Switching Teams
Moving an Application from MySQL to Postgres

Julie Baumler
@competentgirl
juli@baumler.com
June 18, 2013

SQL is an ISO standard, so you should just be able to switch backend databases, right?

Sadly no.

My Database Background

kaypro (image from http://www.mrmartinweb.com/computer.html)

My Database Background

FASTIXX Award

My Database Background

My Database Background

Remedy Install

"Hey, can I admin this unimportant database the DBA's don't have time for? I think it will help me understand admining for systems running databases better"
...
"Why is a sys admin the DBA for the most used database in the organization?"

My Database Background

CATME App History

Why not MySQL

Why not MySQL

MySQL may not have been optimal choice for us in the first place.

Options Considered

Didn't consider NoSQL because this is relational data

Why PostgreSQL

  • The way we write SQL plays to Postgres's strengths
  • Our needs and Postgres's strengths line up well.
  • We had access to good Postgres resources
  • mysqldump | psql

    Postgres mode - Postgres hated it

    ANSI mode - better

    convert.[sed|awk]

    # remove unnecessary or inappropriate escapes
    # fix others
    if (in_insert){ 
    	gsub(/\\r/,"");
    	gsub(/<\/p>\\n\\n/, "

    "); gsub(/

    \\n\\n/, "

    "); gsub(/<[bB][rR][\/]*>\\n/, "
    "); gsub(/\\n/, "
    "); gsub(/\\"/, "\""); sub(/\\');$/, "PROTECTED_END_SLASH"); gsub(/\\'/, "''"); sub("PROTECTED_END_SLASH$", "\\');"); }

    convert.awk

    
    if ( !in_insert && $0 ~ /enum/ ){
    	# enums become create type statements
    	#CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy');
    	enum=data=$0;
    	gsub(/^[ 	]*"/,"", enum);
    	gsub(/" .*$/, "", enum);
    	gsub(/^[ 	]*".*" enum\(/, "(", data);
    	gsub(/\).*$/, ");", data);
    	print "DROP TYPE " enum "_enum ;"
    	print "CREATE TYPE " enum "_enum AS ENUM " data;
    
    	gsub(/enum\(.*\)/, enum"_enum");
    }
    
    

    convert.awk

    
    if ( $0 ~ /AUTO_INCREMENT/ || $0 ~ /auto_increment/ )
    {
    	# Create a script to create the lines to be run to set next values
    	# correctly:
            # ALTER SEQUENCE <table>_<column>_seq 
    	#   RESTART WITH <prior>;
    	col_name = $1; gsub(/\"/, "", col_name);
    	seq_lines = seq_lines"SELECT 'ALTER SEQUENCE 
    	    "table_name"_"col_name"_seq RESTART WITH 
    	    '||max("col_name")+1||';' FROM "table_name";\n"
    
    	# change the line appropriately
    	gsub(/ AUTO_INCREMENT/, "");
    	gsub(/ auto_increment/, "");
    	gsub(/int\([5-9]\)/, "serial");
    	gsub(/int\(1[0-9]\)/, "bigserial");
    }
    
    

    convert.awk

    # convert data types
    gsub(/tinyint\(.*\)/, "smallint");
    gsub(/mediumint\(.*\)/, "integer");
    gsub(/int\([0-4]\)/, "smallint");
    gsub(/int\([5-9]\)/, "integer");
    gsub(/int\(1[0-9]\)/, "bigint");
    gsub(/" double/, "\" double precision");
    
    
    

    convert.awk

    
    #datetime '0000-00-00 00:00:00'
    gsub(/datetime/, "timestamp");
    
    gsub(/0000-00-00/,"1970-01-01");
    
    gsub(/DEFAULT CURRENT_TIMESTAMP/, "DEFAULT CURRENT_TIMESTAMP(0)");
    
    # this is handled in code
    gsub(/ON UPDATE CURRENT_TIMESTAMP/, "");
    

    convert.awk

    
    # handle "KEY" statements by making indexes
    # forum: CREATE UNIQUE INDEX name ON table (column [, ...]);
    if ( $0 ~ /^[ 	]*[UNIQUE ]*KEY/ ) {
    	if ( $0 ~ /^[ 	]*UNIQUE KEY/ ) { unique=1; }
    	gsub(/^[ 	]*[UNIQUE ]*KEY/, "");
    	if ( $0 ~ /,$/ ){
    		gsub(/,$/, "");
    	} else {
    		# for some reason matching wasn't reliable
    		# for removing comma at end of statement
    		l=length(statement);
    		statement = substr(statement, 1, l-2);
    	}
    	keyname=$1
    	gsub(/\"/, "", keyname);
    	$1="";
    	index_line = index_line "\nCREATE " (unique? "UNIQUE" : "") " INDEX \"" table_name"_"keyname"\" on " table_name " " $0";";
    	unique=0;
    	keyname="";
    } 
    

    convert.awk

    /;/{
    print statement;
    print index_line;
    table_name=statement=index_line="";
    in_insert=0;
    }
    END{ 
    print seq_lines; 
    }
        

    Query Changes

    	
    # The following changes used in converting to postgres
    s/date_format(\(.*\), '%d-%b-%Y')/to_char(\1, 'DD-Mon-IYYY')/ig
    s/ifnull(\(.*\), \(.*\))/coalesce(\1, \2)/ig
    s/convert_tz(\(.*\), \@\@global.time_zone, \(.*\))/\
        \1 at time zone \2/ig
    s/datediff(\(.*\), \(.*\))/(\1 - \2)/ig
    s/date_add(\(.*\), \(.*\))/(\1 + \2)/ig
    # note that Postgres random() is self-seeding
    s/rand(.*)/random()/ig
    s/group_concat(\(.*\) separator \(.*\)/\
        array_to_string(array_agg(\1), \2)/ig
    s/group_concat(\(.*\))/array_to_string(array_agg(\1),',')/ig
        

    Query Changes

    # datediff with comparison replacement is:
    # select created from table_with_dates
    #    where date_part( 'years', age(created)) >= 1;
    # OR
    # select created from table_with_dates
    #    where created >= (CURRENT_TIMESTAMP - interval '1 year');
    

    Query Changes

    # if can often be replaced with
    # greatest() or least() 
    # otherwise 
    # if (a = b, c, d) becomes
    # case a when b then c
    #	else d
    # end
        

    Query Changes

    # between sometimes is better as overlaps ie 
    # x between y and z
    # (x, interval '0 day') overlaps y, z
    #         case (date( now() at time zone c.time_zone), interval '0 day')
    #            overlaps ( coalesce(x.start_date, s.start_date),
    #                    coalesce(x.end_date, s.end_date)) 
    #	    when true then 'Y'
    #            else 'N'
    #        end as between_dates,
    #        case (date(now() at time zone c.time_zone) >=
    #           coalesce(x.start_date, s.start_date)) 
    #	   when true then 'Y'
    #           else 'N'
    #        end as started,
        

    Query Changes

    
    # boolean  params need to be set to a bool value and 
    # cast with ::boolean 
    # be careful with (? or column = ?) where 
    # both params are the same,
    # it doesn't work unless the params are boolean
    
    # Need to use "AS" when it is optional in mysql
        

    User Generated Content Data Conversion

    Our test data converted no problem. Actual user generated content was a different story.

    Moral: Know your data

    User Generated Content Data Conversion
    chardet

    Checking encoding with chardet was way too slow, we'd still be converting data! Also, it wasn't particularly accurate.

    User Generated Content Data Conversion
    iconv

    Used with specific sets of data where we knew the encoding

    Windows-1252 is Evil

    I call it 'fake Latin-1'

    User Generated Content Data Conversion
    fix utf8.c

    Started with sed, quickly became messy

    Tried awk, then perl

    Ended up using C as I could use char datatype to represent a byte and then wchar_t to represent wide characters

    Converted to html escape characters where reasonable to match partial prior conversion

    User Generated Content Data conversion
    fix utf8.c

    
    /* VALID UTF-8
     * 00 to 7f hex (0 to 127):
    	     first and only byte of a sequence.
     * C2 to DF hex (194 to 223):
    	     first byte of a two-byte sequence.
     * E0 to EF hex (224 to 239):
    	     first byte of a three-byte sequence.
     * 80 to BF hex (128 to 191):
    	     continuing byte in a multi-byte sequence.
     */
        

    BUT

    0x92 is probably a fancy backquote not a control character
    

    Questions

    Julie Baumler
    @competentgirl
    juli@baumler.com
    https://www.catme.org