Switching Teams
Moving an Application from MySQL to Postgres
Julie Baumler
@competentgirl
juli@baumler.com
June 18, 2013
Julie Baumler
@competentgirl
juli@baumler.com
June 18, 2013
(image from http://www.mrmartinweb.com/computer.html)
"Hey, can I admin this unimportant database the DBA's don't have time for?
I think it will help me understand admining for systems running
databases better"
...
"Why is a sys admin the DBA for the most used database in the organization?"
@johndalton @0xabad1dea MySQL: Doing everything adequately right, most of the time.
— MenTaLguY (@mentalguy) June 5, 2013
Didn't consider NoSQL because this is relational data
Postgres mode - Postgres hated it
ANSI mode - better
# remove unnecessary or inappropriate escapes # fix others if (in_insert){ gsub(/\\r/,""); gsub(/<\/p>\\n\\n/, ""); gsub(/\\n\\n/, "
"); gsub(/<[bB][rR][\/]*>\\n/, "
"); gsub(/\\n/, "
"); gsub(/\\"/, "\""); sub(/\\');$/, "PROTECTED_END_SLASH"); gsub(/\\'/, "''"); sub("PROTECTED_END_SLASH$", "\\');"); }
if ( !in_insert && $0 ~ /enum/ ){ # enums become create type statements #CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy'); enum=data=$0; gsub(/^[ ]*"/,"", enum); gsub(/" .*$/, "", enum); gsub(/^[ ]*".*" enum\(/, "(", data); gsub(/\).*$/, ");", data); print "DROP TYPE " enum "_enum ;" print "CREATE TYPE " enum "_enum AS ENUM " data; gsub(/enum\(.*\)/, enum"_enum"); }
if ( $0 ~ /AUTO_INCREMENT/ || $0 ~ /auto_increment/ ) { # Create a script to create the lines to be run to set next values # correctly: # ALTER SEQUENCE <table>_<column>_seq # RESTART WITH <prior>; col_name = $1; gsub(/\"/, "", col_name); seq_lines = seq_lines"SELECT 'ALTER SEQUENCE "table_name"_"col_name"_seq RESTART WITH '||max("col_name")+1||';' FROM "table_name";\n" # change the line appropriately gsub(/ AUTO_INCREMENT/, ""); gsub(/ auto_increment/, ""); gsub(/int\([5-9]\)/, "serial"); gsub(/int\(1[0-9]\)/, "bigserial"); }
# convert data types gsub(/tinyint\(.*\)/, "smallint"); gsub(/mediumint\(.*\)/, "integer"); gsub(/int\([0-4]\)/, "smallint"); gsub(/int\([5-9]\)/, "integer"); gsub(/int\(1[0-9]\)/, "bigint"); gsub(/" double/, "\" double precision");
#datetime '0000-00-00 00:00:00' gsub(/datetime/, "timestamp"); gsub(/0000-00-00/,"1970-01-01"); gsub(/DEFAULT CURRENT_TIMESTAMP/, "DEFAULT CURRENT_TIMESTAMP(0)"); # this is handled in code gsub(/ON UPDATE CURRENT_TIMESTAMP/, "");
# handle "KEY" statements by making indexes # forum: CREATE UNIQUE INDEX name ON table (column [, ...]); if ( $0 ~ /^[ ]*[UNIQUE ]*KEY/ ) { if ( $0 ~ /^[ ]*UNIQUE KEY/ ) { unique=1; } gsub(/^[ ]*[UNIQUE ]*KEY/, ""); if ( $0 ~ /,$/ ){ gsub(/,$/, ""); } else { # for some reason matching wasn't reliable # for removing comma at end of statement l=length(statement); statement = substr(statement, 1, l-2); } keyname=$1 gsub(/\"/, "", keyname); $1=""; index_line = index_line "\nCREATE " (unique? "UNIQUE" : "") " INDEX \"" table_name"_"keyname"\" on " table_name " " $0";"; unique=0; keyname=""; }
/;/{ print statement; print index_line; table_name=statement=index_line=""; in_insert=0; } END{ print seq_lines; }
# The following changes used in converting to postgres s/date_format(\(.*\), '%d-%b-%Y')/to_char(\1, 'DD-Mon-IYYY')/ig s/ifnull(\(.*\), \(.*\))/coalesce(\1, \2)/ig s/convert_tz(\(.*\), \@\@global.time_zone, \(.*\))/\ \1 at time zone \2/ig s/datediff(\(.*\), \(.*\))/(\1 - \2)/ig s/date_add(\(.*\), \(.*\))/(\1 + \2)/ig # note that Postgres random() is self-seeding s/rand(.*)/random()/ig s/group_concat(\(.*\) separator \(.*\)/\ array_to_string(array_agg(\1), \2)/ig s/group_concat(\(.*\))/array_to_string(array_agg(\1),',')/ig
# datediff with comparison replacement is: # select created from table_with_dates # where date_part( 'years', age(created)) >= 1; # OR # select created from table_with_dates # where created >= (CURRENT_TIMESTAMP - interval '1 year');
# if can often be replaced with # greatest() or least() # otherwise # if (a = b, c, d) becomes # case a when b then c # else d # end
# between sometimes is better as overlaps ie # x between y and z # (x, interval '0 day') overlaps y, z # case (date( now() at time zone c.time_zone), interval '0 day') # overlaps ( coalesce(x.start_date, s.start_date), # coalesce(x.end_date, s.end_date)) # when true then 'Y' # else 'N' # end as between_dates, # case (date(now() at time zone c.time_zone) >= # coalesce(x.start_date, s.start_date)) # when true then 'Y' # else 'N' # end as started,
# boolean params need to be set to a bool value and # cast with ::boolean # be careful with (? or column = ?) where # both params are the same, # it doesn't work unless the params are boolean
# Need to use "AS" when it is optional in mysql
Our test data converted no problem. Actual user generated content was a different story.
Checking encoding with chardet was way too slow, we'd still be converting data! Also, it wasn't particularly accurate.
Used with specific sets of data where we knew the encoding
I call it 'fake Latin-1'
Started with sed, quickly became messy
Tried awk, then perl
Ended up using C as I could use char datatype to represent a byte and then wchar_t to represent wide characters
Converted to html escape characters where reasonable to match partial prior conversion
/* VALID UTF-8 * 00 to 7f hex (0 to 127): first and only byte of a sequence. * C2 to DF hex (194 to 223): first byte of a two-byte sequence. * E0 to EF hex (224 to 239): first byte of a three-byte sequence. * 80 to BF hex (128 to 191): continuing byte in a multi-byte sequence. */
BUT
0x92 is probably a fancy backquote not a control character
Julie Baumler
@competentgirl
juli@baumler.com
https://www.catme.org