Switching Teams
Moving an Application from MySQL to Postgres

Julie Baumler
@competentgirl
juli@baumler.com
June 18, 2013

SQL is an ISO standard, so you should just be able to switch backend databases, right?

Sadly no.

My Database Background

kaypro (image from http://www.mrmartinweb.com/computer.html)

My Database Background

mid-1990's worked on proprietary database accessed via AWK

My Database Background

late 1990's - Perl and Berkeley DB
a couple more proprietary databases on AIX
(Unicorn, others)
later 1990's - mySQL, because we had no money, Postgres needed to be shutdown and vaccuumed daily and we needed to run 24/7 (hopefully)

early 2000's - Oracle, some DB/2

My Database Background

Remedy Install

"Hey, can I admin this unimportant database the DBA's don't have time for? I think it will help me understand admining for systems running databases better"
...
"Why is a sys admin the DBA for the most used database in the organization?"

My Database Background

mid 2000's - back to mySQL, sqlite
MS SQL Server
more mySQL, some Postgres
2011 - started work on CATME Project

CATME App History

2005 - built on mod_perl and MySQL
2008 - Sun buys MySQL AB
2010 - Oracle buys Sun
We start to worry about future of MySQL
2012 - Conversion of CATME backend to Postgres

Why not MySQL

@johndalton @0xabad1dea MySQL: Doing everything adequately right, most of the time.
— MenTaLguY (@mentalguy) June 5, 2013

Why not MySQL

MySQL may not have been optimal choice for us in the first place.

The way we write SQL plays to MySQL's weaknesses (ie subqueries)
Our needs and MySQL's strengths don't line up well.
Didn't need round-robin replication

Options Considered

Wait
Maria DB
Postgres

Didn't consider NoSQL because this is relational data

Why PostgreSQL

The way we write SQL plays to Postgres's strengths

Our needs and Postgres's strengths line up well.

We had access to good Postgres resources

mysqldump | psql

Postgres mode - Postgres hated it

ANSI mode - better

convert.[sed|awk]

# remove unnecessary or inappropriate escapes
# fix others
if (in_insert){ 
	gsub(/\\r/,"");
	gsub(/<\/p>\\n\\n/, "");
	gsub(/\\n\\n/, "
");
	gsub(/<[bB][rR][\/]*>\\n/, "
");
	gsub(/\\n/, "
");
	gsub(/\\"/, "\"");
        sub(/\\');$/, "PROTECTED_END_SLASH");
	gsub(/\\'/, "''");
        sub("PROTECTED_END_SLASH$", "\\');");

}

convert.awk


if ( !in_insert && $0 ~ /enum/ ){
	# enums become create type statements
	#CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy');
	enum=data=$0;
	gsub(/^[ 	]*"/,"", enum);
	gsub(/" .*$/, "", enum);
	gsub(/^[ 	]*".*" enum\(/, "(", data);
	gsub(/\).*$/, ");", data);
	print "DROP TYPE " enum "_enum ;"
	print "CREATE TYPE " enum "_enum AS ENUM " data;

	gsub(/enum\(.*\)/, enum"_enum");
}

convert.awk


if ( $0 ~ /AUTO_INCREMENT/ || $0 ~ /auto_increment/ )
{
	# Create a script to create the lines to be run to set next values
	# correctly:
        # ALTER SEQUENCE <table>_<column>_seq 
	#   RESTART WITH <prior>;
	col_name = $1; gsub(/\"/, "", col_name);
	seq_lines = seq_lines"SELECT 'ALTER SEQUENCE 
	    "table_name"_"col_name"_seq RESTART WITH 
	    '||max("col_name")+1||';' FROM "table_name";\n"

	# change the line appropriately
	gsub(/ AUTO_INCREMENT/, "");
	gsub(/ auto_increment/, "");
	gsub(/int\([5-9]\)/, "serial");
	gsub(/int\(1[0-9]\)/, "bigserial");
}

convert.awk

# convert data types
gsub(/tinyint\(.*\)/, "smallint");
gsub(/mediumint\(.*\)/, "integer");
gsub(/int\([0-4]\)/, "smallint");
gsub(/int\([5-9]\)/, "integer");
gsub(/int\(1[0-9]\)/, "bigint");
gsub(/" double/, "\" double precision");

convert.awk


#datetime '0000-00-00 00:00:00'
gsub(/datetime/, "timestamp");

gsub(/0000-00-00/,"1970-01-01");

gsub(/DEFAULT CURRENT_TIMESTAMP/, "DEFAULT CURRENT_TIMESTAMP(0)");

# this is handled in code
gsub(/ON UPDATE CURRENT_TIMESTAMP/, "");

convert.awk


# handle "KEY" statements by making indexes
# forum: CREATE UNIQUE INDEX name ON table (column [, ...]);
if ( $0 ~ /^[ 	]*[UNIQUE ]*KEY/ ) {
	if ( $0 ~ /^[ 	]*UNIQUE KEY/ ) { unique=1; }
	gsub(/^[ 	]*[UNIQUE ]*KEY/, "");
	if ( $0 ~ /,$/ ){
		gsub(/,$/, "");
	} else {
		# for some reason matching wasn't reliable
		# for removing comma at end of statement
		l=length(statement);
		statement = substr(statement, 1, l-2);
	}
	keyname=$1
	gsub(/\"/, "", keyname);
	$1="";
	index_line = index_line "\nCREATE " (unique? "UNIQUE" : "") " INDEX \"" table_name"_"keyname"\" on " table_name " " $0";";
	unique=0;
	keyname="";
}

convert.awk

/;/{
print statement;
print index_line;
table_name=statement=index_line="";
in_insert=0;
}
END{ 
print seq_lines; 
}

Query Changes

	
# The following changes used in converting to postgres
s/date_format(\(.*\), '%d-%b-%Y')/to_char(\1, 'DD-Mon-IYYY')/ig
s/ifnull(\(.*\), \(.*\))/coalesce(\1, \2)/ig
s/convert_tz(\(.*\), \@\@global.time_zone, \(.*\))/\
    \1 at time zone \2/ig
s/datediff(\(.*\), \(.*\))/(\1 - \2)/ig
s/date_add(\(.*\), \(.*\))/(\1 + \2)/ig
# note that Postgres random() is self-seeding
s/rand(.*)/random()/ig
s/group_concat(\(.*\) separator \(.*\)/\
    array_to_string(array_agg(\1), \2)/ig
s/group_concat(\(.*\))/array_to_string(array_agg(\1),',')/ig

Query Changes

# datediff with comparison replacement is:
# select created from table_with_dates
#    where date_part( 'years', age(created)) >= 1;
# OR
# select created from table_with_dates
#    where created >= (CURRENT_TIMESTAMP - interval '1 year');

Query Changes

# if can often be replaced with
# greatest() or least() 
# otherwise 
# if (a = b, c, d) becomes
# case a when b then c
#	else d
# end

Query Changes

# between sometimes is better as overlaps ie 
# x between y and z
# (x, interval '0 day') overlaps y, z
#         case (date( now() at time zone c.time_zone), interval '0 day')
#            overlaps ( coalesce(x.start_date, s.start_date),
#                    coalesce(x.end_date, s.end_date)) 
#	    when true then 'Y'
#            else 'N'
#        end as between_dates,
#        case (date(now() at time zone c.time_zone) >=
#           coalesce(x.start_date, s.start_date)) 
#	   when true then 'Y'
#           else 'N'
#        end as started,

Query Changes


# boolean  params need to be set to a bool value and 
# cast with ::boolean 
# be careful with (? or column = ?) where 
# both params are the same,
# it doesn't work unless the params are boolean

# Need to use "AS" when it is optional in mysql

User Generated Content Data Conversion

Our test data converted no problem. Actual user generated content was a different story.

Moral: Know your data

User Generated Content Data Conversion
chardet

Checking encoding with chardet was way too slow, we'd still be converting data! Also, it wasn't particularly accurate.

User Generated Content Data Conversion
iconv

Used with specific sets of data where we knew the encoding

TIS-620 (Thailand)
CP949 (Korea)

Windows-1252 is Evil

I call it 'fake Latin-1'

User Generated Content Data Conversion
fix utf8.c

Started with sed, quickly became messy

Tried awk, then perl

Ended up using C as I could use char datatype to represent a byte and then wchar_t to represent wide characters

Converted to html escape characters where reasonable to match partial prior conversion

User Generated Content Data conversion
fix utf8.c


/* VALID UTF-8
 * 00 to 7f hex (0 to 127):
	     first and only byte of a sequence.
 * C2 to DF hex (194 to 223):
	     first byte of a two-byte sequence.
 * E0 to EF hex (224 to 239):
	     first byte of a three-byte sequence.
 * 80 to BF hex (128 to 191):
	     continuing byte in a multi-byte sequence.
 */

BUT

0x92 is probably a fancy backquote not a control character

Questions

Julie Baumler
@competentgirl
juli@baumler.com
https://www.catme.org

Switching Teams Moving an Application from MySQL to Postgres

SQL is an ISO standard, so you should just be able to switch backend databases, right?

Sadly no.

My Database Background

My Database Background

My Database Background

My Database Background

Remedy Install

My Database Background

CATME App History

Why not MySQL

Why not MySQL

MySQL may not have been optimal choice for us in the first place.

Options Considered

Why PostgreSQL

mysqldump | psql

convert.[sed|awk]

convert.awk

convert.awk

convert.awk

convert.awk

convert.awk

convert.awk

Query Changes

Query Changes

Query Changes

Query Changes

Query Changes

User Generated Content Data Conversion

Moral: Know your data

User Generated Content Data Conversionchardet

User Generated Content Data Conversioniconv

Windows-1252 is Evil

User Generated Content Data Conversionfix utf8.c

User Generated Content Data conversionfix utf8.c

Questions

Switching Teams
Moving an Application from MySQL to Postgres

User Generated Content Data Conversion
chardet

User Generated Content Data Conversion
iconv

User Generated Content Data Conversion
fix utf8.c

User Generated Content Data conversion
fix utf8.c