Change collation on all columns of a database

It was recently required for me to change the collation of each and every column of every table in a database from ‘latin1′ to ‘utf8′. Although the table collations were correct, the column collations were incorrect. It’s a cumbersome process to perform manually, and there’s apparently no real automated way to do it without a script. Although collation information is only meta-data, not actual data, I found this problem interesting.

Changing one column collation information is easy enough to do with one MySQL query:

ALTER TABLE `moods` 
CHANGE `mood_label` `mood_label` text CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Changing all the columns is more difficult. Here’s a small script that I came up with to do it recently:

dionyziz@orion:~$ mysqldump -u root --password=1234 \ 
--no-data --no-create-db --compact ccbeta \
|egrep 'CREATE TABLE|latin1' \
|sed 's/CREATE TABLE `\(.*\)` (/;ALTER TABLE `\1`/' \
|sed 's/character set latin1/CHARACTER SET utf8 COLLATE utf8_unicode_ci/' \
|sed 's/  `\(.*\)`/ CHANGE `\1` `\1`/'>columns
dionyziz@orion:~$ php -r 'file_put_contents( "columns", 
    preg_replace( "#^;|ALTER TABLE `.*`(\\s*;|$)#", "", 
    preg_replace( "#,(\\s*);#", ";\\1", 
    file_get_contents( "columns" ) ) ) );'
dionyziz@orion:~$ mysql -u root --password=1234 ccbeta <columns

Let’s go through it step-by-step.

mysqldump -u root --password=1234 --no-data --no-create-db --compact ccbeta

This creates a list of CREATE TABLE statements for all our tables. That’s good because it’ll allow us to determine whether the collation of a column is incorrect. Here’s an example CREATE TABLE statement:

CREATE TABLE `albums` (
  `album_id` int(11) NOT NULL auto_increment,
  `album_userid` int(11) NOT NULL default '0',
  `album_created` datetime NOT NULL default '0000-00-00 00:00:00',
  `album_name` text character set latin1 NOT NULL,
  `album_description` text character set latin1 NOT NULL,
  PRIMARY KEY  (`album_id`),
  KEY `album_userid` (`album_userid` )
) ENGINE=MyISAM AUTO_INCREMENT=55 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

In this example, the `album_name` and `album_description` columns are wrong and need their collations changed.

egrep 'CREATE TABLE|latin1'

This simple line limits our results to only lines that contain “CREATE TABLE” or “latin1″. That’s useful since it’ll only show the table names followed by a list of all incorrectly collated columns, if any. The result would be something like this:

CREATE TABLE `relations` (
CREATE TABLE `searches` (
 `search_query` text character set latin1 NOT NULL,
CREATE TABLE `shoutbox` (
 `shout_text` text character set latin1 NOT NULL,
 `shout_delreason` text character set latin1 NOT NULL,

(with more entries potentially)

Good. Now all we need to do is modify these lines to make them ALTER TABLE lines:

sed 's/CREATE TABLE `\(.*\)` (/;ALTER TABLE `\1`/'

Ah, the magic of regular expressions. This removes the final “(” of every CREATE TABLE line, as we don’t need it and also changes the word “CREATE” into “ALTER”. It also adds a semicolon in front of the ALTER TABLE statement (to terminate the previous statement).

sed 's/character set latin1/CHARACTER SET utf8 COLLATE utf8_unicode_ci/'

Straightforward enough, this replaces the existing character set instruction from latin1 to utf8, and adds the correct collation as well.

sed 's/  `\(.*\)`/ CHANGE `\1` `\1`/'

Finally, this adds the word “CHANGE” in front of every column line and repeats the column name (as we want to tell MySQL which column to change (first repetition) and to which to change it (second repetition)). The result is:

;ALTER TABLE `relations`
;ALTER TABLE `searches`
 CHANGE `search_query` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
;ALTER TABLE `shoutbox`
 CHANGE `shout_text` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
 CHANGE `shout_delreason` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,

Pretty close to what we actually want. You’ll notice three problems:

  • There are empty ALTER statements
  • There’s an extra comma at the end of every column (providing all your tables have a primary key, as they should)
  • There’s a redundant semicolon at the beginning

These problems cannot easily be fixed by sed because sed performs a line-to-line processing. A sed expert might have been able to provide us with a better solution, but I’ll prefer to use the PREG feature of PHP. To use PHP, first let’s save our current result into a file:

>columns

Time to run our PHP code on the target file:

php -r 'file_put_contents( "columns", 
    preg_replace( "#^;|ALTER TABLE `.*`(\\s*;|$)#", "", 
    preg_replace( "#,(\\s*);#", ";\\1", 
    file_get_contents( "columns" ) ) ) );'

Let’s analyze it in short.

file_get_contents( "columns" );

This, simply enough, reads the “columns” file into memory. Now we’ll perform two regular expression replacements:

First, we’ll match the following regular expression:

#,(\s*);# 

(notice that the # are separators that wrap the regular expression for clarity — they aren’t part of the actual regular expression)

Anything matching this will be replaced by ;\1. This means that a comma followed by any whitespace (including a new line) followed by a semicolon will be replaced by only a semicolon (and the same whitespace). This simply removes the redundant comma at the end of every ALTER statement.

Second, we’ll match the following:

#^;|ALTER TABLE `.*`(\\s*;|$)# 

Anything matching will be removed. You’ll notice that this regular expression matches basically two things (separated by the first alternation (pipe) character).

The first part is:

#^;# 

It’ll remove the first line if it only contains a single semicolon (which it does in our example).

The second part is:

#ALTER TABLE `.*`(\\s*;|$)# 

This will look for empty ALTER TABLE statements (an ALTER TABLE statement followed only by whitespace and a semicolon or an end-of-file) and remove them.

Finally, we’ll write the result back to the file we read from:

file_put_contents( "columns", ... );

Now if we cat that file we’ll see that it contains all ALTER statements in the form we want them:

ALTER TABLE `searches`
 CHANGE `search_query` `search_query` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL;
ALTER TABLE `shoutbox`
 CHANGE `shout_text` `shout_text` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
 CHANGE `shout_delreason` `shout_delreason` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL;

Excellent. Finally, let’s execute it:

mysql -u root --password=1234 ccbeta <columns

You can also add ‘time’ in front of it to measure how long it’ll take. We can now validate that the collations were changed successfully by, again, performing our initial dump and grepping for ‘latin1′ to confirm that there are none.

Leave a Reply