Archive for the ‘Regular Expressions’ Category

vim regexp magic

Tuesday, September 23rd, 2008

Input, this one big liner:

function UnitUserSettingsSave( tInteger $dobd, tInteger $dobm, tInteger $doby, tText $gender, tInteger $place, tInteger $education, tInteger $school, tInteger $mood, tText $sex, tText $religion, tText $politics, tText $slogan, tText $aboutme, tText $favquote, tText $haircolor, tText $eyecolor, tInteger $height, tInteger $weight, tText $smoker, tText $drinker, tText $email, tText $msn, tText $gtalk, tText $skype, tText $yahoo, tText $web, tText $oldpassword, tText $newpassword, tText $emailprofilecomment, tText $notifyprofilecomment, tText $emailphotocomment, tText $notifyphotocomment, tText $emailpollcomment, tText $notifypollcomment, tText $emailjournalcomment, tText $notifyjournalcomment, tText $emailreply, tText $notifyreply, tText $emailfriendaddition, tText $notifyfriendaddition, tText $emailtagcreation, tText $notifytagcreation, tText $emailfavourite, tText $notifyfavourite ) {

Output, this beautifully spaced multiliner:

function UnitUserSettingsSave( tInteger $dobd, tInteger $dobm,
          tInteger $doby, tText $gender,
          tInteger $place, tInteger $education,
          tInteger $school, tInteger $mood,
          tText $sex, tText $religion,
          tText $politics, tText $slogan,
          tText $aboutme, tText $favquote,
          tText $haircolor, tText $eyecolor,
          tInteger $height, tInteger $weight,
          tText $smoker, tText $drinker,
          tText $email, tText $msn,
          tText $gtalk, tText $skype,
          tText $yahoo, tText $web,
          tText $oldpassword, tText $newpassword,
          tText $emailprofilecomment, tText $notifyprofilecomment,
          tText $emailphotocomment, tText $notifyphotocomment,
          tText $emailpollcomment, tText $notifypollcomment,
          tText $emailjournalcomment, tText $notifyjournalcomment,
          tText $emailreply, tText $notifyreply,
          tText $emailfriendaddition, tText $notifyfriendaddition,
          tText $emailtagcreation, tText $notifytagcreation,
          tText $emailfavourite, tText $notifyfavourite ) {

How? With one command, in vim:

:s/\(.\{-1,}\),\(.\{-1,}\),/\1,\2,\r        /ig

What does it do?

First off, :s/needle/replacement/g searches the current line for regular expression needle and replaces it with expression replacement. The current line is being searched because we didn’t specify a range before the “s”. “s” is the extended command that we’re running, which stands for “search and replace”. The “g” modifier after the final slash stands for “global”, meaning it should feel free to replace several occurrences in the same line, not just the first.

Now, for the needle expression. It can be essentially split into two parts:
1) \(.\{-1,}\),
2) \(.\{-1,}\),

These two expressions match exactly the same thing. They match anything they want, denoted with a dot, followed by a comma (the one that you see at the end of each expression). The “anything they want part” denoted with a single dot is just one character, so we’re modifying it to be able to match more than just one character (as many as it needs to satisfy the comma at the end) by adding the lazy quantifier \{-1,} after the dot.

The expression .\{-1,} means: match as many of any characters as you need to match the whole expression. In reality, because this is a lazy quantifier, it matches as less characters as possible providing it can find a comma right afterwards (but not the comma itself).

So both expressions tied together match anything followed by a comma followed by anything followed by a comma. Translation? They match two of the arguments of those provided in the function argument list.

The parentheses around each of them denoted \( and \) capture what is within, to be used in the replacement string. Our replacement string is simply “\1,\2,\r “. It will replace \1 with the first parenthesized match, then add a comma, then replace the \2 with the second parenthesized match, then add yet another comma. Finally it will add a new line (\r) and some whitespace.

Repeating this pattern with the “global” modifier applies the regular expression several times on the line, yielding to new lines being added after every second argument.

Change collation on all columns of a database

Sunday, May 25th, 2008

It was recently required for me to change the collation of each and every column of every table in a database from ‘latin1′ to ‘utf8′. Although the table collations were correct, the column collations were incorrect. It’s a cumbersome process to perform manually, and there’s apparently no real automated way to do it without a script. Although collation information is only meta-data, not actual data, I found this problem interesting.

Changing one column collation information is easy enough to do with one MySQL query:

ALTER TABLE `moods` 
CHANGE `mood_label` `mood_label` text CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Changing all the columns is more difficult. Here’s a small script that I came up with to do it recently:

dionyziz@orion:~$ mysqldump -u root --password=1234 \ 
--no-data --no-create-db --compact ccbeta \
|egrep 'CREATE TABLE|latin1' \
|sed 's/CREATE TABLE `\(.*\)` (/;ALTER TABLE `\1`/' \
|sed 's/character set latin1/CHARACTER SET utf8 COLLATE utf8_unicode_ci/' \
|sed 's/  `\(.*\)`/ CHANGE `\1` `\1`/'>columns
dionyziz@orion:~$ php -r 'file_put_contents( "columns", 
    preg_replace( "#^;|ALTER TABLE `.*`(\\s*;|$)#", "", 
    preg_replace( "#,(\\s*);#", ";\\1", 
    file_get_contents( "columns" ) ) ) );'
dionyziz@orion:~$ mysql -u root --password=1234 ccbeta <columns

Let’s go through it step-by-step.

mysqldump -u root --password=1234 --no-data --no-create-db --compact ccbeta

This creates a list of CREATE TABLE statements for all our tables. That’s good because it’ll allow us to determine whether the collation of a column is incorrect. Here’s an example CREATE TABLE statement:

CREATE TABLE `albums` (
  `album_id` int(11) NOT NULL auto_increment,
  `album_userid` int(11) NOT NULL default '0',
  `album_created` datetime NOT NULL default '0000-00-00 00:00:00',
  `album_name` text character set latin1 NOT NULL,
  `album_description` text character set latin1 NOT NULL,
  PRIMARY KEY  (`album_id`),
  KEY `album_userid` (`album_userid` )
) ENGINE=MyISAM AUTO_INCREMENT=55 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

In this example, the `album_name` and `album_description` columns are wrong and need their collations changed.

egrep 'CREATE TABLE|latin1'

This simple line limits our results to only lines that contain “CREATE TABLE” or “latin1″. That’s useful since it’ll only show the table names followed by a list of all incorrectly collated columns, if any. The result would be something like this:

CREATE TABLE `relations` (
CREATE TABLE `searches` (
 `search_query` text character set latin1 NOT NULL,
CREATE TABLE `shoutbox` (
 `shout_text` text character set latin1 NOT NULL,
 `shout_delreason` text character set latin1 NOT NULL,

(with more entries potentially)

Good. Now all we need to do is modify these lines to make them ALTER TABLE lines:

sed 's/CREATE TABLE `\(.*\)` (/;ALTER TABLE `\1`/'

Ah, the magic of regular expressions. This removes the final “(” of every CREATE TABLE line, as we don’t need it and also changes the word “CREATE” into “ALTER”. It also adds a semicolon in front of the ALTER TABLE statement (to terminate the previous statement).

sed 's/character set latin1/CHARACTER SET utf8 COLLATE utf8_unicode_ci/'

Straightforward enough, this replaces the existing character set instruction from latin1 to utf8, and adds the correct collation as well.

sed 's/  `\(.*\)`/ CHANGE `\1` `\1`/'

Finally, this adds the word “CHANGE” in front of every column line and repeats the column name (as we want to tell MySQL which column to change (first repetition) and to which to change it (second repetition)). The result is:

;ALTER TABLE `relations`
;ALTER TABLE `searches`
 CHANGE `search_query` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
;ALTER TABLE `shoutbox`
 CHANGE `shout_text` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
 CHANGE `shout_delreason` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,

Pretty close to what we actually want. You’ll notice three problems:

  • There are empty ALTER statements
  • There’s an extra comma at the end of every column (providing all your tables have a primary key, as they should)
  • There’s a redundant semicolon at the beginning

These problems cannot easily be fixed by sed because sed performs a line-to-line processing. A sed expert might have been able to provide us with a better solution, but I’ll prefer to use the PREG feature of PHP. To use PHP, first let’s save our current result into a file:

>columns

Time to run our PHP code on the target file:

php -r 'file_put_contents( "columns", 
    preg_replace( "#^;|ALTER TABLE `.*`(\\s*;|$)#", "", 
    preg_replace( "#,(\\s*);#", ";\\1", 
    file_get_contents( "columns" ) ) ) );'

Let’s analyze it in short.

file_get_contents( "columns" );

This, simply enough, reads the “columns” file into memory. Now we’ll perform two regular expression replacements:

First, we’ll match the following regular expression:

#,(\s*);# 

(notice that the # are separators that wrap the regular expression for clarity — they aren’t part of the actual regular expression)

Anything matching this will be replaced by ;\1. This means that a comma followed by any whitespace (including a new line) followed by a semicolon will be replaced by only a semicolon (and the same whitespace). This simply removes the redundant comma at the end of every ALTER statement.

Second, we’ll match the following:

#^;|ALTER TABLE `.*`(\\s*;|$)# 

Anything matching will be removed. You’ll notice that this regular expression matches basically two things (separated by the first alternation (pipe) character).

The first part is:

#^;# 

It’ll remove the first line if it only contains a single semicolon (which it does in our example).

The second part is:

#ALTER TABLE `.*`(\\s*;|$)# 

This will look for empty ALTER TABLE statements (an ALTER TABLE statement followed only by whitespace and a semicolon or an end-of-file) and remove them.

Finally, we’ll write the result back to the file we read from:

file_put_contents( "columns", ... );

Now if we cat that file we’ll see that it contains all ALTER statements in the form we want them:

ALTER TABLE `searches`
 CHANGE `search_query` `search_query` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL;
ALTER TABLE `shoutbox`
 CHANGE `shout_text` `shout_text` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
 CHANGE `shout_delreason` `shout_delreason` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL;

Excellent. Finally, let’s execute it:

mysql -u root --password=1234 ccbeta <columns

You can also add ‘time’ in front of it to measure how long it’ll take. We can now validate that the collations were changed successfully by, again, performing our initial dump and grepping for ‘latin1′ to confirm that there are none.