Ruben Laguna's blog

Jul 29, 2010 - 2 minute read - Comments - wordpress

Wordpress migration: " (quotes) and ' (apostrophe) being replaced with “ and ’

I migrated from TextDrive to a Joyent Shared Accelerator and in the process I had to migrate the Wordpress MySQL database as well. After the migration the “ and ‘ where showing as “ and ’ respectively. It was a charset problem. Apparently the problem was

that the data itself was already in UTF-8 within a Latin1 database (due to WP default charset).

So I did the backup again (like this)

mysqldump --user=$DB1USER --password=$DB1PASSWD \
--default-character-set=latin1 $DB1NAME dump.sql

and then I imported the dump.sql file again into the Joyent utf8 mysql:

$ cat dump.sql |sed -e 's/DEFAULT CHARSET=latin1;/DEFAULT CHARSET=utf8;/'>dp2.sql
$ mysqldump --user=$DB2USER --password=$DB2PASSWORD --add-drop-table \
--no-data $DB2NAME| grep ^DROP |mysql --user=$DB2NAME --password=$DB2PASSWORD \
$DB2NAME # to drop all existing tables
$ mysql --user=$DB2USER --password=$DB2PASSWORD $DB2NAME <dp2.sql

and the problem was solved!.

Then it got me thinking, how actually get a from “Japonés en viñetas” to “Japonés en viñetasâ€. So I tried to achieve the same result from the command line:

$ echo \“Japonés en viñetas\” “Japonés en viñetas” $ echo \“Japonés en viñetas\” |iconv -f latin1 -t utf-8 “Japonés en viñetasâ

That’s not quite what I was expecting. Then I read the Wikipedia article on ISO-8859-1/Latin1 and I found that Latin-1 is confused with Windows-1253 and that “Many web browsers and e-mail clients will interpret ISO-8859-1 control codes as Windows-1252 characters in order to accommodate such mislabeling”

So I tried it

$ echo \“Japonés en viñetas\”
“Japonés en viñetas”
$ echo \“Japonés en viñetas\” |iconv -f windows-1252 -t utf-8
“Japonés en viñetasâ€
iconv: (stdin):1:25: cannot convert

There it is the " becomes “. when it actually UTF-8 but misintepreted as Windows-1252

EN4J 1.0M2 released - Evernote Java Client Migrating from Wordpress 3.0 to jekyll