Updating a PHP, MySQL, and Javascript Site to Support Foreign Languages with UTF-8

Written by Ryan on May 17th, 2014

Recently on SourceAudio we decided to make supporting foreign languages a priority. We’d always supported html encoded foreign language characters but clients found that extremely clumsy and had no desire to learn that arcane syntax, for which I couldn’t blame them. The solution was to start supporting them properly, which meant switching out character encoding across all layers of the site. After some deliberation, we decided to go with UTF-8, since that would get us all the characters we needed and seemed to have the widest support.

If you’re not familiar with character encoding, Joel Spolsky gives a good overview here. Basically, we needed to support characters like õôóõêç and 测试 in addition to the traditional English characters.

With that decided, it was time to start working on the layers. First up, we needed our backend data to be stored in UTF-8 and that meant updating MySQL.

MySQL

Our first step was to update the collation of our database tables, which all defaulted to latin1, supposedly for historical reasons. There are a number of resources you can look at to figure out which collation is best for you but we decided on utf8_unicode_ci. That collation has a natural sort order but some search problems that we have to compensate for by specifying “COLLATE utf8_bin” in some of our searches. For example, imagine you have three albums:

  1. aggressive rock 6
  2. About Rock 2
  3. Another Rock Album

With utf8_general_ci, if you ask for them sorted by name asc, your results will be return in a case insensitive order:

  1. About Rock 2
  2. aggressive rock 6
  3. Another Rock Album

utf8_bin, on the other hand, only looks at the binary representation of the characters and so it doesn’t care that “a” is the same as “A” and will return

  1. About Rock 2
  2. Another Rock Album
  3. aggressive rock 6

Which of those results you want will depend on your application but most non-technical people assume the first order is correct and that’s what we needed.

When you’re searching though, sometimes you want exact matches. So we’ll search to see if there’s an album called “Aggressive Rock 6” and utf8_general_ci will return the “aggressive rock 6” album, which is not what we want. utf8_bin, on the other hand, would return no rows since it’s case sensitive, so we can just change the collation for this one query, using

SELECT *
FROM `albums`
WHERE `name` = 'Aggressive Rock 6' COLLATE utf8_bin

And the “COLLATE” bit will make sure we only get the exact matches we want.

Note: since we started doing this, MySQL came out with a new UTF-8 charset called utf8mb4, which supports even more characters. If you have MySQL 5.5.3 or greater, you should (and might have to) use utf8mb4_unicode_ci and utf8mb4_bin instead. It takes up more storage (4 bytes instead of 3) but the point was to support as many characters as possible and space is a small price to pay. Otherwise, they appear to be identical to the old versions.

I’m putting this in the MySQL section since the fix happened on the database layer but it didn’t actually present itself until after we changed our PHP code. Something to watch out for was we had a number of characters that were stored in a different encoding that when we started reading them as UTF-8, they came out as gibberish. Perhaps you can find a better solution but we found searching for ‘%â€%’ revealed the badly encoded strings and the following series of REPLACE calls repaired most of them.

UPDATE `albums` SET `name` = REPLACE(`name`, '’', '''') WHERE `name` LIKE '%’%';
UPDATE `albums` SET `name` = REPLACE(`name`, '„', '"') WHERE `name` LIKE '%„%';
UPDATE `albums` SET `name` = REPLACE(`name`, '•', '-') WHERE `name` LIKE '%•%';
UPDATE `albums` SET `name` = REPLACE(`name`, '‘', '''') WHERE `name` LIKE '%‘%';
UPDATE `albums` SET `name` = REPLACE(`name`, '′', '''') WHERE `name` LIKE '%′%';
UPDATE `albums` SET `name` = REPLACE(`name`, '–', '-') WHERE `name` LIKE '%–%';
UPDATE `albums` SET `name` = REPLACE(`name`, '—', '-') WHERE `name` LIKE '%—%';
UPDATE `albums` SET `name` = REPLACE(`name`, 'é', 'é') WHERE `name` LIKE '%é%';
UPDATE `albums` SET `name` = REPLACE(`name`, '…', '…') WHERE `name` LIKE '%…%';
UPDATE `albums` SET `name` = REPLACE(REPLACE(`name`, '“', '"'), 'â€', '"') WHERE `name` LIKE '%“%';

There were other affected characters but they were mostly in foreign language strings and I couldn’t figure out what they were supposed to be. If I had to do it again, I’d probably try to read out the affected strings into PHP, figure out their original encodings, convert them into UTF-8, and resave them.

PHP

Before dealing with PHP code itself, it’s important to make sure your editor of choice is saving files in UTF-8 format. How you do that will depend on your editor. If you skip this step, any UTF-8 characters that appear in your code will get mangled.

To enable multibyte strings in PHP, you’ll need to update your php.ini file and find the [mbstring] section. We’re using the following settings:

mbstring.language = English
mbstring.internal_encoding = UTF-8
mbstring.http_input = UTF-8
mbstring.http_output = UTF-8
mbstring.encoding_translation = On
mbstring.detect_order = auto
mbstring.substitute_character = none;

So all of our strings now support multibyte characters and use the UTF-8 charset. There’s one more setting to consider and we initially had it on because it makes the transition easier but we ended up scrapping it and going the hard way instead and I’ll explain why.

Once you switch to multibyte strings in PHP, you can’t use a lot of the normal string functions to which you’re accustomed. strlen, strpos, substr, strtolower, even functions like mail, all have to be swapped out for their multibyte equivalents. If you call normal strlen on a multibyte string, it won’t treat multibyte characters as single characters and that can yield inaccurate lengths. Fortunately, the new functions are all simply mb_{old function name} so mb_strlen, mb_strpos, mb_substr, mb_strtolower, and, because it’s not PHP if the naming convention isn’t inconsistent, mb_send_mail. You can find a full list of the affected functions in the PHP documentation.

Note: str_replace isn’t one of the affected functions. For whatever reason, it supports multibyte strings out of the box. So do the preg regex functions. It’s not every string function so you’ll have to handle them on a case-by-case basis.

mbstring.overload

Now if you start looking at that page of functions, you’ll see that it’s actually information about the PHP config option, mbstring.overload. If you add to your php.ini the line

mbstring.overload = 7

PHP will actually overload all the affected functions for you without having to do anything! Essentially, when you call strlen, PHP quietly calls mb_strlen instead on the backend without you having to actually change the function call. If you have a big project and you’re not excited at the prospect of a painful find & replace process, that sounds like a godsend. We tried it at first and while it does work exactly as advertised, that behavior creates some problems.

Firstly, there’s no way to get back to the old functions. If, for example, you’re using strlen to count the number of bytes in a string instead of the number of characters and you need it to keep doing that even if the string has multibyte characters, you’re SOL. With overloading on, strlen will always return the number of characters and not the number of bytes. With overloading off, strlen is the number of bytes and mb_strlen is the number of characters. You have the control. That problem is not insurmountable though and I whipped up a byte counting function, which you’re welcome to:

function countBytes($str) {
	return mb_strlen($str, 'latin1');
}

You might be able to come up with your own implementations of the other overloaded functions too but that’s not the only issue.

The problem that was a deal breaker for us was external libraries. A good included library will probably have support for different charsets and therefore support multibyte strings but it’s very unlikely that it’ll be prepared for its functions to be overloaded to behave slightly differently. We use an ID3 library, for example, and it was expecting strlen to return byte counts. When your libraries can’t count on their functions to perform their normal… functions, they won’t behave correctly. You could certainly dig through them all and try to find any instances where it’s a problem and fix them by hand but you’re setting yourself up for a maintenance nightmare in the long run and I ended up putting on some good music and whiling away a few hours with my trusty find in files tool to convert our codebase to use multibyte functions without overloading. I think it’s better to put a little more time in on the front than load yourself up with work and weird bugs in the future.

Note: unlike most php.ini settings, you can’t set mbstring.overload at runtime with ini_set or in your .htaccess file. It has to be in your .ini file. I’m not sure why that is but since it’s kind of a weird hack sort of feature, I guess I’m not too surprised.

mb_substr

One other change you’ll need to make to your string handling is when accessing specific characters in a string, you can’t access them using $string[$index] when dealing with multibyte strings. That’ll return byte $index but that might not be the same as character $index. If you want the character, you’ll need to mb_substr($string, $index, 1);

However, mb_substr is actually significantly slower than accessing the string directly since, in order to find the character you’re asking for, it needs to scan the entire string up that point. For example, if you ask for the 8th character and all characters are 1 byte, you know you can just grab byte number 8 and ignore the rest of the string. But if the characters all take 1-4 bytes, you might need byte number 8 and you might need bytes 32 through 35. The only way to know is to go through each character in the string up to and through the 8th and figure out how many bytes all of them take up.

For short strings, that shouldn’t really matter because computers are fast but we had some strings that were hundreds of thousands of characters long and when we tried to iterate over them with mb_substr, it ran for… well, it was at least half an hour. I gave up after that so I can’t tell you how long it would eventually take. And that’s a block of code that used to complete in under a second. For shorter strings, you can use preg_split but for longer ones, Mészáros Lajos‘ answer to that question is money. His solution involves iterating over the string like mb_substr but remembering your progress so you don’t have to redo all the work each time.

With normal iteration and mb_substr, the function has to recount every time. So iterating over an 8 character string needs to check character 1, then 1 and 2, then 1, 2, and 3, etc. Mészáros’ solution is to remember the byte position of the last check so you don’t have to start over each time. You just end up checking 1, then 2, then 3, etc. instead.

function nextChar($string, &$pointer) {
	if (!isset($string[$pointer])) return false;
	$char = ord($string[$pointer]);
	if ($char < 128) {
		return $string[$pointer++];
	} else {
		if ($char < 224) {
			$bytes = 2;
		} elseif($char < 240) {
			$bytes = 3;
		} elseif($char < 248) {
			$bytes = 4;
		} elseif($char == 252) {
			$bytes = 5;
		} else {
			$bytes = 6;
		}
		$str = substr($string, $pointer, $bytes);
		$pointer += $bytes;
		return $str;
	}
}

$i = 0;
while (($char = nextChar($string, $i)) !== false) {
	// do something
}

I, for one, thought that was a pretty clever solution and seems to smoke all the others in both speed and memory usage.

An interesting side effect here is that when you have a string that you're going to break into smaller strings, my inclination is usually to perform any operations I can on the whole string before splitting it because there's efficiency in avoiding the overhead of repeated function calls. But if one of those operations is using mb_substr to iterate through the string, there might actually be an advantage to splitting it first and then running the op on the substrings. It really just depends on the specifics but it's something to keep in mind.

Downloading

Another PHP change we had to make was in the filenames of downloadable files. Part of SourceAudio is letting end users download music files and admins can change what the suggested filenames for those files are so it's possible those filenames will have foreign language characters. So when you send the Content-Disposition to the end user, how do you specify that the filename is UTF-8?

I found a great Stack Overflow question that addresses this very concern and here's our version of the solution proposed by Martin Ørding-Thomsen, which works well in my testing.

if (preg_match('/MSIE (.*?);/', $_SERVER['HTTP_USER_AGENT'], $matches) && $matches[1] < 9) {
	header("Content-disposition: attachment; filename=".str_replace('+', ' ', urlencode($filename))."\r\n");
} elseif (mb_strpos(mb_strtolower($_SERVER['HTTP_USER_AGENT']), 'android') !== false) {
	header("Content-disposition: attachment; filename=\"".cleanFilename($filename)."\"\r\n");
} else {
	header("Content-disposition: attachment; filename=\"".cleanFilename($filename)."\"; filename*=UTF-8''".str_replace('+', '%20', urlencode($filename))."\r\n");
}

The cleanFilename() function just removes most non-alphanumeric characters to guarantee compatibility.

URLs

On a funny unicode related note, you can see that Martin Ørding-Thomsen's name has a multibyte character in it and if you look at the url of his Stack Overflow user page, http://stackoverflow.com/users/346150/martin-rding-thomsen, that character is stripped out. You can put unicode characters in urls but support seems spotty. I tested Chrome, Firefox, and Safari and all of them will render unicode characters in urls but when you try to copy and paste them, Chrome and Firefox escape them and Safari doesn't.

So the url http://secularcoding.com/Øtest copied in Chrome and Firefox yields http://secularcoding.com/%C3%98test when pasted but Safari returns the original url, unchanged.

If possible, I'd probably recommend against having unicode urls but if you have to, test extensively!

Uploads

Handling uploaded files can also be problematic because it's difficult to force your users to use the encoding you want when they're sending you data to process. In SourceAudio's case, one example is CSV files with metadata for tracks. The uploading clients aren't familiar with character encodings and tend to just hit save and forget it. Depending on the program and operating system they used, that can result in a number of different encodings, though once in particular we get a lot is windows-1252, which I think is coming from Excel on Windows though it seems to be LibreOffice's default format on Windows too.

If you're curious what format files your users are uploading, good luck getting PHP to tell you. There's an mb_detect_encoding function that you'd think would be perfect for this purpose but in all my experiments, it's wrong more than it's right. If you're on Windows, I found a program called File Encoding Checker that seems to work pretty well, though the fact that it only works on directories and not single files is a little obnoxious. I assume there are similar programs on other operating systems or, if you're feeling ambitious, you can scan the files yourself and look for the telltale bytes.

As far as actually fixing the problem goes, we found Sebastián Grignoli's Encoding class is a good way to take care of those alternate format files. Once we load the information from the file into a string, we pass it into Encoding::toUTF8() and it does a great job of converting whatever it finds in there. It also handles the situation where the data is already UTF-8 so there's no harm in running it (other than time) if you're not sure what you're getting.

Headers

The last thing to think about is your content types but that's pretty easy. In your HTML, just add

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

And before you send any HTML or AJAX responses, send a

header("Content-Type: text/html; charset=UTF-8\r\n");

We don't use much XML but the proper way to specify there is with a

<?xml version="1.0" encoding="UTF-8"?>

And that should be enough to let the browser know what to expect!

Solr

SourceAudio uses Solr for searching and I was worried setting up UTF-8 in Solr would be challenging but apparently it's the default character set. The only thing we need to do was make sure that we send in the proper Content-Type, 'Content-type:text/xml; charset=utf-8', when adding documents. Once we made that change and repushed the data, we immediately became able to search for foreign language characters.

If you want to strip accents so searches for "cafe" will match "café", you could always use a ASCII Folding Filter Factory.

You can read some more about Solr's UTF-8 support in their FAQ but I was pretty impressed - it really just works.

Javascript

If you're still using escape() instead of encodeURIComponent(), it's time to make the switch. escape() comes up with different (incorrect) encodings for multibyte strings than encodeURIComponent() and really, it's bad all around.

That's about it, as far as we found. Once you start sending the correct headers and set your HTML meta tags correctly, Javascript just starts treating strings as UTF-8 and it all just works.

Conclusions

The release went fairly smoothly, though we continue to have problems with non-UTF-8 characters in the database now being garbled. Reading everything out, converting, and saving it back might be the solution but I haven't taken that step yet. New stuff going in appears to be fine. We also encountered a strange problem where some strings in the database would have a \0 inserted between every character but since it's just a couple fields that are affected, I'm guessing it's something specific we're doing in code rather than a systemic issue. To make it more interesting, you can only see them in a few places while others just ignore them. I'll update this post if I figure that one out.

Supporting non-English languages is an important step for a growing website. You might not be worried about non-Latin characters but we would previously encounter issues with even accented letters in something like café and switching to UTF-8 solves all of that. The real disappointment is that it's not the default character set and that this process is necessary. I'm sure there are great legacy reasons projects use the character sets they do (there always are) but I'm not convinced it's not worth it to suck it up and make the change. In the future, I'll probably make this one of the first steps in any new project and then never have to worry about it again.

Anything I missed? Let me know in the comments. I'd be happy to add to this post and give you credit.

 

Leave a Comment