Updating a PHP, MySQL, and Javascript Site to Support Foreign Languages with UTF-8

Written by Ryan on May 17th, 2014

Recently on SourceAudio we decided to make supporting foreign languages a priority. We’d always supported html encoded foreign language characters but clients found that extremely clumsy and had no desire to learn that arcane syntax, for which I couldn’t blame them. The solution was to start supporting them properly, which meant switching out character encoding across all layers of the site. After some deliberation, we decided to go with UTF-8, since that would get us all the characters we needed and seemed to have the widest support.

If you’re not familiar with character encoding, Joel Spolsky gives a good overview here. Basically, we needed to support characters like õôóõêç and 测试 in addition to the traditional English characters.

With that decided, it was time to start working on the layers. First up, we needed our backend data to be stored in UTF-8 and that meant updating MySQL.

MySQL

Our first step was to update the collation of our database tables, which all defaulted to latin1, supposedly for historical reasons. There are a number of resources you can look at to figure out which collation is best for you but we decided on utf8_unicode_ci. That collation has a natural sort order but some search problems that we have to compensate for by specifying “COLLATE utf8_bin” in some of our searches. For example, imagine you have three albums:

  1. aggressive rock 6
  2. About Rock 2
  3. Another Rock Album

With utf8_general_ci, if you ask for them sorted by name asc, your results will be return in a case insensitive order:

  1. About Rock 2
  2. aggressive rock 6
  3. Another Rock Album

utf8_bin, on the other hand, only looks at the binary representation of the characters and so it doesn’t care that “a” is the same as “A” and will return

  1. About Rock 2
  2. Another Rock Album
  3. aggressive rock 6

Which of those results you want will depend on your application but most non-technical people assume the first order is correct and that’s what we needed.

When you’re searching though, sometimes you want exact matches. So we’ll search to see if there’s an album called “Aggressive Rock 6″ and utf8_general_ci will return the “aggressive rock 6″ album, which is not what we want. utf8_bin, on the other hand, would return no rows since it’s case sensitive, so we can just change the collation for this one query, using

SELECT *
FROM `albums`
WHERE `name` = 'Aggressive Rock 6' COLLATE utf8_bin

And the “COLLATE” bit will make sure we only get the exact matches we want.

Note: since we started doing this, MySQL came out with a new UTF-8 charset called utf8mb4, which supports even more characters. If you have MySQL 5.5.3 or greater, you should (and might have to) use utf8mb4_unicode_ci and utf8mb4_bin instead. It takes up more storage (4 bytes instead of 3) but the point was to support as many characters as possible and space is a small price to pay. Otherwise, they appear to be identical to the old versions.

I’m putting this in the MySQL section since the fix happened on the database layer but it didn’t actually present itself until after we changed our PHP code. Something to watch out for was we had a number of characters that were stored in a different encoding that when we started reading them as UTF-8, they came out as gibberish. Perhaps you can find a better solution but we found searching for ‘%â€%’ revealed the badly encoded strings and the following series of REPLACE calls repaired most of them.

UPDATE `albums` SET `name` = REPLACE(`name`, '’', '''') WHERE `name` LIKE '%’%';
UPDATE `albums` SET `name` = REPLACE(`name`, '„', '"') WHERE `name` LIKE '%„%';
UPDATE `albums` SET `name` = REPLACE(`name`, '•', '-') WHERE `name` LIKE '%•%';
UPDATE `albums` SET `name` = REPLACE(`name`, '‘', '''') WHERE `name` LIKE '%‘%';
UPDATE `albums` SET `name` = REPLACE(`name`, '′', '''') WHERE `name` LIKE '%′%';
UPDATE `albums` SET `name` = REPLACE(`name`, '–', '-') WHERE `name` LIKE '%–%';
UPDATE `albums` SET `name` = REPLACE(`name`, '—', '-') WHERE `name` LIKE '%—%';
UPDATE `albums` SET `name` = REPLACE(`name`, 'é', 'é') WHERE `name` LIKE '%é%';
UPDATE `albums` SET `name` = REPLACE(`name`, '…', '…') WHERE `name` LIKE '%…%';
UPDATE `albums` SET `name` = REPLACE(REPLACE(`name`, '“', '"'), 'â€', '"') WHERE `name` LIKE '%“%';

There were other affected characters but they were mostly in foreign language strings and I couldn’t figure out what they were supposed to be. If I had to do it again, I’d probably try to read out the affected strings into PHP, figure out their original encodings, convert them into UTF-8, and resave them.

PHP

Before dealing with PHP code itself, it’s important to make sure your editor of choice is saving files in UTF-8 format. How you do that will depend on your editor. If you skip this step, any UTF-8 characters that appear in your code will get mangled.

To enable multibyte strings in PHP, you’ll need to update your php.ini file and find the [mbstring] section. We’re using the following settings:

mbstring.language = English
mbstring.internal_encoding = UTF-8
mbstring.http_input = UTF-8
mbstring.http_output = UTF-8
mbstring.encoding_translation = On
mbstring.detect_order = auto
mbstring.substitute_character = none;

So all of our strings now support multibyte characters and use the UTF-8 charset. There’s one more setting to consider and we initially had it on because it makes the transition easier but we ended up scrapping it and going the hard way instead and I’ll explain why.

Once you switch to multibyte strings in PHP, you can’t use a lot of the normal string functions to which you’re accustomed. strlen, strpos, substr, strtolower, even functions like mail, all have to be swapped out for their multibyte equivalents. If you call normal strlen on a multibyte string, it won’t treat multibyte characters as single characters and that can yield inaccurate lengths. Fortunately, the new functions are all simply mb_{old function name} so mb_strlen, mb_strpos, mb_substr, mb_strtolower, and, because it’s not PHP if the naming convention isn’t inconsistent, mb_send_mail. You can find a full list of the affected functions in the PHP documentation.

Note: str_replace isn’t one of the affected functions. For whatever reason, it supports multibyte strings out of the box. So do the preg regex functions. It’s not every string function so you’ll have to handle them on a case-by-case basis.

mbstring.overload

Now if you start looking at that page of functions, you’ll see that it’s actually information about the PHP config option, mbstring.overload. If you add to your php.ini the line

mbstring.overload = 7

PHP will actually overload all the affected functions for you without having to do anything! Essentially, when you call strlen, PHP quietly calls mb_strlen instead on the backend without you having to actually change the function call. If you have a big project and you’re not excited at the prospect of a painful find & replace process, that sounds like a godsend. We tried it at first and while it does work exactly as advertised, that behavior creates some problems.

Firstly, there’s no way to get back to the old functions. If, for example, you’re using strlen to count the number of bytes in a string instead of the number of characters and you need it to keep doing that even if the string has multibyte characters, you’re SOL. With overloading on, strlen will always return the number of characters and not the number of bytes. With overloading off, strlen is the number of bytes and mb_strlen is the number of characters. You have the control. That problem is not insurmountable though and I whipped up a byte counting function, which you’re welcome to:

function countBytes($str) {
	return mb_strlen($str, 'latin1');
}

You might be able to come up with your own implementations of the other overloaded functions too but that’s not the only issue.

The problem that was a deal breaker for us was external libraries. A good included library will probably have support for different charsets and therefore support multibyte strings but it’s very unlikely that it’ll be prepared for its functions to be overloaded to behave slightly differently. We use an ID3 library, for example, and it was expecting strlen to return byte counts. When your libraries can’t count on their functions to perform their normal… functions, they won’t behave correctly. You could certainly dig through them all and try to find any instances where it’s a problem and fix them by hand but you’re setting yourself up for a maintenance nightmare in the long run and I ended up putting on some good music and whiling away a few hours with my trusty find in files tool to convert our codebase to use multibyte functions without overloading. I think it’s better to put a little more time in on the front than load yourself up with work and weird bugs in the future.

Note: unlike most php.ini settings, you can’t set mbstring.overload at runtime with ini_set or in your .htaccess file. It has to be in your .ini file. I’m not sure why that is but since it’s kind of a weird hack sort of feature, I guess I’m not too surprised.

mb_substr

One other change you’ll need to make to your string handling is when accessing specific characters in a string, you can’t access them using $string[$index] when dealing with multibyte strings. That’ll return byte $index but that might not be the same as character $index. If you want the character, you’ll need to mb_substr($string, $index, 1);

However, mb_substr is actually significantly slower than accessing the string directly since, in order to find the character you’re asking for, it needs to scan the entire string up that point. For example, if you ask for the 8th character and all characters are 1 byte, you know you can just grab byte number 8 and ignore the rest of the string. But if the characters all take 1-4 bytes, you might need byte number 8 and you might need bytes 32 through 35. The only way to know is to go through each character in the string up to and through the 8th and figure out how many bytes all of them take up.

For short strings, that shouldn’t really matter because computers are fast but we had some strings that were hundreds of thousands of characters long and when we tried to iterate over them with mb_substr, it ran for… well, it was at least half an hour. I gave up after that so I can’t tell you how long it would eventually take. And that’s a block of code that used to complete in under a second. For shorter strings, you can use preg_split but for longer ones, Mészáros Lajos‘ answer to that question is money. His solution involves iterating over the string like mb_substr but remembering your progress so you don’t have to redo all the work each time.

With normal iteration and mb_substr, the function has to recount every time. So iterating over an 8 character string needs to check character 1, then 1 and 2, then 1, 2, and 3, etc. Mészáros’ solution is to remember the byte position of the last check so you don’t have to start over each time. You just end up checking 1, then 2, then 3, etc. instead.

function nextChar($string, &$pointer) {
	if (!isset($string[$pointer])) return false;
	$char = ord($string[$pointer]);
	if ($char < 128) {
		return $string[$pointer++];
	} else {
		if ($char < 224) {
			$bytes = 2;
		} elseif($char < 240) {
			$bytes = 3;
		} elseif($char < 248) {
			$bytes = 4;
		} elseif($char == 252) {
			$bytes = 5;
		} else {
			$bytes = 6;
		}
		$str = substr($string, $pointer, $bytes);
		$pointer += $bytes;
		return $str;
	}
}

$i = 0;
while (($char = nextChar($string, $i)) !== false) {
	// do something
}

I, for one, thought that was a pretty clever solution and seems to smoke all the others in both speed and memory usage.

An interesting side effect here is that when you have a string that you're going to break into smaller strings, my inclination is usually to perform any operations I can on the whole string before splitting it because there's efficiency in avoiding the overhead of repeated function calls. But if one of those operations is using mb_substr to iterate through the string, there might actually be an advantage to splitting it first and then running the op on the substrings. It really just depends on the specifics but it's something to keep in mind.

Downloading

Another PHP change we had to make was in the filenames of downloadable files. Part of SourceAudio is letting end users download music files and admins can change what the suggested filenames for those files are so it's possible those filenames will have foreign language characters. So when you send the Content-Disposition to the end user, how do you specify that the filename is UTF-8?

I found a great Stack Overflow question that addresses this very concern and here's our version of the solution proposed by Martin Ørding-Thomsen, which works well in my testing.

if (preg_match('/MSIE (.*?);/', $_SERVER['HTTP_USER_AGENT'], $matches) && $matches[1] < 9) {
	header("Content-disposition: attachment; filename=".str_replace('+', ' ', urlencode($filename))."\r\n");
} elseif (mb_strpos(mb_strtolower($_SERVER['HTTP_USER_AGENT']), 'android') !== false) {
	header("Content-disposition: attachment; filename=\"".cleanFilename($filename)."\"\r\n");
} else {
	header("Content-disposition: attachment; filename=\"".cleanFilename($filename)."\"; filename*=UTF-8''".str_replace('+', '%20', urlencode($filename))."\r\n");
}

The cleanFilename() function just removes most non-alphanumeric characters to guarantee compatibility.

URLs

On a funny unicode related note, you can see that Martin Ørding-Thomsen's name has a multibyte character in it and if you look at the url of his Stack Overflow user page, http://stackoverflow.com/users/346150/martin-rding-thomsen, that character is stripped out. You can put unicode characters in urls but support seems spotty. I tested Chrome, Firefox, and Safari and all of them will render unicode characters in urls but when you try to copy and paste them, Chrome and Firefox escape them and Safari doesn't.

So the url http://secularcoding.com/Øtest copied in Chrome and Firefox yields http://secularcoding.com/%C3%98test when pasted but Safari returns the original url, unchanged.

If possible, I'd probably recommend against having unicode urls but if you have to, test extensively!

Uploads

Handling uploaded files can also be problematic because it's difficult to force your users to use the encoding you want when they're sending you data to process. In SourceAudio's case, one example is CSV files with metadata for tracks. The uploading clients aren't familiar with character encodings and tend to just hit save and forget it. Depending on the program and operating system they used, that can result in a number of different encodings, though once in particular we get a lot is windows-1252, which I think is coming from Excel on Windows though it seems to be LibreOffice's default format on Windows too.

If you're curious what format files your users are uploading, good luck getting PHP to tell you. There's an mb_detect_encoding function that you'd think would be perfect for this purpose but in all my experiments, it's wrong more than it's right. If you're on Windows, I found a program called File Encoding Checker that seems to work pretty well, though the fact that it only works on directories and not single files is a little obnoxious. I assume there are similar programs on other operating systems or, if you're feeling ambitious, you can scan the files yourself and look for the telltale bytes.

As far as actually fixing the problem goes, we found Sebastián Grignoli's Encoding class is a good way to take care of those alternate format files. Once we load the information from the file into a string, we pass it into Encoding::toUTF8() and it does a great job of converting whatever it finds in there. It also handles the situation where the data is already UTF-8 so there's no harm in running it (other than time) if you're not sure what you're getting.

Headers

The last thing to think about is your content types but that's pretty easy. In your HTML, just add

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

And before you send any HTML or AJAX responses, send a

header("Content-Type: text/html; charset=UTF-8\r\n");

We don't use much XML but the proper way to specify there is with a

<?xml version="1.0" encoding="UTF-8"?>

And that should be enough to let the browser know what to expect!

Solr

SourceAudio uses Solr for searching and I was worried setting up UTF-8 in Solr would be challenging but apparently it's the default character set. The only thing we need to do was make sure that we send in the proper Content-Type, 'Content-type:text/xml; charset=utf-8', when adding documents. Once we made that change and repushed the data, we immediately became able to search for foreign language characters.

If you want to strip accents so searches for "cafe" will match "café", you could always use a ASCII Folding Filter Factory.

You can read some more about Solr's UTF-8 support in their FAQ but I was pretty impressed - it really just works.

Javascript

If you're still using escape() instead of encodeURIComponent(), it's time to make the switch. escape() comes up with different (incorrect) encodings for multibyte strings than encodeURIComponent() and really, it's bad all around.

That's about it, as far as we found. Once you start sending the correct headers and set your HTML meta tags correctly, Javascript just starts treating strings as UTF-8 and it all just works.

Conclusions

The release went fairly smoothly, though we continue to have problems with non-UTF-8 characters in the database now being garbled. Reading everything out, converting, and saving it back might be the solution but I haven't taken that step yet. New stuff going in appears to be fine. We also encountered a strange problem where some strings in the database would have a \0 inserted between every character but since it's just a couple fields that are affected, I'm guessing it's something specific we're doing in code rather than a systemic issue. To make it more interesting, you can only see them in a few places while others just ignore them. I'll update this post if I figure that one out.

Supporting non-English languages is an important step for a growing website. You might not be worried about non-Latin characters but we would previously encounter issues with even accented letters in something like café and switching to UTF-8 solves all of that. The real disappointment is that it's not the default character set and that this process is necessary. I'm sure there are great legacy reasons projects use the character sets they do (there always are) but I'm not convinced it's not worth it to suck it up and make the change. In the future, I'll probably make this one of the first steps in any new project and then never have to worry about it again.

Anything I missed? Let me know in the comments. I'd be happy to add to this post and give you credit.

 

Oversharing

Written by Ryan on May 11th, 2014

I’ve had several topics come up while working at SourceAudio that would be perfect for posting according to my goal of writing about topics that are tricky to Google. However, it’s hard to decide sometimes whether it’s better to share or if there’s a competitive advantage to keeping certain things hard to discover.

Simple topics, like running benchmarks of different ways to instantiate classes in Javascript, I don’t really worry about. Sure, I guess there’s an advantage to the company for that to be hard to find out but it’s not a big one and it’s not hard to come up with on your own.

But what about a more complex topics, like how to generate iTunes compatible metadata in AIFF files or how to properly estimate the size of zip files when creating them on the fly? I loved figuring that stuff out and it’d be fun to write about but those topics could be of interest to SourceAudio competitors so do I have a responsibility to keep it a secret? Not that those things are impossible to discover if someone was interested but the time involved in doing the research, in poking around at files, in poring over documentation, in running test after test until I got it just right – all that really adds up. There’s value in that knowledge.

As someone who’s benefited heavily from others being willing to share their valuable knowledge, do I have an obligation to share when I figure something out? Or is the greater obligation to the company?

It’s the classic “Information wants to be free” problem and I don’t have the answer. My heart wants to set it loose but my head is a bit more cautious. I’m usually a head guy, which I guess explains why I haven’t posted in a couple years, but I don’t want be Smaug, sleeping forever on my piles of information and contributing nothing. How do you know when it’s time to open up the Lonely Mountain?

 

IE9 User Agent in HTTP Requests vs navigator.userAgent

Written by Ryan on June 6th, 2012

While tring to figure out why file uploads weren’t working in IE9 on SourceAudio, I discovered an interesting quirk: IE9′s user agent as reported by navigator.userAgent isn’t necessarily the same as the user agent that it sends in for http requests.

Apparently this is intended and understood behavior but it was the first I’d heard of it.

To summarize, MS found that as programs and add-ons added “feature tokens” to your user agent string, the length of the string would become so long that some servers would throw a fit. To prevent the issue, IE9 stopped adding these feature tokens when they send the user agent to the server, so instead of sending

Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET CLR 1.1.4322; .NET4.0C; .NET4.0E)

You just send in

Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)

However when accessing the user agent through javascript, you get the whole thing.

So why does that matter?

SourceAudio uses your user agent as salt for encryption so the server needs a consistent user agent. Ordinarily, the browser would always send the same one so it doesn’t matter but when doing flash uploads, flash doesn’t use the same user agent (or cookies, annoyingly) as your browser normally does so we have to send in the user agent manually using navigator.userAgent and suddenly it’s different.

As a solution, rather than sending in the user agent directly, we can just use

navigator.userAgent.replace(/(Trident\/[0-9.]+);.*/, '$1)')

Which strips out all the feature tokens after the Trident version number, returning the user agent to the format that’s used for http requests.

Of course, that’s only a good idea for IE9 because previous versions sent in all the feature tokens, servers be damned, so you’ll want to either browser detect before applying the replace or try something like

navigator.userAgent.replace(/(Trident\/[5-9]\.[0-9.]+);.*/, '$1)')

IE9 is just Trident/5.0 but I threw in the 5-9 assuming future versions of IE will exhibit the same behavior.

I haven’t tried this in IE10. Can anyone report on its user agent behavior?

 

Javascript prototype functions and performance

Written by Ryan on September 2nd, 2011

Apparently I decided to spend my Friday night profiling javascript and I figured I might as well share a couple performance differences I discovered where I wasn’t expecting them.

First off, it’s apparently faster to instantiate objects by extending their prototype than it is to put their methods inside the function’s brackets. I got this idea from this stackoverflow question and am going to show a variation of the test proposed by Andrew:

var X,Y, x,y,z, i, intNow;

X = function() {};
X.prototype.message = function(s) { var mymessage = s + "";}
X.prototype.addition = function(i,j) { return (i *2 + j * 2) / 2; }

Y = function() {
    this.message = function(s) { var mymessage = s + "";}
    this.addition = function(i,j) { return (i *2 + j * 2) / 2; }
};

function Z() {
    this.message = function(s) { var mymessage = s + "";}
    this.addition = function(i,j) { return (i *2 + j * 2) / 2; }
};

intNow = (new Date()).getTime();
for (i = 0; i < 1000000; i++) {
    x = new X();
}
console.log((new Date()).getTime() - intNow); 
// Chrome=1089ms; IE9=1494ms; FF=1842ms; Safari=414ms

intNow = (new Date()).getTime();
for (i = 0; i < 1000000; i++) {
    y = new Y();
}
console.log((new Date()).getTime() - intNow); 
// Chrome=1270ms; IE9=1721ms; FF=2266ms; Safari=896ms

intNow = (new Date()).getTime();
for (i = 0; i < 1000000; i++) {
    z = new Z();
}
console.log((new Date()).getTime() - intNow); 
// Chrome=1292ms; IE9=1713ms; FF=2257ms; Safari=885ms

You can see from the times after the console logs that using .prototype is faster across the board. It's not even really close in some browsers. I've always liked putting object functions inside the objects' brackets because I think it improves readability but the results are pretty clear - you pay a solid performance penalty for that organization.

Also, unrelatedly, Safari smokes everybody at this test.

That made me curious about whether or not it would also be faster to call functions assigned to an object's prototype than it would be to call the same function declared globally. I modified the previous test and ran:

var i, intNow;

function test() {
	return this + '-test';
};

String.prototype.test = function() {
	return this + '-test';
};
var t = 'test';

intNow = (new Date()).getTime();
for (i = 0; i < 1000000; i++) {
    test(t);
}

console.log((new Date()).getTime() - intNow); 
// Chrome=2504ms; IE9=2160ms; FF=5623ms; Safari=944ms

intNow = (new Date()).getTime();
for (i = 0; i < 1000000; i++) {
    t.test();
}

console.log((new Date()).getTime() - intNow); 
// Chrome=2329ms; IE9=1286ms; FF=1955ms; Safari=486ms

It appears to be consistently faster to attach a function to an object and call it than it is to call the function globally. Does that mean you should always attach functions to objects? Perhaps. I think there's a fair scope related argument for doing that anyway but the performance benefit was a lot higher than I expected.

Now if you'll excuse me, I need to go change some code on SourceAudio. Apparently I've been doing this all wrong.

Note: Don't judge Chrome vs the other browsers in these tests. It's my workhorse browser and I've got seven tabs open while the others were all fresh boots.

 

Richard Dawkins: Faith School Menace

Written by Ryan on August 23rd, 2010

You should really watch Richard Dawkins: Faith School Menace:

I found the part with the teather and students in the Muslim faith school to be particularly frustrating. How can you reasonably expect pupils to make an educated decision about which viewpoint is correct when their instructor clearly has her own, anti-science belief and can’t even answer common questions about evolutionary theory? All the while you’re teaching them in another class that all the information from the Koran is absolutely correct. When conflicts arise, are they going to follow the complex theory that was just taught half-assed by someone who doesn’t believe a word of it or are they going to go with the view that was taught with much conviction and fervor in a different class (which, btw, you’ll go to hell if you don’t choose).

Is that really an atmosphere of ideological equality? Nevermind, as he talks about later, that kids (and even adults) are more inclined to believe purpose driven explanations anyway because it’s the way we’re wired. If you want to teach evolution and you want children to really get it, you need someone teaching it who really understands it themselves and wants to convince their students. An Islamic shill is not going to do that. They can claim they’re giving children choice but without proper education on both viewpoints and without giving them proper tools with which to make choices (an education founded in critical thinking instead of indoctrination), it’s not really a choice at all.

And then the guy invokes the “just a theory” and you know that was dropped in science class without any explanation of the difference in definitions between scientific and common usage of the term.

PS – Thanks for not letting US viewers watch it on your website, Channel 4. I was perfectly happy to support you and contribute to your advertising revenue by watching your version but I guess that’s not happening.

 

Google TV – Gaming Console?

Written by Ryan on May 20th, 2010

Google today announced Google TV, which, while doing a number of other interesting things, allows you to easily run Android apps on your TV. There are plenty of games in the Android Market. If you had a good controller, I’m not sure how that experience would be very different from playing something on a traditional gaming console.

Sure, there are other systems you’d probably want in place – friends, achievements, etc. – but you already have your contacts built into the phone and there are third party achievement systems even now (though something a little more ubiquitous would be nice). Graphics are going to take a hit but Google could easily remedy that by throwing in some specialized hardware and you could certainly play less graphically intensive games in the meantime.

There are obviously a number of hurdles still but they’re talking about getting a device with a content delivery network and pre-built library of games and attaching it your TV. That’s going a lot of the way. And if you build a game for that, it’s also going to show up on one of the biggest smartphone install bases – that’s not a bad deal.

 

The Problem with document.location.hash

Written by Ryan on May 10th, 2010

SourceAudio, like a lot of ajax heavy applications, uses the hash to store state information. For example, when you search, you might end up on a url like http://www.sourceaudio.com/#explorer?s=search+terms&pg=1

The “page” is the “explorer” and the parameters are after the question mark. There are a number of ways you can format your hash but using the standard url format has been a pretty good solution for us. At least until today.

In the code, when we need to get that hash information, we would use document.location.hash but I realized today that Firefox has a problem when you start having ampersands in values. Naturally, you’d encode them with escape() to end up with something like #explorer?s=bump%20%26%20grind&pg=1 (from “bump & grind”) but when you try to retrieve that with document.location.hash, Firefox automatically un-escapes the hex codes.

Try it out. Go to firebug and run

document.location.hash = 'a%20%26%20b';
console.log(document.location.hash);

You’d hope to see ‘a%20%26%20b’ but instead, you get ‘a & b’.
Just ran Chrome out of curiosity. It returns ‘a%20%26%20b’, as does IE8.

I guess there are applications where Firefox’s behavior would be useful but it’s a real problem when the ampersand is a special character in your syntax. There’s no way to tell the difference between an & and a %26. So if you’re trying to parse ‘s=bump%20%26%20grind&pg=1′ so ‘s’ is ‘bump & grind’ and ‘pg’ is ’1′, you instead get ‘s=bump & grind&pg=1′ which parses to ‘s’ = ‘bump ‘, ‘ grind’ = ”, and ‘pg’ = ’1′. It’s no good.

So, what’s the answer? Apparently it’s not to use document.location.hash at all. Consistently across browsers, document.location.href contains the full path, without any un-escaping. So if you want the hash, all you have to do is

var hash = document.location.href.replace(/^[^#]+#/, '');

Just don’t trust document.location.hash
Or the Firefox developers, it seems.

 

Knight’s Tour

Written by Ryan on December 22nd, 2009

I wrote a little program to calculate the “winning” boards of the Knight’s Tour problem after my grandfather brought it up and noticed some C# array performance stats while I was at it. I ended up using lots of .Clone() operations on the arrays and just some basic index accessing.

On the first try, I wrote it with a multidimensional int[,] array and it took 48.5s to run 10M iterations. After reading that you should really flatten multidimensional arrays, and changing the code to just use int[], it only took 10.2s to run otherwise identical code. That’s pretty huge.

Out of curiosity, I switched it again to use ArrayList objects and knocked it even further down to 6.4s.

So, multidimensional arrays, slow! ArrayLists, fast! Is anything else better?

 

Asynchronous PHP with wget

Written by Ryan on September 23rd, 2009

I have a site where there’s a cache of a bunch of tracks (music site) and there are instances when a user does something and I need to rebuild that cache. The rebuilding takes a good five or ten seconds and will only become slower as we get more tracks so having the user wait while I do that is no longer feasible. I have a script that does a periodic rebuild (every fifteen min) but I’ve found that users don’t understand that their changes were accepted when they don’t show up for several minutes. It has to rebuild on demand but asynchronously. That way you get their changes applied pretty quickly but don’t have to wait on it.

I tried googling for asynchronous php solutions but couldn’t get anything to actually work. Brent accepted this answer but I had no luck with that. It would make the call but it wouldn’t do it asynchronously. I always had to wait for the thing to finish.

Some people suggested using the command line to run php and make the call that way but I couldn’t do that because of some architectural crap. I came up with this beauty though:
exec('wget -O /dev/null -o /dev/null -qb --no-cache --post-data=foo=bar http://theurl.com/whatever.php');
and it works perfectly. You can call a script on any server (not just your own like with php) and it runs in the background and discards the output quietly. Just one line and forget it.

Naturally, this only works on Linux servers with wget installed. Sorry windows folks

 

jQuery document.body is null error

Written by Ryan on September 23rd, 2009

I had a little “fun” with jQuery where it was giving me a “document.body is null” error…or sometimes it wouldn’t say anything but would just quietly fail. After some investigating, I made an interesting discovery about jQuery’s data() function.

I was chaining calls to data, like
myObject.data('foo', foo).data('bar', bar);

which generally works fine but apparently if foo is undefined, an error is thrown and the whole thing dies. If bar is undefined, it’s no problem, but something about trying to call data() with an undefined value breaks jQuery’s ability to make any further chained calls to that element.

The easiest solution I found is just dropping in a little
if (!foo) foo = null;
myObject.data('foo', foo).data('bar', bar);

And then everything works!