Tech Note: Character Encoding Bug Hunt

• Views: 4,426

An open thread for a Friday morning; I’m chasing down a long-running, very annoying character translation bug, that causes Western European characters (with accènts, ümlauts, etc.) to show up as garbage when the Ajax system transfers them back and forth from the server.

I’m pretty sure I’ve finally killed the bug, but we can test my solution to destruction in this thread.

UPDATE at 5/30/08 12:11:33 pm:

Our plan for world domination is coming together, and the character encoding part of it now works very well. After trying a million or more different approaches, and only getting halfway there, the real solution involved a PHP function containing only 5 lines of code:

function convertLatin1ToHtml($str) {
$allEntities = get_html_translation_table( HTML_ENTITIES, ENT_NOQUOTES );
$specialEntities = get_html_translation_table( HTML_SPECIALCHARS, ENT_NOQUOTES );
$noTags = array_diff($allEntities, $specialEntities);
$str = strtr($str, $noTags);
return $str;
}

The source of the encoding problem is the way Javascript mishandles displaying raw (unencoded) European characters.

There’s no problem when Javascript reads the text field and sends the text to the server; before sending the text you’ll typically use a Javascript function like escape (if your pages are served as ISO-8859-1) or encodeURIComponent (if you’re serving UTF-8), and both of these functions correctly encode the extended characters so that PHP can translate them back.

The problem occurs on the return trip; if the PHP script sends back any raw extended characters, Javascript has a tantrum, dumps out a bunch of garbage, and embarrasses itself in front of the whole internet.

The solution: any text that may contain European characters and will be returned to a Javascript routine (for example, via XMLHttpRequest) needs to be passed through the function above to properly encode the extended characters as HTML entities (for example, “ü”).

This function exists because we can’t simply encode the whole text with a call to htmlentities. In comments and LGF articles, the text may contain HTML tags—and we don’t want those to be encoded or they’ll display as text in the browser instead of acting as HTML.

So the function above gets the two PHP translation tables (for htmlentities and htmlspecialchars), and calls array_diff to generate a translation table that omits all of the HTML-specific characters, such as < and >, and single/double quotes. Then it simply calls strtr (string translate) to replace those pesky foreign characters with their equivalent HTML entities, leaving the HTML tags and anything inside them alone.

And now we have a nice, safe, Javascript-friendly string that can be passed back to any browser and displayed correctly, without fear of embarrassment.

(Note: as usual, there’s a caveat with Internet Explorer—some HTML entities are not supported in IE by default, and may display as little boxes.)

Jump to top

Create a PageThis is the LGF Pages posting bookmarklet. To use it, drag this button to your browser's bookmark bar, and title it 'LGF Pages' (or whatever you like). Then browse to a site you want to post, select some text on the page to use for a quote, click the bookmarklet, and the Pages posting window will appear with the title, text, and any embedded video or audio files already filled in, ready to go.
Or... you can just click this button to open the Pages posting window right away.
Last updated: 2023-04-04 11:11 am PDT
LGF User's Guide RSS Feeds

Help support Little Green Footballs!

Subscribe now for ad-free access!Register and sign in to a free LGF account before subscribing, and your ad-free access will be automatically enabled.

Donate with
PayPal
Cash.app
Recent PagesClick to refresh
Ranked-Choice Voting Has Challenged the Status Quo. Its Popularity Will Be Tested in November. JUNEAU — Alaska’s new election system — with open primaries and ranked voting — has been a model for those in other states who are frustrated by political polarization and a sense that voters lack real choice at the ...
Cheechako
6 days ago
Views: 186 • Comments: 0 • Rating: 0