These days its pretty standard to require support for multiple languages and special characters on your website. But it’s still terribly easy to trip up and make mistakes, usually indicated by weird characters popping up across your web content. Here’s a few tips on how to sort out your character encoding.
When things go wrong
You might see characters like  or ¿½ popping up over your text. This indicates the web browser is having problems interpreting the characters. The usual cause is a mismatch of the character encoding between how it is stored (by the filesystem or database) and how it is displayed (by the browser).
One common issue is to store a UK price with the pound sign in UTF-8 and display it as Latin-1. If you do that, you’ll get the rather catchy: £5.99
You might also spot the pretty black diamond question mark character: �
This is the Unicode replacement character and is displayed if the browser can’t output the correct symbol. Again, this is usually a mismatch of encodings, for example storing the content in Latin-1 and displaying as UTF-8.
UTF-what?
UTF-8 is a popular character encoding for Unicode, the current standard for storing text on computer systems. Unicode has over 107,000 characters and is intended to support all languages. Compare this with the catchy ISO-8859-1, also known as Latin-1, which contains only 192 characters. And doesn’t fully support French.
How to make your website Unicode friendly
Basically you need to store, transfer and display content as UTF-8 to ensure reliable Unicode character support. Some tips on a few common areas where character encoding affects the web appear below.
Files
The most obvious place to start is your files, i.e. HTML, CSS, JavaScript and any scripting languages such as PHP. If you have any special characters in a file (i.e. content in a web page or in a template) then you need to ensure you save those files as UTF-8.
The simplest thing is to just ensure all your files are saved as UTF-8. This is usually possible in a modern text editor. See below for tips on some common editors:
- In TextMate just go to Preferences > Advanced to set the default character encoding
- In Coda it’s Preferences > Editor
- In Eclipse, or Zend Studio, it’s in Preferences > General > Workspace
You can usually tell if you have a decent text editor or IDE if when you select Save As you get an option to change character encoding.
MySQL
Ensure all your database tables use UTF-8 encoding: utf8_general_ci
. This can easily be achieved by simply choosing this encoding when creating tables.
Whenever you connect to MySQL make sure you run the following SQL. This ensures MySQL selects and inserts data as UTF-8.
SET NAMES utf-8
Bear in mind before MySQL 4.1 UTF-8 wasn’t fully supported and the default character encoding was latin1_swedish_ci
. If you want to convert your data to UTF-8 take a look at this O’Reilly article on converting data from Latin-1 to UTF-8 or a Drupal guide on migrating from MySQL 4.0 to 4.1.
Apache
You can check the default encoding of your webserver by publishing a blank HTML page and using the Firefox Web Developer toolbar plugin to view the page response headers (Information > View Response Headers). You should see a line like:
Content-Type: text/html; charset=ISO-8859-1
A lot of webservers serve Latin-1 by default (just like the above example). If you don’t see a charset entry in there most web browsers also default to Latin-1. If this is the case ideally you need to change the character encoding HTML files are served as.
If you can change your Apache settings then open up your httpd.conf
file, look for a line starting AddDefaultCharset and ensure it reads as so:
AddDefaultCharset UTF-8
Make sure you remove any comment character (#) from the start of the line to enable this declaration. You’ll need to configtest Apache and gracefully restart it to make the changes appear. If you don’t know how to do that you really shouldn’t be editing the server’s httpd.conf
file ;-)
If you’re on a shared server, or you don’t want to set a default for all websites on Apache, you can set UTF-8 encoding for just one website via the VirtualHost
declaration.
<Directory /path/to/document-root> AddCharset UTF-8 .html .php .js .css </Directory>
Just add all the file extensions you want outputted at UTF-8.
This can also be done via an .htaccess
file in the document root of your website (while flexible it will be slightly slower):
AddCharset UTF-8 .html .php .js .css
If all else fails and you’re using PHP to output your web pages, you can force it in PHP with:
header('Content-Type: text/html; charset=utf-8');
HTML
You should always state the character encoding in your HTML document.
In HTML 4 use:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
For XHTML (served as HTML) add the closing slash:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
And HTML5 the far more elegant:
<meta charset="utf-8" />
Keep this tag at the top of your element (i.e. before the title tag or any other element). According to the HTML5 spec it’s supposed to be within the first 512 bytes of your document, which for most of us means you have about 250 spare characters if you are serving an XHTML page with a valid doctype. If you’re using the more spartan HTML5 you’ll have more like 400 characters to play with.
Finally, when sending out emails make sure you specify the encoding in the email header:
Content-Type: text/plain; charset="UTF-8"
This can be done in PHP as so:
// $to, $subject and $messageBody omitted for brevity $headers = 'From: name@domain.com' . "\r\n" . 'Reply-To: name@domain.com' . "\r\n" . 'Content-Type: text/plain; charset="UTF-8"'; mail($to, $subject, $messageBody, $headers);
Though I’d recommend you use an email library such as Zend Framework’s excellent Zend_Mail to avoid things like nasty email header injection!
And if all else fails
If you’re managing an old site with data stuck with Latin-1 content all over the place, perhaps it’s easier to just serve that site as Latin-1. So if you have a webserver-wide default encoding of UTF-8 consider switching that for Latin-1 just for that website. You can always upgrade next time the client wants a site redesign!
If you’ve Problems with that and use PHP i recommend (if you don’t already have) to make yourself familiar with mbstring -> http://php.net/manual/en/book.mbstring.php
If you need to convert a file (like a database dump) that contains a mixture of ASCII, Latin-1, CP1252 and UTF-8 then you can simply pipe it through fix_latin. This is a script I developed for omving a Postgres DB to Unicode but have since found useful for other purposes: http://search.cpan.org/dist/Encoding-FixLatin/
Another tool you may find useful is my Unicode Character Finder: http://liip.to/unicode
Hannes – good tip on mb_ functions for dealing with unicode strings within your code. Hopefully the next version of PHP, be it 5.4 or whatever they decide on, will have pretty solid unicode support for normal string functions too.
Grant – those resources look really useful, thanks.