|Tips, Tricks, News and Reviews for Web Coders|
In This Issue...
Traditional Gateways aren't the best fit for your VoIP needs
Connecting a business to a
VoIP network doesn't require a big, expensive integration overhaul.
Quintum's VoIP access solutions are designed with "integrated
intelligence" so they're the perfect fit for SMEs and branch offices
of large enterprises. Our proven solutions fit into existing PBX and
IP infrastructures, making them the ideal choice for service providers
and network managers.
Yesterday marked the final day of web-related activites at the South By Southwest (SXSW) 2006 festival in Austin, Texas. While several friends of SitePoint were there, Team SitePoint itself watched jealously from SPHQ in Melbourne.
Among the "friends of SitePoint" who were on the ground in Austin I count Web Standards Project (WaSP) Steering Committee Member Ian Lloyd, who has contributed several articles to the site, has written our upcoming book, Build Your First Website The Right Way Using HTML & CSS, and has acted as expert reviewer for another upcoming book, Building Best Practice Websites: 101 Essential Checklists. Ian spent more time than could reasonably be expected blogging the goings-on at SXSW 2006 on our behalf (day 1, day 2, day 3, day 4).
Last, but certainly not least, was WaSP Group Lead Molly Holzschlag, who was an expert reviewer on Build Your Own Standards Compliant Website Using Dreamweaver 8.
"If I have seen further it is by standing on ye shoulders of Giants," I always say. So did Isaac Newton, by the way.
Do You Know Your Character Encodings?
Last month, I attended a meeting of the Melbourne chapter of the Web Standards Group, where Richard Ishida, the Internationalization Activity Lead of the W3C gave a remarkably clear presentation of one of the most ignored issues in web development: character encodings.
Have you ever noticed certain characters on your site not displaying the way they should? Perhaps the curly quotation marks look like little boxes, or the long dashes have been replaced with question marks. Problems like these usually arise from an incomplete understanding of character encodings on the part of the developer responsible for the site.
I'd go so far as to guess that, in English speaking circles at least, there are more web developers that have never learned about character encodings, and just deal with the consequences when issues like the above crop up.
As a site grows to the point where it must address an international audience (or even just an audience that likes curly quotes), however, it's more and more difficult to ignore these issues. Even worse, in these heady times of daily hack attempts, incorrect handling of character encodings can result in severe security vulnerabilities (as Google recently discovered).
So what is a character encoding, exactly? Well, let's start with something it's not: a character encoding is not a character set.
A character set, or more specifically, a coded character set is a set of character symbols, each of which has a unique numerical ID, which is called the character's code point.
Some examples of character sets include the 128-character ASCII character set, which is mostly made up of the letters, numbers, and punctuation used in the English language, and the 256-character ISO-8859-1, or Latin 1 character set, which includes all the ASCII characters plus accented and other additional characters used in related languages like French. The most expansive character set in common use is the Universal Character Set (UCS), as defined in the Unicode standard, which contains over 1.1 million code points.
The first thing to understand is that every HTML document uses Unicode's UCS. Some older browsers, or less powerful devices, may not support (and thus will not display) the complete character set, but the fact remains that any HTML document may contain any character in the UCS.
What does vary from document to document is the character encoding, which defines how each of the characters in the UCS is to be represented as one or more bytes in the text data of the page.
This figure shows the ASCII, ISO-8859-1, and Unicode code points for three characters (the letter 'A', the acute-accented letter 'e', and the Hebrew letter 'alef'), and how those characters map to a series of bytes in five common character encodings:
Looking first at the character sets, note how the letter 'A' is available as a character in all three character sets, but the acute 'e' isn't available in ASCII, and 'alef' is only available in Unicode. The fact that characters maintain the same code points across multiple character encodings is due to the fact that ISO-8859-1 was designed as an extension of ASCII, and Unicode in turn was designed as an extension to ISO-8859-1. There are certainly other character sets where the code points of these characters, where they exist, would differ.
As I mentioned above, however, web pages always use the Unicode character set, so these code points are the only ones that matter for the purposes of web development.
Now take a look at the character encodings in the figure. The first, 7-bit ASCII, dates back to the days of MS-DOS, and is commonly used today as a "lowest common denominator" in email systems. If an email message contains only characters from the ASCII character set, and those characters are encoded as per their ASCII code points (e.g. the letter A is code point 41, which in hexadecimal (base-16) is 29, so the byte value used to represent it should be 29), then it should be compatible with any Internet email system, no matter how obsolete. Because ASCII contains only 128 code points, only seven of the eight bits in a byte are needed to represent any ASCII character. The byte values in a 7-bit ASCII document will therefore never exceed 7F (that's 127 in base-10).
ISO-8859-1 is the default encoding assumed by many browsers and related English-language software. It uses all eight bits of each byte to represent all 256 code points in the ISO-8859-1 character set. Though this provides the characters required for the vast majority of English language documents, as well as documents in many related languages like French, there are plenty of languages that are based on characters not included in this set. Even certain specialized characters in English documents, curly quotes and long dashes for instance, are not a part of ISO-8859-1. This explains why such characters are most often responsible for revealing a character encoding problem.
To serve the needs of other languages, there are an abundance of character encodings like ISO-8859-1 that make use of the possible byte values to represent a set of 256 characters. Additionally, there are a number of character sets that use two bytes per character to allow for 65,536 different characters. Commonly used for Chinese and other languages requiring a large number of characters, these encodings are called double-byte character sets (DBCS), even though they are in fact encodings.
But for documents that may contain characters from any language, the best encodings are those that can address Unicode's entire UCS. The simplest of these is UTF-32, which simply uses four bytes to represent each UCS character by its code point. 'A', which is code point 41 (29 hex) is represented by the four byte values 00 00 00 29, the acute 'e' (code point 82 hex) is 00 00 00 82, and 'alef' (05D0 hex) is 00 00 05 D0.
The problem with UTF-32 is that, because the vast majority of characters in documents occur early in the UCS, almost every character in a given document will begin with two 00 bytes, which is quite a waste. Effectively, most UTF-32 documents will be four times the size of the same documented encoded in a single-byte encoding like ISO-8859-1.
The UTF-8 and UTF-16 encodings address this by using a variable number of bytes per character. In UTF-8, the most common characters use only a single byte, which is equal to that character's UCS code point, while less common characters use two, even rarer characters use three, and only the very rarest of characters use four bytes. UTF-16 accomodates a larger set of "common" characters whose two-byte encodings match their UCS code points, reserving three- and four-byte encodings for rarer characters.
Looking at the figure, you can see that both the 'A' and acute 'e' characters have encodings that match their UCS code points in UTF-8 and UTF-16. 'alef', on the other hand, is a less common character that has a special two-byte encoding in UTF-8 that differs from its UCS code point. In UTF-16, however, 'alef' is considered common enough to get an encoding thatm atches its two-byte code point (05 D0).
Make sense? If you've followed this far, you've grasped all the concepts you need to work intelligently with character encodings. Keep reading after the ad to find out how all this affects your work as a web developer.
STOP wasting too much time sending your email newsletters!
Build your email lists, publish your newsletters, manage your campaigns & track your response in just minutes. VerticalResponse will:
Best of all, there are no contracts or monthly fees! Sign up for your Free trial account today!
Character Encodings and the Web
Okay, so a character encoding specifies how a set of characters (like Unicode's UCS, which is used on the web) can be written as bytes in a stored document. So what does this mean to web developers?
As a web developer, there are two types of text data that you need to deal with: the text that makes up the pages of your site, and the text that is sent by your users' browsers (usually as a form submission). In each case, you should be aware of the character encoding that is in use, and treat that data accordingly.
It turns out that the encodings of these two bodies of text data are linked: the default encoding that a browser will use when submitting a form is governed by the encoding of the document that contained the form. A page encoded in ISO-8859-1 will submit its form data in ISO-8859-1, while a page encoded in UTF-8 will submit in UTF-8.
So the first thing you need to do is pick an appropriate encoding in whichever editor you use to create your web documents. Depending on your editor, this will involve setting a configuration option (e.g. in Dreamweaver), or simply choosing the right encoding when you first save the file (e.g. in Notepad).
You also need to tell browsers which encoding your documents are using.
Browsers cannot guess the character encoding--every document just looks
like a series of byte values until an encoding is provided to interpret
them. So next you must declare the character encoding of each of your
documents. To indicate the encoding of an HTML document, include an
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Yes, that's right: you specify the character encoding with an
You might wonder how a browser can even read this tag if it doesn't yet know the character encoding, but it turns out that most encodings in popular use have enough characters in common that the simple HTML code leading up to this tag can usually be interpreted by guessing at a simple encoding (say ISO-8859-1), and then starting over if the tag indicates the browser has guessed wrong.
Content-Type: text/css; charset=UTF-8
You can use the HTTP header approach for HTML documents as well, but you
should still include the
Once you've specified an encoding, you can verify that browsers are picking up on it. Open the page in Firefox, right-click the background and choose Page Info. The window that appears will show the character encoding that was used to interpret the document.
So all this begs the question, which character set should you be using? Well, in most cases, the answer is UTF-8. It gives you access to a multitude of characters in your documents without significantly increasing the file size, and it's reasonably backwards-compatible with older browsers and simple devices that do not support Unicode. If, however, you need to use significant quantities of CJK (Chines, Japanese or Korean) text, which will necessitate a larger character set, then you might find UTF-16 is a more efficient choice.
That is, unless you're using PHP. One of the biggest weaknesses of PHP (up to and including PHP 5.1) is that its built-in string functions handle multi-byte character encodings like UTF-8 and UTF-16 incorrectly. PHP was written with the assumption that one byte equals one character, which simply isn't the case in such encodings. An optional module or library can be used to provide alternative string functions that do support multi-byte characters, but many of the PHP scripts in circulation use the built-in functions, and simply can't handle Unicode characters as a result.
This problem will be addressed in PHP 6, where Unicode support will be an integral part of the language, but in the meantime getting PHP to treat Unicode correctly is something of a black art. It's certainly possible to do--high quality PHP scripts like WordPress and phpBB handle Unicode quite well--but you really need to know your PHP to do it.
For this reason, PHP-based web sites are commonly written using the ISO-8859-1 encoding. SitePoint's article and forum pages, for example, are all written using ISO-8859-1.
As you can probably gather, using ISO-8859-1 has a few disadvantages. For one thing, you're limited to using that relatively small character set to write your documents. What happens when you need a curly quote, or some other character not found in the ISO-8859-1 set?
HTML's answer to this problem is the character entity. I'm sure you're
familiar with these: codes like
Take a moment to absorb the fact that numeric character entities refer to UCS code points for characters, not the byte values for characters in any particular encoding. The numeric character entity for 'alef' is the same no matter what encoding you are using in your document.
So character entities let you deal with characters outside your selected encoding when writing documents, but what about the other side of the coin? How do you deal with characters outside a limited encoding like ISO-8859-1 when it comes to form submissions?
Sadly, this is one place where browsers have disagreed for a long time, and even today, after much pulling of hair and gnashing of teeth, the solutions that most browsers now support are less than ideal.
One of the biggest problems is Windows, which on English language systems makes use of a slightly modified version of ISO-8859-1 called Windows-1252. Sam Ruby has documented the differences in his survival guide. Windows-1252 represents certain useful characters like curly quotes as single bytes, taking the places of less commonly used ISO-8859-1 characters. As a result, Internet Explorer browsers will often consider such characters as being within the document encoding, and will submit them as such. On the server, these single-byte encodings get interpreted as their ISO-8859-1 equivalents, which is what often leads to ugly boxes and other nonsense characters showing up on web pages in the place of curly quotes and the like, particularly when text entered on a Windows system is displayed on a non-Windows browser like Safari.
That exception aside, most current browsers, when faced with a character
that is not in the encoding in which a form is to be submitted, will
convert that character to a numeric character entity and submit that
instead. This may sound sensible at first, but consider that HTML forms
are supposed to submit plain text, not HTML code. Special characters like
< and > are not automatically encoded as
Some browsers have approached this problem differently, replacing certain out-of-encoding characters with in-encoding equivalents (e.g. curly quotes with straight quotes), and replacing other problem characters with a generic substitute (e.g. '?'). While this solution is technically superior, you do miss out on the few cases where the more common approach described above manages to preserve the desired characters without any side-effects.
A full discussion of how different browsers tackle the problem of character encoding in form submissions would take too long to go into here, but there are good writeups available for those who go looking. In short, however, your best bet for conquering these problems is to move your site to UTF-8 (or UTF-16 if appropriate) as soon as you can.
Wow, what an issue! A short newsletter is a good newsletter, but I just couldn't compress the essentials of character encoding any more without it becoming incomprehensible (and trust me, there are plenty of incomprehensible treatments of the subject around).
As I mentioned at the top, much of the information in this issue is distilled from the second hour of a talk that Richard Ishida gave to the Melbourne Web Standards Group not long ago. If I've piqued your interest but you're still a bit foggy on the details, you can listen to the complete audio of that presentation, and read through his slides, enhanced with complete tutorial notes.
Once you start working with Unicode, you'll find a number of utilities on Ishida's site will come in very handy. There's a tool for browsing the complete UCS, and another for converting between Unicode characters, code points, encodings, and numeric character entities, both of which are definitely worth bookmarking.
That's it for this issue. Next time, I'll try to pick a lighter subject!
Help Your Friends Out
People you care about can benefit from the wealth of information on new and maturing technologies available on the Internet. Help them learn how to do it by forwarding them this issue of the SitePoint Tech Times!
Download free chapters from every SitePoint Book!
DHTML & CSS
Web Tech Blog:
Manage Your Subscription Here.
You are currently subscribed as to the HTML edition of the Tech Times.
CHANGE your email address here
UNSUBSCRIBE from the Tech Times here.
SUBSCRIBE to the Tech Times here.
SWAP to the 'Text-Only' version of the Tech Times here.
SitePoint Pty. Ltd.
Thanks for reading!
© SitePoint 1998-2006. All Rights Reserved.
Design, coding, community or marketing? Select the right newsletters right for your needs...