Character sets and encoding
- Damodar Chetty
- October 07, 2007
The topic of character sets is one that causes a lot of heartburn for Web developers trying to internationalize their sites. I've spent countless hours trying to get these concepts straight, and in the next few paragraphs, I'm going to attempt to unravel some of the arcana that this involves.
Languages have often been written to persistent storage (clay tablets, stone, parchment, paper, etc.) using their associated script. A language's script comprises a set of symbols that represent its consonants and vowels (e.g., the symbol 'a', for the English language, or the letter 'व' in Devanagari.)
The complete set of all these symbols (or characters) for a given script are completely independent of any computer usage - and are what a child might learn in kindergarten.
In order for a symbol within a script to be represented on a computer, we need to perform the following tasks:
Computers work with numbers, so the only way they can work with natural languages is if we have a mechanism of encoding each character symbol into a byte-representation. This byte-representation (aka an encoding) would have to be a standard so that my computer interprets the characters in the same manner that yours does.
So far, we took written characters, mapped them to numbers (their code points), and determined an encoding. We also tied each code point to various glyphs, one for each point size, type face, and style combination that is supported.
Then, so long as we have a keyboard (or other input device) that can convert the symbols we want (as marked on the individual keys) into code points, and as long as our editor converts those code points into the appropriate byte representation (based on the current encoding), and as long as the graphics card is able to determine the glyph associated with that symbol - we are done.
In other words, the process that is followed during the encoding process is as follows:
The reverse occurs during the decoding process (as a file is being read, as text is being rendered to a screen, etc.).
At this point, I'm going to take a step back and hammer through a few more details in a couple critical areas. Viz., character sets, and encodings.
As we saw earlier, a character set defines a mapping between virtual symbols ('0', 'a', 'Y', '@', etc.) and code points (numbers).
The primordial character set (at least in most everyone's living memory) is ASCII - the American Standard Code for Information Interchange. This character set was designed primarily to encode US English. To that end, the numbers '0' through '9' are represented by the logical code points 48 through 57; the letters 'A'-'Z' by 65-90, and 'a'-'z' by 97-122. Interspersed in there were other printable characters (:, -, +, etc.) as well as a number of control characters (mostly obsolete now) that had special hardware meaning to the computer (e.g., tab, bell, end of transmission, etc.)
The major drawback for ASCII was that while it was sufficient for writing American English, its support for other languages and locales was non existent. For example, it did not even have a pound character represented for UK English. Of course, this was hardly a problem where documents were primarily generated and consumed within a single locale .
However, all the standard characters used in US English were easily accommodated in 7 bits, leaving the code points above 127 up for grabs. Unfortunately, there were only 128 more code points available - surely not enough room for all the languages in the world. One compromise was to use this area, called the code page, to hold code points for one additional language - that of the country in which the computer was likely to be used. This worked well when exchanging documents between computers that shared the same code page. However, any document sent to a computer that used a different code page would be rendered incorrectly. After all, the correct glyph can be chosen only when both computers can agree on which symbol is represented by a particular code point. It is interesting to note that computers across the world agreed on the code points below 128 - and so any 2 computers could safely exchange documents that were restricted to the ASCII set (US English). It was only when characters beyond 127 were used, that the confusion arose. In addition, it was impossible to have a document that incorporated multiple languages that conflicted in the use of these code points. E.g., Latin-1 combines most Western languages and Icelandic, but Latin-5 replaces Icelandic with Turkish. So, a document that needed to use both Icelandic and Turkish would have to be written in Unicode.
Unicode was introduced as an attempt to eliminate this confusion. The Unicode character set encompasses characters from almost every language in the world. This has the disadvantage that multiple bytes are now required to represent each code point (since it supports over a million characters, but a byte can encode only 255 code points). For historical reasons, the first 255 characters of the Unicode character set map directly to the Latin-1 characters (ISO 8859-1). The code points 256 to 383 support languages like Afrikaans, Czech, Turkish, Welsh, etc.; Tamil is encoded in the code points 2944 to 3071, Thai in 3584 to 3711; Hiragana and Katakana in 12352 to 12543, and so on. Geometric shapes (9632 to 9727), box drawing elements (9472 - 9599) and Zapf Dingbats (9984 - 10175) also find representation in this set.
The major advantage, however, is that as long as you are using an Unicode encoding, you can mix characters from any of these languages in the same document, and the receiver will be able to decode it appropriately for rendering to output.
To summarize, a "character set" encompasses two concepts: a collection of characters from one or more languages that you intend to use in a document, and a mapping of each of those characters to a code point - i.e., a numeric code that uniquely identifies each character within that character set. ASCII or Latin-1 are smaller collection maps (<256 symbols), whereas Unicode is a mucho-grande map (> 1,000,000 symbols).
So, if you just say "Unicode", all you are referring to is the mapping between the individual character symbols and their correspoding integer code points. Before a character set can be used by a computer, you need to specify an encoding - i.e., how these integers will be represented as bytes in memory.
A "character encoding" adds yet another element to this mix - an algorithm that determines how each code point will be represented in terms of bits and bytes. I.e., this comprises the number of bytes that will be required and the endian-ordering of the bytes as they are written.
The simplest character encodings are those where there is a trivial one-to-one mapping between each code point (number) in the character set and a single byte. E.g., all the characters in ASCII can be represented in a single byte.
In the late 80s the ISO as well as the Unicode consortium began work on a unified character set that would support multilingual software. The combined efforts bore fruit in 1993 with the ISO 10646-1 which defines the Universal Character Set (UCS) - which contains the characters required to represent most known languages, many historic (e.g., Hieroglyphs) and fictional ones (e.g., Klingon), as well as mathematical and graphical symbols. It was designed as a 31-bit character set (code points U-00000000 to U-7FFFFFFF) allowing up to 2 billion 2000 million characters.
The most commonly used characters (e.g., Latin-1, etc.) are found in the first plane (where each plane is represented by groups of 216 characters - as represented by the least significant 16 bits) - called the Basic Multilingual Plane (BMP). ISO 10646-2 (2001) defined characters outside the BMP. Until 2001 (Unicode 3.1), a common misconception was that Unicode only defined up to 65535 characters, and so a 2-byte encoding (UTF-2) would suffice. Unfortunately, this misconception continues through today.
The UCS-4 encoding can represent all Unicode characters, the UCS-2 encoding however can represent only those from the BMP (U+0000 to U+FFFF). This led to the popular misconception that Unicode only needed an unsigned integer based 2-byte encoding. However, Unicode's valid code points far exceed 65535 (over a million possible code points), and as a result, you do need 4-bytes per character (UTF-32 or UCS-4).
The full usage of 31 bits would allow representation of over 2 billion different symbols. However, the standard today is codified as ISO 10646 - (using 21 bits) and is expected to hold over 1 million code points (0x000000 to 0x10FFFF). A Unicode character is preceded by U+, e.g., U+0041 represents the character 'A'.
This is unfortunately wasteful when dealing with English text, for example, where the characters lie between 0 and 127 (U+0000 to U+007F which corresponds to ASCII) or 0-255 (U+0000 to U+00FF, which corresponds to Latin-1), and so can be represented in just 1 byte. I.e., a file encoded in Latin-1 grows to be 4 times it size when re-encoded in UTF-32. Hence, additional encodings were proposed. In particular, UTF-8 which uses 1 byte to represent the standard ASCII set, and can use up to 4 bytes to represent all Unicode characters; and UTF-16 which uses 2 bytes for characters in the Basic Multilingual Plane, and 4 bytes for the supplementary characters.
In UTF-8, the original ASCII characters (U+0000 to U+007F) are encoded simply as bytes (0x00 to 0x7F), making it byte-compatible with the ASCII encoding. All UCS characters beyond this range are encoded using several bytes each.
| U-00000000 - U-0000007F: | 0xxxxxxx |
| U-00000080 - U-000007FF: | 110xxxxx 10xxxxxx |
| U-00000800 - U-0000FFFF: | 1110xxxx 10xxxxxx 10xxxxxx |
| U-00010000 - U-001FFFFF: | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
| U-00200000 - U-03FFFFFF: | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
| U-04000000 - U-7FFFFFFF: | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
Of course, your choice of "encoding flavor" will depend on the specific language you will be using. I.e., if your document has a lot of characters that lie in the last range in the above table, you end up using 6 bytes per character, which is 50% worse than using UTF-32.
It is hard to reiterate this enough ... a character set can have multiple encodings. For example, the Japanese character sets can be represented using either EUC-JP or Shift_JIS encodings.
There is a lot of misinformation floating around regarding this. E.g., even the HTTP Content-Type header refers to the encoding as a charset:
Content-Type: text/html; charset=utf-8, when it really means encoding.
This has been corrected in the declaration used in XML files, where it is finally called an encoding.
When dealing with internationalization of Web applications, you need to clearly specify the character encoding to be used. Else, the characters will be rendered as unrecognizable characters.
An interesting aside is that you can encode any Unicode character into an ISO 8859-1 document using character escapes. I.e., you simply enclose the hexadecimal code point value for that character within &#x and ;. E.g., as א is rendered as א.
So far, we have discovered that a character encoding refers to how symbols in some script are converted to byte sequences, i.e.:
Character Symbol -mapped to-> code point -converted to->sequence of 1 or more bytes
The reverse process (decoding) converts a byte sequence back into the appropriate character symbol that should be rendered:
Sequence of 1 or more bytes-convert to>code point -mapped to->Character Symbol
In other words, to a computer, there really is no such as thing as "text" - whether in a file or in a String object. All content is ["text" + "encoding"]. The encoding determines how character symbols are converted to bytes, and how the bytes are converted back into character symbols. Without this additional information, any conversion most often results in gibberish.
With a Web application, the two communicating parties comprise the Web server/container and the client browser. The client browser sends the server a HTTP request that either requests a page, or passes in form field parameters; and the server returns a HTTP response with the requested information or form. If the information sent across the wire must be understandable to the receiver, each sender must clearly indicate the encoding being used for the content of the communication.
By default, Web applications assume that any HTTP request is encoded using ISO 8859-1 (Latin-1). Either party may choose to use a different encoding - but it is then up to that part to ensure that this fact is communicated to the other. This is usually done via the Content-type HTTP header which may be set to, say, text/plain;charset=UTF-8 to indicate that all characters in the request are encoded in UTF-8.
Unless both parties have a common understanding of the encoding being used over a connection, the decoding process will not re-assemble the original plain text sent.
Note that in addition to knowing the encoding, the client browser must also have a font installed that can display the characters, so that it may render the symbols appropriately.
To use a particular encoding within a JSP, set the page directive's pageEncoding and contentType attributes.
With JSPs there are 3 encodings that come into play -
pageEncoding attribute, and contentType attribute.
E.g.,
<%@ page pageEncoding="Shift_JIS" contentType="text/html;charset=UTF-8"%>
If neither a pageEncoding or a contentType attribute is specified, then the default encoding of ISO 8859-1 is used to decode the bytes of the JSP file as well as for encoding the response returned to the client.
If a pageEncoding is not specified, the charset specified by the contentType attribute is used to decode the bytes in the JSP file as well.
If a pageEncoding is specified, but a contentType is not, then the charset specified by pageEncoding is used for both.
The Web container will raise a translation time error if an unrecognized page encoding is specified.
Given its importance in correctly decoding a JSP file's contents, and for encoding the response to the client, the page directive along with its charset specification must ideally appear as the first line of the JSP page. At the latest, it should appear before any characters appear that can only be interpreted when the charset is known. I.e., before any non ASCII characters are encountered.
There are two mechanisms by which a server can inform the client browser about a non-default encoding being used in a response.
response.setContentType("text/html; charset=UTF-8");
ServletOutputStream sos = response.getOutputStream();
PrintWriter out = new PrintWriter(new OutputStreamWriter(sos, "UTF-8"), true);
response.setLocale("", "");
out.println("<html>");
...
The java.io classes that end in Reader and Writer (e.g., BufferedReader/Writer, InputStreamReader, PrintWriter) support reading and writing of character data streams in different encodings.
A PrintWriter (used by Servlets and JSP pages) by default encodes using ISO 8859-1. Servlets can also use an OutputStream - which performs no encoding.
<html>
<head>
<%= response.setLocale(Locale.KOREAN); %>
</head>
<body>
<%="\uc548\ub155\uc138\uc694"%>
</body>
</html>
This is equivalent to:
<html>
<head>
<%@ page contentType='text/html;charset=EUC-KR'%>
<% response.setHeader("Content-Language", "ko"; %>
</head>
<body>
<%="\uc548\ub155\uc138\uc694"%>
</body>
</html>
The encoding of a request is the character encoding that should be used to decode the parameters contained in that request. An internationalized request is one that contains a form that allows users to enter characters in a non Latin-1 character set. I.e., characters that are not supported in HTTP's default encoding.
In this case, the first step is for the server to inform the browser which encoding it should use to encode the user input.
For a JSP, the page directive's contentType attribute that we met earlier does double duty in this case. I.e., it not only informs the server which encoding should be used to encode the characters being returned to the client, but also tells the client which encoding is to be used to encode the characters being submitted to the server.
The same applies for servlets - which directly set the Content-type header for this same purpose.
E.g.,
<%@ page pageEncoding="Shift_JIS" contentType="text/html;charset=UTF-8"%>
A HTTP request can only contain parameter values made up from the characters defined by the ISO 8859-1 character set. Hence, the browser must encode all other characters entered in input fields in terms of the allowed characters. It encodes each non standard character as a string, starting with a % sign followed by a hex value for that character. The problem is that the hex value only makes sense if you know which charset it comes from.
Luckily, most browsers use the charset of the response containing the form to encode the parameter values when the form is submitted. As long as you keep track of the response encoding, you can tell the container which encoding to use to decode the parameter values.
Assume that the encoding is Unicode, i.e., the user can enter values using any Unicode character. Then, when the user submits the form, each character in the form field payload is first encoded into bytes using the UTF-8 encoding.
It then uses the HTTP standard URL encoding scheme to encode the resulting bytes. This encoding scheme dictates that even when the default ISO 8859-1 encoding is used, the bytes for all characters other than a-z, A-Z, and 0-9 must be converted into the byte's value in hexadecimal preceded by a % sign.
For a charset of UTF-8, each Japanese character symbol is represented by 3 bytes each. Each of these bytes would be converted to a % followed by its value in hex. E.g., %E4%BB%8A would represent a single Japanese character.
When the container receives this information, it must know which charset the browser used to encode it. As stated earlier, some browsers don't return a Content-Type request header - so, it is up to you to keep track of which encoding was used by a particular form, and to use that encoding to process the input. Once the container is told which charset to use, it can decode any parameter values correctly.
String value = request.getParameter("employeeFirstName");
emp.firstName = new String(value.getBytes(), request.getCharacterEncoding());
Here we use HttpServletRequest.getCharacterEncoding() to obtain the encoding being used in the request from the content-type header.
The String(byte[], String) constructor uses the specified character set to decode the specified array of bytes.
You can also use ServletRequest.setCharacterEncoding(String enc) to override the character encoding supplied by the container. This method must be called prior to parsing any request parameters.
You have a number of options when trying to use Unicode characters in a document.
out.println("<h1>\uf460</h1>");
An internationalized application must determine the encoding of the incoming request parameters. An HTML browser encodes each request using the encoding of the page that was the source of the request, but this is only useful if the original page's encoding is known. The following options are available for you to determine a request's locale: