menu

Character Encoding

Published May 24th, 2006

Trying to find out why large text was fuzzy in IE on my PC, I took a short refresher on how text characters are rendered in browsers.

Computers use character encoding to map 8-bit character codes (octets) to the glyphs of a font. It is possible to make up to 255 code points simply by indicating whether each bit in an octet is set to 1 or 0. For example:

0100 0001 = A
0100 0010 = B
0100 0011 = C

Using this principle, Windows Notepad uses 8 bits (1 byte) of memory per character or space. Character encoding as such is a simple enough concept; Morse code uses it to map sets of dots and dashes to the letters of the Latin alphabet.

ASCII character encoding

In the early days of Unix, text in computers was represented by ASCII (American Standard Code for Information Interchange), a character encoding system for the English alphabet. There are 95 printable ASCII characters, with codes numbered 32 to 127.

!"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_
`abcdefghijklmnopqrstuvwxyz{|}~

Each of these ASCII characters can actually be stored in 7 bits of an 8-bit byte, leaving one bit for further codes (128 to 255) available for other purposes such as accented characters for European languages. These became known as OEM character sets (later consolidated into the ANSI standard) but they proliferated to the point where reliable document exchange between PCs in different countries was no longer possible.

Fortunately, with the arrival of the Internet - the ultimate in document exchange - came Unicode.

The Unicode standard

Unicode characters don't map to bits but to 'code points' containing hexadecimal numbers, for example U+0041 - Latin capital letter A (the Windows XP character map utility can usually be opened with Start » All Programs » Accessories » System Tools » Character Map).

Character sets are no longer limited to 255, or even the 65,536 that are available with 16 bits per character code. Unicode covers almost all writing systems in current use, including Chinese characters (Han), Russian (Cyrillic), Arabic, Gujarati, Arabic, Hebrew, and Greek. In fact the Unicode Consortium aims to create one single character set to encompass all the world's writing systems.

UTF-8

Unicode Transformation Format 8-bit (UTF-8) is a 'variable-length' character encoding for Unicode, able to represent any universal character in the Unicode standard and is backwards compatible with ASCII. For this reason it's becoming the preferred encoding for web pages and email.

UTF-8 stores code points 0 to 127 in a single byte. Code points 128 and above are stored using up to 6 bytes but compared to early 16-bit 2-byte Unicode, UTF-8 is more memory-efficient, especially for the Latin characters used by most North Americans. UTF-8 was created by programming guru Ken Thompson (inset) in 1992.

ISO 8859-1 (Latin 1)

This is a standard character encoding of the Latin alphabet developed by the ISO and is not Unicode but part of the ISO 8859 standard of 8-bit character encodings for use by computers. Maintenance of ISO 8859, including ISO 8859-1, ceased in 2004 but ISO-8859-1 (ISO 8859-1 with additional character assignments - note the hyphenation) remains the default encoding of documents delivered via HTTP with a MIME type beginning with "text/".

Windows-1252

Windows-1252 (Western European) is, as the name suggests, a Windows character set. It is similar but not identical to ISO-8859-1. Within Windows, ISO-8859-1 is replaced by Windows-1252, which often means that text copied from, say, a Microsoft Word document and pasted straight into a web page produces HTML validation errors.

This doesn't apply to Windows Notepad. A document created in Notepad is so-called "plain" text (but not actually ASCII) and can be saved with ANSI, Unicode (little endian by default), *Unicode big endian, or UTF-8 character encoding.

[*From Microsoft: The bytes (a unit of storage) in a word in a Unicode document created on a big-endian processor, such as the Macintosh, are arranged in an order opposite to that of the bytes in a word in a document created on an Intel processor. The most significant byte has the lowest address, with the word stored big end first. To make your documents accessible to users on these types of computers, save your Notepad file in the big-endian Unicode format.]

As a digression, the term "endian" comes from Gulliver's Travels, in which wars were fought between those who thought eggs should be cracked on the big end and those who insisted on the little end.

Specifying a character set

It's not hard to see why, in the interests of universal web accessibility, an HTML/XHTML web page should always inform the output device (usually a web browser) on which character set to use in displaying written content. In practice this simply means specifying a character set (abbreviated to "charset") with 'Content Type' in the head of the HTML/XHTML - eg:

"text/html; charset=UTF-8"

The browser then responds by using the specified encoding - in the top menus, see View » Encoding, which should indicate either Western European (ISO) or Unicode (UTF-8), at least in IE and Firefox - Opera does its own thing.

Note that entity references are not required for UTF-8 but they are for ISO 8859-1. In other words, Unicode and UTF-8 encoding allow the use of true 'special characters' without invalidating HTML/XHTML.

Flash makes large HTML text fuzzy in IE6 (& IE7)

None of this solved the fuzzy text problem. Eventually I traced this to the presence of my Flash header - nothing to do with character encoding. For some reason (in IE6) a Flash file in a page makes any large HTML text beneath it go semi-transparent with a fuzzy appearance. After loading my page, pages on other sites showed the same problem, so the Flash file was doing something strange to my IE6 browser.

I've removed the Flash file and the large text is no longer fuzzy, but there's still an issue with Flash and IE generally. If I close IE and reopen it, then view a page without Flash, large text is fine. But as soon as I open a page containing Flash the fuzziness reappears and affects all web pages (on any site) from then on, until I close the browser.

Page last modified: March 09, 2014

Patrick Elsewhere