Sonic's Ultimate HTML Character Set pages

Originally compiled and posted 10 May 2000.
Content last modified Saturday, 9 January 2021
External links last verified Thursday, 5 July 2007

Introduction and definitions

September 2020: This article is obsolete, being left here for historical reference for those who might need to try and figure out what was going on with WWW pages made before Unicode was standardized. Unicode has been standard for years now, and covers characters vastly beyond the limited historic character sets presented here. I do not intend to update this article further, nor verify external links and track changes thereto.

I've learned quite a bit about character sets over the last few years, both those used on the Internet (especially the World Wide Web), and those particular to two of the dominant personal computer platforms (Macintosh and Wintel).

For the purposes of these pages, i define some terms as follows:

Character Set:

a mapping of character glyphs to an internal numbering system.

Glyph:

the visual representation of a character in a natural language.

Character Encoding:

A method of converting a sequence of bytes (such as data sent over the Net to a web browser) into a sequence of characters (such as those you are currently viewing). Examples: charset=utf-8, charset=macintosh, or charset=ISO-8859-1.

SGML Character References:

character encoding-independent mechanism for representing ANY character. Two forms:

Numeric: &#D; where "D" is a decimal position number in ISO 10646
&#xH; where "x" may be upper or lowercase, and "H" is a hexadecimal position number in ISO 10646

Character Entity:

a case-sensitive mnemonic alternative to a character reference. Form: &zzzz; where "zzzz" represents a case-sensitive string of Roman letters of varying length.

Unicode:

a universal character set, designed to encompass all the glyphs of all the world's human languages. For the purposes of these pages, synonymous with the ISO 10646 standard.

Using Character Encodings, References, and Entities - Real-World Considerations

In a perfect world, there would be one uniform standard which would encompass all glyphs of all human languages, with room for new additions... a standard used on all computing equipment. Actually, there now is such a standard: Unicode.

Thankfully, at the dawn of the new millennium, Unicode has become the standard character set for Microsoft Windows and the Mac OS (Apple and Xerox were early major proponents of the Unicode standard), as well as other essential platforms with which the author has insufficient familiarity to discuss here.

As of 2005, there is so little legacy non-Unicode software still being used that there is no reason not to uniformly adopt Unicode on all web pages, all the time, right now today.

The easiest and safest approach as of this update is to stick with the UTF-8 form of Unicode, which far and away has the deepest established deployment on the popular computing systems. Due to limitations in certain widely-used web browsers from Redmond, Washington, U.S.A. and a few others, it is necessary to omit the standard (and normally desirable) Unicode byte order mark (BOM) at the beginning of each HTML file (page). Not all web page creation software and/or text editors can generate real Unicode. It is essential that web designers, both professional and amateur, ensure that whatever means they use to create web pages really does generate the character set promised.

Every single web page on the internet really ought to have a character set/encoding declaration. Preferably (from what i read), this is done server-side. The next best option is under the control of the HTML author. For standard HTML 4.01 and earlier, this is in the form of a META tag in the <HEAD> part:

<META http-equiv="Content-Type" content="text/html; charset=utf-8">

Ideally, this should be the first or second item in the <HEAD> part. That way, the browser will know how to render any special characters in the TITLE or other parts of the markup that precede the body text.

(If, for some reason, you feel a need to choose an older, legacy character set, for a full listing of possible charset [actually character encoding] entries, see IANA registered charset values).

The beauty of all this: whether UTF-8 or an older, legacy character set is declared, the web page author may then generate and use any characters available, just as if using a word processor on that platform... assuming the HTML generation software can actually work with the desired character set. No need to play around with all the character references and/or character entities, as in the Bad Old Days!

Of course, choosing anything other than UTF-8 these days dramatically limits the number of characters available, and dramatically decreases the number of different computer systems (i.e. human visitors) that can see the pages as intended. Given the passage of time, there really no longer is a downside to using Unicode, and it really does make life easier for almost everyone—except the few folks who insist upon using mid-1990s or earlier systems and stick with long-obsolete browsers and/or very old OSes that know nothing of Unicode. I happen to be composing this, in May 2005, on a 1998 “G2” Macintosh computer (9600/350), and am a huge fan of older computers and using them as long as possible. My several equally-old systems have no problem with Unicode, with a decent choice of browsers and OSes. Since Micro$oft seems to have gotten deeper into Unicode in the OS earlier than Apple, i fully expect its support of Unicode goes far back (i do not know for sure, and welcome corrections).

If you really want to do so, you may look at the completely obsolete, and partially outdated (even as it was written in 2000), information that used to be in this section.

The Tables

The set of tables linked below attempts to be a comprehensive cross-reference of character positions in character sets for SGML Character References, Entities, MacOS Roman Standard, Windows ANSI, and Unicode. The columns are as follows:

The character name
A GIF format image of the character, which hopefully should display correctly on any graphic web browser
SGML Character References (the ampersanded numbers) in decimal
SGML Character References (the ampersanded numbers) as displayed on the browser you are currently using to view the page
Character Entities (the ampersanded names)
Character Entities (the ampersanded names) as displayed on the browser you are currently using to view the page
Decimal number position in the Standard Macintosh Roman Character Set
Decimal number position in the Wintel ANSI Character Set
Unicode UCS-2 number (hexadecimal)

The sort order is by the SGML Character Reference number, which happens to correspond to the Unicode number. Please note that the GIF images of the glyphs have been collected from different font families, and therefore will not all appear uniform. Related to this, the vertical position of the GIFs is not guaranteed to be precisely representative.

If you would prefer to see the sort order based on the order in the MacOS character set, try Scott Lawton's useful table HTML entities for the Macintosh character set, available both online and as a download. Actually, if the tables here would be more useful to you in another sort order, please email me with your desired sort order and the reason(s) that order would be of value to you. Upon receiving sufficient requests and when i have time, i will see about posting additional sort orders here (of the same tables below).

I am an interested amateur, not an unimpeachable expert. If you find errors/inaccuracies and/or have suggestions for improvement, please send them to the author. It will help greatly if you cite the source of information from which your correction is derived. If you want to improve upon some of my glyphs, submissions as tiny, black and white GIFs, bitmaps (such as from a screenshot), or similar Macintosh-readable formats are welcomed.

Since the author is an English-using American, the character set focus of these tables is almost exclusively Roman characters, Greek and characters from other languages commonly used in mathematics, and other symbols of interest to English-using Net users. I encourage knowledgeable users of other languages with sufficient time, interest, and accurate information to contribute additions to these tables, either by submitting identically-formatted, W3C HTML 4.01-compliant HTML tables + glyph GIFs to me for inclusion on this site, or providing link information for tables located elsewhere. All contributors whose materials are used will be credited as they wish. Thanks!

Character Set Tables:

Low ASCII (0-127)

Controversial High ASCII (128-159)

Remaining High ASCII (160-255)

Symbols, common Western European, Greek Math (338 and up, selected)

Undersubstantiated "controversial" entries

Links to other related sites

Sonic's Ultimate HTML Character Set pages

Significant May 2005 Update: Using Character References and Entities - Real-World Considerations

Introduction and definitions

Using Character Encodings, References, and Entities - Real-World Considerations

The Tables

Character Set Tables: