Character References 101

This document is an explanation of character references for those unfamiliar with the mechanics of HTML (HyperText Markup Language, the language used to create World Wide Web pages). When my friend looked at my HTML 4 Character References Tables she was as impressed as hell, but none the wiser. She asked me to tell her all about character references in HTML. What better way to explain a topic, requiring a computer to demonstrate what I am talking about, than with a web page.

A Look at HTML

If you select the item Source, Page Source, or Show Source from the View menu of your web browser, you will be shown how the page you are now viewing looks behind the scenes. You will see that the HTML page is liberally sprinkled with tags enclosed in angle brackets, like so: < >. For example, the title of this page is enclosed in the tags <title>Character References 101</title>. The purpose of these tags is to instruct the browser how to format and display the contents of the web page being viewed. But what about all those characters which make up the contents of web page?

The Mother of All Character Encoding Systems

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. .... Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

Unicode is changing all that!

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. ...

from What is Unicode?

In December 1997, HTML 4 incorporated Unicode as part of its standard. Currently Unicode can represent over ninety-five thousand characters. Look down at your keyboard and you can see that only a tiny fraction of all these characters are available for regular use. So, in practice, one specifies usually one of the older, more limited character encoding systems for one’s HTML document, and relies on character references (special little “codes”) to represent characters outside one’s normal repertoire.

Making a Molehill Out of a Mountain

(From here on I can speak only as one who uses a computer keyboard intended for the English language, with a Latin, a.k.a. Roman, alphabet. I have no experience using a computer in any other language.)

Now look more closely at your keyboard. You will see that the keys are made up of the alphabet, single-digit numbers, and a selection of punctuation. You will also see modifier keys, which let you alter what the keyboard enters when you strike a key, such as: “Alt”, “control/Ctrl”, “option”, and “shift”. With these keys you have access to about one hundred and ninety visible characters, many more than we need for typing English.

Meanwhile, until the dream of Unicode becomes a reality, we still live in a world of conflicting character encoding systems, in particular those used by different computer platforms such as Macintosh, Unix, or Windows.

As an author of HTML documents you learn that, for the contents of your web page, it is safest to use only the alphabet, numbers, and punctuation that can be typed without using a modifier key, or typed using the “shift” key. These characters are commonly referred to as 7-bit ASCII, or simply ASCII, because they are the visible part of ASCII (American Standard Code for Information Interchange, now incorporated in Unicode), established in the waning days of the teletypewriter. Anyone who learned to type on a mechanical or electric typewriter, or even used a teletype, will be familiar with this character set already. Plain-text e-mail, and messages to newsgroups, are in ASCII. By historical accident, this is the one group of characters everyone agrees upon. ASCII, by whatever name, is the encoding standard used on the majority of computers worldwide (by itself or incorporated in other encoding systems).

Any letter, number, or punctuation mark that requires the “Alt” or “option” key to produce the desired character, such as é (small e acute), is better represented by a character reference. The following characters reserved for HTML tags should be represented by character references: & (ampersand), < (less-than sign, a.k.a. angle bracket), and > (greater-than sign, a.k.a. angle bracket). Also under certain circumstances, " ((double) quotation mark) should be represented by a character reference.

Character references come in three alternative forms: character entity (abbreviation) references, decimal character references, and hexadecimal character references.

Now we are left with 92 (91 1/2?) alpha-numeric characters and punctuation that one may type without using special references, that is, ASCII minus its controls and the characters reserved for HTML tags. The allowed characters can be summarized as: space, single-digit numbers, unaccented letters a–z and A–Z, and the limited punctuation found on typewriters.

... Basic Latin
DecimalCharacter
(Decimal)
HexadecimalCharacter
(Hexadecimal)
Description
&#32; &#x0020; space (space bar)
&#33;!&#x0021;!exclamation mark
&#34;"&#x0022;"quotation mark
&#35;#&#x0023;#number sign
&#36;$&#x0024;$dollar sign
&#37;%&#x0025;%percent sign
 
&#39;'&#x0027;'apostrophe
&#40;(&#x0028;(left parenthesis
&#41;)&#x0029;)right parenthesis
&#42;*&#x002A;*asterisk
&#43;+&#x002B;+plus sign
&#44;,&#x002C;,comma
&#45;-&#x002D;-minus sign
&#46;.&#x002E;.full stop = period
&#47;/&#x002F;/solidus = (forward) slash
&#48;0&#x0030;0digit zero
&#49;1&#x0031;1digit one
&#50;2&#x0032;2digit two
&#51;3&#x0033;3digit three
&#52;4&#x0034;4digit four
&#53;5&#x0035;5digit five
&#54;6&#x0036;6digit six
&#55;7&#x0037;7diit seven
&#56;8&#x0038;8digit eight
&#57;9&#x0039;9digit nine
&#58;:&#x003A;:colon
&#59;;&#x003B;;semicolon
 
&#61;=&#x003D;=equals sign
 
&#63;?&#x003F;?question mark
&#64;@&#x0040;@commercial at
&#65;A&#x0041;Alatin capital letter A
&#66;B&#x0042;Blatin capital letter B
&#67;C&#x0043;Clatin capital letter C
&#68;D&#x0044;Dlatin capital letter D
&#69;E&#x0045;Elatin capital letter E
&#70;F&#x0046;Flatin capital letter F
&#71;G&#x0047;Glatin capital letter G
&#72;H&#x0048;Hlatin capital letter H
&#73;I&#x0049;Ilatin capital letter I
&#74;J&#x004A;Jlatin capital letter J
&#75;K&#x004B;Klatin capital letter K
&#76;L&#x004C;Llatin capital letter L
&#77;M&#x004D;Mlatin capital letter M
&#78;N&#x004E;Nlatin capital letter N
&#79;O&#x004F;Olatin capital letter O
&#80;P&#x0050;Platin capital letter P
&#81;Q&#x0051;Qlatin capital letter Q
&#82;R&#x0052;Rlatin capital letter R
&#83;S&#x0053;Slatin capital letter S
&#84;T&#x0054;Tlatin capital letter T
&#85;U&#x0055;Ulatin capital letter U
&#86;V&#x0056;Vlatin capital letter V
&#87;W&#x0057;Wlatin capital letter W
&#88;X&#x0058;Xlatin capital letter X
&#89;Y&#x0059;Ylatin capital letter Y
&#90;Z&#x005A;Zlatin capital letter Z
&#91;[&#x005B;[left square bracket
&#92;\&#x005C;\reverse solidus = backslash
&#93;]&#x005D;]right square bracket
&#94;^&#x005E;^circumflex accent (caret, in my experience as a typist)
&#95;_&#x005F;_low line = spacing underscore
&#96;`&#x0060;`grave accent
&#97;a&#x0061;alatin small letter a
&#98;b&#x0062;blatin small letter b
&#99;c&#x0063;clatin small letter c
&#100;d&#x0064;dlatin small letter d
&#101;e&#x0065;elatin small letter e
&#102;f&#x0066;flatin small letter f
&#103;g&#x0067;glatin small letter g
&#104;h&#x0068;hlatin small letter h
&#105;i&#x0069;ilatin small letter i
&#106;j&#x006A;jlatin small letter j
&#107;k&#x006B;klatin small letter k
&#108;l&#x006C;llatin small letter l
&#109;m&#x006D;mlatin small letter m
&#110;n&#x006E;nlatin small letter n
&#111;o&#x006F;olatin small letter o
&#112;p&#x0070;platin small letter p
&#113;q&#x0071;qlatin small letter q
&#114;r&#x0072;rlatin small letter r
&#115;s&#x0073;slatin small letter s
&#116;t&#x0074;tlatin small letter t
&#117;u&#x0075;ulatin small letter u
&#118;v&#x0076;vlatin small letter v
&#119;w&#x0077;wlatin small letter w
&#120;x&#x0078;xlatin small letter x
&#121;y&#x0079;ylatin small letter y
&#122;z&#x007A;zlatin small letter z
&#123;{&#x007B;{left curly bracket = opening brace
&#124;|&#x007C;|vertical line = vertical bar
&#125;}&#x007D;}right curly bracket = closing brace
&#126;~&#x007E;~tilde

The character entity references for all of the above characters are missing. I saw some in cyberspace somewhere but have been unable to locate them again. It matters only in the interest of completeness. CHARACTER REFERENCES ARE NOT NEEDED NOR USED FOR THE ABOVE CHARACTERS.

For the three types of character references for the other characters available on your keyboard, look at HTML 4 Character References Tables (252 characters specified in HTML 4.0.1).

(We can thank the World Wide Web Consortium for this nest of references to references. W3C is the organization responsible for getting members to agree on standards for the WWW.)