This document is an explanation of character references for those unfamiliar with the mechanics of HTML (HyperText Markup Language, the language used to create World Wide Web pages). When my friend looked at my HTML 4 Character References Tables she was as impressed as hell, but none the wiser. She asked me to tell her all about character references in HTML. What better way to explain a topic, requiring a computer to demonstrate what I am talking about, than with a web page.
If you select the item Source, Page Source, or Show Source from the View menu of your web browser, you will be shown how the page you are now viewing looks behind the scenes. You will see that the HTML page is liberally sprinkled with tags enclosed in angle brackets, like so: < >. For example, the title of this page is enclosed in the tags <title>Character References 101</title>. The purpose of these tags is to instruct the browser how to format and display the contents of the web page being viewed. But what about all those characters which make up the contents of web page?
Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. .... Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.
These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.
Unicode is changing all that!
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. ...
— from What is Unicode?
In December 1997, HTML 4 incorporated Unicode as part of its standard. Currently Unicode can represent over ninety-five thousand characters. Look down at your keyboard and you can see that only a tiny fraction of all these characters are available for regular use. So, in practice, one specifies usually one of the older, more limited character encoding systems for one’s HTML document, and relies on character references (special little “codes”) to represent characters outside one’s normal repertoire.
(From here on I can speak only as one who uses a computer keyboard intended for the English language, with a Latin, a.k.a. Roman, alphabet. I have no experience using a computer in any other language.)
Now look more closely at your keyboard. You will see that the keys are made up of the alphabet, single-digit numbers, and a selection of punctuation. You will also see modifier keys, which let you alter what the keyboard enters when you strike a key, such as: “Alt”, “control/Ctrl”, “option”, and “shift”. With these keys you have access to about one hundred and ninety visible characters, many more than we need for typing English.
Meanwhile, until the dream of Unicode becomes a reality, we still live in a world of conflicting character encoding systems, in particular those used by different computer platforms such as Macintosh, Unix, or Windows.
As an author of HTML documents you learn that, for the contents of your web page, it is safest to use only the alphabet, numbers, and punctuation that can be typed without using a modifier key, or typed using the “shift” key. These characters are commonly referred to as 7-bit ASCII, or simply ASCII, because they are the visible part of ASCII (American Standard Code for Information Interchange, now incorporated in Unicode), established in the waning days of the teletypewriter. Anyone who learned to type on a mechanical or electric typewriter, or even used a teletype, will be familiar with this character set already. Plain-text e-mail, and messages to newsgroups, are in ASCII. By historical accident, this is the one group of characters everyone agrees upon. ASCII, by whatever name, is the encoding standard used on the majority of computers worldwide (by itself or incorporated in other encoding systems).
Any letter, number, or punctuation mark that requires the “Alt” or “option” key to produce the desired character, such as é (small e acute), is better represented by a character reference. The following characters reserved for HTML tags should be represented by character references: & (ampersand), < (less-than sign, a.k.a. angle bracket), and > (greater-than sign, a.k.a. angle bracket). Also under certain circumstances, " ((double) quotation mark) should be represented by a character reference.
Character references come in three alternative forms: character entity (abbreviation) references, decimal character references, and hexadecimal character references.
Now we are left with 92 (91 1/2?) alpha-numeric characters and punctuation that one may type without using special references, that is, ASCII minus its controls and the characters reserved for HTML tags. The allowed characters can be summarized as: space, single-digit numbers, unaccented letters a–z and A–Z, and the limited punctuation found on typewriters.
| Decimal | Character (Decimal) | Hexadecimal | Character (Hexadecimal) | Description |
|---|---|---|---|---|
  |   | space (space bar) | ||
! | ! | ! | ! | exclamation mark |
" | " | " | " | quotation mark |
# | # | # | # | number sign |
$ | $ | $ | $ | dollar sign |
% | % | % | % | percent sign |
| | ||||
' | ' | ' | ' | apostrophe |
( | ( | ( | ( | left parenthesis |
) | ) | ) | ) | right parenthesis |
* | * | * | * | asterisk |
+ | + | + | + | plus sign |
, | , | , | , | comma |
- | - | - | - | minus sign |
. | . | . | . | full stop = period |
/ | / | / | / | solidus = (forward) slash |
0 | 0 | 0 | 0 | digit zero |
1 | 1 | 1 | 1 | digit one |
2 | 2 | 2 | 2 | digit two |
3 | 3 | 3 | 3 | digit three |
4 | 4 | 4 | 4 | digit four |
5 | 5 | 5 | 5 | digit five |
6 | 6 | 6 | 6 | digit six |
7 | 7 | 7 | 7 | diit seven |
8 | 8 | 8 | 8 | digit eight |
9 | 9 | 9 | 9 | digit nine |
: | : | : | : | colon |
; | ; | ; | ; | semicolon |
| | ||||
= | = | = | = | equals sign |
| | ||||
? | ? | ? | ? | question mark |
@ | @ | @ | @ | commercial at |
A | A | A | A | latin capital letter A |
B | B | B | B | latin capital letter B |
C | C | C | C | latin capital letter C |
D | D | D | D | latin capital letter D |
E | E | E | E | latin capital letter E |
F | F | F | F | latin capital letter F |
G | G | G | G | latin capital letter G |
H | H | H | H | latin capital letter H |
I | I | I | I | latin capital letter I |
J | J | J | J | latin capital letter J |
K | K | K | K | latin capital letter K |
L | L | L | L | latin capital letter L |
M | M | M | M | latin capital letter M |
N | N | N | N | latin capital letter N |
O | O | O | O | latin capital letter O |
P | P | P | P | latin capital letter P |
Q | Q | Q | Q | latin capital letter Q |
R | R | R | R | latin capital letter R |
S | S | S | S | latin capital letter S |
T | T | T | T | latin capital letter T |
U | U | U | U | latin capital letter U |
V | V | V | V | latin capital letter V |
W | W | W | W | latin capital letter W |
X | X | X | X | latin capital letter X |
Y | Y | Y | Y | latin capital letter Y |
Z | Z | Z | Z | latin capital letter Z |
[ | [ | [ | [ | left square bracket |
\ | \ | \ | \ | reverse solidus = backslash |
] | ] | ] | ] | right square bracket |
^ | ^ | ^ | ^ | circumflex accent (caret, in my experience as a typist) |
_ | _ | _ | _ | low line = spacing underscore |
` | ` | ` | ` | grave accent |
a | a | a | a | latin small letter a |
b | b | b | b | latin small letter b |
c | c | c | c | latin small letter c |
d | d | d | d | latin small letter d |
e | e | e | e | latin small letter e |
f | f | f | f | latin small letter f |
g | g | g | g | latin small letter g |
h | h | h | h | latin small letter h |
i | i | i | i | latin small letter i |
j | j | j | j | latin small letter j |
k | k | k | k | latin small letter k |
l | l | l | l | latin small letter l |
m | m | m | m | latin small letter m |
n | n | n | n | latin small letter n |
o | o | o | o | latin small letter o |
p | p | p | p | latin small letter p |
q | q | q | q | latin small letter q |
r | r | r | r | latin small letter r |
s | s | s | s | latin small letter s |
t | t | t | t | latin small letter t |
u | u | u | u | latin small letter u |
v | v | v | v | latin small letter v |
w | w | w | w | latin small letter w |
x | x | x | x | latin small letter x |
y | y | y | y | latin small letter y |
z | z | z | z | latin small letter z |
{ | { | { | { | left curly bracket = opening brace |
| | | | | | | | vertical line = vertical bar |
} | } | } | } | right curly bracket = closing brace |
~ | ~ | ~ | ~ | tilde |
The character entity references for all of the above characters are missing. I saw some in cyberspace somewhere but have been unable to locate them again. It matters only in the interest of completeness. CHARACTER REFERENCES ARE NOT NEEDED NOR USED FOR THE ABOVE CHARACTERS.
For the three types of character references for the other characters available on your keyboard, look at HTML 4 Character References Tables (252 characters specified in HTML 4.0.1).
(We can thank the World Wide Web Consortium for this nest of references to references. W3C is the organization responsible for getting members to agree on standards for the WWW.)