Unicode encodings

UTF-8

UTF-8 is a variable width character encoding, and it can encode every character covered by Unicode, using from 1 to 4 8-bit bytes. It was originally designed by Ken Thompson and Rob Pike in 1992. Those names are familiar to those with any interest in the Go programming language, as they were two of the original creators of that as well. It’s recommended by the W3C as the default encoding in HTML files, and stats indicate that it’s used on 91,3% of all web pages, as of April 2018. At the time of its introduction, ASCII was the most popular character encoding in the western world. In ASCII all letters, digits and symbols were assigned a number, and this number. Being fixed to 8 bits, it could only represent a maximum of 255 characters, and it was enough. UTF-8 was designed to be backward compatible with ASCII. This was very important for its adoption, as ASCII was much older (1963) and widespread, and moving to UTF-8 came almost transparently. The first 128 characters of UTF-8 map exactly to ASCII. Why 128? Because ASCII uses 7-bit encoding, which allows up to 128 combinations. Why 7 bits? We now take 8 bits for granted, but back in the day when ASCII was conceived, 7 bit systems were popular as well. Being 100% compatible with ASCII makes UTF-8 also very efficient, because the most frequently used characters in the western languages are encoded with 1 byte only. Here is the map of the bytes usage:

Number of bytes	Start	End
1	U+0000	U+007F
2	U+0080	U+07FF
3	U+0800	U+FFFF
4	U+10000	U+10FFFF

Remember that in ASCII the characters were encoded as numbers? If the letter A in ASCII was represented with the number 65, using UTF-8 it’s encoded as U+0041. Why not U+0065 you ask? Well because unicode uses an hexadecimal base, and instead of 10 you have U+000A and so on (basically, you have a set of 16 digits instead of 10) Take a look at this video, which brilliantly explains this UTF-8 and ASCII compatibility.

UTF-16

UTF-16 is another very popular Unicode encoding. For example, it’s how Java internally represents any character. It’s also one of the 2 encodings JavaScript uses internally, along with UCS-2. It’s used by many other systems as well, like Windows. UTF-16 is a variable length encoding system, like UTF-8, but uses 2 bytes (16 bits) as the minimum for any character representation. As such, it’s backwards incompatible with the ASCII standard. Code points in the Basic Multilingual Plane (BMP) are stored using 2 bytes. Code points in astral planes are stored using 4 bytes.

UTF-32

UTF-8 uses a minimum of 1 byte, UTF-16 uses a minimum of 2 bytes. UTF-32 always uses 4 bytes, without optimizing for space usage, and as such it wastes a lot of bandwidth. This constrain makes it faster to operate on because you have less to check, as you can assume 4 bytes for all characters. It’s not as popular as UTF-8 and UTF-16, but it has its applications.

See note on internal stylesheets here