Encoding Text, UTF-32 and UTF-16 – How Unicode Works (Part 1)
The standard for how to represent human writing for many years was ASCII, or American Standard Code for Information Interchange. This representation reserved 7 bits for encoding a character. This served early computing well, but did not scale as computers were used in more and more languages and cultures across the world. This article explains how this simple encoding grew into a standard that aims to represent the writing systems of every culture on Earth. This is Unicode.
At the time of this writing, I recently switched to the Firefox Internationalization team and began working on a Unicode sub-committee to help implement ICU4X. This involves reading lots of specs, and I wanted to take the time to write out my interest in the encodings of UTF-8, 16, and 32.
From first principles, let’s design our own simple encoding of the English language. Let’s see how many bits we need to encode the alphabet. Representing a to z can be done with 26 numbers. So let’s do that.
Imaginary 5 bit Character Encoding
1 | Letter Number Hex Binary |
At this point, we’ve needed 5 bits to represent 26 letters. However, we still have 6 values (or “code points”) left, 26-31 if written as base 10 numbers, or 0x1a to 0x1f if written as base 16. We can shove a few useful characters in the remaining bits.
1 | Character Number Hex Binary |
Our 5 bits are enough for encoding our somewhat bad character set. However, the early ASCII encoding system only needed 7 bits, which is 2 more than our encoding. Modern processors naturally operate on bits that come in multiples of 8. This unit of operation is known as a “word”. For instance my current laptop has a word size of 64 bits. So when my processor adds two numbers, it takes two 64 bit numbers, and runs the add operation.
When ASCII was invented a common word size was 8 bits. In fact, the hole punched tape frequently used with computers at the time had 8 holes (or bits) per line. However, the committee decided that 7 bits was a better choice, because transmissions of bits took both time and money, and so 7 bits were sufficient for their character set needs. The 7 bit encoding could handle values 0 to 127, or 0x00 to 0x7f. This is 96 more characters that we can represent if we had a 7 bit encoding scheme rather than our 5 bit scheme.
Rather than fill out the imaginary encoding above, it may be useful to examine the complete ASCII encoding scheme.
Complete ASCII Encoding
1 | ---------------------- ----------------------- --------------------- ----------------------- |
This table presents some interesting “letters” such as NUL 0x00, BEL 0x07, and BS 0x08. These are known as control characters. Our imaginary 5 bit encoding had no need for these non-printing characters, but early ASCII found them useful for transmitting information to terminals. NUL 0x00 is used in languages like C to signal the end of a string. BEL 0x07 could be used to signal to a device to emit a sound or a flash. BS 0x08 or backspace, may overprint the previous character.
Moving beyond English
For this entire article so far, I’ve typed only letters that could be represented in ASCII. As a US citizen whose native language is English, this is pretty useful. However, the world is much bigger than that. I speak Spanish as well, and I simply can’t represent ¡Hola! ¿Qué tal? in just ASCII. But hold on, ASCII only contains 7 bits, and modern computing typically has word sizes that are multiples of 8 bits. Expanding the range from 7 bits to 8 bits expands the range of symbols from 128 to 256. Now we can add more letters into that extra bit and represent ¡ ¿ é. This encoding is known as the Latin 1 encoding, or ISO/IEC 8859-1. This encoding includes control characters, but it expands the number of printing characters to 191.
1 | Character Number Hex Binary |
What’s noteable is that with adding only 1 more bit, we can now mostly represent many more languages across the world. Not only that, but Latin 1 is backwards compatible with ASCII. This means legacy documents can still be interpreted just fine.
Languages (arguably) supported by Latin 1: Afrikaans, Albanian, Basque, Breton, Corsican, English, Faroese, Galician, Icelandic, Irish, Indonesian, Italian, Leonese, Luxembourgish, Malay, Manx, Norwegian, Occitan, Portuguese, Rhaeto-Romanic, Scottish, Gaelic, Scots, Southern, Sami, Spanish, Swahili, Swedish, Tagalog, and Walloon.
Supporting all of human culture
We started with an imaginary 5 bit encoding scheme, moved on to 7 bit ASCII, and finally to an 8 bit Latin 1 encoding. How many more bits do we need to theoretically support all known and existing human cultures. This is a pretty tall order, but also a noble goal.
Enter Unicode.
Unicode aims to be a universal character encoding. At this point UTF-8 (or Unicode Text Format, 8 bits) is the de-facto winner in encoding text, especially on the internet. So what is Unicode precisely? According Unicode’s FAQ:
Unicode covers all the characters for all the writing systems of the world, modern and ancient. It also includes technical symbols, punctuations, and many other characters used in writing text. The Unicode Standard is intended to support the needs of all types of users, whether in business or academia, using mainstream or minority scripts.
Our imaginary 5 bit system needed to encode 26 letters of the lowercase English alphabet, while Unicode can support 1,114,112 different code points. Now it’s time to define a code point. This is the number representing some kind of character that is being encoded. Keep in mind that in ASCII, not all of these code points represented printable characters, like the BEL. This is true in Unicode, with even stranger types of code points.
It’s customary to display code points in hex representation. So for instance 0x61 is the hex value of the code point for "a" in ASCII. In base-10 notation this would be the number 97. Unicode is backwards compatible with ASCII, and to signify that the code point represents a Unicode code point, i’s hex value is typically prefixed with U+. So the “Latin Small Letter A” is designated as U+0061 in Unicode.
Here are some examples of different Unicode characters and their respective code points. Note that so far, all of these code points are completely backwards compatible with ASCII and Latin 1. However, there are still potentially 1,114,112 different code points, ranging from 0x00 to 0x10FFFF that can be represented in Unicode.
1 | ASCII Range Latin1 Range |
How to fit 1,114,112 code points
In the 5 bit encoding scheme we could fit 32 code points. Let’s examine how many bits can fit into different sized words.
1 | Type Size Type Name Max Code Points Visual Bit Size |
From here on out, I’m going to use the Rust type names u8, u16, and u32 for the unsigned integer types that are used to encode the code points.
Encoding code points – UTF-32
From here, it’s important to understand the distinction that the code points are documented with hex values, but that doesn’t necessarily mean that is how they are actually represented in binary. It’s time to discuss encodings.
The simplest encoding method for Unicode is UTF-32, which uses a u32 for each code point. The main advantage for this approach is its simplicity, it’s fixed length. If you have an array of characters, all you have to do is index the array at some arbitrary point, and you are guaranteed that there is a code point there that matches your index position. The biggest problem here is the waste of space. Modern computing has word sizes divisible by 8. Let’s examine the bit layout of u32. (Note, I’m ignoring byte endianness for the sake of simplicity here.)
The max code points that can be represented in a u32 according to the table above is 4,394,967,296, which is way more space than is needed by Unicode’s 1,114,112 potential code points.
UTF-32 encoding wastes at least 11 bits per code point, as can be seen by this table:
1 | Size Max Code Points Visual Bit Size |
It’s really even worse in terms of representing common English texts. Consider the word "Hello" – the encoding visually looks like:
1 | UTF-32 ASCII |
The UTF-32 memory for this short string is primarily filled with zeros. This is a pretty hefty memory price to pay for a simple implementation. The next encodings will present more memory-optimized solutions.
Saving space with UTF-16
Instead of using a u32, let’s attempt to encode our codepoints in a smaller base unit, the 16 bit u16.
1 | Size Max Code Points Visual Bit Size |
Now we have a different sort of problem, there aren’t quite enough bits to fully represent the Unicode code point space. Let’s ignore this for a moment, and see how our word "Hello" encodes into UTF-16.
1 | UTF-16 ASCII |
The memory size is already looking much better! Now we only have a few bytes of “wasted” zero values. There is still the problem of those pesky missing 5 bits, but for now let’s see what we can represent with a single u32. For this we’ll take a small digression into looking at how Unicode organizes it’s code points.
Unicode Blocks
The first term to discuss is the Unicode block. This is a basic organization block for organizing characters. It is a block of contiguous code points that represents some kind of logical grouping of code points. The blocks are not uniform in size.
The first block should be familiar. It’s the Basic Latin code block. This is the range U+0000 to U+007F, and is backwards compatible with ASCII. The next code block is the Latin-1 Supplement. Together with Basic Latin these two are backwards compatible with the ISO/IEC 8859-1 Latin 1 encoding.
These blocks don’t stop there, and quickly diverge from the familiar Latin-based characters, to the most common scripts that are used across the globe. The table below contains samples of these blocks (a full listing is available here).
1 | ----------------------------------- -------------------------- --------------------------------------- |
Unicode planes
Now the next unit of organization beyond blocks, is the plane. This term is more tightly coupled to the idea of memory. Each plane is a fixed size of 63,536 code points. This fits perfectly into a u16. The blocks listed above all fit into the first plane, the “Basic Multilingual Plane”. Most of the planes are still unassigned, and a full listing can be found here.
1 | Plane Range |
Fitting Unicode into a u16
The Basic Multilingual Plane contains the most common scripts in use today. It’s also quite interesting that a single plane fits into a u16. Given these two constraints, let’s encode some more interesting characters into UTF-16. For this lets take the Japanese hiragana characters となりのトトロ which make up the movie title for “My Neighbor Totoro”.
1 | ===================================== |
Inspecting these values we can see that there are far fewer zeros in it. The low byte is 00110000 or 0x30. Looking at the basic multilingual plane above, we can see that this fits into the U+3040 - U+309F Hiragana range of code points. So this is great, rather than just Latin characters, we can represent Japanese scripts, amongst many others. While the values 00110000 are repeating, they are not the seemingly “wasted space” of the 00000000 from the high byte of the Latin1 and ASCII encoded code points.
This is all great, but leaves the question, what happens if you want to encode something that is outside of the basic multilingual plane? For that we will examine the following emoji:
👍 U+1F44D
Copying and pasting this emoji into your search engine of choice, you can quickly look up the code point for a given symbol, which here is U+1F44D. The first thing you may notice is that the value for this code point is beyond the Basic Multilingual Plane. Referring to the planes table above, we can identify that the range 0x01_0000 to 0x01_FFFF encompasses this code point value of 0x01_F44D. This plane is the Supplementary Multilingual Plane.
For a hint of how this code point is encoded, open up the web console if your browser has it, and type the following code:
1 | "👍".length |
A single character requires two u16 values to encode it.
Code units, code points, and high and low surrogates
I’m glad I did not lead this article with the title, or else my readers would have stopped right there. However, if you’ve followed along so far, let’s dive in.
Technically UTF-16 is a variable length format, and indexing into an array does not give you a code point but a code unit. The reason for this is because of higher-plane values like the 👍. This single code point requires two code units. Let’s open up the web console again and poke at some implementation details.
1 | "👍"[0] |
1 | ======================================= |
Here indexing into the emoji reveals two different code units. The first is U+D83D, and the second is U+DC4D. Right away we can see that both of these fit into the Basic Multilingual Plane, so it’s time to look up the code block to which they belong in that plane. The first is in range of the High Surrogates code block (U+D800 - U+DBFF) while the second is in range of the Low Surrogates (U+DC00 - U+DFFF).
Now for the magic: A high and low surrogate code unit can be combined together using bit math to form the higher plane code point that would not normally fit in a u16. The following table illustrates this visually.
1 | =================================================== |
1 | U+D83D High Surrogate 11011000_00111101 "👍"[0] |
To verify this we can write some code in the web console to do this operation. Feel free to skip this example if you haven’t done bit fiddling operations before.
1 | function combineSurrogates(high, low) { |
We can also do the reverse of turning a code point into a high and low surrogate pair.
1 | function toSurrogatePair(codePoint) { |
Poking more at JavaScript’s UTF-16
Looking into more of how JavaScript encodes the values, you can see other interesting patterns. First let’s build a function that will output a nicely formatted text table of the UTF-16 encoded values.
1 | function getCodeUnitsTable(utf16string) { |
Now we can inspect some strings. First off is a Latin1 string.
1 | getCodeUnitsTable("¡Hola! ¿Qué tal?"); |
1 | =================================== |
From here we can see that the high byte for the entire string is still just 0. If we ignore the high byte, and only use the low byte values, this would be a valid Latin1 ISO/IEC 8859-1 string. In fact, this is an optimization that JavaScript engines already do, since they are required to operate with UTF-16 strings. Inspecting SpiderMonkey’s source shows frequent mentions of Latin1 encoded strings. This creates some additional complexity, but can cut the string storage size in half for web apps that use Latin1 strings. This would change the encoding from using a u16 to a u8 for these special strings.
Of course, the internet is a global resource, so much of the content is not Latin1 encoded. Repeating the example of “My Neighbor Totoro” text from above, you can see that a JavaScript engine would need to use the full u16. These characters are still in the basic multilingual plane.
1 | getCodeUnitsTable("となりのトトロ"); |
1 | ================================== |
However, going beyond to the supplementary multilingual plane. Let’s look at some Egyption hieroglyphs using our function above.
1 | getCodeUnitsTable("𓃓𓆡𓆣𓂀𓅠"); |
Running this code now creates a tricky situation. I can’t directly paste the results into my editor window. This happens because the text has surrogate pairs. When we try to create the text output using getCodeUnitsTable above, this generates invalid code units_. A high surrogate must have a matching low surrogate, or programs get angry (or at least they should do something to handle it). The fact that untrusted code units can be malformed UTF-16 is a great way to introduce bugs into a system. Because of this, you shouldn’t naively take in UTF-16 input without validating its encoding. Browsers do a lot here to protect users from these encoding issues.
Here is the slightly amended table that my editor will actually let me paste into in this document.
1 | ============================================== |
In order to inspect the actual code points in UTF-16 encoded JavaScript, we need to build a separate utility that knows the difference between code units, and code points. It needs to properly handle surrogate pairs.
1 | function getCodePointsTable(string) { |
Now run this with some of the glyphs from above, but include some Latin1 characters as well.
1 | getCodePointsTable("Glyphs: 𓃓𓆡𓆣𓂀𓅠"); |
1 | ============================== |
Now we can see what happens when characters are used outside of the basic multilingual plane, and the extra care that is needed to process and work with this text. This might seem overly complex for building a typical web app, but bugs will come up if a developer tries to do text processing naively. It’s possible to break the encoding and have improperly encoded UTF-16.
In fact. Let’s do that.
Broken Surrogate Pairs
I posted a tweet with a broken surrogate pair on Twitter. It turns out that Twitter did the right thing and stripped it out. However, I tried to do it again with a reply and putting an unmatched low surrogate mid-tweet. The second time the input visually notified me of my breakage, and would not let me send the tweet.
The browser is permissive in what it will display, and will show a glyph with the surrogate pair’s value when encountering an unmatched surrogate pair. This is because the DOMString is not technically UTF-16. It’s typically interpreted as UTF-16.
From MDN:
DOMString is a sequence of 16-bit unsigned integers, typically interpreted as UTF-16 code units.
Building a web page is a messy process. Servers and developers make mistakes all of the time. Having a hard error when an encoding issue in UTF-16 comes up is against the principles of displaying a webpage.
You can try this yourself:
1 | var brokenString = "👍"[1] + "👍"[0]; |
Firefox, Safari, and Chrome each handle the surrogates a little bit differently. Firefox shows a glyph representing the hex value, Chrome shows the “replacement character”: (� U+FFFD), while Safari shows nothing at all. The DOMString on all three are still the broken UTF-16.
There are also other standards for UTF-16 that can deal with the realities of messy encodings. One such is a USVString which replaces the broken surrogates with the “replacement character”: (� U+FFFD). This turns the string into one that is then safe to process. Similarly there is WTF-8, or “Wobbly Text Format – 8-bit” which allows for encoding surrogate pairs into UTF-8.
One last thing
1 | "👍🏽".length |
At this point we’ve seen that the normal thumbs up emoji is made up of two code units. However, why is this single character 4?
1 | getCodeUnitsTable("👍🏽") |
1 | =================================== |
Looking at the code units, we can see that it’s made up of two pairs of surrogates.
1 | getCodePointsTable("👍🏽"); |
1 | ============================== |
This last thumbs up example is a tease to show that there is more going on with how code points interact with each other.
What’s next?
So far we’ve covered imaginary text encodings, code points, the indexable UTF-32 encoding, and the slightly variable length UTF-16 encoding. There’s still plenty to cover with grapheme clusters, diacritical marks, and the oh so variable UTF-8 encoding. Stay tuned for the next part in this series.
