Encoding Text in UTF-8 – How Unicode Works (Part 2)
In part 1 of this article I covered the idea of creating character sets, and different strategies for encoding them. The article covered UTF-32 and UTF-16 encodings with the benefits and drawbacks of each. However, for most documents, UTF-8 encoding is the most popular by far, but is more complicated in its implementation.
For a quick re-cap, a code point is a base unit of meaning in the Unicode. A code point can represent a single letter, like “a”, but it’s not always this simple. In more complex cases code points can combine together to represent graphemes. Code points can also define other behaviors beyond just simple letters displayed to the screen.
However, what we’re interested in with UTF-8 is to be able to store a code point in a variable list of single byte (u8) values. A byte is made up of 8 bits.
1 | Bits Bytes Max Code Points Visual Bit Size |
UTF-32 easily fits the potential 2,097,152 code points into a single u32. In part 1 I showed how UTF-16 can store the full code point range using surrogate pairs (two adjacent code points) to represent a higher plane code point. Code points in the basic multilingual plane (which tend to be more common) can be represented by a single u16, and then higher plane code points require two u16 values.
Single byte values and ASCII
UTF-8 is a variable length encoding. This means that each code point takes one or more bytes (u8 values) to be encoded. The easiest code points to encode in UTF-8 are the ASCII range values, or officially in unicode the “C0 Controls and Basic Latin” code block. This range of values takes 7 bits and can represent the first 128 code points.
Consider this sampling of code points from this block.
1 | Codepoint Binary Letter Name |
The common feature here is that the most significant bit is always 0. This serves as the binary signal that the byte represents a single code point, and does not require any more bytes. This is a neat property in that the UTF-8 encoding of ASCII characters is fully backwards compatible with ASCII encoded characters.
Time to get variable
The next step here is to consider what happens if the most significant bit is 1. This signals to the encoding that the code point uses a variable length encoding. For variable length encodings, the first byte will include information about how many bytes it takes to represent the code point. Let’s ignore the trailing bytes for now, and only look at the first byte.
110x_xxxx Signals that the code point requires 2 bytes to represent it.1110_xxxx Signals that the code point requires 3 bytes to represent it.1111_0xxx Signals that the code point requires 4 bytes to represent it.
The variable length of bytes after this leading byte all start with a 10xx_xxxx. Given this pattern, let’s buid out a table and analyze how many bits that leaves us to represent a codepoint. The x characters represent the available bits for encoding data.
1 | Byte Count Bytes and Bit Pattern Available bits |
21 bits can represent a maximum of 2,097,152 code points given that 2^21 is 2,097,152. In hex this number is 0x20_0000. The largest code point in Unicode is U+10_ffff, which means we’ve found the number of bits needed to represent all of the code points.
Doing the math in the table below shows the code point ranges, and the minimum bytes needed for a given range of Unicode code points.
1 | Byte Count Code Point Range Math |
Exploring the bytes on your own.
The following input can be used to test what the encoding of different text looks like with UTF-8. Feel free to change the contents and explore it.
Variable length leaves room for bad encodings
At this point the leading bits give a good indication of the expected number of bytes. However these encodings could be incorrect, giving bad (or potentially dangerous) results. The table below shows various incorrect combinations of bytes.
❌ An ASCII byte followed by a variable length byte:0xxxxxxx 10xxxxxx
❌ The first byte indicates that there is one more byte, however there are actually 2.110xxxxx 10xxxxxx 10xxxxxx
❌ The first byte indicates that there are two more byte, however the next byte is in the ASCII range.1110xxxx 0xxxxxxx
Surrogate pairs are not valid in UTF-8
The other point of failure for UTF-8 encoding is that UTF-16 surrogate pairs are not valid in UTF-8. As a reminder, these surrogate pairs are a way to encode higher plane values into UTF-16. However, in UTF-8, there is enough encoding space with the variable length of bytes to encode the full Unicode code point space. Because of this, these surrogates are considered invalid.
Iterating over characters
One thing striking about the Rust language, is that the basic string types are all valid UTF-8. This is in contrast with the JavaScript language, where all of the strings are UTF-16. Iterating over UTF-16 strings mostly works without worrying too much about it.
1 | // The string is UTF-16 encoded: |
As can be seen with the emoji, the iteration breaks when dealing with higher plane values that require surrogate pairs to represent them. In UTF-8, the encoding is much more variable, and iterating over the bytes directly would only work on ASCII-encoded text.
The Rust code point iterator
The following code examples are written using the Rust programming language. I attempted to write it in such a way that you don’t have to be familiar with Rust, and hopefully it will read like pseudo-code.
💡 Tip: Paste the code examples into the main() function in play.rust-lang.org/ and run them directly in your browser.
Now, iterating over the bytes of a UTF-8 string is probably not all that useful:
1 | // Create a new string, using the example from above and print it out. |
But iterating using the built-in char iterator is a lot more useful.
1 | // Create a new string, using the example from above and print it out. |
Since the bytes are variable length compared to the number of code points in the string, Rust provides a way to get either the underlying bytes, or an iterator for the code points (technically scalar values, more on this later). One key problem with working with UTF-8 is that you can’t just index directly into the data and know that you will be on a code point boundary, or how many code points in you are. Iteration is the key for working with code points.
So how does Rust build a code point iterator? Let’s build our own!
Building a code point iterator.
First off? What exactly is a char in Rust? Rust provides a std::mem::size_of function which will give you the size of a type in bytes.
1 | println!("Size of char: {:?} bytes", std::mem::size_of::<char>()); |
The char is 4 bytes, which is more than adequate to hold any Unicode code point. This type can then hold our decoded UTF-8 code point. Rather than use the char, we’ll use a raw u32 that matches the size of a char. Now it would be more Rust-like to have a true iterator here, but for the simplicity of this code example we will push the u32s on to a vector. To keep this example
1 | let string = "¡Hola! ⇨ 👋"; |
This example is pretty basic and limited so far, as it does not handle the variable length bytes. The following code is a bit larger piece of code. I’ve liberally added code comments to the code. The Rust is written in a way to hopefully be more readable to a casual programmer who does not have Rust experience. I favored simple code that is full of comments rather than focusing on writing idiomatic Rust.
1 | /// Take a string, and return the code points as `u32`s. The bytes of the string |
That was a bit of code, congrats on getting through it. This shows that iterating over UTF-8 code involves a bit more complexity compared to other encodings. There is a trade-off between the simplicity of the encoding and implementation and the memory size of the encoded text.
One thing that we did not handle in our code was checking for high and low surrogates. Because of this, we made a “Unicode code point” iterator, but not a “Unicode scalar value” iterator. The Rust docs for the primitive type char has the following line:
The char type represents a single character. More specifically, since ‘character’ isn’t a well-defined concept in Unicode, char is a ‘Unicode scalar value’, which is similar to, but not the same as, a ‘Unicode code point’.
This should makes sense because a Unicode scalar values is defined as.
Any Unicode code point except high-surrogate and low-surrogate code points.
The surrogate code points are only useful in the UTF-16 encoding scheme.
What’s next after code point iteration?
Code points are definitely more useful to iterate over compared to raw UTF-8 encoded bytes. However, code points aren’t always the boundary that we need for expressing what is drawn to a screen. It can take combinations of code points to express a character such as diacritical marks or emojis. Check out my article on diacritical marks for a full deep dive into this topic.
In order to iterate over other boundaries, we require more complicated text segmentation algorithms. These algorithms require more information like Unicode property tables to understand the meaning of code points. This requires data tables that must be kept up to date, and is beyond the scope of this article. There are ICU libraries for handling these more complicated iterators, and bindings and implementations available in different languages. For an example to play around with, there is the unicode_segmentation crate which provides iterators for grapheme clusters (essentially characters you see on the screen), words, and sentences.
UTF-8
UTF-8 is the most common encoding on the web by far. From Wikipedia:
UTF-8 has been the most common encoding for the World Wide Web since 2008. As of November 2021, UTF-8 accounts for on average 97.5% of all web pages
Hopefully this article helps shed some light on how this encoding works, and some of the challenges with working with a variable length encoding scheme.
