Greg Tatum

Encoding Text in UTF-8 – How Unicode Works (Part 2)

In part 1 of this article I covered the idea of creating character sets, and different strategies for encoding them. The article covered UTF-32 and UTF-16 encodings with the benefits and drawbacks of each. However, for most documents, UTF-8 encoding is the most popular by far, but is more complicated in its implementation. For a quick re-cap, a code point is a base unit of meaning in the Unicode. A code point can represent a single…

Diacritical Marks in Unicode

I won’t bury the lede, by the end of this article you should be able to write your name in crazy diacritics like this: Ḡ͓̟̟r̬e̱̬͔͑g̰ͮ̃͛ ̇̅T̆a̐̑͢ṫ̀ǔ̓͟m̮̩̠̟. This article is part of the Unicode and i18n series motivated by my work with internationalization in Firefox and the Unicode ICU4X sub-committee. Unicode is made up of a variety of code points that can represent many things beyond just a simple letter. The code point itself is a numeric…

Encoding Text, UTF-32 and UTF-16 – How Unicode Works (Part 1)

The standard for how to represent human writing for many years was ASCII, or American Standard Code for Information Interchange. This representation reserved 7 bits for encoding a character. This served early computing well, but did not scale as computers were used in more and more languages and cultures across the world. This article explains how this simple encoding grew into a standard that aims to represent the writing systems of every culture on Earth.