Published 2022-02-08 byΒ SethΒ MichaelΒ Larson
Reading time: 8 minutes
Subscribe for more content like this through the mailing list or RSS.
If you're anything like me, you love emojis! Emojis appear like an image on the screen, but they aren't an image like a PNG or JPEG. What do emojis look like to computers?
More often than not the mechanism being used to turn bytes into characters and emojis on your computer is "UTF-8".
I recently learned how UTF-8 works and felt that the definition lended itself perfectly to creating diagrams explaining the implementation. I created these diagrams for my own enjoyment and wanted to share them. Hopefully this will inspire you to learn how other low-level protocols work too! All diagrams are available on GitHub.
UTF-8 is an encoding currently defined in RFC 3629 (first published in 1996 in RFC 2044) describes encoding Unicode characters into bytes. Unicode uses a concept called a "codepoint" which is essentially a number that maps to a single "character" within the Unicode standard. Unicode codepoints are often written as hex with a U+
prefix. For example, the character "π" is codepoint 0x1F602
(128,514 in decimal) and would be written as U+1F602
.
NOTE: I use the term "octet" in this article which means "a grouping of 8 bits". Today's computers consider 8 bits to be 1 byte, but previously there were systems which used a different number of bits per "byte" hence the distinction. For our purposes an octet and a byte are the same thing.
Every octet that can be produced using UTF-8 will fall into one of five types. The octet will either be a "header" octet specifying a length of 1, 2, 3, or 4 octets or a "tail" octet which only holds onto data. You can determine what type each individual octet is by examining the high-order bits (in this representation these bits are left-most in a block and colored green).
Each octet also has "empty" spaces for bits (visualized as X
in blue) which we'll eventually fill with data. You can also see the 4 unique layouts that a UTF-8 encoded codepoint can use to store different amounts of data (between 7-21 bits).
Bit prefixes is a common technique that allows for encoding information while still leaving room in the octet for other information. Bit prefixing works by choosing a list of prefixes such that when each bit is read from left to right you eventually know unambiguously which prefix has been used.
Bit prefixing is useful to maximize the extra space on prefixes that are more commonly used.
In the case of UTF-8 the shortest prefixes are 0
and 10
where 10
is for the tail octet which is used between 1-3 times per characters for higher codepoint characters. The 0
prefix was likely selected more for its utility rather than its frequency when desiging UTF-8 to maintain compatibility with US-ASCII, more on this later!
Looking at the prefixes for UTF-8 octets the possibilities are 0
, 10
, 110
, 1110
, and 11110
. If I start reading an octet from left to right and encounter the bits 1
, 1
, and 0
I immediately know without reading further that this octet is a "2 octet header" and can't possibly be any other octet type.
Using Python we're able to see that our target output is \xf0\x9f\x98\x82
:
>>> emoji = "π"
>>> emoji.encode("utf-8")
b'\xf0\x9f\x98\x82'
Encoding a Unicode codepoint into bytes is a multi-step process. The first step is determining the number of octets required to encode the codepoint. The codepoint value for "Face with Tears of Joy" ( π ) is 0x1F602
. From the previous diagram the value 0x1F602
falls in the range for a 4 octets header (between 0x10000
and 0x10FFFF
).
Next step is converting the codepoint value 0x1F602
into binary. In Python you can do f"{0x1F602:b}"
which will return '111β1101100β0000010'
as a string. This value is padded with zeroes until there are 21 bits to fit the layout for 4 octets. This padded value can be seen on the top of the "UTF-8 encoding" section in the diagram as "000 011β111 011β000 000β010
".
From there we lay out our four octets, 1 header and 3 tail octets for a total of 4 octets. There are 21 empty bits to fill with data. Starting with the third tail octet from right to left we begin filling the empty bits in each octet. You can follow where each bit ends up with the arrows.
After that we turn each 8 bit grouping into a byte which we display in hexadecimal at the bottom of the diagram. This value matches what we expected to receive from UTF-8 encoding this character: Success! β¨
Decoding a Unicode codepoint from bytes only requires reversing the above process. Each byte will be examined first for the header type and then individual bits will be extracted from each octet and added together to reproduce the codepoint.
NOTE: You can always find a codepoint boundary from an arbitrary point in a stream of octets by moving left an octet each time the current octet starts with the bit prefix
10
which indicates a tail octet. At most you'll have to move left 3 octets to find the nearest header octet.
Back when UTF-8 was first introduced there were many systems that didn't understand any character encoding beyond "US-ASCII". This meant whenever data encoded with another Unicode encoding was used then that system would produce garbage, even if the characters were within the US-ASCII range.
UTF-8's use of byte prefixing 0
to be identical to the US-ASCII range in 0x00-0x7F
means that all characters within the US-ASCII range are encoded exactly as they would had they been explicitly encoded using US-ASCII instead of UTF-8.
This was a big win for compatibility as it meant many systems could start using UTF-8 as an encoding immediately. As long as input data wasn't outside of the US-ASCII range the encoded bytes would not change which allowed for incremental adoption within a group of systems instead of having to switch "all at once" to a new encoding.
Below is a diagram which shows how all the different possible headers that a codepoint could be encoded to in UTF-8. You can use it for your reference or just admire all the time I spent in diagrams.net π .
Grapheme clusters are mentioned in the diagram, simply put they are multiple codepoints that when placed together will "combine" into a single "thing" drawn on the screen (that "thing" being a "grapheme"). Maybe I'll write about these in the future!