How does UTF-8 turn “😂” into “F09F9882”?

If you're anything like me, you love emojis! Emojis appear like an image on the screen, but they aren't an image like a PNG or JPEG. What do emojis look like to computers?

More often than not the mechanism being used to turn bytes into characters and emojis on your computer is "UTF-8".

I recently learned how UTF-8 works and felt that the definition lended itself perfectly to creating diagrams explaining the implementation. I created these diagrams for my own enjoyment and wanted to share them. Hopefully this will inspire you to learn how other low-level protocols work too! All diagrams are available on GitHub.

What is UTF-8?

UTF-8 is an encoding currently defined in RFC 3629 (first published in 1996 in RFC 2044) describes encoding Unicode characters into bytes. Unicode uses a concept called a "codepoint" which is essentially a number that maps to a single "character" within the Unicode standard. Unicode codepoints are often written as hex with a U+ prefix. For example, the character "😂" is codepoint 0x1F602 (128,514 in decimal) and would be written as U+1F602.

The 5 octets of UTF-8

NOTE: I use the term "octet" in this article which means "a grouping of 8 bits". Today's computers consider 8 bits to be 1 byte, but previously there were systems which used a different number of bits per "byte" hence the distinction. For our purposes an octet and a byte are the same thing.

Every octet that can be produced using UTF-8 will fall into one of five types. The octet will either be a "header" octet specifying a length of 1, 2, 3, or 4 octets or a "tail" octet which only holds onto data. You can determine what type each individual octet is by examining the high-order bits (in this representation these bits are left-most in a block and colored green).

Each octet also has "empty" spaces for bits (visualized as X in blue) which we'll eventually fill with data. You can also see the 4 unique layouts that a UTF-8 encoded codepoint can use to store different amounts of data (between 7-21 bits).

What are bit prefixes?

Bit prefixes is a common technique that allows for encoding information while still leaving room in the octet for other information. Bit prefixing works by choosing a list of prefixes such that when each bit is read from left to right you eventually know unambiguously which prefix has been used.

Bit prefixing is useful to maximize the extra space on prefixes that are more commonly used.

In the case of UTF-8 the shortest prefixes are 0 and 10 where 10 is for the tail octet which is used between 1-3 times per characters for higher codepoint characters. The 0 prefix was likely selected more for its utility rather than its frequency when desiging UTF-8 to maintain compatibility with US-ASCII, more on this later!

Looking at the prefixes for UTF-8 octets the possibilities are 0, 10, 110, 1110, and 11110. If I start reading an octet from left to right and encounter the bits 1, 1, and 0 I immediately know without reading further that this octet is a "2 octet header" and can't possibly be any other octet type.

Following the flow of bits

Using Python we're able to see that our target output is \xf0\x9f\x98\x82:

>>> emoji = "😂"
>>> emoji.encode("utf-8")
b'\xf0\x9f\x98\x82'

Encoding a Unicode codepoint into bytes is a multi-step process. The first step is determining the number of octets required to encode the codepoint. The codepoint value for "Face with Tears of Joy" ( 😂 ) is 0x1F602. From the previous diagram the value 0x1F602 falls in the range for a 4 octets header (between 0x10000 and 0x10FFFF).

Next step is converting the codepoint value 0x1F602 into binary. In Python you can do f"{0x1F602:b}" which will return '111‍1101100‍0000010' as a string. This value is padded with zeroes until there are 21 bits to fit the layout for 4 octets. This padded value can be seen on the top of the "UTF-8 encoding" section in the diagram as "000 011‍111 011‍000 000‍010".

From there we lay out our four octets, 1 header and 3 tail octets for a total of 4 octets. There are 21 empty bits to fill with data. Starting with the third tail octet from right to left we begin filling the empty bits in each octet. You can follow where each bit ends up with the arrows.

After that we turn each 8 bit grouping into a byte which we display in hexadecimal at the bottom of the diagram. This value matches what we expected to receive from UTF-8 encoding this character: Success! ✨

Decoding a Unicode codepoint from bytes only requires reversing the above process. Each byte will be examined first for the header type and then individual bits will be extracted from each octet and added together to reproduce the codepoint.

NOTE: You can always find a codepoint boundary from an arbitrary point in a stream of octets by moving left an octet each time the current octet starts with the bit prefix 10 which indicates a tail octet. At most you'll have to move left 3 octets to find the nearest header octet.

Why is UTF-8 everywhere?

Back when UTF-8 was first introduced there were many systems that didn't understand any character encoding beyond "US-ASCII". This meant whenever data encoded with another Unicode encoding was used then that system would produce garbage, even if the characters were within the US-ASCII range.

UTF-8's use of byte prefixing 0 to be identical to the US-ASCII range in 0x00-0x7F means that all characters within the US-ASCII range are encoded exactly as they would had they been explicitly encoded using US-ASCII instead of UTF-8.

This was a big win for compatibility as it meant many systems could start using UTF-8 as an encoding immediately. As long as input data wasn't outside of the US-ASCII range the encoded bytes would not change which allowed for incremental adoption within a group of systems instead of having to switch "all at once" to a new encoding.

Giant reference card

Below is a diagram which shows how all the different possible headers that a codepoint could be encoded to in UTF-8. You can use it for your reference or just admire all the time I spent in diagrams.net 😅.

Grapheme clusters are mentioned in the diagram, simply put they are multiple codepoints that when placed together will "combine" into a single "thing" drawn on the screen (that "thing" being a "grapheme"). Maybe I'll write about these in the future!

Thanks for reading! ♡ Did you find this article helpful and want more content like it? Get notified of new posts by subscribing to the RSS feed or the email newsletter.

This work is licensed under CC BY-SA 4.0

How does UTF-8 turn “😂” into “F09F9882”?

What is UTF-8?§

The 5 octets of UTF-8§

What are bit prefixes?§

Following the flow of bits§

Why is UTF-8 everywhere?§

Giant reference card§