Seth Michael Larson

Blogging about Python and the Internet

How does UTF-8 turn β€œπŸ˜‚β€ into β€œF09F9882”?

Published 2022-02-08 β€” ❀︎ Subscribe for more via the newsletter or RSS

If you're anything like me, you love emojis! Emojis appear like an image on the screen, but they aren't an image like a PNG or JPEG. What do emojis look like to computers?





0x82
0x82
0x98
0x98
0x9F
0x9F
0xF0
0xF0
πŸ˜‚
πŸ˜‚


???
???
Text is not SVG - cannot display

More often than not the mechanism being used to turn bytes into characters and emojis on your computer is "UTF-8".

I recently learned how UTF-8 works and felt that the definition lended itself perfectly to creating diagrams explaining the implementation. I created these diagrams for my own enjoyment and wanted to share them. Hopefully this will inspire you to learn how other low-level protocols work too! All diagrams are available on GitHub.

What is UTF-8?

UTF-8 is an encoding currently defined in RFC 3629 (first published in 1996 in RFC 2044) describes encoding Unicode characters into bytes. Unicode uses a concept called a "codepoint" which is essentially a number that maps to a single "character" within the Unicode standard. Unicode codepoints are often written as hex with a U+ prefix. For example, the character "πŸ˜‚" is codepoint 0x1F602 (128,514 in decimal) and would be written as U+1F602.

The 5 octets of UTF-8

NOTE: I use the term "octet" in this article which means "a grouping of 8 bits". Today's computers consider 8 bits to be 1 byte, but previously there were systems which used a different number of bits per "byte" hence the distinction. For our purposes an octet and a byte are the same thing.

Every octet that can be produced using UTF-8 will fall into one of five types. The octet will either be a "header" octet specifying a length of 1, 2, 3, or 4 octets or a "tail" octet which only holds onto data. You can determine what type each individual octet is by examining the high-order bits (in this representation these bits are left-most in a block and colored green).

Each octet also has "empty" spaces for bits (visualized as X in blue) which we'll eventually fill with data. You can also see the 4 unique layouts that a UTF-8 encoded codepoint can use to store different amounts of data (between 7-21 bits).

UTF-8 OCTETS
UTF-8 OCTETS
UTF-8 HEADER OCTETS
UTF-8 HEADER OCTETS
0
0
110
110
1110
1110
11110
11110
xxxxxxx
xxxxxxx
xxxxx
xxxxx
xxxx
xxxx
xxx
xxx
10
10
xxxxxx
xxxxxx
U+0000–U+007F
1 octet header
U+0000–U+007F...
U+0080–U+07FF
2 octet header
U+0080–U+07FF...
U+0800–U+FFFF
3 octet header
U+0800–U+FFFF...
U+10000–U+10FFFF
4 octet header
U+10000–U+10FFFF...
Tail/data octet
Tail/data octet
0
0
110
110
1110
1110
11110
11110
xxxxxxx
xxxxxxx
xxxxx
xxxxx
xxxx
xxxx
xxx
xxx
10
10
xxxxxx
xxxxxx
10
10
xxxxxx
xxxxxx
10
10
xxxxxx
xxxxxx
10
10
xxxxxx
xxxxxx
10
10
xxxxxx
xxxxxx
10
10
xxxxxx
xxxxxx
21 data bits
21 data bi...
11 data bits
11 data bi...
16 data bits
16 data bi...
7 data bits
7 data bit...
UTF-8 OCTET LAYOUTS
UTF-8 OCTET LAYOU...
Text is not SVG - cannot display

What are bit prefixes?

Bit prefixes is a common technique that allows for encoding information while still leaving room in the octet for other information. Bit prefixing works by choosing a list of prefixes such that when each bit is read from left to right you eventually know unambiguously which prefix has been used.

Bit prefixing is useful to maximize the extra space on prefixes that are more commonly used.

In the case of UTF-8 the shortest prefixes are 0 and 10 where 10 is for the tail octet which is used between 1-3 times per characters for higher codepoint characters. The 0 prefix was likely selected more for its utility rather than its frequency when desiging UTF-8 to maintain compatibility with US-ASCII, more on this later!

Looking at the prefixes for UTF-8 octets the possibilities are 0, 10, 110, 1110, and 11110. If I start reading an octet from left to right and encounter the bits 1, 1, and 0 I immediately know without reading further that this octet is a "2 octet header" and can't possibly be any other octet type.

Following the flow of bits

Using Python we're able to see that our target output is \xf0\x9f\x98\x82:

>>> emoji = "πŸ˜‚"
>>> emoji.encode("utf-8")
b'\xf0\x9f\x98\x82'

Encoding a Unicode codepoint into bytes is a multi-step process. The first step is determining the number of octets required to encode the codepoint. The codepoint value for "Face with Tears of Joy" ( πŸ˜‚ ) is 0x1F602. From the previous diagram the value 0x1F602 falls in the range for a 4 octets header (between 0x10000 and 0x10FFFF).

Next step is converting the codepoint value 0x1F602 into binary. In Python you can do f"{0x1F602:b}" which will return '111‍1101100‍0000010' as a string. This value is padded with zeroes until there are 21 bits to fit the layout for 4 octets. This padded value can be seen on the top of the "UTF-8 encoding" section in the diagram as "000 011‍111 011‍000 000‍010".

UNICODE CODEPOINT
UNICODE CODEPO...
UTF-8 ENCODING
UTF-8 ENCODING
BYTES
BYTES
0x82
0x82
0x98
0x98
0x9F
0x9F
0xF0
0xF0
11110
11110
000
000
10
10
011‍111
011‍111
10
10
011‍000
011‍000
000 011‍111 011‍000 000‍010
000 011‍111 011‍000 000‍010
10
10
000‍010
000‍010
U+1F602
U+1F602
πŸ˜‚
πŸ˜‚
000
000
011‍111
011‍111
011‍000
011‍000
000‍010
000‍010
111‍10000
111‍10000
100‍11111
100‍11111
100‍11000
100‍11000
100‍00010
100‍00010
Text is not SVG - cannot display

From there we lay out our four octets, 1 header and 3 tail octets for a total of 4 octets. There are 21 empty bits to fill with data. Starting with the third tail octet from right to left we begin filling the empty bits in each octet. You can follow where each bit ends up with the arrows.

After that we turn each 8 bit grouping into a byte which we display in hexadecimal at the bottom of the diagram. This value matches what we expected to receive from UTF-8 encoding this character: Success! ✨

Decoding a Unicode codepoint from bytes only requires reversing the above process. Each byte will be examined first for the header type and then individual bits will be extracted from each octet and added together to reproduce the codepoint.

NOTE: You can always find a codepoint boundary from an arbitrary point in a stream of octets by moving left an octet each time the current octet starts with the bit prefix 10 which indicates a tail octet. At most you'll have to move left 3 octets to find the nearest header octet.

Why is UTF-8 everywhere?

Back when UTF-8 was first introduced there were many systems that didn't understand any character encoding beyond "US-ASCII". This meant whenever data encoded with another Unicode encoding was used then that system would produce garbage, even if the characters were within the US-ASCII range.

UTF-8's use of byte prefixing 0 to be identical to the US-ASCII range in 0x00-0x7F means that all characters within the US-ASCII range are encoded exactly as they would had they been explicitly encoded using US-ASCII instead of UTF-8.

This was a big win for compatibility as it meant many systems could start using UTF-8 as an encoding immediately. As long as input data wasn't outside of the US-ASCII range the encoded bytes would not change which allowed for incremental adoption within a group of systems instead of having to switch "all at once" to a new encoding.

UNICODE CODEPOINT
UNICODE CODEPO...
UTF-8 ENCODING
UTF-8 ENCODING
BYTES
BYTES
a
a
U+61
U+61
101‍0111
101‍0111
0
0
101‍0111
101‍0111
0x61
0x61
01100001
01100001
101‍0111
101‍0111
ASCII ENCODING
ASCII ENCODING
01100001
01100001
BYTES
BYTES
0x61
0x61
Text is not SVG - cannot display

Giant reference card

Below is a diagram which shows how all the different possible headers that a codepoint could be encoded to in UTF-8. You can use it for your reference or just admire all the time I spent in diagrams.net πŸ˜….

Grapheme clusters are mentioned in the diagram, simply put they are multiple codepoints that when placed together will "combine" into a single "thing" drawn on the screen (that "thing" being a "grapheme"). Maybe I'll write about these in the future!

GRAPHEME CLUSTER
GRAPHEME CLUSTER
UNICODE CODEPOINT
UNICODE CODEPO...
UTF-8 ENCODING
UTF-8 ENCODING
BYTES
BYTES
UNICODE CODEPOINT
UNICODE CODEPO...
UTF-8 ENCODING
UTF-8 ENCODING
BYTES
BYTES
a
a
U+61
U+61
1100001
1100001
0
0
1100001
1100001
0x61
0x61
01100001
01100001
1100001
1100001
ở
ở
U+1EDF
U+1EDF
0001 111011 011111
0001 111011 011111
1110
1110
0001
0001
0xE1
0xE1
0xBB
0xBB
10
10
111011
111011
10
10
011111
011111
0x9F
0x9F
Δ‘
Δ‘
U+0111
U+0111
00100 010001
00100 010001
110
110
00100
00100
0xC4
0xC4
0x91
0x91
10
10
010001
010001
00100
00100
010001
010001
0001
0001
111011
111011
011111
011111
11000100
11000100
10010001
10010001
11100001
11100001
10111011
10111011
10011111
10011111
0xBA
0xBA
0x87
0x87
0x9F
0x9F
0xF0
0xF0
11110
11110
000
000
10
10
011111
011111
10
10
000111
000111
000 011111 000111 111010
000 011111 000111 111010
10
10
111010
111010
U+1F1FA
U+1F1FA
U+1F1F8
U+1F1F8
πŸ‡Ί
πŸ‡Ί
πŸ‡Έ
πŸ‡Έ
πŸ‡ΊπŸ‡Έ
πŸ‡ΊπŸ‡Έ
0xB8
0xB8
0x87
0x87
0x9F
0x9F
0xF0
0xF0
11110
11110
000
000
10
10
011111
011111
10
10
000111
000111
10
10
111000
111000
000 011111 000111 111000
000 011111 000111 111000
πŸ‡Ί
πŸ‡Ί
πŸ‡Έ
πŸ‡Έ
000
000
011111
011111
000111
000111
111010
111010
000
000
011111
011111
000111
000111
111000
111000
11110000
11110000
10011111
10011111
10000111
10000111
10111010
10111010
11110000
11110000
10011111
10011111
10000111
10000111
10111000
10111000
UTF-8 OCTETS
UTF-8 OCTETS
0
0
110
110
1110
1110
11110
11110
xxxxxxx
xxxxxxx
xxxxx
xxxxx
xxxx
xxxx
xxx
xxx
10
10
xxxxxx
xxxxxx
U+0000–U+007F
1 octet header
U+0000–U+007F...
U+0080–U+07FF
2 octets header
U+0080–U+07FF...
U+0800–U+FFFF
3 octets header
U+0800–U+FFFF...
U+10000–U+10FFFF
4 octets header
U+10000–U+10FFFF...
Tail/data octet
Tail/data octet
Text is not SVG - cannot display


Enjoy this article? ❀︎ Subscribe for more via the newsletter or RSS
Built with SimpleGrid, FontAwesome, Flask, and more.