Bits and Bytes: ASCII and Unicode

binary code, Photo credit: Turkei89

After running into my recent character encoding conflict with Python, I decided to do a basic explanation of what exactly character encoding means and what the differences are.

As you probably already know, computers at their core ultimately speak a language of just two “letters”: 1‘s and 0‘s, True or False, something or nothing, on or off. That’s why it’s called binary code; there are only two opposing options. Each individual one or zero is known as a bit, a portmanteau derived from “binary digit.”

We humans of course speak languages with many letters; the Cambodian Khmer language has 74 characters! So how do you get a computer to understand all of that? In the early computing days, way back in 1968, this conflict was solved with the development of the ASCII standard (American Standard Code for Information Interchange). ASCII links each letter or character to a unique 8-bit code made up of computer-readable ones and zeroes. These 8-bit chunks of code are known as bytes, a term coined by computer science pioneer Dr. Werner Buchholz. The name fits, doesn’t it? Little bites of code. Anyway, by using 8 bits of binary data, one can get up to 256 different code combinations (that’s 28) ranging from 0 to 255. Now our computer has 256 characters with which to work!

 

In the ASCII standard, each character is assigned one of these 256 slots, and it can be expressed by its binary code, it’s decimal number (0 – 255), its hexadecimal code, or the human-readable glyph (what you see on your screen or keyboard). For example, an asterisk can be expressed as follows:

Binary: 00101010
Decimal: 042
Hexadecimal: 2A
Glyph: *

Neat! But wait, the first 30 spots are reserved for control/formatting characters, different languages use different alphabets, and what about accents and symbols and special characters? We run into the limitations of 8-bit encoding pretty quickly.

In order to truly be able to express all the characters we might need, we had to move on from the 8-bit model to something that allows for even more combinations of 1’s and 0’s. This expanded alphabet eventually became standardized as the Universal Character Set or Unicode. The original Unicode standard (now called UCS-2) doubled the bit size of each character to 16 bits, and 216 gives us a whopping 65,536 possible combinations! The development of Unicode gets more technical from there, and there are a few different Unicode encodings in use today, but by far the most common are UTC-8 and UTC-16. Where ASCI and UCS-2 both work in fixed-length code units (1 byte or 2 bytes per character, respectively), UTC-8 and UTC-16 are variable-length, allowing for even more combinations. UTC-16 extends UCS-2 by using one 16-bit code unit to represent each UCS-2 character and two 16-bit code units (32 bits total) to represent additional characters. UTC-8 uses one byte for the first 128 characters, just like ASCII, and a variable bit-length of up to 4 bytes for additional characters.

For an even more detailed explanation that’s still easy to understand, check out one or both of the following articles:

Unicode and You

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)