Character encoding
Character encoding is an important concept in process of converting byte streams into characters, which can be displayed. To convert bytes in to characters mainly there are two things, a character set and an encoding.
Character set
A character set is nothing but list of characters, where each symbol or character is mapped to a numeric value, also known as code points. Since there are so many characters and symbols in the world, a character set is required to support all those characters.
Encoding Schemes
Encoding schemes describe how code points are mapped to bytes, using different bit values as a basis. E.g. 8 bits for UTF-8, 16 bits for UTF-16 and 32 bits for UTF-32 UTF stands for Unicode Transformation, which defines an algorithm to map every Unicode code point to a unique byte sequence.
Difference between UTF-8, UTF-16 and UTF-32
UTF-8 | UTF-16 | UTF-32 |
---|---|---|
Variable length encoding | Variable length encoding | Fixed width encoding scheme |
Uses one byte at the minimum for encoding and 4 bytes maximum | Uses minimum two bytes and 4 bytes maximum | Uses fixed 4 bytes for encoding |
It is variable length encoding and takes 1 to 4 bytes, dependingupon code point | It is also variable length character encoding but either takes 2 or 4 bytes. | Fixed 4 bytes |
Compatible with ASCII | Incompatible with ASCII | Incompatible with ASCII |
Summary:
- In UTF-8, every code point from 0-127 is stored in single bytes. Only code points 128 and above are stored using 2,3 or in fact, up to 4 bytes. In short, UTF-8 is variable length encoding and takes 1 to 4 bytes, depending upon code point. UTF-16 is also variable length character encoding but either takes 2 or 4 bytes. On the other hand UTF-32 is fixed 4 bytes.
- UTF-8 has an advantage where ASCII are most used characters, in that case most characters only need one byte. UTF-8 file containing only ASCII characters has the same encoding as an ASCII file, which means English text looks exactly the same in UTF-8 as it did in ASCII. Given dominance of ASCII in past this was the main reason of initial acceptance of Unicode and UTF-8.
- UTF16 is not fixed width. It uses 2 or 4 bytes. Only UTF32 is fixedwidth and unfortunately no one uses it. Also, worth knowing is that Java Strings are represented using UTF-16 bit characters, earlier they use USC2, which is fixed width.
- You might think that because UTF-8 take less bytes for many characters it would take less memory that UTF-16, well that really depends on what language the string is in. For non-European languages, UTF-8 requires more memory than UTF-16.
- ASCII is strictly faster than multi-byte encoding scheme because less data to process faster.
- UTF-32 will cover all possible characters in 4 bytes. This makes it pretty bloated. I can’t think of any advantage to using it.