2022-10-06 11:00:05 +01:00
|
|
|
---
|
2023-02-10 18:22:04 +00:00
|
|
|
tags: [binary, binary-encoding]
|
2022-10-06 11:00:05 +01:00
|
|
|
---
|
|
|
|
|
|
|
|
# Text encoding
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
Text encoding is an applied instance of
|
2024-02-17 11:57:44 +00:00
|
|
|
[binary encoding](Binary_encoding.md).
|
2022-10-06 11:00:05 +01:00
|
|
|
|
2022-10-09 11:30:05 +01:00
|
|
|
## ASCII
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
There are around 100 characters in total required to render A-Z, a-z, 0-9 and
|
|
|
|
the special characters of Lating text. The ASCII (American Standard Code for
|
|
|
|
Information Interchange) system achieves this with 8-bit code. Thus, each
|
|
|
|
character symbol corresponds to a byte. As $2^8 = 256$, this allows for a total
|
|
|
|
of 256 characters (where only 7-bits are sufficient, a leading `0` is added).
|
2022-10-06 11:00:05 +01:00
|
|
|
|
|
|
|
Below are some examples of the ASCII correspondences:
|
|
|
|
|
|
|
|
| Binary | Hex | Character |
|
|
|
|
| --------- | --- | --------- |
|
|
|
|
| 00100000 | 20 | [space] |
|
|
|
|
| 00100001 | 21 | ! |
|
|
|
|
| 001010112 | 2B | + |
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
However there are only 128 characters represented in ASCII, thus using 256-bits
|
|
|
|
is somewhat excessive. This lead people to try and use the remaining, free 128
|
|
|
|
bits to accommodate characters from non-English languages. This was quickly
|
|
|
|
found to be insufficient, necessitating the development of a new encoding
|
|
|
|
standard, Unicode...
|
2022-10-09 11:30:05 +01:00
|
|
|
|
|
|
|
## Unicode and UTF-8
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
Whereas ASCII only encodes 128 English alphanumeric characters, the scope of
|
|
|
|
Unicode is much broader, as such it is a superset of ASCII (every character in
|
|
|
|
ASCII is in Unicode but not the converse). Unicode is a universal character
|
|
|
|
encoding that defines every character in every spoken language of the world. The
|
|
|
|
Unicode standard is maintained by the Unicode Consortium and defines more than
|
|
|
|
1,40,000 characters from more than 150 modern and historic scripts along with
|
|
|
|
emoji.
|
2022-10-09 11:30:05 +01:00
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
In contrast to ASCII, it doesn't achieve this by mapping every character to a
|
|
|
|
bit. Instead it provides an abstract representation of every character which is
|
|
|
|
then encoded using a designated encoding protocol, such as UTF-8, UTF-16, UTF-32
|
|
|
|
etc. These abstract representations are called "code-points" and are represented
|
|
|
|
as hexadecimal numbers between 0xO - 0x10FFFF (1114111), prepended by `U+`, for
|
|
|
|
example ` U+00F7` which is the sign for division.
|
2022-10-09 11:30:05 +01:00
|
|
|
|
|
|
|
As the encoding names suggest they encode to different bit sizes:
|
|
|
|
|
|
|
|
- UTF-8 and UTF-16 are variable length encodings.
|
|
|
|
- In UTF-8, a character may occupy a minimum of 8 bits.
|
|
|
|
- In UTF-16, a character length starts with 16 bits.
|
|
|
|
- UTF-32 is a fixed length encoding of 32 bits.
|
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
> Unicode can be stored using several different encodings, which translate the
|
|
|
|
> character codes into sequences of bytes. So, crucially Unicode is not itself
|
|
|
|
> an encoding.
|
2022-10-09 11:30:05 +01:00
|
|
|
|
2024-02-02 15:58:13 +00:00
|
|
|
UTF-8 uses the ASCII set for the first 128 characters. That's handy because it
|
|
|
|
means ASCII text is also valid in UTF-8.
|