Character Encoding
Computers process numerical data more efficiently. Text data are usually handled as a sequence of numbers with corresponding character assignments. The rules that define the mapping is called character encoding.
The following is a example of ASCII encoding of the string "Hello".
Character | H | e | l | l | o |
Code | 72 | 101 | 108 | 108 | 111 |
The number sequence 72, 101, 108, 108, and 111 correspond to the word "Hello". There are two primary aspects to character encoding, namely the Character Map and Data Representation.
Character Map
Character map is also referred to as code page. This is the association of numbers or code to characters. Each entry is called a code point.
Most character encodings follow the ASCII table code points for the 26-letter latin alphabet and numbers.
Range (Decimal) | Characters |
---|---|
48 - 57 | Numbers 0 - 9 |
65 - 90 | Letters A - Z |
97 - 122 | Letters a - z |
However local standards like GB2312, Shift JIS and ISO/IEC 8859 use varying assignments. This led to the creation of The Unicode Standard or unicode. Although the other standards continue to be used and evolve, unicode it is now the dominant code page across all nations.
Data Representation
Aside from code points, character encoding also dictates how each code sequence should be encoded (and decoded). For instance, GSM 03.38 encoding follow ASCII code points for letters and numbers, but is not compatible with ASCII because it uses 7-bit or 16-bit representation. Unicode encodings, such as UTF-8, UTF-16 and UTF-32 use identical code pages may lead to errors when used interchangeably.