3.5 · Character Encoding

Goal: explain how letters become bytes, name the four encodings on the syllabus, and choose the right one for a job.

Why we need encoding

Computers store only 0s and 1s. A character encoding is an agreed mapping between characters (letters, symbols, ideographs) and integer codes, which can then be stored as bits.

Three things have to agree for text to travel correctly:

Character set — which characters are available (the alphabet).
Encoding — how each character maps to a binary code.
Byte order / decoding software — how the bytes are read back.

ASCII — the original

ASCII (American Standard Code for Information Interchange) defines codes for the basic English alphabet, digits, punctuation and control characters.

Aspect	Detail
Bits per character	7 (often padded to 8 bits = 1 byte)
Total characters	128
Covers	English letters, digits, punctuation, control codes (CR, LF…)
Year	1963

Selected ASCII codes

Character	Denary	Binary	Hex
`A`	65	100 0001	41
`a`	97	110 0001	61
`0`	48	011 0000	30
`9`	57	011 1001	39
`(space)`	32	010 0000	20
`\n` (newline)	10	000 1010	0A

Two pieces of trivia worth remembering:

a − A = 32 (lowercase = uppercase + 32 → easy case conversion).
'9' − '0' = 9 (easy digit-character to integer conversion).

You do NOT need to memorise specific codes

The C&A Guide explicitly says: "Recall of specific codes is not required." You need to understand the principle.

Big-5 — Traditional Chinese

Aspect	Detail
Bits per character	16 (2 bytes)
Total characters	~13,000+
Covers	Traditional Chinese (mainly Hong Kong & Taiwan)
Year	1984

Each Chinese character is stored as 2 bytes. A document mixing English and Chinese under Big-5 stores English in 1 byte and Chinese in 2.

GB — Simplified Chinese

The Chinese national standard GB (国标) covers Simplified Chinese characters used in mainland China.

Aspect	Detail
Bits per character	2–4 bytes (GB2312, GBK, GB18030 over time)
Covers	Simplified Chinese; later versions include Traditional

Unicode — the global standard

Unicode assigns a unique number ("code point") to every character in every script — Latin, Chinese, Arabic, emoji, mathematical symbols, etc.

Aspect	Detail
Total characters	150,000+ (and growing)
Encodings	UTF-8 (1–4 bytes per char), UTF-16 (2 or 4 bytes), UTF-32 (always 4 bytes)
Used by	Modern Web, modern operating systems, almost all new software

UTF-8 is the de-facto Web standard

UTF-8 is clever:

Encodes ASCII characters in 1 byte (backwards-compatible).
Encodes other characters in 2, 3 or 4 bytes.
A single byte stream tells the decoder how many bytes belong to the current character.

'A'  → 1 byte:  01000001
'£'  → 2 bytes: 11000010 10100011
'中' → 3 bytes: 11100100 10111000 10101101
'😀' → 4 bytes: 11110000 10011111 10011000 10000000

Side-by-side comparison

Feature	ASCII	Big-5	GB	Unicode (UTF-8)
Origin	USA	Taiwan / HK	Mainland China	International
Bytes per character	1	2	2–4	1–4
Covers	English, ASCII	Traditional Chinese	Simplified Chinese	Everything
Mixed-language safe?	English only	Limited	Limited	Yes
Modern recommendation	Subset of UTF-8	Legacy	Legacy	Yes

Why the size of the character set matters

"The relationship between the size of the character set and the representation should be explained."
— C&A Guide

The number of bits needed to encode a character set is determined by the set's size:

Character set size	Bits needed
2 (binary alphabet)	1
16	4
256	8
65,536 (BMP of Unicode)	16
1,114,112 (full Unicode)	21 (rounded to 24 or 32)

A larger character set → more bits per character → more storage and bandwidth needed.

Common student mistakes

Saying "ASCII can store Chinese" — it cannot (only 128 codes).
Mixing Unicode (the character set) with UTF-8 (one encoding of it).
Claiming Big-5 covers Simplified Chinese — it does not.
Forgetting that mixing encodings causes mojibake (garbled text).

Real-world tips

Always save text files as UTF-8 to avoid Chinese garbage.
When opening a CSV in Excel, choose UTF-8 encoding to keep Chinese readable.
Web pages should declare <meta charset="UTF-8"> in the <head>.

Worked example · How many bytes does "中文ABC" take?

Under UTF-8:

Char	Bytes
中	3
文	3
A	1
B	1
C	1
Total	9 bytes

Under Big-5:

Char	Bytes
中	2
文	2
A	1
B	1
C	1
Total	7 bytes

Big-5 is smaller here but cannot represent emojis or Cyrillic.

Practice activity

Match each scenario to the most suitable encoding:

Scenario	Best encoding
Storing English-only legacy database records from 1985	?
A Hong Kong newspaper website with Traditional Chinese & emojis	?
A simplified-Chinese pamphlet for mainland China, no other scripts	?
A multilingual Wikipedia article	?

Suggested

ASCII
UTF-8 (Unicode)
GB (or UTF-8)
UTF-8 (Unicode)

Exam-style question

Q (4 marks): Compare ASCII and Unicode in terms of (a) the range of characters supported, (b) the bytes used per character. State one reason most modern web pages use Unicode (UTF-8).

Sample answer:

Range: ASCII supports 128 characters (basic English, digits, punctuation, control codes). Unicode supports over 150,000 characters covering all major scripts including Chinese, Arabic, emoji.
Bytes: ASCII uses 1 byte per character (7 bits used). Unicode encoded as UTF-8 uses 1–4 bytes depending on the character; ASCII characters still take 1 byte for backward compatibility.
Reason: Web pages serve global audiences and must display many scripts including emoji and Chinese; only Unicode can encode all of them.

Key takeaways

ASCII = 1 byte, English only.
Big-5 = 2 bytes, Traditional Chinese.
GB = 2–4 bytes, Simplified Chinese.
Unicode (UTF-8) = 1–4 bytes, global default.
Larger character set → more bits per character.

➡️ Next: 3.6 Multimedia Digitisation

3.5 · Character Encoding ​

Why we need encoding ​

ASCII — the original ​

Selected ASCII codes ​

Big-5 — Traditional Chinese ​

GB — Simplified Chinese ​

Unicode — the global standard ​

UTF-8 is the de-facto Web standard ​

Side-by-side comparison ​

Why the size of the character set matters ​

Common student mistakes ​

Real-world tips ​

Worked example · How many bytes does "中文ABC" take? ​

Practice activity ​

Exam-style question ​

Key takeaways ​