A proposal for a Latin script character encoding

Last updated: 17 June 2021

Unicode is an information technology standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems. The standard, which is maintained by the Unicode Consortium, defines 143,859 characters covering 154 modern and historic scripts, as well as symbols, emoji and non-visual control and formatting codes.

Before Unicode, numerous encodings were used around the world. Many were language-specific or region-specific and determining the encoding of a text file could be a challenging endeavour. Unicode was therefore created to be a single standard for the entire world. The world has embraced Unicode with UTF-8, Unicode's most popular encoding, dominating the World Wide Web. UTF-8 is ASCII-compatible which undoubtably helped with the encoding's adoption.

Due to Unicode's ambitious goals, it is no surprise that even though 30 years have lapsed since the first volume of the Unicode standard was published, the work is not complete and the end is not in sight. Implementing Unicode in software APIs is a colossal undertaking that few have managed to do well and none comprehensively. Normalisation, decomposition, collation, rendering and bidirectional text; these are some of the things that any Unicode implementation must contend with, for all existing writing systems, past and present.

I decided to draft a proposal for a variable-width character encoding for the Latin script because of the frustration that I have experienced whilst working with Unicode in software development. I believe that it is time to take a step back and rethink this aspect of information technology from the ground up. The code units can be one to two bytes which should be more than enough to cover the Latin scripts and all necessary symbols.

Note that this work is not meant to be a technical standard but a draft. A Q&A section is located below.

Tables

Table 1
Work on this table is finished.
Table 2
Work on this table is not finished. The punctuation marks have not been decided nor their placement.
Table 3
Work on this table is not finished. It is intended to contain symbols (e.g. mathematical) and free-standing diacritics (e.g. ').
Table 4
Work on this table is not finished. It is intended to contain modifiers that are placed before alphabetic characters (e.g. [diacritic modifier] + [a] = [á]).
DEC HEX BIN Character Notes
0 0 0 null implementation-specific
1 1 1 A capital letter
2 2 10 B capital letter
3 3 11 C capital letter
4 4 100 D capital letter
5 5 101 E capital letter
6 6 110 F capital letter
7 7 111 G capital letter
8 8 1000 H capital letter
9 9 1001 I capital letter
10 A 1010 J capital letter
11 B 1011 K capital letter
12 C 1100 L capital letter
13 D 1101 M capital letter
14 E 1110 N capital letter
15 F 1111 O capital letter
16 10 10000 P capital letter
17 11 10001 Q capital letter
18 12 10010 R capital letter
19 13 10011 S capital letter
20 14 10100 T capital letter
21 15 10101 U capital letter
22 16 10110 V capital letter
23 17 10111 W capital letter
24 18 11000 X capital letter
25 19 11001 Y capital letter
26 1A 11010 Z capital letter
27 1B 11011 a small letter
28 1C 11100 b small letter
29 1D 11101 c small letter
30 1E 11110 d small letter
31 1F 11111 e small letter
32 20 100000 f small letter
33 21 100001 g small letter
34 22 100010 h small letter
35 23 100011 i small letter
36 24 100100 j small letter
37 25 100101 k small letter
38 26 100110 l small letter
39 27 100111 m small letter
40 28 101000 n small letter
41 29 101001 o small letter
42 2A 101010 p small letter
43 2B 101011 q small letter
44 2C 101100 r small letter
45 2D 101101 s small letter
46 2E 101110 t small letter
47 2F 101111 u small letter
48 30 110000 v small letter
49 31 110001 w small letter
50 32 110010 x small letter
51 33 110011 y small letter
52 34 110100 z small letter
53 35 110101 0 digit
54 36 110110 1 digit
55 37 110111 2 digit
56 38 111000 3 digit
57 39 111001 4 digit
58 3A 111010 5 digit
59 3B 111011 6 digit
60 3C 111100 7 digit
61 3D 111101 8 digit
62 3E 111110 9 digit
63 3F 111111 null implementation-specific
DEC HEX BIN Character Notes
64 40 1000000 null implementation-specific
65 41 1000001 Þ capital letter
66 42 1000010 Æ capital letter
67 43 1000011 Ð capital letter
68 44 1000100 ß capital letter
69 45 1000101 þ small letter
70 46 1000110 æ small letter
71 47 1000111 ð small letter
72 48 1001000 ß small letter
73 49 1001001 space
74 4A 1001010 tab
75 4B 1001011 newline
76 4C 1001100 . full stop
77 4D 1001101 , comma
78 4E 1001110 : colon
79 4F 1001111 ; semicolon
80 50 1010000 - hypen
81 51 1010001 en dash
82 52 1010010 em dash
83 53 1010011 horizontal bar
84 54 1010100 ' apostrophe
85 55 1010101 left single quotation mark
86 56 1010110 right single quotation mark
87 57 1010111 single low quotation mark
88 58 1011000 single high reversed quotation mark
89 59 1011001 left double quotation mark
90 5A 1011010 right double quotation mark
91 5B 1011011 double low quotation mark
92 5C 1011100 double high reversed quotation mark
93 5D 1011101 single left-pointing angle quotation mark
94 5E 1011110 single right-pointing angle quotation mark
95 5F 1011111 « double left-pointing angle quotation mark
96 60 1100000 » double right-pointing angle quotation mark
97 61 1100001 ( left bracket
98 62 1100010 ) right bracket
99 63 1100011 { left curly bracket
100 64 1100100 } right curly bracket
101 65 1100101 [ left square bracket
102 66 1100110 ] right square bracket
103 67 1100111 left angle bracket
104 68 1101000 right angle bracket
105 69 1101001 ? question mark
106 6A 1101010 ¿ inverted question mark
107 6B 1101011 ! exclamation mark
108 6C 1101100 ¡ inverted exclamation mark
109 6D 1101101 / slash
110 6E 1101110 \ backslash
111 6F 1101111 | vertical bar
112 70 1110000 * asterisk
113 71 1110001 ^ caret
114 72 1110010 _ underscore
115 73 1110011 double underscore
116 74 1110100 · interpunct
117 75 1110101 bullet
118 76 1110110 & ampersand
119 77 1110111 @ at sign
120 78 1111000 # number sign
121 79 1111001 ° degree symbol
122 7A 1111010 % per cent sign
123 7B 1111011 per mille sign
124 7C 1111100 basis point
125 7D 1111101 + plus sign
126 7E 1111110 minus sign
127 7F 1111111 null implementation-specific
DEC HEX BIN Character Notes
128 80 10000000 null implementation-specific
129 81 10000001 × multiplication sign
130 82 10000010 dot operator
131 83 10000011 ÷ division sign
132 84 10000100 ± plus–minus sign
133 85 10000101 ~ tilde
134 86 10000110 approximation sign
135 87 10000111 = equals sign
136 88 10001000 < less-than sign
137 89 10001001 > greater-than sign
138 8A 10001010 symbol
139 8B 10001011 symbol
140 8C 10001100 symbol
141 8D 10001101 symbol
142 8E 10001110 symbol
143 8F 10001111 symbol
144 90 10010000 symbol
145 91 10010001 symbol
146 92 10010010 symbol
147 93 10010011 symbol
148 94 10010100 symbol
149 95 10010101 symbol
150 96 10010110 symbol
151 97 10010111 symbol
152 98 10011000 symbol
153 99 10011001 symbol
154 9A 10011010 symbol
155 9B 10011011 symbol
156 9C 10011100 symbol
157 9D 10011101 symbol
158 9E 10011110 symbol
159 9F 10011111 symbol
160 A0 10100000 symbol
161 A1 10100001 symbol
162 A2 10100010 symbol
163 A3 10100011 symbol
164 A4 10100100 symbol
165 A5 10100101 symbol
166 A6 10100110 symbol
167 A7 10100111 symbol
168 A8 10101000 symbol
169 A9 10101001 symbol
170 AA 10101010 symbol
171 AB 10101011 symbol
172 AC 10101100 symbol
173 AD 10101101 symbol
174 AE 10101110 symbol
175 AF 10101111 free-standing diacritic
176 B0 10110000 free-standing diacritic
177 B1 10110001 free-standing diacritic
178 B2 10110010 free-standing diacritic
179 B3 10110011 free-standing diacritic
180 B4 10110100 free-standing diacritic
181 B5 10110101 free-standing diacritic
182 B6 10110110 free-standing diacritic
183 B7 10110111 free-standing diacritic
184 B8 10111000 free-standing diacritic
185 B9 10111001 free-standing diacritic
186 BA 10111010 free-standing diacritic
187 BB 10111011 free-standing diacritic
188 BC 10111100 free-standing diacritic
189 BD 10111101 free-standing diacritic
190 BE 10111110 free-standing diacritic
191 BF 10111111 null implementation-specific
DEC HEX BIN Character Notes
192 C0 11000000 null implementation-specific
193 C1 11000001 diacritic modifier
194 C2 11000010 diacritic modifier
195 C3 11000011 diacritic modifier
196 C4 11000100 diacritic modifier
197 C5 11000101 diacritic modifier
198 C6 11000110 diacritic modifier
199 C7 11000111 diacritic modifier
200 C8 11001000 diacritic modifier
201 C9 11001001 diacritic modifier
202 CA 11001010 diacritic modifier
203 CB 11001011 diacritic modifier
204 CC 11001100 diacritic modifier
205 CD 11001101 diacritic modifier
206 CE 11001110 diacritic modifier
207 CF 11001111 diacritic modifier
208 D0 11010000 diacritic modifier
209 D1 11010001 diacritic modifier
210 D2 11010010 diacritic modifier
211 D3 11010011 diacritic modifier
212 D4 11010100 diacritic modifier
213 D5 11010101 diacritic modifier
214 D6 11010110 diacritic modifier
215 D7 11010111 diacritic modifier
216 D8 11011000 diacritic modifier
217 D9 11011001 diacritic modifier
218 DA 11011010 diacritic modifier
219 DB 11011011 diacritic modifier
220 DC 11011100 diacritic modifier
221 DD 11011101 diacritic modifier
222 DE 11011110 diacritic modifier
223 DF 11011111 diacritic modifier
224 E0 11100000 diacritic modifier
225 E1 11100001 (undefined) modifier
226 E2 11100010 (undefined) modifier
227 E3 11100011 (undefined) modifier
228 E4 11100100 (undefined) modifier
229 E5 11100101 (undefined) modifier
230 E6 11100110 (undefined) modifier
231 E7 11100111 (undefined) modifier
232 E8 11101000 (undefined) modifier
233 E9 11101001 (undefined) modifier
234 EA 11101010 (undefined) modifier
235 EB 11101011 (undefined) modifier
236 EC 11101100 (undefined) modifier
237 ED 11101101 (undefined) modifier
238 EE 11101110 (undefined) modifier
239 EF 11101111 (undefined) modifier
240 F0 11110000 (undefined) modifier
241 F1 11110001 (undefined) modifier
242 F2 11110010 (undefined) modifier
243 F3 11110011 (undefined) modifier
244 F4 11110100 (undefined) modifier
245 F5 11110101 (undefined) modifier
246 F6 11110110 (undefined) modifier
247 F7 11110111 (undefined) modifier
248 F8 11111000 (undefined) modifier
249 F9 11111001 (undefined) modifier
250 FA 11111010 (undefined) modifier
251 FB 11111011 (undefined) modifier
252 FC 11111100 (undefined) modifier
253 FD 11111101 (undefined) modifier
254 FE 11111110 (undefined) modifier
255 FF 11111111 null implementation-specific

Q&A

Q: Can this encoding be used in the same text file as other encodings?

A: Yes, as long as the other encodings use a standard mechanism to identify themselves. Most people do not mix languages in one and the same text file, let alone writing systems. Implementing encoding detection in text editors and operating systems should be trivial and in fact much easier than for instance implementing syntax highlighting in source-code editors.

Q: How does this encoding compare to UTF-8 in terms of storage efficiency?

A: In most cases this encoding and UTF-8 are equally efficient. In a few cases this encoding is more efficient than UTF-8. UTF-8 however is never more efficient than this encoding.

Q: How many characters can be encoded?

A: The final number has not been decided as most of the modifiers have not been defined but two bytes can encode 65,536 distinct values or nearly half of all the characters in the Unicode standard.

Q: How would encodings compatible with this proposal be identified?

A: By using an implementation-specific byte (grey cells) along with two modifiers (yellow cells). This encoding (Latin) could be identified thus FF E1 E1, Cyrillic thus FF E1 E2 and so on.

Q: Is this not a step back to the pre-Unicode chaos?

A: Absolutely not. The pre-Unicode chaos was not caused by the fact that there was more than one standard for encoding text – just as the fact that there being multiple Unicode encodings in existence does not automatically make Unicode chaotic – but rather by the lack of coordination and cooperation. What Unicode did was provide a standard for detecting different (Unicode) encodings and that is exactly what this proposal does, albeit on different terms. It makes no sense for there to be encodings per region or per language. It does however make sense for there to be an encoding per writing system. It is a way to break up the gargantuan task of implementing such a system. There are many regions in the world and thousands of languages in existence but very few writing systems.

Q: What about emoji?

A: Pictograms and smileys should never have been added to the Unicode standard as they are not part of any writing system. They should be formatted as inline images.

Q: What about file signatures (magic numbers)?

A: This encoding can be identified both by file signatures (magic numbers) in the file contents but also by file attributes. The former would be achieved using the implementation-specific null characters at the start of text files.

Q: Why Latin script?

A: I am most familiar with the Latin script. Others may implement the other writing systems.

Q: Why variable-width?

A: A single-byte encoding would not be enough to encode all the alphabetic characters of the Latin script, let alone the necessary symbols. A two-byte encoding would be too much for the Latin script and waste storage as the required number of characters is but a fraction of the 65,536 distinct possible values.