In the microcomputer world, the octet is called a byte. Such a grouping allows us to create a character set that includes all our letters and numbers, most punctuation symbols that provide meaning, and have plenty to spare.

In the English language, of course.

Having this character set, we see that sequences of bytes are particularly useful. These are what we call strings.

Of course, a file is also a sequence of bytes; hence a text file. That is, a file is not a text file when its bytes are not to be (directly) interpreted as characters.

We then have utf text files. These are not text files: unicode characters are encoded as strings. Sequences of unicode characters are called wide-strings.

Most text editors can handle utf text files, so that the distinction is only important to programmers and those who get called upon when people are seeing strange things.

It is worth noting that the ASCII character set, upon which Unicode is based, includes control characters. These are not intended to be associated with a glyph (a character apart from its character code). The end of a line is a control character; the tab command is a control character.

The NUL control character--character code zero--is used in the standard C library to indicate the end of a string. A binary-safe string is one which can have NUL's in it (a file is thus a binary-safe string).

As every string is also a numeric sequence, there is a natural ordering of any set of strings: the lexicographic ordering. However, capital letters (in ASCII) are grouped together before ordinary letters; thus a lexicographic ordering is not user-friendly.

But in order to make a wide-string ordering user-friendly, we will need to have at least one person of every language of the world together in a room.

Now ask yourself, would it be worth spending government money on this?

Is it worth spending government money on emoji?

