The Project Gutenberg FAQ - V-74

V.74. What characters can I use?

 a) You should use plain ASCII for straight English texts.
 b) When producing a text partly or completely in a language that requires accents, you should use the appropriate ISO-8859 character set for the language, and specify which you are using, and also provide a 7-bit plain ASCII version with the accents stripped.
 c) When producing a text in a language that doesn't use one of the ISO-8859 character sets, you should use the encoding most commonly used for that language. [e.g. Chinese--Big 5]
 d)  When producing a text containing more characters than can be found in any one of the ISO-8859 character sets, you should use Unicode.

You should use plain ASCII wherever possible--that is, the letters and numbers and punctuation available on a standard U.S. keyboard, without accented letters. The immediate and major exception to this is when you are typing a text written in a language like French or German that requires accents.

There is a problem with using non-ASCII characters. They do not display consistently on all computers; in fact, they do not even display consistently on the same computer! On my computer, for example, what looks like an e-acute in this editor just shows as a black box in another editor, or even using a different font in the same editor. And this is by no means confined to some theoretical minority; we have to deal with it all the time when posting texts.

Further, standards are changing: ten years ago, the character set Codepage 850 [MS-DOS] was very common; now it's rare except in some texts that have survived those ten years.

We want to preserve these texts over centuries, not just decades, and at the moment there is no single clear standard that we can use across all texts. Unicode may perhaps be a future standard, but, right now, it's not something that people use every day, and it's not supported by a lot of common software.

ASCII, while limited, is supported by almost all computers everywhere, so we make a point of always supplying an ASCII version where possible, even if the ASCII version is degraded when compared to the 8-bit original. When we get a text in, say, German, we post two versions of it--one with accents and one without.

Top