History and Philosophy of Project Gutenberg

By Michael S. Hart
© August 1992

The Beginning

Project Gutenberg began in 1971 when Michael Hart was given an operator's account with $100,000,000 of computer time in it by the operators of the Xerox Sigma V mainframe at the Materials Research Lab at the University of Illinois.

This was totally serendipitous, as it turned out that two of a four operator crew happened to be the best friend of Michael's and the best friend of his brother. Michael just happened "to be at the right place at the right time" at the time there was more computer time than people knew what to do with, and those operators were encouraged to do whatever they wanted with that fortune in "spare time" in the hopes they would learn more for their job proficiency.

At any rate, Michael decided there was nothing he could do, in the way of "normal computing," that would repay the huge value of the computer time he had been given ... so he had to create $100,000,000 worth of value in some other manner. An hour and 47 minutes later, he announced that the greatest value created by computers would not be computing, but would be the storage, retrieval, and searching of what was stored in our libraries.

He then proceeded to type in the "Declaration of Independence" and tried to send it to everyone on the networks ... which can only be described today as a not so narrow miss at creating an early version of what was later called the "Internet Virus."

A friendly dissuasion from this yielded the first posting of a document in electronic text, and Project Gutenberg was born as Michael stated that he had "earned" the $100,000,000 because a copy of the Declaration of Independence would eventually be an electronic fixture in the computer libraries of 100,000,000 of the computer users of the future.

The Beginning of the Gutenberg Philosophy

The premise on which Michael Hart based Project Gutenberg was: anything that can be entered into a computer can be reproduced indefinitely ... what Michael termed "Replicator Technology" The concept of Replicator Technology is simple; once a book or any other item (including pictures, sounds, and even 3-D items can be stored in a computer), then any number of copies can and will be available. Everyone in the world, or even not in this world (given satellite transmission) can have a copy of a book that has been entered into a computer.

This philosophical premise has created several offshoots: 1.Electronic Texts (Etexts) created by Project Gutenberg are to be made available in the simplest, easiest to use forms available.

Suggestions to make them less readily available are not to be treated lightly.  Therefore, Project Gutenberg Etexts are made available in what has become known as "Plain Vanilla ASCII," meaning the low set of the American Standard Code for Information Interchange: ie the same kind of character you read on a normal printed page — italics, underlines, and bolds have been capitalized.

The reason for this is that 99% of the hardware and software a person is likely to run into can read and search these files.

Any other system of etext storage is going to fall short of an audience of 99%.

This does not mean there are not other valid mean of doing the etext business ... after all, over half the computers are DOS, so one could address a wide audience by just doing DOS. Plain Vanilla ASCII, however, addresses the audience with Apples and Ataris all the way to the old homebrew Z80 computers, while an audience of Mac, UNIX and mainframers is still included.

In this same vein, Project Gutenberg selects etexts targeted a bit on the "bang for the buck" philosophy ... we choose etexts we hope extremely large portions of the audience will want and use frequently. We are constantly asked to prepare etext from out of print editions of esoteric materials, but this does not provide for usage by the audience we have targeted, 99% of the general public.

Also in the same vein, Project Gutenberg has avoided requests, demands, and pressures to create "authoritative editions." We do not write for the reader who cares whether a certain phrase in Shakespeare has a ":" or a ";" between its clauses. We put our sights on a goal to release etexts that are 99.9% accurate in the eyes of the general reader. Given the preferences your proofreaders have, and the general lack of reading ability the public is currently reported to have, we probably exceed those requirements by a significant amount. However, for the person who wants an "authoritative edition" we will have to wait some time until this becomes more feasible. We do, however, intend to release many editions of Shakespeare and the other classics for the comparative study on a scholarly level, before the end of the year 2001, when we are scheduled to complete our 10,000 book Project Gutenberg Electronic Public Library.

Project Gutenberg has been a part of celebrations of the 100th Anniversary of Public Libraries, starting in 1995. Project Gutenberg hopes to found "The Public Domain Register," after the 100th Anniversary of The U.S. Copyright Register in 1997.

We hope you will be part of it, too. You are all invited.

Footnote:

Our eventual goal is to provide Public Domain Etext editions a short time after they enter the Public Domain. Of course, the period before a copyrighted work entered the Public Domain was extended from 28 years (with a 28 year extension available) to 50 years more than the life of the author, so this put a kink, to put it mildly, into our plans. (The original copyright was for 14 years, in the U.S.) Thus, a person could originally do a reasonable prediction that anything under copyright would be in the Public Domain while it could be used, under the new law it is impossible to predict the length of a copyright, and the likelihood of a new book entering the Public Domain during the lifetime of the average reader is minimal. (Suppose you might be 25 when you read a new book and the author is 50: wait the average 25 years for the author to die (what a thought!*) Now you have to wait another 50 years to have access to that book; it doesn't matter when it was written (unless it is an old one ... before the period the law retroacted to) ... so you would have to wait (on the average) until you were 100 years old. A 25-year-old under the original law would only have to wait for 14 years ... until the age of 39. Quite a difference; between the ages of 39 and 100. Not only that, but the copyright laws would have to stay the same for all that time ... something in serious doubt, seeing how much they have changed in the recent century.

The Project Gutenberg Philosophy

The Project Gutenberg Philosophy is to make information, books and other materials available to the general public in forms a vast majority of the computers, programs and people can easily read, use, quote, and search.

This has several ramifications:

  1. The Project Gutenberg Etexts should cost so little that no one will really care how much they cost. They should be a general size that fits on the standard media of the time ...
  2. The Project Gutenberg Etexts should so easily used that no one should ever have to care about how to use, read, quote and search them ...

The Project Gutenberg Philosophy (continued)

[...] This has several ramifications:

1. The Project Gutenberg Etexts should cost so little that no one will really care how much they cost. They should be a general size that fits on the standard media of the time.

i.e. when we started, the files had to be very small as a normal 300 page book took one meg of space which no one in 1971 could be expected to have (in general). So doing the U.S. Declaration of Independence (only 5K) seemed the best place to start. This was followed by the Bill of Rights — then the whole US Constitution, as space was getting large (at least by the standards of 1973). Then came the Bible, as individual books of the Bible were not that large, then Shakespeare (a play at a time), and then into general work in the areas of light and heavy literature and references.

By the time Project Gutenberg got famous, the standard was 360K disks, so we did books such as Alice in Wonderland or Peter Pan because they could fit on one disk. Now 1.44 is the standard disk and ZIP is the standard compression; the practical filesize is about three million characters, more than long enough for the average book.

However, pictures are still so bulky to store on disk that it will still be a while before we include even the lowres Tenniel illustrations in Alice and Looking-Glass. However we ARE very interested in doing them, and are only waiting for advances in technology to release a test edition. The market will have to establish SOME standards for graphics, however, before we can attempt to reach general audiences, at least on the graphics level.

To illustrate our faith in graphics, and in the future, we have gone one step further in our pursuit of what we named "Replicator Technology" TM a few years ago. We would like the end of this phase of Project Gutenberg (with a first 3D application of Replicator Technology), by doing CAT, MRI and XRAY Fluoroscopy scans of something, perhaps a painting, and printing 3D copies. If anyone can get us access to a hundred year old masterpiece ... the average book.

The Project Gutenberg Philosophy (continued)

[...] This has several ramifications:

2. The Project Gutenberg Etexts should so easily used that no one should ever have to care about how to use, read, quote and search them.

This has created a need to present these Project Gutenberg Etexts in "Plain Vanilla ASCII" as we have come to call it over the years.

The reason for this is simple ... it is the only text mode that is easy on both the eyes and the computer.

However, this encourages others to improve our etexts in a variety of ways and to distribute them in a variety of the available media, as follows:

Once an etext is created in Plain Vanilla ASCII, it is the foundation for as many editions as anyone could hope to do in the future. Anyone desiring an etext edition matching, or not matching, a particular paper edition can readily do the changes they like without having to prepare that whole book again. They can use the Project Gutenberg Etext as a foundation, and then build in any direction they like.

Thus any complaints about how we do italics, bold, and the underscoring, or whether we should use this or that markup formula are sent back with encouragement to do it any ways any person wants it, and with the basic work already done, with our compliments.

The same goes for media. We have had a long-standing work ethic of providing our etexts in any medium people wanted: Amiga, Apple, Atari ... to IBM, to Mac, to TRS-80 ...

However, now that our etexts are carried in so many BBS's, networks and other locations, it is easier to download the file in a manner that puts them in your format than we can make and mail a disk, so we don't really do that too much.

The major point of all this is that years from now Project Gutenberg Etexts are still going to be viable, but program after program, and operating system after operating system are going to go the way of the dinosaur, as will all those pieces of hardware running them. Of course, this is valid for all Plain Vanilla ASCII etexts ... not just those your access has allowed you to get from Project Gutenberg. The point is that a decade from now we probably won't have the same operating systems, or the same programs and therefore all the various kinds of etexts that are not Plain Vanilla ASCII will be obsolete. We need to have etexts in files a Plain Vanilla search/reader program can deal with; this is not to say there should never be any markup ... just those forms of markup should be easily convertible into regular, Plain Vanilla ASCII files so their utility does not expire when programs to use them are no longer with is. Remember all the trouble with CONVERT programs to get files changed from old word processor programs into Plain Vanilla ASCII?

Do you want to go through all that again with every book a whole world ever puts into etext?

The value of Plain Vanilla ASCII is obvious ... so is very much of the value of most of the various markup systems we have in the world. But until some real standards arrive — we would be limiting our options a great deal if we do not keep copies of all etexts in Plain Vanilla ASCII as well.

We don't have anything against markup. Not vice versa.

Alice in Wonderland, the Bible, Shakespeare, the Koran and many others will be with us as long as civilization ... an operating system, a program, a markup system ... will not.

This includes the many requests we have for compression in particular formats. There are only two formats we know of that are suitable for transfer to a wide general audience: Plain Vanilla ASCII (.txt files) and ZIPped files of them, (.zip files). Requests for other compression formats must be ignored as they are appropriate only for small portions of our target audience. However, (programmers take note: we will need help) we are planning to put some compression links on our files so they can be transmitted in any of an assortment compression formats on the fly. i.e. we should be able to generate any kind of file asked for, but we can keep only one copy of each etext on our servers ... as the .Z compression format does in a similar manner today.

The Selection of Project Gutenberg Etexts

There are three portions of the Project Gutenberg Library, basically be described as:

Light Literature; such as Alice in Wonderland, Through the Looking-Glass, Peter Pan, Aesop's Fables, etc.

Heavy Literature; such as the Bible or other religious documents, Shakespeare, Moby Dick, Paradise Lost, etc.

References; such as Roget's Thesaurus, almanacs, and a set of encyclopedia, dictionaries, etc.

The Light Literature Collection is designed to get persons to the computer in the first place, whether the person may be a pre-schooler or a great-grandparent. We love it when we hear about kids or grandparents taking each other to an etexts to Peter Pan when they come back from watching HOOK at the movies, or when they read Alice in Wonderland after seeing it on TV. We have also been told that nearly every Star Trek movie has quoted current Project Gutenberg etext releases (from Moby Dick in The Wrath of Khan; a Peter Pan quote finishing up the most recent, etc.) not to mention a reference to Through the Looking-Glass in JFK. This was a primary concern when we chose the books for our libraries.

We want people to be able to look up quotations they heard in conversation, movies, music, other books, easily with a library containing all these quotations in an easy to find etext format.

With Plain Vanilla ASCII you will be easily able to search an entire library, without any program more sophisticated than a plain search program. In fact, these Project Gutenberg Etext files are so plain that you can do a search on them without even using an intermediate search program (i.e. a program between you and the disk) Norton's and other direct disk access programs can search every one of your files without you even naming them, pointing to an etext directory, or whatever. You can simply search a raw output from the disk ... I do this on a half gigabyte disk partition, containing all our editions.