We have to divide this question into two answers, for books up to 10,000, and books after 10,000 (or older books reposted after we hit 10,000).
Books after 10,000 -- the new naming scheme
Since eBook number 10,000, we name our files based on the PG etext number; thus, the base of the name simply reflects the order in which the book was posted. 12345.txt is just the 12,345th book posted.
Also, when we correct an older book, we may repost it into the new naming scheme rather than just replacing it in the old scheme. When we do this, its naming conventions are the same as if it had been numbered after 10,000, and, additionally, we add a subdirectory "old/", into which we put all of the older files, so that they are preserved for anyone who wants to examine them. In this way, we will eventually move all e-books to the new naming scheme.
Formats or character sets other than plain ASCII then get extensions added to indicate the type of file. Character sets get digits; formats get letters. The most common of these are:
Thus, eBook number 12345 may -- fairly typically -- have the files 12345.txt, 12345.zip, 12345-8.txt, 12345-8.zip, 12345-h.htm and 12345-h.zip, as well as other possible character sets or formats.
Other formats get appropriate three-letter extensions, like -pdf.
The complete set of naming rules for post-10K eBooks is:
1. Directory structure: the directory for the eBook shall be contained in a hierarchy of directories, each one a single digit, being all the digits of the etext number except the last, in order. The name of the directory for the eBook itself shall be the number of the eBook. Thus, eBook #12345 will be contained in:
/1/2/3/4/12345/
and 123456 in
/1/2/3/4/5/123456/
Where an e-book is a reposting of a pre-10,000 text, we will create an old/ subdirectory, containing all of the old files associated with that text. For example, consider:
Mike, by P. G. Wodehouse 7423
The corrected, reposted files will be found in:
/7/4/2/7423/
and the older, pre-10K files will all be held in:
/7/4/2/7423/old/
2. Filenames within the eBook's directory shall be the eBook's number, with extensions preceded by a minus sign, indicating character set or format.
a) A file without a character set or format indicator is plain 7-bit ASCII. [In practice, we might allow a few 8-bit characters -- up to a dozen or two -- and still call it ASCII]
* Example: 12345.txt [7-bit plain vanilla ASCII]
b) Character sets, for text files, get digits:
* Example: 12345-8.txt [Text in some 8-bit encoding]
c) File types get letters. Ideally, one-letter formats should be standards-based and editable. For now, the following is the list of single-letter formats.
Other formats get preferably three (more if necessary) letters.
* Example: 12345-x.xml [XML]
* Example: 12345-pdf.pdf [PDF]
When more than one variant of a format is posted, the poster will add additional letters as appropriate.
* Example: If a HTML of 12345 has been posted as 12345-h, and we are posting a new HTML if the same eBook broken into pages, it might be posted as 12345-hp.
3. Under the eBook's directory are all files for that eBook. The .txt files will be in the eBook's main directory, as well as other formats that require only one file (PDF, RTF...). Formats that are likely to require ancillary files get a subdirectory named for file type, with the file within. This is to make it predictable to find the formats, and to allow for any ancillary files to be stored in the subdirectory.
Formats that get a subdirectory include: HTML, TeX and XML. Formats that do not get a subdirectory include: PDF, RTF, LIT, PDB.
The subdir name for each shall be the name of the primary file that lives there.
* Example: The file 12345-h.htm will be at /12345/12345-h/12345-h.htm , and any ancillary files (such as JPEG or CSS) will be in (or below) the same subdirectory.
4. A .zip for each format will be in the main eBook directory. The .zip will unzip to a subdirectory if it's a multi-file format from #3 above, otherwise it will simply unzip a file. In the case of some pre-compressed formats, such as MP3, a .zip may not make sense, in which case it may be omitted.
* Example: 12345-h.zip will be at 12345/ , and when unzipped will create a subdirectory 12345-h/ with 12345-h.htm and any ancillary files.
* Example: 12345-pdf.zip will be at 12345/, and when unzipped will create 12345-pdf.pdf in the current directory.
5. Versions and editions: in the case of a new EDITION, a corrected file, the original file is renamed with an extension of its own posted date .yyyymmdd, and then replaced by the corrected file. So 12345.txt, when replaced, becomes 12345.txt.20030101 and the new, corrected file becomes 12345.txt.
New EDITIONS will get a "Most recently updated: " line added to their standard metadata.
The Release Date in the standard header will be the month and year of the actual first posting of that eBook.
6. Each file (e.g., 12345-h.htm) should have a Project Gutenberg header, metadata and footer. In cases where the file is not editable (such as PDF), or where adding a header isn't realistic (such as MP3), the header, metadata and footer can go in a "readme" file named for the file, with "-readme" added before the extension. The "readme" file shall be in the same directory as the file to which it refers, and shall be included in the ZIP file for that format. Where the format is multifile, there should be only one "readme" for all files.
* Example: "12345-pdf-readme.txt" for the file 12345-pdf.pdf Note: If we were able to add the standard header prior to creating the PDF file, it could be distributed as any other editable format without a readme.
* Example: "12345-m-readme.txt" for the files 12345-m-001.mp3, 12345-m-002.mp3, etc.
7. The GUTINDEX file(s) will have entries of the form:
Title, by Author eBook#
eBook # will be in 5 digits, followed by a "C" if copyrighted and "*" if reserved. "by " will be omitted if there is not enough space. Any additional data, such as a translator or subtitle, will be on a following line or lines surrounded by square brackets [] and indented by two spaces.
GUTINDEX will have approximate date indicators such as:
** MARCH 2004: 822 eBooks
The following is an example of etext# 12345, assuming it has ASCII, 8-bit and Unicode text files, a HTML and a HTML broken into pages, an XML, PDF, TeX, and LIT formats, and MP3. Assume that we couldn't edit the LIT, and so had to add a "readme" for that containing the header as in point 6 above.
The directory 12345 for the eBook will be at
1/2/3/4/12345/
and it will contain the files
1/2/3/4/12345/12345.txt
1/2/3/4/12345/12345.zip
1/2/3/4/12345/12345-0.txt
1/2/3/4/12345/12345-0.zip
1/2/3/4/12345/12345-8.txt
1/2/3/4/12345/12345-8.zip
1/2/3/4/12345/12345-h.zip
1/2/3/4/12345/12345-hp.zip
1/2/3/4/12345/12345-t.zip
1/2/3/4/12345/12345-x.zip
1/2/3/4/12345/12345-pdf.pdf
1/2/3/4/12345/12345-pdf.zip
1/2/3/4/12345/12345-lit.lit
1/2/3/4/12345/12345-lit-readme.lit
1/2/3/4/12345/12345-lit.zip
and in its subdirectories the further files
1/2/3/4/12345/12345-h/12345-h.htm
1/2/3/4/12345/12345-h/image1.png
1/2/3/4/12345/12345-hp/12345-hp.htm
1/2/3/4/12345/12345-hp/page2.htm
1/2/3/4/12345/12345-hp/image1.png
1/2/3/4/12345/12345-t/12345-t.tex
1/2/3/4/12345/12345-x/12345-x.xml
1/2/3/4/12345/12345-x/12345-x.xsl
1/2/3/4/12345/12345-x/image1.png
1/2/3/4/12345/12345-m/12345-m-readme.txt
1/2/3/4/12345/12345-m/12345-m-001.mp3
1/2/3/4/12345/12345-m/12345-m-002.mp3
Books up to 10,000 -- the old naming scheme
Older PG files are named for the text, the edition, and the format type.
Nearly all of these PG files are named in "8.3" format--that is, up to eight characters, a dot, and three more characters. (It should have been all of them, by the rules, but we had to break a few.)
The first five characters in the filename are simply a unique name for that text, for example, "Ulysses" by Joyce begins with "ulyss".
If the text has been posted as both a 7-bit and 8-bit text, then the first character of the filename will be a 7 or an 8, to indicate that. For example, we have both 7crmp10 and 8crmp10 for Dostoevsky's Crime and Punishment.
The 6th and 7th characters of the name are the edition number--01 through 99. We normally start at edition 10 (1.0); numbers lower than that indicate that we think the text needs some more work; numbers higher than that mean that someone has corrected the original edition 10.
The 8th character of the filename, if it exists, indicates either the version or the format of the file. When we get a different version of the text based on a different source, we give it an a, b, c, as for example if the text is from a different translation. Where we have posted a text in a different format, we also add an eighth character--"h" for HTML, "x" for XML, "r" for RTF, "t" for TeX, "u" for Unicode are established formats. There have been some experimental postings with "l" for LIT, and "p" for either PRC or PDB.
So, for example:
| 7crmp10 | is our first edition of Crime and Punishment in plain ASCII |
| 8sidd10 | is our first edition of Siddhartha, as an 8-bit text |
| dyssy10b | is our first edition of our third translation of Homer's Odyssey, in plain ASCII |
| jsbys11 | is our second edition of Jo's Boys, in plain ASCII |
| vbgle10h | is our HTML format of our first edition of Darwin's Voyage of the Beagle |
| 7ldv110 | is our 7-bit ASCII version of the first volume of the Notebooks of Leonardo da Vinci |
To make it worse, we don't always stick to these rules, for example:
| 1ddc810 | is our first edition of the first book of Dante's Divina Commedia in Italian, as an 8-bit text |
| 80day10 | is our first edition of Verne's Around the World in 80 days, in plain 7-bit ASCII in English. |
| emma10 | is our first edition of Jane Austen's "Emma"--with a 4-character basename instead of 5. |
Some series have special, non-standard names. Shakespeare is named with a digit representing the overall source (First Folio, etc), then "ws", then a series number, so for example 0ws2610, 1ws2610 and 2ws2610 are all versions of "Hamlet". The Tom Swift series is named with a two-digit prefix denoting the series number, then "tom", so for example 01tom10 is "Tom Swift and his Motor-Cycle".
And what should we do with a text from a different source that is formatted as HTML? For example, if dyssy10b is the name of the third translation, what should the HTML version be named? dyssy10bh is obvious, but it uses 9 characters.
The problem, of course, is that we are trying to fit a lot of information into an 8-character filename, and as the collection grows, and the number of formats and versions increases, we come across more pressure on filenames, so while the filename is a good guide to the contents, it's not definitive.