The Project Gutenberg FAQ - H-4

H.4. What are the PG rules for HTML texts?

1. The only absolute rule is that the HTML should be valid according to one of the W3C HTML standards, and, if used, CSS must also be valid.

You can verify that your HTML is valid at the W3C's HTML Validator at http://validator.w3.org/

You can verify that your CSS is valid at the W3C's CSS Validator at http://jigsaw.w3.org/css-validator/

For a more convenient and friendly, though less official, check of the correctness of your HTML, you should use Dave Raggett's Tidy program at http://tidy.sourceforge.net, which not only points out any messiness in your HTML code, but also has some neat modes to clean it up and standardize the formatting.

 

After that, we have some requirements and recommendations. Compliance with the requirements may be waived if there is a really good reason to make an exception in this case.

 

2. Requirement: File names and extensions

All file (and, if present, subdirectory) names and extensions should be in lower-case throughout, and should use only the letters "a" through "z", the digits "0" through "9", the dash character "-", the underscore character "_", and the period character ".", which should be used only once in each file name, to indicate the extension, like image.jpg. Yes, we know this is not strictly necessary, but we don't want to have to correct every file that comes with "image.png" referenced in the HTML accompanied by a file IMAGE.PNG. This applies to all files linked from the main HTML file, whether subdirectories, images, other HTML files, or CSS.

All images, if present, must be in a subdirectory named /images.

While 8.3 is not a requirement for file names, file names should be kept reasonably short, and never, ever exceed 32 characters.

 

3. Requirement: Accessibility

Where styles are used, whether with CSS or HTML, you must not impose personal preferences that may interfere with some readers' ability to read or enjoy the text. That is a guiding principle.

The W3C Accessibility Guidelines at http://www.w3.org/TR/WCAG10/full-checklist.html provides a checklist for web pages in general, and that is partly applicable here -- it is certainly a good idea to be familiar with their guidelines. However, we are dealing with a special case in making eBooks: while the W3C makes certain content recommendations, we have no control over the content itself; while the W3C recommends use of the latest technologies, this is meaningless in our context, where the text may be unchanged for decades; while the W3C is talking about web sites in general, we are making one specific type of HTML page.

Listing all possible implications of that is not practical, but specifically, you should try to:

a) Ensure that your text is well laid out, sensible and readable at all font sizes.
b) Ensure, if you use CSS, that your HTML is readable even when the CSS is removed.
c) Ensure that images have a meaningful "alt" attribute so that a description of the image is available for those who can't see it, and tables have a "summary" attribute.

and you should avoid:

a) Forcing absolute font sizes in point (pt); instead, you can use, for example, "em" or "%" to indicate larger or smaller text in CSS, or "<big>", "<small>", or "-1", "+1" in a HTML font tag.
b) Forcing absolute fonts or font-families, or generic font-families.
c) Forcing background colors other than white, or text colors other than black.
d) The use of frames, blinking text, pop-up windows, auto-redirect or auto-refresh.
e) The use of tables other than for tabular data. Many commercial web pages use tables for their entire layout, but we should use tables only where we are displaying actual tables of information.
f) Creating a hyperlink to anything outside the eBook itself, except in a Credits Line that links to the site of an image or text provider for the eBook.

As always, despite the general rules, there may be cases where, in a small part of the text, these restrictions should not apply. For example, it may be appropriate to use the generic font-family "cursive" in rendering a letter, or a different color for a small insert or a heading.

 

4. Requirement: No scripting

We don't want our readers to be worried about malicious or just plain buggy code, so we do not post any form of scripting in a HTML file, including Javascript.

 

5. Requirement: HTML and plain-text

Project Gutenberg does publish well-formatted, standards compliant HTML. However, we insist that a plain text version be available for all HTML documents we publish (even if images or formatting are absent), except when ASCII can't reasonably be used at all, for example with Arabic, or mathematical texts.

 

6. Requirement: Archive format for posting

If the HTML book contains more than one file (including images), create a ZIP (preferable) or TAR archive containing all of the files in the book for upload.

 

7. Recommendation: Simplicity

Make your HTML as simple as possible. HTML is an evolving standard, and one that may be completely obsolete in the long term. Use of advanced features may just mean that your version will be obsolete or unreadable that much faster.

 

8. Recommendation: Images

Images included with your HTML should be in a format that Web browsers can read: GIF, JPEG or PNG. Images should be edited for high quality in a reasonably small file size. Make the best decision you can concerning the image size and placement in the text. Every image included must be linked into (referenced by) the HTML.

 

9. Recommendation: Line lengths

If it is reasonable to do so, try to wrap paragraphs of text at around the normal PG margin of 70 characters. Ideally, your HTML should be as near as possible identical to your text version except for the HTML tags and entities. People who open your HTML won't all be using browsers, people will need to make corrections, not all editors can handle very long lines, and even with editors that can handle long lines, it's easier to work with short lines. Further, it is very desirable that your text and HTML files should, as near as possible, match line-for-line to make maintenance easier -- rewrapping the HTML just makes it harder to compare and fix.

 

10. Recommendation: Single-file HTML

Normally, all HTML and CSS for the book should be provided in one single file, with all images as separate files in an /images subdirectory. There may be times when it is appropriate to split the HTML into multiple files -- for example, when it is too big to fit in a standard browser -- and in such cases it may also be appropriate to provide the CSS as a separate file linked from each of the HTML files.

Where you must split a HTML ebook into multiple files, the naming requirements for files listed in point 2 above apply.

Top