Project Gutenberg Newsletter:
Distributed Proofreaders Update


Home
Contact Us
News
Reviews
Features
DP Updates
Archive

Distributed Proofreaders Update for 17 December 2003

Distributed Proofreaders Update for December 17, 2003

If you are reading this without your favorite beverage, stop right now and go correct that situation. . . . Go on! This week we have a smorgasbord of newsworthy topics to cover and if you care enough about PG to be reading this newsletter at all, then I promise you an interesting, little diversion for the DP segment this week. So go get a big mug of coffee or tea, maybe a Guinness, whatever oils your works. I'll be right here waiting when you get back.

Today, December 17th, we mark the 100th anniversary of the moment when a dream of great significance was realized. A century ago two brothers with a vibrant vision and a healthy dose of ingenuity set humanity free from exile on the planet's surface. It is easy to conjecture that the desire to fly as freely as the birds must follow our race down to the earliest days of existence. Once it was clear that we could do more than float up in a balloon, subject to the winds--that we could be master in the air--the future of Humanity changed forever. A mere 66 years later, the flight that began on a little hill in North Carolina reached all the way to the moon. 100 years on and our visions now look to the unlimited vastness of the Universe itself.

When I logged into DP earlier, after several days away, I smiled broadly to see these words greeting me on my return:

"Today we are celebrating the 100th anniversary of the Wright brothers' first flight with some specially selected 'aviation' material. Come fly with us!"

I smiled as I thought that if there ever was an appropriate slogan for recruiting new supporters to the vision of Project Gutenberg, this was it: "Come fly with us!"

All dreams are 'lofty' things. Out of the airy invisible, a rare individual plucks an intangible idea as it floats by like a feather. It is a simple truth that we owe everything we are today on this earth to the rare breed of people who we call "dreamers." Go ahead and try to touch anything in the room around you, including the room itself, that was not once an idea in the mind of one individual. That is who we are and that is how we make things on this world. Do not let anyone ever tell you different, dreaming is a wonderful occupation.

One day we may very well travel to other star systems, and on that journey will be two spirits who on this day 100 years ago raised humanity's aim above and beyond. As I reflected on that thought earlier, I smiled broader still to realize that it was in this potential future that our work in these projects is directly connected to that day at Kitty Hawk. When our childrens' children take off towards other worlds, it is a certainty that they will be bringing legacies from home with them. Legacies that will survive to their generations partly because of the work we do in the present.

What Michael Hart began thirty years ago, with the words: "We hold these truths to be self evident..." has taken a trajectory not dissimilar to that of the Wright brothers. More than 10,000 works have followed the Declaration of Independence, and in years to come that number will eventually reach 1,000,000.

Come fly with us! Yes, how appropriate indeed!

This is a good to look back upon the Gutenberg journey and forward to the future. Over the past 48 hours those who gathered together in California have returned home and begun settling back into their daily routines. It is a safe bet that no one who participated in the meetings and discussions of the past several days is quite the same as they were a week ago. We are distributed throughout the world, and we get a great deal accomplished that way. However the infusion of energy, innovation and inspiration that is generated by face to face interaction in real time adds a whole new level of dynamism to our collective efforts.

You have been and will continue to read accounts from those who were there. I was not at the meetings, so my scope of intention will remain with providing the news items of the events for you. One of 'flash' items from the conference was the recognition of the significance of the completion of the Copyright Renewals. If you have been with DP more than a couple of months then you know what a 'piece of work' these projects were to complete. Working from the trenches of the proofing rounds, you may not be aware of the incredible worth of these dry manifests. They are nothing short of golden in their value to the public domain. In time we will look back and say: "Yes, I was there, I worked on the renewals." And we will say it with deep pride.

One person who has provided material recognition of the present worth of the Copyright Renewals is Brewster Kahle. To commemorate the successful completion of DP's work on the CR's, Brewster kept a promise to donate $10,000.00 to Project Gutenberg. Now 10K certainly does not alter the destiny of PG. It is a significant gesture and a contribution that proofers at DP can feel a true part of. Through the Internet Archive, Brewster has long been a supporter of PG and DP. He also provided a variety of support to see to the success of this week's conference.

When the CRs are incorporated into a searchable database they will serve to verify the eligibility of thousands of publications for the public domain. This is task is so tedious at present as to be nearly unworthy of the effort involved. The easy availability of the Copyright Renewals will change that forever, thus making available an immeasurable wealth of cultural and historic content to the whole world. On behalf of all who this accomplishment eventually touches, let me voice a sincere and profound appreciation to all those who worked on the many stages of the CR project!

What the conference provided on the whole, was the chance for many people to get down to some serious discussion of the present state, future directions and possible strategies for PG and all affiliated projects. Topics included; sustaining and increasing the participation levels of volunteers; innovations to the cataloguing system for the PG library; initiation of an image library; increase of support to the readers of e-books; incorporation of XML and the future availability of format on demand features of PG titles; community development among the readership; derivative content developments from existing texts and much more.

Whether these all come to be and in what manner and time frame will be the topic of future discussions and conferences on-line and off. As I discuss these topics with those who were out in California, I will share more details with you here in the weeks ahead. I know that many people reading this will wish they could have been out there at the conference. If it is any conciliation, I share your feelings. But... let us remember what we are celebrating at DP in December... the conclusion of what is surely the most productive and successful year in the project's history, and the coming of what promises to be the year which will--far and wide--supplant that title.

We have shared the journey through this wonderful year together and have given the best of our efforts and intentions to PG/DP. In the year ahead we will do the same and perhaps far more. There will be more meetings and conferences in different regions and countries and with each we will build upon what has begun in California. Look forward then, and stay close to your newsletter in the weeks ahead. In 2004 we will fly...and no doubt, more than once or twice we will glance up at the stars and smile, knowing there is a vast and wondrous destiny aloft, which we are all a part of.

With my travels and personal time constraints over the past couple of weeks it has been necessary to set down the continuation of our exploration and profiles of the tools involved in the DP production process. Today we will pick up where we left off and take a look at the pair of tools developed by Steve Schulze aka: ThunderGnat to the DP community. We also have a piece this week from Bill Keir about those pesky little sprites that trip up the best proofreaders and go by the name of Scannos.

The pair of tools known as GUIPrep and GUIGuts are much adored by the content developers and post processors of DP. It is rather an injustice to call them a "pair" of tools as they both perform a wide range of processes that have come to be essential to efficient and expedient production. From the beginning, we decided that the best way to provide you with the latest and most informative background to the tools is to let the developers speak for themselves. So without further introduction I turn the mic over to Steve.

Musings on Guiguts

Guiguts came about because of my frustration with Proofreaders Toolkit, an older, no longer supported toolkit that was used to prepare texts for Distributed Proofreaders. Proofreaders Toolkit (PRTK) has a GUI front end to gutcheck built into it. It works, but there are several things about it that are sub-optimal.

Number 1. It was designed to work with an older version of gutcheck. The command line options for gutcheck have changed slightly since the PRTK was written, so it doesn't interface very well.

Number 2. An even bigger problem, every time you make an edit to the file, the list of gutcheck errors becomes unsynchronized and it gets hard to find subsequent errors that gutcheck reported.

I had previously written a preprocessing application called Prep to do pre-proofing checks on texts before they were uploaded to the site. After 8 versions of Prep, I added a Gui front end to it to make it easier to select options for processing. (There were some 30 or so options and the command line was getting out of hand. There's over 60 now.) When I added the front end, I changed the name to Guiprep to differentiate it from command line Prep.

When the frustration level with PRTKs interface to Gutcheck grew too much, I thought, "Heck, I could probably write something to do that." and did so. When it came time to naming it, I thought "Well, I already have Guiprep, a Gui front end to prep; this is a Gui front end to gutcheck, I'll call it Guigutcheck. But that was too long, so I shortened it to Guiguts. (which I found amusing anyway, so that was a big plus too.)

Guiguts is written in Perl to take advantage of it's very powerful text processing functions and cross platform support. It will run on Windows and Linux platforms and could be easily modified to work on Mac OSX, (but I don't have access to an OSX system to do development and testing.) It unfortunately cannot be easily ported to Mac OS 9 and earlier due to lack of some necessary Perl modules for those OSes. Since it is written in Perl, the source is automatically available for experimentation and hacking to anyone who is inclined to do so. I also distribute a compiled windows executable version (winguts.exe) for those who don't have a Perl interpreter on their machine and just want to download and go.

Guiguts was originally intended just to be a front end to gutcheck. In order to make it usable, I had to make it a fairly full featured text editor so you would be able to make corrections to errors gutcheck reported. So, since I already had a fairly decent text editor written, I figured I'd add some other specialized functions that would come in handy for some texts I was post processing. I think some of the first functions I added were to do bulk change of case to selected text. (Make it all uppercase, lowercase, whatever.) Not too unusual in a decent text editor, but useful. Another thing I added early on was a word frequency and comparison function. It would count all of the words in a text and how many times they occurred, then let you display them in various sort orders. (By frequency, by alphabetical order, etc.) I wrote a function to help find "stealth scannos", words that commonly mis-scanned but will pass a spellcheck, like "arid" for "and" for instance.

As time went by, a core group of intrepid testers suggested new functions and improvements to existing ones until it has become a fairly powerful and comprehensive post processing toolkit on its own.

A partial list of functions and capabilities:

Search & Replace: Full search and replace functions, search for full or partial words. Able to search using regular expressions with variable extraction for replacement terms.

Stealth Scannos: Find words that were scanned incorrectly but will pass spellcheck.

Spell check: Provides hooks to tie in Aspell or Ispell to do full interactive spell checking.

Find Orphaned Brackets: Often brackets or parentheses are mismatched in a text, it can be a real pain to find the unmatched ones, this function makes it easy.

Case Adjustment: An array of bulk case adjustment functions, convert to uppercase, convert to lowercase, convert to sentence case, convert to title case.

Bulk indenting: Indent a selection of text in or out 1 space with each press,preserves relative indenting.

Text Rewrap: Automatic rewrap of selected text. Adjustable rewrap margin. Adjustable indent. Lots of options.

Word Frequency Analysis: Sort and count words in the text. Specialized sub functions to find hyphenated words, words with accents, words with mixed alphabetic and numeric characters, and several others.

Footnote Fixup: Functions to automate renumbering, moving and reformatting footnotes.

HTML Fixup: Functions to work with HTML markup including finding orphaned markup and auto generating a HTML version of a text.

ASCII Box Drawing: Automatically draw ASCII boxes around selected text. Optionally rewrap and center or left or right justify the text in the box.

A whole host of other specialized functions.

And oh yes, it provides a GUI interface to gutcheck.

Thank you, Steve! ... for the background and all the effort to develop these powerful tools for the PG/DP community!

Now Big Bill is going to fill us in on the slippery bane of all post processors, the dreaded Tasmanian Scanno. Okay, so they're not from Tasmania, but Bill has doing his best to exile them there for the rest of us.

Stealth Scannos by Bill Keir

In the late 19th century wasn't the telephone considered wonderful modem technology? Or was it wonderful *modern* technology?

A standard step in preparing a text for PG is to spell-check it. Of course that can only do so much, and while it will detect words that have been OCRd as junk, it won't detect words that have been OCRd as other words.

When "he" is OCRd as "fe", we have a scanno - analogous to typo - an error. Spell-checkers will catch scannos that produce non-words; that's what spell-checkers do, identify non-words.

But when "he" is OCRd as "lie", we have a scanno that a spell checker will not blink at - a stealth scanno, that flies under the spell-checker's radar.

Tonya was the first to publish a list of the most commonly occurring scannos of this type, as part of her comprehensive PPing checklist. Classics she cited included "arid" being produced for "and", "yon" for "you" and "modem" for "modern". These occur so often, and as the words appear so similar on the screen (unless you're using a custom font, see below) are so often missed by human eyes as well as spell-checkers, that Tonya suggested it was worthwhile searching your text for "arid", as most of them were probably mis-scanned "and"s.

I named them and offered to be a central clearing house for reported sightings. We now have hundreds of stealth scannos and more every month. Various software tools make use of the listings, and once again the cooperation of many individuals has led to improving the standards of quality of our submitted texts.

Thank you, Bill! Next week we will profile two important tools that Bill has developed; the Re-Wrap & Indent script and the Smooth Proofing font. We will also go back to the original tools of the PG/DP community: GutCheck and PRTK. As we wrap up the tool profiles you can look forward to a special section on the PG newsletter site set aside specifically for information on all the tools as well as links to download the latest versions, whether you work within the DP site or develop texts for PG independently.

Finally this week, be aware that we have a new feature to the DP masthead. Where we used to count down the number of titles until 10,000 at PG, we now provide the current total works contributed to Project Gutenberg by DP. At time of writing, the figure stands at 2,851 books posted. Watch this space around the first week of January as we reach 3,000.

Until next week, enjoy your holiday preparations as well as the continued celebrations at DP. This year promises to go out in a grand style. Keep giving your best and the same will return to fill the hours of your days.

For now...

Thierry Alberto

Links to Articles

19 October 2004
18 February 2004
11 February 2004
4 February 2004
14 January 2004
<-- 17 December 2003
3 December 2003
19 November 2003
12 November 2003
5 November 2003
29 October 2003
22 October 2003
15 October 2003
8 October 2003
1 October 2003
17 September 2003
10 September 2003
3 September 2003
27 August 2003
20 August 2003