What does it take to get a
document digitized and published online here at the Rudolf Steiner
Archive? Here are the steps:
| |

[ From this ... ]
|
Locate and acquire the document,
either purchased or from a library (see image at right).
Scan each page, saving as a
computer file (preferably, but not necessarily, a TIF [Tagged Image
Format] file). At this point it is a graphic image, like a photo of
each page.
Run files against OCR (Optical
Character Recognition) software, which converts any alphabetic characters
it finds in the image to actual “text” characters, resulting
in the creation of text files. The accuracy of this process varies from
95% recognition for very clear documents, to no recognition at all for
some old manuscripts, which have to be keyed in (typed) by hand.
Proofread and correct each text
file, comparing against the original document (preferred), or the
scanned images, and save as a revised file(s). This includes:
- edit for typographic errors, whether
caused by OCR inaccuracy or in the original document (it happens!),
- verify special characters, especially
left and right quotation marks, and diacritical marks such as umlauts.
Proofread to locate all
footnotes and graphics (e.g., diagrams, drawings) in order to place
them correctly in the online version.
Proofread to locate all
references to items online in order to set up links for
cross-references.
Convert to HTML. For a single
lecture, this is a single file. If this is a book or collection,
there are multiple files, including cover image, contents, prefaces,
appendices, synopses, notes, footnotes, cross-references – much
of this is automated, but the human eye is still needed, and alot
of this must be done manually.
All browsers are not equal! There
is quite a bit of work that needs to be done to make the document
render, at least close to the same way, in all browsers! What looks
fine in one browser may look terrible in another. And when you fix it
in the other browser, it breaks the first one. We recommend
Mozilla!
[ ... to this. ]
| |
Put into the database(s),
cross-referencing with other documents, create index, keywords, and
other information needed for our database and the search/research
tools we have created.
Publish on the website (Whew!
see image at left).
From start to finish, a 10-page
lecture could take anywhere from one to eight hours, from initial
scanning to finally appearing online. For a collection or book, it
can take 10-50% more time to handle all the indexing, cross
referencing, and formatting. Also, graphics and diagrams can take a
lot of work to clean up after they have been scanned. Some of our
materials are original typewritten manuscripts on very fragile,
yellowed papers, and are nearly impossible for OCR processing.
Most of the digitizing project is
done inhouse, but we have wonderful volunteers all over the world who
acquire and scan documents, run against their own OCR software (if
they have it), and create files that they send to us. The final
proofreading, cross-references, creation of HTML files, setting up
for our databases and tools, and online publication are all done
inhouse. And, of course, we provide the heavy-duty servers and
broadband to make it all available to the world.
Our Search and Research Tools and Database
Management
Jim Stewart has designed and created the online tools the
database, searching capability, keyword indexing and cross referencing,
etc. that enable users to access and research the on-line
documents with ease. This has been an ongoing project for more than
25 years.
How to Help
We have a tremendous backlog of materials we want to get online, and
there are so many irreplaceable resources at risk worldwide! If you
can afford to donate even a little to help support this initiative,
you will be helping save irreplaceable works and to make the information
available to so many others! Please check our
Donation
and
Appeal
pages to see how you can help!
|