What does it take to get a
document digitized and published online here at the Rudolf Steiner
Archive? Here are the steps:
[ From this ... ]
- Locate and acquire the document,
either purchased or from a library (see image at
right ... click image for larger view).
- Scan each page, saving as a
computer file (preferably, but not necessarily, a TIF [Tagged
Image Format] file). At this point it is a graphic image, like
a photo of each page.
- Run files against OCR (Optical
Character Recognition) software, which converts any alphabetic
characters it finds in the image to actual “text”
characters, resulting in the creation of text files. The accuracy
of this process varies from 95% recognition for very clear documents,
to no recognition at all for some old manuscripts, which have to
be keyed in (typed) by hand.
- Proofread and correct each text file,
comparing against the original document (preferred), or the scanned
images, and save as a revised file(s). This includes:
- edit for typographic errors, whether caused by OCR inaccuracy
or in the original document (it happens!),
- verify special characters, especially left and right quotation
marks, and diacritical marks such as umlauts.
- Proofread to locate all footnotes
and graphics (e.g., diagrams, drawings) in order to place them
correctly in the online version.
- Proofread to locate all references
to items online in order to set up links for cross-references.
- Convert to HTML. For a single lecture,
this is a single file. If this is a book or collection, there are
multiple files, including cover image, contents, prefaces, appendices,
synopses, notes, footnotes, cross-references – much of this is
automated, but the human eye is still needed, and a lot of this must
be done manually.
- All browsers are not equal! There
is quite a bit of work that needs to be done to make the document
render, at least close to the same way, in all browsers! What looks
fine in one browser may look terrible in another. And when you fix
it in the other browser, it breaks the first one. We recommend
[ ... to this. ]|
- Put into the database(s),
cross-referencing with other documents, create index, keywords,
and other information needed for our database and the
search/research tools we have created.
- Publish on the website
(Whew! see image at left ... click image to read
- From start to finish, a 10-page
lecture could take anywhere from one to eight hours, from initial
scanning to finally appearing online. For a collection or book,
it can take 10-50% more time to handle all the indexing, cross
referencing, and formatting. Also, graphics and diagrams can take
a lot of work to clean up after they have been scanned. Some of
our materials are original typewritten manuscripts on very fragile,
yellowed papers, and are nearly impossible for OCR processing.
Currently, there are 1102 on-site volumes and 4142 individual
documents here at the Archive!
Most of the digitizing project is done inhouse, but
we have wonderful volunteers all over the world who acquire and scan
documents, run against their own OCR software (if they have it), and
create files that they send to us. The final proofreading,
cross-references, creation of HTML files, setting up for our databases
and tools, and online publication are all done inhouse. And, of course,
we provide the heavy-duty servers and broadband to make it all available
to the world.
Our Search and Research Tools and Database
Jim Stewart has designed and created the online
tools — the database, searching capability, keyword indexing
and cross referencing, etc. — that enable users to access and
research the on-line documents with ease. This has been an ongoing
project for almost 30 years.
How to Help
We have a tremendous backlog of materials we want
to get online, and there are so many irreplaceable resources at risk
worldwide! If you can afford to donate even a little to help support
this initiative, you will be helping save irreplaceable works and to
make the information available to so many others! Please check our
pages to see how you can help!