[Ecommerce] Preservation of word processing documents (Australian Report)

Manon Ress manon.ress@cptech.org
Mon Sep 25 17:42:01 2006


http://www.apsr.edu.au/publications/word_processing_preservation.pdf

Preservation of word processing documents by Ian Barnes
The Australian National University
Friday, 14 July 2006

This work was funded by the Australian Commonwealth Department of
Education, Science & Training, through the Australian Partnership for
Sustainable Repositories, which is part of the Systemic
Infrastructure Initiative, part of the Commonwealth Government's
=93Backing Australia's Ability=97An Innovative Action Plan for the Future=
=94.


QUOTE pp 6-7:

3.3.2. Open Document Format

Open Document Format[13] is the native file format of the latest
versions of OpenOffice.org Writer[14], the
word processor component of the OpenOffice.org open source office
suite. OpenOffice.org is the open source
version of Star Office, which was originally developed by Star
Division in Germany. Star Division were bought
by Sun Microsystems, who still support the continuing development.
OpenOffice.org is the world=92s largest open
source software project. Most development seems to be done by Sun
engineers, but there is also a very active
community.

Open Document Format grew out of OpenOffice.org=92s earlier Open Office
XML format. It is now an OASIS
and ISO standard and a European Commission recommendation. It is
supported by the open source word
processors KOffice and AbiWord, with more to come.

An ODF file is a Zip archive containing several XML files, plus
images and other objects. The Zip archiving
and compression tool is freely available on all major platforms, so
there should never be a problem getting at the
content of an ODF document. Using a Zip archive does mean that the
files are prone to catastrophic loss of
content with even minor data corruption, in the same way as the
Microsoft Word formats discussed above.
If we are going to archive word processing documents, I believe that
ODF is a better option than Microsoft
Word format in any of its variations. Even the new XML-based Word
formats will still suffer from being owned
by a for-profit corporation.

One possible preservation strategy would be to convert all word
processing documents to ODF for storage. This
can be done easily using OpenOffice.org itself as a converter. The
conversion could be set up as part of the
repository ingest process so that it would be almost totally painless
for users. Conversion to ODF gets all the
formatting of most Word documents, with only minor differences in
layout. For complex documents that use
lots of floating text boxes, these minor differences can make a mess
of the appearance of the document. For
documents that use embedded active content (chunks from live
spreadsheets etc), the embedding will probably
fail. For most =93normal=94 documents, even complex ones, the conversion
is good.

The main disadvantage of this strategy is that Open Document Format
is still a word processing format, not a
structured document format. What does this mean, and why is it a
problem?

=95 Word processing formats are at heart about describing the
appearance of the document, not its structure. For
serious processing it=92s the structure we want. In 20, 50 or 100
years, most readers will probably not care
about the size of the paper, the margins, the fonts used and so on.
Even today, if we=92re going to serve up a
document as a web page, those details are irrelevant. Sometimes these
details can even be a disadvantage,
for example if the document insists on fonts that are unavailable on
your computer. On the other hand, the
division of the document into sections will always be relevant,
useful and important, and must be preserved.
=95 Word processing formats are flat. That is, the document is a
sequence of paragraphs and headings. What
we=92d really like is a deep structure with sections, subsections and
so on, nested inside each other (as in
DocBook or TEI ). We want this deep structure because it makes
structured searches and queries possible,
and makes conversion with XSLT much easier.

It is possible to do automated conversion from flat to deep structure
[15] (and see Section 6 below), but this
is only possible at the moment with documents that conform to a well-
designed template. In the future
heuristic methods might extend this to less carefully prepared
documents, but the results are likely to be
inconsistent.

The other disadvantage of Open Document Format is that even for
simple documents it is extremely complex.
For example, unzipping a one-page document of about 120 words results
in a collection of files totalling 300K
in size. This makes it relatively difficult to locate the meaningful
content and structure and transform it into
other formats for viewing or other uses. Instead of leaving documents
in this complex format and having a hard
job writing converters (XSLT stylesheets) for all possible future
uses, it would be better to store documents in a
simple, clear, well-structured format that makes converters easier to
write.

END OF QUOTE
************************************************
Manon Anne Ress
manon.ress@cptech.org,
www.cptech.org

Consumer Project on Technology
1621 Connecticut Ave, NW, Washington, DC 20009 USA
Tel.:  +1.202.332.2670, Ext 16 Fax: +1.202.332.2673

Consumer Project on Technology
1 Route des  Morillons, CP 2100, 1211 Geneva 2, Switzerland
Tel: +41 22 791 6727

Consumer Project on Technology
24 Highbury Crescent, London, N5 1RX, UK
Tel: +44(0)207 226 6663 ex 252 Fax: +44(0)207 354 0607