Jan Schaumann
``These download files are in Microsoft Word 6.0 format. After unzipping, these files can be viewed in any text editor, including all versions of Microsoft Word, WordPad, and Microsoft Word Viewer.''
-- From Microsoft's Website
Mmmmmmmkay. So we are told that Microsoft Word files can be ``viewed in any text editor,'' which probably is why so many people insist on sending even simple text documents as big Word attachments. Well, seriously, how often do you receive an email with a Word document attached, because the original poster simply assumes that everyone uses Microsoft Word (if they even think about it at all)?
Not only is it dangerous for anyone actually using Word to open attachments that might possibly contain macro-viruses, but for anyone not using any of Microsoft's products, it has become a real nuisance.1 This article will try to make your (office) life a bit easier by elaborating on the various possibilities of dealing with these dreaded documents.
Short of rejecting any documents not in standard format (more on this later), there is no optimal way of dealing with MS Word documents in Linux. As mentioned above, sometimes even Word can't read Word. There are, however, a number of approaches to opening most of the documents and even preserve the formatting.
There are the full-fledged Word Processors, very similar to Microsoft's Word,
there are a few file converters and some rather unconventional means of
extracting the information out of a .doc. Depending on your needs, you
may choose one solution in one particular situation, and at times a different
one.
If you need to do a lot of word processing and often exchange documents with
co-workers, you most certainly want to install a complete Office Suite. An
Office Suite comes with, among others, a Word Processor which lets you read
(and sometimes write) various MS Word formats, even though they all have their
own document formats as well.
The most common Office Suites available for Linux are:
Of all the above, Applixware Office[5] is the only one not available for no cost - however, Applix was kind enough to provide me with a copy of their software for this article (retail price is $99). I received a few colorful boxes containing a copy of Applixware Office, Applixware Words (standalone) and Applixware Spreadsheets (standalone). The Office package came with a beautiful handbook, something that I would certainly appreciate, once I set it up.
Eager to test the new software, I attempt to install Applixware Words,
following the instructions in the manual. At first, things seem to go
smoothly, but then closed software practices take their toll: there appears to
be a small bug in the install-scripts which made it impossible for me to
install the software. When I attempted to install into /opt/applix (as
suggested by the program), the error-log tells me after the failed
installation that it apparently tried to install to /optapplix, even
though it created /opt/applix.
Note that this would be just a minor nuisance if the user was able to
edit the install script, but as this is closed software, there is nothing I
can do. I tried /opt//applix and a few other tricks, but to no avail.
Due to unmet dependencies (on a Debian system, it appears as if I did not have
any of the most basic RPMs installed) the RPM install fails as well, so that
in my last attempt to install Applixware Words, I generate ".deb" packages
from the rpms using alien[1].
Even though this seems to install the packages running the application leads to several errors. I finally give up and compose an email to Applix to inquire about these problems. After all, you get 30 days of support from Applix with the product.
A few hours later I receive an email instructing me to install from RPM's using the following command:
$> ./install.bsh -rpm2cpio -library -location \ <path to install location> -rpmloc <path/to/rpm/files>
This requires copying the RPM's to a different location and, in my opinion, does not necessarily qualify as an ``easy installation''. But once the software is installed, Applixware Words proves to be a solid product.
Reading even complex documents which include tables, graphics and mathematic
formulae worked better than in most of the other applications I tried - only
if the original document was saved with the "Fast Save" Option in MS Word did
Applix Words warn the user that the result might not be accurate. Exporting
and writing to different file formats such as .rtf, .doc and
plain ASCII worked equally well.2
As mentioned earlier, Applixware Office comes with a nice printed manual,
something that surely will be appreciated by everybody doing a lot of work
with these products. A short online tutorial together with the help menu
complete the printed manual and make Applix Words easy to use for even a
novice. Another advantage of this suite is that its native file format
(.aw) is plain ASCII text, with the specifications freely available
from the web site[5], which makes it easy to write import/export
filters.
It should be mentioned, though, that Applix Words did exhibit some rather strange resizing behaviour under Blackbox.
Corel[7], best known by most for its Linux Distribution (Corel Linux[8]) and CorelDRAW, also developed a very powerful Office Suite. Corel WordPerfect Office 2000[9], which includes the well-known WordPerfect Word Processor. WP offers anything one might wish for in a tool like this - it has been regarded as superior to MS Word by many people, and is available for Windows and Linux3.
If you want to license WP Office 2000 for your entire office, you will find a hefty price-tag attached; however, for personal use, you can download WordPerfect by itself for free. I found it a breeze to install and use - it easily opened all of the Word documents I found on my hard drive, and was even able to display mathematical formulae properly.
KOffice[6], brought to you by the friendly people of KDE, was released together with KDE 2.0 in October 2000, as beta software. Nonetheless, the word processor - KWord - looks impressive. It integrates nicely with all the other KDE apps, and neatly imported most of the MS Word documents I fed it.
Problems arose when I tried to open a document containing mathematical formulae, but since I have been assured that these formulae bring down every version of Word itself but the latest (no surprise, there), I would still recommend it. By the time KOffice 1.1 will be released, I'm sure, KWord will easily suffice for most needs.
This Office Suite is, of course, licensed under the GPL and available for free
download from your favorite mirror. Debian's apt-get install kword took
care of all dependencies for me, but since KOffice relies on KDE 2.0 and
Qt2.2, you might find yourself upgrading a lot of packages before you can use
this program.
Quite some time ago, Sun Microsystems acquired StarOffice[3], an Office Suite available for many Operating Systems. StarOffice was one of the first Office Suites able to compete with Microsoft's Office. While Sun always offered StarOffice for free download, only fairly recently did they announce the release of the source code to the Open Source community, which ultimately led to the OpenOffice[4] project. So, yes, this is another GPL'ed project.
StarOffice/OpenOffice includes a very powerful word processor, which can read
most Word Documents and can even write to .doc-format. However, it has
a drawback: it's a memory-hog. Not only does it require a significant amount
of disk space for the complete installation, it also takes a while until all
the components are started. If you have a slow machine, this might not be
your first choice - on the other hand, if you have enough space and memory,
I'm sure you'll find StarOffice/OpenOffice to meet all of your needs with
respect to Word Processing.
All of the aforementioned applications are full Office Suites; rather hefty
packages more suited to people who actually do perform a lot of
word-processing and who, at the same time, need to have applications for
spread sheets and presentations and the like.
For those of you, who just want a word processor for the occasional letter of complaint to your landlord, there are some lighter approaches. The most common lightweight word processor is AbiWord[2].
AbiWord, designed to be ``full-featured, and remain lean,'' seems to live up
to its goal. It's fast, available for a large variety of platforms, free (as
in beer and as in speech), and under heavy development. However, I do have to
admit that it chokes on some documents, or opens them without preserving the
original formatting. In particular, MS Word's way of dealing with tables
seems to confuse AbiWord.
Another very small and light word processor is Pathetic Writer (or pw), which is part of the Siag Office Suite. The reason I mention pw here and did not include it in the full-fledged Office Suites is that it
seems rather thin. pw will not open Microsoft's .doc's, but it will
happily perform your everyday word processing and can import and export most
common formats.
Siag Office, just as AbiWord, is published under the GPL and available for free download.
All of the above mentioned applications have various requirements: Some rely heavily on pre-installed libraries (such KWord), some are rather resource-hungry (StarOffice/OpenOffice), others are expensive and/or not open-sourced. But all of them try to preserve a certain style or a certain way of formatting a document.
While this is certainly useful and important, I have found that I, personally,
have no use whatsoever for a Word Processor, no matter which one. In 90% of
the cases where some thoughtless person sends me a .doc, the
information contained within the document could have easily been communicated
in plain text with a fraction of the size.
So let's talk business now and see how we can extract the necessary information from the proprietary file formats. There are a few tools worth mentioning, whose beauty lies in that we do not even need X - they are all command-line tools.
It's quick and since it is a command-line tool, we can redirect the output to
another process or file for further modification. To take a quick glance at
the content of the file, you could pipe the output to less:
$> antiword HUGE.DOC | less
Or, if you'd rather have a hardcopy:
$> antiword -p letter HUGE.DOC | lpr
I found antiword so useful that I replaced my previous mailcap-entry
for handling of MS-Word files (in which I used to call abiword) with
the following line:
application/msword;antiword %s | vim -
This allows me to read through .doc-attachments from my mail reader
(mutt) - and since I pipe the output right into my favorite editor, I can
even make modifications and save it to another file. Note that by placing
this entry into my ~/.mailcap, all applications respecting this file
will use antiword and vim to display .doc's. If you are using a
graphical browser such as Netscape, you might want to use a different
editor, or use the "-g" switch for vim to spawn a GUI frontend.
If you are a hard-core minimalist, you will find that the command
strings, part of the GNU binutils package, is often sufficient to
extract the plain text information from a .doc - however, antiword has the significant advantage over strings that it can also
extract images in addition to just the text.
For details on the use of the various options and on how to extract images
from a Word file, see antiword(1).
.doc's, as it
converted them quite reliably into nice HTML. Note that I'm not talking
about wordview, a Microsoft Product. The similarity in the name caused
the author to rename his tool.
While it certainly is great for a browser to use an application that turns Word files into HTML, this is not always the ideal output format. Therefore, wv nowadays includes a whole set of tools to convert Word documents into a large variety of formats, including, but not limited to ASCII text, HTML, LATEX, PostScript and PDF. wv is published under the GPL and available for free download.
The typical user trying to write a simple progress report, for example, usually follows a certain scheme:
Now I am fully aware that this is not the proper way to utilize a powerful Word Processor, but let's face it - that's exactly how the majority of users - those for whom these ``user friendly'' applications are designed - work. The efforts required to enter a table of contents, a bibliography, cross-references, etc can only be imagined.
Eventually, the outcome is a document that takes hours to prepare, and that looks only the way it should on this platform using this particular version of this word processor.
To avoid such bad practices, let's investigate some alternative methods of preparing platform independent documents.
Simple, plain ASCII text is usually sufficient to send information from one person to another - that's exactly why e-mail, for example, is still a text medium. HTML in emails4 does not add anything to the content. ASCII text can be read from anywhere with any editor (and not just with ``any editor, including Microsoft Word...''). By structuring the text clearly, by using paragraphs and horizontal lines constructed out of hyphens, maybe even by using *bold*, /italics/ and _underlined_ text as used on Usenet, one can write clear, easy-to-read and understand, and, most importantly, portable documents.
LATEX is an astounding typesetting engine, derived from TEX. It takes a
.tex file as input and typesets it, generating a .dvi file. It
is available for a large variety of platforms, and documents typeset with
LATEX look incredibly professional. Yet you can use your favorite editor to
create the input files, since LATEX is a command line tool.
When using LATEX, one can concentrate on the content of the document instead
of the way it looks - the typesetting engine will take care of the layout. A
.tex-file contains a few tags (which may remind you of HTML) to
determine the way the text will be displayed.
This is a completely different way of writing a document - no more pointing and clicking and highlighting and re-considering and so on and so on. But it may be daunting to someone who is used to using a Graphical User Interface.
Now this is where the good guys from LyX[13] come into play. They developed a GUI to LATEX, enabling the inexperienced user to take advantage of the power of TEX, but without having to learn it from scratch (yet).
Upon first glance, LyX may look similar to your average Word Processor, but if you follow the tutorial, you will quickly see the difference and how you can increase productivity by concentrating on your work, your material, rather than on the visual representation.
If you often connect remotely to your machine to get work done, like I do, you
don't always have the ability to export your display or to forward X. This is
when you learn to appreciate the power of the commandline, when you find out
that everything you ever need is right there at your fingertips. By using
your favorite editor (vim, in my case) and LATEX, you can easily get all your
work done through a single terminal to your machine.
Another advantage of LyX and LATEX is that you can easily export your files
into platform independent formats such as PostScript or PDF. By combining the
power of make with the power of LATEX, this can be done with just a
few commands. Take, for example, this document - I turned the input file
into a beautiful PDF (Figure 6) simply by using the command
$> make pdf
Even though the Makefile itself (Figure 7) is simple, it allows
me to convert easily my document into a large variety of output formats using
several different commandline tools, such as ps2pdf and
latex2html.
Finally, LATEX is extensible - you can write your own styles to achieve different results depending on the kind of document you are writing. But most likely, someone else has already done so and uploaded it to the Comprehensive Tex Archive Network (CTAN[14], TEX's equivalent to Perl's CPAN).
Personally, I'm sure you will find that LATEX is far superior even for these little every-day tasks when it comes to creating professional (looking) documents. In order to take advantage of LATEX, however, it is necessary to free your mind from what you may be used to - this may take a while. But don't be afraid, there is a lot of helpful documentation out there5.
After a short time of going through a tutorial and, most importantly, giving it a try and taking a look at some examples, you will never want to go back - you can take my Word for it.
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.48)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 1 words.tex
The translation was initiated by Jan Schaumann on 2001-10-15