More than Word(s)

Jan Schaumann


Contents

Introduction

``These download files are in Microsoft Word 6.0 format. After unzipping, these files can be viewed in any text editor, including all versions of Microsoft Word, WordPad, and Microsoft Word Viewer.''
-- From Microsoft's Website

Mmmmmmmkay. So we are told that Microsoft Word files can be ``viewed in any text editor,'' which probably is why so many people insist on sending even simple text documents as big Word attachments. Well, seriously, how often do you receive an email with a Word document attached, because the original poster simply assumes that everyone uses Microsoft Word (if they even think about it at all)?

Not only is it dangerous for anyone actually using Word to open attachments that might possibly contain macro-viruses, but for anyone not using any of Microsoft's products, it has become a real nuisance.1 This article will try to make your (office) life a bit easier by elaborating on the various possibilities of dealing with these dreaded documents.

Reading and Writing Words

Short of rejecting any documents not in standard format (more on this later), there is no optimal way of dealing with MS Word documents in Linux. As mentioned above, sometimes even Word can't read Word. There are, however, a number of approaches to opening most of the documents and even preserve the formatting.

There are the full-fledged Word Processors, very similar to Microsoft's Word, there are a few file converters and some rather unconventional means of extracting the information out of a .doc. Depending on your needs, you may choose one solution in one particular situation, and at times a different one.

Big Words

If you need to do a lot of word processing and often exchange documents with co-workers, you most certainly want to install a complete Office Suite. An Office Suite comes with, among others, a Word Processor which lets you read (and sometimes write) various MS Word formats, even though they all have their own document formats as well.

The most common Office Suites available for Linux are:

Applixware Office

Of all the above, Applixware Office[5] is the only one not available for no cost - however, Applix was kind enough to provide me with a copy of their software for this article (retail price is $99). I received a few colorful boxes containing a copy of Applixware Office, Applixware Words (standalone) and Applixware Spreadsheets (standalone). The Office package came with a beautiful handbook, something that I would certainly appreciate, once I set it up.

Eager to test the new software, I attempt to install Applixware Words, following the instructions in the manual. At first, things seem to go smoothly, but then closed software practices take their toll: there appears to be a small bug in the install-scripts which made it impossible for me to install the software. When I attempted to install into /opt/applix (as suggested by the program), the error-log tells me after the failed installation that it apparently tried to install to /optapplix, even though it created /opt/applix.

Note that this would be just a minor nuisance if the user was able to edit the install script, but as this is closed software, there is nothing I can do. I tried /opt//applix and a few other tricks, but to no avail.

Due to unmet dependencies (on a Debian system, it appears as if I did not have any of the most basic RPMs installed) the RPM install fails as well, so that in my last attempt to install Applixware Words, I generate ".deb" packages from the rpms using alien[1].

Even though this seems to install the packages running the application leads to several errors. I finally give up and compose an email to Applix to inquire about these problems. After all, you get 30 days of support from Applix with the product.

A few hours later I receive an email instructing me to install from RPM's using the following command:

$> ./install.bsh -rpm2cpio -library -location \
<path to install location> -rpmloc <path/to/rpm/files>

This requires copying the RPM's to a different location and, in my opinion, does not necessarily qualify as an ``easy installation''. But once the software is installed, Applixware Words proves to be a solid product.

Figure 1: A Screenshot of Applixware Words 5.0
\includegraphics[width=0.75\textwidth]{applix.eps}

Reading even complex documents which include tables, graphics and mathematic formulae worked better than in most of the other applications I tried - only if the original document was saved with the "Fast Save" Option in MS Word did Applix Words warn the user that the result might not be accurate. Exporting and writing to different file formats such as .rtf, .doc and plain ASCII worked equally well.2

As mentioned earlier, Applixware Office comes with a nice printed manual, something that surely will be appreciated by everybody doing a lot of work with these products. A short online tutorial together with the help menu complete the printed manual and make Applix Words easy to use for even a novice. Another advantage of this suite is that its native file format (.aw) is plain ASCII text, with the specifications freely available from the web site[5], which makes it easy to write import/export filters.

It should be mentioned, though, that Applix Words did exhibit some rather strange resizing behaviour under Blackbox.

Corel WordPerfect Office 2000

Corel[7], best known by most for its Linux Distribution (Corel Linux[8]) and CorelDRAW, also developed a very powerful Office Suite. Corel WordPerfect Office 2000[9], which includes the well-known WordPerfect Word Processor. WP offers anything one might wish for in a tool like this - it has been regarded as superior to MS Word by many people, and is available for Windows and Linux3.

If you want to license WP Office 2000 for your entire office, you will find a hefty price-tag attached; however, for personal use, you can download WordPerfect by itself for free. I found it a breeze to install and use - it easily opened all of the Word documents I found on my hard drive, and was even able to display mathematical formulae properly.

Figure 2: A Screenshot of WordPerfect 8
\includegraphics[width=0.75\textwidth]{wordperfect.eps}

KOffice

KOffice[6], brought to you by the friendly people of KDE, was released together with KDE 2.0 in October 2000, as beta software. Nonetheless, the word processor - KWord - looks impressive. It integrates nicely with all the other KDE apps, and neatly imported most of the MS Word documents I fed it.

Problems arose when I tried to open a document containing mathematical formulae, but since I have been assured that these formulae bring down every version of Word itself but the latest (no surprise, there), I would still recommend it. By the time KOffice 1.1 will be released, I'm sure, KWord will easily suffice for most needs.

This Office Suite is, of course, licensed under the GPL and available for free download from your favorite mirror. Debian's apt-get install kword took care of all dependencies for me, but since KOffice relies on KDE 2.0 and Qt2.2, you might find yourself upgrading a lot of packages before you can use this program.

Figure 3: A Screenshot of KWord
\includegraphics[width=0.75\textwidth]{kword.eps}

StarOffice/OpenOffice

Quite some time ago, Sun Microsystems acquired StarOffice[3], an Office Suite available for many Operating Systems. StarOffice was one of the first Office Suites able to compete with Microsoft's Office. While Sun always offered StarOffice for free download, only fairly recently did they announce the release of the source code to the Open Source community, which ultimately led to the OpenOffice[4] project. So, yes, this is another GPL'ed project.

Figure 4: A Screenshot of StarOffice (now OpenOffice)
\includegraphics[width=0.75\textwidth]{soffice.eps}

StarOffice/OpenOffice includes a very powerful word processor, which can read most Word Documents and can even write to .doc-format. However, it has a drawback: it's a memory-hog. Not only does it require a significant amount of disk space for the complete installation, it also takes a while until all the components are started. If you have a slow machine, this might not be your first choice - on the other hand, if you have enough space and memory, I'm sure you'll find StarOffice/OpenOffice to meet all of your needs with respect to Word Processing.

Small Words

All of the aforementioned applications are full Office Suites; rather hefty packages more suited to people who actually do perform a lot of word-processing and who, at the same time, need to have applications for spread sheets and presentations and the like.

For those of you, who just want a word processor for the occasional letter of complaint to your landlord, there are some lighter approaches. The most common lightweight word processor is AbiWord[2].

AbiWord, designed to be ``full-featured, and remain lean,'' seems to live up to its goal. It's fast, available for a large variety of platforms, free (as in beer and as in speech), and under heavy development. However, I do have to admit that it chokes on some documents, or opens them without preserving the original formatting. In particular, MS Word's way of dealing with tables seems to confuse AbiWord.

Another very small and light word processor is Pathetic Writer (or pw), which is part of the Siag Office Suite. The reason I mention pw here and did not include it in the full-fledged Office Suites is that it seems rather thin. pw will not open Microsoft's .doc's, but it will happily perform your everyday word processing and can import and export most common formats.

Siag Office, just as AbiWord, is published under the GPL and available for free download.

Unconventional Words

All of the above mentioned applications have various requirements: Some rely heavily on pre-installed libraries (such KWord), some are rather resource-hungry (StarOffice/OpenOffice), others are expensive and/or not open-sourced. But all of them try to preserve a certain style or a certain way of formatting a document.

While this is certainly useful and important, I have found that I, personally, have no use whatsoever for a Word Processor, no matter which one. In 90% of the cases where some thoughtless person sends me a .doc, the information contained within the document could have easily been communicated in plain text with a fraction of the size.

So let's talk business now and see how we can extract the necessary information from the proprietary file formats. There are a few tools worth mentioning, whose beauty lies in that we do not even need X - they are all command-line tools.

antiword

antiword[11] takes a Word document as input and extracts the information contained in it, converting it to plain ASCII text or to PostScript. It tries to maintain the formatting as much as possible and, if I may say, it does a fairly decent job in doing so.

It's quick and since it is a command-line tool, we can redirect the output to another process or file for further modification. To take a quick glance at the content of the file, you could pipe the output to less:

$> antiword HUGE.DOC | less

Or, if you'd rather have a hardcopy:

$> antiword -p letter HUGE.DOC | lpr

I found antiword so useful that I replaced my previous mailcap-entry for handling of MS-Word files (in which I used to call abiword) with the following line:

application/msword;antiword %s | vim -

This allows me to read through .doc-attachments from my mail reader (mutt) - and since I pipe the output right into my favorite editor, I can even make modifications and save it to another file. Note that by placing this entry into my ~/.mailcap, all applications respecting this file will use antiword and vim to display .doc's. If you are using a graphical browser such as Netscape, you might want to use a different editor, or use the "-g" switch for vim to spawn a GUI frontend.

If you are a hard-core minimalist, you will find that the command strings, part of the GNU binutils package, is often sufficient to extract the plain text information from a .doc - however, antiword has the significant advantage over strings that it can also extract images in addition to just the text.

For details on the use of the various options and on how to extract images from a Word file, see antiword(1).

wv

The other application, formerly known as mswordview, now available as wv[12] has been around for quite some time. When I first installed RedHat 5.2 a few years ago, the Netscape browser used mswordview as the standard application to handle .doc's, as it converted them quite reliably into nice HTML. Note that I'm not talking about wordview, a Microsoft Product. The similarity in the name caused the author to rename his tool.

While it certainly is great for a browser to use an application that turns Word files into HTML, this is not always the ideal output format. Therefore, wv nowadays includes a whole set of tools to convert Word documents into a large variety of formats, including, but not limited to ASCII text, HTML, LATEX, PostScript and PDF. wv is published under the GPL and available for free download.

No Words

Ok, so far we've seen how we can read Word documents, and even what options there are to write documents that, in Winworld, would most likely be done in Word. But I can't help concluding that the Word Processor itself, as an application, is not required or useless in the vast majority of cases.

The typical user trying to write a simple progress report, for example, usually follows a certain scheme:

Now I am fully aware that this is not the proper way to utilize a powerful Word Processor, but let's face it - that's exactly how the majority of users - those for whom these ``user friendly'' applications are designed - work. The efforts required to enter a table of contents, a bibliography, cross-references, etc can only be imagined.

Eventually, the outcome is a document that takes hours to prepare, and that looks only the way it should on this platform using this particular version of this word processor.

To avoid such bad practices, let's investigate some alternative methods of preparing platform independent documents.

A classic: ASCII

As I have mentioned repeatedly, the information contained within the majority of documents is plain text. At times some fancy formatting may be nice, but it's optional. The main interest of the person writing the document should be to communicate the information.

Simple, plain ASCII text is usually sufficient to send information from one person to another - that's exactly why e-mail, for example, is still a text medium. HTML in emails4 does not add anything to the content. ASCII text can be read from anywhere with any editor (and not just with ``any editor, including Microsoft Word...''). By structuring the text clearly, by using paragraphs and horizontal lines constructed out of hyphens, maybe even by using *bold*, /italics/ and _underlined_ text as used on Usenet, one can write clear, easy-to-read and understand, and, most importantly, portable documents.

LyX and LATEX

While plain ASCII text should be the choice for most cases, it cannot be denied that occasionally one might need more formatting. If not for the sake of information, it may simply look better. Your boss will like more a progress report that looks neat. Well, no problem - no need to dig out the old Word Processor again. Just use LyX, the graphical frontend to LATEX.

LATEX is an astounding typesetting engine, derived from TEX. It takes a .tex file as input and typesets it, generating a .dvi file. It is available for a large variety of platforms, and documents typeset with LATEX look incredibly professional. Yet you can use your favorite editor to create the input files, since LATEX is a command line tool.

When using LATEX, one can concentrate on the content of the document instead of the way it looks - the typesetting engine will take care of the layout. A .tex-file contains a few tags (which may remind you of HTML) to determine the way the text will be displayed.

This is a completely different way of writing a document - no more pointing and clicking and highlighting and re-considering and so on and so on. But it may be daunting to someone who is used to using a Graphical User Interface.

Now this is where the good guys from LyX[13] come into play. They developed a GUI to LATEX, enabling the inexperienced user to take advantage of the power of TEX, but without having to learn it from scratch (yet).

Upon first glance, LyX may look similar to your average Word Processor, but if you follow the tutorial, you will quickly see the difference and how you can increase productivity by concentrating on your work, your material, rather than on the visual representation.

Figure 5: A Screenshot of LyX
\includegraphics[width=0.75\textwidth]{lyx.eps}

If you often connect remotely to your machine to get work done, like I do, you don't always have the ability to export your display or to forward X. This is when you learn to appreciate the power of the commandline, when you find out that everything you ever need is right there at your fingertips. By using your favorite editor (vim, in my case) and LATEX, you can easily get all your work done through a single terminal to your machine.

Another advantage of LyX and LATEX is that you can easily export your files into platform independent formats such as PostScript or PDF. By combining the power of make with the power of LATEX, this can be done with just a few commands. Take, for example, this document - I turned the input file into a beautiful PDF (Figure 6) simply by using the command

$> make pdf

Figure 6: This document in source and in PDF format
\includegraphics[width=0.3\textwidth]{latex.eps} \includegraphics[width=0.3\textwidth]{pdf.eps}

Even though the Makefile itself (Figure 7) is simple, it allows me to convert easily my document into a large variety of output formats using several different commandline tools, such as ps2pdf and latex2html.

Figure 7: The Makefile for this document
\begin{figure}\begin{center}
\begin{verbatim}TARGET = wordsLATEX = latex
DVI...
... *.ps *.pdf *.toc *.txt
rm -fr $(TARGET)/\end{verbatim}\end{center}\end{figure}

Finally, LATEX is extensible - you can write your own styles to achieve different results depending on the kind of document you are writing. But most likely, someone else has already done so and uploaded it to the Comprehensive Tex Archive Network (CTAN[14], TEX's equivalent to Perl's CPAN).

Conclusion

In brief, whichever way you choose to handle your Word Processing, the importance of conveying the information in a portable document format needs to be expressed. Just try to make it clear to the people you correspond with, to the people who continually send you MS Word documents and then insist that you ``fix your computer'' when you tell them that you can't open them, or that some formatting got lost. I have found that if one explains in a friendly way how a PDF or a PS, for example, can be read by anyone on almost every platform, occasionally one can educate all but the most stubborn citizens of Winworld.

Personally, I'm sure you will find that LATEX is far superior even for these little every-day tasks when it comes to creating professional (looking) documents. In order to take advantage of LATEX, however, it is necessary to free your mind from what you may be used to - this may take a while. But don't be afraid, there is a lot of helpful documentation out there5.

After a short time of going through a tutorial and, most importantly, giving it a try and taking a look at some examples, you will never want to go back - you can take my Word for it.

Bibliography

1
alien, http://kitenet.net/programs/alien/

2
Abisource, http://www.abisource.com/

3
Star Office, http://www.sun.com/products/staroffice/

4
OpenOffice, http://www.openoffice.org

5
Applixware Office, http://www.vistasource.com

6
KOffice, http://www.koffice.org

7
Corel, http://www.corel.com

8
Corel Linux, http://linux.corel.com/products/linux_os/index.htm

9
Corel Word Perfect, http://linux.corel.com/products/wpo2000_linux/index.htm

10
siag Office, http://siag.nu

11
Antiword, http://www.winfield.demon.nl/index.html

12
wv, http://www.wvWare.com/

13
LyX, http://www.lyx.org

14
Comprehensive TEXArchive Network, http://www.ctan.org

About this document ...

More than Word(s)

This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.48)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 1 words.tex

The translation was initiated by Jan Schaumann on 2001-10-15


Jan Schaumann 2001-10-15