Converting Word-Processor Files to Plain Text (Hopkins)
Converting Word-Processor Files to Plain Text Files


What is a Plain-Text File? The PK5 Plain-Text Requirement Problematics of Converting Word Files to Plain Text
How a 75-Character Line Length Should Look Copying/Pasting into PICO Copying/Pasting to a Text Editor
Saving a File as 'Text Only' in Word E-mail: Plain Text vs 'Attachments'

What is a 'Plain Text' or 'ASCII' Text?

'Ascii text' (sometimes known as 'DOS text') is a synonym for 'plain text' files which do not have any of the special screen-display and printing codes that are part of most word-processor documents. 'Ascii texts' do not display underlining, boldface, italics, or different font sizes. They include only the words, numbers and punctuation of the text, without additional 'styling'. They are the lowest common denominator of text file. As such, they can be retrieved into any text editor or word processor, and pass through any e-mail system on the internet with maximum speed and minimum likelihood of text change or loss.

Plain text files are still required for certain types of specialized software — concordancing software, for example — which will automatically insert its own special markup coding for the words and punctuation of the text, and do not want any 'foreign' coding in the text which may interfere with their own. Likewise, even conference abstracts are sometimes requested as plain text, so the word-processing or DTP software used by the organizers can more easily produce all the abstracts published for the conference in exactly the same format and style.

Also, web pages are built on ascii text, to which HTML 'tags' are added so that web browsers (such as Internet Explorer, Firefox, Safari, Opera, etc.) will know how to display the text as the author had intended.

The PK5 Plain Text Requirement

One of the PK5 exam requirements is for students to place online a plain-text file in a specified standard format. This format is in principle the same as will also be used for the PK5 'HTML paper'. The requirements for both are as follows:
  • Line length should be the plain-text 'standard' of a maximum of 75 characters/spaces per line of text (so the text will be human-readable even with an 800x600 display setting). [This is the same 'standard' as used by the PICO editor on Kielo when used within the default Putty terminal window.]

  • Paragraphs should be single-spaced text, separated from the next paragraph by double-spacing. There should also be double-space separation between text paragraphs and the text's title, section headers, etc.

  • Paragraphs should be in 'block' format, with all lines beginning at the same left-size margin. In other words, do not use paragraph indentations.

  • The material used for the PK5 plain text file should be appropriate for a text file; it cannot include images, for example, and also should not include tables or other information which is presented in parallel columns. While this can be done, a plain text file is not an efficient means to either produce or convey such material (HTML, PDF, etc., would be used instead).

A sample text file which meets these requirements can be found here, which was converted from an original Word file which can be found here (RTF).

Examples of Problematics When Converting Word Documents to Plain Text

In principle, producing the plain text file to meet the above requirements is simple. The main problematics are usually the following:
  • Extra text spacing above the beginning of the text or between paragraphs or other text elements which may have been used for Word formatting, or which was inadvertently created when copying and pasting from one type of software to another, has not been removed (see an example here).

    NB: Paragraphs and other text elements should be separated by double-spacing (as illustrated in the link above); not by triple or quadruple-spacing. The only exception to this would be if you wished to use triple-spacing for additional separation between different 'sections' of the paper, such as between the three joke versions in the sample file.

  • The file's line length is too long (e.g. longer than the 'standard' maximum of 75 characters/spaces per line). This is often a problem if a Word file has been converted to 'plain text' using the Word "save as" command, as described below. This will usually result in a file which looks like this, where each 'paragraph' of text is displayed as a single text line which will run far off the right-side of the display screen, which is not 'human-readable.'

  • Problems with Word's 'Smart Quotes'. Slightly different from the other examples above, students have often encountered problems when either converting or copy/pasting Word files in which the 'smart quotes' option has been turned on. The Word coding for 'smart quotes' (and for apostrophes to indicate possessive forms) is proprietary, so there is no 'standard' equivalent in plain text. The result is usually that the smart quotes or apostrophes are converted as periods, although some other characters also occasionally show up. For example, instead of her dog's breath smells bad one would get instead her dog.s breath smells bad.

    There are two solutions to this problem. (1) Turn off 'smart quotes' in Word for that file and re-save a copy of the file with ordinary quotation marks (or apostrophes). (2) If you have already got a plain text with the wrong characters, one could either manually go through the text and change the errant characters, or else (since the problem will be consistent), use a text editor like NoteTab that has a search-and-replace function to search for all instances of .s and replace them with 's (if it is apostrophes rather than quotation marks that is the problem).

Fortunately, these problems are easy to see when you check your text, and can be solved easily using one of the options described below.

How a 75-character Line Length Should Look!

The 75 character/space line length should look like this — with this file having used PICO to correct this faulty version.

Copying and Pasting Directly into PICO

The easiest way to convert a word-processor file to ascii may be to "copy and paste" the WORD text directly into PICO. In the following, WORD will be used as the reference, but the procedure is the same for any Windows software.

NB! When using Pico to produce the correct line-length for your plain-text file, be certain you have not re-sized the default Pico window: a larger window will also produce longer line lengths! [This is the only known problem when using Pico to produce the correct 'standard' line length; the problem has been increasing in recent years as students are more often using widescreen monitors with high resolutions on which the default Pico window size may seem 'too small'.]

  1. Start WORD, "open" the document you want to convert, use the "Select All" command (either Control-A or else Edit-Select All from the top-left menu bar) to highlight all the text you wish to copy, and then use the "Copy" command to move the highlighted text to your Windows "clipboard".
  2. With your Putty (or Tectia SSH, etc.) connection to Kielo open, start PICO with a blank screen. Then paste in the text from the clipboard. When you paste the text into PICO, it will be plain text. It will also be single-spaced, and the paragraphs will normally have the correct line-length (if they don't, use PICO's "Control-J" command to reformat each paragraph). You may also need to edit the text so that the title and author are centered (if desired), extra spacing is removed, and the spacing between paragraphs and at the end of the text is appropriate.
  3. Then use PICO's "Control-O" command to save the file with the desired name (with a ".txt" ending, if it is a plain-text file). Remember to use the UNIX "Chmod 644" command to make the file "visible."
  4. If you paste a text which includes Scandinavian characters, be certain to check with your web browser that these characters display properly. If they do not, replace them with HTML "escape" codes so that they will display properly (see The HTML Coded Character Set). If there are many such codes, however, this function would most easily be done offline using a Windows-based text editor's "search and replace" function.

Copying/Pasting Texts First into a Text Editor

Texts can also be copied and pasted from Word (or elsewhere) into a text editor such as Notetab Light (which can be freely installed on your own computers) that allows you to specify the text line length. Set this line length to either 74 or 75. Then you can easily reformat entire text files to the correct line length, save this revised format, and then upload it to Kielo. For longer files this may save considerable time, compared with needing to reformat paragraph-by-paragraph when using the PICO Control-J command describe above.

Text editors such as Notetab Light which also have search-and-replace functions will enable quick replacement of native 'Scandi' characters with their HTML 'escape code' equivalents, when producing HTML pages, and other such editing tasks which may be easier to do on your own computer than directly on Kielo with PICO.

Saving a File as 'Text Only' in Microsoft Word

However, instead of using 'copy and paste', one may also use Word's "save as" function to save the file as "text only" or "text only with line breaks", and then use WinSCP to upload the saved text file to Kielo to retrieve into PINE. Note, though, that — even if the "text only with line breaks" option is used — this often results in the overly-long line length problem shown here, so that corrective measures would need to be taken either with PICO or another editor, as described both above and below.

The following instructions assume that (1) you have your file on-screen in WORD, (2) you have already saved it as a Word ".doc" or ".rtf" file, and (3) you now want to save it as a separate ascii file in order to upload it. In principle all Windows word processors (such as WordPerfect, OpenOffice, etc.) and text editors (WordPad, etc.) offer the same menu commands as given in the WORD examples below.

  1. Assume you are editing a file named "test.doc" and want to save it as an ascii text.
  2. Click on the upper-left corner "File" menu bar
  3. From the menu you get, choose "Save As:"
  4. You will then get a menu at the botton with two boxes, "Filename" on top and "Save as type" on the bottom. In the "Filename" box will be the original "test.doc" name of your file.
  5. From the "Save as type" box choose "Text Only" (or "text only with line breaks"). Depending on how your version of Word is configured, after you click this option, you might notice that the original "test.doc" filename in the upper box will have changed automatically to "test.txt". You may also enter this name (or any other name that you wish) manually in the "Save As" box.
  6. Click 'okay' to save a copy of your "test.doc" Word file as an ascii text named "test.txt". Note that you will still have both the original ".doc" or ".rtf" file plus the new ".txt" version you just created.
  7. This plain-text (ascii) file may now be uploaded with FTP. However, after the file is on-line in your public_html directories, check it with your web browser. You will probably notice that you will need to retrieve the file into PICO and use the "Control-J" command to reformat the paragraphs into proper line length.

ASCII Text vs 'Attachments' For E-mail

When sending e-mail, it is always preferable to have the content in the 'body' or 'text field' of the e-mail note, rather than appended to the note as an attachment. Content which is in the body of an e-mail note is essentially plain text (and in the case of the PICO editor and ALPINE mailer, purely plain text), although Windows-based e-mail software, such as Outlook Express, Eudora, GMail, or IMP/HORDE, are able to display the text with proprietary 'markup' such as different font sizes, boldface, italics, etc. When copied/pasted from e-mail software to other software, however, the text usually reverts to being only 'plain' text, without markup.

If your e-mail note consists only of content in the text-field of the note, it will always be easily readable (without the need for external software such as Adobe Reader, etc.), will always transmit more quickly, and will not run the risk of being blocked as a potential virus-carrier by the receiver. However, certain material, particularly that which includes images or which must include special markup or must be in an unchangeable (such as PDF) format, cannot be sent as plain text. In such cases, when you need to use an attachment to convey the information, you should first obtain explicit permission from the receiver to do so (e.g. can that person receive attachments in a particular format).

Do not send "attachments" without first getting explicit permission to do so. Among the problems associated with attachments (see Understanding Viruses and Attachments for more detail) are:

  1. As binary files, they may contain viruses (unlike plain-text files);
  2. Due to the virus threat, some computer systems have installed firewalls or filters which either prevent e-mail with 'attachments' from entering the system, or else automatically delete the attachment from the e-mail message it is 'attached' to;
  3. The receiver may not have the necessary software to be able to process the attached file;
  4. Even if the receiver has the right type of mailer and word-processor to process the attachment, 'extra steps' (meaning extra time and work) will still be required in order to open and read it. The 'special formatting' of the WORD document is often not worth the inconvenience of processing the attachment, compared to getting the same text as 'plain e-mail';
  5. Attachments are usually 2-4 times (or more) larger in size than the same file as plain text due to all the extra formatting data — they thus travel more slowly and consume more network and disk space;
  6. Attachments may not transmit successfully through list-server software; they may instead produce enormous quantities of 'garbage characters' in the e-mail of every list recipient. Therefore, do NOT send attachments of any sort to a Listserv, Listproc or Majordomo list. This is one of the reasons why many e-mail lists (such as all of the UTA FAST lists), have been configured to automatically reject any attachments to e-mail being posted to the list.


TopPK5 Reference IndexPK5 Home

Last Updated 03 March 2011