Converting Word-Processor Files to Plain Text Files
What is a Plain-Text File?
The PK5 Plain-Text Requirement
Problematics of Converting Word Files to
Plain Text
How a 75-Character Line Length Should Look
Copying/Pasting into PICO
Copying/Pasting to a Text Editor
Saving a File as 'Text Only' in Word
E-mail: Plain Text vs 'Attachments'
What is a 'Plain Text' or 'ASCII' Text?
'Ascii text' (sometimes known as 'DOS text') is a synonym for 'plain text'
files which do not have any of the special screen-display and printing
codes that are part of most word-processor documents. 'Ascii texts' do
not display underlining, boldface, italics, or different font sizes. They
include only the words, numbers and punctuation of the text, without
additional 'styling'. They are the lowest common denominator of text file.
As such, they can be retrieved into any text editor or word processor, and
pass through any e-mail system on the internet with maximum speed and
minimum likelihood of text change or loss.
Plain text files are still required for certain types of specialized
software concordancing software, for example which will
automatically insert its own special markup coding for the words and
punctuation of the text, and do not want any 'foreign' coding in the text
which may interfere with their own. Likewise, even conference abstracts
are sometimes requested as plain text, so the word-processing or DTP
software used by the organizers can more easily produce all the abstracts
published for the conference in exactly the same format and style.
Also, web pages are built on ascii text, to which HTML 'tags' are added
so that web browsers (such as Internet Explorer, Firefox, Safari, Opera,
etc.) will know how to display the text as the author had intended.
The PK5 Plain Text Requirement
One of the PK5 exam requirements is for students to place online a
plain-text file in a specified standard format. This format is in
principle the same as will also be used for the PK5 'HTML paper'. The
requirements for both are as follows:
- Line length should be the plain-text 'standard' of a maximum of 75
characters/spaces per line of text (so the text will be human-readable
even with an 800x600 display setting). [This is the same 'standard' as
used by the PICO editor on Kielo when used within the default Putty
terminal window.]
- Paragraphs should be single-spaced text, separated from the next
paragraph by double-spacing. There should also be double-space separation
between text paragraphs and the text's title, section headers, etc.
- Paragraphs should be in 'block' format, with all lines beginning at
the same left-size margin. In other words, do not use paragraph
indentations.
- The material used for the PK5 plain text file should be appropriate
for a text file; it cannot include images, for example, and also should
not include tables or other information which is presented in parallel
columns. While this can be done, a plain text file is not an efficient
means to either produce or convey such material (HTML, PDF, etc., would be
used instead).
A sample text file which meets these requirements can be found
here, which was converted from an
original Word file which can be found here (RTF).
Examples of Problematics When Converting Word Documents to Plain Text
In principle, producing the plain text file to meet the above requirements
is simple. The main problematics are usually the following:
- Extra text spacing above the beginning of the text or between
paragraphs or other text elements which may have been used for Word
formatting, or which was inadvertently created when copying and pasting
from one type of software to another, has not been removed (see an
example here).
NB: Paragraphs and other text elements should be separated by
double-spacing (as illustrated in the link above); not by triple or
quadruple-spacing. The only exception to this would be if you wished to
use triple-spacing for additional separation between different 'sections'
of the paper, such as between the three joke versions in the sample file.
- The file's line length is too long (e.g. longer than the
'standard' maximum of 75 characters/spaces per line). This is
often a problem if a Word file has been converted to 'plain text' using
the Word "save as" command, as described below. This will usually result
in a file which looks like this, where
each 'paragraph' of text is displayed as a single text line which will run
far off the right-side of the display screen, which is not
'human-readable.'
- Problems with Word's 'Smart Quotes'. Slightly different
from the other examples above, students have often encountered problems
when either converting or copy/pasting Word files in which the 'smart
quotes' option has been turned on. The Word coding for 'smart quotes'
(and for apostrophes to indicate possessive forms) is proprietary, so
there is no 'standard' equivalent in plain text. The result is usually
that the smart quotes or apostrophes are converted as periods, although
some other characters also occasionally show up. For example, instead of
her dog's breath smells bad one would get instead her
dog.s breath smells bad.
There are two solutions to this problem. (1) Turn off 'smart quotes'
in Word for that file and re-save a copy of the file with ordinary
quotation marks (or apostrophes). (2) If you have already got a plain
text with the wrong characters, one could either manually go through the
text and change the errant characters, or else (since the problem will be
consistent), use a text editor like NoteTab that has a search-and-replace
function to search for all instances of .s and
replace them with 's (if it is apostrophes rather than
quotation marks that is the problem).
Fortunately, these problems are easy to see when you check your
text, and can be solved easily using one of the options described below.
How a 75-character Line Length Should Look!
The 75 character/space line length should look like this with this file having
used PICO to correct this faulty version.
Copying and Pasting Directly into PICO
The easiest way to convert a word-processor file to ascii may be to
"copy and paste" the WORD text directly into PICO. In the
following, WORD will be used as the reference, but the procedure is the
same for any Windows software.
NB! When using Pico to produce the correct line-length for your
plain-text file, be certain you have not re-sized the default
Pico window: a larger window will also produce longer line
lengths! [This is the only known problem when using Pico to produce
the correct 'standard' line length; the problem has been increasing in
recent years as students are more often using widescreen monitors with
high resolutions on which the default Pico window size may seem 'too
small'.]
- Start WORD, "open" the document you want to convert, use the
"Select All" command (either Control-A or else Edit-Select
All from the top-left menu bar) to highlight all the text you wish to
copy, and then use the "Copy" command to move the highlighted text to your
Windows "clipboard".
- With your Putty (or Tectia SSH, etc.) connection to Kielo open, start
PICO with a blank screen. Then paste in the text from the clipboard.
When you paste the text into PICO, it will be plain text. It will
also be single-spaced, and the paragraphs will normally have the
correct line-length (if they don't, use PICO's "Control-J" command to
reformat each paragraph). You may also need to edit the text so that the
title and author are centered (if desired), extra spacing is removed, and
the spacing between paragraphs and at the end of the text is appropriate.
- Then use PICO's "Control-O" command to save the file with the desired
name (with a ".txt" ending, if it is a plain-text file). Remember to use
the UNIX "Chmod 644" command to make the file "visible."
- If you paste a text which includes Scandinavian characters, be
certain to check with your web browser that these characters
display properly. If they do not, replace them with HTML "escape" codes
so that they will display properly (see The HTML
Coded Character Set). If there are many such codes, however, this
function would most easily be done offline using a Windows-based text
editor's "search and replace" function.
Copying/Pasting Texts First into a Text Editor
Texts can also be copied and pasted from Word (or elsewhere) into a text
editor such as Notetab Light (which
can be freely installed on your own computers) that allows you to specify
the text line length. Set this line length to either 74 or 75. Then you
can easily reformat entire text files to the correct line length, save
this revised format, and then upload it to Kielo. For longer files this
may save considerable time, compared with needing to reformat
paragraph-by-paragraph when using the PICO Control-J command describe
above.
Text editors such as Notetab Light which also have search-and-replace
functions will enable quick replacement of native 'Scandi' characters with
their HTML 'escape code' equivalents, when producing HTML pages, and other
such editing tasks which may be easier to do on your own computer than
directly on Kielo with PICO.
Saving a File as 'Text Only' in Microsoft Word
However, instead of using 'copy and paste', one may also use Word's "save
as" function to save the file as "text only" or "text only with line
breaks", and then use WinSCP to upload the saved text file to Kielo to
retrieve into PINE. Note, though, that even if the "text only
with line breaks" option is used this often results in the
overly-long line length problem shown here, so that corrective measures would need
to be taken either with PICO or another editor, as described both above
and below.
The following instructions assume that (1) you have your file on-screen
in WORD, (2) you have already saved it as a Word ".doc" or ".rtf" file,
and (3) you now want to save it as a separate ascii file in order to
upload it. In principle all Windows word processors (such as WordPerfect,
OpenOffice, etc.) and text editors (WordPad, etc.) offer the same menu
commands as given in the WORD examples below.
- Assume you are editing a file named "test.doc" and want to save it as
an ascii text.
- Click on the upper-left corner "File" menu bar
- From the menu you get, choose "Save As:"
- You will then get a menu at the botton with two boxes, "Filename" on
top and "Save as type" on the bottom. In the "Filename" box will be
the original "test.doc" name of your file.
- From the "Save as type" box choose "Text Only" (or "text only with
line breaks"). Depending on how your version of Word is configured, after
you click this option, you might notice that the original "test.doc"
filename in the upper box will have changed automatically to "test.txt".
You may also enter this name (or any other name that you wish) manually in
the "Save As" box.
- Click 'okay' to save a copy of your "test.doc" Word file as an ascii
text named "test.txt". Note that you will still have both the original
".doc" or ".rtf" file plus the new ".txt" version you just created.
- This plain-text (ascii) file may now be uploaded with FTP. However,
after the file is on-line in your public_html directories, check it with
your web browser. You will probably notice that you will need to retrieve
the file into PICO and use the "Control-J" command to reformat the
paragraphs into proper line length.
ASCII Text vs 'Attachments' For E-mail
When sending e-mail, it is always preferable to have the content in the
'body' or 'text field' of the e-mail note, rather than appended to the
note as an attachment. Content which is in the body of an e-mail note is
essentially plain text (and in the case of the PICO editor and ALPINE
mailer, purely plain text), although Windows-based e-mail software, such
as Outlook Express, Eudora, GMail, or IMP/HORDE, are able to display the
text with proprietary 'markup' such as different font sizes, boldface,
italics, etc. When copied/pasted from e-mail software to other software,
however, the text usually reverts to being only 'plain' text, without
markup.
If your e-mail note consists only of content in the text-field of the
note, it will always be easily readable (without the need for external
software such as Adobe Reader, etc.), will always transmit more quickly,
and will not run the risk of being blocked as a potential virus-carrier by
the receiver.
However, certain material, particularly that which includes images or
which must include special markup or must be in an unchangeable (such as
PDF) format, cannot be sent as plain text. In such cases, when you need
to use an attachment to convey the information, you should first obtain
explicit permission from the receiver to do so (e.g. can that
person receive attachments in a particular format).
Do not send "attachments" without first getting
explicit permission to do so. Among the problems associated with
attachments (see Understanding Viruses and
Attachments for more detail) are:
- As binary files, they may contain viruses (unlike plain-text files);
- Due to the virus threat, some computer systems have installed
firewalls or filters which either prevent e-mail with 'attachments' from
entering the system, or else automatically delete the attachment from the
e-mail message it is 'attached' to;
- The receiver may not have the necessary software to be able to
process the attached file;
- Even if the receiver has the right type of mailer and word-processor
to process the attachment, 'extra steps' (meaning extra time and work)
will still be required in order to open and read it. The 'special
formatting' of the WORD document is often not worth the inconvenience of
processing the attachment, compared to getting the same text as 'plain
e-mail';
- Attachments are usually 2-4 times (or more) larger in size than the
same file as plain text due to all the extra formatting data they
thus travel more slowly and consume more network and disk space;
- Attachments may not transmit successfully through list-server
software; they may instead produce enormous quantities of 'garbage
characters' in the e-mail of every list recipient. Therefore, do NOT send
attachments of any sort to a Listserv, Listproc or Majordomo list. This
is one of the reasons why many e-mail lists (such as all of the UTA FAST
lists), have been configured to automatically reject any attachments to
e-mail being posted to the list.
Top
PK5 Reference Index
PK5 Home
Last Updated 03 March 2011
|