Clarity README, University of Tampere

Service documentation, Tampere Services, Version S1

The Query Construction part of this document describes specifically Version S1 (April 3rd 2003 fixed version) of utaclir (/space4/utaclir/utaclir_s1) and its query properties.

Heikki Keskustalo, Eija Airio, Bemmu Sepponen, 3rd April 2003

This file describes UTA services - services locating at Unix server kastanja.uta.fi (Solaris 7) at Tampere, Finland. (Department of Information Studies, Tampere, Finland). The services listed in this file are as follows: Query translation; Search engines; Document fetching; Term Info; Lexical Translation; Morphological Analysis; Relevance Evaluation; Full Text document Text Extraction. For details, see explanations below. Please contact Heikki Keskustalo, Eija Airio, or Bemmu Sepponen before utilizing the resources in order to verify that you are using most up-to-date resources, including SOAP resources. See also /home/iq3/soapdocs/ for soap information. - - -

Basic services available at kastanja.uta.fi (/space4/clarity/ and /space4/utaclir/) for Clarity:

(1) This README file explaining the resources

(2) Query construction

ADVICE TO THE SEARCHER

User given queries can be translated in utaclir S1 version in the following four directions: (see Advice for the System Administrator 2 pages below for details.)

English -> Swedish

English -> Finnish

Swedish -> English

Finnish -> English

Query formulation is described shortly in the next paragraph.

Query is one line containing a list of query words (separated by spaces); Double quotes for phrases may be used. Always use lowercase letters. Use symbol @ for words you do not want to translate.

Query formulation is described in more details, together with examples, in the following paragraphs.

For example, express a query to be translated from English to Finnish or Swedish, but express by @ letter that you do not want to translate the proper name Bush:

president @bush

For a more precise query, using phrase marking so that we demand the words to occur as adjacent words in the documents:

"president @bush"

Technical note regarding @ operator: Unrecognized words are marked by preceding @ at the database index, e.g. @indurain. However, taken any individual proper name, like reagan as an example, it is impossible to know beforehand whether its form in the database index is reagan or @reagan, without consulting the index or the morphological analyzer used for building the index. Therefore, when user marks a query key by symbol @, for example @xxx, actually utaclir query translation system transforms the key into synonym operator form #syn(xxx @xxx). This form then matches literal form of the original string, regardless of the fact whether the word was recognized or unrecognied by the morphological program building the database index.

Query types:

List of words: Simplest query is just a list of words.

Example:

opinion view president bush dna gene

Compounds: When Swedish and Finnish are used as source languages, compound words (words written together as one unit) can be used as source keys.

Example:

geeniteknologian vaikutus ihmisen ravintoketjuun

in the Finnish query above the 1st and 4th words are compound words both containing two component words written together. If the compounds are untranslatable as whole words, utaclir tries to split them into component words, ant then translate the component forms, and forms a proper query structure on the basis of these translation results.

Proper names are mostly missing from the translation dictionary: It is important to note that the translation dictionary utilized by utaclir does not contain proper names. Even the common ones are generally missing.

Synonyms: The user is encouraged to try listing synonyms into her/his queries in order to describe query topics! Also taking part-of-speech into consideration in queries should be considered, that is, the user could consider writing

economy economics economical finance trend direction

thereby utilizing both synonyms and part-of-speech, instead of only using one term and part-of-speech per query facet, like

economy trend

Words that the user wants to keep in the target language translation in the same form they are in the source language (often acronyms and proper names): the user can mark words by symbol @

@tampere @dna research

Explanation:

We cannot know whether word "tampere" is in database index in the form "tampere" or "@tampere". The answer depends on what words the TWOL lexicon happens to include for the language in question. Therefore, if user advices utaclir by using key @tampere, utaclir automatically forms the synonym key #syn(tampere @tampere) which matches either form of the word. In practise, it is possible that for example English TWOL does not contain word Tampere, but Finnish does contain it. English TWOL is used for building the index of the English database. Finnish TWOL is used for building the Finnish database. As words not recognized by TWOL are in their basic forms, and unrecognized strings are in their string forms as such, marked by @, this means that in English database we might have index entry @tampere but not index entry tampere. And if Tampere was recognized by Finnish TWOL, in Finnish database we might have index entry tampere, but not @tampere.

Thus in the query above the user advises utaclir not to translate word tampere and dna. (If user would use capital letters in a word marked by @ they are finally transformed into lowercase letters.)

Example combining phrase quotations and do-not-translate -symbol @.

opinion "president @bush" @dna

Phrases: As seen above, the user can mark phrases by using double quotes.

At most 3 phrases may be used per query. For example,

"seat belt"

or

"president @bush"

(or, the latter one, in an equivalent manner, "president @Bush").

Above the phrase is marked by searcher by using double quotes.

As query translation, the system demans that the translations of president, and the translations of @bush (thus for the latter word, synonym operation containing both string "bush" and "@bush") occur as adjacent words in the documents. The utaclir OUTPUT feeded into InQuery would then be:

#sum(#uw2(#syn(presidentti pääjohtaja) #syn(bush @bush)) )

Note: The user should just use simple queries containing lists of words, with possibly phrases (and @-words). The structuring presented above is always performed by utaclir S1 query translation program.

Each phrase may contain from 1 up to 4 spaces.

Phrases longer than 4 spaces probably do not make sense. Probably they are actually sentences containing quotes. For them #syn operator is used instead of uwN-operator in S1 version.

The following correspondence exists between the amount of spaces in the phrase, and the amount of N in uwN operator:

Phrase examples:

1 space: "presidentti @clinton" -> #uw2(#syn(president) #syn(clinton @clinton))

2 spaces: "presidentti @bill @clinton" ->#uw4(#syn(president) #syn(bill @bill) #syn(clinton @clinton))

3 spaces: "prime minister @john @smith" ->#uw6(...)

4 spaces: probably four space phrases, that is, five word phrases, actally do not make sense. Yet for consistency, we expressed this case by trying to fit translation #syn lists into a #uw8 size window.

Very long phrases probaly do not occur in user queries.

Boolean operators: User could also use Boolean operators:

#and(#or(opinion view) #or(@clinton @bush) @dna)

Above is a basic structure for expressing conjunctive normal form. How this succeeds in practise is yet unexplored. It is also possible for the user to formulate queries resulting a non-valid translated query structure.

If Boolean operators are used, then the expressed structure is propagated into the target query as such. Also, each source key is replaced by a synonym sets of the translations with respect to the TWOLled basic form (or forms) derived from the source key.

Advice to the system administrator.

Utaclir prototype version utaclir_s1 contains query translation service for the following four language pairs: English - Swedish, English - Finnish, Swedish - English, Finnish - English. Source language codes are s_eng, s_fin, and s_swe. Target language codes are t_eng, t_fin, t_swe.

Utaclir input is a 3 line triplet, where first line contains the source language code, second line contains the target language code, and the third line contains the source language query.

For example, translating English query to Finnish can be expressed as:

s_eng

t_fin

opinion "president @bush" @dna

(3) Search engines:

Both interactive and batch engines exist for databases TUTK (Finnish newstext) and LaTimes (CLEF 2000, 2001, English newstext). Search engine is InQuery (v 3.1):

LA Times database

  • LA Times database contains about 113000 newspaper articles from Los Angeles Times. The index was built by using ENGTWOL application "etwt3". The database was used at CLEF 2000 and 2001 experiments.
  • Usage: Start the search engine by typing command

    ./iqlat ./iqlat_date (if you want to calculate term frequencies: type g) ./iqlat_noninter queryfile relevancefile

  • Use InQuery operators:

    #sum

    #syn

    #uwN (unordered window, e.g. #uw3(information retrieval))

    #N (ordered window, e.g. #2(information retrieval))

    #and

    #or

    #band (Boolean and)

  • e.g.

    #sum(berlin #syn(architecture building))

  • Use keys in basic forms. See also "etwt3" below for simulating the basic form construction process used in the indexing phase.

    ./iqlat_date

  • You can use this alternative instead of iqlat and iqlat_ter_count (acts in the same way + searching DATE-field possible).
  • Use field operand like this:

    #field(DATE ndx_operator date)

    #field(DATE ndx_min_max_op)

    #field(DATE ndx_range_op date1 date2)

  • ndx_operator are the following:

    #gt #>

    #gte #>=

    #lt #<

    #lte #<=

    #ne #neq #!=

    #eq #== This is the default operator and may be

    omitted from the expression if desired.

  • ndx_min_max_op are the following:

    #fmin Minimum entry for a field index.

    #fmax Maximum entry for a field index.

  • ndx_range_op takes two forms:

    #range or #<>

  • examples:

    #field(DATE #> 19940709)

    #field(DATE #range 19940709 19940303)

  • If you want to add date restriction to other search criterias, using #band -operator is the

    only logical alternative. Otherwise you will get documents which either fulfill your

    date specification or other search criterias.

  • example:

    #band(oil #field(DATE #range 19940101 19940131))

  • - you will get all documents mentioning "oil" published January 1994.

    #sum(oil #field(DATE #range 19940101 19940131))

  • - you will get all documents, which either mention "oil" or are published in January 1994.

    /iqlat_noninter:

  • queryfile contains queries like this:

    #q001 = #(sum......);

  • Produces a file queryfile.evl, which contains positions of relevant documents per each query.

    DATABASES

    /* 2nd Dec 2002:

  • Date field: LaTimes, Alma9900, CLEF-Alma, Swedish CLEF.

  • Lat will be ready in January 2003, and Lit probably in February 2003.

  • Lat and Lit db will contain DATE field too.

  • Date ranges:

  • LaTimes: 1.1.1994 - 31.12.1994

  • Alma9900: 2.10.1999 - 31.12.2000

  • Alma CLEF: 1.1.1994 - 31.12.1995

  • Swedish CLEF: 1.1.1994 - 31.12.1995

  • Latvian: 1.1.2000 - 31.12.2000

  • Lithuanian: 1.1.2000 - 31.12.2001

  • Please use #<> (#range) operator.

    */

    CLEF Swedish Collections

    /space2/clef/swedish_index/swedish

  • consisting of the following data:

    199401.sgml 199405.sgml 199409.sgml 199501.sgml 199505.sgml 199509.sgml

    199402.sgml 199406.sgml 199410.sgml 199502.sgml 199506.sgml 199510.sgml

    199403.sgml 199407.sgml 199411.sgml 199503.sgml 199507.sgml 199511.sgml

    199404.sgml 199408.sgml 199412.sgml 199504.sgml 199508.sgml 199512.sgml

  • Character set: a-z, å, ä, ö, ü, é

    A-Z, Å, Ä, Ö, Ü, É

  • Special characters are converted:

    á -> a

    ç -> c

    ë -> e

    Á -> A

    Ç -> C

    Ë -> E

    etc.

  • CLEF Finnish Collections

    /space2/clef/finnish_index/finnish

  • consisting of: aamu1994_1995.sgml

  • Character set: a-z, å, ä, ö,

    A-Z, Å, Ä, Ö

  • Special characters are converted:

    á -> a

    ç -> c

    ë -> e

    Á -> A

    Ç -> C

    Ë -> E

    etc.

    TUTK database

  • TUTK database contains about Finnish 54000 newspaper articles from

    Aamulehti, Kauppalehti and Keskisuomalainen. The index was

    built by using FINTWOL application "utwt". This database contains

    old material around 1988-1992, see licentiate thesis of Eero

    Sormunen.

  • Character set: a-z, }, {, |, ~

    A-Z, ], [, \, ^

    (where }=å, {=ä, |=ö, ~=ü, ]=Å, [=Ä, \=Ö, ^=Ü)

  • Special characters are converted:

    á -> a

    ç -> c

    ë -> e

    Á -> A

    Ç -> C

    Ë -> E

    etc.

  • Usage: start the search engine by typing command

    ./iqtutk

    ./iqtutk_term_count (if you want to calculate term frequencies: type g)

    ./iqtutk_noninter queryfile relevancefile

  • Use InQuery operators:

    #sum

    #syn

    #uwN (unordered window, e.g. #uw3(helsinki tampere))

    #N (ordered window, e.g. #2(tampere yliopisto))

    #and

    #or

    #band (Boolean and)

  • e.g.

    #sum(berliini #syn(arkkitehti arkkitehtuuri))

    /iqtutk_noninter:

  • queryfile contains queries like this:

    #q001 = #(sum......);

  • Produces a file queryfile.evl, which contains positions of relevant documents per each query.

    ALMA9900 database

  • Character set: a-z, }, {, |, ~

    A-Z, ], [, \, ^

    (where }=å, {=ä, |=ö, ~=ü, ]=Å, [=Ä, \=Ö, ^=Ü)

  • Special characters are converted:

    á -> a

    ç -> c

    ë -> e

    Á -> A

    Ç -> C

    Ë -> E

    etc.

  • ALMA9900 database contains material (Oct 99 - Dec 00), Finnish

  • Aamulehti newspaper texts, index built by utwt (see above).

  • Usage: start the search engine by typing command

    ./iqalma

    ./iqalma_date (if you want to calculate term frequencies: type g)

    ./iqalma_noninter queryfile relevancefile

    (4) Document fetch:

  • Use operator #lit for searching documents by the unique DOCNO

    e.g.

    /iqalma_noninter:

  • queryfile contains queries like this:

    #q001 = #(sum......);

  • Produces a file queryfile.evl, which contains positions of relevant documents per each query.

    ./iqalma_date

  • You can use this alternative instead of iqalma and iqalma_ter_count (acts in the same way + DATE field added).

    All the dates in this database are converted to form yyyymmdd (ex. 19991230) instead of yyyy-mm-dd (ex. 1999-12-30).

    For using DATE field, look at the LA Times instructions above.

    LA Times db

    #lit(LA062694-0140)

  • Character set: a-z,

    A-Z

  • Special characters are converted:

    á -> a

    ç -> c

    ë -> e

    Á -> A

    Ç -> C

    Ë -> E

    etc.

    TUTK db

    #lit(705511)

    (4) Term info:

  • Precise term information is available for databases LaTimes and TUTK:

    LA Times

    (a) Command qbtl can be used for returning term info:

  • Usage:

    ./qbtl

  • The programs asks as inputs, the btl file of the database

    (/space/clef/latimes/latimes.btl for LA Times database)

    and replys for choises concerning stemming, stopword

    usage, and where to print the resulting information.

  • (b) Index file "dictLaTimes" contains list of all different words

    found from the database after ENGTWOL analysis. 192811 different

    word forms are listed. The first 58649 words are words recognized by

    ENGTWOL. The rest were unrecoginzed by ENGTWOL and must be preceded

    by "@" if used as search keys, e.g. "@zzkk"

  • N.B. also separate dictionaries "recLaTimes" and "unrecLaTimes" exist

    for recognized and unrecognized word forms. By using, for example

    "egrep" command, and other Unix tools, these can be utilized:

    kastanja /space4/clarity# grep gorba recLaTimes

    gorbachev

    kastanja /space4/clarity# grep gorba unrecLaTimes

    gorbachocolate

    gorbachove

    gorbatov

  • This could lead to e.g. query (notice the form of unrecognized keys):

    #syn(gorbachev @gorbatov @gorbachove)

  • Gram algorithm for studying La Times

  • New faster interactive algortihm (Longest

    Common Subsequence) is now available:

    ./like_noninter

  • For example:

    ./like_noninter ericsson

    ericsson

    ericson

    ericksson

    erison

    ericcson

    erickson

    erichson

    eriksson

  • The program behind this is /space/gram/like.

  • The number of words returned varies (sometimes nothing

    is retrieved).

    It's possible to control the action of like with parameters.

    You can ask help typing /space/gram/like -h.

    New gram algorithm is available for searching the 3 most

    similar query strings existing in the recLaTimes and

    unrecLaTimes files, for any given key string.

  • Usage (interactive mode):

    ./gram_bitwise

    wait until key is asked (loop). This application exits with Ctrl-C

    only.

  • For example:

  • Please enter key: ericsson

    11 grams

    ericsson erickson eriksson @ericson @ericksson @ericcson

  • Usage (non-ineractive mode):

    gram_bitwise_noninter word

  • For example:

    ./gram_bitwise_noninter ericsson

    ericsson erickson eriksson @ericson @ericksson @ericcson

  • Six words result as the output.

  • Output means that each of the 192000 different word forms

    of the LaTimes database were first compared to key "ericsson"

    by using a gram algorithm, and the best words were printed.

    3 of the "most similar" ENGTWOL recognized word forms

    compared to key "ericsson", are words ericsson, erickson,

    and eriksson. Additionally, 3 of the best unrecognized

    words were @ericson, @ericksson, and @ericcson.

  • For example, query of the form

    #syn(ericsson erickson eriksson @ericson @ericksson @ericcson)

    could be used.

  • TUTK

  • Index key file "dictTutk" contains words found from the database,

    after ENGTWOL analysis. 625012 different word forms are

    found - these contain unrecognized word forms (preceded by

    a character @) and words in their basic forms without

    splitting (perceded by a character /), and "normal" basic

    word forms (starting with a normal letter). The words are

    coded by 7-bit ascii, and followed by total and document

    frequency for each word.

  • E.g.

    egrep "^olla" dictTutk

    results output

    olla 591753 51854

    meaning that basic form "olla" ("to be") has document

    frequency of 51854 in the database. Total frequency of the

    term is 591753.

    (7) Lexical translation

    GlobalDix and motcom programs exit (motcom: Fin->Eng, Swe->Eng),

    plus Duden dictionary table for Clef runs 2000 and 2001 when target was LaTimes.

    GlobalDix contains at least 18 languages and 300 translation routes, see separate documentation, contact ccheke@uta.fi.

    Fin->Eng

  • NB! Try motcom with -s and -b options to get tagged output:

    /data10/newmot/motcom -s -b -d /data10/newmot/ses kissa

    Old information:

    No really good translation resources are available

    at kastanja. motcom command can be used for resulting

    verbose Kielikone's Motcom dictionary output. This can be

    filtered. However, the result is far from being perfect

    and it is impossible to derive good result by syntactic

    rules - the format of the dictionary does not support it.

  • Also script "fineng" can be used for Fin->Eng translations.

    E.g. Finnish word "kissa" can be translated to "cat"

    in the following way.

    ./fineng kissa

    cat

  • This was used at CLEF 2000 and 2001 for Fin-Eng

    translations of University of Tampere, Department

    of Information Studies.

    Ger->Eng

  • Only hand-made Duden dictionary exists (words selected

    for CLEF 2000 and 2001 topics only!) - this was used

    at CLEF 2000 and 2001 Ger->Eng translations of UTA.

    Swe->Eng

  • As in case of Fin->Eng, motcom application was used

    for Swe->Eng translations at CLEF 2000 and 2001

    experiments of UTA.

    Eng->Fin

  • Try motcom with -s and -b options to get tagged output:

    /data10/newmot/motcom -s -b -d /data10/newmot/ses cat

    Also try engfin script. NB! engfin is slow and older

    version - now that -b option exists, you should try to

    utilize it.

    (8) Morphological analysis

    TWOLs by Lingsoft are used, utwt and etwt3 are TWOL applications,

    utwt - used at TUTK db index building;

    etwt3 - used at LaTimes index building)

    LA Times

  • Type

    ./etwt3

    for simulating ENGTWOL index building program, e.g.

    the input word (database word form)

    oxen

    results the following output word forms (index words

    referring to the same address):

    ox

    TUTK

  • Type

    ./utwt

    for simulating FINTWOL index building program, e.g.

    the input word (database word form)

    kuminasta

    results the following output word forms (index words

    referring to the same address):

    nasta

    kumi

    /kumina

    /kuminasta

    kumina

    kuminasta

  • NB! some etwt3 output words may contain odd looking 7-bit ascii

    characters. These are essential "de facto standard" codes for

    scandinavian letters. a with ring, a with dots and a with ring, for

    example, are marked with 7-bit paranthesis, pipeline symbol

    etc., in the following way (Finnish example words):

    { = a with dots, e.g. as in word "j{{" (meaning "ice" in English)

    } = a with ring, e.g. "t{g" - this is a Swedish word ("train")

    | = o with dots, e.g. "|ljy" ("oil")

  • Capital letters:

    [ = A with dots, e.g. "[ht{ri" (place name)

    ] = A with ring, e.g. "]ke" (first name of a person)

    \ = O with dots, e.g. "\hman" (family name)

  • In addition to these:

    ~ = german y ("Z~rich")

    ^ = german Y (e.g. asi in word "~ber" when starting a sentence)

  • You also need to be able to give Finnish input words containing

    scand letters to translation and/or morphological resources in

    the correct form.

  • NB 2: In some translation resources different type of 8-bit codes

    may be used for denoting these scandinavian characters. This

    is very important while designing translation/morphological analysis

    "pipelines" as different types of character codes for scandinavian

    letters may be used (e.g. MOTCOM dictionaries, TWOL morpohological

    analysis program and TUTK ascii database may each use different

    internal bit representations for scandinavian letters).

    (9) Relevance evaluation: scripts for producing precision-

    recall -tables for given query files when the target

    collection is TUTK or LaTimes.

    LA Times

  • Relevance files exist for year 2000 CLEF topics. They can be utilized

    simply by using a script

    interactive mode:

    ./evalLaTimes

    The script asks as inputs 2 filenames: query file and relevance

    file (the correct relevance filename is suggested by the script).

    non-interactive mode:

    ./evalLaTimes_noninter topicfile relevancefile

  • For query file syntax example, see file "queryexample" and

    InQuery manuals.

  • Similar script exists for TUTK database.

    Author: ccheke@uta.fi

    (10) Full text document text extraction

  • The script "batch" (1) executes the query from a file, (2) prints N top

    documents into file, (3) filters words of those documents into

    a list of (inflected) Finnish words, and (4) runs this inflected

    list of words through morphological program, thus resulting a file

    with a list of words in Finnish basic forms, (5) translates them to

    English by using motcom dictionary script.

  • Usage:

    ./batch

  • Note: you need to type dot and slash.

  • You can tune the functionalities in file "batch" and experiment

    with it.

  • (a) Query and printing fulltext documents (Finnish) to file:

    The script first asks for a query file, you can try "finquery001" or

    "finquery002", or create new query files. A bogus relevance

    file exists but it has no meaning (batch inquery seems to demand

    it). The present version of "batch" prints only 2 top documents

    into file. You can change by changing figure of -wn in the "batch"

    script.

  • Output file names end with .wf, for example "finquery001.wf"

    (b) Making list of fulltext document words (inflected)

    Next the script uses sed, egrep, tr in order to filter

    the text files into lowercase string tokens (=inflected Finnish

    words). Then these are used as inputs to twol program

    version (utwt).

    (c) Making basic form of (Finnish) words

  • The resulting file is .bfw (=basic form words) file.

    One input word typically produces a group of outputs:

    these groups are separated by full stops:

    for example genetive form of 3-part compound word

    "bruttorekisteritonnin" is splitted by utwt into

    3 basic form "parts" (tonni, rekisteri, brutto),

    and also the basic form of the whole compound is generated

    (bruttorekisteritonni). You see this grouping at the .bfw-file:

    tonni

    rekisteri

    brutto

    bruttorekisteritonni

    .

    .

    @rosebay

    alus

    .

    .

  • Note that in this case "@rosebay" (unrecognized Finnish word) and

    "alus" were both formed from the same word "Rosebay-alus". If the

    text contains "Rosebay alus", then the utwt analysis would look

    like

    .

    .

    @rosebay

    .

    .

    alus

    .

    .

    so the two dots correspond to word boundaries in the original

    text file.

  • If you would want to recognize basic forms of the original compound

    only (like "bruttorekisteritonni"), and not the parts of the

    compound at all, you can do it by removing the last

    egrep -v "/"

    from the "batch" file as basic forms of the original word are

    marked by a starting slash. Normally these are not utilized

    so I added the removal of slash-words as a default.

    (d) Translating words from Finnish to English:

    Next, the basic form Finnish words are printed to *.tmp2 file where

    the lines start with a letter instead of spaces (like in *.bfw)

    (dictionary program does not work otherwise).

  • This file containing Finnish words is used as an input to MOT

    dictionary (Finnish->English).

  • The dictionary translation produces a list of translated English words.

    The translations are printed to file *.eng.

    Back to homepage