The Query Construction part of this document describes specifically Version S1 (April 3rd 2003 fixed version) of utaclir (/space4/utaclir/utaclir_s1) and its query properties.
Heikki Keskustalo, Eija Airio, Bemmu Sepponen, 3rd April 2003
This file describes UTA services - services locating at Unix server kastanja.uta.fi (Solaris 7) at Tampere, Finland. (Department of Information Studies, Tampere, Finland). The services listed in this file are as follows: Query translation; Search engines; Document fetching; Term Info; Lexical Translation; Morphological Analysis; Relevance Evaluation; Full Text document Text Extraction. For details, see explanations below. Please contact Heikki Keskustalo, Eija Airio, or Bemmu Sepponen before utilizing the resources in order to verify that you are using most up-to-date resources, including SOAP resources. See also /home/iq3/soapdocs/ for soap information. - - -
Basic services available at kastanja.uta.fi (/space4/clarity/ and /space4/utaclir/) for Clarity:
(1) This README file explaining the resources
(2) Query construction
ADVICE TO THE SEARCHER
User given queries can be translated in utaclir S1 version in the following four directions: (see Advice for the System Administrator 2 pages below for details.)
English -> Swedish
English -> Finnish
Swedish -> English
Finnish -> English
Query formulation is described shortly in the next paragraph.
Query is one line containing a list of query words (separated by spaces); Double quotes for phrases may be used. Always use lowercase letters. Use symbol @ for words you do not want to translate.
Query formulation is described in more details, together with examples, in the following paragraphs.
For example, express a query to be translated from English to Finnish or Swedish, but express by @ letter that you do not want to translate the proper name Bush:
president @bush
For a more precise query, using phrase marking so that we demand the words to occur as adjacent words in the documents:
"president @bush"
Technical note regarding @ operator: Unrecognized words are marked by preceding @ at the database index, e.g. @indurain. However, taken any individual proper name, like reagan as an example, it is impossible to know beforehand whether its form in the database index is reagan or @reagan, without consulting the index or the morphological analyzer used for building the index. Therefore, when user marks a query key by symbol @, for example @xxx, actually utaclir query translation system transforms the key into synonym operator form #syn(xxx @xxx). This form then matches literal form of the original string, regardless of the fact whether the word was recognized or unrecognied by the morphological program building the database index.
Query types:
List of words: Simplest query is just a list of words.
Example:
opinion view president bush dna gene
Compounds: When Swedish and Finnish are used as source languages, compound words (words written together as one unit) can be used as source keys.
Example:
geeniteknologian vaikutus ihmisen ravintoketjuun
in the Finnish query above the 1st and 4th words are compound words both containing two component words written together. If the compounds are untranslatable as whole words, utaclir tries to split them into component words, ant then translate the component forms, and forms a proper query structure on the basis of these translation results.
Proper names are mostly missing from the translation dictionary: It is important to note that the translation dictionary utilized by utaclir does not contain proper names. Even the common ones are generally missing.
Synonyms: The user is encouraged to try listing synonyms into her/his queries in order to describe query topics! Also taking part-of-speech into consideration in queries should be considered, that is, the user could consider writing
economy economics economical finance trend direction
thereby utilizing both synonyms and part-of-speech, instead of only using one term and part-of-speech per query facet, like
economy trend
Words that the user wants to keep in the target language translation in the same form they are in the source language (often acronyms and proper names): the user can mark words by symbol @
@tampere @dna research
Explanation:
We cannot know whether word "tampere" is in database index in the form "tampere" or "@tampere". The answer depends on what words the TWOL lexicon happens to include for the language in question. Therefore, if user advices utaclir by using key @tampere, utaclir automatically forms the synonym key #syn(tampere @tampere) which matches either form of the word. In practise, it is possible that for example English TWOL does not contain word Tampere, but Finnish does contain it. English TWOL is used for building the index of the English database. Finnish TWOL is used for building the Finnish database. As words not recognized by TWOL are in their basic forms, and unrecognized strings are in their string forms as such, marked by @, this means that in English database we might have index entry @tampere but not index entry tampere. And if Tampere was recognized by Finnish TWOL, in Finnish database we might have index entry tampere, but not @tampere.
Thus in the query above the user advises utaclir not to translate word tampere and dna. (If user would use capital letters in a word marked by @ they are finally transformed into lowercase letters.)
Example combining phrase quotations and do-not-translate -symbol @.
opinion "president @bush" @dna
Phrases: As seen above, the user can mark phrases by using double quotes.
At most 3 phrases may be used per query. For example,
"seat belt"
or
"president @bush"
(or, the latter one, in an equivalent manner, "president @Bush").
Above the phrase is marked by searcher by using double quotes.
As query translation, the system demans that the translations of president, and the translations of @bush (thus for the latter word, synonym operation containing both string "bush" and "@bush") occur as adjacent words in the documents. The utaclir OUTPUT feeded into InQuery would then be:
#sum(#uw2(#syn(presidentti pääjohtaja) #syn(bush @bush)) )
Note: The user should just use simple queries containing lists of words, with possibly phrases (and @-words). The structuring presented above is always performed by utaclir S1 query translation program.
Each phrase may contain from 1 up to 4 spaces.
Phrases longer than 4 spaces probably do not make sense. Probably they are actually sentences containing quotes. For them #syn operator is used instead of uwN-operator in S1 version.
The following correspondence exists between the amount of spaces in the phrase, and the amount of N in uwN operator:
Phrase examples:
1 space: "presidentti @clinton" -> #uw2(#syn(president) #syn(clinton @clinton))
2 spaces: "presidentti @bill @clinton" ->#uw4(#syn(president) #syn(bill @bill) #syn(clinton @clinton))
3 spaces: "prime minister @john @smith" ->#uw6(...)
4 spaces: probably four space phrases, that is, five word phrases, actally do not make sense. Yet for consistency, we expressed this case by trying to fit translation #syn lists into a #uw8 size window.
Very long phrases probaly do not occur in user queries.
Boolean operators: User could also use Boolean operators:
#and(#or(opinion view) #or(@clinton @bush) @dna)
Above is a basic structure for expressing conjunctive normal form. How this succeeds in practise is yet unexplored. It is also possible for the user to formulate queries resulting a non-valid translated query structure.
If Boolean operators are used, then the expressed structure is propagated into the target query as such. Also, each source key is replaced by a synonym sets of the translations with respect to the TWOLled basic form (or forms) derived from the source key.
Advice to the system administrator.
Utaclir prototype version utaclir_s1 contains query translation service for the following four language pairs: English - Swedish, English - Finnish, Swedish - English, Finnish - English. Source language codes are s_eng, s_fin, and s_swe. Target language codes are t_eng, t_fin, t_swe.
Utaclir input is a 3 line triplet, where first line contains the source language code, second line contains the target language code, and the third line contains the source language query.
For example, translating English query to Finnish can be expressed as:
s_eng
t_fin
opinion "president @bush" @dna
(3) Search engines:
Both interactive and batch engines exist for databases TUTK (Finnish newstext) and LaTimes (CLEF 2000, 2001, English newstext). Search engine is InQuery (v 3.1):
LA Times database
./iqlat ./iqlat_date (if you want to calculate term frequencies: type g) ./iqlat_noninter queryfile relevancefile
#sum
#syn
#uwN (unordered window, e.g. #uw3(information retrieval))
#N (ordered window, e.g. #2(information retrieval))
#and
#or
#band (Boolean and)
#sum(berlin #syn(architecture building))
./iqlat_date
#field(DATE ndx_operator date)
#field(DATE ndx_min_max_op)
#field(DATE ndx_range_op date1 date2)
#gt #>
#gte #>=
#lt #<
#lte #<=
#ne #neq #!=
#eq #== This is the default operator and may be
omitted from the expression if desired.
#fmin Minimum entry for a field index.
#fmax Maximum entry for a field index.
#range or #<>
#field(DATE #> 19940709)
#field(DATE #range 19940709 19940303)
only logical alternative. Otherwise you will get documents which either fulfill your
date specification or other search criterias.
#band(oil #field(DATE #range 19940101 19940131))
#sum(oil #field(DATE #range 19940101 19940131))
/iqlat_noninter:
#q001 = #(sum......);
DATABASES
/* 2nd Dec 2002:
*/
CLEF Swedish Collections
/space2/clef/swedish_index/swedish
199401.sgml 199405.sgml 199409.sgml 199501.sgml 199505.sgml 199509.sgml
199402.sgml 199406.sgml 199410.sgml 199502.sgml 199506.sgml 199510.sgml
199403.sgml 199407.sgml 199411.sgml 199503.sgml 199507.sgml 199511.sgml
199404.sgml 199408.sgml 199412.sgml 199504.sgml 199508.sgml 199512.sgml
A-Z, Å, Ä, Ö, Ü, É
á -> a
ç -> c
ë -> e
Á -> A
Ç -> C
Ë -> E
etc.
/space2/clef/finnish_index/finnish
A-Z, Å, Ä, Ö
á -> a
ç -> c
ë -> e
Á -> A
Ç -> C
Ë -> E
etc.
TUTK database
Aamulehti, Kauppalehti and Keskisuomalainen. The index was
built by using FINTWOL application "utwt". This database contains
old material around 1988-1992, see licentiate thesis of Eero
Sormunen.
A-Z, ], [, \, ^
(where }=å, {=ä, |=ö, ~=ü, ]=Å, [=Ä, \=Ö, ^=Ü)
á -> a
ç -> c
ë -> e
Á -> A
Ç -> C
Ë -> E
etc.
./iqtutk
./iqtutk_term_count (if you want to calculate term frequencies: type g)
./iqtutk_noninter queryfile relevancefile
#sum
#syn
#uwN (unordered window, e.g. #uw3(helsinki tampere))
#N (ordered window, e.g. #2(tampere yliopisto))
#and
#or
#band (Boolean and)
#sum(berliini #syn(arkkitehti arkkitehtuuri))
/iqtutk_noninter:
#q001 = #(sum......);
ALMA9900 database
A-Z, ], [, \, ^
(where }=å, {=ä, |=ö, ~=ü, ]=Å, [=Ä, \=Ö, ^=Ü)
á -> a
ç -> c
ë -> e
Á -> A
Ç -> C
Ë -> E
etc.
./iqalma
./iqalma_date (if you want to calculate term frequencies: type g)
./iqalma_noninter queryfile relevancefile
(4) Document fetch:
e.g.
/iqalma_noninter:
#q001 = #(sum......);
./iqalma_date
All the dates in this database are converted to form yyyymmdd (ex. 19991230) instead of yyyy-mm-dd (ex. 1999-12-30).
For using DATE field, look at the LA Times instructions above.
LA Times db
#lit(LA062694-0140)
A-Z
á -> a
ç -> c
ë -> e
Á -> A
Ç -> C
Ë -> E
etc.
TUTK db
#lit(705511)
(4) Term info:
LA Times
(a) Command qbtl can be used for returning term info:
./qbtl
(/space/clef/latimes/latimes.btl for LA Times database)
and replys for choises concerning stemming, stopword
usage, and where to print the resulting information.
found from the database after ENGTWOL analysis. 192811 different
word forms are listed. The first 58649 words are words recognized by
ENGTWOL. The rest were unrecoginzed by ENGTWOL and must be preceded
by "@" if used as search keys, e.g. "@zzkk"
for recognized and unrecognized word forms. By using, for example
"egrep" command, and other Unix tools, these can be utilized:
kastanja /space4/clarity# grep gorba recLaTimes
gorbachev
kastanja /space4/clarity# grep gorba unrecLaTimes
gorbachocolate
gorbachove
gorbatov
#syn(gorbachev @gorbatov @gorbachove)
Common Subsequence) is now available:
./like_noninter
./like_noninter ericsson
ericsson
ericson
ericksson
erison
ericcson
erickson
erichson
eriksson
is retrieved).
It's possible to control the action of like with parameters.
You can ask help typing /space/gram/like -h.
New gram algorithm is available for searching the 3 most
similar query strings existing in the recLaTimes and
unrecLaTimes files, for any given key string.
./gram_bitwise
wait until key is asked (loop). This application exits with Ctrl-C
only.
11 grams
ericsson erickson eriksson @ericson @ericksson @ericcson
gram_bitwise_noninter word
./gram_bitwise_noninter ericsson
ericsson erickson eriksson @ericson @ericksson @ericcson
of the LaTimes database were first compared to key "ericsson"
by using a gram algorithm, and the best words were printed.
3 of the "most similar" ENGTWOL recognized word forms
compared to key "ericsson", are words ericsson, erickson,
and eriksson. Additionally, 3 of the best unrecognized
words were @ericson, @ericksson, and @ericcson.
#syn(ericsson erickson eriksson @ericson @ericksson @ericcson)
could be used.
after ENGTWOL analysis. 625012 different word forms are
found - these contain unrecognized word forms (preceded by
a character @) and words in their basic forms without
splitting (perceded by a character /), and "normal" basic
word forms (starting with a normal letter). The words are
coded by 7-bit ascii, and followed by total and document
frequency for each word.
egrep "^olla" dictTutk
results output
olla 591753 51854
meaning that basic form "olla" ("to be") has document
frequency of 51854 in the database. Total frequency of the
term is 591753.
(7) Lexical translation
GlobalDix and motcom programs exit (motcom: Fin->Eng, Swe->Eng),
plus Duden dictionary table for Clef runs 2000 and 2001 when target was LaTimes.
GlobalDix contains at least 18 languages and 300 translation routes, see separate documentation, contact ccheke@uta.fi.
Fin->Eng
/data10/newmot/motcom -s -b -d /data10/newmot/ses kissa
Old information:
No really good translation resources are available
at kastanja. motcom command can be used for resulting
verbose Kielikone's Motcom dictionary output. This can be
filtered. However, the result is far from being perfect
and it is impossible to derive good result by syntactic
rules - the format of the dictionary does not support it.
E.g. Finnish word "kissa" can be translated to "cat"
in the following way.
./fineng kissa
cat
translations of University of Tampere, Department
of Information Studies.
Ger->Eng
for CLEF 2000 and 2001 topics only!) - this was used
at CLEF 2000 and 2001 Ger->Eng translations of UTA.
Swe->Eng
for Swe->Eng translations at CLEF 2000 and 2001
experiments of UTA.
Eng->Fin
/data10/newmot/motcom -s -b -d /data10/newmot/ses cat
Also try engfin script. NB! engfin is slow and older
version - now that -b option exists, you should try to
utilize it.
(8) Morphological analysis
TWOLs by Lingsoft are used, utwt and etwt3 are TWOL applications,
utwt - used at TUTK db index building;
etwt3 - used at LaTimes index building)
LA Times
./etwt3
for simulating ENGTWOL index building program, e.g.
the input word (database word form)
oxen
results the following output word forms (index words
referring to the same address):
ox
TUTK
./utwt
for simulating FINTWOL index building program, e.g.
the input word (database word form)
kuminasta
results the following output word forms (index words
referring to the same address):
nasta
kumi
/kumina
/kuminasta
kumina
kuminasta
characters. These are essential "de facto standard" codes for
scandinavian letters. a with ring, a with dots and a with ring, for
example, are marked with 7-bit paranthesis, pipeline symbol
etc., in the following way (Finnish example words):
{ = a with dots, e.g. as in word "j{{" (meaning "ice" in English)
} = a with ring, e.g. "t{g" - this is a Swedish word ("train")
| = o with dots, e.g. "|ljy" ("oil")
[ = A with dots, e.g. "[ht{ri" (place name)
] = A with ring, e.g. "]ke" (first name of a person)
\ = O with dots, e.g. "\hman" (family name)
~ = german y ("Z~rich")
^ = german Y (e.g. asi in word "~ber" when starting a sentence)
scand letters to translation and/or morphological resources in
the correct form.
may be used for denoting these scandinavian characters. This
is very important while designing translation/morphological analysis
"pipelines" as different types of character codes for scandinavian
letters may be used (e.g. MOTCOM dictionaries, TWOL morpohological
analysis program and TUTK ascii database may each use different
internal bit representations for scandinavian letters).
(9) Relevance evaluation: scripts for producing precision-
recall -tables for given query files when the target
collection is TUTK or LaTimes.
LA Times
simply by using a script
interactive mode:
./evalLaTimes
The script asks as inputs 2 filenames: query file and relevance
file (the correct relevance filename is suggested by the script).
non-interactive mode:
./evalLaTimes_noninter topicfile relevancefile
InQuery manuals.
Author: ccheke@uta.fi
(10) Full text document text extraction
documents into file, (3) filters words of those documents into
a list of (inflected) Finnish words, and (4) runs this inflected
list of words through morphological program, thus resulting a file
with a list of words in Finnish basic forms, (5) translates them to
English by using motcom dictionary script.
./batch
with it.
The script first asks for a query file, you can try "finquery001" or
"finquery002", or create new query files. A bogus relevance
file exists but it has no meaning (batch inquery seems to demand
it). The present version of "batch" prints only 2 top documents
into file. You can change by changing figure of -wn in the "batch"
script.
(b) Making list of fulltext document words (inflected)
Next the script uses sed, egrep, tr in order to filter
the text files into lowercase string tokens (=inflected Finnish
words). Then these are used as inputs to twol program
version (utwt).
(c) Making basic form of (Finnish) words
One input word typically produces a group of outputs:
these groups are separated by full stops:
for example genetive form of 3-part compound word
"bruttorekisteritonnin" is splitted by utwt into
3 basic form "parts" (tonni, rekisteri, brutto),
and also the basic form of the whole compound is generated
(bruttorekisteritonni). You see this grouping at the .bfw-file:
tonni
rekisteri
brutto
bruttorekisteritonni
.
.
@rosebay
alus
.
.
"alus" were both formed from the same word "Rosebay-alus". If the
text contains "Rosebay alus", then the utwt analysis would look
like
.
.
@rosebay
.
.
alus
.
.
so the two dots correspond to word boundaries in the original
text file.
only (like "bruttorekisteritonni"), and not the parts of the
compound at all, you can do it by removing the last
egrep -v "/"
from the "batch" file as basic forms of the original word are
marked by a starting slash. Normally these are not utilized
so I added the removal of slash-words as a default.
(d) Translating words from Finnish to English:
Next, the basic form Finnish words are printed to *.tmp2 file where
the lines start with a letter instead of spaces (like in *.bfw)
(dictionary program does not work otherwise).
dictionary (Finnish->English).
The translations are printed to file *.eng.
Back to homepage