Untranslatable query keys pose an abiding problem in dictionary-based cross-lingual information retrieval (CLIR). One approach for solving it consists of using approximate string matching methods for retrieving word form variants of the source key from the target database index. We have developed a novel n-gram based string matching technique, which we call the s-gram matching technique (s-gram for skip-gram). In the technique, n-grams are classified into categories on the basis of character contiguity in words. The categories are then utilized in matching. The technique has been compared with conventional n-gram technique using adjacent characters as n-grams. Several types of words and word pairs were studied, including biological, geographical, economic, technological and other terms. Source language words have been in French, Spanish, Italian, German, Swedish, Finnish and English and the target words have been their spelling variants in Finnish and English within target word lists of up to 200 000 words. In recent work also Norwegian and Swedish have been tested. In all cross-lingual tests done, the targeted s-gram matching technique outperformed the conventional n-gram matching technique as well as longest common subsequence and edit distance. The technique has also been highly effective for the identification of monolingual word form variants. Formal definitions of the techniques have been defined. The work continues with new language pairs, historical and OCR'd language and new test set-ups.
2002 - 2008
Mr. Heikki Keskustalo– supervised by Dr. Ari Pirkola and Prof. Kal Järvelin
Dr. Ari Pirkola
Prof. Kalervo Järvelin
Mrs. Anni Järvelin
Mrs. Sanna Kumpulainen
Mr. Antti Järvelin
Updated 11.03.2008 Responsibility for updating: KJ