Sisältöön
Informaatiotutkimuksen ja interaktiivisen median laitos Tampereen yliopisto SIS Tutkimuskeskus

Project: sGRAM – Approximate String Matching for Out-Of-Vocabulary Words in CLIR Applications

Description

Untranslatable query keys pose an abiding problem in dictionary-based cross-lingual information retrieval (CLIR). One approach for solving it consists of using approximate string matching methods for retrieving word form variants of the source key from the target database index. We have developed a novel n-gram based string matching technique, which we call the s-gram matching technique (s-gram for skip-gram). In the technique, n-grams are classified into categories on the basis of character contiguity in words. The categories are then utilized in matching. The technique has been compared with conventional n-gram technique using adjacent characters as n-grams. Several types of words and word pairs were studied, including biological, geographical, economic, technological and other terms. Source language words have been in French, Spanish, Italian, German, Swedish, Finnish and English and the target words have been their spelling variants in Finnish and English within target word lists of up to 200 000 words. In recent work also Norwegian and Swedish have been tested. In all cross-lingual tests done, the targeted s-gram matching technique outperformed the conventional n-gram matching technique as well as longest common subsequence and edit distance. The technique has also been highly effective for the identification of monolingual word form variants. Formal definitions of the techniques have been defined. The work continues with new language pairs, historical and OCR'd language and new test set-ups.

Duration

2002 - 2008

Researchers

Mr. Heikki Keskustalo– supervised by Dr. Ari Pirkola and Prof. Kal Järvelin
Dr. Ari Pirkola
Prof. Kalervo Järvelin

Mrs. Anni Järvelin

Mrs. Sanna Kumpulainen

Mr. Antti Järvelin

Publications

  1. Keskustalo, H. & Pirkola, A. & Visala, K. & Leppänen, Erkka & Järvelin, K. (2003). Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L, (Eds.). Proceedings of the 10th International Symposium, SPIRE 2003. Manaus, Brazil, October 2003. Berlin: Springer, Lecture Notes in Computer Science 2857, pp. 252 - 265. ISSN 0302-9743, ISBN 3-540-20177-7.
  2. Pirkola, A. & Keskustalo, H. & Leppänen, E. & Känsälä, A-P. & Järvelin, K. (2002). Targeted S-Gram Matching: A Novel N-Gram Matching Technique for Cross- and Monolingual Word Form Variants. Information Research, 7(2). [Available at http://informationr.net/ir/7-2/paper126.html ]
  3. Järvelin, A. & Järvelin, A. & Järvelin, K. (2007). s-grams: Defining Generalized n-grams for Information Retrieval. Information Processing & Management 43(4): 1005-1019. Preprint
  4. Järvelin, A., Kumpulainen, S., Pirkola, A. & Sormunen, E. (2006). Dictionary-independent translation in CLIR between closely related languages. In: de Jong, F.M.G. and Kraaij, W. (Eds.): 6th Dutch-Belgian Information Retrieval Workshop (DIR 2006). Neslia Paniculata: Enschede. [Available at http://hmi.ewi.utwente.nl/dir2006/abstracts/jarvelin_paper.pdf


Updated 11.03.2008 Responsibility for updating: KJ


TRIM-tutkimuskeskus, Pinni A, 5. kerros, 33014 Tampereen yliopisto, puh. 03 3551 6034
Ylläpito: kkoivu@uta.fi
Muutettu: 22.6.2009 15.01 Muokkaa