This research studies mono-, bi- and multilingual retrieval: morphological aspects of IR and MLIR as well as approaches to improve the ranking of document streams. In focus has been the impact of decompounding on monolingual and bilingual retrieval, benefits of bilingual IR for users, and bilingual IR in an inflected index,
In the first study, combined vs. separate indices were studied. In MLIR, when the target collection indexes are separate, the result sets must be merged. There is a lot of research on different merging strategies, but there have been no breakthroughs. We examined the impact of stemming compared with inflected retrieval in a merged index / separate indexes. The best result was achieved when retrieval was performed in separate indexes and result lists were merged. Stemming improved the results in both settings (separate indexes / and a merged index).
The next study was concerned with impact of decompounding on mono- and bilingual IR. The target languages were English, Finnish, German and Swedish, and the source language was English. According to our research, decompounding in indexing phase is vital for bilingual retrieval, when the source language is a phrase language, and the target language is a compound-oriented language. In monolingual retrieval, impact of decompounding is not remarkable.
The aim of our user study was to find out whether query translation is beneficial in Web retrieval.The language pairs were Finnish-Swedish, English-German and Finnish-French. 12-18 participants were recruited for each language pair. Each participant performed four retrieval tasks. Our aim was to compare the performance of the translated queries with that of the target language queries. Thus, we asked participants to formulate a source language query and a target language query for each task. The source language queries were translated into the target language utilizing a dictionary based system. In English-German, also machine translation was utilized. We used Google as the search engine. The results differed depending on the language pair. We concluded that the dictionary coverage had an effect on the results. On average, the results of query-translation were better than in the traditional laboratory tests.
Our last study deals with bilingual retrieval in an inflected index, especially performance of the FCG approach. FCG (Frequent Case Generation) is a method for retrieval in an inflected index. The idea is to generate the most frequent forms for a word when given the base form. FCG has been tested in monolingual retrieval and it has been proved to be a good method for inflected retrieval, especially for highly inflected languages. The language pairs in this test were English-Finnish, English-Swedish, Swedish-Finnish and Finnish-Swedish. Various query alternatives were run against the inflected index: the lemmatized queries, the n-gram queries and the FCG queries. The lemmatized index with the lemmatized queries was the baseline. The results varied according to the language pairs. FCG performed the best when Finnish was the target language, while results were quite poor with Swedish as the target language. The main reason for that was the functionality of the FCG software: the Swedish software is based on a dictionary. Many topic words were not included in the dictionary, which caused the failure of FCG.
2006 - 2009
Mrs. Eija Airio - supervised by Prof. Kal Järvelin and Prof. Jaana Kekäläinen
Updated 13.3.2008 Responsibility for updating: EA