Recall and Precision Effects of Anaphor and Ellipsis Resolution in Proximity Searching in a Text Database
Ari Pirkola and Kalervo Järvelin
Department of Information Studies
University of Tampere
P.O.Box 607
FIN-33101 TAMPERE, Finland
Pirkola, A. & Järvelin, K. (1996). Recall and precision effects of anaphor and ellipsis resolution in proximity searching in a text database. In Ingwersen, P. & Pors, N.O. (Ed.) Proceedings CoLIS, 2nd International Conference on Conceptions of Library and Information Science: Integration in Perspective, Copenhagen, Oct. 13-16, 1996. Copenhagen: The Royal School of Librarianship, pp. 459 - 475.
Abstract
Effects of ellipsis and anaphor resolution on proximity searching in a text database are analyzed. Anaphora and ellipses are classified into proper names and common nouns of basic words, compound words, and phrases. 28 queries for which document relevance of data was available, were run in a newspaper database of 55.000 articles. Resolution was most relevant for person names (both anaphora and ellipses) and other proper name phrases (ellipses) and only marginal in other keyword categories. Recall improvement due to resolution decreased when query exhaustivity grew and was greater in sentence than paragraph searches. Resolution also improved precision in the same keyword categories. The data also suggest that precision improvement decreases as query exhaustivity grows.
Return to Kal's home page.
Return to Kal's publication list.
Paluu Kallen kotisivulle.
Paluu Kallen julkaisuluetteloon.
Conclusions of the report
In conclusion, the present study supports the findings of the previous study on relevant key categories (proper name phrases) for ellipsis and anaphor resolution and the order of resolution effects on recall. It also elaborates previous findings by new results indicating that:
- recall gains due to resolution fall when query exhaustivity grows.
- both recall and precision performance improves through proper name phrase ellipsis and anaphor resolution as argued in the previous study.
- recall and precision improvement depend on query exhaustivity so that resolution gains diminish rapidly as exhaustivity grows.
- recall and precision improvement due to resolution is greater in narrow (briefsearch) queries at low exhaustivity levels, but at high exhaustivity levels only broad (block strategy) queries benefit from resolution.
- for person name ellipses resolution yields equally many new relevant documents in both the sentence and the paragraph proximity contexts but the relative effect is greater in the sentence context.
- resolution of person name references is a centrality filter rather than a topic filter.
The present findings are based on realistic test queries for which realistic relevance assessments were available. Thus the findings indicate what should happen in real IR situations when anaphor or ellipsis resolution is employed. A restricted resolution method for proper name phrase ellipses and anaphora is easier to implement and more efficient to apply than a comprehensive resolution method for all key categories. Therefore restricted resolution methods for proper name phrases should be considered for keyword-based full text IR.
These findings certainly depend on the database type. Newspaper articles are rich in proper names because persons and organizations and their deeds are central foci in the news. Newspaper articles also contain relatively short paragraphs and seem to have a text structure consisting of introductory paragraphs and elaborative paragraphs, perhaps in several sets in one article. The relative resolution relevance of key categories probably differs in databases of other types. However, the effects of query exhaustivity and expansion extent may well hold also in other collections.
Notes
Anaphor = textual element which refers to an earlier text element (correlate) and share the meaning of the correlate.
Ellipsis = an incomplete construction derived by the omission of one or more words that are obviously understood but must be supplied to make a construction grammatically complete.
Example: the sentence "Ari does not have a dog but he would like to (have a dog)." indicates an anaphor in italics and an elliptic omission in parentheses.
Resolution = the anaphor (ellipsis) is replaced by its referent.
Key categories = proper names and common nouns in the categories of basic (non-compound) words, compound words, and (noun) phrases.
Exhaustivity = the number of intersecting concepts (or query blocks).
Narrow query = the query contains only a few keys per concept (average about 3).
Broad query = the query contains many keys per concept (average about 12).
Centrality filter = recognizes documents with a lot of emphasis on filter keys.
Topic filter = recognizes documents dealing with the topic expressed by filter keys, but perhaps only marginally.