Recall and Precision Effects of Anaphor and Ellipsis Resolution in Proximity Searching in a Text Database

Ari Pirkola and Kalervo Järvelin

Department of Information Studies
University of Tampere
P.O.Box 607
FIN-33101 TAMPERE, Finland

Pirkola, A. & Järvelin, K. (1996). Recall and precision effects of anaphor and ellipsis resolution in proximity searching in a text database. In Ingwersen, P. & Pors, N.O. (Ed.) Proceedings CoLIS, 2nd International Conference on Conceptions of Library and Information Science: Integration in Perspective, Copenhagen, Oct. 13-16, 1996. Copenhagen: The Royal School of Librarianship, pp. 459 - 475.


Abstract

Effects of ellipsis and anaphor resolution on proximity searching in a text database are analyzed. Anaphora and ellipses are classified into proper names and common nouns of basic words, compound words, and phrases. 28 queries for which document relevance of data was available, were run in a newspaper database of 55.000 articles. Resolution was most relevant for person names (both anaphora and ellipses) and other proper name phrases (ellipses) and only marginal in other keyword categories. Recall improvement due to resolution decreased when query exhaustivity grew and was greater in sentence than paragraph searches. Resolution also improved precision in the same keyword categories. The data also suggest that precision improvement decreases as query exhaustivity grows.


Return to Kal's home page.
Return to Kal's publication list.
Paluu Kallen kotisivulle.
Paluu Kallen julkaisuluetteloon.


Conclusions of the report

In conclusion, the present study supports the findings of the previous study on relevant key categories (proper name phrases) for ellipsis and anaphor resolution and the order of resolution effects on recall. It also elaborates previous findings by new results indicating that:

The present findings are based on realistic test queries for which realistic relevance assessments were available. Thus the findings indicate what should happen in real IR situations when anaphor or ellipsis resolution is employed. A restricted resolution method for proper name phrase ellipses and anaphora is easier to implement and more efficient to apply than a comprehensive resolution method for all key categories. Therefore restricted resolution methods for proper name phrases should be considered for keyword-based full text IR.

These findings certainly depend on the database type. Newspaper articles are rich in proper names because persons and organizations and their deeds are central foci in the news. Newspaper articles also contain relatively short paragraphs and seem to have a text structure consisting of introductory paragraphs and elaborative paragraphs, perhaps in several sets in one article. The relative resolution relevance of key categories probably differs in databases of other types. However, the effects of query exhaustivity and expansion extent may well hold also in other collections.

Notes

Anaphor = textual element which refers to an earlier text element (correlate) and share the meaning of the correlate.

Ellipsis = an incomplete construction derived by the omission of one or more words that are obviously understood but must be supplied to make a construction grammatically complete.

Example: the sentence "Ari does not have a dog but he would like to (have a dog)." indicates an anaphor in italics and an elliptic omission in parentheses.

Resolution = the anaphor (ellipsis) is replaced by its referent.

Key categories = proper names and common nouns in the categories of basic (non-compound) words, compound words, and (noun) phrases.

Exhaustivity = the number of intersecting concepts (or query blocks).

Narrow query = the query contains only a few keys per concept (average about 3).

Broad query = the query contains many keys per concept (average about 12).

Centrality filter = recognizes documents with a lot of emphasis on filter keys.

Topic filter = recognizes documents dealing with the topic expressed by filter keys, but perhaps only marginally.