Nuggeteer: Automatic Nugget-Based Evaluation Using Descriptions and Judgements
TREC Definition and Relationship questions are evaluated on thebasis of information nuggets that may be contained in systemresponses. Human evaluators provide informal descriptions of eachnugget, and judgements (assignments of nuggets to responses) for eachresponse submitted by participants.The best present automatic evaluation for these kinds of questions isPourpre. Pourpre uses a stemmed unigram similarity of responses withnugget descriptions, yielding an aggregate result that is difficult tointerpret, but is useful for relative comparison. Nuggeteer, bycontrast, uses both the human descriptions and the human judgements,and makes binary decisions about each response, so that the end resultis as interpretable as the official score.I explore n-gram length, use of judgements, stemming, and termweighting, and provide a new algorithm quantitatively comparable to,and qualitatively better than the state of the art.