RITE has four subtasks BC, MC, Entrance Exam (ExamBC, ExamSearch) & RITE4QA and a Unit Test Pilot Task.
BC subtask result is mandatory to submit, whereas other subtasks are optional (although we strongly recommend you to submit a result to the MC subtask).
Given a text pair <t1, t2> identify whether t1 entails (infers) a hypothesis t2 or not.
A 4-way labeling subtask to detect (forward / bidirectional) entailment or no entailment (contradiction / independence) in a text pair.
This subtask aims to answer multiple-choice questions of real university entrance exams, by referring to textual knowledge such as Wikipedia and textbooks. This is an attempt to emulate human's process to answer entrance exam questions as the RITE task. This is an interesting challenge that use real entrance exams for the evaluation of intelligent systems.
This subtask provides two types of data.
ExamBC and ExamSearch data files include exactly same t2 sentences, and their IDs are common. Therefore, you can use the ExamBC dataset for the development of retrieval systems for ExamSearch. We also provide document IDs of texts that are retrieved by human annotators as candidates for t1 (in the formal run this data will not be provided).
In the ExamSearch subtask, t1 is not provided. As shown below, only t2 and its label are provided. Systems have to retrieve t1 from Wikipedia or textbooks, and judge whether it infers t2 or not.
<dataset> <pair id="1" label="Y"> <t2>パルテノン神殿の建つ丘は,アクロポリスと呼ばれている。</t2> </pair> <pair id="2" label="N"> <t2>パルテノン神殿は,ヘレニズム文化の影響下で建設された。</t2> </pair> </dataset>
In the Entrance Exam subtask, the following three measures are evaluated.
In evaluation 2, the correct answer ratio for entrance exams is evaluated. For example, given the question below (most of the actual questions have four choices),
1. パルテノン神殿の建つ丘は,アクロポリスと呼ばれている。 2. パルテノン神殿は,ヘレニズム文化の影響下で建設された。
when a system returns Y to 1 and N to 2, it is regarded that the system chose 1 as an answer. When Y is returned to more than one choices, the choice given the highest confidence score is considered as an answer.
Evaluation 3 is conducted for the ExamSearch subtask. Accuracy of retrieving t1 texts from Wikipedia or textbooks is measured. Together with run results, submit IDs of documents that are used as t1 for judging a label for each t2 (see “Submission Format” for the format for submitting document IDs). For each t2, you can submit up to 5 document IDs (more than 5 IDs will be ignored for evaluation).
For submitted document IDs for pairs that are given “Y” by the system, human annotators will judge whether each document actually infers t2 (t2 that is given “N” by the system is ignored for manual evaluation). The criterion for the judgment is same as that used for creating the document ID data set included in the development data; when a document includes sentences that infer t2, or when a document includes sentences that infer a part of t2, the document ID is judged as correct. Search accuracy is evaluated by precision and recall (relative to the outputs from all the systems).
Same as the BC subtask in terms of input and output, but as an embedded answer validation component in Question Answering system. This way, an impact of a RITE BC component to an overall end-to-end application can be measured.
For further information regarding the above metrics, please refer to NTCIR-6 CLQA overview paper: Sasaki, Yutaka, Chuan-Jie Lin, Kuang-hua Chen, Hsin-His Chen. 2007. Overview of the NTCIR-6 Cross-Lingual Question Answering (CLQA) Task. In Proceedings of NTCIR-6 Workshop, Japan.
Source: QA question/answer dataset (CT) + a run result from a good QA system (up to 5 answer candidates per question, each with source document ID).
CS data is transliterated from CT.
A RITE4QA run is combined with a “source factoid QA answer ranking”(SrcRank) to produce a new answer ranking (RiteRank) for evaluation. A new ranking, RiteRank, is primarily based on the confidence scores of Y labels. If there are Y pairs with the same confidence score, it falls back to the rank in SrcRank. In other words, a “Naive Run”, which is a RITE4QA run with all the pairs labeled as “Y”, will result in a RiteRank that is identical to the SrcRank.
Two SrcRanks are created for RITE4QA evaluation:
Facoid QA evaluation metrics will be reported against BetterRanking and WorseRakning respectively for each RITE4QA run.
In the evaluation reports, BetterRanking scores show how good a system is in terms of the improvement on the answer ranking of a good-performing factoid QA system, while WorseRanking scores show how good a system when it is applied to the answer ranking of a bad-performing factoid QA system.
For recognizing inference in text, various kinds of semantic/context processing are necessary. While the RITE task aims at such integrated semantic/context processing systems, it also has a problem that research focused on specific linguistic phenomena is not easy to pursue.
The unit test is a data set that provides a breakdown of linguistic phenomena that are necessary for recognizing relations between t1 and t2. Sentence pairs are sampled from the BC subtask data, and several sentence pairs are created for each sample so that only one linguistic phenomenon appears in each pair.
The unit test data corresponds to a subset of the BC task data. The data is small, but you might want to use it for various research including the following purposes.
The unit test data is provided for supporting research on the RITE task, and it is not obligatory to use this data.
The unit test data is provided in the following format.
<dataset> <pair id="1-0" label="Y" category="entailment:phrase"> <t1>川端康成は、「雪国」などの作品でノーベル文学賞を受賞した。</t1> <t2>川端康成は、「雪国」などの作品の作者である。</t2> </pair> <pair id="1-1" label="Y" category="list"> <t1>川端康成は、「雪国」などの作品の作者である。</t1> <t2>川端康成は「雪国」の作者である。</t2> </pair> ...
The attribute “label” is same as the BC task. The attribute “id” is represented in the form “X-Y”, where X denotes the ID of a sentence pair in the original BC data set, and Y is an ID to distinguish among sentence pairs for same X. The t1 of the pair with Y=0 corresponds to t1 of the original BC data.
The attribute “category” shows a category name of a linguistic phenomenon. One of the labels shown below are specified. For each pair in the unit test, a linguistic phenomenon denoted in “category” appears (except for trivial issues like removal of punctuation). When multiple places in a sentence can be explained by a single phenomenon, they are described in the same pair.
Categories are designed based on the following works, while redesigned for Japanese.
In the current categorization, phenomena that require considerable paraphrasing are categorized into phrase rewriting (or disagreement). Consequently, these categories involve a variety of linguistic phenomena. We expect more fine grained subcategorizations are necessary, but we provide the unit test data in the current categorization since further subcategorization is not evident at the moment.
Categories when entailment relations hold | |
---|---|
synonymy:lex | Replacement by a synonym |
hypernymy:lex | Replacement by a hyponym/hypernym |
entailment:lex | Replacement by a word with an entailment/presupposition relation |
meronymy:lex | Replacement by a meronym |
synonymy:phrase | Replacement by a synonymous phrase |
hypernymy:phrase | Replacement by a hyponymous/hypernymous phrase |
entailment:phrase | Replacement by a phrase with an entailment/presupposition relation |
meronymy:phrase | Replacement by a meronymous phrase |
nominalization | Paraphrasing by changing parts-of-speech |
coreference | Resolving coreference or anaphora relations (including filling arguments of nouns) |
scrambling | Changing the order of bunsetsu |
case_alternation | Alternating cases (e.g. passivization) |
modifier | Remove/Insert modifiers |
transparent_head | Remove a head of a phrase(e.g. remove “B” in the construction “A no B” |
clause | Remove/Insert coordinated or subordinated clause |
list | Extract a noun phrase from a list or coordinated phrase |
apposition | Infer IS-A relations from apposition (“modifier” is assigned if one expression is chosen from appositive constructions) |
relative_clause | Extract a sentence from a relative clause or change the structure of a relative clause |
temporal | Temporal inference |
spatial | Spatial inference |
quantity | Quantity inference |
implicit_relation | Infer semantic content that is not explicitly mentioned in a sentence |
inference | Inference based on common knowledge |
Categories when entailment relations do not hold | |
---|---|
disagree:lex | Words are not consistent |
disagree:phrase | Phrases are not consistent |
disagree:modality | Modalities are not consistent |
disagree:modifier | Modifiers are not consistent |
disagree:temporal | Time is disagreed |
disagree:spatial | Space is disagreed |
disagree:quantity | Quantity is disagreed |
Development (training) and test (formal run) data will be provided at this website in the following xml format in all four subtasks.
Note: The “label” attribute is not contained for the formal run data. The “category” attribute is not contained for the formal run unit test data.
<dataset> <pair id="1" label="Y"> <t1>アテネの市域の中心にアクロポリスの丘、北東部にリュカベットス山がそびえ、パルテノン神殿、聖イヨルイヨス礼拝堂などがある。</t1> <t2>パルテノン神殿の建つ丘は,アクロポリスと呼ばれている。</t2> </pair> <pair id="2" label="N"> <t1>パルテノン神殿は、ドーリア式神殿の最高傑作と言える作品である。</t1> <t2>パルテノン神殿は,ヘレニズム文化の影響下で建設された。</t2> </pair> </dataset>
Each line represents an output for a text pair
(Text Pair ID)[SPACE](Label)[SPACE](Confidence)[CR]
Example:
1 Y 0.852 2 Y 0.943 3 Y 0.993 4 Y 1.000
Confidence score in the third column takes a real number between 0 and 1. In BC and MC subtask, the confidence column is optional (but recommended). In Entrance Exam and RITE4QA subtask, confidence column is going to be required for tie-breaking multiple Y labels on series of pairs on a certain topic.
Each line corresponds to a pair (i.e. t2) and describes IDs of retrieved documents.
(Text Pair ID)[SPACE](Document ID)[SPACE](Document ID)[SPACE]…[CR]
Example:
1 35 225 892 2 1028 298 3 821 1582 315 709
For each t2, submit up to 5 search results. Even when you put more than 5 IDs, they are ignored for evaluation.
Manual judgment is performed only for pairs that are given label “Y” by the system. It is OK to output document IDs for t2 with label “N”, but they are ignored for evaluation.
Document IDs are denoted by the <id> tag directly under the <page> tag (see the example below). Do not be confused that Wikipedia data contains multiple <id>s for each <page>, while most of them (other than <id> directly under <page>) are irrelevant.
<page> <title>19世紀</title> <id>1615</id> ← THIS NUMBER ...
When you are using the TSUBAKI search result data provided by the organizers, the “OrigId” attribute of the “Result” tag denotes document IDs (see the example below).
<pair id="1"> <ResultSet ...> <Result OrigId="1615"> ← THIS NUMBER ...