wiki:task_guideline

NTCIR-10 RITE Tasks

RITE has four subtasks BC, MC, Entrance Exam (ExamBC, ExamSearch) & RITE4QA and a Unit Test Pilot Task.

BC subtask result is mandatory to submit, whereas other subtasks are optional (although we strongly recommend you to submit a result to the MC subtask).

BC Subtask

Given a text pair <t1, t2> identify whether t1 entails (infers) a hypothesis t2 or not.

  • Label: {Y,N}.
  • Language: Japanese, Simplified Chinese and Traditional Chinese
  • Evaluation: Macro F1 value of Y and N.

MC Subtask

A 4-way labeling subtask to detect (forward / bidirectional) entailment or no entailment (contradiction / independence) in a text pair.

  • Label = {F,B,C,I}
    • F: forward entailment (t1 entails t2 AND t2 does not entail t1).
    • B: bidirectional entailment (t1 entails t2 AND t2 entails t1).
    • C: contradiction (t1 and t2 contradicts, or cannot be true at the same time).
    • I: independence (otherwise)
  • Language: Japanese, Simplified Chinese and Traditional Chinese
  • Evaluation: Macro F1 value of F,B,C and I.

Entrance Exam Subtask (Japanese Only)

This subtask aims to answer multiple-choice questions of real university entrance exams, by referring to textual knowledge such as Wikipedia and textbooks. This is an attempt to emulate human's process to answer entrance exam questions as the RITE task. This is an interesting challenge that use real entrance exams for the evaluation of intelligent systems.

  • Label: {Y,N}.
  • Language: Japanese
  • Evaluation: Macro F1 value of Y and N, correct answer ratio for multiple-choice questions, precision/recall of t1 search results

This subtask provides two types of data.

  • BC style (ExamBC): The data is provided in the same form as the BC subtask. Systems are asked to recognize inference relations between t1 and t2. In this data, t1 is extracted from Wikipedia, while t2 is taken from university entrance exams.
  • Search style (ExamSearch): In this data, t1 is not given. Systems are asked to retrieve texts that can be used as t1 from Wikipedia or textbooks, and answer whether t2 is entailed (inferred) from retrieved texts.

ExamBC and ExamSearch data files include exactly same t2 sentences, and their IDs are common. Therefore, you can use the ExamBC dataset for the development of retrieval systems for ExamSearch. We also provide document IDs of texts that are retrieved by human annotators as candidates for t1 (in the formal run this data will not be provided).

In the ExamSearch subtask, t1 is not provided. As shown below, only t2 and its label are provided. Systems have to retrieve t1 from Wikipedia or textbooks, and judge whether it infers t2 or not.

<dataset>
  <pair id="1" label="Y">
    <t2>パルテノン神殿の建つ丘は,アクロポリスと呼ばれている。</t2>
  </pair>
  <pair id="2" label="N">
    <t2>パルテノン神殿は,ヘレニズム文化の影響下で建設された。</t2>
  </pair>
</dataset>

In the Entrance Exam subtask, the following three measures are evaluated.

  1. Macro F1 of Y/N: same evaluation measure as the BC subtask.
  2. Correct answer ratio for multiple-choice questions: Y/N labels are mapped into selection of answers for the original questions, and the correct answer ratio is measured.
  3. Precision/recall of t1 search results (optional, ExamSearch subtask only): In the ExamSearch subtask, evaluate accuracy of t1 texts retrieved from Wikipedia or textbooks.

In evaluation 2, the correct answer ratio for entrance exams is evaluated. For example, given the question below (most of the actual questions have four choices),

1. パルテノン神殿の建つ丘は,アクロポリスと呼ばれている。
2. パルテノン神殿は,ヘレニズム文化の影響下で建設された。

when a system returns Y to 1 and N to 2, it is regarded that the system chose 1 as an answer. When Y is returned to more than one choices, the choice given the highest confidence score is considered as an answer.

Evaluation 3 is conducted for the ExamSearch subtask. Accuracy of retrieving t1 texts from Wikipedia or textbooks is measured. Together with run results, submit IDs of documents that are used as t1 for judging a label for each t2 (see “Submission Format” for the format for submitting document IDs). For each t2, you can submit up to 5 document IDs (more than 5 IDs will be ignored for evaluation).

For submitted document IDs for pairs that are given “Y” by the system, human annotators will judge whether each document actually infers t2 (t2 that is given “N” by the system is ignored for manual evaluation). The criterion for the judgment is same as that used for creating the document ID data set included in the development data; when a document includes sentences that infer t2, or when a document includes sentences that infer a part of t2, the document ID is judged as correct. Search accuracy is evaluated by precision and recall (relative to the outputs from all the systems).

NOTES

  • Do not use the formal run data of ExamBC for the formal run of ExamSearch.
  • Since the evaluation of t1 search results is done manually, it is possible that all the submitted results cannot be evaluated. In this case, we evaluate the submission given the run number “01”.
  • All search results submissions are delivered to all the teams who participated in evaluation 3 (i.e. teams who submitted document IDs). They can be used for post-formal-run evaluation and/or comparison with other systems.
  • The datasets for this subtask are created from real university entrance exams. Therefore, it should be noted that, for some data, no texts that infer t2 can be found, or texts that infer t2 are scattered. Note also that the BC-style dataset may include pairs for which strict inference relations do not hold.

RITE4QA Subtask (Chinese Only)

Same as the BC subtask in terms of input and output, but as an embedded answer validation component in Question Answering system. This way, an impact of a RITE BC component to an overall end-to-end application can be measured.

  • Label = {Y,N}
  • Language: Simplified Chinese and Traditional Chinese
  • Evaluation metrics: factoid QA Top 1 Accuracy, Top 5 Accuracy, and Mean Reciprocal Rank (MRR).
    • Top1: is the rate of questions which top 1 answers are correct.
    • MRR:: is the average reciprocal rank (1/n) of the highest rank n of a correct answer for each question.
    • Top5: shows the rate at which at least one correct answer is included in the top 5 answers.

For further information regarding the above metrics, please refer to NTCIR-6 CLQA overview paper: Sasaki, Yutaka, Chuan-Jie Lin, Kuang-hua Chen, Hsin-His Chen. 2007. Overview of the NTCIR-6 Cross-Lingual Question Answering (CLQA) Task. In Proceedings of NTCIR-6 Workshop, Japan.

How RITE4QA data is generated

Source: QA question/answer dataset (CT) + a run result from a good QA system (up to 5 answer candidates per question, each with source document ID).

  • t1: Answer-candidate-bearing sentence (if there are multiple ones in a document, automatically select the sentence with highest overlap with question)
  • t2: Transformed question - from a question “When was X born?”, a template “X was born in <ANSWER>” is automatically generated with minimum human post edits. <ANSWER> variable is replaced by an answer candidate.
  • Label: There is no textual entailment labeling for RITE4QA in RITE2. Systems are evaluated as an answer validation component of a QA system.

CS data is transliterated from CT.

How RITE4QA results are evaluated

A RITE4QA run is combined with a “source factoid QA answer ranking”(SrcRank) to produce a new answer ranking (RiteRank) for evaluation. A new ranking, RiteRank, is primarily based on the confidence scores of Y labels. If there are Y pairs with the same confidence score, it falls back to the rank in SrcRank. In other words, a “Naive Run”, which is a RITE4QA run with all the pairs labeled as “Y”, will result in a RiteRank that is identical to the SrcRank.

Two SrcRanks are created for RITE4QA evaluation:

  • BetterRanking: This ranking is produced from a good QA system.
  • WorseRanking: The reverse ranking of BetterRanking. It is a simulated worse QA result.

Facoid QA evaluation metrics will be reported against BetterRanking and WorseRakning respectively for each RITE4QA run.

In the evaluation reports, BetterRanking scores show how good a system is in terms of the improvement on the answer ranking of a good-performing factoid QA system, while WorseRanking scores show how good a system when it is applied to the answer ranking of a bad-performing factoid QA system.

Note

  • RITE4QA Data is mostly created automatically, so that we can simulate a realistic scenario where RITE is used in QA system which is an application of Information Access / NLP application. Thus, possible errors could be seen in t2.
    • Y labels does not necessarily represent entailment between t1 and t2 (e.g. sometimes t1 lacks coreferential information from previous sentences; t2 has additional information that cannot be inferred from t1).
  • Participants are allowed to:
    • Use both training and test data for BC & MC subtask, in order to develop their system. Training data for RITE4QA will be also provided, which could be used as domain adaptation / transfer learning purpose to tune an existing BC system.
    • Update or newly develop a RITE BC system (please clearly describe the system in the participant paper).
  • Participants are NOT allowed to utilize past QA data.

Unit Test (Japanese only)

For recognizing inference in text, various kinds of semantic/context processing are necessary. While the RITE task aims at such integrated semantic/context processing systems, it also has a problem that research focused on specific linguistic phenomena is not easy to pursue.

The unit test is a data set that provides a breakdown of linguistic phenomena that are necessary for recognizing relations between t1 and t2. Sentence pairs are sampled from the BC subtask data, and several sentence pairs are created for each sample so that only one linguistic phenomenon appears in each pair.

  • Label: {Y,N}.
  • Language: Japanese
  • Evaluation: Macro F1 value of Y and N.

The unit test data corresponds to a subset of the BC task data. The data is small, but you might want to use it for various research including the following purposes.

  • Analyze linguistic issues that appear in the RITE data
  • Evaluate recognition accuracy for each phenomenon
  • Develop/Train a recognizer for each linguistic phenomenon

The unit test data is provided for supporting research on the RITE task, and it is not obligatory to use this data.

The unit test data is provided in the following format.

<dataset>
  <pair id="1-0" label="Y" category="entailment:phrase">
    <t1>川端康成は、「雪国」などの作品でノーベル文学賞を受賞した。</t1>
    <t2>川端康成は、「雪国」などの作品の作者である。</t2>
  </pair>
  <pair id="1-1" label="Y" category="list">
    <t1>川端康成は、「雪国」などの作品の作者である。</t1>
    <t2>川端康成は「雪国」の作者である。</t2>
  </pair>
...

The attribute “label” is same as the BC task. The attribute “id” is represented in the form “X-Y”, where X denotes the ID of a sentence pair in the original BC data set, and Y is an ID to distinguish among sentence pairs for same X. The t1 of the pair with Y=0 corresponds to t1 of the original BC data.

The attribute “category” shows a category name of a linguistic phenomenon. One of the labels shown below are specified. For each pair in the unit test, a linguistic phenomenon denoted in “category” appears (except for trivial issues like removal of punctuation). When multiple places in a sentence can be explained by a single phenomenon, they are described in the same pair.

Categories are designed based on the following works, while redesigned for Japanese.

  • Bentivogli et al. (2010) Building Textual Entailment Specialized Data Sets: a Methodology for Isolating Linguistic Phenomena Relevant to Inference.
  • Sammons et al. (2010) “Ask not what Textual Entailment can do for You…”

In the current categorization, phenomena that require considerable paraphrasing are categorized into phrase rewriting (or disagreement). Consequently, these categories involve a variety of linguistic phenomena. We expect more fine grained subcategorizations are necessary, but we provide the unit test data in the current categorization since further subcategorization is not evident at the moment.

Categories when entailment relations hold
synonymy:lex Replacement by a synonym
hypernymy:lex Replacement by a hyponym/hypernym
entailment:lex Replacement by a word with an entailment/presupposition relation
meronymy:lex Replacement by a meronym
synonymy:phrase Replacement by a synonymous phrase
hypernymy:phrase Replacement by a hyponymous/hypernymous phrase
entailment:phrase Replacement by a phrase with an entailment/presupposition relation
meronymy:phrase Replacement by a meronymous phrase
nominalization Paraphrasing by changing parts-of-speech
coreference Resolving coreference or anaphora relations (including filling arguments of nouns)
scrambling Changing the order of bunsetsu
case_alternation Alternating cases (e.g. passivization)
modifier Remove/Insert modifiers
transparent_head Remove a head of a phrase(e.g. remove “B” in the construction “A no B”
clause Remove/Insert coordinated or subordinated clause
list Extract a noun phrase from a list or coordinated phrase
apposition Infer IS-A relations from apposition (“modifier” is assigned if one expression is chosen from appositive constructions)
relative_clause Extract a sentence from a relative clause or change the structure of a relative clause
temporal Temporal inference
spatial Spatial inference
quantity Quantity inference
implicit_relation Infer semantic content that is not explicitly mentioned in a sentence
inference Inference based on common knowledge
Categories when entailment relations do not hold
disagree:lex Words are not consistent
disagree:phrase Phrases are not consistent
disagree:modality Modalities are not consistent
disagree:modifier Modifiers are not consistent
disagree:temporal Time is disagreed
disagree:spatial Space is disagreed
disagree:quantity Quantity is disagreed

Data Format

Development (training) and test (formal run) data will be provided at this website in the following xml format in all four subtasks.

Note: The “label” attribute is not contained for the formal run data. The “category” attribute is not contained for the formal run unit test data.

<dataset>
  <pair id="1" label="Y">
    <t1>アテネの市域の中心にアクロポリスの丘、北東部にリュカベットス山がそびえ、パルテノン神殿、聖イヨルイヨス礼拝堂などがある。</t1>
    <t2>パルテノン神殿の建つ丘は,アクロポリスと呼ばれている。</t2>
  </pair>
  <pair id="2" label="N">
    <t1>パルテノン神殿は、ドーリア式神殿の最高傑作と言える作品である。</t1>
    <t2>パルテノン神殿は,ヘレニズム文化の影響下で建設された。</t2>
  </pair>
</dataset>

Submission Format

Run results of recognizing inference relations (BC, MC, Entrance Exam and RITE4QA)

Each line represents an output for a text pair

(Text Pair ID)[SPACE](Label)[SPACE](Confidence)[CR]

Example:

1 Y 0.852
2 Y 0.943
3 Y 0.993
4 Y 1.000

Confidence score in the third column takes a real number between 0 and 1. In BC and MC subtask, the confidence column is optional (but recommended). In Entrance Exam and RITE4QA subtask, confidence column is going to be required for tie-breaking multiple Y labels on series of pairs on a certain topic.

t1 search results (optional, ExamSearch subtask only)

Each line corresponds to a pair (i.e. t2) and describes IDs of retrieved documents.

(Text Pair ID)[SPACE](Document ID)[SPACE](Document ID)[SPACE]…[CR]

Example:

1 35 225 892
2 1028 298
3 821 1582 315 709

For each t2, submit up to 5 search results. Even when you put more than 5 IDs, they are ignored for evaluation.

Manual judgment is performed only for pairs that are given label “Y” by the system. It is OK to output document IDs for t2 with label “N”, but they are ignored for evaluation.

Document IDs are denoted by the <id> tag directly under the <page> tag (see the example below). Do not be confused that Wikipedia data contains multiple <id>s for each <page>, while most of them (other than <id> directly under <page>) are irrelevant.

<page>
  <title>19世紀</title>
  <id>1615</id> ← THIS NUMBER
  ...

When you are using the TSUBAKI search result data provided by the organizers, the “OrigId” attribute of the “Result” tag denotes document IDs (see the example below).

<pair id="1">
  <ResultSet ...>
    <Result OrigId="1615"> ← THIS NUMBER
    ...

Attending the NTCIR-10 meeting

  • At least one member from your team must attend NTCIR-10 in June (NII, Tokyo).
  • All teams will present a poster.
  • A few selected teams from RITE will additionally do an oral presentation.
  • The Task organizers will select oral presentations by reading your participant papers (criteria: novelty and effectiveness of your approach; diversity of the oral session)