We provide a tool to evaluate a system output for each subtask.
After extracting the archive, the tool can be used as following.
$ java -jar rite2eval.jar -g data/sample_mc_g.xml -s data/sample_mc_s.txt
data/sample_mc_s.txt is a file output by your RTE system. The format of the file is as following.
1 I 1.0 2 I 1.0 3 B 1.0 ...
The tool outputs precision, recall and F1 values for each label, overall accuracy and Macro-F1 value, and confusion matrix.
------------------------------------------------------------ |Label| #| Precision| Recall| F1| | C| 1| 0.00( 0/ 0)| 0.00( 0/ 1)| 0.00| | B| 3|100.00( 2/ 2)| 66.67( 2/ 3)| 80.00| | F| 1| 0.00( 0/ 2)| 0.00( 0/ 1)| 0.00| | I| 5| 66.67( 4/ 6)| 80.00( 4/ 5)| 72.73| ------------------------------------------------------------ Accuracy: 60.00( 6/ 10) Macro F1: 38.18 Confusion Matrix ----------------------------- |gold \ sys| C B F I| ----------------------------- | C| 0 0 0 1| | B| 0 2 1 0| | F| 0 0 0 1| | I| 0 0 1 4| -----------------------------
The XML format same as the dryrun data is also allowed to use as system outputs.
$ java -jar rite2eval.jar -x -g data/sample_mc_g.xml -s data/sample_mc_s.xml
All options of the tools are as following.
$ java -jar rite2eval.jar [options] -g [gold-data] -s [system-output-file] -t system-output-file is a tab-delimited or space-delimited format -x system-output-file is a rite xml format (default)
You can check the format of the output of your system by executing the following command.
$ java -jar rite2eval.jar -s [system-output-file]
If the format of the file is valid, you can see the following message.
The format of this file is valid.
For unittest data, the tool provides detailed evaluation results for each linguistic phenomena (category).
In the entrance exam subtask, exam scores (correct answer ratios) are evaluated in addition to entailment accuracy.
Expand the package and use the tool as follows.
$ python rite2examscore.py -g RITE2_JA_dev_exam.ans -s sample_exam_run.txt
RITE2_JA_dev_exam.ans is a correct answer table. sample_exam_run.txt is an output from a RITE system, which is in the format of final submission as shown in the following.
1 Y 0.8 2 N 0.6 3 Y 1.0 ...
The tool shows a summary of scores for each subject, as well as the result for each question. JA, JB, MS, PE, WA, WB stands for Japanese History A, Japanese History B, Modern Society, Politics and Economy, World History A, and World History B.
===== SUMMARY ===== JA2009: 0.285714 (4/14) JB2009: 0.312500 (5/16) MS2009: 0.173913 (4/23) PE2009: 0.291667 (7/24) WA2009: 0.200000 (5/25) WB2009: 0.266667 (8/30) TOTAL: 0.250000 (33/132) ===== EVALUATION RESULTS ===== JA2009 ID: EVAL GLD/SYS TYPE GOLD LABELS SYSTEM LABELS 5: WRONG 1/4 choose-Y 1=Y,2=N,3=N,4=N 1=Y(0.456869),2=Y(0.554831),3=Y(0.599073),4=Y(0.695192) 9: WRONG 9/11 choose-Y 8=N,9=Y,10=N,11=N 8=N(0.856898),9=N(0.844198),10=N(0.190628),11=Y(0.253200) ...
Since Japanese characters are handled as UTF-8, text editors that can handle UTF-8 are required.
+--root | +--baseline-2.0 | | +--main.py | | +--src/ | | +--.hg_archival.txt | +--rite2eval-1.1 | | +--data/ | | +--src/ | | +--build.xml | | +--README | | +--rite2eval.jar | +--RITE2_JA_dev_bc.parsed.unidic.xml | +--RITE2_JA_test_bc.parsed.unidic.xml (formal run data) | +--log (created by this tool after executing) | | +--cross_test_all.log (submittable file created by this tool) | | +--cross_*.f (feature file) | | +--cross_*.model (trained model file) | | +--cross_*_dev.f (feature file for development used in cross validation) | | +--cross_*_test.f (feature file for testing used in cross validation) | | +--cross_*_test.log (cross validation results log)
execute the tool in the Terminal as follows
execute the tool in the Terminal as follows
python main.py MODE [options]
The following is a set of options which can be specified when running Baseline Tool.
The process of feature extraction is performed in the function getFeature(pair) in src/feature.py. getFeature is a function that receives an instance of Pair class as an argument, and returns features of the instance as a hashtable (key:string, value:string).
instances of this class contain information about a pair of sentences
Instances of this class have information of sentences.
Instances of this class have bunsetsu (chunk) information.
Instances of this class have morpheme information.
for tok in pair.text.itertokens(): key = 'bow_text(%s)' % tok.attr['orig'] feat[key] = 1.0
for chk in pair.text.chunks: nouns = [] for tok in chk.toks: if tok.attr['pos'] == '名詞': nouns.append(tok.attr['surf']) if len(nouns) > 1: key = 'nns(%s)' % ','.join(nouns) feat[key] = 1.0 else: del nouns[:]
preds_t = set((tok.attr['orig'] for tok in pair.text.itertokens() if bool(tok.pred))) preds_h = set((tok.attr['orig'] for tok in pair.hypo.itertokens() if bool(tok.pred))) for pred in (preds_t & preds_h): key = 'co_pred(%s)' % pred feat[key] = 1.0