Evaluation Tool

We provide a tool to evaluate a system output for each subtask.

After extracting the archive, the tool can be used as following.

$ java -jar rite2eval.jar -g data/sample_mc_g.xml -s data/sample_mc_s.txt

data/sample_mc_s.txt is a file output by your RTE system. The format of the file is as following.

1 I 1.0
2 I 1.0
3 B 1.0

The tool outputs precision, recall and F1 values for each label, overall accuracy and Macro-F1 value, and confusion matrix.

|Label|    #|          Precision|             Recall|    F1|
|    C|    1|  0.00(    0/    0)|  0.00(    0/    1)|  0.00|
|    B|    3|100.00(    2/    2)| 66.67(    2/    3)| 80.00|
|    F|    1|  0.00(    0/    2)|  0.00(    0/    1)|  0.00|
|    I|    5| 66.67(    4/    6)| 80.00(    4/    5)| 72.73|
Accuracy:	60.00(     6/    10)
Macro F1:	38.18 

Confusion Matrix
|gold \ sys|   C   B   F   I|
|         C|   0   0   0   1|
|         B|   0   2   1   0|
|         F|   0   0   0   1|
|         I|   0   0   1   4|

The XML format same as the dryrun data is also allowed to use as system outputs.

$ java -jar rite2eval.jar -x -g data/sample_mc_g.xml -s data/sample_mc_s.xml

All options of the tools are as following.

$ java -jar rite2eval.jar [options] -g [gold-data] -s [system-output-file]
  -t	system-output-file is a tab-delimited or space-delimited format
  -x	system-output-file is a rite xml format (default)

You can check the format of the output of your system by executing the following command.

$ java -jar rite2eval.jar -s [system-output-file]

If the format of the file is valid, you can see the following message.

The format of this file is valid.

For unittest data, the tool provides detailed evaluation results for each linguistic phenomena (category).

Evaluation Tool for the EXAM subtask

In the entrance exam subtask, exam scores (correct answer ratios) are evaluated in addition to entailment accuracy.

Expand the package and use the tool as follows.

$ python -g RITE2_JA_dev_exam.ans -s sample_exam_run.txt

RITE2_JA_dev_exam.ans is a correct answer table. sample_exam_run.txt is an output from a RITE system, which is in the format of final submission as shown in the following.

1 Y 0.8
2 N 0.6
3 Y 1.0

The tool shows a summary of scores for each subject, as well as the result for each question. JA, JB, MS, PE, WA, WB stands for Japanese History A, Japanese History B, Modern Society, Politics and Economy, World History A, and World History B.

===== SUMMARY =====
JA2009: 0.285714 (4/14)
JB2009: 0.312500 (5/16)
MS2009: 0.173913 (4/23)
PE2009: 0.291667 (7/24)
WA2009: 0.200000 (5/25)
WB2009: 0.266667 (8/30)
 TOTAL: 0.250000 (33/132)
  ID: EVAL     GLD/SYS    TYPE         GOLD LABELS              SYSTEM LABELS
   5: WRONG    1/4        choose-Y     1=Y,2=N,3=N,4=N          1=Y(0.456869),2=Y(0.554831),3=Y(0.599073),4=Y(0.695192)
   9: WRONG    9/11       choose-Y     8=N,9=Y,10=N,11=N        8=N(0.856898),9=N(0.844198),10=N(0.190628),11=Y(0.253200)

Baseline Tool

  • provides fundamental entailment recognition framework based on machine-learning
  • provides output files compatible for the format of the RITE-2 formal run
    • provides a feature engineering environment
    • set of simple features (i.e. word coverages) initially implemented
    • easy-to-modify set of feature templates
  • baseline tool ver.2.0 (Python) has the following features
  1. cross validation for the development data (for feature engineering, parameter tuning)
    • input: development data
    • output: output file (compatible format to the RITE-2 formal run), model file, feature file
  2. training on the development data, and testing on the formal run data using the trained model
    • input:development data, formal run data
    • output:submittable result file, model file, feature file

ver.2.0 change log and notice

  • The tool can be used for not only the BC subtask, but also the MC subtask
  • Confidence scores provided by the tool are normalized from zero to one
  • in ver.1.1 can be used in ver.2.0

ver.1.1 notice

  • Confidence scores are not normalized from zero to one.
    • Please update to ver.2.0 if you have no particular reason to use ver.1.1. can be commonly used in both versions.
  • There is no difference in terms of classification performance between ver.1.1 and ver.2.0 on BC subtask.
    • If you intend to attend only BC subtask, you can keep using ver.1.1.
    • If using ver.1.1, please apply a normalization fuction such as Sigmoid function to the scores outputted by the tool.

System Requirements

  • Linux, Mac OSX
    • Python 2.6 or higher
    • If you intend to use Classias, install Ver.1.1.
  • Windows
    • Cygwin Python 2.6 or higher
    • If you intend to use Classias, install Ver.1.1.
      • Since this version is only compatible in a 32-bit environment, please be aware of the limitation of memory size.
  • Other
    • Please let us know if you have any opinions about Baseline Tool.

Since Japanese characters are handled as UTF-8, text editors that can handle UTF-8 are required.

How to use?

|  +--baseline-2.0
|  |
|  |  +--src/
|  |  +--.hg_archival.txt
|  +--rite2eval-1.1
|  |  +--data/
|  |  +--src/
|  |  +--build.xml
|  |  +--README
|  |  +--rite2eval.jar
|  +--RITE2_JA_dev_bc.parsed.unidic.xml
|  +--RITE2_JA_test_bc.parsed.unidic.xml (formal run data)
|  +--log (created by this tool after executing)
|  |  +--cross_test_all.log (submittable file created by this tool)
|  |  +--cross_*.f (feature file)
|  |  +--cross_*.model (trained model file)
|  |  +--cross_*_dev.f (feature file for development used in cross validation)
|  |  +--cross_*_test.f (feature file for testing used in cross validation)
|  |  +--cross_*_test.log (cross validation results log)

Use the tool on the BC and the MC subtask

  • From ver2.0, the tool can be applied for the MC subtask.
  • The tool automatically detects the BC or MC subtask by looking at labels in the dataset.
  • BC subtask
    • All of the training algorithms implemented in Baseline Tool and Classias can be used.
    • An example is classified as Y if the score of the classifier is equal to or more than zero and N otherwise.
    • Scores of classifiers are normalized from zero to one by Standard logistic sigmoid function.
  • MC subtask
    • All of the training algorithms except for pegasos.hinge and truncated_gradient.hinge in Classias can be used.
    • The model is trained by a three-class (F, C and I) classifier. A pair is classified as “B” if both entailment direction holds.
    • Scores of classifiers are normalized from zero to one by Standard logistic sigmoid function.

Cross validation for system development

execute the tool in the Terminal as follows

  1. $ python baseline-2.0/ -d RITE2_JA_dev_bc.parsed.unidic.xml
    • -d [file for cross validqation] (e.g. RITE2_JA_dev_bc.parsed.KNP.xml)
    • -g [number of division] (default: 10)
    • -a [learning algorithm] (default: Averaged Perceptron)
    • -l [output directory] (default: log)
  2. cross_test_all.log in the [output directory] (specified by -l option) is submittable.
  3. system evaluation using the evaluation tool provided by RITE-2 organizers
    • $ java -jar rite2eval-1.1/rite2eval.jar -g RITE2_JA_dev_bc.parsed.unidic.xml -s log/cross_test_all.log

Entailment recognition for the formal run data

execute the tool in the Terminal as follows

  1. $ python baseline-2.0/ normal -d RITE2_JA_dev_bc.parsed.unidic.xml -t RITE2_JA_test_bc.parsed.unidic.xml
    • formal run data will be released on January 9, 2013.


running the tool

python MODE [options]

  • normal mode (normal)
    • $ python normal [options]
    • At first a model is trained using the specified training data, and then testing is perfomed on the specified test data using the trained model.
  • cross validation mode (cross)
    • $ python cross [options]
    • Execute cross validation using the specified training data.

The following is a set of options which can be specified when running Baseline Tool.

  • -a NAME
    • specifies a learning algorithm from several choices. Classias is required to use all of the learning algorithms except without_classias.
    • without_classias is selected by default
    • pegasos.hinge and truncated_gradient.hinge can not be applied for the MC subtask.
    • In all of the algorithms, scores of the classifiers are normalized from zero to one by Standard logistic sigmoid function.
      • To obtain unnormalized scores, use –no-sigmoid described later.
      • without_classias (*Classias is not required)
        • Averaged Perceptron is used.
      • lbfgs.logistic
        • Gradient Descent using L-BFGS (L2) or OW-LQN (L1) is used.
      • averaged_perceptron
        • Averaged Perceptron is used.
      • pegasos.logistic
        • Primal Estimated sub-GrAdient SOlber (Pegasos) is used to train an L2-regularized logistic regression model.
      • pegasos.hinge
        • Primal Estimated sub-GrAdient SOlver (Pegasos) is used to train an L2-regularized L1-loss SVM.
      • truncated_gradient.logistic
        • Truncated Gradient is used to train an L1-regularized logistic relgression model.
      • truncated_gradient.hinge
        • Truncated Gradient is used to train an L1-regularized L1-loss SVM.
  • -d [training data]
    • file path for the training data.
  • -t [test data]
    • file path for the testing data. Not required in the cross validation mode.
  • -l [directory name]
    • directory name of the directory in which log files are outputted by this tool. (default: log)
  • -g NUM
    • specifies the number of division in cross validation. (default: 10)
    • specifies all kinds of parameters used in the learning algorithms implemented in Baseline Tool and Classias. NAME is a parameter name, and VALUE corresponds to its value. Available parameters depends on specified learning algorithm.
    • The following parameters can be used for “without_classias”.
      • eta=VALUE
        • initial learning rate. (default: 0.1)
      • max_iterations=NUM
        • the maximum number of iterations. (default: 1000)
      • period=NUM
        • the model checks convergence of an objective function value for each this number of iterations. (default: 20)
      • epsilon=VALUE
        • the threshold value which detects convergence of the objective function.
        • stop training if the variance divided by max(1, |the value of the objective function in the current iteration|) is lower than VALUE (default:0.001)
    • For more details of the parameters that can be specified in Classias, see mannual of Classias.
  • –no-sigmoid
    • Scores of the classifiers are outputted without normalizing (Sigmoid logstic sigmoid function is not applied).

Feature engineering

The process of feature extraction is performed in the function getFeature(pair) in src/ getFeature is a function that receives an instance of Pair class as an argument, and returns features of the instance as a hashtable (key:string, value:string).

Class references


instances of this class contain information about a pair of sentences

  • member variables
    • attr : a hashtable (keys and values are “string” type)
    • text : an instance of Sentence class
    • hypo : an instance of Sentence class

Instances of this class have information of sentences.

  • member variables
    • attr : a hashtable (key and value are “string” type) which holds attributes of the sentence.
    • chunks : a list of instances of Chunk class.
    • annotation : a list of hashtables (key:string, value:string) which have information of tools that are used to analyze the sentence.
  • methods
    • getHeadChunk(idx) : returns head chunk of the idx-th chunk; otherwise, false if there is no head chunk.
    • itertokens() : returns a generator which scans all of the morphemes in the sentence.
    • find_token(attr) : find morphemes that have a specified attribute “attr” and returns a tuple consisting of chunk index and morpheme index.

Instances of this class have bunsetsu (chunk) information.

  • member variables
    • attr : a hashtable (key:string, value:string) that has attributes of the instance.
      • obligatory attributes: id (id of the chunk), head (head id of the chunk)
      • optional attributes: type (dependency type), score (dependency confidence)
    • toks : a list of instances of Token class.
  • methods
    • head() : returns an integer value that represents the head index of the chunk.
    • find(attr) : find morphemes that have a specified attribute “attr” and returns a set of indices if such morphemes are found.

Instances of this class have morpheme information.

  • member variables
    • attr : hashtable (key and value are “string” type) stores attributes of the Token. The obligatory attributes are “id”, “surf” and “orig” which represent token id (e.g. t0, t1, …), surface form and lemma, respectively. The optional attributes are “head”, “pos”, “pos1”, “pos2”, “pos3”, “pos4”, “read”, “con”, “conType” and “pron”.
      • head: the id of the head of this token (used only if the data has word-based dependency structure information)
      • pos: represents POS information
      • pos1~pos4: fine classification of POS
      • read: reading
      • con:
      • conType:
      • pron: pronunciation
    • pred : hashtable (key and value are “string” type) stores predicate-argument structure information of the token. The obligatory attribute is “type, and the optional attributes are case relations.
      • type: predicate type
      • ga, wo, ni, kara, he, yori, to, de, no, ga2: case relations.
    • modal : hashtable (key and value are “string” type) stores extended modality information of the token. The obligatory attributes are assumptional, sentiment, focus, tense, type, authenticity and source. For more details, see (in Japanese).

An example of feature extraction

bag-of-words in text
for tok in pair.text.itertokens():
    key = 'bow_text(%s)' % tok.attr['orig']
    feat[key] = 1.0
a sequence of nouns in text
for chk in pair.text.chunks:
    nouns = []
    for tok in chk.toks:
        if tok.attr['pos'] == '名詞':
            if len(nouns) > 1:
                key = 'nns(%s)' % ','.join(nouns)
                feat[key] = 1.0
                del nouns[:]
predicate match in pair
preds_t = set((tok.attr['orig'] for tok in pair.text.itertokens() if bool(tok.pred)))
preds_h = set((tok.attr['orig'] for tok in pair.hypo.itertokens() if bool(tok.pred)))
for pred in (preds_t & preds_h):
    key = 'co_pred(%s)' % pred
    feat[key] = 1.0

Change Log

  • 2012.12.17: ver.2.0 release
  • 2012.9.7: ver.1.1 release
    • default encoding changed to utf-8
  • 2012.9.6: ver.1.0 release

Contact Information

  • ntc10-rite2-organizers (at)