neural_compressor.metric.evaluate_squad

Official evaluation script for v1.1 of the SQuAD dataset.

From https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py

Module Contents

Functions

f1_score(prediction, ground_truth)

Calculate the F1 score of the prediction and the ground_truth.

metric_max_over_ground_truths(metric_fn, prediction, ...)

Calculate the max metric for each ground truth.

exact_match_score(prediction, ground_truth)

Compute the exact match score between prediction and ground truth.

evaluate(dataset, predictions)

Evaluate the average F1 score and the exact match score for Question-Answering results.

neural_compressor.metric.evaluate_squad.f1_score(prediction, ground_truth)[source]

Calculate the F1 score of the prediction and the ground_truth.

Parameters:
  • prediction – The predicted result.

  • ground_truth – The ground truth.

Returns:

The F1 score of prediction. Float point number.

neural_compressor.metric.evaluate_squad.metric_max_over_ground_truths(metric_fn, prediction, ground_truths)[source]

Calculate the max metric for each ground truth.

For each answer in ground_truths, evaluate the metric of prediction with this answer, and return the max metric.

Parameters:
  • metric_fn – The function to calculate the metric.

  • prediction – The prediction result.

  • ground_truths – A list of correct answers.

Returns:

The max metric. Float point number.

neural_compressor.metric.evaluate_squad.exact_match_score(prediction, ground_truth)[source]

Compute the exact match score between prediction and ground truth.

Parameters:
  • prediction – The result of predictions to be evaluated.

  • ground_truth – The ground truth.

Returns:

The exact match score.

neural_compressor.metric.evaluate_squad.evaluate(dataset, predictions)[source]

Evaluate the average F1 score and the exact match score for Question-Answering results.

Parameters:
  • dataset – The dataset to evaluate the prediction. A list instance of articles. An article contains a list of paragraphs, a paragraph contains a list of question-and-answers (qas), and a question-and-answer contains an id, a question, and a list of correct answers. For example:

  • predictions – The result of predictions to be evaluated. A dict mapping the id of a question to the predicted answer of the question.

Returns:

The F1 score and the exact match score.