neural_compressor.metric.f1

Official evaluation script for v1.1 of the SQuAD dataset.

From https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py

Functions

normalize_answer(→ str)

Normalize the answer text.

f1_score(prediction, ground_truth)

Calculate the F1 score of the prediction and the ground_truth.

metric_max_over_ground_truths(→ float)

Calculate the max metric for each ground truth.

evaluate(→ float)

Evaluate the average F1 score of Question-Answering results.

Module Contents

neural_compressor.metric.f1.normalize_answer(text: str) str[source]

Normalize the answer text.

Lower text, remove punctuation, articles and extra whitespace, and replace other whitespace (newline, tab, etc.) to space.

Parameters:

s – The text to be normalized.

Returns:

The normalized text.

neural_compressor.metric.f1.f1_score(prediction: collections.abc.Sequence, ground_truth: collections.abc.Sequence)[source]

Calculate the F1 score of the prediction and the ground_truth.

Parameters:
  • prediction – the predicted answer.

  • ground_truth – the correct answer.

Returns:

The F1 score of prediction. Float point number.

neural_compressor.metric.f1.metric_max_over_ground_truths(metric_fn: Callable[[T, T], float], prediction: str, ground_truths: List[str]) float[source]

Calculate the max metric for each ground truth.

For each answer in ground_truths, evaluate the metric of prediction with this answer, and return the max metric.

Parameters:
  • metric_fn – the function to calculate the metric.

  • prediction – the prediction result.

  • ground_truths – the list of correct answers.

Returns:

The max metric. Float point number.

neural_compressor.metric.f1.evaluate(predictions: Dict[str, str], dataset: List[Dict[str, Any]]) float[source]

Evaluate the average F1 score of Question-Answering results.

The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation: F1 = 2 * (precision * recall) / (precision + recall). For all question-and-answers in dataset, it evaluates the f1-score

Parameters:
  • predictions – The result of predictions to be evaluated. A dict mapping the id of a question to the predicted answer of the question.

  • dataset

    The dataset to evaluate the prediction. A list instance of articles. An article contains a list of paragraphs, a paragraph contains a list of question-and-answers (qas), and a question-and-answer contains an id, a question, and a list of correct answers. For example:

    [{‘paragraphs’:
    [{‘qas’:[{‘answers’: [{‘answer_start’: 177, ‘text’: ‘Denver Broncos’}, …],

    ’question’: ‘Which NFL team represented the AFC at Super Bowl 50?’, ‘id’: ‘56be4db0acb8001400a502ec’}]}]}]

Returns:

The F1 score of this prediction. Float point number in forms of a percentage.