BERTScore and ROUGE: Two Metrics for Evaluating Text Summarization Systems

Hatice Özbolat
3 min readNov 27, 2023
Erken Zaman by Pexel

This article was written by Hatice Özbolat.

Click for the source code

Text summarization is a crucial task in the field of natural language processing (NLP), aiming to condense text into a shorter and more concise form. Summarization systems are developed to perform the task of text summarization effectively. However, to evaluate and compare the performance of these systems, various metrics are employed.

In this article, we will examine two commonly used reference-based metrics, BERTScore and ROUGE, for evaluating text summarization systems. Both metrics serve the purpose of measuring the similarity between a summary and reference texts, but they employ different methods.

BERTScore: Measuring Semantic Similarity

BERTScore calculates the similarity between a summary and reference texts based on the outputs of BERT (Bidirectional Encoder Representations from Transformers), a powerful language model. BERT has achieved significant success in natural language processing, making BERTScore a metric that better reflects how semantically similar the summary is to the reference text. It takes into account not only words but also meaning and context.

You can read this article for more information on calculating Bertscore: Text Summarization: How to Calculate BertScore

ROUGE: Word-Level Similarity Measurement

ROUGE, on the other hand, measures the overlap of words, bigrams, and n-grams between a summary and reference texts. This metric evaluates how similar the summary is to the reference text at the word level. ROUGE normalizes the results, considering differences in the length of summaries, ensuring that shorter text summaries do not receive higher scores over longer ones.

Source

Example :

# Upload libraries
!pip install transformers
!pip install nltk
!pip install rouge
!pip install bert-score
!pip install scikit-learn
!pip install scipy
!pip install numpy

# If you use colab, you'll need to " ! " for upload

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertTokenizer, BertForMaskedLM, BertModel
from bert_score import BERTScorer
from rouge import Rouge
from rouge_score import rouge_scorer

text1 = "Stars shine, and the sky is covered with a blue blanket."
text2 = "Stars shine, and the sky is adorned with a bright color."

scorer = BERTScorer(model_type='bert-base-uncased')
P, R, F1 = scorer.score([text1], [text2])
print(f"BERTScore Precision: {P.mean():.4f}, Recall: {R.mean():.4f}, F1: {F1.mean():.4f}")

# Output : BERTScore Precision: 0.8707, Recall: 0.8826, F1: 0.8766
# ROUGE calculation
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(text1, text2)
print(f"ROUGE-1 Precision: {scores['rouge1'].precision:.4f}, Recall: {scores['rouge1'].recall:.4f}, F1: {scores['rouge1'].fmeasure:.4f}")
print(f"ROUGE-2 Precision: {scores['rouge2'].precision:.4f}, Recall: {scores['rouge2'].recall:.4f}, F1: {scores['rouge2'].fmeasure:.4f}")
print(f"ROUGE-L Precision: {scores['rougeL'].precision:.4f}, Recall: {scores['rougeL'].recall:.4f}, F1: {scores['rougeL'].fmeasure:.4f}")

"""
Outputs:

ROUGE-1 Precision: 0.7273, Recall: 0.7273, F1: 0.7273
ROUGE-2 Precision: 0.6000, Recall: 0.6000, F1: 0.6000
ROUGE-L Precision: 0.7273, Recall: 0.7273, F1: 0.7273

"""

Which Metric to Use?

Both metrics are valuable for evaluating the quality of text summarization, but the choice of metric can depend on specific requirements and application scenarios. For instance, if you want to measure semantic similarity, BERTScore is more suitable, while if you want to measure word-level similarity, ROUGE may be preferred.

Conclusion

In this article, we have explored two significant metrics, BERTScore and ROUGE, for evaluating text summarization systems. Each metric has its unique advantages, and the choice between them can vary depending on specific needs and application scenarios. Since text summarization plays a crucial role in text processing and natural language processing, the utilization of these metrics contributes to the advancement of text summarization technology.

References:

  1. Evaluating NLP Models: A Comprehensive Guide to ROUGE, BLEU, METEOR, and BERTScore Metrics
  2. BLEU-BERT-y: Comparing sentence scores
  3. Evaluating Machine Translation Models: Traditional and Novel Approaches

If you have read my article this far and want me to share similar content, do not forget to like it and leave a comment ✨😍

--

--