Text Summarization: How to Calculate BertScore

5 min readSep 28, 2023

This article was written by Hatice Özbolat and Alparslan Mesri.

The development of machine learning has led to the rapid growth of technological fields such as Natural Language Processing (NLP) and Large Language Models (LLMs). However, with the advancement of these fields, a new problem has emerged: How reliable are the accuracy of evaluation metrics?

In this context, BertScore has emerged as a significant metric that has come forward as an alternative to traditional evaluation metrics.

What are Text Summarization and BertScore?

Text summarization is condensing a lengthy text into a shorter and more concise version. This process highlights the text's key points and makes it easier for the reader to understand quickly.

BertScore is a method used to measure the quality of text summarization. This method measures how similar the text summary is to the original text.

BertScore addresses two common issues that n-gram-based metrics often encounter. First, n-gram models tend to incorrectly match paraphrases because semantically accurate expressions may differ from the surface form of the reference text, which can lead to incorrect performance estimation. BertScore, on the other hand, performs similarity calculations using contextualized token embeddings shown to be effective for entailment detection. Second, n-gram models cannot capture long-range dependencies and penalize semantically significant reordering.

BERTScore Architecture

Summarizes the steps for calculating the BERTScore — Source: https://arxiv.org/pdf/1904.09675.pdf

Step 1: Contextual Embeddings: Reference and candidate sentences are represented using contextual embeddings based on surrounding words, computed by models like BERT, Roberta, XLNET, and XLM.

Step 2: Cosine Similarity: The similarity between contextual embeddings of reference and candidate sentences is measured using cosine similarity.

Step 3: Token Matching for Precision and Recall: Each token in the candidate sentence is matched to the most similar token in the reference sentence, and vice versa, to compute Recall and Precision, which are then combined to calculate the F1 score.

Step 4: Importance Weighting: Rare words’ importance is considered using Inverse Document Frequency (IDF), which can be incorporated into BERTScore equations, though it’s optional and domain-dependent.

Step 5: Baseline Rescaling: BERTScore values are linearly rescaled to improve human readability, ensuring they fall within a more intuitive range based on Common Crawl monolingual datasets.

How does BERTScore work?

To calculate a BERT score, you can use the Hugging Face Transformers library. First, you’ll need to install this library:

!pip install transformers # If you are using collab, "!" is required to download
!pip install bert-score

It is an example at a basic level.

from transformers import BertTokenizer, BertForMaskedLM, BertModel
from bert_score import BERTScorer

# Example texts
reference = "This is a reference text example."
candidate = "This is a candidate text example."
# BERTScore calculation
scorer = BERTScorer(model_type='bert-base-uncased')
P, R, F1 = scorer.score([candidate], [reference])
print(f"BERTScore Precision: {P.mean():.4f}, Recall: {R.mean():.4f}, F1: {F1.mean():.4f}")

### Outputs : BERTScore Precision: 0.9258, Recall: 0.9258, F1: 0.9258

Another example is a level intermediate.

If we added the library then let’s start :

# Step 1: Import the required libraries
from transformers import BertTokenizer, BertModel
import torch
import numpy as np

# Step 2: Load the pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Step 3: Define the two texts to compare
text1 = "This is an example text."
text2 = "This text contains an example sentence."

# Step 4: Prepare the texts for BERT
inputs1 = tokenizer(text1, return_tensors="pt", padding=True, truncation=True)
inputs2 = tokenizer(text2, return_tensors="pt", padding=True, truncation=True)

# Step 5: Feed the texts to the BERT model
outputs1 = model(**inputs1)
outputs2 = model(**inputs2)

# Step 6: Obtain the representation vectors
embeddings1 = outputs1.last_hidden_state.mean(dim=1).detach().numpy()
embeddings2 = outputs2.last_hidden_state.mean(dim=1).detach().numpy()

# Step 7: Calculate cosine similarity
similarity = np.dot(embeddings1, embeddings2.T) / (np.linalg.norm(embeddings1) * np.linalg.norm(embeddings2))

# Step 8: Print the result
print("Similarity between the texts: {:.4f}".format(similarity[0][0]))

### Output: Similarity between the texts: 0.9000

Conclusion

BERTScore is considered an important metric that enhances text similarity measurement. This metric is based on the BERT model, providing a better understanding of text content and generating more meaningful similarity scores. Combining Precision and Recall values makes text similarity measurement more accurate and balanced. This offers a significant advantage for many Natural Language Processing (NLP) tasks.

BERTScore can be applied in various domains, including text summarization, translation quality assessment, text generation, and document comparison. This metric enables better comparisons of texts, potentially improving user experiences in translation services, news agencies, and information processing.

The future potential of BERTScore is quite exciting. This metric can contribute to the ongoing development of the field of natural language processing. Anticipated improvements include broader language coverage, adaptation for multilingual texts, and enhancements for better performance on diverse text types.

Furthermore, BERTScore’s ability to measure semantic similarity between texts can also be adapted to different NLP tasks such as text classification, sentiment analysis, and recommendation systems. This demonstrates that BERTScore has significant potential for a wider range of applications.

References

If you have read my article this far and want me to share similar content, do not forget to like it and leave a comment ✨😍

Text Summarization: How to Calculate BertScore

What are Text Summarization and BertScore?

BERTScore Architecture

How does BERTScore work?

Conclusion

References

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Hatice Özbolat

No responses yet

More from Hatice Özbolat

BERTScore and ROUGE: Two Metrics for Evaluating Text Summarization Systems

This article was written by Hatice Özbolat.

Pandas Kütüphanesi: Veri Analizi ve Manipülasyonun Güçlü Aracı

Günümüzde veri, iş dünyasının vazgeçilmez bir parçası haline gelmiştir. Her alanda toplanan veriler, işletmelerin kararlarını…

DeepSeek: Çin’den Doğan ve Yapay Zeka Dünyasını Sarsan Yeni Güç

Son birkaç yılda yapay zeka dünyasında inanılmaz bir hareketlilik yaşandı. Teknoloji devleri arasındaki rekabet, hızla değişen trendler ve…

Parlamenter münazara

Merhaba herkese, bu gün size parlamenter münazara sistemini anlatacağım. Ama bu konuya geçmeden hemen önce size bir soru soracağım. Fikir…

Recommended from Medium

Evaluating LLMs: A Multi-Faceted Approach

Assessing Large Language Models (LLMs) demands a comprehensive approach, as traditional metrics like perplexity, BLEU and ROUGE have…

Immediate Addition

Immediate addition is employed whenever constant or known data are added. An 8-bit immediate addition appears in Example 5–2. In this…

Lists

Predictive Modeling w/ Python

Practical Guides to Machine Learning

Natural Language Processing

The New Chatbots: ChatGPT, Bard, and Beyond

What is METEOR (Metric for Evaluation of Translation with Explicit ORdering)?

METEOR is an evaluation metric for machine translation that improves over traditional metricsto to better align with human judgments.

Unlocking Document Processing with Python: Advanced File Partitioning and Text Extraction

Processing and extracting information from diverse document formats is essential for numerous applications. Python’s unstructured library…

BLEU: a Method for Automatic Evaluation of Machine Translation

Sequence to Sequence tasks has made tremendous strides in recent years, but evaluating its performance remains a challenging task. The BLEU…

Few-Shot and Zero-Shot Learning in LLMs: Unlocking Cross-Domain Generalization

In the age of large language models (LLMs), the ability to perform complex tasks with minimal data is revolutionizing how we approach…