Text Summarization: How to Calculate BertScore

Hatice Özbolat
5 min readSep 28, 2023
Tima Miroshnichenko by Pexels

This article was written by

and .

Click for the source code

The development of machine learning has led to the rapid growth of technological fields such as Natural Language Processing (NLP) and Large Language Models (LLMs). However, with the advancement of these fields, a new problem has emerged: How reliable are the accuracy of evaluation metrics?

In this context, BertScore has emerged as a significant metric that has come forward as an alternative to traditional evaluation metrics.

What are Text Summarization and BertScore?


Text summarization is condensing a lengthy text into a shorter and more concise version. This process highlights the text's key points and makes it easier for the reader to understand quickly.

BertScore is a method used to measure the quality of text summarization. This method measures how similar the text summary is to the original text.

BertScore addresses two common issues that n-gram-based metrics often encounter. First, n-gram models tend to incorrectly match paraphrases because semantically accurate expressions may differ from the surface form of the reference text, which can lead to incorrect performance estimation. BertScore, on the other hand, performs similarity calculations using contextualized token embeddings shown to be effective for entailment detection. Second, n-gram models cannot capture long-range dependencies and penalize semantically significant reordering.

BERTScore Architecture

Summarizes the steps for calculating the BERTScore
Source: https://arxiv.org/pdf/1904.09675.pdf

Step 1: Contextual Embeddings: Reference and candidate sentences are represented using contextual embeddings based on surrounding words, computed by models like BERT, Roberta, XLNET, and XLM.

Step 2: Cosine Similarity: The similarity between contextual embeddings of reference and candidate sentences is measured using cosine similarity.

Step 3: Token Matching for Precision and Recall: Each token in the candidate sentence is matched to the most similar token in the reference sentence, and vice versa, to compute Recall and Precision, which are then combined to calculate the F1 score.

Step 4: Importance Weighting: Rare words’ importance is considered using Inverse Document Frequency (IDF), which can be incorporated into BERTScore equations, though it’s optional and domain-dependent.

Step 5: Baseline Rescaling: BERTScore values are linearly rescaled to improve human readability, ensuring they fall within a more intuitive range based on Common Crawl monolingual datasets.

How does BERTScore work?

To calculate a BERT score, you can use the Hugging Face Transformers library. First, you’ll need to install this library:

!pip install transformers # If you are using collab, "!" is required to download
!pip install bert-score

It is an example at a basic level.

from transformers import BertTokenizer, BertForMaskedLM, BertModel
from bert_score import BERTScorer

# Example texts
reference = "This is a reference text example."
candidate = "This is a candidate text example."
# BERTScore calculation
scorer = BERTScorer(model_type='bert-base-uncased')
P, R, F1 = scorer.score([candidate], [reference])
print(f"BERTScore Precision: {P.mean():.4f}, Recall: {R.mean():.4f}, F1: {F1.mean():.4f}")

### Outputs : BERTScore Precision: 0.9258, Recall: 0.9258, F1: 0.9258

Another example is a level intermediate.

If we added the library then let’s start :

# Step 1: Import the required libraries
from transformers import BertTokenizer, BertModel
import torch
import numpy as np

# Step 2: Load the pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Step 3: Define the two texts to compare
text1 = "This is an example text."
text2 = "This text contains an example sentence."

# Step 4: Prepare the texts for BERT
inputs1 = tokenizer(text1, return_tensors="pt", padding=True, truncation=True)
inputs2 = tokenizer(text2, return_tensors="pt", padding=True, truncation=True)

# Step 5: Feed the texts to the BERT model
outputs1 = model(**inputs1)
outputs2 = model(**inputs2)

# Step 6: Obtain the representation vectors
embeddings1 = outputs1.last_hidden_state.mean(dim=1).detach().numpy()
embeddings2 = outputs2.last_hidden_state.mean(dim=1).detach().numpy()

# Step 7: Calculate cosine similarity
similarity = np.dot(embeddings1, embeddings2.T) / (np.linalg.norm(embeddings1) * np.linalg.norm(embeddings2))

# Step 8: Print the result
print("Similarity between the texts: {:.4f}".format(similarity[0][0]))

### Output: Similarity between the texts: 0.9000


BERTScore is considered an important metric that enhances text similarity measurement. This metric is based on the BERT model, providing a better understanding of text content and generating more meaningful similarity scores. Combining Precision and Recall values makes text similarity measurement more accurate and balanced. This offers a significant advantage for many Natural Language Processing (NLP) tasks.

BERTScore can be applied in various domains, including text summarization, translation quality assessment, text generation, and document comparison. This metric enables better comparisons of texts, potentially improving user experiences in translation services, news agencies, and information processing.

The future potential of BERTScore is quite exciting. This metric can contribute to the ongoing development of the field of natural language processing. Anticipated improvements include broader language coverage, adaptation for multilingual texts, and enhancements for better performance on diverse text types.

Furthermore, BERTScore’s ability to measure semantic similarity between texts can also be adapted to different NLP tasks such as text classification, sentiment analysis, and recommendation systems. This demonstrates that BERTScore has significant potential for a wider range of applications.


  1. BERTScore: Evaluating Text Generation with BERT
  2. Text Summarization Using Python and NLTK
  3. Machine Translation Evaluation with sacreBLEU and BERTScore
  4. Calculating Sentence Similarity using Bert Model
  5. BLEU-BERT-y: Comparing sentence scores

If you have read my article this far and want me to share similar content, do not forget to like it and leave a comment ✨😍