We live in a fast-paced world, and in an over-communicated society. It is difficult to extract the relevant information. Here extractive summarization comes to our rescue. It selects key sentences in a document and presents a snapshot of the relevant points. It facilitates the understanding of bulky documents without reading each and every word.
In this write-up, we shall examine basics of extractive summarization with LLMs. The model uses BERT or Bidirectional Encoder Representations from Transformers.
Extractive summarization is a part of natural language processing (NLP) and text analysis. There is selection of key sentences or phrases from the original text. It is presented as a concise summary. It involves the sifting of the text to know the crucial elements, ideas or arguments.
In abstractive summarization, new sentences are carved out. On the other hand, extractive summarization is faithful to the original text. There is no paraphrasing or alteration. As far as possible the original wording and structure is maintained. It is a useful technique where accuracy is a desired a goal. Even the intent of the author is not disturbed.
It is used to summarize articles, research papers and reports. It shows high fidelity to the original, since paraphrasing may introduce a bias.
The procedure to do extractive summarization may have the following components.
Parsing of the text where the text is split up into sentences and phrases — the basic units. The text is dissected to understand its structure and parts.
Feature extraction analyzes algorithmically features or characteristics indicating their significance in the overall text. Repetitions and frequency of usage of words and phrases are common features. These may be central to the theme.
Sentences are scored based on their content showing their perceived importance. The higher the score, the more the significance.
Ultimately, the highest scoring sentences are selected and aggregated.