Natural Language Processing Techniques

As we know by now, LLMs are useful in natural language processing. The text is linguistic data, and there is always pre-processing of this data by using a number of techniques.

Tokenization

A token is a word segment. It is a vital step to divide the text into tokens — lengthy strings of text are dissected into more manageable and meaningful units. Tokens are the building blocks of NLP. It provides structured framework.

Stemming and lemmatization

After tokenization, there is stemming and lemmatization. These processes distill the root form of words from their morphological variations. To illustrate, ‘stick’ can appear in various forms — stuck, sticking, sticks, unstuck. We ignore the prefixes and suffixes while stemming. Lemmatization leads us to a root form of a word (commonly called lemma). It surpasses the limitations of stemming and identifies, the root word. Stemming vs lemmatization is considered for ‘change’. The various forms could be changing, changes, changed and changer. The stemming gives us ‘chang’. However, lemmatization leads as to ‘change’.

Morphological Segmentation

Some words are monomorphic, say table, lamp consisting of a single morpheme. Some words have more than one morpheme — sunrise which have two morphemes ‘sun’ and ‘rise’. The fusion of these two morphemes will lead to the holistic understanding of the word meanings.

Unachievability has four morphemes — ‘un’, ‘achiev’, ‘abil’ and ‘ity’.

Morphological segmentation prepares the text for subsequent analysis.

Stop Words Removal

This is a crucial pre-processing step. Here we eliminate extraneous linguistic elements which do not contribute much to the meaning of the text. Such words are ‘and’, ‘because’, ‘under’ and ‘in’. These are filler words.

Marketingganga — A marketing portal for market savvy — this has stops. Without stops, it would read Marketingganga, marketing, portal, market, savvy.

I like reading, so I read. This is with stops. Remove the stops and it would read — Like, Reading, Read.

Text Classification

Text classification has a number of techniques to organize vast quantities of unprocessed (textual) data.

The ultimate aim is to convert unstructured data into structured format.

Sentiment Analysis

It is also called emotion AI or opinion mining. It examines user-generated content. It can be leveraged to address evolving needs and enhance consumer experience.

Topic Modelling

Here the underlying themes and topics of the text are identified. It operates as an unsupervised ML process. The topics within the corpus are identified and categorized. The essential key words can be extracted while sifting the document. It identifies a subject of interest within a textual dataset.

Text Summarization

Here the text is condensed into a cohesive summary. It is either extraction-based or abstraction-based.

Parsing

It unravels a grammatical framework. In parsing we come across Named Entity Recognition (NER). It extracts information that identifies ‘named entities’. Here it uses pre- defined key works.

Then there are TF-IDF. It is an acronym for term frequency-inverse document frequency. It is a statistical methodology. It assesses significance of words within a document (relative to a bunch of documents). A word pervasive in all documents, attracts a lower score, even though its occurrence is frequent.

print

Leave a Reply

Your email address will not be published. Required fields are marked *