As we know by now, LLMs are useful in natural language processing. The text is linguistic data, and there is always pre-processing of this data by using a number of techniques.
Tokenization
A token is a word segment. It is a vital step to divide the text into tokens — lengthy strings of text are dissected into more manageable and meaningful units. Tokens are the building blocks of NLP. It provides structured framework.
Stemming and lemmatization
After tokenization, there is stemming and lemmatization. These processes distill the root form of words from their morphological variations. To illustrate, ‘stick’ can appear in various forms — stuck, sticking, sticks, unstuck. We ignore the prefixes and suffixes while stemming. Lemmatization leads us to a root form of a word (commonly called lemma). It surpasses the limitations of stemming and identifies, the root word. Stemming vs lemmatization is considered for ‘change’. The various forms could be changing, changes, changed and changer. The stemming gives us ‘chang’. However, lemmatization leads as to ‘change’.
Morphological Segmentation
Some words are monomorphic, say table, lamp consisting of a single morpheme. Some words have more than one morpheme — sunrise which have two morphemes ‘sun’ and ‘rise’. The fusion of these two morphemes will lead to the holistic understanding of the word meanings.
Unachievability has four morphemes — ‘un’, ‘achiev’, ‘abil’ and ‘ity’.
Morphological segmentation prepares the text for subsequent analysis.
Stop Words Removal
This is a crucial pre-processing step. Here we eliminate extraneous linguistic elements which do not contribute much to the meaning of the text. Such words are ‘and’, ‘because’, ‘under’ and ‘in’. These are filler words.
Marketingganga — A marketing portal for market savvy — this has stops. Without stops, it would read Marketingganga, marketing, portal, market, savvy.
I like reading, so I read. This is with stops. Remove the stops and it would read — Like, Reading, Read.
Text Classification
Text classification has a number of techniques to organize vast quantities of unprocessed (textual) data.
The ultimate aim is to convert unstructured data into structured format.
Sentiment Analysis
It is also called emotion AI or opinion mining. It examines user-generated content. It can be leveraged to address evolving needs and enhance consumer experience.
Topic Modelling
Here the underlying themes and topics of the text are identified. It operates as an unsupervised ML process. The topics within the corpus are identified and categorized. The essential key words can be extracted while sifting the document. It identifies a subject of interest within a textual dataset.
Text Summarization
Here the text is condensed into a cohesive summary. It is either extraction-based or abstraction-based.
Parsing
It unravels a grammatical framework. In parsing we come across Named Entity Recognition (NER). It extracts information that identifies ‘named entities’. Here it uses pre- defined key works.
Then there are TF-IDF. It is an acronym for term frequency-inverse document frequency. It is a statistical methodology. It assesses significance of words within a document (relative to a bunch of documents). A word pervasive in all documents, attracts a lower score, even though its occurrence is frequent.