We shall try to understand the key words associated with large language models.
LLM – Large Language Model: It is a neural network, also called a foundation model that understands and generates human-like text. The text generated is contextually relevant. The examples of LLMs are GPTs, Gemini, Claude, Llama.
Training: An LLM is trained on a vast corpus of dataset. The model learns to predict the next word in a sequence. Its precision is enhanced by adjusting its parameters.
Fine-tuning: A pre-trained model performs broad number of tasks. It is fine-tuned to perform certain specific tasks or operate in a specific domain. The model is trained on such specific data, not covered in the original training data.
Parameter: A parameter is a variable part of the model’s architecture, e.g. weights in neural networks. They are adjusted to minimize the difference between the predicted and actual output.
Vector: In ML, vectors are an array of numbers representing data. The data can be processed by the algorithms. In LLMs, words or phrases are converted into vectors (called embeddings) which capture semantic meanings.
Embeddings: These are dense vector representations of text. Here familiar words have similar representations in vector space. Embeddings capture context and semantic similarity between words. The technique is useful in machine translation and text summarization.
Tokenization: Text is split into tokens — words, sub-words or characters.
Transformers: This neural network architecture relies on self-attention to weigh the influence of different parts of the input data differently. It is useful in NLP tasks. It is at the core of modern LLMs.
Attention: Attention mechanism enables models to focus on different segments of the input sequence while generating a response. The response is contextual and coherent.
Inference: Here the trained model makes predictions — generates text based on input data (using knowledge gained during trained).
Temperature: It is a hyperparameter and controls the randomness of predictions. A higher temperature produces more random outputs while a lower temperature makes the output deterministic. The logits are scaled up before applying SoftMax.
Frequency: The probability of tokens based on their frequency of occurrence is considered. Here we can balance the generation of common and less common words.
Sampling: In generating the text, next are words randomly picked based on its probability distribution. It makes the output varied and creative.
Top-K-sampling: The next word is limited to the K most likely next words. It reduces randomness of text generation, though the variability in the output is maintained.
RLHF – Reinforcement Learning from Human Feedback: Here model is fine-tuned based on human feedback.
Decoding strategies: In decoding, output sequences are chosen. There is greedy decoding — next word that is most likely is chosen at each step. In beam search, the technique is expanded by considering multiple possibilities at the same time. This affects diversity and coherence of the model.
Prompting: Here inputs or prompts are designed to guide the model to generate specific outputs.
Transformer-XL: The transformer architecture is extended. The model learns beyond a fixed length without compromising coherence. It is useful in long documents of sequences.
Masked Language Modelling (MLM): Certain input data segments are masked during training, prompting and the model is expected to predict concealed words. It is used in BERT to enhance effectiveness.
Sequence-to-sequence Models — Seq2Seq: Here sequences are converted from one domain to another. To illustrate, translate from one language to another. Or converting questions to answers. Here both the encoder and decoder are involved.
Generative Pre-trained Transformer (GPT): These are developed by OpenAI.
Multi-Head Attention: This is a component of transformer model — model focuses on various representation perspectives simultaneously.
Contextual Embeddings: Here the context of the word is considered. These are dynamic, and change based on the surrounding text.
Auto-regressive Models: These models predict the next words based on previous words in a sequence. It is used in GPTs. Each output word becomes the next input. It facilitates coherent long generation.