AI & LLM Glossary
Learn essential AI terminology with clear definitions and practical examples
Transformer
ArchitectureA neural network architecture that uses self-attention mechanisms to process sequential data. Introduced in the 'Attention is All You Need' paper (2017), it forms the basis of modern LLMs like GPT and BERT.
Fine-Tuning
TrainingThe process of taking a pre-trained model and training it further on a specific dataset to adapt it for a particular task or domain.
LoRA (Low-Rank Adaptation)
TrainingA parameter-efficient fine-tuning technique that freezes the original model weights and trains small adapter layers, reducing memory requirements by up to 90%.
Prompt Engineering
UsageThe practice of designing and optimizing text prompts to get desired outputs from language models. Includes techniques like few-shot learning, chain-of-thought, and role-playing.
RAG (Retrieval Augmented Generation)
ArchitectureA technique that combines information retrieval with language generation. The model retrieves relevant documents from a knowledge base before generating a response.
Token
FundamentalsThe basic unit of text that a language model processes. Roughly 4 characters or 0.75 words in English. Models have maximum token limits (context windows).
Temperature
ParametersA parameter (0-2) that controls randomness in model outputs. Lower values (0.1-0.5) make outputs more focused and deterministic, higher values (0.8-1.5) make them more creative.
Hallucination
ChallengesWhen a language model generates false or nonsensical information that sounds plausible. A major challenge in deploying LLMs for factual tasks.
Embeddings
FundamentalsDense vector representations of text that capture semantic meaning. Similar texts have similar embeddings, enabling semantic search and clustering.
Context Window
FundamentalsThe maximum number of tokens a model can process at once (input + output). Modern models range from 4K (GPT-3.5) to 2M (Gemini 1.5 Pro).
Few-Shot Learning
UsageProviding a model with a few examples in the prompt to demonstrate the desired task, without fine-tuning.
Chain-of-Thought (CoT)
UsageA prompting technique where you ask the model to show its reasoning step-by-step, improving performance on complex tasks.
Quantization
OptimizationReducing the precision of model weights (e.g., from 16-bit to 4-bit) to decrease memory usage and increase inference speed, with minimal accuracy loss.
RLHF (Reinforcement Learning from Human Feedback)
TrainingA training method where human preferences are used to fine-tune models, making them more helpful, harmless, and honest.
System Prompt
UsageThe initial instruction that sets the model's behavior, role, and constraints. Not visible to end users but crucial for consistent outputs.
Attention Mechanism
ArchitectureA technique that allows models to focus on different parts of the input when producing each part of the output. The foundation of transformer models.
Vector Database
ArchitectureA specialized database designed to store and efficiently search high-dimensional vectors (embeddings). Essential for RAG and semantic search applications.
Zero-Shot Learning
UsageThe ability of a model to perform a task without any examples, using only the task description in the prompt.
Top-P (Nucleus Sampling)
ParametersA sampling method that considers the smallest set of tokens whose cumulative probability exceeds P. More dynamic than Top-K sampling.
Top-K Sampling
ParametersA sampling method that restricts the model to choose from only the K most likely next tokens.
Tokenization
FundamentalsThe process of breaking text into smaller units (tokens) that the model can process. Different models use different tokenization strategies.
BPE (Byte Pair Encoding)
FundamentalsA tokenization algorithm that iteratively merges the most frequent pairs of characters or character sequences. Used by GPT models.
Agent
ArchitectureAn AI system that can perceive its environment, make decisions, and take actions autonomously. Often uses LLMs for reasoning and tool use.
Function Calling
UsageThe ability of LLMs to generate structured outputs that trigger external functions or APIs. Enables agents to interact with tools.
Mixture of Experts (MoE)
ArchitectureAn architecture where multiple specialized sub-models (experts) are trained, and a gating mechanism decides which experts to use for each input.
PEFT (Parameter-Efficient Fine-Tuning)
TrainingTechniques that fine-tune only a small subset of model parameters, reducing memory and compute requirements.
QLoRA
TrainingQuantized LoRA - combines 4-bit quantization with LoRA to enable fine-tuning of large models on consumer hardware.
Instruction Tuning
TrainingFine-tuning a model on a dataset of instruction-response pairs to improve its ability to follow instructions.
Alignment
TrainingThe process of making AI models behave in ways that are helpful, harmless, and honest - aligned with human values and intentions.
Constitutional AI
TrainingAn alignment approach where models are trained to follow a set of principles (a constitution) through self-critique and revision.
Semantic Search
UsageSearch based on meaning rather than exact keyword matching. Uses embeddings to find semantically similar content.
Cosine Similarity
FundamentalsA metric for measuring similarity between two vectors, commonly used to compare embeddings. Ranges from -1 (opposite) to 1 (identical).
Perplexity
FundamentalsA metric that measures how well a language model predicts text. Lower perplexity indicates better prediction capability.
Encoder-Decoder
ArchitectureAn architecture with two components: an encoder that processes input and a decoder that generates output. Used in translation and summarization.
Decoder-Only
ArchitectureAn architecture that only uses the decoder part of transformers. Most modern LLMs (GPT, LLaMA) are decoder-only models.
Autoregressive
FundamentalsA generation method where the model predicts one token at a time, using previously generated tokens as context.
Beam Search
ParametersA search algorithm that keeps track of the top K most likely sequences at each step, balancing quality and diversity.
Greedy Decoding
ParametersAlways selecting the most probable next token. Fast but can lead to repetitive or suboptimal outputs.
Multi-Head Attention
ArchitectureRunning multiple attention mechanisms in parallel, allowing the model to attend to different aspects of the input simultaneously.
Self-Attention
ArchitectureAn attention mechanism where each position in a sequence attends to all other positions, capturing relationships within the input.
Transfer Learning
TrainingUsing knowledge learned from one task to improve performance on a related task. Foundation of modern LLM training.
Pre-training
TrainingThe initial training phase where a model learns general language understanding from large amounts of unlabeled text.
Inference
FundamentalsThe process of using a trained model to make predictions or generate outputs. Distinct from training.
Latency
OptimizationThe time delay between sending a request and receiving a response. Critical for real-time applications.
Throughput
OptimizationThe number of requests or tokens a system can process per unit of time. Important for scaling.
Batching
OptimizationProcessing multiple requests together to improve throughput and GPU utilization.
KV Cache
OptimizationCaching key-value pairs from attention layers to avoid recomputing them during autoregressive generation, speeding up inference.
Flash Attention
OptimizationAn optimized attention algorithm that reduces memory usage and speeds up training/inference by reordering operations.
Distillation
OptimizationTraining a smaller 'student' model to mimic a larger 'teacher' model, retaining most performance with fewer parameters.
Prompt Injection
ChallengesA security vulnerability where malicious users craft prompts to override system instructions or extract sensitive information.
Jailbreaking
ChallengesTechniques to bypass safety guardrails and make models generate prohibited content.
Grounding
ChallengesConnecting model outputs to factual sources or real-world data to reduce hallucinations.
Tool Use
UsageThe ability of LLMs to interact with external tools, APIs, or functions to extend their capabilities beyond text generation.
ReAct (Reasoning + Acting)
UsageA prompting framework where models alternate between reasoning about what to do and taking actions with tools.
Select a Term
Click on any term to see detailed information and related concepts