Multimodal
ConceptDefinition
A model capable of processing and generating multiple types of data — text, images, audio, and video — within a single unified architecture. Examples include GPT-4o, Gemini, Claude (vision), and Sora.
Related Terms
LLM (Large Language Model)
A neural network trained on massive text corpora to predict the next token, resulting in emergent abilities like reasoning, coding, and language understanding. Examples include GPT-4, Claude, Gemini, and Llama. Scale in parameters ranges from billions to trillions.
Diffusion Model
A generative model that learns to reverse a gradual noising process. Starting from pure noise, the model iteratively denoises to produce images, audio, or video. Powers Stable Diffusion, DALL-E 3, Midjourney, and Sora.
Embedding
A dense numerical vector that represents text, images, or other data in a high-dimensional space where semantic similarity maps to geometric closeness. Foundation of semantic search, RAG systems, and recommendation engines.