Multimodal Inference
Models that combine image, text, and document understanding
Multimodal models process and relate multiple input types — images, text, and documents.
- Vision Language – Answer questions about images and extract information from documents
- Embeddings – Joint image-text embeddings for cross-modal retrieval
- Reranking – Score image-text or image-image similarity for retrieval
- Classification – Classify inputs combining multiple image modalities