Multimodal Embeddings
Joint image-text embeddings for cross-modal search and retrieval
Multimodal embedding models produce vector representations from images and text together, enabling cross-modal retrieval where you can search images with text queries or vice versa.
Available Models
- LLaVA-Next 13B Embeddings – 4096-dim joint image-text embeddings (13B parameters)
- Qwen-VL-2 Embedding – 3584-dim multilingual multimodal embeddings with strong OCR understanding (32+ languages)