Documentation

Multimodal

Models that process or generate multiple modalities at once

Multimodal tasks work with multiple types of data at the same time.

Examples: text + images, text + audio, text + video.

Common Multimodal Tasks

  • Image-Text-to-Text: Generate text from a combination of images and text prompts
  • Visual Question Answering: Answer questions about images
  • Document Question Answering: Answer questions from documents or PDFs
  • Audio-to-Text: Convert audio or transcripts into coherent text outputs
  • Video-to-Text: Generate text based on video content
  • Visual Document Retrieval: Retrieve documents or visuals based on multimodal queries
  • Any-to-Any: General multimodal conversion between arbitrary input and output types

Command Palette

Search for a command to run...

Keyboard Shortcuts
CTRL + KSearch
CTRL + DTheme switch
CTRL + LLanguage switch

Software details
Compiled about 7 hours ago
Release: v4.0.0-production
Buildnumber: master@d5b7269
History: 52 Items