Documentation

Vision Language Models

Models that understand both images and text for captioning, VQA, and document understanding

Vision-language models jointly process images and text for captioning, visual question answering, and structured document extraction.

Available Models

  • BLIP-2 – Image captioning, visual question answering, and image-text retrieval
  • LayoutLMv3 – Document understanding combining text, layout, and image for forms, receipts, and invoices

On this page


Command Palette

Search for a command to run...

Keyboard Shortcuts
CTRL + KSearch
CTRL + DTheme switch
CTRL + LLanguage switch

Software details
Compiled 3 days ago
Release: v4.0.0-production
Buildnumber: master@994bcfd
History: 46 Items