Dokumentation (english)

BLIP-2

Vision-language model for image captioning, VQA, and image-text retrieval

BLIP-2 bridges a frozen image encoder and a frozen LLM using a Querying Transformer (Q-Former). Supports image captioning, visual question answering, and retrieval without task-specific fine-tuning.

When to use:

  • Automatically generating captions for images
  • Answering natural language questions about image content
  • Building image search systems with text queries

Input: Image + optional text prompt/question + optional fine-tuned checkpoint Output: Generated caption or answer, confidence score, and generation metadata

Inference Settings

Task (default: caption, options: caption / vqa / retrieval) What BLIP-2 should do with the image.

  • caption: Generate a descriptive caption — no text prompt needed
  • vqa: Answer a question about the image — requires a text prompt
  • retrieval: Compute image-text similarity scores

Max Length (default: 50, range: 10–512) Maximum length of the generated text in tokens.

  • Short (20–50): Brief captions or single-sentence answers
  • Long (100–512): Detailed descriptions or longer responses

Num Beams (default: 5) Number of beams for beam search decoding.

  • 1: Greedy decoding, fastest
  • 5: Good balance of quality and speed
  • 10+: Higher quality, slower

Temperature (default: 1.0, range: 0.1–2.0) Sampling temperature for generation diversity.

  • 0.1–0.5: More focused, deterministic outputs
  • 1.0: Default diversity
  • 1.5+: Creative but less accurate

On this page


Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 3 Minuten
Release: v4.0.0-production
Buildnummer: master@e58ae35
Historie: 66 Items