BLIP-2

BLIP-2 bridges a frozen image encoder and a frozen LLM using a Querying Transformer (Q-Former). Supports image captioning, visual question answering, and retrieval without task-specific fine-tuning.

When to use:

Automatically generating captions for images
Answering natural language questions about image content
Building image search systems with text queries

Input: Image + optional text prompt/question + optional fine-tuned checkpoint Output: Generated caption or answer, confidence score, and generation metadata

Inference Settings

Task (default: caption, options: caption / vqa / retrieval) What BLIP-2 should do with the image.

caption: Generate a descriptive caption - no text prompt needed
vqa: Answer a question about the image - requires a text prompt
retrieval: Compute image-text similarity scores

Max Length (default: 50, range: 10–512) Maximum length of the generated text in tokens.

Short (20–50): Brief captions or single-sentence answers
Long (100–512): Detailed descriptions or longer responses

Num Beams (default: 5) Number of beams for beam search decoding.

1: Greedy decoding, fastest
5: Good balance of quality and speed
10+: Higher quality, slower

Temperature (default: 1.0, range: 0.1–2.0) Sampling temperature for generation diversity.

0.1–0.5: More focused, deterministic outputs
1.0: Default diversity
1.5+: Creative but less accurate

BLIP-2

Inference Settings

On this page

Sicherheit auf Enterprise-Niveau

In jeder Infrastruktur einsetzbar

DSGVO-konform

BLIP-2

Inference Settings

On this page

Command Palette