BLIP-2
Vision-language model for image captioning, VQA, and image-text retrieval
BLIP-2 bridges a frozen image encoder and a frozen LLM using a Querying Transformer (Q-Former). Supports image captioning, visual question answering, and retrieval without task-specific fine-tuning.
When to use:
- Automatically generating captions for images
- Answering natural language questions about image content
- Building image search systems with text queries
Input: Image + optional text prompt/question + optional fine-tuned checkpoint Output: Generated caption or answer, confidence score, and generation metadata
Inference Settings
Task (default: caption, options: caption / vqa / retrieval) What BLIP-2 should do with the image.
- caption: Generate a descriptive caption — no text prompt needed
- vqa: Answer a question about the image — requires a text prompt
- retrieval: Compute image-text similarity scores
Max Length (default: 50, range: 10–512) Maximum length of the generated text in tokens.
- Short (20–50): Brief captions or single-sentence answers
- Long (100–512): Detailed descriptions or longer responses
Num Beams (default: 5) Number of beams for beam search decoding.
- 1: Greedy decoding, fastest
- 5: Good balance of quality and speed
- 10+: Higher quality, slower
Temperature (default: 1.0, range: 0.1–2.0) Sampling temperature for generation diversity.
- 0.1–0.5: More focused, deterministic outputs
- 1.0: Default diversity
- 1.5+: Creative but less accurate