Dokumentation (english)

CLIP ViT-L/14

Joint vision-language embedding model for image similarity and zero-shot tasks

CLIP ViT-L/14 from OpenAI encodes images into a shared embedding space with text, enabling zero-shot classification and cross-modal retrieval without labelled data.

When to use:

  • Image similarity search (find visually similar images)
  • Zero-shot image classification without labelled data
  • Cross-modal retrieval (find images matching a text query)

Input: Image file + optional fine-tuned checkpoint Output: 768-dimensional embedding vector in the CLIP feature space

Inference Settings

No inference-time settings. CLIP encodes images deterministically.

Note: To compare image embeddings to text queries, use a CLIP text encoder on the query side to get text embeddings in the same space.

On this page


Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor etwa 4 Stunden
Release: v4.0.0-production
Buildnummer: master@afa25ab
Historie: 72 Items