CLIP Cross-Encoder
CLIP-based cross-encoder for image-text and image-image similarity scoring
CLIP Cross-Encoder uses CLIP ViT-L/14 to compute pairwise similarity scores between images and text or between two images. Useful for reranking, visual similarity, and recommendation pipelines.
When to use:
- Image-to-image similarity reranking
- Image search reranking with text queries
- Use when CLIP embeddings are already in use in the pipeline
Input:
- Query Image (optional): Image to match against candidates
- Query Text (optional): Text query for image ranking
- Candidates (required): Candidate images to score
Output:
- Scores: Similarity or relevance scores per candidate
- Ranking: Indices sorted by relevance
Inference Settings
No inference-time settings. Scores are computed deterministically.