TL;DR
Are CLIP and SigLIP really “intra-modally misaligned”?
We find little support for it.
Abstract
What is the "Intra-Modal Misalignment Hypothesis"?
Recent research suggested that the embeddings produced
by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the
inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images.
[1,2,3]
What is our "Reevaluation"?
In this study, we question this intra-modal misalignment hypothesis.
We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected.
For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances.
For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2).
This indicates the observed phenomena do not stem from a misalignment of the former.
Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best performance.
The Reevaluation
Previous Concern (Fig. 4a-c)
Our Explanation (Fig. 4d-f)
(d,e) The two different configurations in (a) and (b) are not arbitrary, but have a good reason to exist: Images in (a,d) and (b,e) have equal distance \(r\) to the "cat" text, but the two images in (a,d) are much more similar to each other than those in (b,e). Displayed distance values are real measurements.
Previous Concern (Fig. 2)
Previous "Workaround" (Fig. 3r)
Our "Back to the Basics" (Fig. 3l)
Questioning Few-Shot Metrics on Toy Dataset
Table 1. Repeating the demonstrative experiment in [1] on the simplistic Dogs vs Cats legacy dataset, where near-perfect results are expected.
It was suggested in [1] that poor results with CLIP image-image similarities evidences a misalignment in the image-image space.
This hypothesis gets no evidence when swapping model for the uni-modal DINO, a widely acknowledged state-of-the-art image embedder.
CLIP scores highest, suggesting that the observed low metrics are not caused by a model weakness and hence neither by a misalignment.
Instead, the performance gap between text-image (T-I) and image-image (I-I) can be attributed to the ambiguity in the way the task is conveyed to the model: Two images with opposite labels might still share enough other concepts to be justifiably similar.
| Model |
Retrieval (mAP) |
Classification (acc.) |
| T‑I | I‑I | T‑I (0‑shot) | I‑I (1|16‑shot) |
| CLIP ViT‑B/16 | 99.3 | 87.1 | 99.6 | 84.2 | 99.7 |
| DINOv2 ViT‑B/14 | – | 81.8 | – | 76.2 | 97.3 |
| DINOv3 ViT‑L/16 | – | 84.3 | – | 80.2 | 97.8 |
Reevaluating Image-to-Image Few‑shot Classification
Table 2. SigLIP (inter‑modal only) outperforms DINOv2, showing no disadvantage from missing intra‑modal loss.
| Model |
2‑shot |
4‑shot |
8‑shot |
16‑shot |
| Proto | LDA | Proto | LDA | Proto | LDA | Proto | LDA |
| CLIP ViT‑B/16 | 55.3 | 60.0 | 63.8 | 69.8 | 69.6 | 76.1 | 73.5 | 79.5 |
| SigLIP ViT‑B/16 | 68.6 | 71.0 | 76.3 | 79.0 | 80.5 | 83.3 | 82.5 | 85.3 |
| SigLIP2 ViT‑B/16 | 69.7 | 73.2 | 77.0 | 80.5 | 80.8 | 84.5 | 83.0 | 86.5 |
| DINOv2 ViT‑B/14 | 67.1 | 69.2 | 71.8 | 75.3 | 76.0 | 80.3 | 78.2 | 83.3 |
Reevaluating Image‑to‑Image Retrieval
Table 3. Simple $PCA^\leftarrow$ consistently outperforms Optimization-based Textual Inversion (OTI)
– no need to convert images to pseudo-text tokens to “fix” a misalignment.
| Model |
Method |
Average |
ROxford |
RParis |
Caltech101 |
DTD |
EuroSAT |
FGVCAircraft |
Flowers102 |
Food101 |
ImageNet |
OxfordPets |
StanfordCars |
SUN397 |
UCF101 |
| CLIP B/32 |
Original |
41.6 |
42.4 |
74.0 |
77.7 |
28.3 |
49.3 |
14.5 |
62.5 |
33.6 |
21.6 |
31.2 |
24.9 |
34.6 |
46.2 |
| OTI |
42.9 |
43.0 |
70.3 |
79.9 |
31.9 |
47.2 |
14.4 |
62.6 |
34.7 |
23.8 |
37.5 |
28.0 |
36.3 |
48.6 |
| PCA← |
49.0 |
51.4 |
80.9 |
83.3 |
34.0 |
53.8 |
16.1 |
70.3 |
43.0 |
28.6 |
47.7 |
34.6 |
40.0 |
53.3 |
| CLIP L/14 |
Original |
53.7 |
57.1 |
77.8 |
83.8 |
33.9 |
57.8 |
25.8 |
84.2 |
55.0 |
33.0 |
47.2 |
43.8 |
39.2 |
59.5 |
| OTI |
57.0 |
62.4 |
77.1 |
87.3 |
37.7 |
56.3 |
27.1 |
86.0 |
55.9 |
38.2 |
56.0 |
50.5 |
43.5 |
62.8 |
| PCA← |
61.3 |
64.5 |
83.0 |
89.5 |
39.9 |
62.8 |
28.7 |
88.9 |
64.4 |
42.2 |
62.7 |
57.2 |
46.0 |
66.8 |
| SigLIP B/16 |
Original |
57.2 |
50.6 |
73.1 |
87.2 |
39.8 |
53.3 |
37.9 |
87.5 |
56.3 |
35.8 |
56.4 |
65.7 |
42.8 |
56.9 |
| OTI |
60.0 |
55.2 |
79.1 |
88.9 |
43.3 |
52.9 |
37.6 |
89.7 |
59.0 |
38.8 |
64.2 |
71.8 |
43.6 |
54.9 |
| PCA← |
62.8 |
57.9 |
78.4 |
91.2 |
44.2 |
54.2 |
40.9 |
92.0 |
61.8 |
43.5 |
68.5 |
77.2 |
46.9 |
60.3 |
| SigLIP2 B/16 |
Original |
58.6 |
52.5 |
75.6 |
89.2 |
38.6 |
49.3 |
40.7 |
89.3 |
59.7 |
37.9 |
56.6 |
70.8 |
43.0 |
59.2 |
| PCA← |
64.4 |
59.4 |
78.6 |
93.0 |
44.1 |
51.1 |
46.3 |
93.2 |
65.3 |
46.5 |
67.8 |
80.2 |
48.9 |
63.0 |
[1] Mistretta et al.: "Cross the gap: Exposing the intra-modal misalignment in clip via modality inversion", ICLR 2025.
[2] Yi et al.: "Leveraging cross-modal neighbor representation for improved clip classification", CVPR 2024.
[3] Udandarao et al.: "Training-free name-only transfer of vision-language models", CVPR 2023.
Take Home Messages
🧩 Finding 1 – Image-image distances are constrained
After training, if text-image similarities are well-calibrated, then image-image similarities are well-defined too. This is like a overconstrained bipartite graph.
🔍 Finding 2 – With or without intra-modal loss, we didn't find a diff
We didn't find significant differences between models with and without intra-modal training objective.
So we have no reason to believe that a missing intra-modal objective causes intra-modal misalignment.
🚀 Finding 3 – Basic few-shot methods are competitive
For few-shot classification, some classic machine learning techniques like linear discriminant analysis (LDA) work quite well on the raw image embeddings. It doesn't seem concerning.
📊 Finding 4 – Dropping class-irrelevant information explains and exceeds previous results
Removing visual details from the embedding removes information that's spurious w.r.t. the class label.
It thus mitigates task ambiguity in the few-shot setting. This is why some previous attempts worked.
And this is what our $PCA^\leftarrow$ does.
Cite this work
@article{herzog2026reevaluating,
title = {Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP},
author = {Jonas Herzog and Yue Wang},
journal = {arXiv preprint arXiv:2603.16100},
year = {2026}
}