Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

CVPR 2026
Zhejiang University
TL;DR Are CLIP and SigLIP really “intra-modally misaligned”? We find little support for it.

Abstract

What is the "Intra-Modal Misalignment Hypothesis"?

Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. [1,2,3]

What is our "Reevaluation"?

In this study, we question this intra-modal misalignment hypothesis.

We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment of the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best performance.

The Reevaluation

Previous Concern (Fig. 4a-c)

Fig 4a
Two image embeddings with same distance $r$ to a text embedding...
Fig 4a
(a) ...can be close together
Fig 4b
(b) ... or far apart
Fig 4c
(c) Previous conclusion: a degree of freedom remains → image-image misalignment possible. Image embeddings can lie on any arbitrary points on the circumference, leaving a degree of freedom for intra-modal miscalibration.

Our Explanation (Fig. 4d-f)

Fig 4d
Fig 4e
(d,e) The two different configurations in (a) and (b) are not arbitrary, but have a good reason to exist: Images in (a,d) and (b,e) have equal distance \(r\) to the "cat" text, but the two images in (a,d) are much more similar to each other than those in (b,e). Displayed distance values are real measurements.
Fig 4f
(f) Intra‑modal similarities are a consequence of inter‑modal similarities – no extra degree of freedom. The previous line of argumentation overlooks that each image embedding is bound to more than one text after training.

Previous Concern (Fig. 2)

Fig 2 (by class): class histograms
Figure 2 (by class). Pairwise cosine similarity distributions. Left: Similarities between same class (blue) and opposite class (orange) image feature pairs. A high overlap ratio between the two colors was previously highlighted as an indicator for an intra-modal misalignment issue in CLIP.
Fig 2 (by modality): modality histograms
Figure 2 (by modality). Similarity distributions of image-text pairs (purple) versus image-image pairs (green). Because CLIP is only supervised on the former, the divergence has previously prompted concerns about whether the latter reflect true similarities.

Our Finding (Fig. 5)

SigLIP class histograms
SigLIP (inter-modal only)
SigLIP2 class histograms
SigLIP2 (with intra-modal objective)
Figure 5 (by class). Cosine similarity histograms by class. The distributions are almost equal for purely text-image trained SigLIP (left) and SigLIP2 (right) which includes an image-image self-supervised objective as in the DINO line of work. This indicates intra-class variation is no sign of misalignment brought by pure text-image training, but rather normal behavior.
SigLIP modality histograms
SigLIP (inter-modal only)
SigLIP2 modality histograms
SigLIP2 (with intra-modal objective)
Figure 5 (by modality). Cosine similarity histograms by modality. The divergence in similarity distributions between inter-modal (image-text) and intra-modal (image-image) pairs is nearly identical for both models. The fact that SigLIP2's additional image-image training does not close this gap demonstrates this is not a misalignment introduced by a missing image-image objective.

Previous "Workaround" (Fig. 3r)

Fig 3 right: image-text comparison
Figure 3 (r). Motivated by the intra-modal misalignment hypothesis, a previous work posited it is necessary to convert image-image comparison into image-text comparison.

Our "Back to the Basics" (Fig. 3l)

Fig 3 left: image-image comparison
Figure 3 (l). We go back to the basics.

Questioning Few-Shot Metrics on Toy Dataset

Table 1. Repeating the demonstrative experiment in [1] on the simplistic Dogs vs Cats legacy dataset, where near-perfect results are expected. It was suggested in [1] that poor results with CLIP image-image similarities evidences a misalignment in the image-image space. This hypothesis gets no evidence when swapping model for the uni-modal DINO, a widely acknowledged state-of-the-art image embedder. CLIP scores highest, suggesting that the observed low metrics are not caused by a model weakness and hence neither by a misalignment. Instead, the performance gap between text-image (T-I) and image-image (I-I) can be attributed to the ambiguity in the way the task is conveyed to the model: Two images with opposite labels might still share enough other concepts to be justifiably similar.

Model Retrieval (mAP) Classification (acc.)
T‑II‑IT‑I (0‑shot)I‑I (1|16‑shot)
CLIP ViT‑B/1699.387.199.684.2 | 99.7
DINOv2 ViT‑B/1481.876.2 | 97.3
DINOv3 ViT‑L/1684.380.2 | 97.8

Reevaluating Image-to-Image Few‑shot Classification

Table 2. SigLIP (inter‑modal only) outperforms DINOv2, showing no disadvantage from missing intra‑modal loss.

Model 2‑shot 4‑shot 8‑shot 16‑shot
ProtoLDAProtoLDAProtoLDAProtoLDA
CLIP ViT‑B/1655.360.063.869.869.676.173.579.5
SigLIP ViT‑B/1668.671.076.379.080.583.382.585.3
SigLIP2 ViT‑B/1669.773.277.080.580.884.583.086.5
DINOv2 ViT‑B/1467.169.271.875.376.080.378.283.3

Reevaluating Image‑to‑Image Retrieval

Table 3. Simple $PCA^\leftarrow$ consistently outperforms Optimization-based Textual Inversion (OTI) – no need to convert images to pseudo-text tokens to “fix” a misalignment.

Model Method Average ROxford RParis Caltech101 DTD EuroSAT FGVCAircraft Flowers102 Food101 ImageNet OxfordPets StanfordCars SUN397 UCF101
CLIP B/32 Original 41.6 42.4 74.0 77.7 28.3 49.3 14.5 62.5 33.6 21.6 31.2 24.9 34.6 46.2
OTI 42.9 43.0 70.3 79.9 31.9 47.2 14.4 62.6 34.7 23.8 37.5 28.0 36.3 48.6
PCA 49.0 51.4 80.9 83.3 34.0 53.8 16.1 70.3 43.0 28.6 47.7 34.6 40.0 53.3
CLIP L/14 Original 53.7 57.1 77.8 83.8 33.9 57.8 25.8 84.2 55.0 33.0 47.2 43.8 39.2 59.5
OTI 57.0 62.4 77.1 87.3 37.7 56.3 27.1 86.0 55.9 38.2 56.0 50.5 43.5 62.8
PCA 61.3 64.5 83.0 89.5 39.9 62.8 28.7 88.9 64.4 42.2 62.7 57.2 46.0 66.8
SigLIP B/16 Original 57.2 50.6 73.1 87.2 39.8 53.3 37.9 87.5 56.3 35.8 56.4 65.7 42.8 56.9
OTI 60.0 55.2 79.1 88.9 43.3 52.9 37.6 89.7 59.0 38.8 64.2 71.8 43.6 54.9
PCA 62.8 57.9 78.4 91.2 44.2 54.2 40.9 92.0 61.8 43.5 68.5 77.2 46.9 60.3
SigLIP2 B/16 Original 58.6 52.5 75.6 89.2 38.6 49.3 40.7 89.3 59.7 37.9 56.6 70.8 43.0 59.2
PCA 64.4 59.4 78.6 93.0 44.1 51.1 46.3 93.2 65.3 46.5 67.8 80.2 48.9 63.0

[1] Mistretta et al.: "Cross the gap: Exposing the intra-modal misalignment in clip via modality inversion", ICLR 2025.

[2] Yi et al.: "Leveraging cross-modal neighbor representation for improved clip classification", CVPR 2024.

[3] Udandarao et al.: "Training-free name-only transfer of vision-language models", CVPR 2023.


Take Home Messages

🧩 Finding 1 – Image-image distances are constrained

After training, if text-image similarities are well-calibrated, then image-image similarities are well-defined too. This is like a overconstrained bipartite graph.

🔍 Finding 2 – With or without intra-modal loss, we didn't find a diff

We didn't find significant differences between models with and without intra-modal training objective. So we have no reason to believe that a missing intra-modal objective causes intra-modal misalignment.

🚀 Finding 3 – Basic few-shot methods are competitive

For few-shot classification, some classic machine learning techniques like linear discriminant analysis (LDA) work quite well on the raw image embeddings. It doesn't seem concerning.

📊 Finding 4 – Dropping class-irrelevant information explains and exceeds previous results

Removing visual details from the embedding removes information that's spurious w.r.t. the class label. It thus mitigates task ambiguity in the few-shot setting. This is why some previous attempts worked. And this is what our $PCA^\leftarrow$ does.


Cite this work

@article{herzog2026reevaluating,
      title   = {Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP},
      author  = {Jonas Herzog and Yue Wang},
      journal = {arXiv preprint arXiv:2603.16100},
      year    = {2026}
    }
2026-03-26 🚧🚧Under Development. May contain inaccuracies.🚧🚧