In this study, we question this intra-modal misalignment hypothesis.
We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected.
For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances.
For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2).
This indicates the observed phenomena do not stem from a misalignment of the former.
Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best performance.