Aligning machine and human visual representations across abstraction levels

Muttenthaler, Lukas; Greff, Klaus; Born, Frieda; Spitzer, Bernhard; Kornblith, Simon; Mozer, Michael C.; Müller, Klaus-Robert; Unterthiner, Thomas; Lampinen, Andrew K.

doi:10.1038/s41586-025-09631-6

Download PDF

Article
Open access
Published: 12 November 2025

Aligning machine and human visual representations across abstraction levels

Lukas MuttenthalerÂ ORCID: orcid.org/0000-0002-0804-4687^1,2,3,4,
Klaus Greff¹,
Frieda BornÂ ORCID: orcid.org/0009-0002-1214-4864^2,3,5,
Bernhard SpitzerÂ ORCID: orcid.org/0000-0001-9752-932X^5,6,
Simon Kornblith^7,8,
Michael C. MozerÂ ORCID: orcid.org/0000-0002-9654-0575⁸,
Klaus-Robert MÃ¼ller^1,2,3,9,10,
Thomas Unterthiner¹ &
â€¦
Andrew K. LampinenÂ ORCID: orcid.org/0000-0002-6988-8437⁸Â

Nature volumeÂ 647,Â pages 349â€“355 (2025)Cite this article

51k Accesses
7 Citations
91 Altmetric
Metrics details

Subjects

Abstract

Deep neural networks have achieved success across a wide range of applications, including as models of human behaviour and neural representations in vision tasks^1,2. However, neural network training and human learning differ in fundamental ways, and neural networks often fail to generalize as robustly as humans do^3,4, raising questions regarding the similarity of their underlying representations. We need to determine what is missing for modern learning systems to exhibit more human-aligned behaviour. Here we highlight a key misalignment between vision models and humans: whereas human conceptual knowledge is hierarchically organized from fine- to coarse-scale distinctions (for example, ref. ⁵), model representations do not accurately capture all these levels of abstraction. To address this misalignment, we first train a teacher model to imitate human judgements, then transfer human-aligned structure from its representations to refine the representations of pretrained state-of-the-art vision foundation models via fine-tuning. These human-aligned models more accurately approximate human behaviour and uncertainty across a wide range of similarity tasks, including a dataset of human judgements spanning multiple levels of semantic abstractions. They also perform better on a diverse set of machine learning tasks, increasing generalization and out-of-distribution robustness. Thus, infusing neural networks with additional human knowledge yields a best-of-both-worlds representation that is both more consistent with human cognitive judgements and more practically useful, paving the way towards more robust, interpretable and human-aligned artificial intelligence systems.

Emulating human-like adaptive vision for efficient and flexible machine visual perception

Article 06 November 2025

Dimensions underlying the representational alignment of deep neural networks with humans

Article Open access 23 June 2025

Overcoming classic challenges for artificial neural networks by providing incentives and practice

Article 20 October 2025

Main

Although deep learning has recently driven rapid progress in areas of artificial intelligence such as natural language processing⁶ and computer vision^7,8,9, even the best of these systems often fail in ways that humans would not^{4,10,11,12,13}. These failures have led to renewals^3,12 of older arguments^14,15 that neural networks lack the essential ingredients of human intelligence. Therefore, we need to determine how can we build systems that produce more human-aligned behaviour.

Human perception is robust and generalizes across different visual settings^4,16,17. However, model performance declinesâ€”often markedlyâ€”if the data distribution shifts between the training and test sets (for example, refs. ^11,18). This lack of robustness in vision model representations poses a challenge for downstream applications that require generalization (for example, refs. ^10,11,19). In addition, humans tend to be well calibratedâ€”for example, when they are asked to judge visual similarity¹⁷â€”that is, humansâ€™ (un)certainty tends to correlate with their (in)accuracy. Artificial intelligence systems, however, are often overconfident and show high certainty even when their predictions are incorrect²⁰. Thus, many differences remain to be reconciled before we can ultimately achieve human-like artificial intelligence.

Here we highlight a key misalignment between humans and deep learning models that may underlie some of these differences: model representations tend to fail to capture the full multi-level conceptual structure of human knowledge. Although model representations successfully encode the local human-perceived similarity structure among closely related entities (for example, different dog breeds), the global relationships between concepts with more abstract semantic relations (for example, dogs and fish, which are both animate but visually dissimilar) are modelled much less systematically. Human neural representations, however, are organized by global features such as animacy^5,21, and at multiple finer scales that capture nuanced semantic relationships^21,22,23,24. This lack of global organization in the representations of deep learning models across levels of the conceptual hierarchy likely contributes to the aforementioned weaknesses of these models.

A challenge for addressing this misalignment is that collecting representative datasets of human judgements is challenging and expensive. We therefore propose a method for synthesizing simulated (approximately) human-aligned similarity judgements via a surrogate teacher modelâ€”a large foundation model that we align using an affine transformation²⁵ and uncertainty distillation on a small existing dataset²⁶. We use this surrogate to produce the AligNet datasetâ€”human-aligned pseudolabels (compare ref. ²⁷) from the surrogate model for triplets sampled from ImageNet²⁸ using a clustering-based data-grouping method. We fine-tune various vision foundation models on AligNet using a similarity-space distillation objective. These models show substantially more human-aligned predictions on various cognitive science tasksâ€”including Levels, a dataset of human semantic judgements reflecting multiple levels of abstraction. Furthermore, these aligned models show improved accuracy and out-of-distribution robustness across many downstream machine learning tasks, thus showing the improved generalizability of the aligned model representations. We release our aligned models and training and evaluation datasets.

In summary, our work contributes to better understanding a key difference between artificial and natural intelligence. Moreover, our results illustrate a principle for aligning models to humansâ€”focusing on the multi-scale relational structure of human knowledgeâ€”that may contribute to the more general problem of achieving human-aligned artificial intelligence.

To build foundation models with more human-aligned behaviour, we inject additional supervision about human behaviour into the model representations, using a surrogate teacher model: a vision foundation model whose representations are linearly transformed to approximate human judgements and uncertainty on the THINGS dataset²⁶. We use clusters from this teacher modelâ€™s representations to sample triplets from ImageNet²⁸ and soft-label them using distances in the teachers representation space, then distil these soft labels into the student representations via a Kullbackâ€“Leibler divergence objective (Fig. 1). For details, see Methods.

Towards more human-aligned models

THINGS triplet odd one out

We first validate that our teacher model performs well on the test data for the THINGS dataset²⁶ used in training; as expected, the teacher achieves high performance (61.7% accuracy, close to the human noise ceiling of 66.67%). We then align a variety of student modelsâ€”trained with objectives ranging from image captioning to classification or self-supervised learningâ€”using this teacherâ€™s representations; all models show substantially improved human alignment on the THINGS tests (relative performance increases from 21.28â€“74.47%).

Other cognitive tasks

Our findings generalize across various object similarity tasks that are commonly used in the cognitive sciencesâ€”triplet odd-one-out task (relative performance increases up to 73.35%; Fig 2a), Likert scale similarity ratings (up to 6.3-fold increase in Spearman rank correlation coefficient; Fig 2b) and multiple-arrangement tasks (up to 14.47-fold increase in Spearman rank correlation coefficient; Fig.â€‰2c). All performance increases are statistically significant at Î±â€‰=â€‰0.05; for details, see Supplementary Information.

Alignment at multiple levels of abstraction

Levels dataset

Because previous cognitive datasets were not specifically targeted for assessing knowledge of vision foundation models across levels of abstraction, we collected a dataset of human judgementsâ€”which we call Levelsâ€”that is based on the triplet odd-one-out task, but stratified across different levels of the semantic hierarchy. Specifically, we collect global coarse-grained semantic, which requires deciding on the odd one out among broadly different categories; local fine-grained semantic, involving discerning subtle distinctions within the same category; and class boundary, testing the capacity to identify category boundaries. For details, see Methods.

Alignment at multiple levels

The Levels dataset allows us to systematically study discrepancies between human and model decisions across these different levels. We find that our soft-alignment method reduces these discrepancies at all levels, but especially for the global coarse-grained judgements, as we predicted.

Global coarse-grained

This level shows the largest improvements. The base models achieved low accuracies of 36.09% (classifier ViT-B (ref. ⁸)) to 57.38% (self-supervised DINOv2 (ref. ²⁹)). AligNet models significantly improved; all models performed well, with accuracies of 65.70% (ViT-B) to 68.56% (DINOv2)â€”above the human-to-human reliability score of 61.92% (Fig. 2d, leftmost column). That is, the AligNet modelsâ€™ responses were more similar to average human responses (as each triplet response is the majority response of the participants) than the level of agreement among the human participants. AligNet modelsâ€™ relative improvements ranged from 19.48% (DINOv2) to 93.51% (ViT-L).

Local fine-grained

Most base models did not strongly align to human responses for fine-grained semantics either; all models achieved poor alignment scores of 46.04% (ViT-B) to 57.72% (DINOv2), except for DINOv1, which performed significantly better (62.92%; near the human noise ceiling of 65.92%; Supplementary Table 3). AligNet models achieved increased accuracies of 58.93 (ViT-S) to 62.92% (DINOv1), with relative improvements ranging from 7.84% (DINOv2) to 46.03% (ViT-L) (Fig. 2d, middle column).

Class boundary

Supervised classifiers and image/text contrastive base models performed close to the noise ceiling; accuracies ranged from 81.96% (SigLIP) to 93.67% (ViT-L). Others performed worse; the CapPa (ref. ³⁰) captioning model achieved 70.37%. AligNet fine-tuning brought all models to a similar level, achieving accuracies up to 93.24% (ViT-L), higher than the human noise ceiling of 89.21% (Fig. 2d, rightmost column). Relative improvements ranged from 0.62% (ViT-L) to 32.29% (CapPa). For more performance details, see Supplementary Table 3.

AligNet model uncertainties correspond to human latencies

We also collected (continuous) human response times, which we use as a proxy of the participantsâ€™ uncertainty^31,32. We measured model uncertainty as the entropy of the three pairwise similarities within each triplet. Base model uncertainties were not correlated with human response times for the coarse (Ïâ€‰=â€‰âˆ’0.014â€“0.184) and fine-grained (Ïâ€‰=â€‰0.047â€“0.160) settings and moderately correlated for the class-boundary setting (Ïâ€‰=â€‰0.208â€“0.432). All AligNet models showed substantially increased uncertainty alignment across all levels (Fig. 2e), especially the coarse-grained abstraction level (Ïâ€‰=â€‰0.479â€“0.506).

Evaluating other model classes

To confirm that the aforementioned representational weaknesses are present in other classes of deep learning models, we evaluated two state-of-the-art natively multimodal large language models, Gemini 2.0 Flash and Gemini 2.5 pro³³, on Levels. These visual-language models perform similarly toâ€”or slightly better thanâ€”the better pretrained vision models across all levels; however, they still substantially underperform our AligNet fine-tuned models (Supplementary Information section 1.7). In addition, we evaluated models trained on Ecoset³⁴, an ecologically motivated natural image dataset, and find that Ecoset models have severe difficulties matching the human similarity judgements (Supplementary Information section 1.6). These results confirm that the AligNet fine-tuning offers greater improvements in human alignment than merely incorporating language modelling or ecological training.

Aligned models reflect the conceptual hierarchy

Next we consider how model representations change after soft alignment. In Fig. 3, we show that although the model representations are dissimilar before alignment, they become more aligned with each other after soft alignment. This convergence is driven by models aligning better with the human conceptual hierarchy (Fig. 3b; compare ref. ³⁵). Our soft-alignment procedure embeds this global structure at two levels: first in the cluster-based sampling (Fig. 3c) and then in the soft labels (see above). Because of these factors, the relationships between image representations change during alignment according to their semantic relationship; representations of images from the same basic category tend to move closer together, those of images from the same superordinate category tend to move somewhat closer, and those from different superordinate categories tend to move apart (all effects are highly significant; tâ€‰>â€‰3.93, Pâ€‰<â€‰0.001).

**Fig. 3: Aligned models reflect the semantic hierarchy.**

As an illustrative example, in a base ViT-B, the representations of lizards are close to those of some plants and fruits owing their similarity in texture, colour or background; after alignment, they become similar to representations of other animals and more distant from those of other, unrelated superordinate categories. This reorganization yields better generalization, for example, when a lizard image is used as an example depicting an abstract category such as animals.

Supplementary Information section 3.2 and Extended Data Fig. 2 present more detailed analyses corroborating these results, including reorganization at higher levels such as living versus non-living, across model layers, and in other models and ablations. Furthermore, Supplementary Information section 3.3 and Extended Data Fig. 3 show that where the AligNet model and a baseline unaligned model disagree, human judgements are strongly correlated with those of AligNet, but not the ablation model (which relies more on superficial stylistic features)â€”in fact, every human participant in the study agreed more with AligNet.

Alignment improves generalization and robustness

Next we considered how human-aligned representations affect the performance on machine learning tasks. We investigated how alignment improves generalization and out-of-distribution robustness across a variety of downstream tasks.

One-shot classification

We first test an extremely challenging generalization setting: classifying images given only a single labelled example per class. In Fig. 4, we show one-shot performance before and after soft alignment on ten image-classification datasets from varied domains, such as fine-grained bird (Birds³⁶) and flower (Flowers³⁷) types classification, multi-domain natural image classification (ImageNet³⁸) and scene recognition (Places365³⁹). The majority of cases (32 of 40) show an improvement, sometimes by a substantial margin (for example, DINOv2 shows a 2.7-fold increase on the Pets dataset); overall, alignment significantly increases the generalization performance on these tasks (Pâ€‰<â€‰0.05). These results show how human-aligned representations support strong generalization from little data. See Supplementary Information section 2.1 for additional results, including the complementary benefits of combining our method with other approaches to few-shot generalization.

**Fig. 4: Machine learning downstream results.**

Distribution shift

A long-standing problem for applying machine learning algorithms is distribution shift (compare refs. ^40,41): in deployment, data often differ in subtle ways from the training data, leading to unexpected model failures. To evaluate whether the global structure induced by alignment helps ameliorate this issue, we evaluated our models on the BREEDS benchmarks⁴²â€”which specifically tests generalization under input distribution shifts, using datasets where training and test data points are sampled from different subpopulations. Figure 4b shows that AligNet fine-tuning consistently improves performance significantly across all benchmarks and models types (especially the image classifier, ViT-B). See Supplementary Information section 2.3 for further results.

Model robustness

Alignment also improves robustness. We evaluate on ImageNet-A⁴³, a challenging dataset of natural images that are adversarial (that is, models tend to misclassify them, but humans perform better). Again, alignment improves accuracy for all models (compare Fig. 4c), with improvements of up to 9.5â€‰percentage points (1.6-fold improvement for CapPa). Although our method is not designed for improving out-of-distribution robustness, its improvements are comparable to state-of-the-art methods designed for precisely this problem (Supplementary Table 6).

Together, these machine learning results corroborate that AligNet fine-tuning improves the generalization, transfer and robustness of model representations.

Discussion

The differences between natural intelligence and the capabilities of neural networks are the subject of long-standing debates^3,14. Despite the recent progress in artificial intelligence, these discussions persist, because deep learning systems still seem to fail in non-human-like ways^10,11.

Here, we have highlightedâ€”and addressedâ€”a key deficiency in a broad class of vision foundation models: their representations do not adequately represent the multi-level conceptual structure of human semantic knowledge (â€˜Towards more human-aligned modelsâ€™ section). We demonstrate this deficiency through Levels, a dataset of human similarity judgements across multiple levels of abstraction. To address this deficiency, we established a methodological framework for aligning deep learning modelsâ€™ representations with human similarity judgements to create more human-aligned systems. This framework involves bootstrapping from a small quantity of human data to train a surrogate teacher model, and using this teacher to create a large synthetic dataset (AligNet), which we use to fine-tune various vision foundation models to inject human-aligned structure.

This approach yields significantly increased alignment to human judgements on cognitive science tasks (â€˜Towards more human-aligned modelsâ€™ section), and better generalization and robustness on representative machine learning tasks (â€˜Alignment improves generalization robustnessâ€™ section). Thus, soft alignment helps to reduce the brittleness of machine learning models under changing environments. Moreover, our results illustrate how the broader paradigm of studying representational alignment^44,45 can not only yield insights about how different systems relate but also be leveraged to actively align model representations with human knowledge to improve the modelsâ€™ generalization abilities.

These results contribute to long-standing debates over which features of human intelligence neural networks may lack^3,14,15. In particular, one line of critique argues that neural networks lack the capability to appropriately represent abstract relations such as same and different^46,47, or to organize knowledge into hierarchies of concepts¹⁵. Although aspects of these critiques have been refuted in simple synthetic settings (for example, ref. ⁴⁸), similar criticisms persist for modern foundation models¹². Our results show that, although standard training objectives do not adequately capture hierarchical category relations, these relations can be distilled into the modelsâ€”which improves the modelsâ€™ resilience under the distribution shifts highlighted in previous critiques. These results show that hierarchical representations may emerge from a system that is neither explicitly hierarchically structured nor trained explicitly on the hierarchy.

Although we focused on vision, similar global misalignments probably arise in other areas of research. For instance, in natural language processing, models are similarly trained with objectives that focus on distinguishing between close matches (for example, prediction objectives that primarily distinguish words that are likely to occur, rather than considering their relations to less probable concepts). Applying alignment techniques may therefore analogously help to better capture the global structure of semantic and syntactic relationships among language inputs that these objectives might miss.

More broadly, artificial intelligence systems have been successfully adopted in many areas. However, these deployments lead to practical⁴⁹ and conceptual⁵⁰ concerns about trustworthiness and safety. It is therefore increasingly important to identify the reasons why these systems occasionally fail and how to alleviate these failures. Our work advances the understanding of the deficiencies of vision model representations, and simultaneously shows a viable path for ameliorating these deficiencies by alignment with human judgements.

Our work has a number of limitations that could be addressed in future efforts. First, the models we used account for neither context in similarity judgements nor higher-order relations. Second, human representations may vary systematically across individuals, cultures and so on. Finally, human judgement is full of flaws, intrinsic contradictions and discrepancies. Given these issues, perfect alignment to human performance may not always be desirable for a technical system. Thus, future work could explore how to best learn from human knowledge without imitating human imperfections.

In summary, we have provided an initial approach to distil global, human-aligned similarity structures into the representations of modern deep neural networks. We have demonstrated an efficient path towards a best-of-both-worlds representation that is both more consistent with human judgements and more practically useful, paving the way towards more robust, interpretable and human-aligned artificial intelligence systems. We hope that our work will inspire more general approaches for aligning foundation models by distiling human priors into their representations.

Methods

Soft alignment

This section is organized as follows. We start by describing how we transform model representations into a space that matches human similarity judgements about coarse-grained semantic object relations. We introduce an affine transformation that matches human similarity judgements and injects the uncertainties that humans assign to their triplet odd-one-out choices into a modelâ€™s representation space creating a surrogate teacher model. Using the teacher modelâ€™s human-aligned representations, we sample triplets of ImageNet³⁸ images differently than uniform random sampling by clustering the representations into superordinate categories and using those clusters for data partitioning. We pseudo-label these triplets with human-aligned judgement distributions from the surrogate teacher model. Finally, after having created AligNet triplets, we fine-tune student models with a triplet loss object function.

Representational alignment

Data

To increase the degree of alignment between human and neural network similarity spaces, we begin from the publicly available THINGS dataset, which is a large behavioural dataset of 4.7â€‰million unique triplet responses from 12,340 human participants for mâ€‰=â€‰1,854 natural object images⁵¹ from the public THINGS object concept and image database²⁶. The THINGS dataset can formally be defined as $D{(\{{a}_{s},{b}_{s}\}|\{{i}_{s},{j}_{s}\,,{k}_{s}\})}_{s=1}^{n}$, which denotes a dataset of n object triplets and corresponding human odd-one-out responses, where $\{{a}_{s},{b}_{s}\}\subset \{{i}_{s},{j}_{s}\,,\,{k}_{s}\}$ and $\{{a}_{s},{b}_{s}\}$ is the object pair that was chosen by a human participant among the s-th triplet to have the highest similarity. Let ${\bf{X}}\in \,{{\mathbb{R}}}^{m\times p}$ be the teacher model representations for the mâ€‰=â€‰1,854 objects in the THINGS dataset, where p is the dimension of the image-representation vector. It is noted that each category in the THINGS dataset is represented by one object image. From X we can construct a similarity matrix for all object pairs ${\bf{S}}:= {\bf{X}}\,{{\bf{X}}}^{{\rm{\top }}}\in {{\mathbb{R}}}^{m\times m}$, where ${S}_{i,j}={{\bf{x}}}_{i}^{{\rm{\top }}}{{\bf{x}}}_{j}$ is the representational similarity for objects i and j, ${\rm{\top }}$ denotes the matrix transpose, and x_i refers to the i-th column of X.

Odd-one-out accuracy

The triplet odd-one-out task is frequently used in the cognitive sciences to measure human notions of object similarity^52,53,54,55. To measure the degree of alignment between human and neural network similarity judgements in the THINGS triplet task, we embed the mâ€‰=â€‰1,854 THINGS images into the representation space of a neural network with ${\bf{X}}\in {{\mathbb{R}}}^{m\times p}$. Given vector representations x₁, x₂ and x₃ of the 3 images in a triplet, we first construct a similarity matrix ${\bf{S}}\in {{\mathbb{R}}}^{3\times 3}$ where ${S}_{i,j}\,:= \,{{\bf{x}}}_{i}^{{\rm{\top }}}{{\bf{x}}}_{j}$ is the dot product between a pair of image representations. We identify the closest pair of images in the triplet as ${{\rm{a}}{\rm{r}}{\rm{g}}\,{\rm{m}}{\rm{a}}{\rm{x}}}_{i,j > i}{S}_{i,j}$ with the remaining image being the odd one out. We define odd-one-out accuracy as the fraction of triplets where the odd one out â€˜chosen by a modelâ€™ is identical to the human odd-one-out choice. Thus, our goal is to learn an affine transformation into the THINGS human object similarity space of the form: ${{\bf{x}}}^{^{\prime} }={\bf{W}}{\bf{x}}+{\bf{b}}$. Here, ${\bf{W}}\in {{\mathbb{R}}}^{p\times p}$ is a learned transformation matrix, ${\bf{b}}\in {{\mathbb{R}}}^{p}$ is a bias and ${\bf{x}}\in {{\mathbb{R}}}^{p}$ is the neural network representation for a single object image in the THINGS dataset. We learn the affine transformation for the representation of the image encoder space of the teacher model (see the â€˜Surrogate teacher modelâ€™ section for details about the teacher model). Using this affine transformation, an entry in the pairwise similarity matrix Sâ€²â€”which represents the similarity between two object images i and jâ€”can now be written as ${S}_{i,j}^{^{\prime} }\,:= \,{({\bf{W}}{{\bf{x}}}_{i}+{\bf{b}})}^{{\rm{\top }}}({\bf{W}}{{\bf{x}}}_{j}+{\bf{b}})$.

Hard-alignment loss

Given a similarity matrix of neural network representations S and a triplet {i,â€‰j,â€‰k}, the likelihood of a particular pair, $\{a,b\}\subset \{i,j,k\}$, being most similar to the remaining object being the odd one out, is modelled by the softmax of the object similarities,

$$\sigma ({\bf{S}},\tau ):= \exp ({S}_{a,b}/\tau )/\exp ({S}_{i,j}/\tau )+\exp ({S}_{i,k}/\tau )+\exp ({S}_{j,k}/\tau )$$

(1)

We can then define the probability of the neural network model to choose the most similar pair (according to the human participants) to be $q(\{a,b\}|\{i,j,k\},{\bf{S}}):= \sigma ({\bf{S}},\tau )$ with a temperature parameter Ï„â€‰=â€‰1. For n triplet responses, the discrete negative log-likelihood is defined as follows

$${L}_{{\rm{h}}{\rm{a}}{\rm{r}}{\rm{d}}\mbox{-}{\rm{a}}{\rm{l}}{\rm{i}}{\rm{g}}{\rm{n}}}({{\bf{S}}}^{^{\prime} })\,:= \,-\frac{1}{n}\mathop{\sum }\limits_{s=1}^{n}\mathrm{log}\,q(\{{a}_{{\rm{s}}},{b}_{{\rm{s}}}\}|\{{i}_{{\rm{s}}},\,{j}_{{\rm{s}}},\,{k}_{{\rm{s}}}\},\,{{\bf{S}}}^{^{\prime} })$$

Modelling human uncertainties

As each triplet response is a discrete choice, we do not have direct access to the uncertainties of a human participant over the objects in a triplet. Thus, the above loss function optimizes a transform to match the human choice but does not take into account the uncertainties over the three odd-one-out alternatives. However, it is possible to model these uncertainties using variational interpretable concept embeddings (VICE⁵⁵), a recently proposed, approximate Bayesian inference method for learning an interpretable object concept space from human similarity judgements. VICE has shown remarkable performance in predicting the (dis-)agreement in human similarity judgements for multiple similarity judgement datasets, including THINGS⁵⁵.

We train a VICE model on the official THINGS train triplet dataset using the (default) hyperparameters recommended by the authors. To capture the uncertainties in human triplet responses, VICE learns a mean, $\mu \in {{\mathbb{R}}}^{m\times d}$, and a variance, $\sigma \in {{\mathbb{R}}}^{m\times d}$, for each object image m and each object dimension d, respectively. Therefore, the set of VICE parameters is defined as $\theta =\{\mu ,\sigma \}$. VICE uses the reparameterization trick^56,57 to generate an embedding matrix ${\bf{Y}}\in {{\mathbb{R}}}^{m\times d}$, ${{\bf{Y}}}_{\theta ,{\varepsilon }}=\mu +\sigma \odot {\varepsilon }$, where $\varepsilon \in {{\mathbb{R}}}^{m\times d}$ is entrywise N(0,â€‰1), and âŠ™ denotes the Hadamard (element-wise) product.

After convergence, we can use a VICE model to obtain a posterior probability distribution for each triplet in the data. We approximate the probability distribution using a Monte Carlo estimate^58,59,60 from R samples ${{\bf{Y}}}^{({\bf{r}})}={{\bf{Y}}}_{\hat{\theta },\varepsilon (r)}$ for râ€‰=â€‰1,â€‰â€¦,â€‰R, yielding

$$\hat{p}(\{\,{y}_{s},\,{z}_{s}\}|\{{i}_{s},\,{j}_{s},\,{k}_{s}\}):= -\frac{1}{R}\mathop{\sum }\limits_{r=1}^{R}\mathop{\underbrace{p(\{\,{{\rm{y}}}_{s},\,{z}_{s}\}|\{{i}_{{\rm{s}}},{j}_{s},{k}_{s}\},{{\bf{Y}}}^{(r)})}}\limits_{{\rm{M}}{\rm{o}}{\rm{n}}{\rm{t}}{\rm{e}}-{\rm{C}}{\rm{a}}{\rm{r}}{\rm{l}}{\rm{o}}\,{\rm{e}}{\rm{s}}{\rm{t}}{\rm{i}}{\rm{m}}{\rm{a}}{\rm{t}}{\rm{e}}}$$

where we set Râ€‰=â€‰50 because we found it to yield the best predictive performance on the official THINGS validation set. This gives a representative probability estimate for each of the three pairs in a triplet to be selected as the most similar pair.

Soft-alignment loss

Using the posterior probability estimates obtained from VICE, we transform the original THINGS triplet dataset of discrete triplet choices into a triplet dataset of probability distributions that reflect the human uncertainties of the triplet alternatives. Let ${D}^{\dagger }\,:= \,{({p}_{s}^{\ast }(\{{i}_{s},{j}_{s}\,,\,{k}_{s}\}))}_{s=1}^{n}$ be the transformed triplet dataset, where

$${p}_{s}^{\ast }(\{{i}_{s},{j}_{s}\,,\,{k}_{s}\}):= \hat{p}(\{\,{y}_{s},{z}_{s}\}|\{{i}_{s},{j}_{s}\,,\,{k}_{s}\})\,{\rm{\forall }}\,\{\,y,z\}\subset \{i,j\,,\,k\}.$$

Now, for n triplet responses we can define the negative log-likelihood for the soft alignment loss as

$$\begin{array}{l}{L}_{{\rm{s}}{\rm{o}}{\rm{f}}{\rm{t}}\mbox{-}{\rm{a}}{\rm{l}}{\rm{i}}{\rm{g}}{\rm{n}}}({{\bf{S}}}^{^{\prime} })\,:= \,\frac{1}{n}\mathop{\sum }\limits_{s=1}^{n}{p}_{s}^{\ast }(\{{i}_{s},{j}_{s}\,,\,{k}_{s}\})\mathrm{log}{p}_{s}^{\ast }(\{{i}_{s},{j}_{s}\,,\,{k}_{s}\})\\ \,\,\,\,\,\,-{p}_{s}^{\ast }(\{{i}_{s},{j}_{s}\,,\,{k}_{s}\})\mathrm{log}{q}_{s}^{\ast }(\{{i}_{s},{j}_{s}\,,\,{k}_{s}\})\end{array}$$

(3)

where ${q}_{s}(\{{i}_{s},{j}_{s}\,,\,{k}_{s}\},{\bf{S}})\,:= \,q(\{\,{y}_{s},{z}_{s}\}|\{{i}_{s},{j}_{s}\,,\,{k}_{s}\},{\bf{S}})\,{\rm{\forall }}\,\{\,y,z\}\subset \{i,j,k\}.$

Uncertainty distillation

We mainly follow the optimization process introduced in ref. ⁶¹. However, we modify their approach by injecting uncertainty measures about human odd-one-out responses into the representation space of the teacher, using a recent approximate Bayesian inference method for learning object concepts from human behaviour⁵⁵. Thus, we replace the negative log-likelihood of the discrete human odd-one-out choicesâ€”which we refer to as hard alignmentâ€”with the negative log-likelihood of the probabilities for the pairwise triplet similarities obtained from the Bayesian inference modelâ€”referred to as soft alignment. The final objective for learning the uncertainty distillation transformation is thus defined as

$$\mbox{arg}\,\mathop{\mbox{min}}\limits_{W,b}\,{L}_{\mbox{soft-align}}({\bf{X}},{\bf{W}},{\bf{b}})+\lambda \Vert {\bf{W}}-(\mathop{\sum }\limits_{j=1}^{p}{W}_{jj}/p){\bf{I}}{\Vert }_{F}^{2},$$

(4)

where $I\in {{\mathbb{R}}}^{p\times p}$ is the identity matrix and ||.||_F² denotes the squared Frobenius norm. The right-hand side of the above objective is an â„“₂-regularization whose aim is to preserve the nearest-neighbour information (or equivalently, the local similarity structure) of the pretrained representations while learning an affine transformation into the THINGS human object similarity space. The above equation is minimized using standard stochastic gradient descent.

Although this expression is similar to the global transform defined in ref. ⁶¹, we find it to yield equally strong downstream task performance as the gLocal transform proposed in ref. ⁶¹ while predicting human uncertainties better than the global transform. It appears as though there is barely any trade-off between representational alignment and downstream task performance for using the uncertainty distillation, whereas ref. ⁶¹ found that the global transform yields slightly better human alignment but worse downstream task performance compared wit the gLocal transform. We use the uncertainty distillation transformation for generating human-like similarity judgements by transforming a modelâ€™s representation space with uncertainty distillation.

Data generation

In the following section, we describe the AligNet data-generation process. We start by introducing the data that we use for constructing the triplets. We continue with a detailed description of the different sampling strategies that we consider in our analyses. Finally, we explain how we collect model responses using transformed representations and define the objective function for fine-tuning models on AligNet.

Image data

For creating AligNet, we use the publicly available ImageNet database³⁸. ImageNet is a natural image dataset with approximately 10⁶ training data points and 1,000 image categories²⁸. The categories are almost equally distributed in the data with small variations in the number of images between the different classes. Hence, ImageNet can be considered a highly balanced dataset. ImageNet has been the dominant image dataset for training large computer vision models until the advent of image/text multimodal training a few years ago. Although, so far, larger image datasets exist, ImageNet is still is one of the largest open-source and most widely used image datasets in the field of computer vision.

Triplet sampling

For generating triplets of images, we use three different sampling strategies: random, class-border and cluster-boundary sampling. Let mâ€² be the number of images in the data where mâ€²â€‰=â€‰1, 281, 167 and C be the number of classes with Câ€‰=â€‰1,000. Let ${D}_{{\rm{i}}{\rm{m}}{\rm{a}}{\rm{g}}{\rm{e}}}\,:= \,{({x}_{i},\,{y}_{i})}_{i=1}^{{m}^{^{\prime} }}$ be the ImageNet dataset of mâ€² imageâ€“label pairs.

Random

Uniform random sampling is the vanilla sampling approach used to create the THINGS datasets (see above). In random sampling, three images are chosen uniformly at random without replacement from all of the m images in the data to create a triplet. As there are Câ€‰=â€‰1,000 classes and each class has approximately 1,000 images, most of the triplets generated with this approach contain 3 images from 3 different classes. The number of triplets different from triplets with images from three distinct classes is negligible. It is noted that this is the same sampling approach that was used to generate the THINGS triplets⁵⁴. A triplet generated via random sampling can be defined as the following triplet set ${\bf{S}}:= \{{x}_{i},{x}_{j},{x}_{k}\}$ with the constraint $({x}_{i}\ne {x}_{j}\ne {x}_{k})$.

Class boundary

Another way to sample image triplets is to exploit the label information associated with each data point. Instead of three random images from three distinct classes, we determine class-boundary triplets to contain two images from the same class and one image from a different class. This is similar to the approach introduced in ref. ⁶² where each odd-k-out set of images contains a majority class and k odd class singletons. This sampling approach allows models to learn class boundaries similar to the standard supervised learning setting. A triplet generated via class-boundary sampling can be defined as the following triplet set ${\bf{S}}:= \{{x}_{i},\,{x}_{j},\,{x}_{k}\}$ with the constraint $(\,{y}_{i}={y}_{j}\ne {y}_{k})\vee (\,{y}_{i}\ne {y}_{j}={y}_{k})\vee (\,{y}_{i}={y}_{k}\ne {y}_{j})$ where the labels are used for data partitioning.

Cluster boundary

As we want to introduce a general approach that does not rely on label information, we use a third sampling strategy that is, in principle, similar to the class-boundary approach but does not require labels. Let ${\bf{Z}}\in {{\mathbb{R}}}^{{m}^{^{\prime} }\times p}$ be the stacked representations of a neural network model for every image in D_image. The representations can essentially be computed for any layer of a model. Here we use the image encoder for image/text models and the CLS token representation of the penultimate layer for any other model (as we only use ViT-based models). We then apply k-means clustering to the encoded image representations Z and ${{\bf{Z}}}^{^{\prime} }\,:= \,{({\bf{W}}{{\bf{Z}}}^{{\rm{\top }}}+({{\bf{b}}}_{1},\ldots ,{{\bf{b}}}_{{m}^{^{\prime} }}))}^{{\rm{\top }}}$ respectively (where the transformation variables W and b are computed via uncertainty distillation optimization using equation (4)) into c representation clusters where c can be regarded as similar to C, the number of labels in the original dataset. We use the Elbow criterion to select c. For all of our main experiments, we set câ€‰=â€‰500. Hence, the ImageNet dataset is transformed into a ImageNet dataset of image and cluster pairs. After the clustering, we apply the same sampling method as for class-boundary triplets: for each triplet, we choose uniformly at random two images without replacement from one cluster and one image from a different cluster. Thus, a triplet generated via cluster-boundary sampling can be defined as the following set ${\bf{S}}:= \{{x}_{i},{x}_{j},{x}_{k}\}$ with the constraint $({y}_{i}={y}_{j}\ne {y}_{k})\vee ({y}_{i}\ne {y}_{j}={y}_{k})\vee ({y}_{i}={y}_{k}\ne {y}_{j})$ where instead of the original labels we use the cluster labels for partitioning the data.

Triplet-response generation

We use the responses of a surrogate teacher model (see below) to simulate a dataset of human-aligned triplet odd-one-out responses. More formally, let ${D}_{\mbox{triplets}}\,:= \,{(\{{x}_{i},\,{x}_{j},\,{x}_{k}\})}_{s=1}^{{n}^{^{\prime} }}$ be the dataset of sampled ImageNet triplets for which we want to collect responses using transformed model representations. It is noted that we can sample an arbitrary number of tripletsâ€”upper-bounded by the binomial coefficient mâ€²/k with kâ€‰=â€‰3â€”and can thus set nâ€² to essentially any natural number. For the experiments that we report in the main text, we set nâ€²â€‰=â€‰10⁷ because we found a larger nâ€² to not yield any downstream task improvements. For now, we regard our surrogate model as a blackbox model with transformed ImageNet representations ${{\bf{Z}}}^{^{\prime} }\,:= \,{({\bf{W}}{{\boldsymbol{Z}}}^{{\rm{\top }}}+({{\bf{b}}}_{1},\ldots ,{{\bf{b}}}_{{m}^{^{\prime} }}))}^{{\rm{\top }}}\in {{\mathbb{R}}}^{{m}^{^{\prime} }\times p}$ where the affine transformation was found via uncertainty distillation optimization (equation (4)). Given transformed representations z₁â€², z₂â€² and z₃â€² of the three images in a triplet, we can construct a similarity matrix ${{\bf{S}}}^{^{\prime} }\in {{\mathbb{R}}}^{3\times 3}$ where ${{S}^{^{\prime} }}_{i,j}:={{\rm{z}}}_{i}^{{\rm{\top }}}{{\rm{z}}}_{j}$ is the dot product between a pair of of representations. Similarly to how we do this for learning the uncertainty distillation transformation (see above), we identify the closest pair of images in a triplet as $\arg \mathop{max}\limits_{i,j > i}{S}_{i,j}^{^{\prime} }$ with the remaining image being the odd one out. Let ${D}_{{\rm{a}}{\rm{l}}{\rm{i}}{\rm{g}}{\rm{n}}}\,:= \,{({\{{x}_{a},{x}_{b}\}}_{s}|\{{x}_{i},{x}_{j},{x}_{k}\})}_{s=1}^{{n}^{^{\prime} }}$ then constitute the final AligNet dataset of ImageNet triplets and corresponding model responses, where $\{{x}_{a},{x}_{b}\}\subset \{{x}_{i},{x}_{j},{x}_{k}\}$ and $\{{x}_{a},{x}_{b}\}$ is the image pair that was chosen by the transformed model representations to have the highest pairwise similarity. The model choices are the closest approximation to the human choices due to the uncertainty distillation transformation.

It is noted that the dataset includes not only the discrete model choices but also the exact relationships among all pairwise similarities in a triplet obtained from the probability space of the teacher model. Thus, we have access to soft distributions over the labels for use in distillation.

Objective function

Let f_Î¸ be a neural network function parameterized by Î¸, the set of its weights and biases. For every input image x that the function f_Î¸(x) processes it yields a representation f_Î¸(x)â€‰=â€‰z. Here, z refers to the image encoder representation of image/text models or the CLS token representation before the final linear layer for other model types. From the representations of the three images in a triplet, we can again construct a similarity matrix ${{\boldsymbol{S}}}^{\dagger }\in {{\mathbb{R}}}^{3\times 3}$ where ${S}^{\dagger }:= {{\rm{z}}}_{i}^{{\rm{\top }}}{{\rm{z}}}_{j}$ is the dot product between a pair of image representations. The AligNet loss function is defined as the following KL divergence between teacher and student triplet probabilities,

$$\begin{array}{c}{L}_{\mbox{alignet}}({{\bf{S}}}^{^{\prime} },{{\bf{S}}}^{\dagger })\,:= \,\\ \,-\frac{1}{B}\mathop{\sum }\limits_{s=1}^{B}{\sigma }{([{S}_{i,j}^{^{\prime} },{S}_{i,k}^{^{\prime} },{S}_{j,k}^{^{\prime} }],{\tau }^{^{\prime} })}_{s}\mathrm{log}\,{\sigma }{([{S}_{i,j}^{^{\prime} },{S}_{i,k}^{^{\prime} },{S}_{j,k}^{^{\prime} }],{\tau }^{^{\prime} })}_{s}\\ \,-{\sigma }{([{S}_{i,j}^{^{\prime} },{S}_{i,k}^{^{\prime} },{S}_{j,k}^{^{\prime} }],{\tau }^{^{\prime} })}_{s}\mathrm{log}\,{\sigma }{([{S}_{i,j}^{\dagger },{S}_{i,k}^{\dagger },{S}_{j,k}^{\dagger }],{\tau }^{\dagger })}_{s},\end{array}$$

(5)

where Ï„â€²â€‰=â€‰1 and Ï„^â€â€‰>â€‰1 and B is the batch size. We find Ï„^â€ via grid search and set it to Ï„^â€â€‰=â€‰100 for all of our experiments. Recall that Ïƒ is a softmax function that models the probabilities over the three image similarity pairs (equation (1)). The final AligNet objective is defined as the following minimization problem

$$\mbox{arg}\,\mathop{\mbox{min}}\limits_{\theta }\,{L}_{\mbox{alignet}}({f}_{\theta })+\lambda {\Vert {\theta }^{\ast }-{\theta }^{\dagger }\Vert }_{2}^{2}$$

(6)

where Î¸^* are the parameters of the pretrained base student model and Î¸^â€ are the parameters of the fine-tuned student model. This â„“₂-regularization, which we refer to as weight decay to initialization, encourages the fine-tuned set of parameters to stay close to its base during training. It is similar to the regularization used for learning the uncertainty distillation transformation (equation (4)) but adapted to the set of all model parameters rather than a linear transform.

Surrogate teacher model

Reference ²⁵ showed that image and text models and models trained on large, diverse datasets are better aligned with human similarity judgements than vision models trained with a self-supervised learning objective or supervised models trained on ImageNet. Thus, we use the best-performing image and text model according to various computer vision benchmarks at the time of writing this paper as our teacher model. This model is called SigLIP⁶³. SigLIP, similar to contrastive languageâ€“image pretraining (CLIP)⁶⁴ and ALIGN⁶⁵, is trained via contrastive languageâ€“image pretraining using millions of image and text pairs. The difference between CLIP and SigLIP is that the latter uses a paired sigmoid loss instead of the standard softmax function usually used for pretraining image and text models via cross-entropy. Image and text pretraining allows the model to learn an aligned representation space for images and text; thus, adding more semantic information about the objects in an image to the model representations.

We use the SigLIP-So400m variant of SigLIP as our teacher model. This variant uses an optimized ViT backbone whose performance is similar to one of the largest ViTs, ViT-g/14⁶⁶ while having fewer parameters and thus being smaller. The number of parameters of SoViT-400m/14 lies somewhere between that of ViT-L/16 and ViT-g/14. The output dimensionality of the image and text encoder representations of SoViT-400m/14 is pâ€‰=â€‰1,152 each. We align the image encoder representations with human odd-one-out choices using the uncertainty distillation optimization outlined in equationâ€‰(6). This allows us to increase the triplet odd-one-out accuracy of SigLIP-So400m from 44.24% to 61.7% (rightmost column in Supplementary Table 1), which is close to the human noise ceiling of 66.67% for THINGS (compare ref. ⁵⁴) and thus among the best human-aligned models without AligNet fine-tuning (compare ref. ²⁵). It is noted that this is a relative increase in performance of 39.47%. Throughout this paper, we use the human-aligned version of SigLIP-So400m as the surrogate teacher model for generating human-aligned similarity judgements and distiling human-like similarity structure into student vision foundation models (VFMs). We select a diverse and representative set of student VFMs.

Student models

As previous research has demonstrated that a modelâ€™s architecture has no significant impact on the degree of alignment with human similarity judgements^25,61, we use the same architecture for all student models that we fine-tune on AligNet. Specifically, we use the ViT⁸ for the backbone of each student model. We use the ViT rather than a convolutional-neural-network-based model because ViTs have recently emerged as the dominant neural network architecture for computer vision application and VFMs. Every large VFM used in practice is based on the ViT^30,63,67,68. Unless otherwise mentioned, we use the base model size, that is, ViT-B. ViT-B has 12 attention layers and an internal (hidden) representation size of pâ€‰=â€‰768. It has been shown that both the training data and the objective function have a substantial impact on the degree of alignment with human behaviour. Thus, we use student models that were trained on different pretraining task with different training data and objective functions.

Supervised pretraining is still the prevailing mode of training computer vision models. Therefore, we trained ViT-B on the popular ImageNet dataset consisting of 1.4â€‰million natural images³⁸. To examine how model performance changes as a function of the model size, we train ViT instances of three different sizes on ImageNet: ViT-S/16, ViT-B/16 and ViT-L/16. The image patch size is the same for each of those models. To evaluate the effect of AligNet on self-supervised pretraining, we use pretrained DINOv1⁶⁹ and DINOv2²⁹ models of which DINOv1 was pretrained on ImageNet and DINOv2 was pretrained on a different, larger image dataset as denoted below. In addition, we investigate multimodal training of vision models that add textual information in the form of both image captioning via the CapPa model³⁰ and CLIP via SigLIP⁶³. The latter model is considered state of the art on many downstream computer vision applications and is used as the image embedding model in modern large visual-language models^67,68. The full list of student models that we consider in our analyses is as follows:

ViT-{S,B,L}
- Training data: ImageNet³⁸
- Objective: supervised learning

CLIP (ViT-B) (see Appendix A.2.4)
- Training data: WebImageText⁶⁴
- Objective: CLIP⁶⁴

SigLIP (ViT-{B,SO400M})
- Training data: WebLI⁶³
- Objective: CLIP⁶⁴

SigLIP2 (ViT-B) (Supplementary Information section 2.4)
- Training data: WebLI⁶³
- Objective: combination of various objectives (see ref. ⁷⁰ for details)

DINOv1 (ViT-B)
- Training data: ImageNet³⁸
- Objective: self-supervised image pretraining⁶⁹

DINOv2 (ViT-B)
- Training data: DINOv2 data (see ref. ²⁹ for details)
- Objective: self-supervised teacher-student distillation²⁹

CapPa (ViT-B)
- Training data: JFT-3B (Google proprietary dataset)
- Objective: multimodal image captioning³⁰

Randomly initialized ViT-B (Supplementary Information section 2.5)
- Training data: AligNet
- Objective: AligNet objective (equation (5))

Representational similarity analysis

Representational similarity analysis is a well-established method for comparing neural network representationsâ€”extracted from an arbitrary layer of the modelâ€”to representations obtained from human behaviour⁴⁴. In representational similarity analysis, one first obtains representational similarity matrices (RSMs) for the human behavioural judgements and for the neural network representations (more specific details can be found in Supplementary Information). These RSMs measure the similarity between pairs of examples according to each source. As in previous work^{23,25,54,61,71}, we flatten the upper triangular of human and model RSMs respectively and quantify their similarities using use the Spearman rank correlation coefficient. In contrast to Pearson correlation, the Spearman rank correlation is scale invariant and thus better suited to measure similarities of judgements obtained from different sources.

Multi-arrangement task

Human similarity judgements for refs. ^23,71 were obtained by using a multi-arrangement task. In a multi-arrangement task, participants are presented with a computer screen showing images of several different objects. The participants are asked to arrange the images into semantically meaningful clusters, given the instruction that images of objects that lie close together are considered more similar. From this arrangement, one can infer pairwise (dis-)similarities of the objects and average those across all participants to obtain a representative (dis-)similarity matrix.

Likert scale

In refs. ^24,72, pairwise similarity judgements were obtained by asking human participants to rate the similarity of pairs of objects on an ordinal scale that ranges from 0 (â€˜not similar at allâ€™) to 10 (â€˜very similarâ€™). The pairwise similarity ratings can be averaged across the different participants, which in turn yields a matrix of similarities between pairs of objects.

Neural network representations

RSMs for neural network representations are obtained by first embedding the same set of images that were presented to the human participants in the p-dimensional latent space of a model. The latent space could be any layer of a neural network. For the base models, we use the representations of the image encoder for SigLIO and the CLS token of the penultimate layer for CapPa, DINOv2 and ViT-B. We do this because previous work has shown that the penultimate layer space and the image encoder space of image and text models, respectively, yield the highest similarity to human behaviour^24,25,73. After embedding the images into the neural netâ€™s latent space, we get a representation matrix ${\bf{X}}\in {{\mathbb{R}}}^{n\times p}$ for the n images in the data. Instead of simply computing the dot-product similarity matrix ${\bf{S}}:= {\bf{X}}{{\bf{X}}}^{{\rm{\top }}}$, in representational similarity analysis one typically uses either a cosine similarity or a Pearson correlation kernel to compute the affinity matrix

$$\cos ({{\bf{x}}}_{i},{{\bf{x}}}_{j})\,:= \,\frac{{{{\bf{x}}}_{i}}^{{\rm{\top }}}{{\bf{x}}}_{j}}{{\Vert {{\bf{x}}}_{i}\Vert }_{2}{\Vert {{\bf{x}}}_{j}\Vert }_{2}};\,\,\,\,\,\phi ({{\bf{x}}}_{i},{{\bf{x}}}_{j})\,:= \,\frac{{({{\bf{x}}}_{i}-{\bar{{\bf{x}}}}_{i})}^{{\rm{\top }}}({{\bf{x}}}_{j}-{\bar{{\bf{x}}}}_{j})}{{\Vert {{\bf{x}}}_{i}-{\bar{{\bf{x}}}}_{i}\Vert }_{2}{\Vert {{\bf{x}}}_{j}-{\bar{{\bf{x}}}}_{j}\Vert }_{2}},$$

where the cosine similarity kernel function cos(x_i,â€‰x_j) or the Pearson correlation kernel function Ï•(x_i,â€‰x_j) is applied to every (x_i,â€‰x_j) vector pair of the matrix X for obtaining the final RSM ${{\bf{S}}}^{^{\prime} }\in {{\mathbb{R}}}^{n\times n}$. Here we use the Pearson correlation kernel function Ï•(x_i,â€‰x_j) to obtain a neural netâ€™s RSM. Pearson correlation is the centred version of cosine similarity and the ranking of the obtained similarities does not differ between the two kernel functions but Pearson correlation first centres the vectors to have zero mean and is therefore a more robust measure. For obtaining RSMs with transformed representations, the transforms are first applied to X before computing Sâ€².

Alignment with conceptual hierarchy

When analysing alignment with the conceptual hierarchy, we use the original ImageNet category labels for the images³⁸. ImageNet is structured by the WordNet hierarchy, from which we extract basic and superordinate categories aligned with the previous cognitive work. Within and across categories, we measure change in representation distance relative to other changes (by z-scoring across all representation distances for the given model checkpoint), because relative distances are more meaningful than absolute ones (for example, scaling all representations by two would change absolute distances, but not relative ones), and absolute scales of all representations tend to increase during training. We quantify changes with mixed-effects linear regressions that account for the non-independence of representational changes across the different clusters (see Supplementary Information section 3.2 for details).

Levels data

We collected a new multi-level similarity judgement dataset from Nâ€‰=â€‰473 human participants, which we named Levels. The dataset contains odd-one-out judgements on three different types of triplet: coarse-grained semantic, which requires deciding on the odd one out in broadly different categories; fine-grained semantic, which involved discerning subtle within category distinctions; and class boundary, which tested for category-boundary detection. Consistent selection of the same odd-one-out image (for example, i) in multiple participants indicated that the remaining two images (for example, j and k) were closer to each other in the participantsâ€™ concept space than either was to the odd one out (see Supplementary Information for details about the data collection). Levels allowed us to evaluate modelâ€“human alignment for the same set of stimuli on various levels of abstraction, and to assess how well the models capture the inherent uncertainty in human judgements, inferred from response latencies.

Participants

We recruited Nâ€‰=â€‰508 participants (209 female, 289 male, 3 diverse, Nâ€‰=â€‰7 missing demographic information owing to revocation of study consent; mean age 31.75â€‰Â±â€‰s.d.â€‰=â€‰8.04â€‰years) online via Prolific Academic (https://www.prolific.ac). The eligibility criteria were that participants had to be between 18 and 50â€‰years old, fluent in English, have normal or corrected-to-normal vision, no colourblindness, and have a minimum approval rating of 95% on Prolific. Participants provided informed consent before starting the experiment. The experiment lasted approximately 45â€‰minutes. Participants were reimbursed with Â£7.70 for completing the experiment and received an additional bonus payment of Â£0.77. Partial payments were made if the experiment was not completed owing to technical issues (Nâ€‰=â€‰6) or early termination by the participant (Nâ€‰=â€‰1). Participants performing below 90% correct on catch trials (Nâ€‰=â€‰19, 3 female, 16 male), or failing to respond in the allotted time window (15â€‰s) in more than 10 trials (Nâ€‰=â€‰9, 4 female, 4 male, 1 diverse) were excluded. Thus, Nâ€‰=â€‰473 participants remained in the dataset (202 female, 269 male, 2 diverse; mean age 31.82â€‰Â±â€‰s.d.â€‰=â€‰8.03â€‰years). Of these participants, Nâ€‰=â€‰448 were each tested with a different selection of triplets, while ensuring that each triplet was presented Nâ€‰=â€‰5 times across the entire sample of participants (see information on stimuli sampling below). Owing to a server glitch during trial assignment, the remaining Nâ€‰=â€‰25 participants shared their exact triplet selection with one other participant in the sample. These Nâ€‰=â€‰25 participants were excluded from the response times and uncertainty estimation (see â€˜Alignment at multiple levels of abstractionâ€™ section) to restrict analysis to participants with different sets of triplets. The experiment was approved by the internal review board of the Max Planck Institute for Human Development.

Stimuli

The experimental stimuli were images taken from the ImageNet dataset³⁸. Another nine images were used for instructions only and depicted natural objects selected from the Bank of Standardized Stimuli (BOSS)⁷⁴, available at https://drive.google.com/drive/folders/1FpnEFkbqe_huRwfsCf7gs5R1zuc1ZOkn. We grouped the visual stimuli presented in the triplets according to different levels of abstraction: coarse-grained semantic, which comprised three images from three different categories; fine-grained semantic, showing three images from the same category; and class boundary, where two images were from the same and one from a different category.

Instead of randomly sampling tripletsâ€”which would reproduce dataset biasesâ€”we stratified sampling by superclasses. ImageNet classes follow the WordNet hierarchy^28,38, which includes higher-level classes. For instance, all dog breeds can be summarized as the dog superclass. To avoid presenting dogs, birds and other fine-grained classes that are overrepresented in ImageNet more frequently to the participants than other categories, we grouped the ImageNet classes into 717 coarse-grained WordNet superclasses. We uniformly at random sampled images from those 717 superclasses to construct the different kinds of triplets. It is noted that for all superclasses with more than one class, we uniformly at random chose one subclass and uniformly at random sampled one image, two images (without replacement) or three images (without replacement) from that subclass, depending on the triplet type. For most superclasses that comprised a single subclass only, that is, a one-to-one-mapping, we could skip the subclass sampling part. Triplet sampling resulted in Nâ€‰=â€‰450 predefined experiment trial sets, of which Nâ€‰=â€‰448 were used for testing. Across these, each triplet was presented within Nâ€‰=â€‰5 different experiment files. This sampling process ensured a balanced distribution of triplets across the sample, and the repetition of each triplet in five different participants allowed for the calculation of an uncertainty distribution for each triplet.

The triplet odd-one-out task

On each trial, participants were presented with a triplet of images (i,â€‰j,â€‰k). Participants were asked to select the image that was the most different from the other two, that is, the odd one out. During the instructions, participants saw different triplets with increasing ambiguity regarding which image would likely be picked as the odd one out. Participants were given explanations for potential odd-one-out choices, clarifying that decisions could be based on different criteria, such as semantic or perceptual features of the shown images.

Procedure

The experiment was run online using jsPsych v7.3.3 (www.jspsych.org/7.3/) and custom plugins. Participants were asked to provide demographic information, including their age and gender. Thereafter, they viewed written instructions about the task and performed six practice trials (two trials per triplet level of abstraction). Participants were free to repeat the instructions until they felt confident to perform the experiment. The experiment proper comprised Nâ€‰=â€‰330 experiment trials. Each trial started with a fixation cross (1â€‰s), followed by the presentation of a triplet (maximum 15â€‰s). Participants were asked to select the â€˜odd one outâ€™ using the right-, left- or downwards-facing arrow keys on their keyboard. Responses could be entered between 1â€‰s and 15â€‰s after triplet onset, after which the next trial started. Trials in which participants failed to submit a response were rare (Mâ€‰=â€‰0.27% of trials; minimum 0.00%, maximum 6.06%). The serial order of triplet types (for example, fine-grained or coarse-grained semantic) and ImageNet classes (for example, dogs or birds) was counterbalanced across the experiment. We additionally counterbalanced the serial position of trial types across participants using a Latin-square design⁷⁵. Participants could take short breaks (self-paced for up to 2â€‰min) after Nâ€‰=â€‰50, 150 and 200 experiment trials. Experimental trials were interleaved with Nâ€‰=â€‰16 catch trials (class-border triplets), which were predefined based on low model uncertainty and 100% agreement among participants on these specific triplets during piloting. Catch trial performance was used as an indicator of adequate task engagement (see participant inclusion criteria above).

Preprocessing of human response times and uncertainty estimation

Descriptive statistics on response times and uncertainty estimation (see â€˜Alignment at multiple levels of abstractionâ€™ section) were calculated based on participants with unique experimental trial sets (Nâ€‰=â€‰448). The response-time data were log transformed (log(RT)), in accordance with current best practices for response-time analysis. Trials with response times longer than 10â€‰s were excluded from analysis (on average Mâ€‰=â€‰2.64% of trials per participant). As responses could be given no earlier than 1â€‰s after triplet onset (see â€˜Procedureâ€™ above), no lower bound was set for response-time exclusion. To estimate uncertainty (in terms of the level of (dis-)agreement among observers) for each triplet, we used the discrete (Shannon) entropy of the response distribution across participants.

Human-to-human alignment

We computed the human noise ceiling for each abstraction setting in Levels using a leave-one-out cross-validation approach. In leave one out, the agreement level for a triplet is computed as the average match rate between a held-out participantâ€™s response and the majority response of the remaining population. Thus, for a triplet that was used for five participants, on each leave-one-out iteration, one participant response is held out and the remaining four comprise the population. The human-to-human reliability score is then calculated as the average agreement level across all triplets in the dataset.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The synthetically created AligNet data are publicly available at https://console.cloud.google.com/storage/browser/alignet. The Levels data are available on GIN at https://doi.org/10.12751/g-node.hg4tdz.

Code availability

The training code is available at https://github.com/google-deepmind/alignet. The aligned model checkpoints are publicly available at https://console.cloud.google.com/storage/browser/alignet. Both the experiment and analysis code for Levels and for the human validation of the representational differences between AligNet and UnaligNet are hosted on GitHub and archived on Zenodo at https://zenodo.org/records/15554034 (ref. ⁷⁶) and https://zenodo.org/records/15554174 (ref. ⁷⁷).

References

Khaligh-Razavi, S.-M. & Kriegeskorte, N. Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS Comput. Biol. 10, e1003915 (2014).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Peterson, J. C., Battleday, R. M., Griffiths, T. L. & Russakovsky, O. Human uncertainty makes classification more robust. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 9616â€“9625 (IEEE, 2019).
Lake, B. M., Ullman, T. D., Tenenbaum, J. B. & Gershman, S. J. Building machines that learn and think like people. Behav. Brain Sci. 40, e253 (2017).
ArticleÂ PubMedÂ Google ScholarÂ
Geirhos, R. et al. Generalisation in humans and deep neural networks. In Proc. Advances in Neural Information Processing Systems Vol. 31 (eds Bengio, S. et al.) 7538â€“7550 (Curran Associates, 2018).
Hebart, M. N., Zheng, C. Y., Pereira, F. & Baker, C. I. Revealing the multidimensional mental representations of natural objects underlying human similarity judgements. Nat. Hum. Behav. 4, 1173â€“1185 (2020).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Brown, T. et al. Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems Vol. 33 (eds Larochelle, H. et al.) 1877â€“1901 (Curran Associates, 2020).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th Interna- tional Conference on Machine Learning: Proc. Machine Learning Research Vol. 139 (eds Meila, M. & Zhang, T.) 8748â€“8763 (PMLR, 2021).
Dosovitskiy, A. et al. An image is worth 16Ã—16 words: transformers for image recognition at scale. In Proc. The Ninth International Conference on Learning Representations (ICLR, 2021).
Zhai, X., Mustafa, B., Kolesnikov, A. & Beyer, L. Sigmoid Loss for language image pre-training. In Proc. IEEE/CVF International Conference on Computer Vision 1975â€“11986 (IEEE, 2023).
Lapuschkin, S. et al. Unmasking clever Hans predictors and assessing what machines really learn. Nat. Commun. 10, 1096 (2019).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2, 665â€“673 (2020).
ArticleÂ Google ScholarÂ
Bowers, J. S. et al. Deep problems with neural network models of human vision. Behav. Brain Sci. 46, e385 (2022).
Muttenthaler, L. et al. Improving neural network representations using human similarity judgments. In Proc. Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al) 50978â€“51007 (Curran Associates, 2023).
Fodor, J. A. & Pylyshyn, Z. W. Connectionism and cognitive architecture: a critical analysis. Cognition 28, 3â€“71 (1988).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Tenenbaum, J. B., Kemp, C., Griffiths, T. L. & Goodman, N. D. How to grow a mind: statistics, structure, and abstraction. Science 331, 1279â€“1285 (2011).
ArticleÂ MathSciNetÂ CASÂ PubMedÂ Google ScholarÂ
Lake, B. M., Salakhutdinov, R. & Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science 350, 1332â€“1338 (2015).
ArticleÂ MathSciNetÂ CASÂ PubMedÂ Google ScholarÂ
Roads, B. D. & Mozer, M. C. Improving human-machine cooperative classification via cognitive theories of similarity. Cogn. Sci. 41, 1394â€“1411 (2017).
ArticleÂ PubMedÂ Google ScholarÂ
Hendrycks, D. et al. The many faces of robustness: a critical analysis of out- of-distribution generalization. In Proc. IEEE/CVF International Conference on Computer Vision 8340â€“8349 (IEEE, 2021).
Pooch, E. H. P., Ballester, P. & Barros, R. C. Can we trust deep learning based diagnosis? The impact of domain shift in chest radiograph classification. In Proc.Thoracic Image Analysis: Second International Workshop (eds Petersen, J. et al.) 74â€“83 (Springer, 2020).
Minderer, M. et al. Revisiting the calibration of modern neural networks. In Proc. Advances in Neural Information Processing Systems Vol. 34 (eds Ranzato, M. et al.) 15682â€“15694 (Curran Associates, 2021).
Cichy, R. M., Kriegeskorte, N., Jozwik, K. M., van den Bosch, J. J. F. & Charest, I. The spatiotemporal neural dynamics underlying perceived similarity for real-world objects. NeuroImage 194, 12â€“24 (2019).
ArticleÂ PubMedÂ Google ScholarÂ
Connolly, A. C., et al. The representation of biological classes in the human brain. J. Neurosci. 32, 2608â€“2618 (2012).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
King, M. L., Groen, I. I. A., Steel, A., Kravitz, D. J. & Baker, C. I. Similarity judgments and cortical visual responses reflect different properties of object and scene categories in naturalistic images. NeuroImage 197, 368â€“382 (2019).
ArticleÂ PubMedÂ Google ScholarÂ
Peterson, J. C., Abbott, J. T. & Griffiths, T. L. Evaluating (and improving) the correspondence between deep neural networks and human representations. Cogn. Sci. 42, 2648â€“2669 (2018).
ArticleÂ PubMedÂ Google ScholarÂ
Muttenthaler, L., Dippel, J., Linhardt, L., Vandermeulen, R. A. & Kornblith, S. Human alignment of neural network representations. In Proc. The Eleventh International Conference on Learning Representations (ICLR, 2023).
Hebart, M. N. et al. THINGS: a database of 1,854 object concepts and more than 26,000 naturalistic object images. PLoS OneÂ 14, e0223792 (2019).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Lee, D.-H. et al. Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Proc. Workshop on Challenges In Representation Learning (WREPL) 896â€“902 (ICML, 2013).
Russakovsky, O. et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis.Â 115, 211â€“252 (2015).
ArticleÂ MathSciNetÂ Google ScholarÂ
Oquab, M. et al. DINOv2: learning robust visual features without supervision. Transact. Mach. Learn. Res. https://openreview.net/forum?id=a68SUt6zFt (2024).
Tschannen, M. et al. Image captioners are scalable vision learners too. In Proc. Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 46830â€“46855 (Curran Associates, 2023).
Ratcliff, R. A theory of memory retrieval. Psychol. Rev. 85, 59 (1978).
ArticleÂ Google ScholarÂ
Kiani, R., Corthell, L. & Shadlen, M. N. Choice certainty is informed by both evidence and decision time. Neuron 84, 1329â€“1342 (2014).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Gemini Team et al. Gemini 2.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. Preprint at https://arxiv.org/abs/2507.06261 (2025).
Mehrer, J., Spoerer, C. J., Jones, E. C., Kriegeskorte, N. & Kietzmann, T. C. An ecologically motivated image dataset for deep learning yields better models of human vision. Proc. Natl Acad. Sci. USA 118, e2011417118 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M. & Boyes-Braem, P. Basic objects in natural categories. Cogn. Psychol. 8, 382â€“439 (1976).
ArticleÂ Google ScholarÂ
Wah, C., Branson, S., Welinder, P., Perona, P. & Belongie, S. The Caltech-UCSD Birds-200-2011 dataset. Caltech Vision Lab https://www.vision.caltech.edu/datasets/cub_200_2011/ (2011).
Nilsback, M.-E. & Zisserman, A. Automated flower classification over a large number of classes. In Proc. Sixth Indian Conference on Computer Vision, Graphics & Image Processing 722â€“729 (IEEE, 2008).
Deng, J. et al. ImageNet: a large-scale hierarchical image database. In Proc. 2009 IEEE Conference on Computer Vision and Pattern Recognition 248â€“255 (IEEE, 2009).
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A. & Torralba, A. Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1452â€“1464 (2018).
ArticleÂ PubMedÂ Google ScholarÂ
Sugiyama, M. & Kawanabe, M. Machine Learning in Non-stationary Environments: Introduction to Covariate Shift Adaptation (MIT Press, 2012).
Farahani, A., Voghoei, S., Rasheed, K. & Arabnia, H. R. in Advances in Data Science and Information Engineering: Proceedings from ICDATA 2020 and IKE 2020 (eds Stahlbock, R. et al.) 877â€“894 (Springer, 2021).
Santurkar, S., Tsipras, D. & Madry, A. BREEDS: benchmarks for subpopulation shift. In Proc. The Ninth International Conference on Learning Representations (ICLR, 2021).
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J. & Song, D. Natural adversarial examples. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 15262â€“15271 (IEEE, 2021).
Kriegeskorte, N., Mur, M. & Bandettini, P. A. Representational similarity analysis-connecting the branches of systems neuroscience. Front. Syst. Neurosci. 2, 4 (2008).
Sucholutsky, I. et al. Getting aligned on representational alignment. Preprint at https://arxiv.org/abs/2310.13018 (2023).
Gary, F. Marcus. Rethinking eliminative connectionism. Cogn. Psychol. 37, 243â€“282 (1998).
ArticleÂ Google ScholarÂ
Holyoak, K. J. & Hummel, J. E. in Cognitive Dynamics (eds Dietrich, E. & Markman, A. B.) 229â€“263 (Psychology Press, 2014).
Geiger, A., Carstensen, A., Frank, M. C. & Potts, C. Relational reasoning and generalization using nonsymbolic neural networks. Psychol. Rev. 130, 308 (2023).
ArticleÂ PubMedÂ Google ScholarÂ
Paleyes, A., Urma, R.-G. & Lawrence, N. D. Challenges in deploying machine learning: a survey of case studies. ACM Comput. Surv. 55, 1â€“29 (2022).
ArticleÂ Google ScholarÂ
Amodei, D. et al. Concrete problems in AI safety. Preprint at https://arxiv.org/abs/1606.06565 (2016).
Hebart, M. N., et al. THINGS-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior. eLife 12, e82580 (2023).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Fukuzawa, K., et al. Internal representations and the conceptual operation of color in pure alexia with color naming defects. Brain Lang. 34, 98â€“126 (1988).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Robilotto, R. & Zaidi, Q. Limits of lightness identification for real objects under natural viewing conditions. J. Vis.Â 4, 779â€“797 (2004).
ArticleÂ PubMedÂ Google ScholarÂ
Hebart, M. N., Zheng, C. Y., Pereira, F. & Baker, C. I. Revealing the multidimensional mental representations of natural objects underlying humansimilarity judgements. Nat. Hum. Behav. 4, 1173â€“1185 (2020).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Muttenthaler, L. et al. VICE: variational interpretable concept embeddings. Adv. Neural Inf. Process. Syst. 35, 33661â€“33675 (2022).
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. In Proc. The Second International Conference on Learning Representations (eds Bengio, Y. & LeCun, Y.) (ICLR, 2014).
Jimenez Rezende, D., Mohamed, S. & Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proc. 31st International Conference on Machine Learning Vol. 32 (eds Xing, E. P. & Jebara, T.) 1278â€“1286 (PMLR, 2014).
Graves, A. Practical variational inference for neural networks. In Proc. Advances in Neural Information Processing Systems Vol. 24 (eds Shawe-Taylor, J. et al.) 2348â€“2356 (Curran Associates, 2011).
Blundell, C., Cornebise, J., Kavukcuoglu, K. & Wierstra, D. Weight uncertainty in neural network. In Proc. 32nd International Conference on Machine Learning Vol. 37 (eds Bach, F. & Blei, D.) 1613â€“1622 (PMLR, 2015).
Blei, D. M., Kucukelbir, A. & McAuliffe, J. D. Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859â€“877 (2017).
ArticleÂ MathSciNetÂ CASÂ Google ScholarÂ
Muttenthaler, L. et al. Improving neural network representations using human similarity judgments. In Proc. Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 50978â€“51007 (Curran Associates, 2023).
Muttenthaler, L., Vandermeulen, R. A., Zhang, Q., Unterthiner, T. & MÃ¼ller, K.-R. Set learning for accurate and calibrated models. In Proc. The Twelfth International Conference on Learning Representations (ICLR, 2024).
Zhai, X., Mustafa, B., Kolesnikov, A. & Beyer, L. Sigmoid loss for language image pre-training. In Proc. IEEE/CVF International Conference on Computer Vision 11975â€“11986 (IEEE, 2023).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc 38th International Conference on Machine Learning Vol. 139 (eds Meila, M. & Zhang, T.) 8748â€“8763 (PMLR, 2021).
Jia, C. et al. Scaling up visual and vision-language representation learning with noisy text supervision. In Proc. 38th International Conference on Machine Learning Vol. 139 (eds Meila, M. & Zhang, T.) 4904â€“4916 (PMLR, 2021).
Alabdulmohsin, I. M., Zhai, X., Kolesnikov, A. & Beyer, L. Getting ViT in shape: scaling laws for compute-optimal model design. In Proc. Advances in Neural Information Processing Systems Vol. 36 (eds Oh, A. et al.) 16406â€“16425 (Curran Associates, 2023).
Chen, X. et al. Pali-3 vision language models: smaller, faster, stronger. Preprint at https://arxiv.org/abs/2310.09199 (2023).
Beyer, L. et al. PaliGemma: aversatile 3B VLM for transfer. Preprint at https://arxiv.org/abs/2407.07726 (2024).
Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc. IEEE/CVF International Conference on Computer Vision 9650â€“9660 (IEEE, 2021).
Tschannen, M. et al. SigLIP 2: multilingual vision-language encoders with improved semantic understanding, localization, and dense features. Preprint at https://arxiv.org/abs/2502.14786 (2025).
Cichy, R. M., Kriegeskorte, N., Jozwik, K. M., van den Bosch, J. J. F. & Charest, I. The spatiotemporal neural dynamics underlying perceivedsimilarity for real-world objects. NeuroImage 194, 12â€“24 (2019).
ArticleÂ PubMedÂ Google ScholarÂ
Peterson, J. C., Abbott, J. T. & Griffiths, T. L. Adapting deep network features to capture psychological representations. In Proc. 38th Annual Meeting of the Cognitive Science Society, Recognizing and Representing Events (eds Papafragou, A. et al.) (Cognitive Science Society, 2016).
Peterson, J. C., Battleday, R. M., Griffiths, T. L. & Russakovsky, O. Human uncertainty makes classification more robust. In Proc. IEEE/CVF International Conference on Computer Vision 9616â€“9625 (IEEE, 2019).
Brodeur, M. B., GuÃ©rard, K. & Bouras, M. Bank of standardized stimuli (BOSS) phase II: 930 new normative photos. PLoS OneÂ 9,Â e106953 (2014).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Grant, D. A. The Latin square principle in the design and analysis of psychological experiments. Psychol. Bull. 45, 427 (1948).
Born, F. Frieda-Josefine/HumanEval_experiment: HumanEval experiment. Zenodo https://doi.org/10.5281/zenodo.13749102 (2025).
Born, F. AligNet Human-Validation-Experiment. Zenodo https://doi.org/10.5281/zenodo.15554173 (2025).

Download references

Acknowledgements

L.M., F.B. and K.-R.M. were in part supported by the German Ministry for Education and Research (BMBF) under grants 01IS14013A-E, 01GQ1115, 01GQ0850, 01IS18025A, 031L0207D and 01IS18037A. K.-R.M. was partly supported by the Institute of Information and Communications Technology Planning and Evaluation (IITP) grants funded by the Korean government (MSIT) (number 2019-0-00079, Artificial Intelligence Graduate School Program, Korea University and number 2022-0-00984, Development of Artificial Intelligence Technology for Personalized Plug-and-Play Explanation and Verification of ExplanationÂ and number 2024-00457882, AI Research Hub Project). B.S. and F.B. were supported by European Research Council Consolidator Grant ERC-2020-COG-101000972 and Deutsche Forschungsgemeinschaft (DFG) grant 462752742. We thank A. Mahendran, M. S. M. Sajjadi, R. Geirhos and X. Zhai for discussions; and I. Dasgupta, K. Hermann and S. Mohamed for comments on earlier versions of the paper.

Author information

Authors and Affiliations

Google DeepMind, Berlin, Germany
Lukas Muttenthaler,Â Klaus Greff,Â Klaus-Robert MÃ¼llerÂ &Â Thomas Unterthiner
Machine Learning Group, Technische UniversitÃ¤t Berlin, Berlin, Germany
Lukas Muttenthaler,Â Frieda BornÂ &Â Klaus-Robert MÃ¼ller
BIFOLD, Berlin Institute for the Foundations of Learning and Data, Berlin, Germany
Lukas Muttenthaler,Â Frieda BornÂ &Â Klaus-Robert MÃ¼ller
Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
Lukas Muttenthaler
Max Planck Institute for Human Development, Berlin, Germany
Frieda BornÂ &Â Bernhard Spitzer
TUD Dresden University of Technology, Dresden, Germany
Bernhard Spitzer
Anthropic, San Francisco, CA, USA
Simon Kornblith
GoogleDeepMind, San Francisco, CA, USA
Simon Kornblith,Â Michael C. MozerÂ &Â Andrew K. Lampinen
Department of Artificial Intelligence, Korea University, Seoul, South Korea
Klaus-Robert MÃ¼ller
Max Planck Institute for Informatics, SaarbrÃ¼cken, Germany
Klaus-Robert MÃ¼ller

Authors

Lukas Muttenthaler
View author publications
Search author on:PubMedÂ Google Scholar
Klaus Greff
View author publications
Search author on:PubMedÂ Google Scholar
Frieda Born
View author publications
Search author on:PubMedÂ Google Scholar
Bernhard Spitzer
View author publications
Search author on:PubMedÂ Google Scholar
Simon Kornblith
View author publications
Search author on:PubMedÂ Google Scholar
Michael C. Mozer
View author publications
Search author on:PubMedÂ Google Scholar
Klaus-Robert MÃ¼ller
View author publications
Search author on:PubMedÂ Google Scholar
Thomas Unterthiner
View author publications
Search author on:PubMedÂ Google Scholar
Andrew K. Lampinen
View author publications
Search author on:PubMedÂ Google Scholar

Contributions

L.M. initiated and led the project and had a major role in every part of the project: writingâ€”original draft, conceptualization, investigation, writingâ€”review and editing, methodology, data curation, validation, formal analysis, software, project administration and visualization. K.G. helped lead the project and had a critical role in major parts (for example, running experiments and creating figures) of the project: writingâ€”original draft, conceptualization, investigation, validation, formal analysis, software, project administration and visualization. F.B. led the Levels data collection and was in full charge of everything concerned with it:Â writingâ€”original draft, investigation, validation, writingâ€”review and editing, data curation, software and visualization. B.S.: writingâ€”review and editing, validation and supervision. S.K.: writingâ€”review and editing, methodology and supervision. M.C.M.: writingâ€”review and editing, validation, resources, methodology and supervision. K.-R.M. helped lead the project and shape some of its ideas: writingâ€”original draft, conceptualization, writingâ€”review and editing, methodology, funding acquisition, resources, supervision and project administration. T.U. hosted L.M. during his first internship at GDM and helped lead the project: conceptualization, investigation, writingâ€”review and editing, methodology, validation, supervision, software, project administration and visualization. A.K.L. had a major role in administering and supervising the project and shaping some of its ideas: writingâ€”original draft, conceptualization, investigation, writingâ€”review and editing, methodology, validation, supervision, project administration and visualization.

Corresponding authors

Correspondence to Lukas Muttenthaler, Klaus-Robert MÃ¼ller or Andrew K. Lampinen.

Ethics declarations

Competing interests

All authors, aside from B.S. and F.B., are or were employed by Google DeepMind.

Peer review

Peer review information

Nature thanks Bhavin Choksi, Gemma Roig and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisherâ€™s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Visualizing the representational reorganization in two dimensions using PCA and TriMap.

Each panel shows a low-dimensional visualization (a: PCA; b: TriMap) the representations of 85k ImageNet images from four different models (columns) onto their first two principal components both before (first row) and after fine-tuning on AligNet (second row). Colors correspond to high-level categories of the images. As shown in both visualizations, AligNet training results in more systematic differentiation of the high-level categories, and more consistent organization across the models. (The third row of the top panel visualizes the proportion of explained variance for the first 32 principal components).

Extended Data Fig. 2 Changes in relative distances between stimuli reflect the superordinate category structure in AligNet modelsâ€”including at earlier layers and at higher levels of the hierarchyâ€” but not in UnAligNet models.

(a) Representation reorganization is similar across four AligNet models; representations move closer or farther depending on their degree of relatedness in the semantic hierarchy. (a) This reorganization also occurs for the animacy distinction in all four models. (c) By contrast, an UnAlignet ablation model shows reorganization at the basic and subordinate levels that is the opposite of what would be desired; viz. these closely-related categories are moving farther apart, with changes that are larger in magnitude than the changes in the superordinate categories. (d) Changes across the layers of ViT-B/16. Although we only optimize representations at the final representation layer, we see directionally consistent reorganization across all layers, with the magnitude of the changes increasing at higher layers. (Errorbars are bootstrap 95%-CIs. Curves are quadratic regression fits to illustrate trends).

Extended Data Fig. 3 Where AligNet and UnAlignet disagree, AligNet focuses on semantics while UnAligNet relies on more superficial appearanceâ€”resulting in humans overwhelmingly favoring AligNet.

Human ratings of semantic similarity on image pairs on which AligNet and UnaligNet maximally disagree. The human participants consistently rate the pairs that AligNet represents more similarly as in fact more semantically similar.

Extended Data Fig. 4 1-shot and 10-shot linear probing performance on 11 common datasets.

This plots compares the few-shot performances of the pre-trained (base) models vs. their performance after being aligned-finetuned. AligNet finetuning is clearly beneficial on most datasets in both 1-shot and 10-shot evaluations. Even models that have previously been trained on ImageNet data (e.g. DINOv1 ViT-B/16 and supervised ViT-B/16) show improvements on 1-shot ImageNet performance, which indicates that the benefits do not merely come from being exposed to ImageNet data, but also from the label information we distilled into AligNet. This plot also includes results for the widely known CLIP¹⁸ model. It generally performs worse compared to the SigLIP model, which is why we picked the latter as the representative text/image model. For the more recent SigLIP2²⁷ we instead find that it has even stronger baseline performance than SigLIP, but it still does benefit in most cases from AligNet finetuning.

Extended Data Fig. 5 AligNet improves out-of-distribution generalization, and robustness wrt. distributional shifts.

(a) Accuracy of a linear readout head trained on top of the (frozen) base models is plotted on the x-axis, against the corresponding performance a linear readout head trained on top of the (frozen) AligNet-finetuned models on the y-axis. Every symbol above the diagonal indicates improved performance (through AligNet), while symbols below the diagonal indicate a performance degradation. Note that for better readability the range of the plot is zoomed in on the pre-trained models. The model trained from scratch (Ã—) is thus often out of bounds due to its substantially lower performance, and its accuracy before and after finetuning is instead written out in a small inset. (b) Additional results for all student models on the four Breeds datasets: â€œentity13â€, â€œentity30â€, â€œliving17â€, and â€œnonliving26â€. Student models are sorted according to their objective function (supervised, captioning, self-supervised, image/text). The randomly initialized ViT-B has the poorest base performance but benefits the most (leftmost column in each panel). Conversely, SigLIP-So400mâ€”the teacher modelâ€”has the strongest base performance and benefits the least but consistently (rightmost column in each panel). Vertical arrows indicate improvements in performance compared to the base performance.

Supplementary information

Supplementary Information

Supplementary Notes, including the detailed results of statistical models, and other analysis details (9 tables and 15 figures).

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the articleâ€™s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâ€™s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Muttenthaler, L., Greff, K., Born, F. et al. Aligning machine and human visual representations across abstraction levels. Nature 647, 349â€“355 (2025). https://doi.org/10.1038/s41586-025-09631-6

Download citation

Received: 14 November 2024
Accepted: 15 September 2025
Published: 12 November 2025
Version of record: 12 November 2025
Issue date: 13 November 2025
DOI: https://doi.org/10.1038/s41586-025-09631-6

Subjects

Abstract

Similar content being viewed by others

Main

Towards more human-aligned models

THINGS triplet odd one out

Other cognitive tasks

Alignment at multiple levels of abstraction

Levels dataset

Alignment at multiple levels

Global coarse-grained

Local fine-grained

Class boundary

AligNet model uncertainties correspond to human latencies

Evaluating other model classes

Aligned models reflect the conceptual hierarchy

Alignment improves generalization and robustness

One-shot classification

Distribution shift

Model robustness

Discussion

Methods

Soft alignment

Representational alignment

Data

Odd-one-out accuracy

Hard-alignment loss

Modelling human uncertainties

Soft-alignment loss

Uncertainty distillation

Data generation

Image data

Triplet sampling

Random

Class boundary

Cluster boundary

Triplet-response generation

Objective function

Surrogate teacher model

Student models

Representational similarity analysis

Multi-arrangement task

Likert scale

Neural network representations

Alignment with conceptual hierarchy

Levels data

Participants

Stimuli

The triplet odd-one-out task

Procedure

Preprocessing of human response times and uncertainty estimation

Human-to-human alignment

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data figures and tables

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links