Many-Two-One

Diverse Representations Across Visual Pathways Emerge from A Single Objective

EPFL Stanford

Abstract

How the human brain supports diverse behaviours has been debated for decades. The canonical view divides visual processing into distinct "what" and "where/how" streams however, their origin and independence remain contested. Here, using deep neural network models that accurately predict hours of brain recordings, we computationally characterise how cortex processes dynamic vision. Despite the diversity of cortical regions and thereby supported tasks, we identify two fundamental computations that explain neural activity across visual cortex: object and appearance-free motion recognition. Strikingly, a single objective underlies both: these inherent computations in the brain emerge from optimising for understanding world dynamics, and their arrangement is highly distributed and smooth across cortex rather than strictly aligning with the two visual streams. Our results suggest that the human brain's ability to integrate complex information across seemingly distinct representational pathways may originate from the single goal of modelling the world.

Functional Objective of the Brain

How do we understand the computations in the brain? One way is through building hypothesis models that encode specific computational mechanisms, and evaluate how well their stimulus-response patterns align with those of real neural circuits. Task-driven approaches have recently proven effective for modeling brain function. For example, image classification as an objective yields deep networks whose representations closely match activity in the ventral visual stream. In a multitask setting, the brain may contain distinct “processing pathways” for different tasks. A model that excels at task A likely shares representations with the neural pathway specialized for A. Moreover, a neuron may support multiple tasks (A&B), requiring alignment with models optimized for multitask computation. In our work, we first find brain-aligned models, and characterize the tasks supported by their representations, thereby inferring the visual system’s functional objectives.


Brain-Model Alignment

We begin by assessing how closely current deep neural networks resemble the brain, using both neural and behavioral alignment. For neural alignment, we measure linear decoding performance from DNN representations to fMRI responses during video viewing. For behavioral alignment, we compare model and human error patterns on action recognition tasks under varying presentation conditions.

Across all model families, we find that dynamic models achieve state-of-the-art alignment with visual cortex activity and behavioral responses, outperforming both static and traditional models. They accurately predict neural activity on a second-by-second basis. Meanwhile, many dynamic models closely mirror human error patterns, with several approaching the level of inter-subject consistency observed among human participants. Several dynamic families yield models with high neural alignment, while action recognition models have an advantage in behavioral alignment. What drives the superior alignment of these dynamic models?


Brain-Task Relevance

To understand what drives brain alignment, we decode the top layers of DNNs to identify which cognitive tasks their representations support. We use a diverse set of tasks—static, dynamic, and hybrid—annotated in blue, red, and yellow, respectively, to pinpoint the role of dynamic processing. We then correlate task performance with brain alignment: stronger correlations suggest that the corresponding task is more closely tied to the brain's underlying functional objective. Hybrid tasks generally show higher relevance to brain alignment. Notably, the combination of purely static object recognition and purely dynamic motion recognition yields the highest alignment—and together, they explain away the explanatory power of all other tasks.

Are these two computations fundamental to the brain? Indeed, brain alignment increases as both object and motion capacities improve (upper-left). Moreover, dimensionality reduction on voxel-wise task relevances (upper-right) reveals a 2D eigenspace that explains 97% of the variance, with two principal axes emerging: object form and motion dynamics. We see several regions actually conduct hybrid computations, with voxels spreading between the two axes. Further analysis at the region and stream levels (bottom) shows that they account for most of the variance across regions and explain every visual stream. Again, it also highlights the highly hybrid and distributed nature of computations within individual streams.


Evidence for a Unified Objective

Evidence from Task Mapping

The distributed nature of computations along the two principal axes raises a key puzzle: there appears to be no clear boundary between some specialized processing streams. In fact, object and motion processing transition smoothly across cortical topography (left). Their distribution is also unimodal, both across the cortex and along the computational hierarchy (right). These findings challenge the longstanding two-visual-systems theory, suggesting it may be a biased conceptual model. Alternatively, we propose that the brain is optimized for a unified objective, from which the topographical distribution of the two core computations—object form and motion—naturally emerges. Informally, this looks like a foundational world model with topographical constraints.

Evidence from Models

To complete our argument, we find that models trained on a single objective can achieve high alignment not only with all regions of the visual system, but also with human behavior and cognitive tasks. These objectives implicitly optimize for world understanding by encoding both object form and motion information. Notably, some self-supervised model, such as V-JEPA and VideoMAE, also achieve state-of-the-art neural alignment, echoing principles from theoretical frameworks like predictive coding and world modeling. Our finding suggests that such learning may suffice as a unified principle for learning in the brain.

Please check our full manuscript here, where we talk about more interesting details:
  • How do we align the full hierarchy of a DNN to that of the brain? Do they match?
  • What does the actual neural prediction look like? How is the prediction accuracy distributed across the cortex?
  • How does the correlation between tasks and brain alignment look like?
  • We find new regions that underly the action understanding. What are they?
  • More details on how we think of the unified objective, etc.
Again, thank you for your interest in our work! We hope you find it insightful.

BibTeX


        @article{tang2025many,
          title={Many-Two-One: Diverse Representations Across Visual Pathways Emerge from A Single Objective},
          author={Tang, Yingtian and Gokce, Abdulkadir and Al-Karkari, Khaled Jedoui and Yamins, Daniel and Schrimpf, Martin},
          journal={bioRxiv},
          pages={2025--07},
          year={2025},
          publisher={Cold Spring Harbor Laboratory}
        }
      

Acknowledgement

This website is adapted from LLaVA-VL, Nerfies, and VL-RewardBench, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.