Enterprise technology decision-makers evaluating computer vision AI must understand what information neural networks actually use to make predictions. New research from Yıldırım, published on arXiv, reveals that image classifiers rely primarily on phase information in their hidden representations, while magnitude is largely dispensable. The finding has implications for how models process visual data and why some architectures behave differently.
The study builds on the classic Oppenheim and Lim (1981) result showing that natural images remain recognizable when reconstructed from Fourier phase alone. The researchers ask whether trained image classifiers reproduce this asymmetry inside their hidden layers and test it causally: given two images, they transplant the phase of one onto the magnitude of the other at a chosen layer and record which image the prediction follows.
Architectures Tested
The study examines four models: PRISM2D, GFNet, ViT-B/16, and ResNet-50. For PRISM2D, GFNet, and ViT-B/16, the prediction follows the phase or sign donor, and deleting all image-specific magnitude barely moves accuracy. This means identity rides on phase while image-specific magnitude is largely dispensable to the readout.
| Architecture | Behavior | Key Insight |
|---|---|---|
| PRISM2D | Prediction follows phase donor | Phase dominance in hidden layers |
| GFNet | Prediction follows phase donor | Phase dominance in hidden layers |
| ViT-B/16 | Prediction follows phase donor | Phase dominance in hidden layers |
| ResNet-50 | Appears to break pattern; latent sign code before ReLU | Rectification and readout geometry expose phase code differently |
ResNet-50's Latent Phase Code
ResNet-50 at first seems to break the pattern because transplanting sign after its ReLUs does nothing. However, a fair intervention before the ReLU reveals a strong latent sign code in the late blocks, and a DC-only control shows the readout consumes a channel-wise spatial average. Controls rule out the trivial case in which magnitude simply stops depending on the image. The architectures therefore share a phase/sign identity code but expose it in different bases, set by rectification and readout geometry.
Mechanistic Account of Texture–Shape Gap
The paper provides a mechanistic account of the texture–shape gap between CNNs and attention models. The differing exposure of the phase code explains why convolutional networks and transformer-based models behave differently when confronted with texture versus shape cues.
The prediction follows the phase or sign donor, and deleting all image-specific magnitude barely moves accuracy. — Study finding
Implications for Enterprise Computer Vision
For technology leaders deploying computer vision in applications such as quality inspection, autonomous navigation, or document digitization, understanding that phase information is primary means that models can potentially be made more robust by ensuring phase features are preserved during preprocessing or compression. Magnitude information, while not entirely useless, is less critical for the final classification. This insight could guide the design of more efficient neural architectures that focus computation on phase processing.
The research also underscores that seemingly similar architectures may encode information differently due to activation functions and readout mechanisms. When selecting a model for a specific visual task, enterprises should consider not just final accuracy but how the model internally represents features.
All architectures studied share a common phase/sign identity code, but rectification and readout geometry determine how that code is read. This understanding can help bridge the performance gap between CNNs and attention models in practical deployments.