Across the AI industry, convolutional neural networks (or ConvNets) are being replaced with Visual Transformers (or ViTs) for computer vision tasks — or with neural networks that hybridize ConvNets and ViTs.
The key difference between ConvNets and ViTs (as far as I’m able to understand, anyway) is that ConvNets rely on an assumption of locality in images (or video frames): pixels that are closer together in an image are more likely to be part of the same semantic or physical whole. ViTs process pixels in an image globally, without assuming pixel proximity tells us anything about the world.
Research has found that ConvNets’ assumption of locality is a crutch: it helps ConvNets outperform ViTs when it comes to smaller datasets, but as data grows, ViTs overtake ConvNets in accuracy. My non-expert hunch is that this is because with enough data, a ViT learns the locality assumption insofar as it holds true for the images it has seen, but also learns a larger set of inductive biases and rules of thumb that allow it to balance pixel locality against other considerations.
If my hunch is correct, then it’s another demonstration of the general power of neural networks: given enough data, they can learn to replace human-designed assumptions with better ones learned from the data itself. This is a reason to be bullish on ViTs: because they seem like a more powerful and more general approach to computer vision, one that will allow us to build ever-more accurate models as data sets continue to grow.
When it comes to 3D computer tasks for robots operating in the real world, such as self-driving cars, ViTs could (I speculate) reason better about depth than ConvNets. A car with a longitudinal distance, or depth, of 100 metres from the ego vehicle could be exactly adjacent to another car only 10 metres in front. A transformer is (presumably) less biased toward assuming that cars that are adjacent in a 2D image (or video frame) are adjacent in 3D space.
To make safety-critical real world robotics tasks, such as autonomous driving, practically feasible, a major improvement in computer vision is needed. It is, therefore, encouraging that the fundamental neural network architecture of computer vision is in the midst of being revolutionized.
Disclosure: this post was written with the assistance of a text-generating AI.