Visual Transformers Are Becoming the New…

Across the AI industry, convolutional neural networks (or ConvNets) are being replaced with Visual Transformers (or ViTs) for computer vision tasks — or with neural networks that hybridize ConvNets and ViTs. The key difference between ConvNets and ViTs (as far as I’m able to understand, anyway) is that ConvNets rely on an assumption of locality in images (or video frames): pixels that are closer together in an image are more likely to be part of the same semantic or physical whole. ViTs process pixels in an image globally, without assuming pixel proximity tells us anything about the world.

Read →