VIT和CNN的速度
The Speed of VITs and CNNs

原始链接: https://lucasb.eyer.be/articles/vit_cnn_speed.html

本文挑战了视觉Transformer (ViT) 因二次自注意力机制而难以处理高分辨率图像的观点。作者认为,ViT能够很好地扩展到1024x1024像素²,足以满足大多数图像编码任务。跨不同GPU的基准测试表明,ViT比同等CNN更快,内存效率更高,尤其是在较新的硬件上。 作者还强调,高分辨率并非总是必要的,认为对于许多任务来说,较低分辨率(224-896像素²)就足够了,因为计算机视觉模型不需要人类那样高的审美细节。高分辨率下的性能提升往往是由于模型容量(FLOPs)的增加,而不是仅仅由于分辨率的提高。 此外,文章还重点介绍了局部注意力机制,例如ViTDet中的机制,它通过将注意力限制在局部窗口内来提高ViT在高分辨率下的速度和内存效率。作者总结道,ViT是一种可行且通常优于CNN的替代方案,提倡通过实证评估来克服先入为主的局限性。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 VIT 和 CNN 的速度 (lucasb.eyer.be) jxmorris12 2 小时前 6 分 | 隐藏 | 过去 | 收藏 | 讨论 加入我们 6 月 16-17 日在旧金山举办的 AI 初创公司学校! 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们 搜索:

原文

You disabled JavaScript. Please enable it for syntax-highlighting, or don't complain about unlegible code snippets =) This page doesn't contain any tracking/analytics/ad code.

Context

Computer vision is now powered by two workhorse architectures: Convolutional Neural Networks (CNN) and Vision Transformers (ViT). CNNs slide a feature extractor (stack of convolutions) over the image to get the final, usually lower-resolution, feature map on which the task is performed. ViTs on the other hand cut the image into patches from the start and perform stacks of self-attention on all the patches, leading to the final feature map, also of lower resolution.

It is often stated that because of the quadratic self-attention, ViTs aren't practical at higher resolution. As the most prominent example, here is Yann LeCun, Godfather of CNNs, stating the following:

However, I believe this criticism is a misguided knee-jerk reaction and, in practice, ViTs scale perfectly fine up to at least 1024x1024px², which is enough for the vast majority of usage scenarios for image encoders.

In this article, I make two points:

  • ViTs scale just fine up to at least 1024x1024px²
  • For the vast majority of uses, that resolution is more than enough.

ViTs scale just fine with resolution

First, I set out to quantify the inference speed of plain ViTs and CNNs on a range of current GPUs. To give this benchmark as wide an appeal as possible, I stray away from my usual JAX+TPU toolbox and perform benchmarking using PyTorch on a few common GPUs. I use models from the de-facto standard vision model repository timm, and follow PyTorch best practices in terms of benchmarking and performance by using torch.compile. I further sweep over dtype (float32, float16, bfloat16), attention implementation (sdpa_kernel), and matmul precision (set_float32_matmul_precision) and take the best setting among all these for each measurement. Since I am quite rusty in PyTorch, here is my full benchmarking code, and I'll be glad to take feedback from experts. Besides just speed, I also compute FLOPs and measure peak memory usage. Thanks to RunPod for providing the compute and making the benchmarking easy.

I benchmarked various devices:

and batch-sizes: 1 | 8 | 32, so take your pick. ()

Now that you've browsed these measurements a bit, I hope we get to the same conclusions:

  • ViT's scale just fine with resolution, at least up to 1024px².
  • Often times ViT is faster than an equivalent CNN, especially on more modern GPUs.
  • FLOPs != speed, see also my Efficiency Misnomer paper with Mostafa, Yi, Anurag, and Ashish on the topic.
  • Out of the box, ViT is more memory-efficient. On the GTX3070, it is the only model that can go beyond 512px².

But wait, it gets better! We already had all of this in the original ViT paper. We've successfully scaled ResNets before anyone else, and were most annoyed by memory. So we (me, specifically) included this figure in the appendix:

This figure was on TPUv3, i.e. a few generations ago. This blogpost is on various current GPUs. I think it is safe to say that these are universal take-aways between ViTs and CNNs by now; they have stood the test of time.

You don't need very high resolution

My second argument is that people waste too much of their time focussing on resolution, aspect ratio, and related things.

My conservative claim is that you can always stretch to a square, and for:

  • natural images, meaning most photos, 224px² is enough;
  • text in photos, phone screens, diagrams and charts, 448px² is enough;
  • desktop screens and single-page documents, 896px² is enough.

(Yes, you recoginized correctly, these are the PaliGemma numbers. That's no coincidence.)

Higher resolutions exist purely for human consumption: for the aesthetic beauty of very crisp lines, and to avoid eye strain. However, computer vision models do not suffer from eye strain, and do not care about aesthetic beauty. At least for now, while AI is not yet sentient.

There are a few very special exceptions, including medical and satellite images or multi-page documents. I believe these can be split into pieces of any of the above sizes, with maybe a light global feature. But I am no expert in those.

The most important thing is to always look at your data, the same way your model will see it. If you can solve your task looking at it, even with effort, then so can your model. Let's do this for a few representative images:

Image:

  • Natural
  • Natural with text
  • Smartphone
  • Chart
  • Diagram
  • Desktop
  • Document

Intrinsic resolution:

  • 128px²
  • 224px²
  • 256px²
  • 384px²
  • 448px²
  • 512px²
  • 768px²
  • 896px²
  • 1024px²
  • Original

Resize method:

  • Nearest
  • Area
  • Bilinear
  • Bilinear (no aa)
  • Bicubic
  • Gaussian
  • Lanczos (3px)
  • Lanczos (5px)
  • Mitchell
  • Nearest
  • Bilinear
  • Bilinear (no antialias)
  • Bicubic
This is MSCOCO validation image ID 136355. Original resolution: 640x427.
This is ST-VQA (Scene-Text VQA), IIIT-Text subset, image 385. Original resolution: 1600x1195.
This is image 55459 from the RICO dataset. Original resolution: 1080x1920.
Note that I chose an unusually long chart to exemplify an extreme case of aspect ratio stretching. Still, 512px² is enough.
This is two_col_40643 from ChartQA validation set. Original resolution: 800x1556.
This is image 3337 from the AI2 Diagrams dataset. Original resolution: 1500x968.
This is a screenshot of Lucas' desktop, reading a random paper. Lucas really likes tiny fonts and icons to maximize space, so this is an extreme case. Original resolution: 3840x2400.
This is image mtvg_0227_2 from the DocVQA dataset. I chose an especially bad document image with very small text, most are significantly more legible.
Original resolution: 1818x2884.

Hopefully this demo convinced you that I am right.

Resolution... or compute?

One important point that the vast majority of people forget when they talk about resolution, is that increasing resolution also significantly increases the model's capacity. Now, capacity is a fuzzy concept, but it's generally agreed that it is a weird mixture of the model's size, measured in parameters and unaffected by resolution, but also the model's compute (FLOPs), which, as we've just seen, increases significantly with resolution.

So, while it has been a common trick to increase performance by increasing resolution since FixRes and BiT in 2019, it took a whole five years for someone (me) to clearly disentangled these two factors in the 2024 PaliGemma report. We ran an experiment where we compute performance at 224px² resolution and at 448px² resolution, but also at 448px² resolution by first resizing the image to 224px² and then back up to 448px². This setting uses the compute (FLOPs) of the 448px² setting, but with the raw information content of the 224px² setting, and thus the improvements this setting has over the 224px² setting are purely due to model capacity.

As we can clearly see, a lot (but not all) of the improved performance at 448px² comes from the increased capacity. For example, the improved ChartQA results can almost entirely be attributed to capacity increase, not resolution increase.

Bonus: Local Attention

Besides all this, there is a very simple and elegant mechanism to make ViTs for high resolution even faster and more memory efficient: local attention. In local attention, the image (or feature-map) is split into non-overlapping windows, and a token only attends to other tokens within its window. Effectively, this means the windows are moved to the batch dimension for the local attention operation.

The UViT and ViTDet papers introduced this idea, and suggests to use local attention in most layers of a high-resolution ViT, and global attention only in few. Even better: ViTDet suggests to upcycle plain ViTs that were pre-trained at low-resolution (say 224px²) to high resolution ones by using the pre-training resolution as window size for most layers. This ViTDet-style local attention was then successfully used by the Segment Anything (SAM) line of work.

This has negligible impact on the model's quality while being very simple, elegant, and compatible. Importantly, I am not aware of an equally simple and effective idea for CNNs. This, and token-dropping, are examples of beautiful ideas that become possible thanks to ViT's simplicity, and would be hard and complicated to implement properly with CNNs.

Now, scroll back up to the benchmark figures, and check that () checkbox that you previously ignored. Now even at 1024px² the ViTDet is faster than the ConvNeXt.

ViTDet architecture schematic to visualize local attention. It's interactive.

Final thoughts

Training

This was all for inference. Doing similar measurements for training code would be interesting too. In my experience, take-aways are the same regarding speed (with a roughly architecture-independent factor of 3x). The memory consumption could look different, as we need to keep many buffers alive for backprop. But in my experience training many of these models, ViTs are also more memory efficent during training.

Learning ability

Besides speed and scalability, one should also think about what works with which architecture. Several ideas in recent literature are explicitly said to work with ViTs but not with CNNs: "MoCo v3 and SimCLR are more favorable for ViT-B than R50", "This property emerges only when using DINO with ViT architectures, and does not appear with other existing self-supervised methods nor with a ResNet-50", and the patch dropping idea from Masked AutoEncoders is only possible with the plain ViT architecture with non-overlapping patches. For image-text training à la CLIP, both the original CLIP paper and my unpublished experiments show a clearly better performance when using a ViT encoder vs other convolutional encoders, however none of us has a good explanation of why that would be the case. Notably, two of these four references are from Kaiming He, the inventor of ResNets.

Preference

At the end of the day, use whatever works best in your scenario and constraints. Constraints may include things like familiarity or availability of checkpoints. I am not religious about architectures, ViT happens to fit most of my use cases well. The only thing I am religious about, is not making unfounded claims, and calling them out when I see them =)

Acknowledgements: I thank Alexander Kolesnikov and Xiaohua Zhai for feedback on a draft of this post.

If this has been useful to your research, consider citing it:

@misc{beyer2024vitspeed,
  author = {Beyer, Lucas},
  title = {{On the speed of ViTs and CNNs}},
  year = {2024},
  howpublished = {\url{http://lb.eyer.be/a/vit-cnn-speed.html}},
}

Footnotes

联系我们 contact @ memedata.com