苹果公司推出的FastVLM:显著更快的视觉语言模型
FastVLM: Efficient vision encoding for vision language models

原始链接: https://github.com/apple/ml-fastvlm

FastVLM 是一种高效的视觉编码方法,用于视觉语言模型 (VLMs),旨在减少高分辨率图像的编码时间。它引入了 FastViTHD,这是一种混合视觉编码器,它生成的 token 更少,并实现了更快的首个 Token 生成时间 (TTFT)。最小变体在速度上显著优于 LLaVA-OneVision-0.5B,且 TTFT 更快,视觉编码器也更小。更大的变体与 Qwen2-7B 配合使用,超越了 Cambrian-1-8B 等近期模型,同时使用单个图像编码器保持更快的 TTFT。 该仓库提供了安装、使用 LLaVA 代码库进行训练/微调以及推理的说明。可以使用 `get_models.sh` 下载预训练的检查点。提供了 PyTorch、苹果硅(包括预导出模型)和苹果设备的推理说明。一个 iOS 演示应用程序展示了该模型的移动性能。这项工作已发表在 CVPR 2025 上,其中包含引用信息。该项目感谢多个开源项目的贡献。代码和发布的模型均提供许可信息。

苹果的FastVLM,一个高效的视觉编码模型,因其在设备端应用的潜力而在Hacker News上引发热议。虽然最小0.5B参数模型的2GB大小引发了对应用下载大小的担忧,但许多人推测苹果计划在操作系统层面预加载这些模型,并提供开发者SDK。这将使应用程序能够在本地利用强大的视觉语言模型(VLM),从而增强隐私并降低延迟。 讨论探讨了在操作系统提供的基础模型之上使用LoRA微调进行特定应用程序定制的可能性。人们对LoRA用于大型语言模型(LLM)的实用性进行了辩论,并将其与LoRA在扩散图像模型中的成功进行了对比。另一些人建议使用模型量化来减小模型大小。 一些评论者讨论了实时语音+视觉应用程序以及辅助视障人士等用例。一些人表示需要模型更好地遵循指令。许多人指出了设备端推理的潜在优势,包括节省成本、提高延迟和增强隐私。

原文

This is the official repository of FastVLM: Efficient Vision Encoding for Vision Language Models. (CVPR 2025)

Accuracy vs latency figure.

  • We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images.
  • Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder.
  • Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT.
  • Demo iOS app to demonstrate the performance of our model on a mobile device.

We use LLaVA codebase to train FastVLM variants. In order to train or finetune your own variants, please follow instructions provided in LLaVA codebase. We provide instructions for running inference with our models.

conda create -n fastvlm python=3.10
conda activate fastvlm
pip install -e .

For detailed information on various evaluations, please refer to our paper.

To download all the pretrained checkpoints run the command below (note that this might take some time depending on your connection so might be good to grab ☕️ while you wait).

bash get_models.sh   # Files will be downloaded to `checkpoints` directory.

To run inference of PyTorch checkpoint, follow the instruction below

python predict.py --model-path /path/to/checkpoint-dir \
                  --image-file /path/to/image.png \
                  --prompt "Describe the image."

Inference on Apple Silicon

To run inference on Apple Silicon, pytorch checkpoints have to be exported to format suitable for running on Apple Silicon, detailed instructions and code can be found model_export subfolder. Please see the README there for more details.

For convenience, we provide 3 models that are in Apple Silicon compatible format: fastvlm_0.5b_stage3, fastvlm_1.5b_stage3, fastvlm_7b_stage3. We encourage developers to export the model of their choice with the appropriate quantization levels following the instructions in model_export.

Inference on Apple Devices

To run inference on Apple devices like iPhone, iPad or Mac, see app subfolder for more details.

If you found this code useful, please cite the following paper:

@InProceedings{fastvlm2025,
  author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
  title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2025},
}

Our codebase is built using multiple opensource contributions, please see ACKNOWLEDGEMENTS for more details.

Please check out the repository LICENSE before using the provided code and LICENSE_MODEL for the released models.

联系我们 contact @ memedata.com