Ollama 的多模态模型新引擎
Ollama's new engine for multimodal models

原始链接: https://ollama.com/blog/multimodal-models

Ollama现在通过一个新的引擎支持Llama 4 Scout和Gemma 3等多模态模型,从而提高本地推理的可靠性和准确性。这使得模型能够同时处理图像和文本,例如回答关于图像的问题和比较多张图像。 新引擎隔离了各个模型,允许开发者专注于其特定模型的实现,而不会影响其他模型。它还通过正确处理大型图像及其位置信息以及缓存已处理图像以加快响应速度来提高准确性。 Ollama与硬件合作伙伴合作以优化内存使用,并在模型级别配置注意力机制以获得更好的性能。这项工作为支持未来的模态奠定了基础,例如语音、图像生成、视频生成、更长的上下文大小以及改进的工具支持。GGML张量库以及英伟达、AMD和英特尔等硬件合作伙伴对这一进步至关重要。

Ollama的新型多模态引擎旨在提供一流的多模态模型支持,引发了Hacker News社区的热烈讨论。虽然其易用性和类似Docker的LLM使用方式受到称赞,但Ollama因其历史上缺乏对核心依赖项llama.cpp的归属声明,以及对其未来盈利模式的担忧而受到批评。一些用户强调Ollama运行多个模型的便捷性,而另一些用户则更重视llama.cpp的速度和功能,并对Ollama存储模型权重的方式表示不满。讨论还涉及llama.cpp项目内部的人际冲突以及支持新模型的挑战。评论者也质疑Ollama引擎完全独立于llama.cpp的说法是否准确,因为它仍然使用与llama.cpp开发密切相关的GGML库。最后,人们对多模态模型的实际应用以及LLM中“用户上下文”的细微之处也表现出兴趣。

原文

multimodality Ollama

Ollama now supports multimodal models via Ollama’s new engine, starting with new vision multimodal models:

General Multimodal Understanding & Reasoning

Llama 4 Scout

ollama run llama4:scout

(Note: this is a 109 billion parameter, mixture-of-experts model.)

Example: asking location-based questions about a video frame:

video frame analysis example

video frame analysis example answer

You can then ask follow-up questions:

ollama@ollamas-computer ~ % ollama run llama4:scout
>>> what do you see in this image?  /Users/ollama/Downloads/multimodal-example1.png 
Added image '/Users/ollama/Downloads/multimodal-example1.png'
The image depicts a scenic waterfront area with a prominent clock tower at its center. The clock tower, likely the Ferry Building Clock Tower in San Francisco, features a tall structure with a clock face on each side and an American flag flying atop. 

In the foreground, there are several buildings, including what appears to be a marketplace or shopping center, with a parking lot and a street lined with cars. The background showcases a large body of water, possibly a bay or harbor, with a bridge visible in the distance. The sky above is blue with white clouds, suggesting a daytime setting with pleasant weather conditions.

Overall, the image presents a picturesque view of a waterfront area, highlighting the iconic clock tower and the surrounding architecture, as well as the natural beauty of the bay and sky.

>>> how far is this building from the stanford campus in palo alto? 
The distance between the Ferry Building in San Francisco and the Stanford University campus in Palo Alto is approximately 35 miles (56 kilometers). The drive typically takes around 45-60 minutes, depending on traffic conditions.

>>> what's the best way to get there? 
The best way to get to Stanford University from the Ferry Building in San Francisco depends on your preferences and budget. Here are a few options:

1. **By Car**: Take US-101 South to CA-85 South, then continue on CA-101 South. The journey takes around 45-60 minutes, depending on traffic.

2. **By Train and Shuttle**: Take Caltrain from the San Francisco Station (4th and King Streets) to the Palo Alto Station. From there, you can take the Stanford University shuttle (Red and White) to campus. The total travel time is around 1 hour and 15 minutes.

3. **By Ride-hailing Services**: Uber or Lyft from San Francisco to Stanford University takes around 45-60 minutes, depending on traffic.

Please check for any service disruptions or schedule changes before your trip.

Gemma 3

ollama run gemma3

Example of giving multiple images and asking about their relations.

This demonstrates how a user can input multiple images at once, or do so via follow up prompts and ask about the images.

multiple images

gemma 3 answers

ollama@ollamas-computer ~ % ollama run gemma3
>>> /Users/ollama/Downloads/img1.png /Users/ollama/Downloads/img2.png /Users/ollama/Downloads/img3.png /Users/
... ollama/Downloads/img4.png tell me what animal appears in all 4 images? 

Added image '/Users/ollama/Downloads/img1.png'
Added image '/Users/ollama/Downloads/img2.png'
Added image '/Users/ollama/Downloads/img3.png'
Added image '/Users/ollama/Downloads/img4.png'

The animal featured in all four images is a **llama**.

>>> is there a marine mammal in one of the images? 
Yes! You are absolutely right to spot that. 

In the first image, there’s a **dolphin** (whale) boxing with the llama.

>>> in the boxing image, is the llama more likely to win or the whale? 

That’s a fantastic question! Given the image, the llama is *much* more likely to win. 

Here’s why:

*   **Punching Power:** The llama is clearly throwing a punch with a closed fist, demonstrating intent and force.

*   **Whale’s Reaction:** The whale is clearly recoiling in pain and fear.

It's a comical mismatch, but the llama has the upper hand in this boxing match!

Document scanning

Qwen 2.5 VL

ollama run qwen2.5vl

Example of using Qwen 2.5 VL for character recognition.

check example

check answer

Example understanding and translating vertical Chinese spring couplets to English.

spring couplets example

spring couplets answer

Ollama’s new multimodal engine

Ollama’s new car

Ollama has so far relied on the ggml-org/llama.cpp project for model support and has instead focused on ease of use and model portability.

As more multimodal models are released by major research labs, the task of supporting these models the way Ollama intends became more and more challenging.

We set out to support a new engine that makes multimodal models first-class citizens, and getting Ollama’s partners to contribute more directly the community - the GGML tensor library.

What does this mean?

To sum it up, this work is to improve the reliability and accuracy of Ollama’s local inference, and to set the foundations for supporting future modalities with more capabilities - i.e. speech, image generation, video generation, longer context sizes, improved tool support for models.

Let’s break down a couple specific areas:

Model modularity

Our goal is to confine each model’s “blast radius” to itself—improving reliability and making it easier for creators and developers to integrate new models.

Today, ggml/llama.cpp offers first-class support for text-only models. For multimodal systems, however, the text decoder and vision encoder are split into separate models and executed independently. Passing image embeddings from the vision model into the text model therefore demands model-specific logic in the orchestration layer that can break specific model implementations.

Within Ollama, each model is fully self-contained and can expose its own projection layer, aligned with how that model was trained. This isolation lets model creators implement and ship their code without patching multiple files or adding cascading if statements. They no longer need to understand a shared multimodal projection function or worry about breaking other models—they can focus solely on their own model and its training.

Examples of how some models are implemented are available on Ollama’s GitHub repository.

Accuracy

Large images produce large number of tokens which may exceed the batch size. Processing this correctly with the right positional information is challenging specifically when a single image crosses boundaries.

Ollama adds metadata as it processes images to help improve accuracy.

Some examples:

  • Should causal attention be on / off?

  • Is it possible to split the image embeddings into batches for processing, and if possible, what are the boundaries when accounting for quality of output, and the computer being used for inference? If an image is split in the wrong place, the quality of output goes down. This is usually defined by the model, and can be checked in its paper?

Many other local inference tools implement this differently; while a similar result may be achieved, it does not follow how the models were designed and trained.

Memory management

Image caching

Once an image is processed, Ollama caches it so later prompts are faster; the image remains in cache while it is still being used and is not discarded for memory-cleanup limits.

Memory estimation & KV cache optimizations

Ollama collaborates with hardware manufacturers and an operating system partner to make sure the correct hardware metadata is detected for Ollama to better estimate and optimize for memory usage. For many firmware releases, partners will validate/test it against Ollama to minimize regression and to benchmark against new features.

Ollama has some KV cache optimizations to improve how memory can be efficiently used. Ollama configures causal attention at the individual model level instead of configuring as a group.

Examples:

  • Google DeepMind’s Gemma 3 leverages sliding window attention, and Ollama can leverage that to allocate a subset or a portion of the model’s context length to improve performance, and because of the memory efficiency, this means we can increase the context length of the model on the same system or use the remaining memory for higher concurrency.

  • To uniquely support Meta’s Llama 4 Scout and Maverick models, Ollama has implemented chunked attention, attention tuning to support longer context size, specific 2D rotary embedding, and in the mixture-of-experts type of model.

If a model’s attention layer isn’t fully implemented, such as sliding window attention or chunked attention, it may still ‘work’. However, because this isn’t how the model was trained, the end user may begin to see erratic or degraded output by the model itself over time. This becomes especially prominent the longer the context / sequence due to cascading effects.

What’s next

  • Support longer context sizes
  • Support thinking / reasoning
  • Tool calling with streaming responses
  • Enabling computer use

Acknowledgements

Models

Thank you to the teams working to release openly available and powerful vision models - Google DeepMind, Meta Llama, Alibaba Qwen, Mistral, IBM Granite, and the many countless people & labs in the community.

GGML

Thank you to the GGML team for the tensor library that powers Ollama’s inference – accessing GGML directly from Go has given a portable way to design custom inference graphs and tackle harder model architectures not available before in Ollama.

Hardware partners

Thank you to hardware partners for collaborating and helping to improve inference on a large number of devices - NVIDIA, AMD, Qualcomm, Intel, and Microsoft.

联系我们 contact @ memedata.com