(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=39718558

是的,最初关注 Ollama 而不是 llama.cpp 的论点是相当合理的。 依赖关系确实会让我们陷入困境,当我们专注于软件中的较小组件时,通常会更容易理解软件。 然而,承认起源并赞扬原始创造者的基础工作是科学方法和软件开发社区的重要组成部分。 这是一种促进透明度、问责制和持续改进的做法。 这对于确保合作者之间的信任和尊重以及保持我们记录的历史准确性至关重要。 反过来,它鼓励知识共享、增长和创新的健康生态系统。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Ollama now supports AMD graphics cards (ollama.com)
585 points by tosh 17 hours ago | hide | past | favorite | 196 comments










It's pretty funny to see this blog post, when I have been running Ollama on my AMD RX 6650 for weeks :D

They have shipped ROCm containers since 0.1.27 (21 days ago). This blog post seems to be published along with the latest release, 0.1.29. I wonder what they actually changed in this release with regards to AMD support.

Also: see this issue[0] that I made where I worked through running Ollama on an AMD card that they don't "officially" support yet. It's just a matter of setting an environment variable.

[0] https://github.com/ollama/ollama/issues/2870

Edit: I did notice one change, now the starcoder2[1] model works now. Before that would crash[2].

[1] https://ollama.com/library/starcoder2

[2] https://github.com/ollama/ollama/issues/2953



While the PRs went in slightly earlier, much of the time was spent on testing the integrations, and working with AMD directly to resolve issues.

There were issues that we resolved prior to cutting the release, and many reported by the community as well.



Thank you for clarifying and thanks for the great work you do!


Maybe they wanted to wait for bug reports for 21 days before publishing a popular blog post about it..?


What kind of performance do you see with an RX 6650?


I mean, it was 21 days ago. What’s the difference?


2 versions, apparently


I'm thrilled to see support for RX 6800/6800 XT / 6900 XT. I bought one of those for an outrageous amount during the post-covid shortage in hopes that I could use it for ML stuff, and thus far it hasn't been very successful, which is a shame because it's a beast of a card!

Many thanks to ollama project and llama.cpp!



Sad to see that the cut off is just after 6700 XT which is what is in my desktop. They indicate more devices are coming, hopefully that includes some of the more modern all in one chips with RDNA 2/3 from AMD as well.


It appears that the cut off lines up with HIP SDK support from AMD, https://rocm.docs.amd.com/projects/install-on-windows/en/lat...


I've been using my 6750XT for more than a year now on all sorts of AI projects. Takes a little research and a few env vars but no need to wait for "official" support most of the time.


I’ve already been using ollama with my 6700xt just fine, you just have to set some env variable to make rocm work “unoficially”

The linked page says they will support more soon, so i’m guessing this will just be integrated



which env variable did you set to what? currently I'm struggeling setting it up


I'm not sure why Ollama garners so much attention. It has limited value - used for only experimenting with models + cannot support more than 1 model at a time. It's not meant for production deployments. Granted that it makes the experimentation process super easy but for something that relies on llama.cpp completely and whose main value proposition is easy model management I'm not sure it deserves the brouhaha people are giving it.

Edit: what do you do after the initial experimentation? you need to deploy these models eventually to production. I'm not even talking about giving credit to llama.cpp, just mentioning that this product is gaining disproportionate attention and kudos compared to the value it delivers. Not denying that it's a great product.



> It's not meant for production deployments.

I am probably not the demographics you expect. I don’t do “production” in that sense, but I have ollama running quite often when I am working, as I use it for RAG and as a fancy knowledge extraction engine. It is incredibly useful:

- I can test a lot of models by just pulling them (very useful as progress is very fast),

- using their command line is trivial,

- the fact that it keeps running in the background means that it starts once every few days and stays out of the way,

- it integrates nicely with langchain (and a host of other libraries), which means that it is easy to set up some sophisticated process and abstract away the LLM itself.

> what do you do after the initial experimentation?

I just keep using it. And for now, I keep tweaking my scripts but I expect them to stabilise at some point, because I use these models to do some real work, and this work is not monkeying about with LLMs.

> I'm not even talking about giving credit to llama.cpp, just mentioning that this product is gaining disproportionate attention and kudos compared to the value it delivers.

For me, there is nothing that comes close in terms of integration and convenience. The value it delivers is great, because it enables me to do some useful work without wasting time worrying about lower-level architecture details. Again, I am probably not in the demographics you have in mind (I am not a CS person and my programming is usually limited to HPC), but ollama is very useful to me. Its reputation is completely deserved, as far as I am concerned.



> I use it for RAG and as a fancy knowledge extraction engine

Curious, can you share more details about your usecase?



In my opinion, pre-built binaries and an easy-to-use front-end are things that should exist and are valid as a separate project unto themselves (see, e.g., HandBrake vs ffmpeg).

Using the name of the authors or the project you're building on can also read like an endorsement, which is not _necessarily_ desirable for the original authors (it can lead to ollama bugs being reported against llama.cpp instead of to the ollama devs and other forms of support request toil). Consider the third clause of BSD 3-Clause for an example used in other projects (although llama.cpp is licensed under MIT).



> Granted that it makes the experimentation process super easy

That's the answer to your question. It may have less space than a Zune, but the average person doesn't care about technically superior alternatives that are much harder to use.



*Nomad

And lame.



The answer to your question is:

ollama run mixtral

That's it. You're running a local LLM. I have no clue how to run llama.cpp

I got Stable Diffusion running and I wish there was something like ollama for it. It was painful.



On a mac, https://drawthings.ai is the ollama of Stable Diffusion.


For me, ComfyUI made the process of installing and playing with SD about as simple as a Windows installer.


The README is pretty clear, albeit it talks about a lot of optional steps you don’t need, but it’s essentially gonna be something like:

   git clone https://github.com/ggerganov/llama.cpp.git
   cd llama.cpp
   make
   wget https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/resolve/main/mixtral-8x7b-v0.1.Q4_K_M.gguf?download=true
   ./main -m ./mixtral-8x7b-v0.1.Q4_K_M.gguf -n 128


This shows the value ollama provides

I only need to know the model name and then run a single command



The first 3 steps GP provided are literally just the steps for installation. The "value" you mentioned is just a packaged installer (or, in the case of Linux, apparently a `curl | sh` -- and I'd much prefer the git clone version).

On multiple occasions I've been modifying llama.cpp code directly and recompiling for my own purposes. If you're using ollama on the command line, I'd say having the option to easily do that is much more useful than saving a couple commands upon installation.



>Hacker News


It should be fairly obvious that one can find alternative models and use them in the above command too.

Look, I’m not arguing that a prebuilt binary that handles model downloading has no value over a source build and manually pulling down gguf files. I just want to dispel some of the mystery.

Local LLM execution doesn’t require some mysterious voodoo that can only be done by installing and running a server runtime. It’s just something you can do by running code that loads a model file into memory and feeds tokens to it.

More programmers should be looking at llama.cpp language bindings than at Ollama’s implementation of the openAI api.



There are 5 commands in that README two comments up, 4 can reasonably fail (I'll give cd high marks for reliability). `make` especially is a minefield and usually involves a half-hour of searching the internet and figuring out which dependencies are a problem today. And that is all assuming someone is comfortable with compiled languages. I'd hazard most devs these days are from JS land and don't know how to debug make.

Finding the correct model weights is also a challenge in my experience, there are a lot of alternatives and it is often difficult to figure out what the differences are and whether they matter.

The README is clear that I'm probably about to lose an hour debugging if I follow it. It might be one of those rare cases where it works first time but that is the exception not the rule.



I'd rather focus on building on top of of LLMs than going lower level

Ollama makes that super easy. I tried llama.cpp first and hit build issues. Ollama worked out of the box



Sure.

Just be aware that there’s a lot of expressive difference between building on top of an HTTP API vs on top of a direct interface to the token sampler and model state.



I'm aware, I don't need that amount of sophistication yet.

Python seems to be the way to go deeper though. Is there a good reason I should be aware of to pick llama.cpp over python?



Python’s as good a choice as any for the application layer. You’re either going to be using PyTorch or llama-cpp-python to get the CUDA stuff working - both rely on native compiled C/C++ code to access GPUs and manage memory at the scale needed for LLMs. I’m not actually up to speed on the current state of the game there but my understanding is that llama.cpp’s less generic approach has allowed it to focus on specifically optimizing performance of llama-style LLMs.


I've seen more of the model fiddling, like logits restrictions and layer dropping, implemented in python, which is why I ask

Most of AI has centralized around Python, I see more of my code moving that way, like how I'm using LlamaIndex as my primary interface now, which supports ollama and many more model loaders / APIs



And what will you do after trying it? Sure, you saved a few mins in trying out a model or models. What next?


I focus on building the application rather than figuring out someone else preferred method for how I should work?

I use Docker Compose locally, Kubernetes in the cloud

I run in hot-reload locally, I build for production

I often nuke my database locally, but I run it HA in production

It is very rare to use the same technology locally (or the same way) as in production



Relax. Not everything in this world was built exactly for you. You almost seem to have a problem with this.


For us this may like a walk in the park.

For non technical people there is a possibility their os don't have git, wget and c++ compiler (especially in windows)

This is just like dropbox case years ago.



Last time I tried llama.cpp I got errors when running make that were way too time consuming to bother tracking down.

It's probably a simple build if everything is how it wants it, but it wasn't in my machine, while running ollama was.



This will likely build a version without GPU acceleration, I think?


Compared to “ollama pull mixtral”? And then actually using the thing is easier as well.


Check out EasyDiffusion.


Even for just running a model locally, Ollama provided a much simpler "one click install" earlier than most tools. That in itself is worth the support.


Koboldcpp is also very, very good, plug and play, very complet web UI, nice little api with sse text streaming, vulkan accelerated, have an AMD fork...


For me all the projects which enable running & fine-tuning LLMs locally like llama.cpp, ollama, open-webui, unsolth etc. play a very important part in democratizing AI.

> what do you do after the initial experimentation? you need to deploy these models eventually to production

I built GaitAnalyzer[1], to analyze my gait laptop; I had deployed it briefly in production when I had enough credits to foot the AWS GPU bills. Ollama made it very simple to deploy the application, Anyone who has used docker before can now run GaitAnalyzer in their computer.

[1] https://github.com/abishekmuthian/gaitanalyzer



As it turns out, making it faster and better to manage things tends to get people’s attention. I think it’s well deserved.


Ollama is the Docker of LLMs. Ollama made it _very_ easy to run LLMs locally. This is surprisingly not as easy as it seems, and incredibly useful.


Interestingly, Ollama is not popular at all in the "localllama" community (which also extends to related discords and repos).

And I think thats because of capabilities... Ollama is somewhat restrictive compared to other frontends. I have a littany of reasons I personally wouldn't run it over exui or koboldcpp, both for performance and output quality.

This is a necessity of being stable and one-click though.



I mean, it takes something difficult like an LLM and makes it easy to run. It's bound to get attention. If you've tried to get other models like BERT based models to run you'll realize just how big the usability gains are running ollama than anything else in the space.

If the question you're asking is why so many folks are focused on experimentation instead of productionizing these models, then I see where you're coming from. There's the question of how much LLMs are actually being used in prod scenarios right now as opposed to just excited people chucking things at them; that maybe LLMs are more just fun playthings than tools for production. But in my experience as HN has gotten bigger, the number of posters talking about productionizing anything has really gone down. I suspect the userbase has become more broadly "interested in software" rather than "ships production facing code" and the enthusiasm in these comments reflects those interests.

FWIW we use some LLMs in production and we do not use ollama at all. Our prod story is very different than what folks are talking about here and I'd love to have a thread that focuses more on language model prod deployments.



Well you would be one of the few hundred people on the planet doing that. With local LLMs we're just trying to create a way for everyone else to use AI that doesn't require sharing all their data with them. First thing everyone asks for of course is how to turn the open source local llms into their own online service.


Ollama's purpose and usefulness is clear. I don't think anyone is disputing that nor the large usability gains ollama has driven. At least I'm not.

As far as being one of the few hundred on the planet, well yeah that's why I'm on HN. There's tons of publications and subreddits and fora for generic tech conversation. I come here because I want to talk about the unknowns.



> I come here because I want to talk about the unknowns.

Your knowns an are unknowns to some people and vice versa. This is a great strength of HN; on a whole lot of subject you’ll find people ranging from enthusiastic to expert. There are probably subreddits or discord servers tailored to narrow niches and that’s cool, but HN is not that. They are complementary, if anything. In contrast, HN is much more interesting and with a much better S/N ratio than generic tech subreddits, it’s not even comparable.



I've been using this site since 2008. This is my second account from 2009. HN very much used to be a tech niche site. I realize that for newer users to HN, the appeal is like r/programming or r/technology but with a more text oriented interface or higher SNR or whatever but this is a shift in audience and there are folks like I on this site who still want to use it for niche content.

There are still threads where people do discuss gory details, even if the topics aren't technical. A lot of the mapping articles on the site bring out folks with deep knowledge about mapping stacks. Alternate energy threads do it too. It can be like that for LLMs also, but the user base has to want this site to be more than just Yet Another Tech News Aggregator thread.

For me as of late I've come to realize that the current audience wants YATNE more than they want deep discussion here and so I modulate my time here accordingly. The LLM threads bring me here because experts like jart chime in.



> I've been using this site since 2008. This is my second account from 2009.

I did not really like HN back in the day because it felt too startup-y, but maybe I got a wrong impression. I much preferred Ars Technica and their forum (now, Ars is much less compelling).

> For me as of late I've come to realize that the current audience wants YATNE more than they want deep discussion here and so I modulate my time here accordingly.

I think it depends on the stories. Different subjects have different demographics, and I have almost completely stopped reading physics stuff because it is way too Reddit-like (and full of confident people asserting embarrassingly wrong facts). I can see how you could feel about fields closer to your interests being dumbed down by over-confident non-specialists.

There are still good, highly technical discussions, but it is true that the home page is a bit limited and inefficient to find them.



Few hundred on the planet? Are you kidding me? We're asking enterprises to run LLMs on-premise (I'm intentionally discounting the cloud scenario where the traffic rates are much higher). That's way more than a hundred and sorry to break it to you that Ollama is just not going to cut it.


No need to be angry about this. Tech folks should be discussing this collectively and collaboratively. There's space for everything from local models running on smartphones all the way up to OpenAI style industrialized models. Back when social networks were first coming out, I used to read lots of comments about deploying and running distributed systems. I remember reading early incident reports about hotspotting and consistent hashing and TTL problems and such. We need to foster more of that kind of conversation for LMs. Sadly right now Xitter seems to be the best place for that.


Not angry. Having a discussion :-). It just amazes me how the HN crowd is more than happy with just trying out a model on their machine and calling it a day and not seeing the real picture ahead. Let ignore perf concerns for a moment. Let's say I want to run it on a shared server in the enterprise network so that any application can make use of it. Each application might want to use a model of their choosing. Ollama will unload/load/unload models as each new request arrives. Not sure if folks here are realizing this :-)


> Not sure if folks here are realizing this :-)

I’m not sure you’re capable of understanding that your needs and requirements are just that, yours.



I am 100% uninterested in your production deployment of rent seeking behavior for tools and models I can run myself. Ollama empowers me to do more of that easier. That’s why it’s popular.


OP’s point is more that Ollama isn’t what’s doing the empowering. Llama.cpp is.


You are not making any sense. I am running ollama and Open WebUI (which takes care of auth) in production.


It's nice for personal use which is what I think it was built for, has some nice frontend options too. The tooling around it is nice, and there are projects building in rag etc. I don't think people are intending to deploy days services through these tools


Sounds like you are dismissing Ollama as a "toy".

Refer:

https://paulgraham.com/startupideas.html



More than one is easy: put it behind a load balancer. Put one ollama in one container or one port.


FWIW Ollama has no concurrency support even though llama.cpp's server component (the thing that Ollama actually uses) supports it. Besides, you can't have more than 1 model running. Unloading and loading models is not free. Again, there's a lot more and really much of the real optimization work is not in Ollama; it's in llama.cpp which is completely ignored in this equation.


Thanks! Great to know. I did not know llama.cpp could do this. It should be pretty straight forward to support, not sure why they would not do it.


I'm pretty sure their primary focus right now is to gain as much mindshare as possible and they seem to be doing a great job of it. If you look at the following GitHub metrics:

https://devboard.gitsense.com/ggerganov?r=ggerganov%2Fllama....

https://devboard.gitsense.com/ollama?r=ollama%2Follama&nb=tr...

The number of people engaging with ollama is twice that of llama.cpp. And there hasn't been a dip in people engaging with Ollama in the past 6 months. However, what I do find interesting with regards to these two projects is the number of merged pull requests. If you click on the "Groups" tab and look at "Hooray", you can see llama.cpp had 72 contributors with one or more merged pull requests vs 25 for Ollama.

For Ollama, people are certainly more interested in commenting and raising issues. Compare this to llama.cpp, where the number of people contributing code changes is double that of Ollama.

I know llama.cpp is VC funded and if they don't focus on make using llama.cpp as easy to use as Ollama, they may find themselves doing all the hard stuff with Ollama reaping all the benefits.

Full Disclosure: The tool that I used is mine.



That is still one model per instance of Ollama, right?


yes, not sure you can do better than that. You cannot still have one instance of LLM in (GPU) memory answer two queries at one time.


Of course, you can support concurrent requests. But Ollama doesn't support it and it's not meant for this purpose and that's perfectly ok. That's not the point though. For fast/perf scenarios, you're better off with vllm.


Thanks! This is great to know.


This is great news. The more projects do this, the less of a moat CUDA is, and the less of a competitive advantage Nvidia has.


What does performance look like?


Spoiler alert: not good enough to break CUDA's moat


This is not CUDA's moat. That is on the R&D/training side.

Inference side is partly about performance, but mostly about cost per token.

And given that there has been a ton of standardization around LLaMA architectures, AMD/ROCm can target this much more easily, and still take a nice chunk of the inference market for non-SOTA models.



Not sure why you're downvoted, but as far as I've heard AMD cards can't beat 4090 - yet.

Still, I think AMD will catch or overtake NVidia in hardware soon, but software is a bigger problem. Hopefully the opensource strategy will pay off for them.



Really hope so, maybe this time will catch and last

Usually when corps open source stuff to get adoption, they stuff the adopters after they gain enough market share and the cycle repeats again



A RTX 4090 is about twice the price of and 50%-ish faster than AMD's most expensive consumer card so I'm not sure anyone really expects it to ever surpass a 4090.

A 7900 XTX beating a RTX 4080 at inference is probably a more realistic goal though I'm not sure how they compare right now.



The 4080 is $1k for 16gb of VRAM, and the 7900 is $1k for 24gb of VRAM. Unless you're constantly hammering it with requests, the extra speed you may get with CUDA on a 4080 is basically irrelevant when you can run much better models at a reasonable speed.


I get 35tps on Mistral:7b-Instruct-Q6_K with my 6650 XT.


Hey I did, and sorry for the self promo,

Please check out https://github.com/geniusrise - tool for running llms and other stuff, behaves like docker compose, works with whatever is supported by underlying engines:

Huggingface - MPS, cuda VLLM - cuda, ROCm llama.cpp, whisper.cpp - cuda, mps, rocm

Also coming up integration with spark (TorchDistributor), kafka and airflow.



Just downloaded this and gave it a go. I have no experience with running any local models, but this just worked out of the box on my 7600 on Ubuntu 22. This is fantastic.


Feels like all of this local LLM stuff is definitely pushing people in the direction of getting new hardware, since nothing like RX 570/580 or other older cards sees support.

On one hand, the hardware nowadays is better and more powerful, but on the other, the initial version of CUDA came out in 2007 and ROCm in 2016. You'd think that compute on GPUs wouldn't require the latest cards.



The Ollama backend llama.cpp definitely supports those older cards with the OpenCL and Vulkan backends, though performance is worse than ROCm or CUDA. In their Vulkan thread for instance I see people getting it working with Polaris and even Hawaii cards.

https://github.com/ggerganov/llama.cpp/pull/2059

Personally I just run it on CPU and several tokens/s is good enough for my purposes.



No new hardware needed. I was shocked that Mixtral runs well on my laptop, which has a so-so mobile GPU. Mixtral isn't hugely fast, but definitely good enough!


I'm a happy user of Mistral on my Mac Air M1.


How many gbs of RAM do you have in your M1 machine?


8gb


Thanks, wow, amazing that you can already run a small model with so little ram. I need to buy a new laptop, guess more than 16 gb on a macbook isn't really needed


I use several LLM models locally for chat UIs and IDE autocompletions like copilot (continue.dev).

Between Teams, Chrome, VS Code, Outlook, and now LLMs my RAM usage sits around 20-22GB. 16GB will be a bottleneck to utility.



I would advise getting as much RAM as you possibly can. You can't upgrade later, so get as much as you can afford.

Mine is 64GB, and my memory pressure goes into the red when running a quantized 70B model with a dozen Chrome tabs open.



I've run LLMs and some of the various image models on my M1 Studio 32GB without issue. Not as fast as my old 3080 card, but considering the Mac all in has about a 5th the power draw, it's a lot closer than I expected. I'm not sure of the exact details but there is clearly some secret sauce that allows it to leverage the onboard NN hardware.


Mistral is _very_ small when quantized.

I’d still go with 16gbs



Is it easy to set this up?


It is, it doesn't require any setup.

After installation:

> ollama run mistral:latest



Super easy. You can just head down to https://lmstudio.ai and pick up an app that lets you play around. It's not particularly advanced, but it works pretty well.

It's mostly optimized for M-series silicon, but it also technically works on Windows, and isn't too difficult to trick into working on Linux either.



Also, https://jan.ai is open source and worth trying out too.


Looks super cool, though it seems to be missing a good chunk of features, like the ability to change the prompt format. (Just installed it myself to check out all the options.) All the other missing stuff I can see though is stuff that LM Studio doesn't have either (such as a notebook mode). If it has a good chat mode then that's good enough for most!


The compatibility matrix is quite complex for both AMD and NVIDIA graphics cards, and completely agree: there is a lot of work to do, but the hope is to gracefully fall back to older cards.. they still speed up inference quite a bit when they do work!


My 1080ti runs ok even this 47B model: https://ollama.com/library/dolphin-mixtral


llama.cpp added first class support for the RX 580 by implementing the vulkan backend. There are some issues on older kernel amdgpu code where a llm process VRAM is never reloaded if it gets kicked out to GTT (in 5.x kernels) but overall it's much faster than the clBLAST opencl implementation.


Here is the commit that added ROCm support to llama.cpp back in August:

https://github.com/ggerganov/llama.cpp/commit/6bbc598a632560...



Yep, and it deserves the credit! He who writes the cuda kernel (or translates it) controls the spice.

I had wrapped this and had it working in Ollama months ago as well: https://github.com/ollama/ollama/pull/814. I don't use Ollama anymore, but I really like the way they handle device memory allocation dynamically, I think they were the first to do this well.



I'm curious about both:

- what's special about the memory allocation, and how might it help me?

- what are you now using instead of ollama?



Ollama does a nice job of looking at how much VRAM the card has and tuning the number of gpu layers offloaded. Before that, I mainly just had to guess. It's still a heuristic, but I thought that was neat.

I'm mainly just using llama.cpp as a native library now, mainly for the direct access to more of llama's data structures, and because I have a sort of unique sampler setup.



I’m curious as to how they pulled this off. OpenCL isn’t that common in the wild relative to Cuda. Hopefully it can become robust and widespread soon enough. I personally succumbed to the pressure and spent a relative fortune on a 4090 but wish I had some choice in the matter.


I'm surprised they didn't speak about the implementation at all. Anyone got more intel?




Another giveaway that it's ROCm is that it doesn't support the 5700 series...

I'm really salty because I "upgraded" to a 5700XT from a Nvidia GTX 1070 and can't do AI on the GPU anymore, purely because the software is unsupported.

But, as a dev, I suppose I should feel some empathy that there's probably some really difficult problem causing 5700XT to be unsupported by ROCm.



I wrote a bunch of openmp code on a 5700XT a couple of years ago, if you're building from source it'll probably run fine


They're open source and based on llama.cpp so nothings secret.

My money, looking at nothing, would be on one of the two Vulkan backends added in Jan/Feb.

I continue to be flummoxed by a mostly-programmer-forum treating ollama like a magical new commercial entity breaking new ground.

It's a CLI wrapper around llama.cpp so you don't have to figure out how to compile it



I tried it recently and couldn't figure out why it existed. It's just a very feature limited app that doesn't require you to know anything or be able to read a model card to "do AI".

And that more or less answered it.



It’s because most devs nowadays are new devs and probably aren’t very familiar with native compilation.

So compiling the correct version of llama.cpp for their hardware is confusing.

Compound that with everyone’s relative inexperience with configuring any given model and you have prime grounds for a simple tool to exist.

That’s what ollama and their Modelfiles accomplish.



It's just because it's convenient. I wrote a rich text editor front end for llama.cpp and I originally wrote a quick go web server with streaming using the go bindings, but now I just use ollama because it's just simpler and the workflow for pulling down models with their registry and packaging new ones in containers is simpler. Also most people who want to play around with local models aren't developers at all.


Eh, I've been building native code for decades and hit quite a few roadblocks trying to get llama.cpp building with cuda support on my Ubuntu box. Library version issues and such. Ended up down a rabbit hole related to codenames for the various Nvidia architectures... It's a project on hold for now.

Weirdly, the Python bindings built without issue with pip.



Edited it out of my original comment because I didn't want to seem ranty/angry/like I have some personal vendatta, as opposed to just being extremely puzzled, but it legit took me months to realize it wasn't a GUI because of how it's discussed on HN, i.e. as key to democratizing, as a large, unique, entity, etc.

Hadn't thought about it recently. After seeing it again here, and being gobsmacked by the # of genuine, earnest, comments assuming there's extensive independent development of large pieces going on in it, I'm going with:

- "The puzzled feeling you have is simply because llama.cpp is a challenge on the best of days, you need to know a lot to get to fully accelerated on ye average MacBook. and technical users don't want a GUI for an LLM, they want a way to call an API, so that's why there isn't content extalling the virtues of GPT4All*. So TL;DR you're old and have been on computer too much :P"

but I legit don't know and still can't figure it out.

* picked them because they're the most recent example of a genuinely democratizing tool that goes far beyond llama.cpp and also makes large contributions back to llama.cpp, ex. GPT4All landed 1 of the 2 vulkan backends



Ahhhh, I see what you did there.


It would serve Nvidia right if their insistence on only running CUDA workloads on their hardware results in adoption of ROCm/OpenCL.


You can use OpenCL just fine on Nvidia, but CUDA is just a superior compute programming model overall (both in features and design.) Pretty much every vendor offers something superior to OpenCL (HIP, OneAPI, etc), because it simply isn't very nice to use.


I suppose that's about right. The implementors are busy building on a path to profit and much less concerned about any sort-of lock-in or open standards--that comes much later in the cycle.


OpenCL is fine on Nvidia Hardware. Of course it's a second class citizen next to CUDA, but then again everything is a second class citizen on AMD hardware.


May be Vulkan compute? But yeah, interesting how.


OpenCL is as dead as OpenGL and the inference implementations that exist are very unperformant. The only real options are CUDA, ROCm, Vulkan and CPU. And Vulkan is a proper pain too, takes forver to build compute shaders and has to do so for each model. It only makes sense on Intel Arc since there's nothing else there.


SYCL is a fairly direct successor to the OpenCL model and is not quite dead, Intel seems to be betting on it more than others.


ROCm includes OpenCL. And it's a very performant OpenCL implementation.


why though? except for apple, most vendors still actively support it and newer versions of OpenCL are released…


Apple killed off OpenCL for their platforms when they created Metal which was disappointing. Sounds like ROCm will keep it alive but the fragmentation sucks. Gotta support CUDA, OpenCL, and Metal now to be cross-platform.


What is OpenCL? AMD GPUs support CUDA. It's called HIP. You just need a bunch of #define statements like this:

    #ifndef __HIP__
    #include 
    #include 
    #else
    #include 
    #include 
    #define cudaSuccess hipSuccess
    #define cudaStream_t hipStream_t
    #define cudaGetLastError hipGetLastError
    #endif
Then your CUDA code works on AMD.


Can you explain why nobody knows this trick, for some values of “nobody”?


No idea. My best guess is their background is in graphics and games rather than machine learning. When CUDA is all you've ever known, you try just a little harder to find a way to keep using it elsewhere.


People know; it just hasn't been reliable.


What's not reliable about it? On Linux hipcc is about as easy to use as gcc. On Windows it's a little janky because hipcc is a perl script and there's no perl interpreter I'll admit. I'm otherwise happy with it though. It'd be nice if they had a shell script installer like NVIDIA, so I could use an OS that isn't a 2 year old Ubuntu. I own 2 XTX cards but I'm actually switching back to NVIDIA on my main workstation for that reason alone. GPUs shouldn't be choosing winners in the OS world. The lack of a profiler is also a source of frustration. I think the smart thing to do is to develop on NVIDIA and then distribute to AMD. I hope things change though and I plan to continue doing everything I can do to support AMD since I badly want to see more balance in this space.


The compilation toolchain may be reliable but then you get kernel panics at runtime.


I've heard geohot is upset about that. I haven't tortured any of my AMD cards enough to run into that issue yet. Do you know how to make it happen?


OpenCL is a Khronos open spec for GPU compute, and what you’d use on Apple platforms before Metal compute shaders and CoreML were released. If you wanted to run early ML models on Apple hardware, it was an option. There was an OpenCL backend for torch, for example.


Given the price of top line NVidia cards, if they can be had at all, there's got to be a lot of effort going on behind the scenes to improve AMD support in various places.


What are the barriers to doing so?


Does anyone know how the AMD consumer GPU support on Linux has been implemented? Must use something else than ROCm I assume? Because ROCm only supports the 7900 XTX on Linux[1], while on Windows[2] support is from RX 6600 and upwards.

[1]: https://rocblas.readthedocs.io/en/rocm-6.0.0/about/compatibi... [2]: https://rocblas.readthedocs.io/en/rocm-6.0.0/about/compatibi...



The newest release, 6.0.2, supports a number of other cards[1] and in general people are able to get a lot more cards to work than are officially supported. My 7900 XT worked on 6.0.0 for instance.

[1]https://rocm.docs.amd.com/projects/install-on-linux/en/lates...



I wouldn't read too much into support. It's more in terms of business/warranty/promises than what can actually do things

I've had a 6900XT since launch and this is the first I'm hearing "unsupported", having played with ROCM plenty over the years with Fedora Linux.

I think, at most, it's taken a couple key environment variables



How hard would it be for AMD just to document the levels of support of different cards the way NVIDIA does with their "compute capability" numbers ?!

I'm not sure what is worse from AMD - the ML software support they provide for their cards, or the utterly crap documentation.

How about one page documenting AMD's software stack compared to NVIDIA, one page documenting what ML frameworks support AMD cards, and another documenting "compute capability" type numbers to define the capabilities of different cards.



And almost looks like they're deliberately trying to not win any market share.

It's as if the CEO is mates with NVidias CEO and has an unwritten agreement not to try too hard to topple the applecart...

Oh wait... They're cousins!



Can anyone (maybe ollama contributors) explain to me the relationship between llama.cpp and ollama?

I always thought that ollama basically just was a wrapper (i.e. not much changes to inference code, and only built on top) around llama.cpp, but this makes it seem like it is more than that?





I heard "Nvidia for LLM today is similar to how Sun Microsystems was for the web"


... for a very brief period of time until Linux servers and other options caught up.


OS was possibly much more complicated to write at that time than CUDA is to write today. And competition is too strong. It might be more briefer than even Sun.


Hm, fooocus manages to run, but for Ollama I get:

>time=2024-03-16T00:11:07.993+01:00 level=WARN source=amd_linux.go:50 msg="ollama >recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file >missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or >directory" >time=2024-03-16T00:11:07.993+01:00 level=INFO source=amd_linux.go:85 msg="detected amdgpu >versions [gfx1031]" >time=2024-03-16T00:11:07.996+01:00 level=WARN source=amd_linux.go:339 msg="amdgpu >detected, but no compatible rocm library found. Either install rocm v6, or follow manual >install instructions at https://github.com/ollama/ollama/blob/main/docs/linux.md#man..." >time=2024-03-16T00:11:07.996+01:00 level=WARN source=amd_linux.go:96 msg="unable to verify >rocm library, will use cpu: no suitable rocm found, falling back to CPU" >time=2024-03-16T00:11:07.996+01:00 level=INFO source=routes.go:1105 msg="no GPU detected"

Need to check how to install rocm on arch again... have done it once, a few moons back, but alas...



Ah, this is probably from missing ROCm libraries. The dynamic libraries are available as one of the release assets (warning: it's about 4GB expanded) https://github.com/ollama/ollama/releases/tag/v0.1.29 – dropping them in the same directory as the `ollama` binary should work.


Ollama runs really, really slow on my MBP for Mistral - as in just a few tokens a second and it takes a long while before it starts giving a result. Anyone else run into this?

I've seen that it seems to be related to the amount of system memory available when ollama is started (??) however LM Studio does not have such issues.



I wish AMD did well in the Stable Diffusion front because AMD is never greedy on VRAM. The 4060Ti 16GB(minimum required for Stable Diffusion in 2024) starts at $450.

AMD with ROCm is decent on Linux but pretty bad on Windows.



I run A1111, ComfyUI and kohya-ss on an AMD (6900XT which has 16GB, the minimum required for Stable Diffusion in 2024 ;)), though on Linux. Is it a Windows specific Issue for you?

Edit to add: Though apparently I still don't run ollama on AMD since it seems to disagree with my setup.



You don't need 16gb, literally majority of people don't have that and use 8gb and up. especially with forge


They bump up VRAM because they can't compete on raw compute.


Or rather Nvidia is purposefully restricting VRAM to avoid gaming cards canibalizing their supremely profitable professional/server cards. AMD has no relevant server cards, so they have no reason to hold back on VRAM in consumer cards


Nvidia released consumer RTX 3090 with 24GB VRAM in Sep 2020, AMDs flagship release in that same month was 6900 XT with 16GB VRAM. Who is being restrictive here exactly?


it doesn't matter how much compute you have if you don't have enough vram to run the model.


Exactly. My friend was telling me that I was making a mistake for getting a 7900 XTX to run language models, when the fact of the matter is the cheapest NVIDIA card with 24 GB of VRAM is over 50% more expensive than the 7900 XTX. Running a high quality model at like 80 tps is way more important to me than running a way lower quality model at like 120 tps.


They lag on software a lot more than the lag on silicon.


Does this work with integrated Radeon graphics? If so it might be worth getting one of those Ryzen mini PCs to use as a local LLM server.




Curious to see if this will work on APUs. Have a 7840HS to test, will give it a go ASAP.


If anyone wants to run some benchmarks on MI300x, ping me.


I wonder why they aren't supporting RX 6750 XT and lower yet, are there architectural differences between these and RX 6800+?


They don't support it, but it works if you set an environment variable.

https://github.com/ollama/ollama/issues/2870#issuecomment-19...



Those are Navi 22/23/24 GPUs while the RX 6800+ GPUs are Navi 21. They have different ISAs... however, the ISAs are identical in all but name.

LLVM has recently introduced a unified ISA for all RDNA 2 GPUs (gfx10.3-generic), so the need for the environment variable workaround mentioned in the other comment should eventually disappear.



Is there an equivalent to ollama or gpt4all for Android? I'd like to host my model somewhere and talk to it via app.


At that point, it sounds like your after an API endpoint for any model. Lots of solutions out there, depends more on your hosting


Thanks, Ollama.

(Sorry, could not resist.)



Any info on the performance? How does it compare to 4080/4090?


No support for my rx 5700xt :(


Does it also support AMD APUs?


Wow, that's a huge feature. Thank you, guys. By the way, does anyone have a preferred case where they can put 4 AMD 7900XTX? There's a lot of motherboards and CPUs that support 128 lanes. It's the physical arrangement that I have trouble with.


You don't need 128 lanes. 8x PCIe3 is more than enough, so for 4 cards that's 32. Most CPUs have about 40lanes. If you are not doing much that would be more than sufficient. Buy a PCIe riser. Go to amazon and search for it, a 16x to 16x PCIe riser. They go for about $25-$30 often about 20-30cm. If you want really long one, you can get one from China a 60cm for about the same price, you just have to wait for 3 weeks. That's what I did. Stuffing all those in a case is often difficult, so you have to go open rig. Either have the cables running out your computer and figuring out a way to support the cards while keeping them cool or just buy a small $20-$30 open rig frame.


Most CPUs have about 20 lanes (plus a link to the chipset).

On the one hand, they will be gen 4 or 5, so they're the equivalent of 40-80 gen 3 lanes.

On the other hand, you can only split them up if you have a motherboard that supports bifurcation. If you buy the wrong model, you're stuck dedicating the equivalent of 64 gen 3 lanes to a single card.

Edit: Actually, looking into it further, current Intel desktop processors will only run their lanes as 16(+4) or 8+8(+4). You can kind of make 4 cards work by using chipset-fed slots, but that sucks. You could also get a PCIe switch but those are very expensive. AMD will do 4+4+4+4(+4) on the right boards.



Ah ha, that's the part I was curious about. I was wondering if I could keep everything cool open rig. I'm waiting for the stuff to arrive: risers, board, CPU, GPUs. And I've been putting it off because I wasn't sure how about the case. All right then, open rig frame. Thank you!


Used crypto mining parts not available any more?


Crypto mining didn't require significant bandwidth to the card. Mining-oriented motherboards typically only provisioned a single lane of PCIe to each card, and often used anemic host CPUs (like Celeron embedded parts).


Does LLM inference require significant bandwidth to the card? You have to get the model into VRAM, but that's a fixed startup cost, not a per-output-token cost.


Exactly. They'd use PCIe x1 to PCIe x16 risers with power adapters. These require high-bandwidth.


Oh. Shows I wasn't into that.

I did once work with a crypto case, but yes, it was one motherboard with a lot of wifis and we still didn't need the pcie lanes.



There's a thing somewhat conspicuous in its absence - why isn't llama.cpp more directly credited and thanked for providing the base technology powering this tool?

All the other cool "run local" software seems to have the appropriate level of credit. You can find llama.cpp references in the code, being set up in a kind of "as is" fashion such that it might be OK as far as MIT licensing goes, but it seems kind of petty to have no shout out or thank you anywhere in the repository or blog or ollama website.

GPT4All - https://gpt4all.io/index.html

LM Studio - https://lmstudio.ai/

Both of these projects credit and attribute appropriately, Ollama seems to bend over backwards so they don't have to?



ollama has made a lot of nice contributions of their own. It's a good look to give a hat tip to the great work llama.cpp is also doing, but they're strictly speaking not required to do that in their advertising any more than llama.cpp is required to give credit to Google Brain, and I think that's because llama.cpp has pulled off tricks in the execution that Brain never could have accomplished, just as ollama has had great success focusing on things that wouldn't make sense for llama.cpp. Besides everyone who wants to know what's up can read the source code, research papers, etc. then make their own judgements about who's who. It's all in the open.


Newton only gave credits to "Gods". He said he stands on shoulders of Gods or something like that. But he never mentioned which Gods in particular, did he?


Giants, not Gods, and it's debated whether it was actually an insult or not: https://en.wikipedia.org/wiki/Isaac_Newton#Personality


Like what? What contributions has ollama made to llama.cpp? It's not a big deal or problem. It just hasnt.

And they could have very, very, easily, there's a server, sitting right there.

They chose not to.

That's fine. But it's a choice.

The rest I chalk it up to inexperience and being busy.



OP never said "contributions to llama.cpp", the comment just said just "contributions" which I read as "own developments".


Where the name does come from?


[flagged]



I've recently taken to calling it llama.cpp plus ollama


It's the "GNU/Linux" thing all over again....


Trust me, they do. Programmers curse him out on a daily basis.


Honestly, they probably should keep a little Bjarne shrine on their desk and light a little votive candle every time they release to production.


Realistically that's the only way to reduce the amount of bugs in your C++ code.


Do you want machine spirits? Because that's how you get machine spirits.


My brother in Stroustrup, we are trying to run llama.cpp, we are definitionally attempting to summon machine spirits.


lol, although you made me think--the history of computers has involved patching layer after layer of sediment on top of each other until the how things work 10 layers deep is forgotten. Imagine when local LLMs are five layers deep. It'll be like having machine spirits. You know how you have to yell to get the LLM to do what you want sometimes? That's what the future version of sudo will be like.


Who are these heathens who are going to ignore the basic language design work done by Kernighan & Ritchie? And whoever invented ++. Stroustrup didn't invent the C, the ++ or the notion of ++ing things in general. His contribution here is clearly quite small.

I thought the original argument for focusing on ollama alone was quite good. Once you get into dependencies it is turtles all the way down.







Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com