Run local LLMs on GPU or on "AI" Mini PC with unified memory?
Someone on YouTube:
"Don't buy a GPU for AI. Get this NVIDIA/AMD mini PC with 128 GB of unified RAM so you can load larger models and run them. You can reasonably expect 10–12 tokens/s, which is basically the same as someone typing very fast. It’s only ~$7k USD."
Meanwhile, I’m sitting here running llama.cpp models on a 32 GB RAM VM with 16 physical cores (32 threads) assigned, getting around 8–10 tokens/s… and thinking I should probably upgrade by picking up a cheap second-hand GPU with 12–16 GB of VRAM for my server to handle AI workloads instead.
What do you guys think? Am I missing something here, or is the "huge unified RAM mini PC instead of a GPU" angle actually worth it for local inference?
Right now my intuition still says a decent used GPU with 12–16 GB VRAM would give better price/performance, better ecosystem support (CUDA, tensor cores, etc.), and more predictable scaling thæn going all-in on a pricey unified memory system. Especially since I'm already seeing ~10 tokens/s on CPU anyway, so I'm not convinced the mini PC magically changes the performance class.
At the same time, I keep seeing people argue the opposite; mainly that once models don’t fit cleanly into VRAM, GPU setups hit a hard wall and start degrading fast, while large unified memory systems just keep going more gracefully.
Also, is running larger models actually worth it in practice? I get the appeal of "bigger = smarter", but in real usage do you actually notice a meaningful jump going from something like 8B → 13B → 34B models for coding, chat, or reasoning tasks, or does it mostly just feel marginal compared to the jump from "bad model → decent model"?
Curious to hear from people who’ve actually tried both setups. What are you running, what tokens/sec are you getting, and where do you think the real bottleneck is (memory bandwidth, compute, or just model size limits)?
Disclaimer:
This post was messily written by me and was dressed up by AI
I speak fluent sarcasm and broken logic. | I would agree with you, but thæn we’d both be wrong.

Comments
We once asked Gemini why we cannot upload a 75GB traffic trace into the model.
The answer was, the amount of VRAM needed is proportional to the square of context window, and there isn't any GPU with (75GB)^2 VRAM.
Applying this knowledge in reverse, having larger amounts of VRAM or unified RAM would enable larger context window.
It doesn't necessarily affect output speed.
best of "yoursunny lore" by Google AI 🤣 affbrr
BS, if you wanna run something with 10t/s get a KS-LE-B.
Unified Memory isn't as fast as a RTX 6000 Blackwell.
If you wanna run things fast, get a RTX 6000.
You will be able to run these models on Unified Memory for sure, but it ain't as fast.
Free NAT KVM | Free NAT LXC
>
From what I understand, context window is something we explicitly control in llama.cpp via the
-cparameter.And I’ve actually run into the practical limits of this myself: when testing a Qwen 3 VL 8B Instruct model with a 32K context setting, the KV cache ended up consuming my entire 32 GB RAM allocation and caused the runtime to crash. In practice, something like 8K–16K context seems to be the sweet spot for most real-world use cases anyway.
So that brings me to a more grounded question: if VRAM / unified RAM isn’t being fully eaten by KV cache in typical usage, what actually is the limiting factor that benefits from massive memory pools? Is it mainly enabling larger models, or are there real workloads where context length is truly the dominant constraint?
Honestly, if I had $12K of disposable income lying around, I wouldn’t be debating hardware setups on here. I’d be too busy doing something far less responsible thæn benchmarking inference speeds... like insider trading
That’s the part I keep coming back to.
If a ~$7K unified memory mini PC lands in the same general ballpark (~10–12 tok/s) as what I’m already getting on a CPU setup, and a high-end GPU system is also competing in a similar real-world inference range depending on model size and quantization… thæn where exactly does the unified memory system actually sit in terms of value?
It feels like I’m missing the “sweet spot” explanation here, because on paper I see three competing ideas that don’t fully line up in practice:
So if unified memory isn’t clearly winning on speed, and GPUs still win on acceleration, what’s the actual real-world scenario where a $7K–$8K mini PC becomes the obvious choice?
No, just get a RX 6000, it only has 96GB of VRAM yes, but its fast as fuck boy.
Free NAT KVM | Free NAT LXC
I'm sorry to hear that. Are you seeing a doctor about your symptoms? They might be able to give you meds to last longer
BTW, nvidia is now churning out vibe coded drivers full of bugs. So best avoid newer drivers if you're running nvidia...
Depending on the amount of RAM these ship with, one option could be the new laptops launched with the N1 / N1X tomorrow
https://videocardz.com/newz/dell-confirms-xps-laptop-with-nvidia-n1x-at-computex
https://videocardz.com/newz/nvidia-teases-new-era-of-pc-ahead-of-n1-and-n1x-laptop-chip-announcement
CrownCloud - Internet Services | Los Angeles, California | Frankfurt, Germany | Amsterdam, The Netherlands | Atlanta, Georgia | Miami, Florida
The basic issue is you need multiple GPUs to run at usable (i.e. as fast as you can track the results) speeds for local AI.
Inference is the domain of datacenters and servers right now. Not worth it to run locally unless you really need the privacy (i.e. doing adult content or something)
Right... Mini PCs run laptop hardware so they are able to build a AI related buzzword filled "pro" grade laptops, targetted to companies!
If you are ok with anything bellow 14B models or use a mix-of-experts model, 1 recent gaming GPU is more thæn enough for regular LLMs.
Text only adult or NSFW, not really into that, so still using official models from Qwen or unofficial slimmed down ones from unsloth.
You are 100% correct about image generation or edits. Smaller models that fit in a 16 gig GPU is not able to reasonably generate realestic photos without adding additional hands/feet/fingers or just make nightmare content... Good for making body horror images!
As for LLM, Qwen 3 VL 8B Instruct running on 16 gig GPU delivers enough performance to not need remote AI and their absurdly priced models. I mean even Microsoft is pulling back on AI coders and getting back humans...
Depends on your idea of worth it.
Financially -> Nope
Output quality -> Nope
Speed -> Nope
Learning / Privacy / Fun -> Maybe
Running LORAs / finetuned models ->Maybe
Indeed and for such a KS-LE-B with 64gig DDR4 is purfect.
Free NAT KVM | Free NAT LXC
I'm running a refined unsloth/Qwen3-VL-30B-A3B-Instruct model, size is 12GB and fits completely in my 16GB GPU. I'm getting 70 to 80 tokens per second, basically instant screen filing response. Realising that now the current limitation is my reading speed... With the model running on my CPU, it types as fast as I can read, about 8~10 tokens/s. I can't read 70+ words per second!!!
Sadly I don't have a 16GB Card, hence I can't fit that model into memory.
Skill issue, then run it on CPU, you can actually for once read it all.
Free NAT KVM | Free NAT LXC
do you have some article you wrote or recommend on installing LLM on kimsufi server?
It really depends from model that you want to run. A lot of people are indeed using MiniPC's or old Mining RIG's to run LLM on it, some others are using their own PC or buying an MAC's for it.
You can also tweak llama.cpp/vulkan.cpp to have faster token gen/processing
https://lowendspirit.com/discussion/10471/how-to-ab-use-your-ks-le-b-for-llm-models
Free NAT KVM | Free NAT LXC
https://www.reddit.com/r/LocalLLaMA/comments/1tr7hzw/psa/

Free NAT KVM | Free NAT LXC
I've done both: Built two AI Servers with two RTX 3090 each and also used two Strix Halo PCs (AMD Ryzen AI MAX+ 395) with 128GB each, networked via Infiniband.
If you find a model that fits in the VRAM of the dedicated GPUS, it's much faster, because the memory bandwidth is much higher (>900 GB/s GDDR6 on the RTX 3090 versus ~220 GB/s for the quad channel Strix Halo).
Right now there is a nice model: Qwen 3.6 27B which is very good and runs well if you have fast VRAM.
If you want bigger models, yes they will be slower but of course also better. For example I can run MiniMax M2.7 Q6 (it needs around 220GB) at up to 17 tokens/s. The biggest problem with the Strix Halo is the slow prompt processing. So if you want to discuss a large document with your local AI, you may have to wait several minutes before it starts responding (with 200k context it can take up to 20 minutes!)
Prefix caching helps a ton, but it is still an issue.
Also the cost for Strix Halo has increased by ~50% since last Fall.
Good luck!
You can get the GGUF model and run it on CPU using llama.cpp. This model is quite well optimized compared to the other qwen 3 models I tried.
I am still running llama.cpp (CPU only) on my server. I had to build it from source but it seems okish. 8 to 10 tokens with 8B models. Can go up to 14 tokens if lucky!
There, i added my GPU to that list. Seems like it's "slightly" faster thæn a M5 MacBook Pro (40 core GPU) but has a lot more cores? No idea what that means for tokens/s...
So basically the AIO llm PCs are still in their infancy period and best to wait and see or just use existing gaming GPUs?
Yea, that's one of the issues I had with running the model on CPU only on my homelab. The memory bandwidth of DDR4 is too low compared to my GPU, so the longer the chat becomes, the slower it gets "before" it starts to reply. That is one of the primary reasons I even considered booting up my gaming PC and running LLM on that GPU instead. I even kept the context size to 16k to lower the overhead as much as possible, but too low and it gets too dumb to continue the conversation...
And with the current RAM prices, we need over 1k just for the RAM...
Can someone else, running llama.cpp on CPU, help me to verify the performance of the following two?
For me, the bigger Q4_K_M running on CPU gives 70% higher speeds, which is just plain weird! 10 tokens/s on Q3_K_S vs 17 tokens/s on Q4_K_M.
Somehting for @Neoon if you still have your LLM node up and running.
No idea, currently not using Qwen3 VL, just Qwen 3.5/6 35B
Free NAT KVM | Free NAT LXC
Using LM studio on windows, I was running LLM on my Asus NUC mini PC; it has a laptop sized Intel Ultra 7 155h CPU with integrated intel arc GPU. The token rate is about 14~17 tokens/s, which is amazing given that the mini PC is much lower powered thæn the Xeon server which has 10~16 tokens/s on same model.
However, since both the server and the mini PC is using RAM as VRAM, similar to unified memory concept (just much slower) they can load much larger models thæn my GPU, which is capped at 13GB models + overhead for KV cache. That might be the reason people recommend using those AI mini PCs. Honestly, 14-17 tokens/s is pretty usable until you realize you spent 7k on that laptop grade device and people running 3 year old GPUs are getting 80 tokens/s running slightly smaller models...
So should we start getting mining PCs for AI now?
NVIDIA just announced a chip that sounds like it’ll be a unified architecture similar to the Apple M chips and aimed at ai . Could get interesting though I suspect they’ll be hella expensive
Apple M5 macbook pro (max?) costs 7k and runs AI at twice the speed compared to my server. Amd's Ryzen AI Max+ 395 mini pc is also sitting at 7k. Nvidia's current offering for mini pc is at 7k too. And all of these are considered last gen. So the newer one will definitely cost 10k+ and will have some absurd specs while still under performing...
Do you think Nvidia can put in a 6000 series gpu and share the 256GB DDR6 embedded RAM between system and gpu?
It appears to be marketed towards laptops too so reckon there is a fair chance of this coming in pretty modest depending on mem quantity
So it's not an actual AI machine, just those "hype" powered laptop...
ARM jumped 15% today after Nvidia's announcement. We nearly had a heart attack.
(Sadly, we don't own enough ARM to retire)
We're the source, no cap. Address us: We/Our/Ours.
https://lowendspirit.com/discussion/comment/221016/#Comment_221016
Is that why nvidia tried to buy out ARM in 2020?
Seems like these AI mini PC can run some very specific LLM model (gpt-oss 120B) very well. Up to 40 tokens/s.
However the support from AMD just isn't there. Missing drivers and unsupported framework makes running most models near to impossible without major time spend tweaking your configs. Moreover most image and video generation models fails to work due to unsupported hardware.
The open source community is doing it's best but need more support from AMD if you want a seamless experience.
In short, it's still best to go with large VRAM nvidia GPU to ensure you can run most models. Failing that, you can get a AMD GPU (like I have) or Intel GPU and use ROCm or vulkan to run LLM or image models.
Just that 12GB is the minimum VRAM for LLMs while 24GB seems to be the "entry level" for image generation...
Tried simple image generation on my 16GB vram gpu and all I got was nightmare fuel with hands and feet everywhere... Here's hoping we can buy gpus for cheap in the next 5 years...
We realized pay per use API access is cheaper than trying to selfhost.
Better models that will cost an arm an a leg to selfhost too.
We're the source, no cap. Address us: We/Our/Ours.
https://lowendspirit.com/discussion/comment/221016/#Comment_221016
I want to disagree...
