Run local LLMs on GPU or on "AI" Mini PC with unified memory?

somik · May 30

Someone on YouTube:

"Don't buy a GPU for AI. Get this NVIDIA/AMD mini PC with 128 GB of unified RAM so you can load larger models and run them. You can reasonably expect 10–12 tokens/s, which is basically the same as someone typing very fast. It’s only ~$7k USD."

Meanwhile, I’m sitting here running llama.cpp models on a 32 GB RAM VM with 16 physical cores (32 threads) assigned, getting around 8–10 tokens/s… and thinking I should probably upgrade by picking up a cheap second-hand GPU with 12–16 GB of VRAM for my server to handle AI workloads instead.

What do you guys think? Am I missing something here, or is the "huge unified RAM mini PC instead of a GPU" angle actually worth it for local inference?

Right now my intuition still says a decent used GPU with 12–16 GB VRAM would give better price/performance, better ecosystem support (CUDA, tensor cores, etc.), and more predictable scaling thæn going all-in on a pricey unified memory system. Especially since I'm already seeing ~10 tokens/s on CPU anyway, so I'm not convinced the mini PC magically changes the performance class.

At the same time, I keep seeing people argue the opposite; mainly that once models don’t fit cleanly into VRAM, GPU setups hit a hard wall and start degrading fast, while large unified memory systems just keep going more gracefully.

Also, is running larger models actually worth it in practice? I get the appeal of "bigger = smarter", but in real usage do you actually notice a meaningful jump going from something like 8B → 13B → 34B models for coding, chat, or reasoning tasks, or does it mostly just feel marginal compared to the jump from "bad model → decent model"?

Curious to hear from people who’ve actually tried both setups. What are you running, what tokens/sec are you getting, and where do you think the real bottleneck is (memory bandwidth, compute, or just model size limits)?

Disclaimer:
This post was messily written by me and was dressed up by AI

yoursunny · May 30

We once asked Gemini why we cannot upload a 75GB traffic trace into the model.
The answer was, the amount of VRAM needed is proportional to the square of context window, and there isn't any GPU with (75GB)^2 VRAM.

Applying this knowledge in reverse, having larger amounts of VRAM or unified RAM would enable larger context window.
It doesn't necessarily affect output speed.

Neoon · May 30

BS, if you wanna run something with 10t/s get a KS-LE-B.
Unified Memory isn't as fast as a RTX 6000 Blackwell.
If you wanna run things fast, get a RTX 6000.

You will be able to run these models on Unified Memory for sure, but it ain't as fast.

somik · May 30

@yoursunny said:
We once asked Gemini why we cannot upload a 75GB traffic trace into the model.
The answer was that VRAM requirements scale with the square of the context window, and there isn’t any GPU with (75GB)² VRAM.

>

Applying this reasoning in reverse, having larger amounts of VRAM or unified RAM would enable much larger context windows.
It doesn’t necessarily improve output speed.

From what I understand, context window is something we explicitly control in llama.cpp via the -c parameter.

And I’ve actually run into the practical limits of this myself: when testing a Qwen 3 VL 8B Instruct model with a 32K context setting, the KV cache ended up consuming my entire 32 GB RAM allocation and caused the runtime to crash. In practice, something like 8K–16K context seems to be the sweet spot for most real-world use cases anyway.

So that brings me to a more grounded question: if VRAM / unified RAM isn’t being fully eaten by KV cache in typical usage, what actually is the limiting factor that benefits from massive memory pools? Is it mainly enabling larger models, or are there real workloads where context length is truly the dominant constraint?

@Neoon said:
BS, if you want ~10 tok/s just get a KS-LE-B.
Unified memory isn’t as fast as an RTX 6000 Blackwell.
If you want raw performance, just get a RTX 6000.

Honestly, if I had $12K of disposable income lying around, I wouldn’t be debating hardware setups on here. I’d be too busy doing something far less responsible thæn benchmarking inference speeds... like insider trading

@Neoon said:
You can run models on unified memory, sure, but it won’t be as fast.

That’s the part I keep coming back to.

If a ~$7K unified memory mini PC lands in the same general ballpark (~10–12 tok/s) as what I’m already getting on a CPU setup, and a high-end GPU system is also competing in a similar real-world inference range depending on model size and quantization… thæn where exactly does the unified memory system actually sit in terms of value?

It feels like I’m missing the “sweet spot” explanation here, because on paper I see three competing ideas that don’t fully line up in practice:

Unified memory → lets you fit bigger models / larger contexts
GPU → best performance per dollar and ecosystem support
CPU inference → already “good enough” for smaller quantized models (~8–10 tok/s in my case)

So if unified memory isn’t clearly winning on speed, and GPUs still win on acceleration, what’s the actual real-world scenario where a $7K–$8K mini PC becomes the obvious choice?

Neoon · May 30

@somik said:

@yoursunny said:
We once asked Gemini why we cannot upload a 75GB traffic trace into the model.
The answer was that VRAM requirements scale with the square of the context window, and there isn’t any GPU with (75GB)² VRAM.

>

Applying this reasoning in reverse, having larger amounts of VRAM or unified RAM would enable much larger context windows.
It doesn’t necessarily improve output speed.

From what I understand, context window is something we explicitly control in llama.cpp via the -c parameter.

And I’ve actually run into the practical limits of this myself: when testing a Qwen 3 VL 8B Instruct model with a 32K context setting, the KV cache ended up consuming my entire 32 GB RAM allocation and caused the runtime to crash. In practice, something like 8K–16K context seems to be the sweet spot for most real-world use cases anyway.

So that brings me to a more grounded question: if VRAM / unified RAM isn’t being fully eaten by KV cache in typical usage, what actually is the limiting factor that benefits from massive memory pools? Is it mainly enabling larger models, or are there real workloads where context length is truly the dominant constraint?

@Neoon said:
BS, if you want ~10 tok/s just get a KS-LE-B.
Unified memory isn’t as fast as an RTX 6000 Blackwell.
If you want raw performance, just get a RTX 6000.

Honestly, if I had $12K of disposable income lying around, I wouldn’t be debating hardware setups on here. I’d be too busy doing something far less responsible thæn benchmarking inference speeds... like insider trading

@Neoon said:
You can run models on unified memory, sure, but it won’t be as fast.

That’s the part I keep coming back to.

If a ~$7K unified memory mini PC lands in the same general ballpark (~10–12 tok/s) as what I’m already getting on a CPU setup, and a high-end GPU system is also competing in a similar real-world inference range depending on model size and quantization… thæn where exactly does the unified memory system actually sit in terms of value?

It feels like I’m missing the “sweet spot” explanation here, because on paper I see three competing ideas that don’t fully line up in practice:

Unified memory → lets you fit bigger models / larger contexts

GPU → best performance per dollar and ecosystem support

CPU inference → already “good enough” for smaller quantized models (~8–10 tok/s in my case)

So if unified memory isn’t clearly winning on speed, and GPUs still win on acceleration, what’s the actual real-world scenario where a $7K–$8K mini PC becomes the obvious choice?

No, just get a RX 6000, it only has 96GB of VRAM yes, but its fast as fuck boy.

somik · May 30

@Neoon said:
fast as fuck

I'm sorry to hear that. Are you seeing a doctor about your symptoms? They might be able to give you meds to last longer

BTW, nvidia is now churning out vibe coded drivers full of bugs. So best avoid newer drivers if you're running nvidia...

SpeedBus · May 30

Depending on the amount of RAM these ship with, one option could be the new laptops launched with the N1 / N1X tomorrow

https://videocardz.com/newz/dell-confirms-xps-laptop-with-nvidia-n1x-at-computex
https://videocardz.com/newz/nvidia-teases-new-era-of-pc-ahead-of-n1-and-n1x-laptop-chip-announcement

John_Q_Developer · May 30

The basic issue is you need multiple GPUs to run at usable (i.e. as fast as you can track the results) speeds for local AI.

Inference is the domain of datacenters and servers right now. Not worth it to run locally unless you really need the privacy (i.e. doing adult content or something)

somik · May 31

@SpeedBus said:
Depending on the amount of RAM these ship with, one option could be the new laptops launched with the N1 / N1X tomorrow

https://videocardz.com/newz/dell-confirms-xps-laptop-with-nvidia-n1x-at-computex
https://videocardz.com/newz/nvidia-teases-new-era-of-pc-ahead-of-n1-and-n1x-laptop-chip-announcement

Right... Mini PCs run laptop hardware so they are able to build a AI related buzzword filled "pro" grade laptops, targetted to companies!

@John_Q_Developer said:
The basic issue is you need multiple GPUs to run at usable (i.e. as fast as you can track the results) speeds for local AI.

If you are ok with anything bellow 14B models or use a mix-of-experts model, 1 recent gaming GPU is more thæn enough for regular LLMs.

@John_Q_Developer said:
Inference is the domain of datacenters and servers right now. Not worth it to run locally unless you really need the privacy (i.e. doing adult content or something)

Text only adult or NSFW, not really into that, so still using official models from Qwen or unofficial slimmed down ones from unsloth.

You are 100% correct about image generation or edits. Smaller models that fit in a 16 gig GPU is not able to reasonably generate realestic photos without adding additional hands/feet/fingers or just make nightmare content... Good for making body horror images!

As for LLM, Qwen 3 VL 8B Instruct running on 16 gig GPU delivers enough performance to not need remote AI and their absurdly priced models. I mean even Microsoft is pulling back on AI coders and getting back humans...

havoc · May 31

@somik said:
worth it

Depends on your idea of worth it.

Financially -> Nope

Output quality -> Nope

Speed -> Nope

Learning / Privacy / Fun -> Maybe

Running LORAs / finetuned models ->Maybe

Neoon · May 31

@havoc said:

@somik said:
worth it

Depends on your idea of worth it.

Financially -> Nope

Output quality -> Nope

Speed -> Nope

Learning / Privacy / Fun -> Maybe

Running LORAs / finetuned models ->Maybe

Indeed and for such a KS-LE-B with 64gig DDR4 is purfect.

somik · May 31

@Neoon said:

@havoc said:

@somik said:
worth it

Depends on your idea of worth it.

Financially -> Nope

Output quality -> Nope

Speed -> Nope

Learning / Privacy / Fun -> Maybe

Running LORAs / finetuned models ->Maybe

Indeed and for such a KS-LE-B with 64gig DDR4 is purfect.

I'm running a refined unsloth/Qwen3-VL-30B-A3B-Instruct model, size is 12GB and fits completely in my 16GB GPU. I'm getting 70 to 80 tokens per second, basically instant screen filing response. Realising that now the current limitation is my reading speed... With the model running on my CPU, it types as fast as I can read, about 8~10 tokens/s. I can't read 70+ words per second!!!

Neoon · May 31

@somik said:

@Neoon said:

@havoc said:

@somik said:
worth it

Depends on your idea of worth it.

Financially -> Nope

Output quality -> Nope

Speed -> Nope

Learning / Privacy / Fun -> Maybe

Running LORAs / finetuned models ->Maybe

Indeed and for such a KS-LE-B with 64gig DDR4 is purfect.

I'm running a refined unsloth/Qwen3-VL-30B-A3B-Instruct model, size is 12GB and fits completely in my 16GB GPU. I'm getting 70 to 80 tokens per second, basically instant screen filing response. Realising that now the current limitation is my reading speed... With the model running on my CPU, it types as fast as I can read, about 8~10 tokens/s. I can't read 70+ words per second!!!

Sadly I don't have a 16GB Card, hence I can't fit that model into memory.
Skill issue, then run it on CPU, you can actually for once read it all.

ehab · May 31

@Neoon said:
BS, if you wanna run something with 10t/s get a KS-LE-B.
Unified Memory isn't as fast as a RTX 6000 Blackwell.
If you wanna run things fast, get a RTX 6000.

You will be able to run these models on Unified Memory for sure, but it ain't as fast.

do you have some article you wrote or recommend on installing LLM on kimsufi server?

TeYroX · May 31

It really depends from model that you want to run. A lot of people are indeed using MiniPC's or old Mining RIG's to run LLM on it, some others are using their own PC or buying an MAC's for it.
You can also tweak llama.cpp/vulkan.cpp to have faster token gen/processing

Neoon · May 31

@ehab said:

@Neoon said:
BS, if you wanna run something with 10t/s get a KS-LE-B.
Unified Memory isn't as fast as a RTX 6000 Blackwell.
If you wanna run things fast, get a RTX 6000.

You will be able to run these models on Unified Memory for sure, but it ain't as fast.

do you have some article you wrote or recommend on installing LLM on kimsufi server?

https://lowendspirit.com/discussion/10471/how-to-ab-use-your-ks-le-b-for-llm-models

Neoon · May 31

https://www.reddit.com/r/LocalLLaMA/comments/1tr7hzw/psa/

Lunics · May 31

I've done both: Built two AI Servers with two RTX 3090 each and also used two Strix Halo PCs (AMD Ryzen AI MAX+ 395) with 128GB each, networked via Infiniband.

If you find a model that fits in the VRAM of the dedicated GPUS, it's much faster, because the memory bandwidth is much higher (>900 GB/s GDDR6 on the RTX 3090 versus ~220 GB/s for the quad channel Strix Halo).
Right now there is a nice model: Qwen 3.6 27B which is very good and runs well if you have fast VRAM.

If you want bigger models, yes they will be slower but of course also better. For example I can run MiniMax M2.7 Q6 (it needs around 220GB) at up to 17 tokens/s. The biggest problem with the Strix Halo is the slow prompt processing. So if you want to discuss a large document with your local AI, you may have to wait several minutes before it starts responding (with 200k context it can take up to 20 minutes!)
Prefix caching helps a ton, but it is still an issue.

Also the cost for Strix Halo has increased by ~50% since last Fall.

Good luck!

somik · June 1

@Neoon said:

@somik said:
I'm running a refined unsloth/Qwen3-VL-30B-A3B-Instruct model, size is 12GB and fits completely in my 16GB GPU. I'm getting 70 to 80 tokens per second, basically instant screen filing response. Realising that now the current limitation is my reading speed... With the model running on my CPU, it types as fast as I can read, about 8~10 tokens/s. I can't read 70+ words per second!!!

Sadly I don't have a 16GB Card, hence I can't fit that model into memory.
Skill issue, thæn run it on CPU, you can actually for once read it all.

You can get the GGUF model and run it on CPU using llama.cpp. This model is quite well optimized compared to the other qwen 3 models I tried.

@TeYroX said:
It really depends from model that you want to run. A lot of people are indeed using MiniPC's or old Mining RIG's to run LLM on it, some others are using their own PC or buying an MAC's for it.
You can also tweak llama.cpp/vulkan.cpp to have faster token gen/processing

I am still running llama.cpp (CPU only) on my server. I had to build it from source but it seems okish. 8 to 10 tokens with 8B models. Can go up to 14 tokens if lucky!

@Neoon said:
https://www.reddit.com/r/LocalLLaMA/comments/1tr7hzw/psa/

There, i added my GPU to that list. Seems like it's "slightly" faster thæn a M5 MacBook Pro (40 core GPU) but has a lot more cores? No idea what that means for tokens/s...

Device	Memory Bandwidth	GPU Cores / Compute Units	Memory Available to LLMs
M4 Mac Mini	120 GB/s	10 GPU cores	16–32 GB unified
AMD Strix Halo (Ryzen AI Max+ 395)	256 GB/s	40 RDNA 3.5 CUs (2,560 shaders)	Up to 128 GB unified
Nvidia DGX Spark	273 GB/s	6,144 CUDA cores	128 GB unified
M5 MacBook Pro (32-core GPU)	460 GB/s	32 GPU cores	Up to 128 GB unified
Intel Arc Pro B70	608 GB/s	20 Xe Cores (2,560 ALUs)	24 GB VRAM
M5 MacBook Pro (40-core GPU)	614 GB/s	40 GPU cores	Up to 128 GB unified
RX 7800 XT	624 GB/s	120 AI cores + 3,840 shaders	16 GB VRAM
RTX 3090	936 GB/s	10,496 CUDA cores	24 GB VRAM
RTX 5090	1,792 GB/s	21,760 CUDA cores	32 GB VRAM

@Lunics said:
I've done both: Built two AI Servers with two RTX 3090 each and also used two Strix Halo PCs (AMD Ryzen AI MAX+ 395) with 128GB each, networked via Infiniband.

If you find a model that fits in the VRAM of the dedicated GPUS, it's much faster, because the memory bandwidth is much higher (>900 GB/s GDDR6 on the RTX 3090 versus ~220 GB/s for the quad channel Strix Halo).
Right now there is a nice model: Qwen 3.6 27B which is very good and runs well if you have fast VRAM.

So basically the AIO llm PCs are still in their infancy period and best to wait and see or just use existing gaming GPUs?

@Lunics said:
If you want bigger models, yes they will be slower but of course also better. For example I can run MiniMax M2.7 Q6 (it needs around 220GB) at up to 17 tokens/s. The biggest problem with the Strix Halo is the slow prompt processing. So if you want to discuss a large document with your local AI, you may have to wait several minutes before it starts responding (with 200k context it can take up to 20 minutes!)
Prefix caching helps a ton, but it is still an issue.

Yea, that's one of the issues I had with running the model on CPU only on my homelab. The memory bandwidth of DDR4 is too low compared to my GPU, so the longer the chat becomes, the slower it gets "before" it starts to reply. That is one of the primary reasons I even considered booting up my gaming PC and running LLM on that GPU instead. I even kept the context size to 16k to lower the overhead as much as possible, but too low and it gets too dumb to continue the conversation...

@Lunics said:
Also the cost for Strix Halo has increased by ~50% since last Fall.

And with the current RAM prices, we need over 1k just for the RAM...

somik · June 1

Can someone else, running llama.cpp on CPU, help me to verify the performance of the following two?

unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF:Q4_K_M
unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF:Q3_K_S

For me, the bigger Q4_K_M running on CPU gives 70% higher speeds, which is just plain weird! 10 tokens/s on Q3_K_S vs 17 tokens/s on Q4_K_M.

Somehting for @Neoon if you still have your LLM node up and running.

Neoon · June 1

@somik said:
Can someone else, running llama.cpp on CPU, help me to verify the performance of the following two?
unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF:Q4_K_M
unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF:Q3_K_S
For me, the bigger Q4_K_M running on CPU gives 70% higher speeds, which is just plain weird! 10 tokens/s on Q3_K_S vs 17 tokens/s on Q4_K_M.

Somehting for @Neoon if you still have your LLM node up and running.

No idea, currently not using Qwen3 VL, just Qwen 3.5/6 35B

somik · June 1

Using LM studio on windows, I was running LLM on my Asus NUC mini PC; it has a laptop sized Intel Ultra 7 155h CPU with integrated intel arc GPU. The token rate is about 14~17 tokens/s, which is amazing given that the mini PC is much lower powered thæn the Xeon server which has 10~16 tokens/s on same model.

However, since both the server and the mini PC is using RAM as VRAM, similar to unified memory concept (just much slower) they can load much larger models thæn my GPU, which is capped at 13GB models + overhead for KV cache. That might be the reason people recommend using those AI mini PCs. Honestly, 14-17 tokens/s is pretty usable until you realize you spent 7k on that laptop grade device and people running 3 year old GPUs are getting 80 tokens/s running slightly smaller models...

So should we start getting mining PCs for AI now?

havoc · June 1

NVIDIA just announced a chip that sounds like it’ll be a unified architecture similar to the Apple M chips and aimed at ai . Could get interesting though I suspect they’ll be hella expensive

somik · June 1

@havoc said:
NVIDIA just announced a chip that sounds like it’ll be a unified architecture similar to the Apple M chips and aimed at ai . Could get interesting though I suspect they’ll be hella expensive

Apple M5 macbook pro (max?) costs 7k and runs AI at twice the speed compared to my server. Amd's Ryzen AI Max+ 395 mini pc is also sitting at 7k. Nvidia's current offering for mini pc is at 7k too. And all of these are considered last gen. So the newer one will definitely cost 10k+ and will have some absurd specs while still under performing...

Do you think Nvidia can put in a 6000 series gpu and share the 256GB DDR6 embedded RAM between system and gpu?

havoc · June 1

@somik said:
.

It appears to be marketed towards laptops too so reckon there is a fair chance of this coming in pretty modest depending on mem quantity

somik · June 2

@havoc said:

@somik said:
.

It appears to be marketed towards laptops too so reckon there is a fair chance of this coming in pretty modest depending on mem quantity

So it's not an actual AI machine, just those "hype" powered laptop...

terrorgen · June 2

ARM jumped 15% today after Nvidia's announcement. We nearly had a heart attack.

(Sadly, we don't own enough ARM to retire)

somik · June 2

@terrorgen said:
ARM jumped 15% today after Nvidia's announcement. We nearly had a heart attack.

(Sadly, we don't own enough ARM to retire)

Is that why nvidia tried to buy out ARM in 2020?

somik · June 2

Seems like these AI mini PC can run some very specific LLM model (gpt-oss 120B) very well. Up to 40 tokens/s.

However the support from AMD just isn't there. Missing drivers and unsupported framework makes running most models near to impossible without major time spend tweaking your configs. Moreover most image and video generation models fails to work due to unsupported hardware.

The open source community is doing it's best but need more support from AMD if you want a seamless experience.

In short, it's still best to go with large VRAM nvidia GPU to ensure you can run most models. Failing that, you can get a AMD GPU (like I have) or Intel GPU and use ROCm or vulkan to run LLM or image models.

Just that 12GB is the minimum VRAM for LLMs while 24GB seems to be the "entry level" for image generation...

Tried simple image generation on my 16GB vram gpu and all I got was nightmare fuel with hands and feet everywhere... Here's hoping we can buy gpus for cheap in the next 5 years...

terrorgen · June 3

We realized pay per use API access is cheaper than trying to selfhost.

Better models that will cost an arm an a leg to selfhost too.

somik · June 3

@terrorgen said:
We realized pay per use API access is cheaper than trying to selfhost.

Better models that will cost an arm an a leg to selfhost too.

I want to disagree...

Run local LLMs on GPU or on "AI" Mini PC with unified memory?

Comments