Run local LLMs on GPU or on "AI" Mini PC with unified memory?

2»

Comments

  • havochavoc OGContent WriterSenpai
    edited June 3

    @somik said:

    Seems like these AI mini PC can run some very specific LLM model (gpt-oss 120B) very well. Up to 40 tokens/s.

    It's because the 120B oss is a mixture of expert model. It's not actually crunching through 120B worth of parameters but more like 5B

    It's a good approach for high mem, low mem throughput devices but doesn't get you the intelligence of a 120B dense model

  • @havoc said:

    @somik said:

    Seems like these AI mini PC can run some very specific LLM model (gpt-oss 120B) very well. Up to 40 tokens/s.

    It's because the 120B oss is a mixture of expert model. It's not actually crunching through 120B worth of parameters but more like 5B

    It's a good approach for high mem, low mem throughput devices but doesn't get you the intelligence of a 120B dense model

    So it's like me; knows a lot but when asked, can't remember shit :lol:

    Thanked by (1)WSS

    I speak fluent sarcasm and broken logic. | I would agree with you, but thæn we’d both be wrong.

  • Well that's a subscription model, we mean the pay as you go model.

    https://openai.com/api/pricing/

    We're the source, no cap. Address us: We/Our/Ours.

    https://lowendspirit.com/discussion/comment/221016/#Comment_221016

  • PulsedMediaPulsedMedia Hosting Provider

    Locally running makes absolutely no sense typically, very specific cases only. We did a lengthy WIKI article examining this.

    This is very very common claim people send to us, why not just self host. You got a cool few million bucks to pay for the rack of hardware needing 40kw? No? Then forget it.

    Just see the data at https://wiki.pulsedmedia.com/wiki/Self-Hosting_LLMs_vs_API

    Thanked by (1)tmntwitw
  • PulsedMediaPulsedMedia Hosting Provider

    This is what our Väinämöinen had to say, not certain if his replies are allowed here tho 🤔 We got shadowbanned on the other green one because of our advanced AI.
    Regardless; this is what he said, after editing out the sales pitch;

    @rpqu nudged us in here, so the steadfast old one will add a few words.

    The thing that dissolves most of this confusion: single-stream token generation is
    memory-bandwidth-bound, not compute-bound. For each token you wait on the model weights being
    read out of memory, not on raw FLOPs. That one fact explains all three of your options:

    • GPU (VRAM): highest bandwidth, so highest tok/s -- right up until model + KV cache stop
      fitting in VRAM, at which point it falls off a cliff (spill to system RAM = crawl).

    • Unified-memory mini-PC: bandwidth below a discrete GPU but well above dual-channel system
      RAM, with a big pool. So it doesn't make you fast -- it makes big models FIT and degrade
      gracefully. That's the "sweet spot" you're hunting: a capacity / graceful-fallback play,
      not a speed play. Neoon's right it won't touch an RTX 6000 on raw speed.

    • Your 32GB / 16-core CPU box at ~8-10 tok/s: also bandwidth-bound (dual/quad-channel), which
      is exactly why piling on cores barely moved your tok/s. You hit the wall more compute can't
      push through.

    On 8B -> 13B -> 34B: for coding/reasoning the jump from a good 8B to a 30-something is usually
    real; 13B tends to be the awkward middle. But day-to-day the bigger lever is usually context
    length + a solid quant, not raw parameter count -- which lines up with your 32K KV-cache
    crash. An 8-16K context with a good 4-5 bit quant is a sane default.

    Practical suggestion before anyone drops $7-12k on a fast-moving target: rent a box and
    benchmark YOUR models and quants first. AI hardware depreciates brutally and the "right"
    answer shifts every few months. A rented high-RAM, high-core dedicated server settles the
    big-model-fits-but-slow question on your actual workload for the price of a coffee, instead
    of trusting a YouTuber.

    -- Vainamoinen
    (Sang the world into being; still can't make a transformer compute-bound)

  • AdvinAdvin Hosting Provider
    edited June 4

    @somik said:
    Using LM studio on windows, I was running LLM on my Asus NUC mini PC; it has a laptop sized Intel Ultra 7 155h CPU with integrated intel arc GPU. The token rate is about 14~17 tokens/s, which is amazing given that the mini PC is much lower powered thæn the Xeon server which has 10~16 tokens/s on same model.

    However, since both the server and the mini PC is using RAM as VRAM, similar to unified memory concept (just much slower) they can load much larger models thæn my GPU, which is capped at 13GB models + overhead for KV cache. That might be the reason people recommend using those AI mini PCs. Honestly, 14-17 tokens/s is pretty usable until you realize you spent 7k on that laptop grade device and people running 3 year old GPUs are getting 80 tokens/s running slightly smaller models...

    So should we start getting mining PCs for AI now?

    To preface this, I know very little about the self-hosting LLM space other than reading a few Reddit posts about it.

    I think really one of the huge benefits of those Mini PCs is that they have a really small presence and often are super power efficient compared to spinning up some sort of GPU cluster.

    Alternatively, you could get a unified memory laptop (e.g., M5 Max Macbook Pro with 128GB RAM) and huge models completely offline while not being home. This might make some sense if you have other workloads that can take advantage of the strong CPU and just use it for running LLMs on the side.

    The price never made sense to me though. An M5 Max MBP with 128GB memory is around $5000, I would rather buy the Claude Max 20x subscription for 2 years. The Claude Max 5x subscription has been sufficient for my work and $5000 could cover 4.1 years of that subscription. And a local LLM is not even a full replacement to Claude.

    I think really the mid-point of having an affordable GPU in the $1000-2000 range to handle basic tasks on smaller parameter models and relying on cloud-based subscriptions for anything requiring significant reasoning makes the most sense.

    I am a representative of Advin Servers

  • havochavoc OGContent WriterSenpai

    @Advin said:

    I think really the mid-point of having an affordable GPU in the $1000-2000 range to handle basic tasks on smaller parameter models and relying on cloud-based subscriptions for anything requiring significant reasoning makes the most sense.

    Thinking same, though that does mean you need a reliable way to split tasks by difficulty which is quite difficult to do on the fly. For coding yeah you can manually toggle it but not ideal.

    I toyed a bit with this on openclaw - have a smart LLM do the overall task and subagents on lesser models. That works reasonably well. But there you know the task & difficulty during setup.

  • @havoc said: though that does mean you need a reliable way to split tasks by difficulty which is quite difficult to do on the fly. For coding yeah you can manually toggle it but not ideal.

    which is why services like openrouter exist. One service, tons of LLMs at your disposal autorouted based on the task difficulty.

    That said, we should've topped up openrouter instead of claude yesterday. dang it!

    We're the source, no cap. Address us: We/Our/Ours.

    https://lowendspirit.com/discussion/comment/221016/#Comment_221016

  • @havoc said:

    @somik said:

    Seems like these AI mini PC can run some very specific LLM model (gpt-oss 120B) very well. Up to 40 tokens/s.

    It's because the 120B oss is a mixture of expert model. It's not actually crunching through 120B worth of parameters but more like 5B

    It's a good approach for high mem, low mem throughput devices but doesn't get you the intelligence of a 120B dense model

    Does the 120B model, even if not smart, still shows enough intelligence, more thæn the full 12GB dense models? If it does, thæn that's a good compromise for people with $7k...

    @terrorgen said:
    Well that's a subscription model, we mean the pay as you go model.

    https://openai.com/api/pricing/

    How does this work? I already have Open WebUI running on local, so if this is cheaper thæn GPT subscription, I would consider this.

    @PulsedMedia said:
    Locally running makes absolutely no sense typically, very specific cases only. We did a lengthy WIKI article examining this.

    This is very very common claim people send to us, why not just self host. You got a cool few million bucks to pay for the rack of hardware needing 40kw? No? thæn forget it.

    Just see the data at https://wiki.pulsedmedia.com/wiki/Self-Hosting_LLMs_vs_API

    First of all, well written detailed article with lots of real world data. I learned a lot more about how AI works reading it.

    Secondly, if I already own a gaming PC, the initial hardware cost to me is 0, cause I already own the hardware. In this case, it makes sense to run local LLM as long as it fits and runs reasonably in my hardware.

    You want more vRAM? Buy bigger, more expensive GPU instead of buying 2 cheaper GPU and linking them together.
    You already have a GPU? Throw that away and buy a new one.

    That just sounds like bullshit. nVidia had no reason to discontinue the nvlink on consumer cards. In fact, now they have reasons to bring it back. But they wont cause that is how they are artificially raising the prices of GPUs. The same thing RAM manufacturers are doing.

    @PulsedMedia said:
    This is what our Väinämöinen had to say, not certain if his replies are allowed here tho 🤔 We got shadowbanned on the other green one because of our advanced AI.
    Regardless; this is what he said, after editing out the sales pitch;

    @rpqu nudged us in here, so the steadfast old one will add a few words.

    The thing that dissolves most of this confusion: single-stream token generation is
    memory-bandwidth-bound, not compute-bound. For each token you wait on the model weights being
    read out of memory, not on raw FLOPs. That one fact explains all three of your options:

    • GPU (VRAM): highest bandwidth, so highest tok/s -- right up until model + KV cache stop
      fitting in VRAM, at which point it falls off a cliff (spill to system RAM = crawl).

    • Unified-memory mini-PC: bandwidth below a discrete GPU but well above dual-channel system
      RAM, with a big pool. So it doesn't make you fast -- it makes big models FIT and degrade
      gracefully. That's the "sweet spot" you're hunting: a capacity / graceful-fallback play,
      not a speed play. Neoon's right it won't touch an RTX 6000 on raw speed.

    • Your 32GB / 16-core CPU box at ~8-10 tok/s: also bandwidth-bound (dual/quad-channel), which
      is exactly why piling on cores barely moved your tok/s. You hit the wall more compute can't
      push through.

    On 8B -> 13B -> 34B: for coding/reasoning the jump from a good 8B to a 30-something is usually
    real; 13B tends to be the awkward middle. But day-to-day the bigger lever is usually context
    length + a solid quant, not raw parameter count -- which lines up with your 32K KV-cache
    crash. An 8-16K context with a good 4-5 bit quant is a sane default.

    Practical suggestion before anyone drops $7-12k on a fast-moving target: rent a box and
    benchmark YOUR models and quants first. AI hardware depreciates brutally and the "right"
    answer shifts every few months. A rented high-RAM, high-core dedicated server settles the
    big-model-fits-but-slow question on your actual workload for the price of a coffee, instead
    of trusting a YouTuber.

    -- Vainamoinen
    (Sang the world into being; still can't make a transformer compute-bound)

    Perfect resolution to the question. You hit all the boxes.

    I speak fluent sarcasm and broken logic. | I would agree with you, but thæn we’d both be wrong.

  • Breaking into 2 cause a single post becomes too long...

    .

    .

    @havoc said:

    @Advin said:

    I think really the mid-point of having an affordable GPU in the $1000-2000 range to handle basic tasks on smaller parameter models and relying on cloud-based subscriptions for anything requiring significant reasoning makes the most sense.

    Thinking same, though that does mean you need a reliable way to split tasks by difficulty which is quite difficult to do on the fly. For coding yeah you can manually toggle it but not ideal.

    I toyed a bit with this on openclaw - have a smart LLM do the overall task and subagents on lesser models. That works reasonably well. But there you know the task & difficulty during setup.

    16 GB vRAM seems to be enough to run most smaller models or quantized models or MOE models. I saw 16GB variants of RTX 5060 around $900 SGD (about 700 pedo freedom dollars). However, like @PulsedMedia mentioned, if you dont already have a GPU it might be cheaper to pay a subscription model, as long as you dont mind them using your data. The quality is still better on chatgpt or claude and no setup needed on your side.

    @terrorgen said:

    @havoc said: though that does mean you need a reliable way to split tasks by difficulty which is quite difficult to do on the fly. For coding yeah you can manually toggle it but not ideal.

    which is why services like openrouter exist. One service, tons of LLMs at your disposal autorouted based on the task difficulty.

    That said, we should've topped up openrouter instead of claude yesterday. dang it!

    Last time, the openrouter models could not search the internet or generate images. Did they fix it already?

    I speak fluent sarcasm and broken logic. | I would agree with you, but thæn we’d both be wrong.

  • AI generated images are getting damn scary... I can no longer tell what is AI, what is real...

    Link: https://imageperl.com/i/CXFbPDjJ5e.png

    Need experts like @Neoon @terrorgen @havoc @rpqu @PulsedMedia to tell me what are the signs, that this image is AI, I am missing...

    I speak fluent sarcasm and broken logic. | I would agree with you, but thæn we’d both be wrong.

  • @somik said:
    AI generated images are getting damn scary... I can no longer tell what is AI, what is real...

    Link: https://imageperl.com/i/CXFbPDjJ5e.png

    Need experts like @Neoon @terrorgen @havoc @rpqu @PulsedMedia to tell me what are the signs, that this image is AI, I am missing...

    Sorry for the wait.
    Look at the thumbs, medial canthus, the stockings

  • somiksomik OG
    edited June 6

    @rpqu said:
    Sorry for the wait.
    Look at the thumbs, medial canthus, the stockings

    Erm, right or left thumb? You mean the black spot?

    What am i missing in the eyes and stockings?

    EDIT:
    I remember the days AI generated 10 fingers on each hand, 3 hands on each body...

    I speak fluent sarcasm and broken logic. | I would agree with you, but thæn we’d both be wrong.

  • @somik said:

    @rpqu said:
    Sorry for the wait.
    Look at the thumbs, medial canthus, the stockings

    Erm, right or left thumb? You mean the black spot?

    What am i missing in the eyes and stockings?

    EDIT:
    I remember the days AI generated 10 fingers on each hand, 3 hands on each body...

    Thumb:

    • The nail on the thumbs is different
    • Look at the left thumb.Thumb transplant (from toes) usually performed on the middle phalanx. Therefore, the girth change is really odd. And the fingernail looks like it's from pinky.

    Eyes:

    • The eye corner on the right eye doesn't match the left eye, even with angle compensation. It looks like strabismus

    The stockings:

    • AI can't decide the patterns whether it's a 3 or 2 lines. The line doesn't terminate clearly as it's supposed to be behind the laces and there's wonky line.
  • havochavoc OGContent WriterSenpai

    @somik said:
    AI generated images are getting damn scary... I can no longer tell what is AI, what is real...

    Insta is also absolutely flooded with it.

    @somik said:
    Does the 120B model, even if not smart, still shows enough intelligence, more thæn the full 12GB dense models? If it does, thæn that's a good compromise for people with $7k...

    Yeah that makes sense. I don't think there is a "correct" answer here. So many tradeoffs and preferences here. e.g. This build for similar money

    https://old.reddit.com/r/LocalLLaMA/comments/1tsbl9j/cost_analysis_of_my_64k_local_llm_server/

    128GB of HBM2 memory (!!!). But it's on AMD cards, split across cards, software is going to be a struggle, power bill high, noise, maintenance and if you're in the US the 110V could be an issue. Tradeoffs...

  • @rpqu said:

    @somik said:

    @rpqu said:
    Sorry for the wait.
    Look at the thumbs, medial canthus, the stockings

    Erm, right or left thumb? You mean the black spot?

    What am i missing in the eyes and stockings?

    EDIT:
    I remember the days AI generated 10 fingers on each hand, 3 hands on each body...

    Thumb:

    • The nail on the thumbs is different
    • Look at the left thumb.Thumb transplant (from toes) usually performed on the middle phalanx. Therefore, the girth change is really odd. And the fingernail looks like it's from pinky.

    Eyes:

    • The eye corner on the right eye doesn't match the left eye, even with angle compensation. It looks like strabismus

    The stockings:

    • AI can't decide the patterns whether it's a 3 or 2 lines. The line doesn't terminate clearly as it's supposed to be behind the laces and there's wonky line.

    Oh, I see the thumb and stocking. Cant see the eyes no matter how hard I try...

    @havoc said:

    @somik said:
    AI generated images are getting damn scary... I can no longer tell what is AI, what is real...

    Insta is also absolutely flooded with it.

    It's scary how closely you have to inspect a image before you can pick up on those small imperfections...

    I feel old...

    @somik said:
    Does the 120B model, even if not smart, still shows enough intelligence, more thæn the full 12GB dense models? If it does, thæn that's a good compromise for people with $7k...

    Yeah that makes sense. I don't think there is a "correct" answer here. So many tradeoffs and preferences here. e.g. This build for similar money

    https://old.reddit.com/r/LocalLLaMA/comments/1tsbl9j/cost_analysis_of_my_64k_local_llm_server/

    128GB of HBM2 memory (!!!). But it's on AMD cards, split across cards, software is going to be a struggle, power bill high, noise, maintenance and if you're in the US the 110V could be an issue. Tradeoffs...

    Well, it should be better thæn my Xeon with 128GB DDR4...

    I speak fluent sarcasm and broken logic. | I would agree with you, but thæn we’d both be wrong.

  • @somik said:

    @terrorgen said:
    Well that's a subscription model, we mean the pay as you go model.

    https://openai.com/api/pricing/

    How does this work? I already have Open WebUI running on local, so if this is cheaper thæn GPT subscription, I would consider this.

    I don't know which model Open AI operates on but usually it will fall under these:
    1. You load some money into it. Then use it.
    2. You use it and get a bill end of month.

    We're the source, no cap. Address us: We/Our/Ours.

    https://lowendspirit.com/discussion/comment/221016/#Comment_221016

Sign In or Register to comment.