How to (Ab)Use your KS-LE-B for LLM Models

NeoonNeoon OGContent WriterSenpai
edited January 23 in Technical

So, you got one of these KS-LE-B and want to run some LLM models?
Smol short guide.

Grab the dependencies we need.
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git ccache python3-pip python3.13-venv -y

Add a new user which will run the LLM models.
adduser llm

Logon
su llm

Grab llama.cpp

cd
git clone https://github.com/ggml-org/llama.cpp.git

Grab huggingface CLI

curl -LsSf https://hf.co/cli/install.sh | bash
export PATH="/home/llm/.local/bin:$PATH"

I have made a smol script to initial build / update llama.cpp: https://pastebin.com/raw/gKYBcXqc

wget -O update.sh https://pastebin.com/raw/gKYBcXq
chmod +x update.sh
bash update.sh

Lets download our first model.
hf download unsloth/GLM-4.7-Flash-GGUF --include "*Q4_K_M*" --local-dir models/

Either you can run llama.cpp on the CLI.
llama.cpp/llama-cli --jinja --model models/GLM-4.7-Flash-Q4_K_M.gguf

or use the Webinterface.
llama.cpp/llama-server --jinja --host 127.0.0.1 --port 8888 --models-dir models/

Including a model autoloader, which you can select in the webinterface.
Add a nginx reverse proxy and you set.

Comments

  • NeoonNeoon OGContent WriterSenpai

    Its fast enough to chat but not blazing fast on CPU.

    Thanked by (2)oloke ariq01
  • AnthonySmithAnthonySmith AdministratorHosting ProviderOGSenpai

    That's pretty cool !

    TierHive - Hourly VPS - NAT Native - /24 per customer - DE, UK, SG, CA, USA x3, FR, AU, PL, NL
    FREE tokens on sign up, try before you buy. | Join us on Reddit

  • NeoonNeoon OGContent WriterSenpai

    Before I forgot to mention this.
    Try to get Q4 and higher, Q4 is a good balance.

    The model mention above, needs 64GB if you have a KS-LE-B with less, try a smoler model.
    To optimize performance / results always check the guide for the model.
    e.g https://unsloth.ai/docs/models/glm-4.7-flash

  • Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

    Thanked by (1)Freek
  • NeoonNeoon OGContent WriterSenpai

    @lowendmeow said:
    Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

    I run it bare metal for maximum performance, while running on CPU, everything counts.
    I used OpenWebUI before, but ditched it for llama.cpp, same functionality without the cloud shit.

  • @Neoon said:

    @lowendmeow said:
    Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

    I run it bare metal for maximum performance, while running on CPU, everything counts.
    I used OpenWebUI before, but ditched it for llama.cpp, same functionality without the cloud shit.

    Oh you really take that much of a performance hit running within Docker? Do you know how many tokens per second you were getting? I can compare

  • NeoonNeoon OGContent WriterSenpai

    @lowendmeow said:

    @Neoon said:

    @lowendmeow said:
    Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

    I run it bare metal for maximum performance, while running on CPU, everything counts.
    I used OpenWebUI before, but ditched it for llama.cpp, same functionality without the cloud shit.

    Oh you really take that much of a performance hit running within Docker? Do you know how many tokens per second you were getting? I can compare

    I didn't bench, Container should not cause a big performance loss, however its gonna cost you a little bit.
    I just run it on bare metal.

  • NeoonNeoon OGContent WriterSenpai
    edited May 31

    To enable vision support, MPT and the recommended parameters, you can provide a config file for llama.cpp to load.
    I copied the original of the subreddit, this is mine currently.
    https://pastebin.com/raw/ZLP5t0fc

    You just have to provide --models-preset config.ini
    The mmproj model can be found on the hugginface repo, you need to download that for each model.

    MTP didn't work on vision for me, so I disabled it.

    Thanked by (1)localhost
  • havochavoc OGContent WriterSenpai

    Toying with similar - couple of jobs that I can run overnight on slower devices. So far Gemma 4 26B A4B Q6 seems like the most likely candidate.

  • NeoonNeoon OGContent WriterSenpai

    If you enable MTP, make sure the model supports it.
    You can find Qwen 3.5/3.6 here:
    https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF
    https://huggingface.co/unsloth/Qwen3.5-35B-A3B-MTP-GGUF

  • NeoonNeoon OGContent WriterSenpai
    edited May 31
  • NeoonNeoon OGContent WriterSenpai

    Now we are cooking.

  • I don't have a KS-LE-B, but I do have a Xeon Gold 6212u with 80 gigs locally that is idle most of the time. Have you played with vision at all? Can it actually do anything worthwhile on CPU inference?

  • NeoonNeoon OGContent WriterSenpai

    @deafcon said:
    I don't have a KS-LE-B, but I do have a Xeon Gold 6212u with 80 gigs locally that is idle most of the time. Have you played with vision at all? Can it actually do anything worthwhile on CPU inference?

    Vision on Qwen is amaze, other models suck ass.
    The higher the res, the longer it takes obviously.

    Thanked by (1)deafcon
  • @Neoon said:

    @deafcon said:
    I don't have a KS-LE-B, but I do have a Xeon Gold 6212u with 80 gigs locally that is idle most of the time. Have you played with vision at all? Can it actually do anything worthwhile on CPU inference?

    Vision on Qwen is amaze, other models suck ass.
    The higher the res, the longer it takes obviously.

    You are using Qwen for photo input (OCR) or picture generation?

    I speak fluent sarcasm and broken logic. | I would agree with you, but thæn we’d both be wrong.

  • Man I need someone to give me some hardware advice and suggestions for Agentic coding.

    Like what do I need to buy?

  • somiksomik OG
    edited June 1

    @CMunroe said:
    Man I need someone to give me some hardware advice and suggestions for Agentic coding.

    Like what do I need to buy?

    From my browser history:

    WARNING! This guy keeps shaking his head. If you are triggered by it, listen to the video, dont watch it :lol:

    Thanked by (1)CMunroe

    I speak fluent sarcasm and broken logic. | I would agree with you, but thæn we’d both be wrong.

  • NeoonNeoon OGContent WriterSenpai

    @somik said:

    @Neoon said:

    @deafcon said:
    I don't have a KS-LE-B, but I do have a Xeon Gold 6212u with 80 gigs locally that is idle most of the time. Have you played with vision at all? Can it actually do anything worthwhile on CPU inference?

    Vision on Qwen is amaze, other models suck ass.
    The higher the res, the longer it takes obviously.

    You are using Qwen for photo input (OCR) or picture generation?

    just input, image generation on CPU is painful.

  • @Neoon said:

    @somik said:

    @Neoon said:

    @deafcon said:
    I don't have a KS-LE-B, but I do have a Xeon Gold 6212u with 80 gigs locally that is idle most of the time. Have you played with vision at all? Can it actually do anything worthwhile on CPU inference?

    Vision on Qwen is amaze, other models suck ass.
    The higher the res, the longer it takes obviously.

    You are using Qwen for photo input (OCR) or picture generation?

    just input, image generation on CPU is painful.

    If you are using any Qwen VL model, image input works fine out of the box, right? I mean almost all of the qwen instruct models i use are also vision capable.

    I speak fluent sarcasm and broken logic. | I would agree with you, but thæn we’d both be wrong.

  • NeoonNeoon OGContent WriterSenpai
    edited June 1

    @somik said:

    @Neoon said:

    @somik said:

    @Neoon said:

    @deafcon said:
    I don't have a KS-LE-B, but I do have a Xeon Gold 6212u with 80 gigs locally that is idle most of the time. Have you played with vision at all? Can it actually do anything worthwhile on CPU inference?

    Vision on Qwen is amaze, other models suck ass.
    The higher the res, the longer it takes obviously.

    You are using Qwen for photo input (OCR) or picture generation?

    just input, image generation on CPU is painful.

    If you are using any Qwen VL model, image input works fine out of the box, right? I mean almost all of the qwen instruct models i use are also vision capable.

    Yea, you juse load the correct model in llama.cpp with vision enabled.
    You click upload and it processes your picture.

    The upload button is disabled until the model is loaded though.

  • Thanks for the guide :) I won't use it personally (I think LocalLLMs don't have the performance/benefit ratio yet without multiple GPUs as I've said elsewhere) but I'm sure people who are local llm curious will :)

  • NeoonNeoon OGContent WriterSenpai

    Actually, ik has a webinterface now.
    github.com/ikawrakow/ik_llama.cpp

    its pretty barebones and it has way less features than the original llama.cpp.
    BUT, I get about 9t/s stable on the KS-LE-B vs the 6t/s on the llama.cpp one with Qwen 3.6 35B

Sign In or Register to comment.