How to (Ab)Use your KS-LE-B for LLM Models
So, you got one of these KS-LE-B and want to run some LLM models?
Smol short guide.
Grab the dependencies we need.
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git ccache python3-pip python3.13-venv -y
Add a new user which will run the LLM models.
adduser llm
Logon
su llm
Grab llama.cpp
cd
git clone https://github.com/ggml-org/llama.cpp.git
Grab huggingface CLI
curl -LsSf https://hf.co/cli/install.sh | bash
export PATH="/home/llm/.local/bin:$PATH"
I have made a smol script to initial build / update llama.cpp: https://pastebin.com/raw/gKYBcXqc
wget -O update.sh https://pastebin.com/raw/gKYBcXq
chmod +x update.sh
bash update.sh
Lets download our first model.
hf download unsloth/GLM-4.7-Flash-GGUF --include "*Q4_K_M*" --local-dir models/
Either you can run llama.cpp on the CLI.
llama.cpp/llama-cli --jinja --model models/GLM-4.7-Flash-Q4_K_M.gguf
or use the Webinterface.
llama.cpp/llama-server --jinja --host 127.0.0.1 --port 8888 --models-dir models/
Including a model autoloader, which you can select in the webinterface.
Add a nginx reverse proxy and you set.

Comments
Its fast enough to chat but not blazing fast on CPU.

Free NAT KVM | Free NAT LXC
That's pretty cool !
TierHive - Hourly VPS - NAT Native - /24 per customer - DE, UK, SG, CA, USA x3, FR, AU, PL, NL
FREE tokens on sign up, try before you buy. | Join us on Reddit
Before I forgot to mention this.
Try to get Q4 and higher, Q4 is a good balance.
The model mention above, needs 64GB if you have a KS-LE-B with less, try a smoler model.
To optimize performance / results always check the guide for the model.
e.g https://unsloth.ai/docs/models/glm-4.7-flash
Free NAT KVM | Free NAT LXC
Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool
I run it bare metal for maximum performance, while running on CPU, everything counts.
I used OpenWebUI before, but ditched it for llama.cpp, same functionality without the cloud shit.
Free NAT KVM | Free NAT LXC
Oh you really take that much of a performance hit running within Docker? Do you know how many tokens per second you were getting? I can compare
I didn't bench, Container should not cause a big performance loss, however its gonna cost you a little bit.
I just run it on bare metal.
Free NAT KVM | Free NAT LXC
To enable vision support, MPT and the recommended parameters, you can provide a config file for llama.cpp to load.
I copied the original of the subreddit, this is mine currently.
https://pastebin.com/raw/ZLP5t0fc
You just have to provide --models-preset config.ini
The mmproj model can be found on the hugginface repo, you need to download that for each model.
MTP didn't work on vision for me, so I disabled it.
Free NAT KVM | Free NAT LXC
Toying with similar - couple of jobs that I can run overnight on slower devices. So far Gemma 4 26B A4B Q6 seems like the most likely candidate.
If you enable MTP, make sure the model supports it.
You can find Qwen 3.5/3.6 here:
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF
https://huggingface.co/unsloth/Qwen3.5-35B-A3B-MTP-GGUF
Free NAT KVM | Free NAT LXC
https://www.reddit.com/r/LocalLLaMA/comments/1tluma3/llamacpp_server_have_builtin_native_tools_exec/
You can enable build-in tools, without anything extra.
Free NAT KVM | Free NAT LXC
Now we are cooking.
Free NAT KVM | Free NAT LXC
I don't have a KS-LE-B, but I do have a Xeon Gold 6212u with 80 gigs locally that is idle most of the time. Have you played with vision at all? Can it actually do anything worthwhile on CPU inference?
Vision on Qwen is amaze, other models suck ass.
The higher the res, the longer it takes obviously.
Free NAT KVM | Free NAT LXC
You are using Qwen for photo input (OCR) or picture generation?
Man I need someone to give me some hardware advice and suggestions for Agentic coding.
Like what do I need to buy?
From my browser history:
WARNING! This guy keeps shaking his head. If you are triggered by it, listen to the video, dont watch it
just input, image generation on CPU is painful.
Free NAT KVM | Free NAT LXC
If you are using any Qwen VL model, image input works fine out of the box, right? I mean almost all of the qwen instruct models i use are also vision capable.
Yea, you juse load the correct model in llama.cpp with vision enabled.
You click upload and it processes your picture.
The upload button is disabled until the model is loaded though.
Free NAT KVM | Free NAT LXC
Thanks for the guide
I won't use it personally (I think LocalLLMs don't have the performance/benefit ratio yet without multiple GPUs as I've said elsewhere) but I'm sure people who are local llm curious will 
Look mom, we are on TV
https://point.free/blog/gemma-4-on-a-2016-xeon/
Free NAT KVM | Free NAT LXC
Actually, ik has a webinterface now.
github.com/ikawrakow/ik_llama.cpp
its pretty barebones and it has way less features than the original llama.cpp.
BUT, I get about 9t/s stable on the KS-LE-B vs the 6t/s on the llama.cpp one with Qwen 3.6 35B
Free NAT KVM | Free NAT LXC