How to (Ab)Use your KS-LE-B for LLM Models

Neoon · January 23

So, you got one of these KS-LE-B and want to run some LLM models?
Smol short guide.

Grab the dependencies we need.
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git ccache python3-pip python3.13-venv -y

Add a new user which will run the LLM models.
adduser llm

Logon
su llm

Grab llama.cpp

cd
git clone https://github.com/ggml-org/llama.cpp.git

Grab huggingface CLI

curl -LsSf https://hf.co/cli/install.sh | bash
export PATH="/home/llm/.local/bin:$PATH"

I have made a smol script to initial build / update llama.cpp: https://pastebin.com/raw/gKYBcXqc

wget -O update.sh https://pastebin.com/raw/gKYBcXq
chmod +x update.sh
bash update.sh

Lets download our first model.
hf download unsloth/GLM-4.7-Flash-GGUF --include "*Q4_K_M*" --local-dir models/

Either you can run llama.cpp on the CLI.
llama.cpp/llama-cli --jinja --model models/GLM-4.7-Flash-Q4_K_M.gguf

or use the Webinterface.
llama.cpp/llama-server --jinja --host 127.0.0.1 --port 8888 --models-dir models/

Including a model autoloader, which you can select in the webinterface.
Add a nginx reverse proxy and you set.

Neoon · January 23

Its fast enough to chat but not blazing fast on CPU.

AnthonySmith · January 23

That's pretty cool !

Neoon · January 23

Before I forgot to mention this.
Try to get Q4 and higher, Q4 is a good balance.

The model mention above, needs 64GB if you have a KS-LE-B with less, try a smoler model.
To optimize performance / results always check the guide for the model.
e.g https://unsloth.ai/docs/models/glm-4.7-flash

lowendmeow · January 23

Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

Neoon · January 23

@lowendmeow said:
Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

I run it bare metal for maximum performance, while running on CPU, everything counts.
I used OpenWebUI before, but ditched it for llama.cpp, same functionality without the cloud shit.

lowendmeow · January 23

@Neoon said:

@lowendmeow said:
Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

I run it bare metal for maximum performance, while running on CPU, everything counts.
I used OpenWebUI before, but ditched it for llama.cpp, same functionality without the cloud shit.

Oh you really take that much of a performance hit running within Docker? Do you know how many tokens per second you were getting? I can compare

Neoon · January 23

@lowendmeow said:

@Neoon said:

@lowendmeow said:
Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

I run it bare metal for maximum performance, while running on CPU, everything counts.
I used OpenWebUI before, but ditched it for llama.cpp, same functionality without the cloud shit.

Oh you really take that much of a performance hit running within Docker? Do you know how many tokens per second you were getting? I can compare

I didn't bench, Container should not cause a big performance loss, however its gonna cost you a little bit.
I just run it on bare metal.

Neoon · May 31

To enable vision support, MPT and the recommended parameters, you can provide a config file for llama.cpp to load.
I copied the original of the subreddit, this is mine currently.
https://pastebin.com/raw/ZLP5t0fc

You just have to provide --models-preset config.ini
The mmproj model can be found on the hugginface repo, you need to download that for each model.

MTP didn't work on vision for me, so I disabled it.

havoc · May 31

Toying with similar - couple of jobs that I can run overnight on slower devices. So far Gemma 4 26B A4B Q6 seems like the most likely candidate.

Neoon · May 31

If you enable MTP, make sure the model supports it.
You can find Qwen 3.5/3.6 here:
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF
https://huggingface.co/unsloth/Qwen3.5-35B-A3B-MTP-GGUF

Neoon · May 31

https://www.reddit.com/r/LocalLLaMA/comments/1tluma3/llamacpp_server_have_builtin_native_tools_exec/

You can enable build-in tools, without anything extra.

Neoon · May 31

Now we are cooking.

deafcon · May 31

I don't have a KS-LE-B, but I do have a Xeon Gold 6212u with 80 gigs locally that is idle most of the time. Have you played with vision at all? Can it actually do anything worthwhile on CPU inference?

Neoon · May 31

@deafcon said:
I don't have a KS-LE-B, but I do have a Xeon Gold 6212u with 80 gigs locally that is idle most of the time. Have you played with vision at all? Can it actually do anything worthwhile on CPU inference?

Vision on Qwen is amaze, other models suck ass.
The higher the res, the longer it takes obviously.

somik · June 1

@Neoon said:

@deafcon said:
I don't have a KS-LE-B, but I do have a Xeon Gold 6212u with 80 gigs locally that is idle most of the time. Have you played with vision at all? Can it actually do anything worthwhile on CPU inference?

Vision on Qwen is amaze, other models suck ass.
The higher the res, the longer it takes obviously.

You are using Qwen for photo input (OCR) or picture generation?

CMunroe · June 1

Man I need someone to give me some hardware advice and suggestions for Agentic coding.

Like what do I need to buy?

somik · June 1

@CMunroe said:
Man I need someone to give me some hardware advice and suggestions for Agentic coding.

Like what do I need to buy?

From my browser history:

WARNING! This guy keeps shaking his head. If you are triggered by it, listen to the video, dont watch it

Neoon · June 1

@somik said:

@Neoon said:

@deafcon said:
I don't have a KS-LE-B, but I do have a Xeon Gold 6212u with 80 gigs locally that is idle most of the time. Have you played with vision at all? Can it actually do anything worthwhile on CPU inference?

Vision on Qwen is amaze, other models suck ass.
The higher the res, the longer it takes obviously.

You are using Qwen for photo input (OCR) or picture generation?

just input, image generation on CPU is painful.

somik · June 1

@Neoon said:

@somik said:

@Neoon said:

@deafcon said:
I don't have a KS-LE-B, but I do have a Xeon Gold 6212u with 80 gigs locally that is idle most of the time. Have you played with vision at all? Can it actually do anything worthwhile on CPU inference?

Vision on Qwen is amaze, other models suck ass.
The higher the res, the longer it takes obviously.

You are using Qwen for photo input (OCR) or picture generation?

just input, image generation on CPU is painful.

If you are using any Qwen VL model, image input works fine out of the box, right? I mean almost all of the qwen instruct models i use are also vision capable.

Neoon · June 1

@somik said:

@Neoon said:

@somik said:

@Neoon said:

@deafcon said:
I don't have a KS-LE-B, but I do have a Xeon Gold 6212u with 80 gigs locally that is idle most of the time. Have you played with vision at all? Can it actually do anything worthwhile on CPU inference?

Vision on Qwen is amaze, other models suck ass.
The higher the res, the longer it takes obviously.

You are using Qwen for photo input (OCR) or picture generation?

just input, image generation on CPU is painful.

If you are using any Qwen VL model, image input works fine out of the box, right? I mean almost all of the qwen instruct models i use are also vision capable.

Yea, you juse load the correct model in llama.cpp with vision enabled.
You click upload and it processes your picture.

The upload button is disabled until the model is loaded though.

John_Q_Developer · June 1

Thanks for the guide I won't use it personally (I think LocalLLMs don't have the performance/benefit ratio yet without multiple GPUs as I've said elsewhere) but I'm sure people who are local llm curious will

Neoon · June 1

Look mom, we are on TV
https://point.free/blog/gemma-4-on-a-2016-xeon/

Neoon · June 3

@Neoon said:
Look mom, we are on TV
https://point.free/blog/gemma-4-on-a-2016-xeon/

Actually, ik has a webinterface now.
github.com/ikawrakow/ik_llama.cpp

its pretty barebones and it has way less features than the original llama.cpp.
BUT, I get about 9t/s stable on the KS-LE-B vs the 6t/s on the llama.cpp one with Qwen 3.6 35B

deafcon · June 3

@Neoon said:

@Neoon said:
Look mom, we are on TV
https://point.free/blog/gemma-4-on-a-2016-xeon/

Actually, ik has a webinterface now.
github.com/ikawrakow/ik_llama.cpp

its pretty barebones and it has way less features than the original llama.cpp.
BUT, I get about 9t/s stable on the KS-LE-B vs the 6t/s on the llama.cpp one with Qwen 3.6 35B

Thanks a lot for the link to that repo! I went ahead and installed it on the server I mentioned early in this thread (if you want the thread to be limited to baguettes, say so and I'll shut up). Anyway, I was able to get 20 tokens per second on Qwen 3.6 35B MoE Q4 and 11 tokens per second on Q8. That is acceptable performance for local inference in my opinion. Now I have to see if it's actually smart enough to do anything with.

Neoon · June 3

@deafcon said:

@Neoon said:

@Neoon said:
Look mom, we are on TV
https://point.free/blog/gemma-4-on-a-2016-xeon/

Actually, ik has a webinterface now.
github.com/ikawrakow/ik_llama.cpp

its pretty barebones and it has way less features than the original llama.cpp.
BUT, I get about 9t/s stable on the KS-LE-B vs the 6t/s on the llama.cpp one with Qwen 3.6 35B

Thanks a lot for the link to that repo! I went ahead and installed it on the server I mentioned early in this thread (if you want the thread to be limited to baguettes, say so and I'll shut up). Anyway, I was able to get 20 tokens per second on Qwen 3.6 35B MoE Q4 and 11 tokens per second on Q8. That is acceptable performance for local inference in my opinion. Now I have to see if it's actually smart enough to do anything with.

what hardware? Q6 on KS-LE-B was 9/t for me.

deafcon · June 3

@Neoon said:

@deafcon said:

@Neoon said:

@Neoon said:
Look mom, we are on TV
https://point.free/blog/gemma-4-on-a-2016-xeon/

Actually, ik has a webinterface now.
github.com/ikawrakow/ik_llama.cpp

its pretty barebones and it has way less features than the original llama.cpp.
BUT, I get about 9t/s stable on the KS-LE-B vs the 6t/s on the llama.cpp one with Qwen 3.6 35B

Thanks a lot for the link to that repo! I went ahead and installed it on the server I mentioned early in this thread (if you want the thread to be limited to baguettes, say so and I'll shut up). Anyway, I was able to get 20 tokens per second on Qwen 3.6 35B MoE Q4 and 11 tokens per second on Q8. That is acceptable performance for local inference in my opinion. Now I have to see if it's actually smart enough to do anything with.

what hardware? Q6 on KS-LE-B was 9/t for me.

It's a Xeon Gold 6212U with 80 gigs of DDR4. I ran the tests with 36 threads, but I haven't tried the full 48 yet.

Neoon · June 3

@deafcon said:

@Neoon said:

@deafcon said:

@Neoon said:

@Neoon said:
Look mom, we are on TV
https://point.free/blog/gemma-4-on-a-2016-xeon/

Actually, ik has a webinterface now.
github.com/ikawrakow/ik_llama.cpp

its pretty barebones and it has way less features than the original llama.cpp.
BUT, I get about 9t/s stable on the KS-LE-B vs the 6t/s on the llama.cpp one with Qwen 3.6 35B

Thanks a lot for the link to that repo! I went ahead and installed it on the server I mentioned early in this thread (if you want the thread to be limited to baguettes, say so and I'll shut up). Anyway, I was able to get 20 tokens per second on Qwen 3.6 35B MoE Q4 and 11 tokens per second on Q8. That is acceptable performance for local inference in my opinion. Now I have to see if it's actually smart enough to do anything with.

what hardware? Q6 on KS-LE-B was 9/t for me.

It's a Xeon Gold 6212U with 80 gigs of DDR4. I ran the tests with 36 threads, but I haven't tried the full 48 yet.

Interesting, my local setup gives me about the same as the KS-LE-B.

deafcon · June 4

@Neoon said:

@deafcon said:

@Neoon said:

@deafcon said:

@Neoon said:

@Neoon said:
Look mom, we are on TV
https://point.free/blog/gemma-4-on-a-2016-xeon/

Actually, ik has a webinterface now.
github.com/ikawrakow/ik_llama.cpp

its pretty barebones and it has way less features than the original llama.cpp.
BUT, I get about 9t/s stable on the KS-LE-B vs the 6t/s on the llama.cpp one with Qwen 3.6 35B

Thanks a lot for the link to that repo! I went ahead and installed it on the server I mentioned early in this thread (if you want the thread to be limited to baguettes, say so and I'll shut up). Anyway, I was able to get 20 tokens per second on Qwen 3.6 35B MoE Q4 and 11 tokens per second on Q8. That is acceptable performance for local inference in my opinion. Now I have to see if it's actually smart enough to do anything with.

what hardware? Q6 on KS-LE-B was 9/t for me.

It's a Xeon Gold 6212U with 80 gigs of DDR4. I ran the tests with 36 threads, but I haven't tried the full 48 yet.

Interesting, my local setup gives me about the same as the KS-LE-B.

I've got a Ultra 7 265F with a 5070ti as well, but that machine only has 32gb of ram. Q8 is like 36 gigs I think, so I don't know if I could even run that model. I am curious how it would compare though. This is the first time I've dipped my toe into the local inference world except that I got an older model running on a 1060 that was both slow and dumb so I gave up immediately.

Neoon · June 4

I can fit Q4 on my 32gigs, Q6 is too tight.
Q6 fits easily on the 64gig idle dedi though.

Google just released a new gemma model, might be worth trying, just 12B so fits 16GB.
https://www.reddit.com/r/LocalLLaMA/comments/1tvtn6m/googlegemma412b_hugging_face/

Neoon · June 7

https://www.reddit.com/r/LocalLLaMA/comments/1tzbcyp/llamacpp_gemma4_mtp_support_merged/

How to (Ab)Use your KS-LE-B for LLM Models

Comments