How to (Ab)Use your KS-LE-B for LLM Models

Neoon · January 23

So, you got one of these KS-LE-B and want to run some LLM models?
Smol short guide.

Grab the dependencies we need.
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git ccache python3-pip python3.13-venv -y

Add a new user which will run the LLM models.
adduser llm

Logon
su llm

Grab llama.cpp

cd
git clone https://github.com/ggml-org/llama.cpp.git

Grab huggingface CLI

curl -LsSf https://hf.co/cli/install.sh | bash
export PATH="/home/llm/.local/bin:$PATH"

I have made a smol script to initial build / update llama.cpp: https://pastebin.com/raw/gKYBcXqc

wget -O update.sh https://pastebin.com/raw/gKYBcXq
chmod +x update.sh
bash update.sh

Lets download our first model.
hf download unsloth/GLM-4.7-Flash-GGUF --include "*Q4_K_M*" --local-dir models/

Either you can run llama.cpp on the CLI.
llama.cpp/llama-cli --jinja --model models/GLM-4.7-Flash-Q4_K_M.gguf

or use the Webinterface.
llama.cpp/llama-server --jinja --host 127.0.0.1 --port 8888 --models-dir models/

Including a model autoloader, which you can select in the webinterface.
Add a nginx reverse proxy and you set.

Neoon · January 23

Its fast enough to chat but not blazing fast on CPU.

AnthonySmith · January 23

That's pretty cool !

Neoon · January 23

Before I forgot to mention this.
Try to get Q4 and higher, Q4 is a good balance.

The model mention above, needs 64GB if you have a KS-LE-B with less, try a smoler model.
To optimize performance / results always check the guide for the model.
e.g https://unsloth.ai/docs/models/glm-4.7-flash

lowendmeow · January 23

Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

Neoon · January 23

@lowendmeow said:
Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

I run it bare metal for maximum performance, while running on CPU, everything counts.
I used OpenWebUI before, but ditched it for llama.cpp, same functionality without the cloud shit.

lowendmeow · January 23

@Neoon said:

@lowendmeow said:
Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

I run it bare metal for maximum performance, while running on CPU, everything counts.
I used OpenWebUI before, but ditched it for llama.cpp, same functionality without the cloud shit.

Oh you really take that much of a performance hit running within Docker? Do you know how many tokens per second you were getting? I can compare

Neoon · January 23

@lowendmeow said:

@Neoon said:

@lowendmeow said:
Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

I run it bare metal for maximum performance, while running on CPU, everything counts.
I used OpenWebUI before, but ditched it for llama.cpp, same functionality without the cloud shit.

Oh you really take that much of a performance hit running within Docker? Do you know how many tokens per second you were getting? I can compare

I didn't bench, Container should not cause a big performance loss, however its gonna cost you a little bit.
I just run it on bare metal.

How to (Ab)Use your KS-LE-B for LLM Models

Comments