How to (Ab)Use your KS-LE-B for LLM Models

NeoonNeoon OGContent WriterSenpai
edited January 23 in Technical

So, you got one of these KS-LE-B and want to run some LLM models?
Smol short guide.

Grab the dependencies we need.
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git ccache python3-pip python3.13-venv -y

Add a new user which will run the LLM models.
adduser llm

Logon
su llm

Grab llama.cpp

cd
git clone https://github.com/ggml-org/llama.cpp.git

Grab huggingface CLI

curl -LsSf https://hf.co/cli/install.sh | bash
export PATH="/home/llm/.local/bin:$PATH"

I have made a smol script to initial build / update llama.cpp: https://pastebin.com/raw/gKYBcXqc

wget -O update.sh https://pastebin.com/raw/gKYBcXq
chmod +x update.sh
bash update.sh

Lets download our first model.
hf download unsloth/GLM-4.7-Flash-GGUF --include "*Q4_K_M*" --local-dir models/

Either you can run llama.cpp on the CLI.
llama.cpp/llama-cli --jinja --model models/GLM-4.7-Flash-Q4_K_M.gguf

or use the Webinterface.
llama.cpp/llama-server --jinja --host 127.0.0.1 --port 8888 --models-dir models/

Including a model autoloader, which you can select in the webinterface.
Add a nginx reverse proxy and you set.

Comments

  • NeoonNeoon OGContent WriterSenpai

    Its fast enough to chat but not blazing fast on CPU.

    Thanked by (1)oloke
  • backtogeekbacktogeek Hosting ProviderOGSenpai

    That's pretty cool !

    TierHive - Hourly VPS - NAT Native - /24 per customer - Lab in the cloud - Free to try.
    FREE tokens when you sign up, try before you buy. | Join us on Reddit

  • NeoonNeoon OGContent WriterSenpai

    Before I forgot to mention this.
    Try to get Q4 and higher, Q4 is a good balance.

    The model mention above, needs 64GB if you have a KS-LE-B with less, try a smoler model.
    To optimize performance / results always check the guide for the model.
    e.g https://unsloth.ai/docs/models/glm-4.7-flash

  • Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

    Thanked by (1)Freek
  • NeoonNeoon OGContent WriterSenpai

    @lowendmeow said:
    Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

    I run it bare metal for maximum performance, while running on CPU, everything counts.
    I used OpenWebUI before, but ditched it for llama.cpp, same functionality without the cloud shit.

  • @Neoon said:

    @lowendmeow said:
    Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

    I run it bare metal for maximum performance, while running on CPU, everything counts.
    I used OpenWebUI before, but ditched it for llama.cpp, same functionality without the cloud shit.

    Oh you really take that much of a performance hit running within Docker? Do you know how many tokens per second you were getting? I can compare

  • NeoonNeoon OGContent WriterSenpai

    @lowendmeow said:

    @Neoon said:

    @lowendmeow said:
    Nice! I messed around with https://github.com/mudler/LocalAI on my OVH server a long time back I think it's time to try again. This one puts it in a docker container and you get an API which is cool

    I run it bare metal for maximum performance, while running on CPU, everything counts.
    I used OpenWebUI before, but ditched it for llama.cpp, same functionality without the cloud shit.

    Oh you really take that much of a performance hit running within Docker? Do you know how many tokens per second you were getting? I can compare

    I didn't bench, Container should not cause a big performance loss, however its gonna cost you a little bit.
    I just run it on bare metal.

Sign In or Register to comment.