Install Llama on a GPU server

havoc · August 2023

Busy testing the GPU servers per @crunchbits thread, jotted down some notes on how to get a fresh ubuntu server to talking llama model. Note this is on a 16gb GPU - if you're on a smaller one you'll need to change the q8_0 part to q4_0 or even q3_k

Also note that here I'm downloading a fp16 model and converting it to q8 GGUF. In practice you can skip over those steps and just download ready made quantized GGUF models from TheBloke's huggingface repo.. i.e. You'd modify the download model step to point to a quantized GGUF model and skip the generate and quantize step after that.

This assumes Ubuntu 22.04 - you may need to do stuff like install python3 if you're on a different distro

Check that we have a GPU

apt update && apt upgrade
apt install hwinfo -y
hwinfo --gfxcard --short

Set up nvidia driver and SDK

apt install nvidia-driver-535-server nvidia-dkms-535-server nvidia-cuda-toolkit -y
reboot
nvidia-smi
nvcc --version

Grab llama.cpp and build it

git clone https://github.com/ggerganov/llama.cpp
apt install cmake -y
cd llama.cpp
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release
cd ..

Download a model

mkdir -p /root/llama.cpp/models/llama2-fp16
python3 -m pip install huggingface_hub
python3
from huggingface_hub import snapshot_download
snapshot_download(repo_id="TheBloke/Llama-2-13B-Chat-fp16", revision="main",local_dir="/root/llama.cpp/models/llama2-fp16/")
quit()

Generate GGUF file

python3 -m pip install gguf sentencepiece
python3 convert.py ./models/llama2-fp16/

Quantize it

cd ./build/bin
./quantize ../../models/llama2-fp16/ggml-model-f16.gguf ../../models/llama2-q8.gguf q8_0

Run it

./main -m ../../models/llama2-q8.gguf -ngl 99 --color -p "Tell me a story about a unicorn!"

Tell me a story about a unicorn!

Once upon a time, in a far-off land of rolling hills and sparkling streams, there lived a beautiful unicorn named Luna. She had a shimmering coat of silver and white, and her horn was as bright as the stars in the night sky.

Luna lived a peaceful life, roaming the forests and meadows, and making friends with all the creatures she met. She loved to play with the butterflies and dance with the flowers, and she could make the most beautiful music with her horn.

One day, a wicked witch cast a spell on the land, causing all the plants and animals to become sick and tired. The unicorns were especially affected, and their beautiful coats became dull and lifeless.

Luna knew that she had to do something to save her friends and the land they lived in. She set out on a journey to find the witch and break her spell.

As she traveled through the forest, Luna met many creatures who were suffering from the witch's spell. She used her horn to heal them and bring them back to life. She also met a brave knight who had been searching for the witch for many years. Together, they journeyed on, determined to defeat the wicked witch and bring peace back to the land.

Finally, after many days of traveling, they came to the witch's castle. It was a dark and gloomy place, surrounded by a moat of swirling black water. But Luna was not afraid. She knew that her horn could break any spell, no matter how powerful.

She and the knight entered the castle, ready to face whatever dangers lay inside. As they made their way deeper into the castle, they came across the witch herself. She was a terrifying sight, with warts and a crooked nose, and a cackle that sent chills down your spine.

But Luna was not afraid. She raised her horn and pointed it at the witch, ready to break the spell. The witch laughed and tried to stop her, but Luna's horn was too powerful. With one blast of magic, the spell was broken, and the land was once again filled with light and life.

The creatures who had been turned to stone were returned to their true forms, and they cheered and celebrated as Luna and the knight emerged from the castle. The witch was banished from the land forever, and peace was restored.

And Luna, the little unicorn with the powerful horn, lived happily ever after, knowing that she had saved her homeland from the evil witch's spell. The end.

Encoders · September 2023

looking good, finally someone not just using stable diffusion.

are you considering to benchmark few more models? i.e. comparing them at creating stories, code generation, code completion (I'm more interested at them being able to spot bugs however).
other than that, maybe trying some 'uncensored' models, If it's being naughty you can just post the conclusion in this thread.

havoc · September 2023

@Encoders said:
looking good, finally someone not just using stable diffusion.

Yep - the LLMs seem much more interesting to me in the long run & def the part I want to learn more about in the host it yourself context. This stuff is absolutely gonna change the world.

are you considering to benchmark few more models?

It's incredibly hard to benchmark them in any way that is meaningful tbh, so not planning to. I do try out a lot of different ones though because they do have very different vibes.

code generation, code completion

Code generation yes - the code llama ones are pretty good at generating pieces. Code completion, no. Haven't figured out how to hook copilot extension into a local model for completion. Tried but failed thus far.

The local stuff is solid at explaining code already though.. That code was generated by the same model too.

And also responds well to follow up questions.

(I'm more interested at them being able to spot bugs however).

For that I still end up using GPT4 when I'm really stuck - especially for not coding but linux things. Like if OpenCL is broken or whatever. Trying to make an effort to ask local models first though so that I can learn their failure points better & develop an intuition for it.

other than that, maybe trying some 'uncensored' models, If it's being naughty you can just post the conclusion in this thread.

Not super interested in that angle tbh. Uncensored - reckon the importance of it is exaggerated. Tried asking one to generate a spicy story to see what the fuss is about and it got the request and complied but...just so damn bland. I could see them being good at stories though...dungeons and dragons style

rpollestad · September 2023

Thanks a lot for this tutorial! I had tried to play around with llama.cpp in the past but after getting the main program compiled, I could never figure out how to get models. (And based on your instructions, I never would have figured it out anyway.)

Worked like a charm, and now I'm off generating my own unicorn stories.

havoc · September 2023

@rpollestad said:
Worked like a charm, and now I'm off generating my own unicorn stories.

Wohoo!

If you're looking for something more user friendly, this works well:
https://github.com/oobabooga/text-generation-webui

@rpollestad said:
I could never figure out how to get models.

You can technically wget the files off huggingface too but this way is cleaner as long as the repo is split into branches. Some are not so downloading "main" ends up unnecessarily downloading all the quantization combinations. So always worth glancing at the repo in a browser first

nhocconan · September 2023

I simply use the Text Generation WebUI (https://github.com/oobabooga/text-generation-webui) for model downloading and do interacting with them :-)

The @crunchbits server works fine, I think it is best to use it on-demand atm so waiting for their hour-based pricing .

havoc · September 2023

@nhocconan said:
I simply use the Text Generation WebUI (https://github.com/oobabooga/text-generation-webui) for model downloading and do interacting with them :-)

The @crunchbits server works fine, I think it is best to use it on-demand atm so waiting for their hour-based pricing .

Yep - for GUI use it's probably the best. Started with that & still use it for chatbot use.

llama.cpp becomes more interesting if you want to use it for integration into coding projects. (Also potentially langchain)

yusra · October 2023

16gb GPU and models that are hundreds of gigabytes in size. I hope that one day we can see more manageable and feasible requirements.