Free hosted Llama 3.1 8B - TensorDock
Hello LES!
If you've been looking to play around with AI endpoints recently but haven't had the chance, I've spun up a GPU cluster running Llama 3.1 8B for anyone who's interested.
Tech deets: Rust load balancer distributing load across 5 NVIDIA GPU servers running Llama inference servers. Max context length 2048 tokens.
The API is 100% OpenAI completions API compatible (let me know if you want streaming support) and free to use for now!
If there's enough interest, I can also set up a Llama 70B or hosted Mixtral cluster.
Example API integration in Python:
"""
An example OpenAI API client that connects to TensorDock's YAP
"""
from openai import OpenAI
client = OpenAI(
api_key = "dummy",
base_url="https://yap.tensordock.com"
)
completion = client.chat.completions.create(
model="Meta-Llama-3.1-8B-Instruct",
messages=[
{
"role" : "system",
"content" : "You are a pirate who speaks in pirate-speak."
},
{
"role" : "user",
"content" : "Explain LLMs to me in a single sentence."
}
],
max_tokens=256,
temperature=0.7,
top_p = 0.7,
frequency_penalty=1.0,
seed = 17
)
output = completion.choices[0].message.content
print(output)
More details: https://blog.tensordock.com/blog/YAP
Rent affordable GPU servers: https://dashboard.tensordock.com/deploy
Comments
Hi @lentro!
Thanks for making TensorDock's YAP GPU cluster available here at LES!
Maybe I could ask five questions, please.
Name Version Max In Max Out Description
models/gemini-1.5-pro-latest 001 2097152 8192 Mid-size multimodal model that supports up to 2 million tokens
Are there only three important token counts: max in, max out, and maximum context?
Does the "Max context length 2048 tokens" number you mentioned above refer to the same use of "token" as Google Gemini's "2 million tokens?"
I have been using https://github.com/eliben/gemini-cli, which is a command line interface to Google Gemini. Is there a command line AI interface which you could recommend that supports standard input, standard output, standard error, and pipes?
Can the command line interface which you recommend connect to both YAP and Google Gemini?
Thanks again for making YAP available here at LES! Thanks in advance for any help with my questions!
Best wishes!
I hope everyone gets the servers they want!
I can help answer some if it lifts some of the burden off of @lentro
1) llama 3.1 8b is not multimodal(ie it cant understand images and generate a text based output on those images within the same model) It can only understand text input and do text output. Llama 3.1 8B is also likely much much smaller than google-gemini pro so therefore is likely not as smart in many tasks.
3) Essentially yes
Nice one! Tried it & works as expected.
If you're looking to get additional load for publicity have a look at openrouter...some of the models on there have $0 pricing. Doesn't look like anybody else has free 3.1 on there right now.. You'd need to up the context length though - limiting a 128k model to 2k will raise eyebrows.
Think I've still got some credits w/ tensor...if this does ever switch to paid, could i use it against that?
Nice one, @lentro
———-
blog | exploring visually |
https://openrouter.ai/models/meta-llama/llama-3.1-8b-instruct:free maybe?
youtube.com/watch?v=k1BneeJTDcU
Hi @lentro long time since I saw you here,how's everything?
Want free vps ? https://microlxc.net
@codelock -- hello! Glad to be back -- things have been really crazy over the past two years, but TensorDock work is finally getting less hectic with all the automation & new team members we have... life is finally settling down & I'll finally have more time to collect Linux ISOs soon
Yes of course! 70B might be paid, I don't think we'll charge for 8B though. Running it is so cheap if you own the GPUs... Please send me a DM if you do have credits though, we have been shifting database changes around & have not migrated super old accounts if you haven't been an active customer recently.
We're running this at full FP16 precision on 24GB VRAM GPUs. 2k context gives us the sufficient VRAM to batch & serve with low unit costs. Full 128k context requires larger VRAM GPUs that are more expensive to own, so maybe we'd charge for users that need 128k context? But anyways, that's the reasoning behind this decision
@BruhGamer12 // @Not_Oles
I think @BruhGamer12 hit the hammer on the head up a bit. Realistically, Llama 8B is a small model and runs on (relatively) low-end hardware, hence why I can host it for free. The 70B model and 405B Llama models are free to download but need much beefier hardware to run [a 2x H100 at full precision for 70B and 8x H100 at half precision for 405B]. Those 2 models are probably much more comparable to Google Gemini in terms of logical reasoning.
The main selling point of Llama is the price. Because Meta provides weights for free, model hosting providers compete against each other on pricing. Even if we needed to charge for 8B, we could do it for just $0.07 per million tokens to break even [assuming enough customers use it], or 15x cheaper than Google Gemini
Let me get back to you re: command line interface -- I have seen aichat support OpenAI-compatible APIs but I need to get it working on my local system first!
Why full 16 bit? 8 bit gguf takes way less and is way faster or if you are using only GPUs do exl2 6 bit or something for much faster speed - you dont have loss of quality till somewhere inbetween 5 and 6 bit.
Hmm, I will have to look into this but I am quite sure even at 8 bit there is measurable quality loss. Of course the question becomes does a quantized 70B model perform better than a, say, 34B model -- and I am sure yes, so it's a balance between quality & performance/cost... I'll look into running these at FP8.
You are def right so thanks for that correction but maybe the loss is less than you think(it is actually more than I would have thought which surprised me!). This is with 70B llama 3 using the new MMLU-Pro benchmark but here are the scores in the math section for example. So you may have some headroom to reduce down and still maintain quality. This is using GGUF quants btw. Math is where it would struggle the most I think but others like econ are the same between 8 bit and 16bit.
For context full fp16 score is 54% for math - from huggingface https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro
https://www.reddit.com/r/LocalLLaMA/comments/1ds6da5/mmlupro_all_category_test_results_for_llama_3_70b/
Thanks again for the free project tho. Do not mean to step on your toes at all! Whatever you do is awesome for the community!
@ehab,
Here’s a chance for you to virtually try on as many pants as you like… or not. For e.g. you can try asking "Name 42 ways to use pants without wearing them even once"
Click to expand
Q. Write a short essay on "42 creative ways to use pants without wearing them
Title: 42 Creative Ways to Utilize Pants Without Wearing Them
Introduction
Pants, a staple in many wardrobes, are versatile pieces of clothing that can serve purposes beyond wearing them on our bodies. This essay explores 42 creative and practical ways to utilize pants without wearing them even once, showcasing their hidden potential and value.
Conclusion
This extensive list of 42 creative ways to utilize pants without wearing them showcases their hidden potential and value. From home decor to personal care, gardening to travel, pants can serve a multitude of purposes beyond their traditional use as clothing. By repurposing pants, we can extend their lifespan, reduce waste, and embrace a more sustainable lifestyle.
———-
blog | exploring visually |
@vyas charming as always. Made me smile.
Another option with various models, including this and Llama 3.1 70B is Perplexity Labs, though only free via the web interface.
Their paid plan includes $5 worth of API credits as well...which is A LOT. Never come close to exhausting that
Another is openrouter - got free $5 with first ACH bank deposit on there.
A noob question which is your usage ai for learning data or?
I believe in good luck. Harder that I work ,luckier i get.
imho, I'm not too sure what people will use this API for
but personally, I just like having an AI that is more privacy-focused [we don't collect any data of course -- I trust myself more than anyone else]