Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As an article aimed towards newbies I think it's pretty good, but I'd agree "comprehensive" is a bit of a reach. I think a better title would have been something like "Get LLama2 running locally on every platform easily."

Anyone w/ more than a single consumer GPU probably has a good grip on their options (a vllm vs hf shootout would be neat for exmaple), but I'd add a few more projects for those taking the next step for local inferencing:

* exllama - while llama.cpp has matched its token generation performance, exllama is still largely my preferred inference engine because it is so memory efficient (shaving gigs off the competition) - this means you can run a 33B model w/ 2K context easily on a single 24GB card. It also scales almost perfectly for inferencing on 2 GPUs. It's been tested to run a llama2-70b w/ 16K context (NTK RoPE scaling) sneaking in at 47GB. Ridiculous.

* AutoGPTQ - while it's fallen a bit behind for inference, if you are using an older (eg Pascal) cards, it's worth taking a look. It's also the easiest tool for making GPTQ quants.

* ggml - for non-llama models, this covers running quants of almost everything else out there; GPU acceleration only w/ cuBLAS/CLBlast, so not the fastest, but fast enough

* python-llama-cpp and LocalAI - while these are technically llama.cpp bindings, they're pretty useful/worth mentioning since they replicate the OpenAI API making it easy as a drop-in replacement for a whole ecosystems of tools/apps

* A lot of hobbyists like oobabooga, kobold.cpp, silly tavern, etc but I haven't gotten around to poking into those as much. They seem like a lot of work, always behind their mainline dependencies, and while featureful and interesting, they also feel like they're one update away from breaking (eg, automatic1111 vibes).



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: