0.8.10
版本发布时间: 2024-07-24 01:53:04
Mozilla-Ocho/llamafile最新发布版本:0.8.13(2024-08-19 01:22:48)
llamafile lets you distribute and run LLMs with a single file
llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of llama.cpp and cosmopolitan libc while aiming to stay ahead of the curve by including the most cutting-edge performance and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.
This release includes a build of the new llamafile server rewrite we've
been promising, which we're calling llamafiler
. It's matured enough to
recommend for embedding serving. This is the fastest way to serve
embeddings. If you use it with all-MiniLM-L6-v2.Q6_K.gguf then on
Threadripper it can serve JSON /embedding at 800 req/sec whereas the old
llama.cpp server could only do 100 req/sec. So you can fill up your RAG
databases very quickly if you productionize this.
The old llama.cpp server came from a folder named "examples" and was never intended to be production worthy. This server is designed to be sturdy and uncrashable. It has /completion and /tokenize endpoints too, which serves 3.7 million requests per second on Threadripper, thanks to Cosmo Libc improvements.
See the LLaMAfiler Documentation for further details.
- 73b1836 Write documentation for new server
- b3930aa Make GGML asynchronously cancelable
- 8604e9a Fix POSIX undefined cancelation behavior
- 323f50a Let SIGQUIT produce per-thread backtraces
- 15d7fba Use semaphore to limit GGML worker threads
- d7c8e33 Add support for JSON parameters to new server
- 7f099cd Make stack overflows recoverable in new server
- fb3421c Add barebones /completion endpoint to new server
This release restores support for non-AVX x86 microprocessors. We had to drop support at the beginning of the year. However our CPUid dispatching has advanced considerably since then. We're now able to offer top speeds on modern hardware, without leaving old hardware behind.
- a674cfb Restore support for non-AVX microprocessors
- 555fb80 Improve build configuration
Here's the remaining improvements included in this release:
- cc30400 Supports SmolLM (#495)
- 4a4c065 Fix CUDA compile warnings and errors
- 82f845c Avoid crashing with BF16 on Apple Metal
1、 llamafile-0.8.10 28.3MB
2、 llamafile-0.8.10.zip 60.09MB