0.8.10

版本发布时间: 2024-07-24 01:53:04

Mozilla-Ocho/llamafile最新发布版本:0.8.13(2024-08-19 01:22:48)

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

llamafile lets you distribute and run LLMs with a single file

llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of llama.cpp and cosmopolitan libc while aiming to stay ahead of the curve by including the most cutting-edge performance and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.

This release includes a build of the new llamafile server rewrite we've been promising, which we're calling llamafiler. It's matured enough to recommend for embedding serving. This is the fastest way to serve embeddings. If you use it with all-MiniLM-L6-v2.Q6_K.gguf then on Threadripper it can serve JSON /embedding at 800 req/sec whereas the old llama.cpp server could only do 100 req/sec. So you can fill up your RAG databases very quickly if you productionize this.

The old llama.cpp server came from a folder named "examples" and was never intended to be production worthy. This server is designed to be sturdy and uncrashable. It has /completion and /tokenize endpoints too, which serves 3.7 million requests per second on Threadripper, thanks to Cosmo Libc improvements.

See the LLaMAfiler Documentation for further details.

73b1836 Write documentation for new server
b3930aa Make GGML asynchronously cancelable
8604e9a Fix POSIX undefined cancelation behavior
323f50a Let SIGQUIT produce per-thread backtraces
15d7fba Use semaphore to limit GGML worker threads
d7c8e33 Add support for JSON parameters to new server
7f099cd Make stack overflows recoverable in new server
fb3421c Add barebones /completion endpoint to new server

This release restores support for non-AVX x86 microprocessors. We had to drop support at the beginning of the year. However our CPUid dispatching has advanced considerably since then. We're now able to offer top speeds on modern hardware, without leaving old hardware behind.

a674cfb Restore support for non-AVX microprocessors
555fb80 Improve build configuration

Here's the remaining improvements included in this release:

cc30400 Supports SmolLM (#495)
4a4c065 Fix CUDA compile warnings and errors
82f845c Avoid crashing with BF16 on Apple Metal

相关地址：原始地址下载(tar) 下载(zip)

1、 llamafile-0.8.10 28.3MB

2、 llamafile-0.8.10.zip 60.09MB

查看：2024-07-24发行的版本