MyGit

0.8.13

Mozilla-Ocho/llamafile

版本发布时间: 2024-08-19 01:22:48

Mozilla-Ocho/llamafile最新发布版本:0.8.13(2024-08-19 01:22:48)

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

llamafile lets you distribute and run LLMs with a single file

llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of llama.cpp and cosmopolitan libc while aiming to stay ahead of the curve by including the most cutting-edge performance and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.

v0.8.13 changes

This release synchronizes with upstream projects, bringing with it support for the newest models (e.g. Gemma 2B). Support for LLaMA v3 has been significantly improved.

The new llamafiler server is now able to serve 2400 embeddings per second on CPU. That's 3x faster than the llama.cpp server upstream. It's now hardened for security. You should be able to safely use it a public facing web server. There's a man page for llamafiler. You can also read the docs online: /llamafile/server/doc/index.md.

The new llamafiler server now fully supports all the old embedding endpoints that were provided by llamafile --server. Support for serving embeddings has been removed from the old server.

image

This release introduces whisperfile which is a single-file implementation of OpenAI's Whisper model. It lets you transcribe speech to text and even translate it too. Our implementation is based off Georgi Gerganov's whisper.cpp project. The project to turn it into a whisperfile was founded by CJ Pais who's handed over maintenance of his awesome work. There's a man page for whisperfile (which also can be viewed by running ./whisperfile --help) and we have online documentation with markdown tutorials at /whisper.cpp/doc/index.md.

We developed a faster, more accurate implementation of GeLU. This helps improve the performance of tiny models. It leads to measurable quality improvements in whisper model output.

We've been improving floating point numerical stability for very large models, e.g. Mixtral 8x22b and Command-R-Plus. tinyBLAS on CPU for F32, F16, and BF16 weights now uses a new zero-overhead divide-and-conquer approach to computing dot products, which we call ruler reduction, that can result in a 10x reduction in worst case roundoff error accumulation.

This release introduces sdfile, which is our implementation of stable diffusion. No documentation is yet provided for this command, other than the docs provided by the upstream stable-diffusion.cpp project on which it's based.

The list of new architectures and tokenizers introduced by this version are: Open ELM, GPT NEOX, Arctic, DeepSeek2, ChatGLM, BitNet, T5, JAIS, Poro, Viking, Tekken, and CodeShell.

Known Issues

The llamafile executable size is increased from 30mb to 200mb by this release. This is caused by ggerganov/llama.cpp#7156. We're already employing some workarounds to minimize the impact of upstream development contributions on binary size, and we're aiming to find more in the near future.

相关地址:原始地址 下载(tar) 下载(zip)

1、 llamafile-0.8.13 230.17MB

2、 llamafile-0.8.13.zip 472.09MB

3、 llamafile-bench-0.8.13 8.41MB

4、 sdfile-0.8.13 17.47MB

5、 whisperfile-0.8.13 225.79MB

查看:2024-08-19发行的版本