0.8.9
版本发布时间: 2024-07-02 03:11:46
Mozilla-Ocho/llamafile最新发布版本:0.8.13(2024-08-19 01:22:48)
This release gets Gemma2 working closer to how Google intended.
- af22695 Make gemma2-27b-it the same as aistudio.google.com
- 41678c8 Add sliding window mask for Gemma2
- 140eed5 Add soft-capping to Gemma2
This release fixes Android support. You can now run LLMs on your phone using Cosmopolitan software like llamafile. Thank you @aj47 (techfren.net) for bug reports and and testing efforts. See also other bug fixes described by the Cosmopolitan v3.5.4 and v3.5.3 release notes.
Our future replacement for the server now has an /embedding endpoint. On my workstation, it's currently able to serve 851 requests per second for a prompt with 52 tokens, using the all-MiniLM-L6-v2.Q6_K.gguf embeddings model. None of the requests fail and 99th percentile latency is 56.74ms.
- 1346ef4 Create /embedding endpoint in new server
- 263d39b Use float to string conversion
- 0d62d05 Reclaim llama_decode() memory on cancelation
- 617d841 Remove ggml_context cache
- 46dda4f Refactor new server and get leak checker working
- cd73243 Prevent vector overflow in llama.cpp
You can try the new embedding server as follows:
make -j o//llamafile/server/main
o//llamafile/server/main -m /weights/all-MiniLM-L6-v2.F32.gguf
curl http://127.0.0.1:8080/embedding?prompt=orange
Compatibility with the old server's API of posting JSON content will be added in upcoming changes. The same goes for the OpenAI API. The goal's to be compatible with everything.
1、 llamafile-0.8.9 28.62MB
2、 llamafile-0.8.9.zip 59.19MB