0.8.9

版本发布时间: 2024-07-02 03:11:46

Mozilla-Ocho/llamafile最新发布版本:0.8.13(2024-08-19 01:22:48)

This release gets Gemma2 working closer to how Google intended.

af22695 Make gemma2-27b-it the same as aistudio.google.com
41678c8 Add sliding window mask for Gemma2
140eed5 Add soft-capping to Gemma2

This release fixes Android support. You can now run LLMs on your phone using Cosmopolitan software like llamafile. Thank you @aj47 (techfren.net) for bug reports and and testing efforts. See also other bug fixes described by the Cosmopolitan v3.5.4 and v3.5.3 release notes.

Our future replacement for the server now has an /embedding endpoint. On my workstation, it's currently able to serve 851 requests per second for a prompt with 52 tokens, using the all-MiniLM-L6-v2.Q6_K.gguf embeddings model. None of the requests fail and 99th percentile latency is 56.74ms.

1346ef4 Create /embedding endpoint in new server
263d39b Use float to string conversion
0d62d05 Reclaim llama_decode() memory on cancelation
617d841 Remove ggml_context cache
46dda4f Refactor new server and get leak checker working
cd73243 Prevent vector overflow in llama.cpp

You can try the new embedding server as follows:

make -j o//llamafile/server/main
o//llamafile/server/main -m /weights/all-MiniLM-L6-v2.F32.gguf
curl http://127.0.0.1:8080/embedding?prompt=orange

Compatibility with the old server's API of posting JSON content will be added in upcoming changes. The same goes for the OpenAI API. The goal's to be compatible with everything.

相关地址：原始地址下载(tar) 下载(zip)

1、 llamafile-0.8.9 28.62MB

2、 llamafile-0.8.9.zip 59.19MB

查看：2024-07-02发行的版本