v2.2.0
版本发布时间: 2024-07-24 00:30:03
huggingface/text-generation-inference最新发布版本:v3.0.1(2024-12-12 04:13:58)
Notable changes
- Llama 3.1 support (including 405B, FP8 support in a lot of mixed configurations, FP8, AWQ, GPTQ, FP8+FP16).
- Gemma2 softcap support
- Deepseek v2 support.
- Lots of internal reworks/cleanup (allowing for cool features)
- Lots of AWQ/GPTQ work with marlin kernels (everything should be faster by default)
- Flash decoding support (FLASH_DECODING=1 environment variables which will probably enable some nice improvements in the future)
What's Changed
- Preparing patch release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2186
- Adding "longrope" for Phi-3 (#2172) by @amihalik in https://github.com/huggingface/text-generation-inference/pull/2179
- Refactor dead code - Removing all
flash_xxx.py
files. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2166 - Fix Starcoder2 after refactor by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2189
- GPTQ CI improvements by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2151
- Consistently take
prefix
in model constructors by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2191 - fix dbrx & opt model prefix bug by @icyxp in https://github.com/huggingface/text-generation-inference/pull/2201
- hotfix: Fix number of KV heads by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2202
- Fix incorrect cache allocation with multi-query by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2203
- Falcon/DBRX: get correct number of key-value heads by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2205
- add doc for intel gpus by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2181
- fix: python deserialization by @jaluma in https://github.com/huggingface/text-generation-inference/pull/2178
- update to metrics 0.23.0 or could work with metrics-exporter-promethe… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2190
- feat: use model name as adapter id in chat endpoints by @drbh in https://github.com/huggingface/text-generation-inference/pull/2128
- Fix nccl regression on PyTorch 2.3 upgrade by @fxmarty in https://github.com/huggingface/text-generation-inference/pull/2099
- Fix buildx cache + change runner type by @glegendre01 in https://github.com/huggingface/text-generation-inference/pull/2176
- Fixed README ToC by @vinkamath in https://github.com/huggingface/text-generation-inference/pull/2196
- Updating the self check by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2209
- Move quantized weight handling out of the
Weights
class by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2194 - Add support for FP8 on compute capability >=8.0, <8.9 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2213
- fix: append DONE message to chat stream by @drbh in https://github.com/huggingface/text-generation-inference/pull/2221
- [fix] Modifying base in yarn embedding by @SeongBeomLEE in https://github.com/huggingface/text-generation-inference/pull/2212
- Use symmetric quantization in the
quantize
subcommand by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2120 - feat: simple mistral lora integration tests by @drbh in https://github.com/huggingface/text-generation-inference/pull/2180
- fix custom cache dir by @ErikKaum in https://github.com/huggingface/text-generation-inference/pull/2226
- fix: Remove bitsandbytes installation when running cpu-only install by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2216
- Add support for AWQ-quantized Idefics2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2233
-
server quantize
: expose groupsize option by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2225 - Remove stray
quantize
argument inget_weights_col_packed_qkv
by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2237 - fix(server): fix cohere by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2249
- Improve the handling of quantized weights by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2250
- Hotfix: fix of use of unquantized weights in Gemma GQA loading by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2255
- Hotfix: various GPT-based model fixes by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2256
- Hotfix: fix MPT after recent refactor by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2257
- Hotfix: pass through model revision in
VlmCausalLM
by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2258 - usage stats and crash reports by @ErikKaum in https://github.com/huggingface/text-generation-inference/pull/2220
- add usage stats to toctree by @ErikKaum in https://github.com/huggingface/text-generation-inference/pull/2260
- fix: adjust default tool choice by @drbh in https://github.com/huggingface/text-generation-inference/pull/2244
- Add support for Deepseek V2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2224
- re-push to internal registry by @XciD in https://github.com/huggingface/text-generation-inference/pull/2242
- Add FP8 release test by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2261
- feat(fp8): use fbgemm kernels and load fp8 weights directly by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2248
- fix(server): fix deepseekv2 loading by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2266
- Hotfix: fix of use of unquantized weights in Mixtral GQA loading by @icyxp in https://github.com/huggingface/text-generation-inference/pull/2269
- legacy warning on text_generation client by @ErikKaum in https://github.com/huggingface/text-generation-inference/pull/2271
- fix(ci): test new instances by @XciD in https://github.com/huggingface/text-generation-inference/pull/2272
- fix(server): fix fp8 weight loading by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2268
- Softcapping for gemma2. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2273
- use proper name for ci by @XciD in https://github.com/huggingface/text-generation-inference/pull/2274
- Fixing mistral nemo. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2276
- fix(l4): fix fp8 logic on l4 by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2277
- Add support for repacking AWQ weights for GPTQ-Marlin by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2278
- [WIP] Add support for Mistral-Nemo by supporting head_dim through config by @shaltielshmid in https://github.com/huggingface/text-generation-inference/pull/2254
- Preparing for release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2285
- Add support for Llama 3 rotary embeddings by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2286
- hotfix: pin numpy by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2289
New Contributors
- @jaluma made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2178
- @vinkamath made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2196
- @ErikKaum made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2226
- @Hugoch made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2216
- @XciD made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2242
- @shaltielshmid made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2254
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.1.1...v2.2.0