v2.4.0
版本发布时间: 2024-10-26 05:14:13
huggingface/text-generation-inference最新发布版本:v3.0.1(2024-12-12 04:13:58)
Notable changes
- Experimental prefill chunking (
PREFILL_CHUNKING=1
) - Experimental FP8 KV cache support
- Greatly decrease latency for large batches (> 128 requests)
- Faster MoE kernels and support for GPTQ-quantized MoE
- Faster implementation of MLLama
What's Changed
- nix: remove unused
_server.nix
file by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2538 - chore: Add old V2 backend by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2551
- Remove duplicated
RUN
inDockerfile
by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2547 - Micro cleanup. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2555
- Hotfixing main by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2556
- Add support for scalar FP8 weight scales by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2550
- Add
DenseMoELayer
and wire it up in Mixtral/Deepseek V2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2537 - Update the link to the Ratatui organization by @orhun in https://github.com/huggingface/text-generation-inference/pull/2546
- Simplify crossterm imports by @orhun in https://github.com/huggingface/text-generation-inference/pull/2545
- Adding note for private models in quick-tour document by @ariG23498 in https://github.com/huggingface/text-generation-inference/pull/2548
- Hotfixing main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2562
- Cleanup Vertex + Chat by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2553
- More tensor cores. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2558
- remove LORA_ADAPTERS_PATH by @nbroad1881 in https://github.com/huggingface/text-generation-inference/pull/2563
- Add LoRA adapters support for Gemma2 by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2567
- Fix build with
--features google
by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2566 - Improve support for GPUs with capability < 8 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2575
- flashinfer: pass window size and dtype by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2574
- Remove compute capability lazy cell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2580
- Update architecture.md by @ulhaqi12 in https://github.com/huggingface/text-generation-inference/pull/2577
- Update ROCM libs and improvements by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2579
- Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2557
- feat: support phi3.5 moe by @drbh in https://github.com/huggingface/text-generation-inference/pull/2479
- Move flake back to tgi-nix
main
by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2586 - MoE Marlin: support
desc_act
forgroupsize != -1
by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2590 - nix: experimental support for building a Docker container by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2470
- Mllama flash version by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2585
- Max token capacity metric by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2595
- CI (2592): Allow LoRA adapter revision in server launcher by @drbh in https://github.com/huggingface/text-generation-inference/pull/2602
- Unroll notify error into generate response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2597
- New release 2.3.1 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2604
- Revert "Unroll notify error into generate response" by @drbh in https://github.com/huggingface/text-generation-inference/pull/2605
- nix: example of local package overrides during development by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2607
- Add basic FP8 KV cache support by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2603
- Fp8 Cache condition by @flozi00 in https://github.com/huggingface/text-generation-inference/pull/2611
- enable mllama in intel platform by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2610
- Upgrade minor rust version (Fixes rust build compilation cache) by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2617
- Add support for fused MoE Marlin for AWQ by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2616
- nix: move back to the tgi-nix main branch by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2620
- CI (2599): Update ToolType input schema by @drbh in https://github.com/huggingface/text-generation-inference/pull/2601
- nix: add black and isort to the closure by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2619
- AMD CI by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2589
- feat: allow tool calling to respond without a tool by @drbh in https://github.com/huggingface/text-generation-inference/pull/2614
- Update documentation to most recent stable version of TGI. by @Vaibhavs10 in https://github.com/huggingface/text-generation-inference/pull/2625
- Intel ci by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2630
- Fixing intel Supports windowing. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2637
- Small fixes for supported models by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/2471
- Cpu perf by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2596
- Clarify gated description and quicktour by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/2631
- update ipex to fix incorrect output of mllama in cpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2640
- feat: enable pytorch xpu support for non-attention models by @dvrogozh in https://github.com/huggingface/text-generation-inference/pull/2561
- Fixing linters. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2650
- Rollback to
ChatRequest
for Vertex AI Chat instead ofVertexChat
by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2651 - Fp8 e4m3_fnuz support for rocm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2588
- feat: prefill chunking by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2600
- Support
e4m3fn
KV cache by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2655 - Simplify the
attention
function by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2609 - fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process by @oOraph in https://github.com/huggingface/text-generation-inference/pull/2663
- fix: prefer inplace softmax to avoid copy by @drbh in https://github.com/huggingface/text-generation-inference/pull/2661
- Break cycle between the attention implementations and KV cache by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2627
- CI job. Gpt awq 4 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2665
- Make handling of FP8 scales more consisent by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2666
- Test Marlin MoE with
desc_act=true
by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2622 - break when there's nothing to read by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2582
- Add
impureWithCuda
dev shell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2677 - Make moe-kernels and marlin-kernels mandatory in CUDA installs by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2632
- feat: natively support Granite models by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2682
- feat: allow any supported payload on /invocations by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2683
- flashinfer: reminder to remove contiguous call in the future by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2685
- Fix Phi 3.5 MoE tests by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2684
- Add support for FP8 KV cache scales by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2628
- Fixing "deadlock" when python prompts for trust_remote_code by always by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2664
- [TENSORRT-LLM] - Implement new looper thread based backend by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2357
- Fixing rocm gptq by using triton code too (renamed cuda into triton). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2691
- Fixing mt0 test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2692
- Add support for stop words in TRTLLM by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2678
- Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2688
New Contributors
- @alvarobartt made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2547
- @orhun made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2546
- @ariG23498 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2548
- @ulhaqi12 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2577
- @mht-sharma made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2579
- @dvrogozh made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2561
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.4