v2.4.1
版本发布时间: 2024-11-23 01:35:00
huggingface/text-generation-inference最新发布版本:v3.0.1(2024-12-12 04:13:58)
Notable changes
- Choose input/total tokens automatically based on available VRAM
- Support Qwen2 VL
- Decrease latency of very large batches (> 128)
What's Changed
- feat: add triton kernels to decrease latency of large batches by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2687
- Avoiding timeout for bloom tests. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2693
- Green main by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2697
- Choosing input/total tokens automatically based on available VRAM? by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2673
- We can have a tokenizer anywhere. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2527
- Update poetry lock. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2698
- Fixing auto bloom test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2699
- More timeout on docker start ? by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2701
- Monkey patching as a desperate measure. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2704
- add xpu triton in dockerfile, or will show "Could not import Flash At… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2702
- Support qwen2 vl by @drbh in https://github.com/huggingface/text-generation-inference/pull/2689
- fix cuda graphs for qwen2-vl by @drbh in https://github.com/huggingface/text-generation-inference/pull/2708
- fix: create position ids for text only input by @drbh in https://github.com/huggingface/text-generation-inference/pull/2714
- fix: add chat_tokenize endpoint to api docs by @drbh in https://github.com/huggingface/text-generation-inference/pull/2710
- Hotfixing auto length (warmup max_s was wrong). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2716
- Fix prefix caching + speculative decoding by @tgaddair in https://github.com/huggingface/text-generation-inference/pull/2711
- Fixing linting on main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2719
- nix: move to tgi-nix
main
by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2718 - fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2717
- add trust_remote_code in tokenizer to fix baichuan issue by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2725
- Add initial support for compressed-tensors checkpoints by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2732
- nix: update nixpkgs by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2746
- benchmark: fix prefill throughput by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2741
- Fix: Change model_type from ssm to mamba by @mokeddembillel in https://github.com/huggingface/text-generation-inference/pull/2740
- Fix: Change embeddings to embedding by @mokeddembillel in https://github.com/huggingface/text-generation-inference/pull/2738
- fix response type of document for Text Generation Inference by @jitokim in https://github.com/huggingface/text-generation-inference/pull/2743
- Upgrade outlines to 0.1.1 by @aW3st in https://github.com/huggingface/text-generation-inference/pull/2742
- Upgrading our deps. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2750
- feat: return streaming errors as an event formatted for openai's client by @drbh in https://github.com/huggingface/text-generation-inference/pull/2668
- Remove vLLM dependency for CUDA by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2751
- fix: improve find_segments via numpy diff by @drbh in https://github.com/huggingface/text-generation-inference/pull/2686
- add ipex moe implementation to support Mixtral and PhiMoe by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2707
- Add support for compressed-tensors w8a8 int checkpoints by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2745
- feat: support flash attention 2 in qwen2 vl vision blocks by @drbh in https://github.com/huggingface/text-generation-inference/pull/2721
- Simplify two ipex conditions by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2755
- Update to moe-kernels 0.7.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2720
- PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme by @drbh in https://github.com/huggingface/text-generation-inference/pull/2645
- fix: adjust llama MLP name from dense to mlp to correctly apply lora by @drbh in https://github.com/huggingface/text-generation-inference/pull/2760
- nix: update for outlines 0.1.4 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2764
- Add support for wNa16 int 2:4 compressed-tensors checkpoints by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2758
- nix: build and cache impure devshells by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2765
- fix: set outlines version to 0.1.3 to avoid caching serialization issue by @drbh in https://github.com/huggingface/text-generation-inference/pull/2766
- nix: downgrade to outlines 0.1.3 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2768
- fix: incomplete generations w/ single tokens generations and models that did not support chunking by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2770
- fix: tweak grammar test response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2769
- Add a README section about using Nix by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2767
- Remove guideline from API by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/2762
- feat: Add automatic nightly benchmarks by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2591
- feat: add payload limit by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2726
- Update to marlin-kernels 0.3.6 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2771
- chore: prepare 2.4.1 release by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2773
New Contributors
- @tgaddair made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2711
- @mokeddembillel made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2740
- @jitokim made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2743
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.4.1