v2.4.1

huggingface/text-generation-inference

版本发布时间: 2024-11-23 01:35:00

huggingface/text-generation-inference最新发布版本:v3.0.1(2024-12-12 04:13:58)

Notable changes

Choose input/total tokens automatically based on available VRAM
Support Qwen2 VL
Decrease latency of very large batches (> 128)

What's Changed

feat: add triton kernels to decrease latency of large batches by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2687
Avoiding timeout for bloom tests. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2693
Green main by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2697
Choosing input/total tokens automatically based on available VRAM? by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2673
We can have a tokenizer anywhere. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2527
Update poetry lock. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2698
Fixing auto bloom test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2699
More timeout on docker start ? by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2701
Monkey patching as a desperate measure. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2704
add xpu triton in dockerfile, or will show "Could not import Flash At… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2702
Support qwen2 vl by @drbh in https://github.com/huggingface/text-generation-inference/pull/2689
fix cuda graphs for qwen2-vl by @drbh in https://github.com/huggingface/text-generation-inference/pull/2708
fix: create position ids for text only input by @drbh in https://github.com/huggingface/text-generation-inference/pull/2714
fix: add chat_tokenize endpoint to api docs by @drbh in https://github.com/huggingface/text-generation-inference/pull/2710
Hotfixing auto length (warmup max_s was wrong). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2716
Fix prefix caching + speculative decoding by @tgaddair in https://github.com/huggingface/text-generation-inference/pull/2711
Fixing linting on main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2719
nix: move to tgi-nix main by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2718
fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2717
add trust_remote_code in tokenizer to fix baichuan issue by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2725
Add initial support for compressed-tensors checkpoints by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2732
nix: update nixpkgs by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2746
benchmark: fix prefill throughput by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2741
Fix: Change model_type from ssm to mamba by @mokeddembillel in https://github.com/huggingface/text-generation-inference/pull/2740
Fix: Change embeddings to embedding by @mokeddembillel in https://github.com/huggingface/text-generation-inference/pull/2738
fix response type of document for Text Generation Inference by @jitokim in https://github.com/huggingface/text-generation-inference/pull/2743
Upgrade outlines to 0.1.1 by @aW3st in https://github.com/huggingface/text-generation-inference/pull/2742
Upgrading our deps. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2750
feat: return streaming errors as an event formatted for openai's client by @drbh in https://github.com/huggingface/text-generation-inference/pull/2668
Remove vLLM dependency for CUDA by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2751
fix: improve find_segments via numpy diff by @drbh in https://github.com/huggingface/text-generation-inference/pull/2686
add ipex moe implementation to support Mixtral and PhiMoe by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2707
Add support for compressed-tensors w8a8 int checkpoints by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2745
feat: support flash attention 2 in qwen2 vl vision blocks by @drbh in https://github.com/huggingface/text-generation-inference/pull/2721
Simplify two ipex conditions by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2755
Update to moe-kernels 0.7.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2720
PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme by @drbh in https://github.com/huggingface/text-generation-inference/pull/2645
fix: adjust llama MLP name from dense to mlp to correctly apply lora by @drbh in https://github.com/huggingface/text-generation-inference/pull/2760
nix: update for outlines 0.1.4 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2764
Add support for wNa16 int 2:4 compressed-tensors checkpoints by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2758
nix: build and cache impure devshells by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2765
fix: set outlines version to 0.1.3 to avoid caching serialization issue by @drbh in https://github.com/huggingface/text-generation-inference/pull/2766
nix: downgrade to outlines 0.1.3 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2768
fix: incomplete generations w/ single tokens generations and models that did not support chunking by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2770
fix: tweak grammar test response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2769
Add a README section about using Nix by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2767
Remove guideline from API by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/2762
feat: Add automatic nightly benchmarks by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2591
feat: add payload limit by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2726
Update to marlin-kernels 0.3.6 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2771
chore: prepare 2.4.1 release by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2773

New Contributors

@tgaddair made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2711
@mokeddembillel made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2740
@jitokim made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2743

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.4.1

相关地址：原始地址下载(tar) 下载(zip)

查看：2024-11-23发行的版本