v2.3.0
版本发布时间: 2024-09-21 00:20:17
huggingface/text-generation-inference最新发布版本:v3.0.1(2024-12-12 04:13:58)
Important changes
-
Renamed
HUGGINGFACE_HUB_CACHE
to useHF_HOME
. This is done to harmonize environment variables across HF ecosystem. So locations of data moved from/data/models-....
to/data/hub/models-....
on the Docker. -
Prefix caching by default ! To help with long running queries TGI will use prefix caching a reuse pre-existing queries in the kv-cache in order to speed up TTFT. This should be totally transparent for most users, however this has required a instense rewrite of internals and therefore bugs can potentially exist. Also we changed kernels from paged_attention to
flashinfer
(andflashdecoding
as a fallback for some specific models that aren't supported by flashinfer). -
Lots of performance improvements with Marlin and quantization.
What's Changed
- chore: update to torch 2.4 by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2259
- fix crash in multi-modal by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2245
- fix of use of unquantized weights in cohere GQA loading, also enable … by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2291
- Split up
layers.marlin
into several files by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2292 - fix: refactor adapter weight loading and mapping by @drbh in https://github.com/huggingface/text-generation-inference/pull/2193
- Using g6 instead of g5. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2281
- Some small fixes for the Torch 2.4.0 update by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2304
- Fixing idefics on g6 tests. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2306
- Fix registry name by @XciD in https://github.com/huggingface/text-generation-inference/pull/2307
- Support tied embeddings in 0.5B and 1.5B Qwen2 models by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2313
- feat: add ruff and resolve issue by @drbh in https://github.com/huggingface/text-generation-inference/pull/2262
- Run ci api key by @ErikKaum in https://github.com/huggingface/text-generation-inference/pull/2315
- Install Marlin from standalone package by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2320
- fix: reject grammars without properties by @drbh in https://github.com/huggingface/text-generation-inference/pull/2309
- patch-error-on-invalid-grammar by @ErikKaum in https://github.com/huggingface/text-generation-inference/pull/2282
- fix: adjust test snapshots and small refactors by @drbh in https://github.com/huggingface/text-generation-inference/pull/2323
- server quantize: store quantizer config in standard format by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2299
- Rebase TRT-llm by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2331
- Handle GPTQ-Marlin loading in
GPTQMarlinWeightLoader
by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2300 - Pr 2290 ci run by @drbh in https://github.com/huggingface/text-generation-inference/pull/2329
- refactor usage stats by @ErikKaum in https://github.com/huggingface/text-generation-inference/pull/2339
- enable HuggingFaceM4/idefics-9b in intel gpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2338
- Fix cache block size for flash decoding by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2351
- Unify attention output handling by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2343
- fix: attempt forward on flash attn2 to check hardware support by @drbh in https://github.com/huggingface/text-generation-inference/pull/2335
- feat: include local lora adapter loading docs by @drbh in https://github.com/huggingface/text-generation-inference/pull/2359
- fix: return the out tensor rather then the functions return value by @drbh in https://github.com/huggingface/text-generation-inference/pull/2361
- feat: implement a templated endpoint for visibility into chat requests by @drbh in https://github.com/huggingface/text-generation-inference/pull/2333
- feat: prefer stop over eos_token to align with openai finish_reason by @drbh in https://github.com/huggingface/text-generation-inference/pull/2344
- feat: return the generated text when parsing fails by @drbh in https://github.com/huggingface/text-generation-inference/pull/2353
- fix: default num_ln_in_parallel_attn to one if not supplied by @drbh in https://github.com/huggingface/text-generation-inference/pull/2364
- fix: prefer original layernorm names for 180B by @drbh in https://github.com/huggingface/text-generation-inference/pull/2365
- fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig by @almersawi in https://github.com/huggingface/text-generation-inference/pull/2350
- add gptj modeling in TGI #2366 (CI RUN) by @drbh in https://github.com/huggingface/text-generation-inference/pull/2372
- Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) by @drbh in https://github.com/huggingface/text-generation-inference/pull/2371
- Pr 2374 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/2378
- fix EleutherAI/gpt-neox-20b does not work in tgi by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2346
- Pr 2337 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/2379
- fix: prefer hidden_activation over hidden_act in gemma2 by @drbh in https://github.com/huggingface/text-generation-inference/pull/2381
- Update Quantization docs and minor doc fix. by @Vaibhavs10 in https://github.com/huggingface/text-generation-inference/pull/2368
- Pr 2352 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/2382
- Add FlashInfer support by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2354
- Add experimental flake by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2384
- Using HF_HOME instead of CACHE to get token read in addition to models. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2288
- flake: add fmt and clippy by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2389
- Update documentation for Supported models by @Vaibhavs10 in https://github.com/huggingface/text-generation-inference/pull/2386
- flake: use rust-overlay by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2390
- Using an enum for flash backens (paged/flashdecoding/flashinfer) by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2385
- feat: add guideline to chat request and template by @drbh in https://github.com/huggingface/text-generation-inference/pull/2391
- Update flake for 9.0a capability in Torch by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2394
- nix: add router to the devshell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2396
- Upgrade fbgemm by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2398
- Adding launcher to build. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2397
- Fixing import exl2 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2399
- Cpu dockerimage by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2367
- Add support for prefix caching to the v3 router by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2392
- Keeping the benchmark somewhere by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2401
- feat: validate template variables before apply and improve sliding wi… by @drbh in https://github.com/huggingface/text-generation-inference/pull/2403
- fix: allocate tmp based on sgmv kernel if available by @drbh in https://github.com/huggingface/text-generation-inference/pull/2345
- fix: improve completions to send a final chunk with usage details by @drbh in https://github.com/huggingface/text-generation-inference/pull/2336
- Updating the flake. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2404
- Pr 2395 ci run by @drbh in https://github.com/huggingface/text-generation-inference/pull/2406
- fix: include create_exllama_buffers and set_device for exllama by @drbh in https://github.com/huggingface/text-generation-inference/pull/2407
- nix: incremental build of the launcher by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2410
- Adding more kernels to flake. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2411
- add numa to improve cpu inference perf by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2330
- fix: adds causal to attention params by @drbh in https://github.com/huggingface/text-generation-inference/pull/2408
- nix: partial incremental build of the router by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2416
- Upgrading exl2. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2415
- More fixes trtllm by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2342
- nix: build router incrementally by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2422
- Fixing exl2 and other quanize tests again. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2419
- Upgrading the tests to match the current workings. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2423
- nix: try to reduce the number of Rust rebuilds by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2424
- Improve the Consuming TGI + Streaming docs. by @Vaibhavs10 in https://github.com/huggingface/text-generation-inference/pull/2412
- Further fixes. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2426
- doc: Add metrics documentation and add a 'Reference' section by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2230
- All integration tests back everywhere (too many failed CI). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2428
- nix: update to CUDA 12.4 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2429
- Prefix caching by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2402
- nix: add pure server to flake, add both pure and impure devshells by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2430
- nix: add
text-generation-benchmark
to pure devshell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2431 - Adding eetq to flake. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2438
- nix: add awq-inference-engine as server dependency by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2442
- nix: add default package by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2453
- Fix: don't apply post layernorm in SiglipVisionTransformer by @drbh in https://github.com/huggingface/text-generation-inference/pull/2459
- Pr 2451 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/2454
- Fixing CI. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2462
- fix: bump minijinja version and add test for llama 3.1 tools by @drbh in https://github.com/huggingface/text-generation-inference/pull/2463
- fix: improve regex expression by @drbh in https://github.com/huggingface/text-generation-inference/pull/2468
- nix: build Torch against MKL and various other improvements by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2469
- Lots of improvements (Still 2 allocators) by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2449
- feat: add /v1/models endpoint by @drbh in https://github.com/huggingface/text-generation-inference/pull/2433
- update doc with intel cpu part by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2420
- Tied embeddings in MLP speculator. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2473
- nix: improve impure devshell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2478
- nix: add punica-kernels by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2477
- fix: enable chat requests in vertex endpoint by @drbh in https://github.com/huggingface/text-generation-inference/pull/2481
- feat: support lora revisions and qkv_proj weights by @drbh in https://github.com/huggingface/text-generation-inference/pull/2482
- hotfix: avoid non-prefilled block use when using prefix caching by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2489
- Adding links to Adyen blogpost. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2492
- Add two handy gitignores for Nix environments by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2484
- hotfix: fix regression of attention api change in intel platform by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2439
- nix: add pyright/ruff for proper LSP in the impure devshell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2496
- Fix incompatibility with latest
syrupy
and update in Poetry by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2497 - radix trie: add assertions by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2491
- hotfix: add syrupy to the right subproject by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2499
- Add links to Adyen blogpost by @martinigoyanes in https://github.com/huggingface/text-generation-inference/pull/2500
- Fixing more correctly the invalid drop of the batch. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2498
- Add Directory Check to Prevent Redundant Cloning in Build Process by @vamsivallepu in https://github.com/huggingface/text-generation-inference/pull/2486
- Prefix test - Different kind of load test to trigger prefix test bugs. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2490
- Fix tokenization yi by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2507
- Fix truffle by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2514
- nix: support Python tokenizer conversion in the router by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2515
- Add nix test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2513
- fix: pass missing revision arg for lora adapter when loading multiple… by @drbh in https://github.com/huggingface/text-generation-inference/pull/2510
- hotfix : enable intel ipex cpu and xpu in python3.11 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2517
- Use
ratatui
not (deprecated)tui
by @strickvl in https://github.com/huggingface/text-generation-inference/pull/2521 - Add tests for Mixtral by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2520
- Adding a test for FD. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2516
- nix: pure Rust check/fmt/clippy/test by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2525
- fix: metrics unbounded memory by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2528
- Move to moe-kernels package and switch to common MoE layer by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2511
- Stream options. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2533
- Update to moe-kenels 0.3.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2535
- doc: clarify that
--quantize
is not needed for pre-quantized models by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2536 - hotfix: ipex fails since cuda moe kernel is not supported by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2532
- fix: wrap python basic logs in debug assertion in launcher by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2539
- Preparing for release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2540
New Contributors
- @almersawi made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2350
- @Vaibhavs10 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2368
- @mfuntowicz made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2342
- @vamsivallepu made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2486
- @strickvl made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2521
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.2.0...v2.3.0