v0.2.0
版本发布时间: 2024-07-25 23:58:24
sgl-project/sglang最新发布版本:v0.3.0(2024-09-04 19:50:29)
Highlights
- We performed extensive engineering to improve the base performance. Compared to TensorRT-LLM and vLLM, SGLang now consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, on A100 and H100 GPUs, using FP8 and FP16. See the latest blog.
- New models: Llama3 405B, Deepseek MoE, InternLM, GPTBigCode, Mistral-Nemo
What's Changed
- Optimize mem indices mangement by @hnyls2002 in https://github.com/sgl-project/sglang/pull/619
- Unify index operations by @hnyls2002 in https://github.com/sgl-project/sglang/pull/620
- Simplify mem state by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/623
- Improve tensor parallel performance by @Ying1123 in https://github.com/sgl-project/sglang/pull/625
- Bump version to 0.1.21 by @Ying1123 in https://github.com/sgl-project/sglang/pull/626
- Fix model forward grad by @hnyls2002 in https://github.com/sgl-project/sglang/pull/628
- Update docker file by @Ying1123 in https://github.com/sgl-project/sglang/pull/629
- Disable NCCL_NVLS by default by @Ying1123 in https://github.com/sgl-project/sglang/pull/631
- Add qwen2 tie word embedding by @yileld in https://github.com/sgl-project/sglang/pull/630
- Add support for VertexAI safety settings by @AidanCooper in https://github.com/sgl-project/sglang/pull/624
- Fix vertexai by @hnyls2002 in https://github.com/sgl-project/sglang/pull/633
- Reduce docker size by @hnyls2002 in https://github.com/sgl-project/sglang/pull/632
- clean up step function by @Ying1123 in https://github.com/sgl-project/sglang/pull/635
- feat: support internlm2 by @zhyncs in https://github.com/sgl-project/sglang/pull/636
- misc: add pre-commit config by @zhyncs in https://github.com/sgl-project/sglang/pull/637
- misc: add issue and pr template by @zhyncs in https://github.com/sgl-project/sglang/pull/638
- Flashinfer sample kernel by @hnyls2002 in https://github.com/sgl-project/sglang/pull/617
- Move
global_server_args_dict
by @hnyls2002 in https://github.com/sgl-project/sglang/pull/642 - Increase the capacity of the memory pool by @Ying1123 in https://github.com/sgl-project/sglang/pull/643
- feat: add check_env by @zhyncs in https://github.com/sgl-project/sglang/pull/645
- Remove the dependency of rpyc by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/646
- misc: rm rpyc from PACKAGE_LIST by @zhyncs in https://github.com/sgl-project/sglang/pull/649
- fix: set ulimit -n 65535 by @zhyncs in https://github.com/sgl-project/sglang/pull/647
- feat: add lint workflow by @zhyncs in https://github.com/sgl-project/sglang/pull/648
- fix: resolve lint error by @zhyncs in https://github.com/sgl-project/sglang/pull/650
- Remove useless variables in infer_batch.py by @Ying1123 in https://github.com/sgl-project/sglang/pull/651
- Detokenize incrementally when streaming by @hnyls2002 in https://github.com/sgl-project/sglang/pull/653
-
TokenizerManager.context_len
should inherit from `server_args.conte… by @shrirajh in https://github.com/sgl-project/sglang/pull/654 - Remove cached triton launcher by @merrymercy in https://github.com/sgl-project/sglang/pull/656
- perf: reduce ttft and itl with stream_interval 1 by @zhyncs in https://github.com/sgl-project/sglang/pull/658
- feat: add benchmark serving by @zhyncs in https://github.com/sgl-project/sglang/pull/657
- refactor model loader [unreachable code]: initial refactor by @Ying1123 in https://github.com/sgl-project/sglang/pull/655
- misc: update SGLang package description by @zhyncs in https://github.com/sgl-project/sglang/pull/659
- Update Readme by @Ying1123 in https://github.com/sgl-project/sglang/pull/660
- feat: update check env by @zhyncs in https://github.com/sgl-project/sglang/pull/661
- Improve docs by @Ying1123 in https://github.com/sgl-project/sglang/pull/662
- Add benchmark instructions by @Ying1123 in https://github.com/sgl-project/sglang/pull/663
- Fix jump forward when streaming by @hnyls2002 in https://github.com/sgl-project/sglang/pull/665
- Fix kill process util by @ispobock in https://github.com/sgl-project/sglang/pull/666
- Add support for OpenAI API parallel sampling by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/640
- Update OpenAI API by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/667
- Temporary fix invalid sample results by @hnyls2002 in https://github.com/sgl-project/sglang/pull/668
- Support random dataset in bench_serving.py by @merrymercy in https://github.com/sgl-project/sglang/pull/669
- Revert "Temporary fix invalid sample results" by @hnyls2002 in https://github.com/sgl-project/sglang/pull/673
- refactor model loader: initial refactor by @Ying1123 in https://github.com/sgl-project/sglang/pull/664
- Fix cuda graph with flashinfer by @merrymercy in https://github.com/sgl-project/sglang/pull/675
- Tmp fix illegal sample by @hnyls2002 in https://github.com/sgl-project/sglang/pull/676
- Update version to 0.1.22 by @Ying1123 in https://github.com/sgl-project/sglang/pull/677
- Fallback when sampling failed by @ispobock in https://github.com/sgl-project/sglang/pull/678
- feat: support TRT LLM benchmark and multiple benchmarks by @zhyncs in https://github.com/sgl-project/sglang/pull/670
- Decouple kv by @hnyls2002 in https://github.com/sgl-project/sglang/pull/679
- Support gpt-bigcode model class by @hnyls2002 in https://github.com/sgl-project/sglang/pull/681
- support non-streaming benchmark by @merrymercy in https://github.com/sgl-project/sglang/pull/682
- Fix StreamExecutor.fork() losing the current role start index. by @max99x in https://github.com/sgl-project/sglang/pull/684
- feat: update bench serving by @zhyncs in https://github.com/sgl-project/sglang/pull/685
- misc: update output file logic by @zhyncs in https://github.com/sgl-project/sglang/pull/686
- Allow disabling streaming in bench by @merrymercy in https://github.com/sgl-project/sglang/pull/687
- docs: update README by @zhyncs in https://github.com/sgl-project/sglang/pull/688
- Support Deepseek MoE Model by @hnyls2002 in https://github.com/sgl-project/sglang/pull/689
- misc: recommend to use chat model for benchmark by @zhyncs in https://github.com/sgl-project/sglang/pull/690
- Support Mistral-Nemo by @ispobock in https://github.com/sgl-project/sglang/pull/691
- docs: update README by @zhyncs in https://github.com/sgl-project/sglang/pull/692
- fix: update bench serving by @zhyncs in https://github.com/sgl-project/sglang/pull/694
- misc: update output token logic by @zhyncs in https://github.com/sgl-project/sglang/pull/695
- Tune params by @Ying1123 in https://github.com/sgl-project/sglang/pull/696
- Fix trt benchmark by @Ying1123 in https://github.com/sgl-project/sglang/pull/697
- misc: fix typo by @zhyncs in https://github.com/sgl-project/sglang/pull/698
- Fix flashinfer by @Ying1123 in https://github.com/sgl-project/sglang/pull/700
- Fix hf config loading by @ispobock in https://github.com/sgl-project/sglang/pull/702
- Use min new token ratio at start by @hnyls2002 in https://github.com/sgl-project/sglang/pull/701
- feat: add e2e latency by @zhyncs in https://github.com/sgl-project/sglang/pull/704
- Update vllm version to support llama3.1 by @Ying1123 in https://github.com/sgl-project/sglang/pull/705
- bump version to 0.1.23 by @Ying1123 in https://github.com/sgl-project/sglang/pull/706
- Reduce hardcoded logic of kernel usage by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/707
- Fix multi-node deadlock by @merrymercy in https://github.com/sgl-project/sglang/pull/709
- Auto adjust new ratio by @hnyls2002 in https://github.com/sgl-project/sglang/pull/708
- Fix prefill size by @Ying1123 in https://github.com/sgl-project/sglang/pull/711
- docs: update README by @zhyncs in https://github.com/sgl-project/sglang/pull/712
- docs: update doc by @zhyncs in https://github.com/sgl-project/sglang/pull/713
- fix: llama 3.1 405b fp8 by @zhyncs in https://github.com/sgl-project/sglang/pull/714
- misc: update doc by @zhyncs in https://github.com/sgl-project/sglang/pull/715
- Improve benchmark scripts by @Ying1123 in https://github.com/sgl-project/sglang/pull/717
- Bump version to 0.1.24 by @Ying1123 in https://github.com/sgl-project/sglang/pull/718
- docs: update supported models by @zhyncs in https://github.com/sgl-project/sglang/pull/719
- docs: update comment by @zhyncs in https://github.com/sgl-project/sglang/pull/721
- chore: add close inactive issues workflow by @zhyncs in https://github.com/sgl-project/sglang/pull/722
- misc: update bulid instruction by @zhyncs in https://github.com/sgl-project/sglang/pull/724
- fix: fp8 config by @Ying1123 in https://github.com/sgl-project/sglang/pull/723
- Fix dockerfile and triton cache manager by @hnyls2002 in https://github.com/sgl-project/sglang/pull/720
- chore: bump v0.1.25 by @zhyncs in https://github.com/sgl-project/sglang/pull/725
- fix: resolve the logo display issue on the PyPI page by @zhyncs in https://github.com/sgl-project/sglang/pull/726
- misc: update bug issue template by @zhyncs in https://github.com/sgl-project/sglang/pull/727
- Revert "fix: fp8 config" by @Ying1123 in https://github.com/sgl-project/sglang/pull/728
- Fix bugs (fp8 checkpoints, triton cache manager) by @Ying1123 in https://github.com/sgl-project/sglang/pull/729
- Bump version to 0.2.0 by @Ying1123 in https://github.com/sgl-project/sglang/pull/730
New Contributors
- @yileld made their first contribution in https://github.com/sgl-project/sglang/pull/630
- @AidanCooper made their first contribution in https://github.com/sgl-project/sglang/pull/624
- @zhyncs made their first contribution in https://github.com/sgl-project/sglang/pull/636
- @shrirajh made their first contribution in https://github.com/sgl-project/sglang/pull/654
- @yichuan520030910320 made their first contribution in https://github.com/sgl-project/sglang/pull/640
- @max99x made their first contribution in https://github.com/sgl-project/sglang/pull/684
Full Changelog: https://github.com/sgl-project/sglang/compare/v0.1.20...v0.2.0