v0.1.20
版本发布时间: 2024-07-14 08:33:05
sgl-project/sglang最新发布版本:v0.3.0(2024-09-04 19:50:29)
Highlights
- Enable CUDA graph by default. It brings 1.5x - 2x speedup for small batch size decoding (#612)
- Model support: Gemma2, minicpm, Qwen2 MoE
- Docker support (#217 )
- Various latency optimizations
What's Changed
- Add docker file by @Ying1123 in https://github.com/sgl-project/sglang/pull/588
- Add Gemma2 by @Ying1123 in https://github.com/sgl-project/sglang/pull/592
- Format by @Ying1123 in https://github.com/sgl-project/sglang/pull/593
- Fix Llava model by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/594
-
- fix(detokenizer_manager.py): fix truncated decoded output by @Titan-p in https://github.com/sgl-project/sglang/pull/586
- Add
--enable-p2p-check
option by @hnyls2002 in https://github.com/sgl-project/sglang/pull/599 - Fix streaming by @hnyls2002 in https://github.com/sgl-project/sglang/pull/600
- Reduce number of workspaces for flashinfer by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/601
- add
LogitsMetadata
by @hnyls2002 in https://github.com/sgl-project/sglang/pull/604 - add minicpm support by @Titan-p in https://github.com/sgl-project/sglang/pull/602
- Make sglang compat with vllm 0.5.1 by @M0gician in https://github.com/sgl-project/sglang/pull/598
- Add Qwen2 MoE support by @M0gician in https://github.com/sgl-project/sglang/pull/603
- Update chat template for qwen and yi-1.5. by @for-just-we in https://github.com/sgl-project/sglang/pull/530
- [Feat] Expose logprob options to
sgl.gen
API by @huyiwen in https://github.com/sgl-project/sglang/pull/503 - Fix bench latency by @merrymercy in https://github.com/sgl-project/sglang/pull/607
- Code clean up: Remove deprecated prefill move InputMetadata to infer_batch.py by @merrymercy in https://github.com/sgl-project/sglang/pull/609
- Clean up the usage of flashinfer by @merrymercy in https://github.com/sgl-project/sglang/pull/610
- Cleanup attention backend: flashinfer and triton by @merrymercy in https://github.com/sgl-project/sglang/pull/611
- Enable cuda graph by default by @merrymercy in https://github.com/sgl-project/sglang/pull/612
- Improve benchmark scripts & fix llava by @merrymercy in https://github.com/sgl-project/sglang/pull/613
- Memorypool chunked prefetch by @hnyls2002 in https://github.com/sgl-project/sglang/pull/614
- Improve benchmark scripts by @merrymercy in https://github.com/sgl-project/sglang/pull/615
- Fix memory pool index error by @Ying1123 in https://github.com/sgl-project/sglang/pull/616
- Bump version to 0.1.20 by @merrymercy in https://github.com/sgl-project/sglang/pull/618
New Contributors
- @wisclmy0611 made their first contribution in https://github.com/sgl-project/sglang/pull/594
- @Titan-p made their first contribution in https://github.com/sgl-project/sglang/pull/586
- @M0gician made their first contribution in https://github.com/sgl-project/sglang/pull/598
- @for-just-we made their first contribution in https://github.com/sgl-project/sglang/pull/530
Full Changelog: https://github.com/sgl-project/sglang/compare/v0.1.18...v0.1.20