v0.3.0
版本发布时间: 2024-09-04 19:50:29
sgl-project/sglang最新发布版本:v0.3.0(2024-09-04 19:50:29)
Highlights
Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.
- Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
- Up to 1.5x lower latency with torch.compile on small batch sizes
- Support for interleaved text and multi-image/video in LLaVA-OneVision
- Support for interleaved window attention and 2x longer context length in Gemma-2
- Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
- Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.
What's Changed
- update hyperparameter guide by @merrymercy in https://github.com/sgl-project/sglang/pull/1114
- ci: compatible with fork repo by @zhyncs in https://github.com/sgl-project/sglang/pull/1115
- fix: resolve Python.h header missing by @zhyncs in https://github.com/sgl-project/sglang/pull/1119
- Fix the deadlock in multi-node tp by @merrymercy in https://github.com/sgl-project/sglang/pull/1122
- Mixed style of chunked prefill by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1013
- Fix port conflicts between local CI and runner CI. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1131
- Fix CI accuracy && time out limit by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1133
- fix: use fp16 dtype for sm75 by @zhyncs in https://github.com/sgl-project/sglang/pull/1136
- Improve the code style: more comments and remove useless packages by @merrymercy in https://github.com/sgl-project/sglang/pull/1139
- Improve benchmark by @merrymercy in https://github.com/sgl-project/sglang/pull/1140
- Fix duplicated imports in hf_transformers_utils.py by @merrymercy in https://github.com/sgl-project/sglang/pull/1141
- fixed a typo by @min-xu-et in https://github.com/sgl-project/sglang/pull/1143
- [Docs] Add instruction for running on clouds and kubernetes with SkyPilot by @Michaelvll in https://github.com/sgl-project/sglang/pull/1144
- [Feat]Add support for optional start len of logprobs by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/1035
- Optimize MLA/GQA/MQA Triton decoding by @ispobock in https://github.com/sgl-project/sglang/pull/1138
- feat: allow streaming for multi-prompt and/or parallel sampling by @vhain in https://github.com/sgl-project/sglang/pull/1134
- Improve docs and warnings by @merrymercy in https://github.com/sgl-project/sglang/pull/1164
- [Feature] add disable-custom-all-reduce by @Xu-Chen in https://github.com/sgl-project/sglang/pull/1148
- misc: add hypervisor vendor by @zhyncs in https://github.com/sgl-project/sglang/pull/1165
- support /v1/health using a generation 1 token by @LucienShui in https://github.com/sgl-project/sglang/pull/1154
- fix: resolve README render by @zhyncs in https://github.com/sgl-project/sglang/pull/1166
- [Feat] Support update weights without restart server by @shanyu-sys in https://github.com/sgl-project/sglang/pull/1157
- Improve multi-node stability by @merrymercy in https://github.com/sgl-project/sglang/pull/1171
- fix: custom op fallback forward native when lower sm80 by @zhyncs in https://github.com/sgl-project/sglang/pull/1177
- [Feature] Add a function to convert sampling_params to kwargs by @gryffindor-rr in https://github.com/sgl-project/sglang/pull/1170
- Support min-p sampling by @intervitens in https://github.com/sgl-project/sglang/pull/1167
- [Docs] Fix rendering of details in README by @Michaelvll in https://github.com/sgl-project/sglang/pull/1179
- Improve code style of sampler by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1168
- [Minor] Improve logging and rename the health check endpoint name by @merrymercy in https://github.com/sgl-project/sglang/pull/1180
- Fix broken penalty by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1184
- Fix benchmark script by @Ying1123 in https://github.com/sgl-project/sglang/pull/1185
- [Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. by @kcz358 in https://github.com/sgl-project/sglang/pull/1123
- feat: use gelu_tanh_and_mul by @zhyncs in https://github.com/sgl-project/sglang/pull/1193
- Cleanup readme, llava examples, usage examples and nccl init by @merrymercy in https://github.com/sgl-project/sglang/pull/1194
- Update README.md by @merrymercy in https://github.com/sgl-project/sglang/pull/1198
- [CI] Fix the problem of hf runner too slow by @Ying1123 in https://github.com/sgl-project/sglang/pull/1202
- [Fix] the issue of random order when input is a list by @Ying1123 in https://github.com/sgl-project/sglang/pull/1199
- Relax the assert in moe throughput test to fix the flaky CI by @merrymercy in https://github.com/sgl-project/sglang/pull/1207
- [Fix] Fixing the multi-images error for llava-onevision by @kcz358 in https://github.com/sgl-project/sglang/pull/1205
- Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/1186
- [Minor] Improve the function organization in TokenizerManager & improve loggers by @merrymercy in https://github.com/sgl-project/sglang/pull/1208
- [Minor] Temporarily skip flaky test by @Ying1123 in https://github.com/sgl-project/sglang/pull/1209
- [CI] Fix the issue of unit test hanging by @Ying1123 in https://github.com/sgl-project/sglang/pull/1211
- Update CI workflows by @merrymercy in https://github.com/sgl-project/sglang/pull/1210
- Update CI runner docs by @merrymercy in https://github.com/sgl-project/sglang/pull/1213
- [Feature] Support fp8 e5m2 kv cache with flashinfer by @ispobock in https://github.com/sgl-project/sglang/pull/1204
- Update workflow files by @merrymercy in https://github.com/sgl-project/sglang/pull/1214
- improve the threshold and ports in tests by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/1215
- [CI] Fix CI by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/1217
- [Fix] Multi-images loading error by @kcz358 in https://github.com/sgl-project/sglang/pull/1218
- [Minor] improve CI and dependencies by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1212
- [CI] Parallelize unit tests in CI by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/1219
- Move sampler into CUDA graph by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1201
- chore: bump v0.2.14 by @zhyncs in https://github.com/sgl-project/sglang/pull/1155
- [FEAT] JSON constrained support by @havetc in https://github.com/sgl-project/sglang/pull/1125
- Torch compile CI throughput test by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1223
- [FEAT] Support batches cancel by @caiyueliang in https://github.com/sgl-project/sglang/pull/1222
- [Minor] add delete test and delete tmp file on ci server by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/1227
- [FIX] Wrong logger by @havetc in https://github.com/sgl-project/sglang/pull/1230
- feat: replace get_act_fn for gpt_bigcode by @zhyncs in https://github.com/sgl-project/sglang/pull/1231
- Fix readme by @ArtificialZeng in https://github.com/sgl-project/sglang/pull/1236
- Fix bench latency benchmark by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1225
- [Minor] Add more type annotations by @merrymercy in https://github.com/sgl-project/sglang/pull/1237
- feat: support sm75 with FlashInfer v0.1.6 by @zhyncs in https://github.com/sgl-project/sglang/pull/1233
- Update README.md by @merrymercy in https://github.com/sgl-project/sglang/pull/1239
- hotfix: revert sampler CUDA Graph by @zhyncs in https://github.com/sgl-project/sglang/pull/1242
- Add sglang.bench_latency to CI by @merrymercy in https://github.com/sgl-project/sglang/pull/1243
- fix: increase max_new_tokens when testing generation models by @zhyncs in https://github.com/sgl-project/sglang/pull/1244
- feat: update GemmaRMSNorm by @zhyncs in https://github.com/sgl-project/sglang/pull/1232
- Fix llava on multi images by @merrymercy in https://github.com/sgl-project/sglang/pull/1247
- feat: replace GeluAndMul by @zhyncs in https://github.com/sgl-project/sglang/pull/1234
- fix: resolve qwen2 moe weight loader by @zhyncs in https://github.com/sgl-project/sglang/pull/1252
- chore: bump v0.2.14.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/1250
- make json_schema usable from gen by @qeternity in https://github.com/sgl-project/sglang/pull/1254
- fix data racing due to mutable reference using deepcopy by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/1255
- Sampler cudagraph by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1253
- fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader by @lxww302 in https://github.com/sgl-project/sglang/pull/1260
- Transpose mla weight offline by @ispobock in https://github.com/sgl-project/sglang/pull/1261
- EXAONE 3.0 Model Support by @Deepfocused in https://github.com/sgl-project/sglang/pull/1258
- Update README Support Exaone 3.0 by @Deepfocused in https://github.com/sgl-project/sglang/pull/1267
- Report median instead of mean in bench_latency.py by @merrymercy in https://github.com/sgl-project/sglang/pull/1269
- Allow more flexible assistant and system response by @BabyChouSr in https://github.com/sgl-project/sglang/pull/1256
- fix: resolve the fp8 bug introduced by vLLM 0.5.5 by @zhyncs in https://github.com/sgl-project/sglang/pull/1276
- [doc] fix quick start link by @ByronHsu in https://github.com/sgl-project/sglang/pull/1282
- Optimize the update flashinfer indices by @xiaobochen123 in https://github.com/sgl-project/sglang/pull/1262
- [CI] Add more multi-gpu tests by @merrymercy in https://github.com/sgl-project/sglang/pull/1280
- feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 by @zhyncs in https://github.com/sgl-project/sglang/pull/1285
- [CI] merge all ci tests into one file by @merrymercy in https://github.com/sgl-project/sglang/pull/1289
- Support Triton fp8 e5m2 kv cache by @ispobock in https://github.com/sgl-project/sglang/pull/1286
- [triton] Remove the zero initialization of qk_acc by directly writing the result by @ByronHsu in https://github.com/sgl-project/sglang/pull/1288
- [Chore] Rename model_overide_args to model_override_args by @kevin85421 in https://github.com/sgl-project/sglang/pull/1284
- Allow new lines during JSON generation by @qeternity in https://github.com/sgl-project/sglang/pull/1277
- fix: resolve fp8 for mixtral by @zhyncs in https://github.com/sgl-project/sglang/pull/1290
- ci: add nightly eval by @zhyncs in https://github.com/sgl-project/sglang/pull/1291
- Fix the flaky tests in test_moe_eval_accuracy_large.py by @merrymercy in https://github.com/sgl-project/sglang/pull/1293
- [doc] Fix more broken links by @ByronHsu in https://github.com/sgl-project/sglang/pull/1294
- Fix regex mask by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1296
- Fix hang when doing s += None. by @max99x in https://github.com/sgl-project/sglang/pull/1297
- Release v0.2.15 by @merrymercy in https://github.com/sgl-project/sglang/pull/1295
- feat: update nightly gsm8k eval by @zhyncs in https://github.com/sgl-project/sglang/pull/1304
- Fix bugs in sampler with CUDA graph / torch.compile by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1306
- [Fix] Reduce memory usage for loading llava model & Remove EntryClassRemapping by @merrymercy in https://github.com/sgl-project/sglang/pull/1308
- Support Phi3 mini and medium by @janimo in https://github.com/sgl-project/sglang/pull/1299
- Update README.md for llava-onevision instructions by @merrymercy in https://github.com/sgl-project/sglang/pull/1313
- Fix llama2 weight loader by @merrymercy in https://github.com/sgl-project/sglang/pull/1317
- Fix select by ensuring each request has at least one token by @merrymercy in https://github.com/sgl-project/sglang/pull/1318
- misc: speedup load safetensors by @zhyncs in https://github.com/sgl-project/sglang/pull/1319
- chore: bump v0.3.0 by @zhyncs in https://github.com/sgl-project/sglang/pull/1320
- Fix the flaky test test_moe_eval_accuracy_large.py by @merrymercy in https://github.com/sgl-project/sglang/pull/1326
- docs: update news by @zhyncs in https://github.com/sgl-project/sglang/pull/1327
New Contributors
- @Michaelvll made their first contribution in https://github.com/sgl-project/sglang/pull/1144
- @Xu-Chen made their first contribution in https://github.com/sgl-project/sglang/pull/1148
- @shanyu-sys made their first contribution in https://github.com/sgl-project/sglang/pull/1157
- @intervitens made their first contribution in https://github.com/sgl-project/sglang/pull/1167
- @zhaochenyang20 made their first contribution in https://github.com/sgl-project/sglang/pull/1186
- @havetc made their first contribution in https://github.com/sgl-project/sglang/pull/1125
- @caiyueliang made their first contribution in https://github.com/sgl-project/sglang/pull/1222
- @ArtificialZeng made their first contribution in https://github.com/sgl-project/sglang/pull/1236
- @lxww302 made their first contribution in https://github.com/sgl-project/sglang/pull/1260
- @Deepfocused made their first contribution in https://github.com/sgl-project/sglang/pull/1258
- @ByronHsu made their first contribution in https://github.com/sgl-project/sglang/pull/1282
- @xiaobochen123 made their first contribution in https://github.com/sgl-project/sglang/pull/1262
- @kevin85421 made their first contribution in https://github.com/sgl-project/sglang/pull/1284
Full Changelog: https://github.com/sgl-project/sglang/compare/v0.2.13...v0.3.0