v0.3.0

版本发布时间: 2024-09-04 19:50:29

sgl-project/sglang最新发布版本:v0.3.0(2024-09-04 19:50:29)

Highlights

Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.

Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
Up to 1.5x lower latency with torch.compile on small batch sizes
Support for interleaved text and multi-image/video in LLaVA-OneVision
Support for interleaved window attention and 2x longer context length in Gemma-2
Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.

What's Changed

update hyperparameter guide by @merrymercy in https://github.com/sgl-project/sglang/pull/1114
ci: compatible with fork repo by @zhyncs in https://github.com/sgl-project/sglang/pull/1115
fix: resolve Python.h header missing by @zhyncs in https://github.com/sgl-project/sglang/pull/1119
Fix the deadlock in multi-node tp by @merrymercy in https://github.com/sgl-project/sglang/pull/1122
Mixed style of chunked prefill by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1013
Fix port conflicts between local CI and runner CI. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1131
Fix CI accuracy && time out limit by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1133
fix: use fp16 dtype for sm75 by @zhyncs in https://github.com/sgl-project/sglang/pull/1136
Improve the code style: more comments and remove useless packages by @merrymercy in https://github.com/sgl-project/sglang/pull/1139
Improve benchmark by @merrymercy in https://github.com/sgl-project/sglang/pull/1140
Fix duplicated imports in hf_transformers_utils.py by @merrymercy in https://github.com/sgl-project/sglang/pull/1141
fixed a typo by @min-xu-et in https://github.com/sgl-project/sglang/pull/1143
[Docs] Add instruction for running on clouds and kubernetes with SkyPilot by @Michaelvll in https://github.com/sgl-project/sglang/pull/1144
[Feat]Add support for optional start len of logprobs by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/1035
Optimize MLA/GQA/MQA Triton decoding by @ispobock in https://github.com/sgl-project/sglang/pull/1138
feat: allow streaming for multi-prompt and/or parallel sampling by @vhain in https://github.com/sgl-project/sglang/pull/1134
Improve docs and warnings by @merrymercy in https://github.com/sgl-project/sglang/pull/1164
[Feature] add disable-custom-all-reduce by @Xu-Chen in https://github.com/sgl-project/sglang/pull/1148
misc: add hypervisor vendor by @zhyncs in https://github.com/sgl-project/sglang/pull/1165
support /v1/health using a generation 1 token by @LucienShui in https://github.com/sgl-project/sglang/pull/1154
fix: resolve README render by @zhyncs in https://github.com/sgl-project/sglang/pull/1166
[Feat] Support update weights without restart server by @shanyu-sys in https://github.com/sgl-project/sglang/pull/1157
Improve multi-node stability by @merrymercy in https://github.com/sgl-project/sglang/pull/1171
fix: custom op fallback forward native when lower sm80 by @zhyncs in https://github.com/sgl-project/sglang/pull/1177
[Feature] Add a function to convert sampling_params to kwargs by @gryffindor-rr in https://github.com/sgl-project/sglang/pull/1170
Support min-p sampling by @intervitens in https://github.com/sgl-project/sglang/pull/1167
[Docs] Fix rendering of details in README by @Michaelvll in https://github.com/sgl-project/sglang/pull/1179
Improve code style of sampler by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1168
[Minor] Improve logging and rename the health check endpoint name by @merrymercy in https://github.com/sgl-project/sglang/pull/1180
Fix broken penalty by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1184
Fix benchmark script by @Ying1123 in https://github.com/sgl-project/sglang/pull/1185
[Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. by @kcz358 in https://github.com/sgl-project/sglang/pull/1123
feat: use gelu_tanh_and_mul by @zhyncs in https://github.com/sgl-project/sglang/pull/1193
Cleanup readme, llava examples, usage examples and nccl init by @merrymercy in https://github.com/sgl-project/sglang/pull/1194
Update README.md by @merrymercy in https://github.com/sgl-project/sglang/pull/1198
[CI] Fix the problem of hf runner too slow by @Ying1123 in https://github.com/sgl-project/sglang/pull/1202
[Fix] the issue of random order when input is a list by @Ying1123 in https://github.com/sgl-project/sglang/pull/1199
Relax the assert in moe throughput test to fix the flaky CI by @merrymercy in https://github.com/sgl-project/sglang/pull/1207
[Fix] Fixing the multi-images error for llava-onevision by @kcz358 in https://github.com/sgl-project/sglang/pull/1205
Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model by @zhaochenyang20 in https://github.com/sgl-project/sglang/pull/1186
[Minor] Improve the function organization in TokenizerManager & improve loggers by @merrymercy in https://github.com/sgl-project/sglang/pull/1208
[Minor] Temporarily skip flaky test by @Ying1123 in https://github.com/sgl-project/sglang/pull/1209
[CI] Fix the issue of unit test hanging by @Ying1123 in https://github.com/sgl-project/sglang/pull/1211
Update CI workflows by @merrymercy in https://github.com/sgl-project/sglang/pull/1210
Update CI runner docs by @merrymercy in https://github.com/sgl-project/sglang/pull/1213
[Feature] Support fp8 e5m2 kv cache with flashinfer by @ispobock in https://github.com/sgl-project/sglang/pull/1204
Update workflow files by @merrymercy in https://github.com/sgl-project/sglang/pull/1214
improve the threshold and ports in tests by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/1215
[CI] Fix CI by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/1217
[Fix] Multi-images loading error by @kcz358 in https://github.com/sgl-project/sglang/pull/1218
[Minor] improve CI and dependencies by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1212
[CI] Parallelize unit tests in CI by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/1219
Move sampler into CUDA graph by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1201
chore: bump v0.2.14 by @zhyncs in https://github.com/sgl-project/sglang/pull/1155
[FEAT] JSON constrained support by @havetc in https://github.com/sgl-project/sglang/pull/1125
Torch compile CI throughput test by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1223
[FEAT] Support batches cancel by @caiyueliang in https://github.com/sgl-project/sglang/pull/1222
[Minor] add delete test and delete tmp file on ci server by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/1227
[FIX] Wrong logger by @havetc in https://github.com/sgl-project/sglang/pull/1230
feat: replace get_act_fn for gpt_bigcode by @zhyncs in https://github.com/sgl-project/sglang/pull/1231
Fix readme by @ArtificialZeng in https://github.com/sgl-project/sglang/pull/1236
Fix bench latency benchmark by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1225
[Minor] Add more type annotations by @merrymercy in https://github.com/sgl-project/sglang/pull/1237
feat: support sm75 with FlashInfer v0.1.6 by @zhyncs in https://github.com/sgl-project/sglang/pull/1233
Update README.md by @merrymercy in https://github.com/sgl-project/sglang/pull/1239
hotfix: revert sampler CUDA Graph by @zhyncs in https://github.com/sgl-project/sglang/pull/1242
Add sglang.bench_latency to CI by @merrymercy in https://github.com/sgl-project/sglang/pull/1243
fix: increase max_new_tokens when testing generation models by @zhyncs in https://github.com/sgl-project/sglang/pull/1244
feat: update GemmaRMSNorm by @zhyncs in https://github.com/sgl-project/sglang/pull/1232
Fix llava on multi images by @merrymercy in https://github.com/sgl-project/sglang/pull/1247
feat: replace GeluAndMul by @zhyncs in https://github.com/sgl-project/sglang/pull/1234
fix: resolve qwen2 moe weight loader by @zhyncs in https://github.com/sgl-project/sglang/pull/1252
chore: bump v0.2.14.post2 by @zhyncs in https://github.com/sgl-project/sglang/pull/1250
make json_schema usable from gen by @qeternity in https://github.com/sgl-project/sglang/pull/1254
fix data racing due to mutable reference using deepcopy by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/1255
Sampler cudagraph by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1253
fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader by @lxww302 in https://github.com/sgl-project/sglang/pull/1260
Transpose mla weight offline by @ispobock in https://github.com/sgl-project/sglang/pull/1261
EXAONE 3.0 Model Support by @Deepfocused in https://github.com/sgl-project/sglang/pull/1258
Update README Support Exaone 3.0 by @Deepfocused in https://github.com/sgl-project/sglang/pull/1267
Report median instead of mean in bench_latency.py by @merrymercy in https://github.com/sgl-project/sglang/pull/1269
Allow more flexible assistant and system response by @BabyChouSr in https://github.com/sgl-project/sglang/pull/1256
fix: resolve the fp8 bug introduced by vLLM 0.5.5 by @zhyncs in https://github.com/sgl-project/sglang/pull/1276
[doc] fix quick start link by @ByronHsu in https://github.com/sgl-project/sglang/pull/1282
Optimize the update flashinfer indices by @xiaobochen123 in https://github.com/sgl-project/sglang/pull/1262
[CI] Add more multi-gpu tests by @merrymercy in https://github.com/sgl-project/sglang/pull/1280
feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 by @zhyncs in https://github.com/sgl-project/sglang/pull/1285
[CI] merge all ci tests into one file by @merrymercy in https://github.com/sgl-project/sglang/pull/1289
Support Triton fp8 e5m2 kv cache by @ispobock in https://github.com/sgl-project/sglang/pull/1286
[triton] Remove the zero initialization of qk_acc by directly writing the result by @ByronHsu in https://github.com/sgl-project/sglang/pull/1288
[Chore] Rename model_overide_args to model_override_args by @kevin85421 in https://github.com/sgl-project/sglang/pull/1284
Allow new lines during JSON generation by @qeternity in https://github.com/sgl-project/sglang/pull/1277
fix: resolve fp8 for mixtral by @zhyncs in https://github.com/sgl-project/sglang/pull/1290
ci: add nightly eval by @zhyncs in https://github.com/sgl-project/sglang/pull/1291
Fix the flaky tests in test_moe_eval_accuracy_large.py by @merrymercy in https://github.com/sgl-project/sglang/pull/1293
[doc] Fix more broken links by @ByronHsu in https://github.com/sgl-project/sglang/pull/1294
Fix regex mask by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1296
Fix hang when doing s += None. by @max99x in https://github.com/sgl-project/sglang/pull/1297
Release v0.2.15 by @merrymercy in https://github.com/sgl-project/sglang/pull/1295
feat: update nightly gsm8k eval by @zhyncs in https://github.com/sgl-project/sglang/pull/1304
Fix bugs in sampler with CUDA graph / torch.compile by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1306
[Fix] Reduce memory usage for loading llava model & Remove EntryClassRemapping by @merrymercy in https://github.com/sgl-project/sglang/pull/1308
Support Phi3 mini and medium by @janimo in https://github.com/sgl-project/sglang/pull/1299
Update README.md for llava-onevision instructions by @merrymercy in https://github.com/sgl-project/sglang/pull/1313
Fix llama2 weight loader by @merrymercy in https://github.com/sgl-project/sglang/pull/1317
Fix select by ensuring each request has at least one token by @merrymercy in https://github.com/sgl-project/sglang/pull/1318
misc: speedup load safetensors by @zhyncs in https://github.com/sgl-project/sglang/pull/1319
chore: bump v0.3.0 by @zhyncs in https://github.com/sgl-project/sglang/pull/1320
Fix the flaky test test_moe_eval_accuracy_large.py by @merrymercy in https://github.com/sgl-project/sglang/pull/1326
docs: update news by @zhyncs in https://github.com/sgl-project/sglang/pull/1327

New Contributors

@Michaelvll made their first contribution in https://github.com/sgl-project/sglang/pull/1144
@Xu-Chen made their first contribution in https://github.com/sgl-project/sglang/pull/1148
@shanyu-sys made their first contribution in https://github.com/sgl-project/sglang/pull/1157
@intervitens made their first contribution in https://github.com/sgl-project/sglang/pull/1167
@zhaochenyang20 made their first contribution in https://github.com/sgl-project/sglang/pull/1186
@havetc made their first contribution in https://github.com/sgl-project/sglang/pull/1125
@caiyueliang made their first contribution in https://github.com/sgl-project/sglang/pull/1222
@ArtificialZeng made their first contribution in https://github.com/sgl-project/sglang/pull/1236
@lxww302 made their first contribution in https://github.com/sgl-project/sglang/pull/1260
@Deepfocused made their first contribution in https://github.com/sgl-project/sglang/pull/1258
@ByronHsu made their first contribution in https://github.com/sgl-project/sglang/pull/1282
@xiaobochen123 made their first contribution in https://github.com/sgl-project/sglang/pull/1262
@kevin85421 made their first contribution in https://github.com/sgl-project/sglang/pull/1284

Full Changelog: https://github.com/sgl-project/sglang/compare/v0.2.13...v0.3.0

相关地址：原始地址下载(tar) 下载(zip)

查看：2024-09-04发行的版本