v0.2.13

sgl-project/sglang

版本发布时间: 2024-08-16 13:16:08

sgl-project/sglang最新发布版本:v0.3.0(2024-09-04 19:50:29)

Highlights

New Feature: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973)
New Models: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API.
Performance: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905).
More CI Tests: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
Refactor and fix: More modular, better stability, use more kernels from flashinfer (#907)

What's Changed

fix: set env in runner by @zhyncs in https://github.com/sgl-project/sglang/pull/891
docs: update setup runner by @zhyncs in https://github.com/sgl-project/sglang/pull/884
misc: update cuda graph capture exception log by @zhyncs in https://github.com/sgl-project/sglang/pull/894
chore: add multipart dep for fastapi by @zhyncs in https://github.com/sgl-project/sglang/pull/895
[minor] fixed code formatting doc by @min-xu-et in https://github.com/sgl-project/sglang/pull/896
Bump version to 0.2.9.post1 by @Ying1123 in https://github.com/sgl-project/sglang/pull/899
Update the base image of the docker by @Ying1123 in https://github.com/sgl-project/sglang/pull/900
Reorder CI unit tests. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/908
fixed an error handling in bench_latency.py by @min-xu-et in https://github.com/sgl-project/sglang/pull/904
Add model accuracy test - step 1 by @Ying1123 in https://github.com/sgl-project/sglang/pull/866
latency test enhancement - part 1 by @min-xu-et in https://github.com/sgl-project/sglang/pull/909
Improve the structure of CI by @Ying1123 in https://github.com/sgl-project/sglang/pull/911
fix: use e2e and unit test only for original repo or pr by @zhyncs in https://github.com/sgl-project/sglang/pull/912
misc: add triton in check_env PACKAGE_LIST by @zhyncs in https://github.com/sgl-project/sglang/pull/914
Support MLA for DeepSeek-V2 with Triton - step 1 by @ispobock in https://github.com/sgl-project/sglang/pull/905
enhance latency test - part 2 by @min-xu-et in https://github.com/sgl-project/sglang/pull/915
Make API Key OpenAI-compatible by @Ying1123 in https://github.com/sgl-project/sglang/pull/917
Update hyperparameter_tuning.md by @Ying1123 in https://github.com/sgl-project/sglang/pull/918
Fix CI && python3.8 compatible by @hnyls2002 in https://github.com/sgl-project/sglang/pull/920
Support more OpenAI API test by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/916
Bump version to 0.2.10 by @Ying1123 in https://github.com/sgl-project/sglang/pull/923
latency test enhancement - final part by @min-xu-et in https://github.com/sgl-project/sglang/pull/921
Test openai vision api by @Ying1123 in https://github.com/sgl-project/sglang/pull/925
Test regex in vision api by @Ying1123 in https://github.com/sgl-project/sglang/pull/926
Update README.md by @Ying1123 in https://github.com/sgl-project/sglang/pull/927
Fix prompt len in parallel sampling by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/928
docs: update README by @zhyncs in https://github.com/sgl-project/sglang/pull/935
Remove leftover auth_token by @AidanCooper in https://github.com/sgl-project/sglang/pull/934
Feat: add alternative choices selection methods by @AidanCooper in https://github.com/sgl-project/sglang/pull/835
Fix union operator by @ispobock in https://github.com/sgl-project/sglang/pull/940
Support multiple args options by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/941
Fix stuck in get_new_prefill_batch by @hnyls2002 in https://github.com/sgl-project/sglang/pull/948
Organize code (rename, movement) by @hnyls2002 in https://github.com/sgl-project/sglang/pull/953
fix nsys cannot profile cuda kernel by @mpjlu in https://github.com/sgl-project/sglang/pull/957
Add support for Batch API test by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/936
Show more error messages for warmup errors by @Ying1123 in https://github.com/sgl-project/sglang/pull/932
misc: update issue template by @zhyncs in https://github.com/sgl-project/sglang/pull/963
misc: simplify test by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/964
misc: add compute capability in check_env by @zhyncs in https://github.com/sgl-project/sglang/pull/965
Make req_pool_indices on CPU by @hnyls2002 in https://github.com/sgl-project/sglang/pull/960
misc: fix the req_to_token member change by @hnyls2002 in https://github.com/sgl-project/sglang/pull/967
chore: update vllm to 0.5.4 by @zhyncs in https://github.com/sgl-project/sglang/pull/966
chore: bump v0.2.11 by @zhyncs in https://github.com/sgl-project/sglang/pull/970
Purge self-runner's pip cache weekly by @hnyls2002 in https://github.com/sgl-project/sglang/pull/975
Run purge-cache only in sgl-project by @hnyls2002 in https://github.com/sgl-project/sglang/pull/976
misc: correct the int data type for token ids and indices by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/969
PrefillAdder abstraction by @hnyls2002 in https://github.com/sgl-project/sglang/pull/968
RadixCache method adjust by @hnyls2002 in https://github.com/sgl-project/sglang/pull/977
Adjust max prefix len by @hnyls2002 in https://github.com/sgl-project/sglang/pull/980
#590 Increase default , track changes in examples and documentation by @foszto in https://github.com/sgl-project/sglang/pull/971
[minor] Update type annotation in tokenizer_manager.py by @Ying1123 in https://github.com/sgl-project/sglang/pull/982
Fix chunked prefill by @hnyls2002 in https://github.com/sgl-project/sglang/pull/984
Add llama embedding modules [unreachable code] - step 1/3 by @Ying1123 in https://github.com/sgl-project/sglang/pull/983
Add io struct for embedding models [unreachable code] - step 2/3 by @Ying1123 in https://github.com/sgl-project/sglang/pull/987
Adjust InputeMetadata and ScheduleBatch by @hnyls2002 in https://github.com/sgl-project/sglang/pull/981
support more optioin about usage in stream mode by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/985
Create contributor_guide.md by @Ying1123 in https://github.com/sgl-project/sglang/pull/992
feat: frequency, min_new_tokens, presence, and repetition penalties by @vhain in https://github.com/sgl-project/sglang/pull/973
Move torch.compile configs into cuda_graph_runner.py by @Ying1123 in https://github.com/sgl-project/sglang/pull/993
Add e5-mistral embedding model - step 3/3 by @Ying1123 in https://github.com/sgl-project/sglang/pull/988
test: negative value testing for frequency, presence penalizers by @vhain in https://github.com/sgl-project/sglang/pull/995
support models from www.modelscope.cn by @liuyhwangyh in https://github.com/sgl-project/sglang/pull/994
bugfix: penalizers to be merged before reqs by @vhain in https://github.com/sgl-project/sglang/pull/1001
fix: resolve correctness_test issue by @zhyncs in https://github.com/sgl-project/sglang/pull/1002
Minor bugfix on benchmark serving by @ywang96 in https://github.com/sgl-project/sglang/pull/1005
Add openai embedding API by @Ying1123 in https://github.com/sgl-project/sglang/pull/997
Add skip_tokenizer_init args. by @gryffindor-rr in https://github.com/sgl-project/sglang/pull/959
Fix benchmark latency by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/1007
Some warnings to crash when CI by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1009
Reduce the overhead when cache is disabled by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1010
Support embedding input as a list by @Ying1123 in https://github.com/sgl-project/sglang/pull/1014
misc: update test config by @zhyncs in https://github.com/sgl-project/sglang/pull/990
fix: force max new tokens to be 1 for embedding request by @Ying1123 in https://github.com/sgl-project/sglang/pull/1019
Clean up unit tests by @merrymercy in https://github.com/sgl-project/sglang/pull/1020
Fix input_ids && rename to fill_ids by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1021
feat: use FlashInfer rmsnorm and silu by @zhyncs in https://github.com/sgl-project/sglang/pull/907
misc: update issue template by @zhyncs in https://github.com/sgl-project/sglang/pull/1024
Clean up readme and arguments of chunked prefill by @merrymercy in https://github.com/sgl-project/sglang/pull/1022
Fix wrong assert by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1028
Improve type annotation by @merrymercy in https://github.com/sgl-project/sglang/pull/1029
hotfix: add CustomOp abstraction by @zhyncs in https://github.com/sgl-project/sglang/pull/1027
Fix the case where r.prefix_indices is None by @merrymercy in https://github.com/sgl-project/sglang/pull/1031
Fix triton args init by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1034
Fix the case when max_new_tokens is too large by @merrymercy in https://github.com/sgl-project/sglang/pull/1025
Test the case when max_new_tokens is very large by @merrymercy in https://github.com/sgl-project/sglang/pull/1038
Fix the prefix indices by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1037
Improve end-to-end throughput test and its coverage by @merrymercy in https://github.com/sgl-project/sglang/pull/1039
Delete the useless test/srt/test_throughput.py by @merrymercy in https://github.com/sgl-project/sglang/pull/1045
minor: some potential bugs by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1044
Clean up the comments and names under python/sglang/srt/layers by @merrymercy in https://github.com/sgl-project/sglang/pull/1047
fix: Fix returned prefill logits and add output str test by @Ying1123 in https://github.com/sgl-project/sglang/pull/1046
feat: update Dockerfile by @zhyncs in https://github.com/sgl-project/sglang/pull/1033
docs: update setup github runner by @zhyncs in https://github.com/sgl-project/sglang/pull/1050
Add longer accuracy test on CI by @merrymercy in https://github.com/sgl-project/sglang/pull/1049
Fix accuracy test by @merrymercy in https://github.com/sgl-project/sglang/pull/1051
Re-organize CI tests by @merrymercy in https://github.com/sgl-project/sglang/pull/1052
chore: bump v0.2.12 by @zhyncs in https://github.com/sgl-project/sglang/pull/1048
feat: replace all rmsnorm and silu by @zhyncs in https://github.com/sgl-project/sglang/pull/1057
fix: not use the default port by @zhyncs in https://github.com/sgl-project/sglang/pull/1068
Fix layernorm input shape by @ispobock in https://github.com/sgl-project/sglang/pull/1066
fix: temporary solution for DeepSeek V2 H100 layout conversion issue by @zhyncs in https://github.com/sgl-project/sglang/pull/1060
ci: add cancel pr workflow by @zhyncs in https://github.com/sgl-project/sglang/pull/1070
ci: add moe test by @zhyncs in https://github.com/sgl-project/sglang/pull/1053
fix: use devel for Triton's compiler requirements by @zhyncs in https://github.com/sgl-project/sglang/pull/1074
ci: add accuracy timeout by @zhyncs in https://github.com/sgl-project/sglang/pull/1078
Fix create_abort_task, GenerateReqInput does not have rids. by @gryffindor-rr in https://github.com/sgl-project/sglang/pull/1079
Example file for docker compose and k8s by @LucienShui in https://github.com/sgl-project/sglang/pull/1006
Update the mixtral to use the better FusedMoE layer by @merrymercy in https://github.com/sgl-project/sglang/pull/1081
[Feat] Add window attention for gemma-2 by @Ying1123 in https://github.com/sgl-project/sglang/pull/1056
Fix jump forward final state circular path bug. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1084
ci: update timeout and retry by @zhyncs in https://github.com/sgl-project/sglang/pull/1086
[Feature] modify Runtime to support skip_tokenizer_init by @gryffindor-rr in https://github.com/sgl-project/sglang/pull/1088
Fix a bug in cuda graph runner by @merrymercy in https://github.com/sgl-project/sglang/pull/1094
ci: remove workflow path trigger by @zhyncs in https://github.com/sgl-project/sglang/pull/1096
docs: update README by @zhyncs in https://github.com/sgl-project/sglang/pull/1098
Update grok 1 model by @merrymercy in https://github.com/sgl-project/sglang/pull/1095
docs: update pr template by @zhyncs in https://github.com/sgl-project/sglang/pull/1099
Use dtype to control generate by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1082
[Fix] Compatibility of window attention and cuda graph by @Ying1123 in https://github.com/sgl-project/sglang/pull/1090
docs: update nsys usage by @zhyncs in https://github.com/sgl-project/sglang/pull/1103
Support stop_token_ids in sglang API by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1092
Support jinja as chat template file by @Ying1123 in https://github.com/sgl-project/sglang/pull/1104
Use a single workspace for flashinfer by @merrymercy in https://github.com/sgl-project/sglang/pull/1077
[Fix] fix the typo bug for window attention by @Ying1123 in https://github.com/sgl-project/sglang/pull/1106
Enable chunked prefill by default by @merrymercy in https://github.com/sgl-project/sglang/pull/1040
[Fix] fix flashinfer usage for window attention by @Ying1123 in https://github.com/sgl-project/sglang/pull/1107
misc: rm unused model_loader by @zhyncs in https://github.com/sgl-project/sglang/pull/1110
[Fix] Window attention compatible with RadixAttention and chunked prefill by @Ying1123 in https://github.com/sgl-project/sglang/pull/1112
set CUDA_DEVICE_MAX_CONNECTIONS=1 by @merrymercy in https://github.com/sgl-project/sglang/pull/1113
chore: bump v0.2.13 by @zhyncs in https://github.com/sgl-project/sglang/pull/1111

New Contributors

@min-xu-et made their first contribution in https://github.com/sgl-project/sglang/pull/896
@mpjlu made their first contribution in https://github.com/sgl-project/sglang/pull/957
@xiezhq-hermann made their first contribution in https://github.com/sgl-project/sglang/pull/969
@foszto made their first contribution in https://github.com/sgl-project/sglang/pull/971
@vhain made their first contribution in https://github.com/sgl-project/sglang/pull/973
@liuyhwangyh made their first contribution in https://github.com/sgl-project/sglang/pull/994
@ywang96 made their first contribution in https://github.com/sgl-project/sglang/pull/1005
@gryffindor-rr made their first contribution in https://github.com/sgl-project/sglang/pull/959
@LucienShui made their first contribution in https://github.com/sgl-project/sglang/pull/1006

Full Changelog: https://github.com/sgl-project/sglang/compare/v0.2.9...v0.2.13

相关地址：原始地址下载(tar) 下载(zip)

查看：2024-08-16发行的版本