v0.2.13
版本发布时间: 2024-08-16 13:16:08
sgl-project/sglang最新发布版本:v0.3.0(2024-09-04 19:50:29)
Highlights
- New Feature: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973)
- New Models: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API.
- Performance: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905).
- More CI Tests: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
- Refactor and fix: More modular, better stability, use more kernels from flashinfer (#907)
What's Changed
- fix: set env in runner by @zhyncs in https://github.com/sgl-project/sglang/pull/891
- docs: update setup runner by @zhyncs in https://github.com/sgl-project/sglang/pull/884
- misc: update cuda graph capture exception log by @zhyncs in https://github.com/sgl-project/sglang/pull/894
- chore: add multipart dep for fastapi by @zhyncs in https://github.com/sgl-project/sglang/pull/895
- [minor] fixed code formatting doc by @min-xu-et in https://github.com/sgl-project/sglang/pull/896
- Bump version to 0.2.9.post1 by @Ying1123 in https://github.com/sgl-project/sglang/pull/899
- Update the base image of the docker by @Ying1123 in https://github.com/sgl-project/sglang/pull/900
- Reorder CI unit tests. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/908
- fixed an error handling in bench_latency.py by @min-xu-et in https://github.com/sgl-project/sglang/pull/904
- Add model accuracy test - step 1 by @Ying1123 in https://github.com/sgl-project/sglang/pull/866
- latency test enhancement - part 1 by @min-xu-et in https://github.com/sgl-project/sglang/pull/909
- Improve the structure of CI by @Ying1123 in https://github.com/sgl-project/sglang/pull/911
- fix: use e2e and unit test only for original repo or pr by @zhyncs in https://github.com/sgl-project/sglang/pull/912
- misc: add triton in check_env PACKAGE_LIST by @zhyncs in https://github.com/sgl-project/sglang/pull/914
- Support MLA for DeepSeek-V2 with Triton - step 1 by @ispobock in https://github.com/sgl-project/sglang/pull/905
- enhance latency test - part 2 by @min-xu-et in https://github.com/sgl-project/sglang/pull/915
- Make API Key OpenAI-compatible by @Ying1123 in https://github.com/sgl-project/sglang/pull/917
- Update hyperparameter_tuning.md by @Ying1123 in https://github.com/sgl-project/sglang/pull/918
- Fix CI && python3.8 compatible by @hnyls2002 in https://github.com/sgl-project/sglang/pull/920
- Support more OpenAI API test by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/916
- Bump version to 0.2.10 by @Ying1123 in https://github.com/sgl-project/sglang/pull/923
- latency test enhancement - final part by @min-xu-et in https://github.com/sgl-project/sglang/pull/921
- Test openai vision api by @Ying1123 in https://github.com/sgl-project/sglang/pull/925
- Test regex in vision api by @Ying1123 in https://github.com/sgl-project/sglang/pull/926
- Update README.md by @Ying1123 in https://github.com/sgl-project/sglang/pull/927
- Fix prompt len in parallel sampling by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/928
- docs: update README by @zhyncs in https://github.com/sgl-project/sglang/pull/935
- Remove leftover auth_token by @AidanCooper in https://github.com/sgl-project/sglang/pull/934
- Feat: add alternative choices selection methods by @AidanCooper in https://github.com/sgl-project/sglang/pull/835
- Fix union operator by @ispobock in https://github.com/sgl-project/sglang/pull/940
- Support multiple args options by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/941
- Fix stuck in
get_new_prefill_batch
by @hnyls2002 in https://github.com/sgl-project/sglang/pull/948 - Organize code (rename, movement) by @hnyls2002 in https://github.com/sgl-project/sglang/pull/953
- fix nsys cannot profile cuda kernel by @mpjlu in https://github.com/sgl-project/sglang/pull/957
- Add support for Batch API test by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/936
- Show more error messages for warmup errors by @Ying1123 in https://github.com/sgl-project/sglang/pull/932
- misc: update issue template by @zhyncs in https://github.com/sgl-project/sglang/pull/963
- misc: simplify test by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/964
- misc: add compute capability in check_env by @zhyncs in https://github.com/sgl-project/sglang/pull/965
- Make
req_pool_indices
on CPU by @hnyls2002 in https://github.com/sgl-project/sglang/pull/960 - misc: fix the req_to_token member change by @hnyls2002 in https://github.com/sgl-project/sglang/pull/967
- chore: update vllm to 0.5.4 by @zhyncs in https://github.com/sgl-project/sglang/pull/966
- chore: bump v0.2.11 by @zhyncs in https://github.com/sgl-project/sglang/pull/970
- Purge self-runner's pip cache weekly by @hnyls2002 in https://github.com/sgl-project/sglang/pull/975
- Run purge-cache only in sgl-project by @hnyls2002 in https://github.com/sgl-project/sglang/pull/976
- misc: correct the int data type for token ids and indices by @xiezhq-hermann in https://github.com/sgl-project/sglang/pull/969
- PrefillAdder abstraction by @hnyls2002 in https://github.com/sgl-project/sglang/pull/968
- RadixCache method adjust by @hnyls2002 in https://github.com/sgl-project/sglang/pull/977
- Adjust max prefix len by @hnyls2002 in https://github.com/sgl-project/sglang/pull/980
- #590 Increase default , track changes in examples and documentation by @foszto in https://github.com/sgl-project/sglang/pull/971
- [minor] Update type annotation in tokenizer_manager.py by @Ying1123 in https://github.com/sgl-project/sglang/pull/982
- Fix chunked prefill by @hnyls2002 in https://github.com/sgl-project/sglang/pull/984
- Add llama embedding modules [unreachable code] - step 1/3 by @Ying1123 in https://github.com/sgl-project/sglang/pull/983
- Add io struct for embedding models [unreachable code] - step 2/3 by @Ying1123 in https://github.com/sgl-project/sglang/pull/987
- Adjust
InputeMetadata
andScheduleBatch
by @hnyls2002 in https://github.com/sgl-project/sglang/pull/981 - support more optioin about usage in stream mode by @yichuan520030910320 in https://github.com/sgl-project/sglang/pull/985
- Create contributor_guide.md by @Ying1123 in https://github.com/sgl-project/sglang/pull/992
- feat: frequency, min_new_tokens, presence, and repetition penalties by @vhain in https://github.com/sgl-project/sglang/pull/973
- Move torch.compile configs into cuda_graph_runner.py by @Ying1123 in https://github.com/sgl-project/sglang/pull/993
- Add e5-mistral embedding model - step 3/3 by @Ying1123 in https://github.com/sgl-project/sglang/pull/988
- test: negative value testing for frequency, presence penalizers by @vhain in https://github.com/sgl-project/sglang/pull/995
- support models from www.modelscope.cn by @liuyhwangyh in https://github.com/sgl-project/sglang/pull/994
- bugfix: penalizers to be merged before reqs by @vhain in https://github.com/sgl-project/sglang/pull/1001
- fix: resolve correctness_test issue by @zhyncs in https://github.com/sgl-project/sglang/pull/1002
- Minor bugfix on benchmark serving by @ywang96 in https://github.com/sgl-project/sglang/pull/1005
- Add openai embedding API by @Ying1123 in https://github.com/sgl-project/sglang/pull/997
- Add skip_tokenizer_init args. by @gryffindor-rr in https://github.com/sgl-project/sglang/pull/959
- Fix benchmark latency by @wisclmy0611 in https://github.com/sgl-project/sglang/pull/1007
- Some warnings to crash when CI by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1009
- Reduce the overhead when cache is disabled by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1010
- Support embedding input as a list by @Ying1123 in https://github.com/sgl-project/sglang/pull/1014
- misc: update test config by @zhyncs in https://github.com/sgl-project/sglang/pull/990
- fix: force max new tokens to be 1 for embedding request by @Ying1123 in https://github.com/sgl-project/sglang/pull/1019
- Clean up unit tests by @merrymercy in https://github.com/sgl-project/sglang/pull/1020
- Fix
input_ids
&& rename tofill_ids
by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1021 - feat: use FlashInfer rmsnorm and silu by @zhyncs in https://github.com/sgl-project/sglang/pull/907
- misc: update issue template by @zhyncs in https://github.com/sgl-project/sglang/pull/1024
- Clean up readme and arguments of chunked prefill by @merrymercy in https://github.com/sgl-project/sglang/pull/1022
- Fix wrong assert by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1028
- Improve type annotation by @merrymercy in https://github.com/sgl-project/sglang/pull/1029
- hotfix: add CustomOp abstraction by @zhyncs in https://github.com/sgl-project/sglang/pull/1027
- Fix the case where r.prefix_indices is None by @merrymercy in https://github.com/sgl-project/sglang/pull/1031
- Fix triton args init by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1034
- Fix the case when max_new_tokens is too large by @merrymercy in https://github.com/sgl-project/sglang/pull/1025
- Test the case when max_new_tokens is very large by @merrymercy in https://github.com/sgl-project/sglang/pull/1038
- Fix the prefix indices by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1037
- Improve end-to-end throughput test and its coverage by @merrymercy in https://github.com/sgl-project/sglang/pull/1039
- Delete the useless test/srt/test_throughput.py by @merrymercy in https://github.com/sgl-project/sglang/pull/1045
- minor: some potential bugs by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1044
- Clean up the comments and names under python/sglang/srt/layers by @merrymercy in https://github.com/sgl-project/sglang/pull/1047
- fix: Fix returned prefill logits and add output str test by @Ying1123 in https://github.com/sgl-project/sglang/pull/1046
- feat: update Dockerfile by @zhyncs in https://github.com/sgl-project/sglang/pull/1033
- docs: update setup github runner by @zhyncs in https://github.com/sgl-project/sglang/pull/1050
- Add longer accuracy test on CI by @merrymercy in https://github.com/sgl-project/sglang/pull/1049
- Fix accuracy test by @merrymercy in https://github.com/sgl-project/sglang/pull/1051
- Re-organize CI tests by @merrymercy in https://github.com/sgl-project/sglang/pull/1052
- chore: bump v0.2.12 by @zhyncs in https://github.com/sgl-project/sglang/pull/1048
- feat: replace all rmsnorm and silu by @zhyncs in https://github.com/sgl-project/sglang/pull/1057
- fix: not use the default port by @zhyncs in https://github.com/sgl-project/sglang/pull/1068
- Fix layernorm input shape by @ispobock in https://github.com/sgl-project/sglang/pull/1066
- fix: temporary solution for DeepSeek V2 H100 layout conversion issue by @zhyncs in https://github.com/sgl-project/sglang/pull/1060
- ci: add cancel pr workflow by @zhyncs in https://github.com/sgl-project/sglang/pull/1070
- ci: add moe test by @zhyncs in https://github.com/sgl-project/sglang/pull/1053
- fix: use devel for Triton's compiler requirements by @zhyncs in https://github.com/sgl-project/sglang/pull/1074
- ci: add accuracy timeout by @zhyncs in https://github.com/sgl-project/sglang/pull/1078
- Fix create_abort_task, GenerateReqInput does not have rids. by @gryffindor-rr in https://github.com/sgl-project/sglang/pull/1079
- Example file for docker compose and k8s by @LucienShui in https://github.com/sgl-project/sglang/pull/1006
- Update the mixtral to use the better FusedMoE layer by @merrymercy in https://github.com/sgl-project/sglang/pull/1081
- [Feat] Add window attention for gemma-2 by @Ying1123 in https://github.com/sgl-project/sglang/pull/1056
- Fix jump forward final state circular path bug. by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1084
- ci: update timeout and retry by @zhyncs in https://github.com/sgl-project/sglang/pull/1086
- [Feature] modify Runtime to support skip_tokenizer_init by @gryffindor-rr in https://github.com/sgl-project/sglang/pull/1088
- Fix a bug in cuda graph runner by @merrymercy in https://github.com/sgl-project/sglang/pull/1094
- ci: remove workflow path trigger by @zhyncs in https://github.com/sgl-project/sglang/pull/1096
- docs: update README by @zhyncs in https://github.com/sgl-project/sglang/pull/1098
- Update grok 1 model by @merrymercy in https://github.com/sgl-project/sglang/pull/1095
- docs: update pr template by @zhyncs in https://github.com/sgl-project/sglang/pull/1099
- Use
dtype
to control generate by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1082 - [Fix] Compatibility of window attention and cuda graph by @Ying1123 in https://github.com/sgl-project/sglang/pull/1090
- docs: update nsys usage by @zhyncs in https://github.com/sgl-project/sglang/pull/1103
- Support
stop_token_ids
in sglang API by @hnyls2002 in https://github.com/sgl-project/sglang/pull/1092 - Support jinja as chat template file by @Ying1123 in https://github.com/sgl-project/sglang/pull/1104
- Use a single workspace for flashinfer by @merrymercy in https://github.com/sgl-project/sglang/pull/1077
- [Fix] fix the typo bug for window attention by @Ying1123 in https://github.com/sgl-project/sglang/pull/1106
- Enable chunked prefill by default by @merrymercy in https://github.com/sgl-project/sglang/pull/1040
- [Fix] fix flashinfer usage for window attention by @Ying1123 in https://github.com/sgl-project/sglang/pull/1107
- misc: rm unused model_loader by @zhyncs in https://github.com/sgl-project/sglang/pull/1110
- [Fix] Window attention compatible with RadixAttention and chunked prefill by @Ying1123 in https://github.com/sgl-project/sglang/pull/1112
- set CUDA_DEVICE_MAX_CONNECTIONS=1 by @merrymercy in https://github.com/sgl-project/sglang/pull/1113
- chore: bump v0.2.13 by @zhyncs in https://github.com/sgl-project/sglang/pull/1111
New Contributors
- @min-xu-et made their first contribution in https://github.com/sgl-project/sglang/pull/896
- @mpjlu made their first contribution in https://github.com/sgl-project/sglang/pull/957
- @xiezhq-hermann made their first contribution in https://github.com/sgl-project/sglang/pull/969
- @foszto made their first contribution in https://github.com/sgl-project/sglang/pull/971
- @vhain made their first contribution in https://github.com/sgl-project/sglang/pull/973
- @liuyhwangyh made their first contribution in https://github.com/sgl-project/sglang/pull/994
- @ywang96 made their first contribution in https://github.com/sgl-project/sglang/pull/1005
- @gryffindor-rr made their first contribution in https://github.com/sgl-project/sglang/pull/959
- @LucienShui made their first contribution in https://github.com/sgl-project/sglang/pull/1006
Full Changelog: https://github.com/sgl-project/sglang/compare/v0.2.9...v0.2.13