v0.4.1
版本发布时间: 2024-01-31 23:29:14
EleutherAI/lm-evaluation-harness最新发布版本:v0.4.3(2024-07-01 22:00:36)
Release Notes
This PR release contains all changes so far since the release of v0.4.0 , and is partially a test of our release automation, provided by @anjor .
At a high level, some of the changes include:
- Data-parallel inference using vLLM (contributed by @baberabb )
- A major fix to Huggingface model generation--previously, in v0.4.0, due to a bug with stop sequence handling, generations were sometimes cut off too early.
- Miscellaneous documentation updates
- A number of new tasks, and bugfixes to old tasks!
- The support for OpenAI-like API models using
local-completions
orlocal-chat-completions
( Thanks to @veekaybee @mgoin @anjor and others on this)! - Integration with tools for visualization of results, such as with Zeno, and WandB coming soon!
More frequent (minor) version releases may be done in the future, to make it easier for PyPI users!
We're very pleased by the uptick in interest in LM Evaluation Harness recently, and we hope to continue to improve the library as time goes on. We're grateful to everyone who's contributed, and are excited by how many new contributors this version brings! If you have feedback for us, or would like to help out developing the library, please let us know.
In the next version release, we hope to include
- Chat Templating + System Prompt support, for locally-run models
- Improved Answer Extraction for many generative tasks, making them more easily run zero-shot and less dependent on model output formatting
- General speedups and QoL fixes to the non-inference portions of LM-Evaluation-Harness, including drastically reduced startup times / faster non-inference processing steps especially when num_fewshot is large!
- A new
TaskManager
object and the deprecation oflm_eval.tasks.initialize_tasks()
, for achieving the easier registration of many tasks and configuration of new groups of tasks
What's Changed
- Announce v0.4.0 in README by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1061
- remove commented planned samplers in
lm_eval/api/samplers.py
by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1062 - Confirming links in docs work (WIP) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1065
- Set actual version to v0.4.0 by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1064
- Updating docs hyperlinks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1066
- Fiddling with READMEs, Reenable CI tests on
main
by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1063 - Update _cot_fewshot_template_yaml by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1074
- Patch scrolls by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1077
- Update template of qqp dataset by @shiweijiezero in https://github.com/EleutherAI/lm-evaluation-harness/pull/1097
- Change the sub-task name from sst to sst2 in glue by @shiweijiezero in https://github.com/EleutherAI/lm-evaluation-harness/pull/1099
- Add kmmlu evaluation to tasks by @h-albert-lee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1089
- Fix stderr by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1106
- Simplified
evaluator.py
by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1104 - [Refactor] vllm data parallel by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1035
- Unpack group in
write_out
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1113 - Revert "Simplified
evaluator.py
" by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1116 -
qqp
,mnli_mismatch
: remove unlabled test sets by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1114 - fix: bug of BBH_cot_fewshot by @Momo-Tori in https://github.com/EleutherAI/lm-evaluation-harness/pull/1118
- Bump BBH version by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1120
- Refactor
hf
modeling code by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1096 - Additional process for doc_to_choice by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1093
- doc_to_decontamination_query can use function by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1082
- Fix vllm
batch_size
type by @xTayEx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1128 - fix: passing max_length to vllm engine args by @NanoCode012 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1124
- Fix Loading Local Dataset by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1127
- place model onto
mps
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1133 - Add benchmark FLD by @MorishT in https://github.com/EleutherAI/lm-evaluation-harness/pull/1122
- fix typo in README.md by @lennijusten in https://github.com/EleutherAI/lm-evaluation-harness/pull/1136
- add correct openai api key to README.md by @lennijusten in https://github.com/EleutherAI/lm-evaluation-harness/pull/1138
- Update Linter CI Job by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1130
- add utils.clear_torch_cache() to model_comparator by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1142
- Enabling OpenAI completions via gooseai by @veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1141
- vllm clean up tqdm by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1144
- openai nits by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1139
- Add IFEval / Instruction-Following Eval by @wiskojo in https://github.com/EleutherAI/lm-evaluation-harness/pull/1087
- set
--gen_kwargs
arg to None by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1145 - Add shorthand flags by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1149
- fld bugfix by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1150
- Remove GooseAI docs and change no-commit-to-branch precommit hook by @veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1154
- Add docs on adding a multiple choice metric by @polm-stability in https://github.com/EleutherAI/lm-evaluation-harness/pull/1147
- Simplify evaluator by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1126
- Generalize Qwen tokenizer fix by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1146
- self.device in huggingface.py line 210 treated as torch.device but might be a string by @pminervini in https://github.com/EleutherAI/lm-evaluation-harness/pull/1172
- Fix Column Naming and Dataset Naming Conventions in K-MMLU Evaluation by @seungduk-yanolja in https://github.com/EleutherAI/lm-evaluation-harness/pull/1171
- feat: add option to upload results to Zeno by @Sparkier in https://github.com/EleutherAI/lm-evaluation-harness/pull/990
- Switch Linting to
ruff
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1166 - Error in --num_fewshot option for K-MMLU Evaluation Harness by @guijinSON in https://github.com/EleutherAI/lm-evaluation-harness/pull/1178
- Implementing local OpenAI API-style chat completions on any given inference server by @veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1174
- Update README.md by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1184
- Update README.md by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1183
- Add tokenizer backend by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1186
- Correctly Print Task Versioning by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1173
- update Zeno example and reference in README by @Sparkier in https://github.com/EleutherAI/lm-evaluation-harness/pull/1190
- Remove tokenizer for openai chat completions by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1191
- Update README.md by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1181
- disable
mypy
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1193 - Generic decorator for handling rate limit errors by @zachschillaci27 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1109
- Refer in README to main branch by @BramVanroy in https://github.com/EleutherAI/lm-evaluation-harness/pull/1200
- Hardcode 0-shot for fewshot Minerva Math tasks by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1189
- Upstream Mamba Support (
mamba_ssm
) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1110 - Update cuda handling by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1180
- Fix documentation in API table by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1203
- Consolidate batching by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1197
- Add remove_whitespace to FLD benchmark by @MorishT in https://github.com/EleutherAI/lm-evaluation-harness/pull/1206
- Fix the argument order in
utils.divide
doc by @xTayEx in https://github.com/EleutherAI/lm-evaluation-harness/pull/1208 - [Fix #1211 ] pin vllm at < 0.2.6 by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1212
- fix unbounded local variable by @onnoo in https://github.com/EleutherAI/lm-evaluation-harness/pull/1218
- nits + fix siqa by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1216
- add length of strings and answer options to Zeno metadata by @Sparkier in https://github.com/EleutherAI/lm-evaluation-harness/pull/1222
- Don't silence errors when loading tasks by @polm-stability in https://github.com/EleutherAI/lm-evaluation-harness/pull/1148
- Update README.md by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1195
- Update race's README.md by @pminervini in https://github.com/EleutherAI/lm-evaluation-harness/pull/1230
- batch_schedular bug in Collator by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1229
- Update openai_completions.py by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1238
- vllm: handle max_length better and substitute Collator by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1241
- Remove self.dataset_path post_init process by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1243
- Add multilingual HellaSwag task by @JorgeDeCorte in https://github.com/EleutherAI/lm-evaluation-harness/pull/1228
- Do not escape ascii in logging outputs by @passaglia in https://github.com/EleutherAI/lm-evaluation-harness/pull/1246
- fixed fewshot loading for multiple input tasks by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1255
- Revert citation by @StellaAthena in https://github.com/EleutherAI/lm-evaluation-harness/pull/1257
- Specify utf-8 encoding to properly save non-ascii samples to file by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1265
- Fix evaluation for the belebele dataset by @jmichaelov in https://github.com/EleutherAI/lm-evaluation-harness/pull/1267
- Call "exact_match" once for each multiple-target sample by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1266
- MultiMedQA by @tmabraham in https://github.com/EleutherAI/lm-evaluation-harness/pull/1198
- Fix bug in multi-token Stop Sequences by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1268
- Update Table Printing by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1271
- add Kobest by @jp1924 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1263
- Apply
process_docs()
to fewshot_split by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1276 - Fix whitespace issues in GSM8k-CoT by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1275
- Make
parallelize=True
vs.accelerate launch
distinction clearer in docs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1261 - Allow parameter edits for registered tasks when listed in a benchmark by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1273
- Fix data-parallel evaluation with quantized models by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1270
- Rework documentation for explaining local dataset by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1284
- Update CITATION.bib by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1285
- Update
nq_open
/ NaturalQs whitespacing by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1289 - Update README.md with custom integration doc by @msaroufim in https://github.com/EleutherAI/lm-evaluation-harness/pull/1298
- Update nq_open.yaml by @Hannibal046 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1305
- Update task_guide.md by @daniellepintz in https://github.com/EleutherAI/lm-evaluation-harness/pull/1306
- Pin
datasets
dependency at 2.15 by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1312 - Fix polemo2_in.yaml subset name by @lhoestq in https://github.com/EleutherAI/lm-evaluation-harness/pull/1313
- Fix
datasets
dependency to >=2.14 by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1314 - Fix group register by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1315
- Update task_guide.md by @djstrong in https://github.com/EleutherAI/lm-evaluation-harness/pull/1316
- Update polemo2_in.yaml by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1318
- Fix: Mamba receives extra kwargs by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1328
- Fix Issue regarding stderr by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1327
- Add
local-completions
support using OpenAI interface by @mgoin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1277 - fallback to classname when LM doesnt have config by @nairbv in https://github.com/EleutherAI/lm-evaluation-harness/pull/1334
- fix a trailing whitespace that breaks a lint job by @nairbv in https://github.com/EleutherAI/lm-evaluation-harness/pull/1335
- skip "benchmarks" in changed_tasks by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1336
- Update migrated HF dataset paths by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1332
- Don't use
get_task_dict()
in task registration / initialization by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1331 - manage default (greedy) gen_kwargs in vllm by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1341
- vllm: change default gen_kwargs behaviour; prompt_logprobs=1 by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1345
- Update links to advanced_task_guide.md by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1348
-
Filter
docs not offset bydoc_id
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1349 - Add FAQ on
lm_eval.tasks.initialize_tasks()
to README by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1330 - Refix issue regarding stderr by @thnkinbtfly in https://github.com/EleutherAI/lm-evaluation-harness/pull/1357
- Add causalLM OpenVino models by @NoushNabi in https://github.com/EleutherAI/lm-evaluation-harness/pull/1290
- Apply some best practices and guideline recommendations to code by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1363
- serialize callable functions in config by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1367
- delay filter init; remove
*args
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1369 - Fix unintuitive
--gen_kwargs
behavior by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1329 - Publish to pypi by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1194
- Make dependencies compatible with PyPI by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1378
New Contributors
- @shiweijiezero made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1097
- @h-albert-lee made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1089
- @Momo-Tori made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1118
- @xTayEx made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1128
- @NanoCode012 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1124
- @MorishT made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1122
- @lennijusten made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1136
- @veekaybee made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1141
- @wiskojo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1087
- @polm-stability made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1147
- @seungduk-yanolja made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1171
- @Sparkier made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/990
- @anjor made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1184
- @zachschillaci27 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1109
- @BramVanroy made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1200
- @onnoo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1218
- @JorgeDeCorte made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1228
- @jmichaelov made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1267
- @jp1924 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1263
- @msaroufim made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1298
- @Hannibal046 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1305
- @daniellepintz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1306
- @lhoestq made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1313
- @djstrong made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1316
- @nairbv made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1334
- @thnkinbtfly made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1357
- @NoushNabi made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1290
- @LSinev made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1363
Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.0...v0.4.1