v0.4.2
版本发布时间: 2024-03-18 21:07:28
EleutherAI/lm-evaluation-harness最新发布版本:v0.4.3(2024-07-01 22:00:36)
lm-eval v0.4.2 Release Notes
We are releasing a new minor version of lm-eval for PyPI users! We've been very happy to see continued usage of the lm-evaluation-harness, including as a standard testbench to propel new architecture design (https://arxiv.org/abs/2402.18668), to ease new benchmark creation (https://arxiv.org/abs/2402.11548, https://arxiv.org/abs/2402.00786, https://arxiv.org/abs/2403.01469), enabling controlled experimentation on LLM evaluation (https://arxiv.org/abs/2402.01781), and more!
New Additions
- Request Caching by @inf3rnus - speedups on startup via caching the construction of documents/requests’ contexts
- Weights and Biases logging by @ayulockin - evals can now be logged to both WandB and Zeno!
- New Tasks
- KMMLU, a localized - not (auto) translated! - dataset for testing Korean knowledge by @h-albert-lee @guijinSON
- GPQA by @uanu2002
- French Bench by @ManuelFay
- EQ-Bench by @pbevan1 and @sqrkl
- HAERAE-Bench, readded by @h-albert-lee
- Updates to answer parsing on many generative tasks (GSM8k, MGSM, BBH zeroshot) by @thinknbtfly!
- Okapi (translated) Open LLM Leaderboard tasks by @uanu2002 and @giux78
- Arabic MMLU and aEXAMS by @khalil-hennara
- And more!
- Re-introduction of
TemplateLM
base class for lower-code new LM class implementations by @anjor - Run the library with metrics/scoring stage skipped via
--predict_only
by @baberabb - Many more miscellaneous improvements by a lot of great contributors!
Backwards Incompatibilities
There were a few breaking changes to lm-eval's general API or logic we'd like to highlight:
TaskManager
API
previously, users had to call lm_eval.tasks.initialize_tasks()
to register the library's default tasks, or lm_eval.tasks.include_path()
to include a custom directory of task YAML configs.
Old usage:
import lm_eval
lm_eval.tasks.initialize_tasks()
# or:
lm_eval.tasks.include_path("/path/to/my/custom/tasks")
lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"])
New intended usage:
import lm_eval
# optional--only need to instantiate separately if you want to pass custom path!
task_manager = TaskManager() # pass include_path="/path/to/my/custom/tasks" if desired
lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"], task_manager=task_manager)
get_task_dict()
now also optionally takes a TaskManager object, when wanting to load custom tasks.
This should allow for much faster library startup times due to lazily loading requested tasks or groups.
Updated Stderr Aggregation
Previous versions of the library incorrectly reported erroneously large stderr
scores for groups of tasks such as MMLU.
We've since updated the formula to correctly aggregate Standard Error scores for groups of tasks reporting accuracies aggregated via their mean across the dataset -- see #1390 #1427 for more information.
As always, please feel free to give us feedback or request new features! We're grateful for the community's support.
What's Changed
- Add support for RWKV models with World tokenizer by @PicoCreator in https://github.com/EleutherAI/lm-evaluation-harness/pull/1374
- add bypass metric by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1156
- Expand docs, update CITATION.bib by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1227
- Hf: minor egde cases by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1380
- Enable override of printed
n-shot
in table by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1379 - Faster Task and Group Loading, Allow Recursive Groups by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1321
- Fix for https://github.com/EleutherAI/lm-evaluation-harness/issues/1383 by @pminervini in https://github.com/EleutherAI/lm-evaluation-harness/pull/1384
- fix on --task list by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1387
- Support for Inf2 optimum class [WIP] by @michaelfeil in https://github.com/EleutherAI/lm-evaluation-harness/pull/1364
- Update README.md by @mycoalchen in https://github.com/EleutherAI/lm-evaluation-harness/pull/1398
- Fix confusing
write_out.py
instructions in README by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1371 - Use Pooled rather than Combined Variance for calculating stderr of task groupings by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1390
- adding hf_transfer by @michaelfeil in https://github.com/EleutherAI/lm-evaluation-harness/pull/1400
-
batch_size
withauto
defaults to 1 ifNo executable batch size found
is raised by @pminervini in https://github.com/EleutherAI/lm-evaluation-harness/pull/1405 - Fix printing bug in #1390 by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1414
- Fixes https://github.com/EleutherAI/lm-evaluation-harness/issues/1416 by @pminervini in https://github.com/EleutherAI/lm-evaluation-harness/pull/1418
- Fix watchdog timeout by @JeevanBhoot in https://github.com/EleutherAI/lm-evaluation-harness/pull/1404
- Evaluate by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1385
- Add multilingual ARC task by @uanu2002 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1419
- Add multilingual TruthfulQA task by @uanu2002 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1420
- [m_mmul] added multilingual evaluation from alexandrainst/m_mmlu by @giux78 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1358
- Added seeds to
evaluator.simple_evaluate
signature by @Am1n3e in https://github.com/EleutherAI/lm-evaluation-harness/pull/1412 - Fix: task weighting by subtask size ; update Pooled Stderr formula slightly by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1427
- Refactor utilities into a separate model utils file. by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1429
- Nit fix: Updated OpenBookQA Readme by @adavidho in https://github.com/EleutherAI/lm-evaluation-harness/pull/1430
- improve hf_transfer activation by @michaelfeil in https://github.com/EleutherAI/lm-evaluation-harness/pull/1438
- Correct typo in task name in ARC documentation by @larekrow in https://github.com/EleutherAI/lm-evaluation-harness/pull/1443
- update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zeroshot 0% -> 42%) by @thnkinbtfly in https://github.com/EleutherAI/lm-evaluation-harness/pull/1356
- Add a new task HaeRae-Bench by @h-albert-lee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1445
- Group reqs by context by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1425
- Add a new task GPQA (the part without CoT) by @uanu2002 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1434
- Added KMMLU evaluation method and changed ReadMe by @h-albert-lee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1447
- Add TemplateLM boilerplate LM class by @anjor in https://github.com/EleutherAI/lm-evaluation-harness/pull/1279
- Log which subtasks were called with which groups by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1456
- PR fixing the issue #1391 (wrong contexts in the mgsm task) by @leocnj in https://github.com/EleutherAI/lm-evaluation-harness/pull/1440
- feat: Add Weights and Biases support by @ayulockin in https://github.com/EleutherAI/lm-evaluation-harness/pull/1339
- Fixed generation args issue affection OpenAI completion model by @Am1n3e in https://github.com/EleutherAI/lm-evaluation-harness/pull/1458
- update parsing logic of mgsm following gsm8k (mgsm en 0 -> 50%) by @thnkinbtfly in https://github.com/EleutherAI/lm-evaluation-harness/pull/1462
- Adding documentation for Weights and Biases CLI interface by @veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1466
- Add environment and transformers version logging in results dump by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1464
- Apply code autoformatting with Ruff to tasks/*.py an *init.py by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1469
- Setting trust_remote_code to
True
for HuggingFace datasets compatibility by @veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1467 - add arabic mmlu by @khalil-Hennara in https://github.com/EleutherAI/lm-evaluation-harness/pull/1402
- Add Gemma support (Add flag to control BOS token usage) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1465
- Revert "Setting trust_remote_code to
True
for HuggingFace datasets compatibility" by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1474 - Create a means for caching task registration and request building. Ad… by @inf3rnus in https://github.com/EleutherAI/lm-evaluation-harness/pull/1372
- Cont metrics by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1475
- Refactor
evaluater.evaluate
by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1441 - add multilingual mmlu eval by @jordane95 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1484
- Update TruthfulQA val split name by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1488
- Fix AttributeError in huggingface.py When 'model_type' is Missing by @richwardle in https://github.com/EleutherAI/lm-evaluation-harness/pull/1489
- Fix duplicated kwargs in some model init by @lchu-ibm in https://github.com/EleutherAI/lm-evaluation-harness/pull/1495
- Add multilingual truthfulqa targets by @jordane95 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1499
- Always include EOS token as stop sequence by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1480
- Improve data-parallel request partitioning for VLLM by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1477
- modify
WandbLogger
to accept arbitrary kwargs by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1491 - Vllm update DP+TP by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1508
- Setting trust_remote_code to True for HuggingFace datasets compatibility by @veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1487
- Cleaning up unused unit tests by @veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1516
- French Bench by @ManuelFay in https://github.com/EleutherAI/lm-evaluation-harness/pull/1500
- Hotfix: fix TypeError in
--trust_remote_code
by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1517 - Fix minor edge cases (#951 #1503) by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1520
- Openllm benchmark by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/1526
- Add a new task GPQA (the part CoT and generative) by @uanu2002 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1482
- Add EQ-Bench as per #1459 by @pbevan1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1511
- Add WMDP Multiple-choice by @justinphan3110 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1534
- Adding new task : KorMedMCQA by @sean0042 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1530
- Update docs on LM.loglikelihood_rolling abstract method by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1532
- Minor KMMLU cleanup by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1502
- Cleanup and fixes (Task, Instance, and a little bit of *evaluate) by @LSinev in https://github.com/EleutherAI/lm-evaluation-harness/pull/1533
- Update installation commands in openai_completions.py and contributing document and, update wandb_args description by @naem1023 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1536
- Add compatibility for vLLM's new Logprob object by @Yard1 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1549
- Fix incorrect
max_gen_toks
generation kwarg default in code2_text. by @cosmo3769 in https://github.com/EleutherAI/lm-evaluation-harness/pull/1551 - Support jinja templating for task descriptions by @HishamYahya in https://github.com/EleutherAI/lm-evaluation-harness/pull/1553
- Fix incorrect
max_gen_toks
generation kwarg default in generative Bigbench by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1546 - Hardcode IFEval to 0-shot by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1506
- add Arabic EXAMS benchmark by @khalil-Hennara in https://github.com/EleutherAI/lm-evaluation-harness/pull/1498
- AGIEval by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1359
- cli_evaluate calls simple_evaluate with the same verbosity. by @Wongboo in https://github.com/EleutherAI/lm-evaluation-harness/pull/1563
- add manual tqdm disabling management by @artemorloff in https://github.com/EleutherAI/lm-evaluation-harness/pull/1569
- Fix README section on vllm integration by @eitanturok in https://github.com/EleutherAI/lm-evaluation-harness/pull/1579
- Fix Jinja template for Advanced AI Risk by @RylanSchaeffer in https://github.com/EleutherAI/lm-evaluation-harness/pull/1587
- Proposed approach for testing CLI arg parsing by @veekaybee in https://github.com/EleutherAI/lm-evaluation-harness/pull/1566
- Patch for Seq2Seq Model predictions by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/1584
- Add start date in results.json by @djstrong in https://github.com/EleutherAI/lm-evaluation-harness/pull/1592
- Cleanup for v0.4.2 release by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/1573
- Fix eval_logger import for mmlu/_generate_configs.py by @noufmitla in https://github.com/EleutherAI/lm-evaluation-harness/pull/1593
New Contributors
- @PicoCreator made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1374
- @michaelfeil made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1364
- @mycoalchen made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1398
- @JeevanBhoot made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1404
- @uanu2002 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1419
- @giux78 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1358
- @Am1n3e made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1412
- @adavidho made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1430
- @larekrow made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1443
- @leocnj made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1440
- @ayulockin made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1339
- @khalil-Hennara made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1402
- @inf3rnus made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1372
- @jordane95 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1484
- @richwardle made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1489
- @lchu-ibm made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1495
- @pbevan1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1511
- @justinphan3110 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1534
- @sean0042 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1530
- @naem1023 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1536
- @Yard1 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1549
- @cosmo3769 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1551
- @HishamYahya made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1553
- @Wongboo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1563
- @artemorloff made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1569
- @eitanturok made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1579
- @RylanSchaeffer made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1587
- @noufmitla made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/1593
Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.1...v0.4.2