Skip to content

Releases: EleutherAI/lm-evaluation-harness

v0.4.2

18 Mar 13:07
4600d6b
Compare
Choose a tag to compare

lm-eval v0.4.2 Release Notes

We are releasing a new minor version of lm-eval for PyPI users! We've been very happy to see continued usage of the lm-evaluation-harness, including as a standard testbench to propel new architecture design (https://arxiv.org/abs/2402.18668), to ease new benchmark creation (https://arxiv.org/abs/2402.11548, https://arxiv.org/abs/2402.00786, https://arxiv.org/abs/2403.01469), enabling controlled experimentation on LLM evaluation (https://arxiv.org/abs/2402.01781), and more!

New Additions

  • Request Caching by @inf3rnus - speedups on startup via caching the construction of documents/requests’ contexts
  • Weights and Biases logging by @ayulockin - evals can now be logged to both WandB and Zeno!
  • New Tasks
  • Re-introduction of TemplateLM base class for lower-code new LM class implementations by @anjor
  • Run the library with metrics/scoring stage skipped via --predict_only by @baberabb
  • Many more miscellaneous improvements by a lot of great contributors!

Backwards Incompatibilities

There were a few breaking changes to lm-eval's general API or logic we'd like to highlight:

TaskManager API

previously, users had to call lm_eval.tasks.initialize_tasks() to register the library's default tasks, or lm_eval.tasks.include_path() to include a custom directory of task YAML configs.

Old usage:

import lm_eval

lm_eval.tasks.initialize_tasks() 
# or:
lm_eval.tasks.include_path("/path/to/my/custom/tasks")

 
lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"])

New intended usage:

import lm_eval

# optional--only need to instantiate separately if you want to pass custom path!
task_manager = TaskManager() # pass include_path="/path/to/my/custom/tasks" if desired

lm_eval.simple_evaluate(model=lm, tasks=["arc_easy"], task_manager=task_manager)

get_task_dict() now also optionally takes a TaskManager object, when wanting to load custom tasks.

This should allow for much faster library startup times due to lazily loading requested tasks or groups.

Updated Stderr Aggregation

Previous versions of the library incorrectly reported erroneously large stderr scores for groups of tasks such as MMLU.

We've since updated the formula to correctly aggregate Standard Error scores for groups of tasks reporting accuracies aggregated via their mean across the dataset -- see #1390 #1427 for more information.

As always, please feel free to give us feedback or request new features! We're grateful for the community's support.

What's Changed

Read more

v0.4.1

31 Jan 15:29
a0a2fec
Compare
Choose a tag to compare

Release Notes

This PR release contains all changes so far since the release of v0.4.0 , and is partially a test of our release automation, provided by @anjor .

At a high level, some of the changes include:

  • Data-parallel inference using vLLM (contributed by @baberabb )
  • A major fix to Huggingface model generation--previously, in v0.4.0, due to a bug with stop sequence handling, generations were sometimes cut off too early.
  • Miscellaneous documentation updates
  • A number of new tasks, and bugfixes to old tasks!
  • The support for OpenAI-like API models using local-completions or local-chat-completions ( Thanks to @veekaybee @mgoin @anjor and others on this)!
  • Integration with tools for visualization of results, such as with Zeno, and WandB coming soon!

More frequent (minor) version releases may be done in the future, to make it easier for PyPI users!

We're very pleased by the uptick in interest in LM Evaluation Harness recently, and we hope to continue to improve the library as time goes on. We're grateful to everyone who's contributed, and are excited by how many new contributors this version brings! If you have feedback for us, or would like to help out developing the library, please let us know.

In the next version release, we hope to include

  • Chat Templating + System Prompt support, for locally-run models
  • Improved Answer Extraction for many generative tasks, making them more easily run zero-shot and less dependent on model output formatting
  • General speedups and QoL fixes to the non-inference portions of LM-Evaluation-Harness, including drastically reduced startup times / faster non-inference processing steps especially when num_fewshot is large!
  • A new TaskManager object and the deprecation of lm_eval.tasks.initialize_tasks(), for achieving the easier registration of many tasks and configuration of new groups of tasks

What's Changed

Read more

v0.4.0

04 Dec 15:08
c9bbec6
Compare
Choose a tag to compare

What's Changed

Read more

v0.3.0

08 Dec 08:34
Compare
Choose a tag to compare

HuggingFace Datasets Integration

This release integrates HuggingFace datasets as the core dataset management interface, removing previous custom downloaders.

What's Changed

New Contributors

Full Changelog: v0.2.0...v0.3.0

v0.2.0

07 Mar 02:12
Compare
Choose a tag to compare

Major changes since 0.1.0:

  • added blimp (#237)
  • added qasper (#264)
  • added asdiv (#244)
  • added truthfulqa (#219)
  • added gsm (#260)
  • implemented description dict and deprecated provide_description (#226)
  • new --check_integrity flag to run integrity unit tests at eval time (#290)
  • positional arguments to evaluate and simple_evaluate are now deprecated
  • _CITATION attribute on task modules (#292)
  • lots of bug fixes and task fixes (always remember to report task versions for comparability!)

v0.0.1

02 Sep 02:28
Compare
Choose a tag to compare
Rename package