v2.1.0
版本发布时间: 2023-08-25 00:42:00
bigscience-workshop/petals最新发布版本:v2.2.0(2023-09-07 01:29:56)
Highlights
🔌 Compatibility with 🤗 Transformers generation utils. Petals models now directly use 🤗 Transformers .generate() implementation instead of custom generation code. This means that you can use a variety of generation methods and constraints implemented in 🤗 Transformers (e.g., repetition_penalty
, beam search, etc.) and expect an exact match between Petals and a model running locally.
Most common methods are compatible with reusing inference sessions, so that you can run .generate()
multiple times without reprocessing the dialogue history from scratch:
with model.inference_session(max_length=100):
outputs1 = model.generate(user_prompt1, repetition_penalty=1.2)
outputs2 = model.generate(user_prompt2, repetition_penalty=1.2)
⚡ Faster loading of Stable Beluga 2. We repacked Stable Beluga 2, the most popular model at the moment, to increase its loading speed and minimize RAM and disk space requirements. The repacked version can be loaded from the petals-team/StableBeluga2
repository and is fully compatible with clients and servers using the standard repository (stabilityai/StableBeluga2
).
Now, clients need to download only 1.05 GB of data to run Stable Beluga 2 (instead of ~20 GB needed before) and require only 4 GB of RAM (instead of ~20 GB required before). Servers need to download and store 2x less data and load the model from disk significantly faster. If you're switching from the old repository, don't forget to remove the old cache in the~/.cache/petals/models--stabilityai--StableBeluga2
directory to save disk space.
⏱️ More responsive inference. In older versions, servers could become unresponsive for a few seconds while processing large prefixes (thousands of tokens) on inference. This release allows to perform small inference requests (a few tokens) in the middle of processing a large request, thus avoiding freezes during token-by-token inference caused by someone processing a large prefix.
🔒 Minor improvements. This release adds support for loading weights in the safetensors format on servers and adds the blocked_servers
client option to avoid a given set of servers:
from petals import AutoDistributedModelForCausalLM
blocked_servers = ["12D3KooWA6g...", "12D3KooWGyD..."] # Full peer IDs from https://health.petals.dev
model = AutoDistributedModelForCausalLM.from_pretrained(model_name, blocked_servers=blocked_servers)
🐞 Bug fixes. This release also includes a variety of bug fixes allowing to speed up the chatbot app and fine-tuning, better bypass recently disconnect servers, improve rebalancing algorithm and usability of benchmarks, fix throughput measurements and installation on ARM CPUs.
We also fixed Petals compatibility with the latest releases of 🤗 Transformers, Accelerate, and PEFT libraries.
Breaking changes
📖 Default inference sessions. If you run .generate()
or forward passes inside an .inference_session()
context, they now use the opened session by default. These snippets are now equivalent:
# Using default session
with model.inference_session(max_length=100):
output_ids = model.generate(input_ids, max_new_tokens=3)
# Explicitly specifying a session
with model.inference_session(max_length=100) as sess:
output_ids = model.generate(input_ids, max_new_tokens=3, session=sess)
Earlier, the 1st snippet was creating a new session, which confused most people and lead to bugs.
➡️ Renaming. We renamed SequenceManagerConfig
to petals.ClientConfig and petals.dht_utils
to petals.utils.dht. The old names now lead to DeprecationWarning
s and will be removed in Petals 2.2.0+.
What's Changed
- Fix stale link by @bot66 in https://github.com/bigscience-workshop/petals/pull/418
- Add Discord badge and more Discord links to readme by @borzunov in https://github.com/bigscience-workshop/petals/pull/422
- Add connect_timeout by @borzunov in https://github.com/bigscience-workshop/petals/pull/423
- Add Stable Beluga 2 to readme by @borzunov in https://github.com/bigscience-workshop/petals/pull/424
- Penalize servers that use relays during rebalancing by @borzunov in https://github.com/bigscience-workshop/petals/pull/428
- Fix petals.utils.ping for servers with client-mode DHT by @borzunov in https://github.com/bigscience-workshop/petals/pull/430
- Fix typo and make blocks message more informative by @vadi2 in https://github.com/bigscience-workshop/petals/pull/437
- Update Discord links from channels to forums by @borzunov in https://github.com/bigscience-workshop/petals/pull/440
- Remove distracting links from readme by @borzunov in https://github.com/bigscience-workshop/petals/pull/441
- Remove deprecated comment in fine-tuning notebook by @borzunov in https://github.com/bigscience-workshop/petals/pull/443
- Use bitsandbytes 0.41.1 by @borzunov in https://github.com/bigscience-workshop/petals/pull/442
- [Refactor] extract block forward, backward and inference into a separate file by @justheuristic in https://github.com/bigscience-workshop/petals/pull/435
- Override float32 in config to bfloat16 by @borzunov in https://github.com/bigscience-workshop/petals/pull/431
- Prefer longer servers for fine-tuning, exclude unreachable by @borzunov in https://github.com/bigscience-workshop/petals/pull/448
- Force using --new_swarm instead of empty --initial_peers by @borzunov in https://github.com/bigscience-workshop/petals/pull/451
- Test Llama, rebalancing, throughput eval, and all CLI scripts by @borzunov in https://github.com/bigscience-workshop/petals/pull/452
- benchmarks: Aggregate speed among workers, set default dtype torch32 by @borzunov in https://github.com/bigscience-workshop/petals/pull/454
- Use torch.cuda.synchronize for compute throughput by @justheuristic in https://github.com/bigscience-workshop/petals/pull/456
- Prioritize short inference, unmerge pools for long inference by @borzunov in https://github.com/bigscience-workshop/petals/pull/458
- Bump version to 2.0.1.post2 by @borzunov in https://github.com/bigscience-workshop/petals/pull/459
- Add
blocked_servers
argument by @borzunov in https://github.com/bigscience-workshop/petals/pull/462 - Add customizable input tensors by @artek0chumak in https://github.com/bigscience-workshop/petals/pull/445
- Move SequenceManagerConfig -> ClientConfig, petals.dht_utils -> petals.utils.dht by @borzunov in https://github.com/bigscience-workshop/petals/pull/463
- Make client compatible with transformers' GenerationMixin by @borzunov in https://github.com/bigscience-workshop/petals/pull/464
- Temporarily require peft<0.5.0, transformers<4.32.0 by @justheuristic in https://github.com/bigscience-workshop/petals/pull/470
- Support transformers 4.32.x by @justheuristic in https://github.com/bigscience-workshop/petals/pull/471
- Change transformers version assert by @justheuristic in https://github.com/bigscience-workshop/petals/pull/472
- Support loading weights from Safetensors on server by @borzunov in https://github.com/bigscience-workshop/petals/pull/473
- Update peft to 0.5.0 version by @artek0chumak in https://github.com/bigscience-workshop/petals/pull/475
- Hide excess key message by @borzunov in https://github.com/bigscience-workshop/petals/pull/476
- Bump version to 2.1.0 by @borzunov in https://github.com/bigscience-workshop/petals/pull/474
- Don't install cpufeature on non-x86_64 machines by @borzunov in https://github.com/bigscience-workshop/petals/pull/478
New Contributors
- @bot66 made their first contribution in https://github.com/bigscience-workshop/petals/pull/418
Full Changelog: https://github.com/bigscience-workshop/petals/compare/v2.0.1...v2.1.0