v2.0.0.post1

版本发布时间: 2023-07-20 02:29:48

bigscience-workshop/petals最新发布版本:v2.2.0(2023-09-07 01:29:56)

We're excited to announce Petals 2.0.0 — the largest Petals release to date!

Highlights

🦙 Support for LLaMA and LLaMA 2. We've added support for inference and fine-tuning of any models based on 🤗 Transformers LlamaModel, including all variants of LLaMA and LLaMA 2 — one of the strongest open source models available today. The public swarm hosts the largest variants of these models, LLaMA-65B and LLaMA 2 (70B and 70B-Chat), providing inference at the speed of up to 5-6 tokens/sec.

You can try them in the 💬 chatbot web app or in 🚀 our Colab tutorial.

🗜️ 4-bit quantization. We've integrated efficient 4-bit (NF4) quantization from the recent "QLoRA: Efficient Finetuning of Quantized LLMs" paper. This allows to use ~40% less GPU memory (thus, ~40% less servers) to fit all model blocks and have ~2x speedup for token-by-token inference, compared to the 8-bit quantization we previously used, with relatively small quality loss.

🔌 Pre-loading LoRA adapters, such as Guanaco. We've also added an opportunity to pre-load LoRA adapters compatible with the 🤗 PEFT library, which may add extra functionality to the model you host. This adapters are activated at a client's request - specifically, the client may specify .from_pretrained(..., active_adapter="adapter_repo") when loading a distributed model. One example of this is Guanaco - an instruction-finetuned adapter for LLaMA that turns it into a helpful chatbot that carefully follows user's instructions. You can try using LLaMA with this adapter in our chatbot app.

➡️ Direct server-to-server communication. Previously, servers didn't send tensors to each other directly due to specifics of our fault-tolerant inference algorithm. This update changes that, which saves round-trip time between servers and a client and leads to substantial speedups for clients located far away from servers they're using.

🛣️ Shortest-path routing for inference. Previously, a client didn't properly choose geographically close and fast servers, so the client could choose a slow inference chain, especially if the swarm has many servers located for away from it. Now, the client builds a full graph of client-server and server-server latencies, as well as server inference speeds, to find the fastest chain of servers for inference among all possible ones. It also considers the amount of GPU memory left for attention caches, so that we don't choose a close server that doesn't actually have memory for our request.

🌎 Loading models directly from 🤗 Model Hub and Auto classes. Starting from Petals 2.0.0, models do not need to be converted to a special format to be hosted by Petals. Instead, both clients and servers can load models directly from 🤗 Model Hub, fetching only the shards they need to host their part of the model. Furthermore, you can write code supporting multiple architectures at once using Auto classes, such as AutoDistributedConfig.from_pretrained(...) and AutoDistributedModelForCausalLM.from_pretrained(...). The guide for adding new model architectures to Petals also became much simpler due to generalizing Petals code to multiple architectures and the absence of the model conversion step.

🏋️ Fine-tuning examples. We've switched most examples to LLaMA-65B and fixed previously reported bugs. In particular, the "Getting started" notebook now includes a simple example of deep prompt tuning on a dummy task, and the sequence classification notebook uses LLaMA-65B and improved hyperparameters for a stable training.

🖥️ Upgraded swarm monitor. The swarm monitor now contains much more info about the server, including pre-loaded LoRA adapters, detailed performance info, latencies to potential next servers, and so on. All these info is published to DHT, so you don't need to ping each server to fetch it. We've also added a "Contributor" column, so that contributors hosting 10+ blocks get a chance to publish their name, advertise their company or a social media account in exchange to hosting a server for Petals. A name (or a link) shown there may be specified using the server's --public_name argument.

What's Changed

Remove unused imports and attributes by @mryab in https://github.com/bigscience-workshop/petals/pull/324
Determine block dtype in a unified manner by @mryab in https://github.com/bigscience-workshop/petals/pull/325
Use number of tokens for attn_cache_size by @mryab in https://github.com/bigscience-workshop/petals/pull/286
Add LLaMA support by @borzunov in https://github.com/bigscience-workshop/petals/pull/323
Add AutoDistributed{Model, ModelForCausalLM, ModelForSequenceClassification} by @borzunov in https://github.com/bigscience-workshop/petals/pull/329
Fix llama's lm_head.weight.requires_grad by @borzunov in https://github.com/bigscience-workshop/petals/pull/330
Show license links when loading models by @borzunov in https://github.com/bigscience-workshop/petals/pull/332
Add benchmark scripts by @borzunov in https://github.com/bigscience-workshop/petals/pull/319
Fix warmup steps and minor issues in benchmarks by @borzunov in https://github.com/bigscience-workshop/petals/pull/334
Require pydantic < 2.0 (2.0 is incompatible with hivemind 1.1.8) by @borzunov in https://github.com/bigscience-workshop/petals/pull/337
Support loading blocks in 4-bit (QLoRA NF4 format, disabled by default) by @borzunov in https://github.com/bigscience-workshop/petals/pull/333
Allow free_disk_space_for() remove arbitrary files from Petals cache by @borzunov in https://github.com/bigscience-workshop/petals/pull/339
Implement direct server-to-server communication by @borzunov in https://github.com/bigscience-workshop/petals/pull/331
Use 4-bit for llama by default, use bitsandbytes 0.40.0.post3 by @borzunov in https://github.com/bigscience-workshop/petals/pull/340
Delete deprecated petals.cli scripts by @borzunov in https://github.com/bigscience-workshop/petals/pull/336
Use bitsandbytes 0.40.0.post4 with bias hotfix by @borzunov in https://github.com/bigscience-workshop/petals/pull/342
Support peft LoRA adapters by @artek0chumak in https://github.com/bigscience-workshop/petals/pull/335
Fix convergence issues and switch to LLaMA in the SST-2 example by @mryab in https://github.com/bigscience-workshop/petals/pull/343
Mention LLaMA in readme by @borzunov in https://github.com/bigscience-workshop/petals/pull/344
Import petals.utils.peft only when needed to avoid unnecessary import of bitsandbytes by @borzunov in https://github.com/bigscience-workshop/petals/pull/345
Fix Docker build by avoiding Python 3.11 by @borzunov in https://github.com/bigscience-workshop/petals/pull/348
Support LLaMA repos without "-hf" suffix by @borzunov in https://github.com/bigscience-workshop/petals/pull/349
Estimate adapter memory overhead in choose_num_blocks() by @justheuristic in https://github.com/bigscience-workshop/petals/pull/346
Spam less in server logs by @borzunov in https://github.com/bigscience-workshop/petals/pull/350
Remove unused import os by @justheuristic in https://github.com/bigscience-workshop/petals/pull/352
Test that bitsandbytes is not imported when it's not used by @borzunov in https://github.com/bigscience-workshop/petals/pull/351
Fix bugs in _choose_num_blocks() added in #346 by @borzunov in https://github.com/bigscience-workshop/petals/pull/354
Switch adapters slightly faster by @justheuristic in https://github.com/bigscience-workshop/petals/pull/353
Share more info about a server in DHT by @borzunov in https://github.com/bigscience-workshop/petals/pull/355
Make a server ping next servers by @borzunov in https://github.com/bigscience-workshop/petals/pull/356
Use bitsandbytes 0.40.1.post1 by @borzunov in https://github.com/bigscience-workshop/petals/pull/357
Update readme and "Getting started" link by @borzunov in https://github.com/bigscience-workshop/petals/pull/360
Report inference, forward, and network RPS separately by @borzunov in https://github.com/bigscience-workshop/petals/pull/358
Fix typo in generation_algorithms.py by @eltociear in https://github.com/bigscience-workshop/petals/pull/364
Implement shortest-path routing for inference by @borzunov in https://github.com/bigscience-workshop/petals/pull/362
Update readme to show new models by @borzunov in https://github.com/bigscience-workshop/petals/pull/365
Require transformers < 4.31.0 until we're compatible by @borzunov in https://github.com/bigscience-workshop/petals/pull/369
Fix AssertionError on rebalancing by @borzunov in https://github.com/bigscience-workshop/petals/pull/370
Update transformers to 4.31.0 and peft to 0.4.0 by @borzunov in https://github.com/bigscience-workshop/petals/pull/371
Fix readme code example, require Python < 3.11 until supported by @borzunov in https://github.com/bigscience-workshop/petals/pull/374
Fix handler memory leak, get rid of mp.Manager by @justheuristic in https://github.com/bigscience-workshop/petals/pull/373
Inherit bitsandbytes compute dtype correctly (override peft quirk) by @justheuristic in https://github.com/bigscience-workshop/petals/pull/377
Fix --token arg by @borzunov in https://github.com/bigscience-workshop/petals/pull/378
Support Llama 2 by @borzunov in https://github.com/bigscience-workshop/petals/pull/379
Require accelerate>=0.20.3 as transformers do by @borzunov in https://github.com/bigscience-workshop/petals/pull/383
Bump version to 2.0.0.post1 by @borzunov in https://github.com/bigscience-workshop/petals/pull/384

New Contributors

@eltociear made their first contribution in https://github.com/bigscience-workshop/petals/pull/364

Full Changelog: https://github.com/bigscience-workshop/petals/compare/v1.1.5...v2.0.0.post1

相关地址：原始地址下载(tar) 下载(zip)

查看：2023-07-20发行的版本