v2.0.0
版本发布时间: 2024-07-31 14:49:24
argilla-io/argilla最新发布版本:v2.1.0(2024-09-05 23:11:08)
🔆 Release highlights
One Dataset
to rule them all
The main difference between Argilla 1.x and Argilla 2.x is that we've converted the previous dataset types tailored for specific NLP tasks into a single highly-configurable Dataset
class.
With the new Dataset
you can combine multiple fields and question types, so you can adapt the UI for your specific project. This offers you more flexibility, while making Argilla easier to learn and maintain.
[!IMPORTANT] If you want to continue using your legacy datasets in Argilla 2.x, you will need to convert them into v2
Dataset
's as explained in this migration guide. This includes:DatasetForTextClassification
,DatasetForTokenClassification
, andDatasetForText2Text
.
FeedbackDataset
's do not need to be converted as they are already compatible with the Argilla v2 format.
New SDK & documentation
We've redesigned our SDK with the idea to adapt it to the new single Dataset
and Record
classes and, most importantly, improve the user and developer experience.
The main goal of the new design is to make the SDK easier to use and learn, making it simpler and faster to configure your dataset and get it up and running.
Here's an example of what creating a Dataset
looks like:
import argilla as rg
from datasets import load_dataset
# log to the Argilla client
client = rg.Argilla(
api_url="<api_url>",
api_key="<api_key>"
# headers={"Authorization": f"Bearer {HF_TOKEN}"}
)
# configure dataset settings
settings = rg.Settings(
guidelines="Classify the reviews as positive or negative.",
fields=[
rg.TextField(
name="review",
title="Text from the review",
use_markdown=False,
),
],
questions=[
rg.LabelQuestion(
name="my_label",
title="In which category does this article fit?",
labels=["positive", "negative"],
)
],
)
# create the dataset in your Argilla instance
dataset = rg.Dataset(
name=f"my_first_dataset",
settings=settings,
client=client,
)
dataset.create()
# get some data from the hugging face hub and load the records
data = load_dataset("imdb", split="train[:100]").to_list()
dataset.records.log(records=data, mapping={"text": "review"})
To learn more about this SDK and how it works, check out our revamped documentation: https://argilla-io.github.io/argilla/latest
We made this new documentation site from scratch, applying the Diátaxis framework and UX principles with the hope to make this version cleaner and the information easier to find.
New UI layout
We have also redesigned part of our UI for Argilla 2.0:
- We've redistributed the information in the Home page.
- Datasets don't have Tasks, but Questions.
- A clearer way to see your team's progress over each dataset.
- Annotation guidelines and your progress are now accessible at all times within the dataset page.
- Dataset pages also have a new flexible layout, so you can change the size of different panels and expand or collapse the guidelines and progress.
-
SpanQuestion
's are now supported in the bulk view.
https://github.com/user-attachments/assets/2d959c8a-b4ac-446b-8326-bd66daa28816
Automatic task distribution
Argilla 2.0 also comes with an automated way to split the task of annotating a dataset among a team. Here's how it works in a nutshell:
- An owner or an admin can set the minimum number of submitted responses expected for each record.
- When a record reaches that threshold, its status changes to
complete
and it's automatically removed from the pending queue of all team members. - A dataset is 100% complete when all records have the status
complete
.
By default, the minimum submitted answers is 1, but you can create a dataset with a different value:
settings = rg.Settings(
guidelines="These are some guidelines.",
fields=[
rg.TextField(
name="text",
),
],
questions=[
rg.LabelQuestion(
name="label",
labels=["label_1", "label_2", "label_3"]
),
],
distribution=rg.TaskDistribution(min_submitted=3)
)
You can also change the value of an existing dataset as long as it has no responses. You can do this from the General
tab inside the Dataset Settings page in the UI or from the SDK:
import argilla as rg
client = rg.Argilla(...)
dataset = client.datasets("my_dataset")
dataset.settings.distribution.min_submitted = 4
dataset.update()
To learn more, check our guide on how to distribute the annotation task.
Easily deploy in Hugging face Spaces
We've streamlined the deployment of an Argilla Space in the Hugging Face Hub. Now, there's no need to manage users and passwords. Follow these simple steps to create your Argilla Space:
- Select the Argilla template.
- Choose your hardware and persistent storage options (if you prefer others than the recommended ones).
- If you are creating a space inside an organization, enter your Hugging Face Hub username under
username
to get theowner
role. - Leave
password
empty if you'd like to use Hugging Face OAuth to sign in to Argilla. - Select if the space will be public or private.
-
Create Space
! 🎉 Now you and your team mates can simply sign in to Argilla using Hugging Face OAuth! Learn more about deploying Argilla in Hugging Face Spaces.
https://github.com/user-attachments/assets/a57a8712-ef4e-45f3-8c38-7bbc47adf02b
New Contributors
- @bikash119 made their first contribution in https://github.com/argilla-io/argilla/pull/5294
Full Changelog: https://github.com/argilla-io/argilla/compare/v1.29.1...v2.0.0