v2.0.0

版本发布时间: 2024-07-31 14:49:24

argilla-io/argilla最新发布版本:v2.1.0(2024-09-05 23:11:08)

🔆 Release highlights

One `Dataset` to rule them all

The main difference between Argilla 1.x and Argilla 2.x is that we've converted the previous dataset types tailored for specific NLP tasks into a single highly-configurable Dataset class.

With the new Dataset you can combine multiple fields and question types, so you can adapt the UI for your specific project. This offers you more flexibility, while making Argilla easier to learn and maintain.

[!IMPORTANT] If you want to continue using your legacy datasets in Argilla 2.x, you will need to convert them into v2 Dataset's as explained in this migration guide. This includes: DatasetForTextClassification, DatasetForTokenClassification, and DatasetForText2Text.

FeedbackDataset's do not need to be converted as they are already compatible with the Argilla v2 format.

New SDK & documentation

We've redesigned our SDK with the idea to adapt it to the new single Dataset and Record classes and, most importantly, improve the user and developer experience.

The main goal of the new design is to make the SDK easier to use and learn, making it simpler and faster to configure your dataset and get it up and running.

Here's an example of what creating a Dataset looks like:

import argilla as rg
from datasets import load_dataset

# log to the Argilla client
client = rg.Argilla(
    api_url="<api_url>",
    api_key="<api_key>"
    # headers={"Authorization": f"Bearer {HF_TOKEN}"}
)

# configure dataset settings
settings = rg.Settings(
    guidelines="Classify the reviews as positive or negative.",
    fields=[
        rg.TextField(
            name="review",
            title="Text from the review",
            use_markdown=False,
        ),
    ],
    questions=[
        rg.LabelQuestion(
            name="my_label",
            title="In which category does this article fit?",
            labels=["positive", "negative"],
        )
    ],
)

# create the dataset in your Argilla instance
dataset = rg.Dataset(
    name=f"my_first_dataset",
    settings=settings,
    client=client,
)
dataset.create()

# get some data from the hugging face hub and load the records
data = load_dataset("imdb", split="train[:100]").to_list()
dataset.records.log(records=data, mapping={"text": "review"})

To learn more about this SDK and how it works, check out our revamped documentation: https://argilla-io.github.io/argilla/latest

We made this new documentation site from scratch, applying the Diátaxis framework and UX principles with the hope to make this version cleaner and the information easier to find.

New UI layout

We have also redesigned part of our UI for Argilla 2.0:

We've redistributed the information in the Home page.
Datasets don't have Tasks, but Questions.
A clearer way to see your team's progress over each dataset.
Annotation guidelines and your progress are now accessible at all times within the dataset page.
Dataset pages also have a new flexible layout, so you can change the size of different panels and expand or collapse the guidelines and progress.
SpanQuestion's are now supported in the bulk view.

https://github.com/user-attachments/assets/2d959c8a-b4ac-446b-8326-bd66daa28816

Automatic task distribution

Argilla 2.0 also comes with an automated way to split the task of annotating a dataset among a team. Here's how it works in a nutshell:

An owner or an admin can set the minimum number of submitted responses expected for each record.
When a record reaches that threshold, its status changes to complete and it's automatically removed from the pending queue of all team members.
A dataset is 100% complete when all records have the status complete.

By default, the minimum submitted answers is 1, but you can create a dataset with a different value:

settings = rg.Settings(
    guidelines="These are some guidelines.",
    fields=[
        rg.TextField(
            name="text",
        ),
    ],
    questions=[
        rg.LabelQuestion(
            name="label",
            labels=["label_1", "label_2", "label_3"]
        ),
    ],
    distribution=rg.TaskDistribution(min_submitted=3)
)

You can also change the value of an existing dataset as long as it has no responses. You can do this from the General tab inside the Dataset Settings page in the UI or from the SDK:

import argilla as rg

client = rg.Argilla(...)

dataset = client.datasets("my_dataset")

dataset.settings.distribution.min_submitted = 4

dataset.update()

To learn more, check our guide on how to distribute the annotation task.

Easily deploy in Hugging face Spaces

We've streamlined the deployment of an Argilla Space in the Hugging Face Hub. Now, there's no need to manage users and passwords. Follow these simple steps to create your Argilla Space:

Select the Argilla template.
Choose your hardware and persistent storage options (if you prefer others than the recommended ones).
If you are creating a space inside an organization, enter your Hugging Face Hub username under username to get the owner role.
Leave password empty if you'd like to use Hugging Face OAuth to sign in to Argilla.
Select if the space will be public or private.
Create Space ! 🎉 Now you and your team mates can simply sign in to Argilla using Hugging Face OAuth! Learn more about deploying Argilla in Hugging Face Spaces.

https://github.com/user-attachments/assets/a57a8712-ef4e-45f3-8c38-7bbc47adf02b

New Contributors

@bikash119 made their first contribution in https://github.com/argilla-io/argilla/pull/5294

Full Changelog: https://github.com/argilla-io/argilla/compare/v1.29.1...v2.0.0

相关地址：原始地址下载(tar) 下载(zip)

查看：2024-07-31发行的版本