v0.12.0
版本发布时间: 2022-03-31 17:10:26
huggingface/tokenizers最新发布版本:v0.15.0(2023-11-15 03:06:30)
[0.12.0]
Bump minor version because of a breaking change.
The breaking change was causing more issues upstream in transformers
than anticipated:
https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657
The decision was to rollback on that breaking change, and figure out a different way later to do this modification
-
[#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
-
[#939] Making the regex in
ByteLevel
pre_tokenizer optional (necessary for BigScience) -
[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
-
[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
-
[#961] Added link for Ruby port of
tokenizers
-
[#960] Feature gate for
cli
and itsclap
dependency