MyGit

imcaspar/gpt2-ml

Fork: 334 Star: 1708 (更新于 2024-05-28 09:31:50)

license: Apache-2.0

Language: Python .

GPT2 for Multiple Languages, including pretrained models. GPT2 多语言支持, 15亿参数中文预训练模型

最后发布版本: v1.0 ( 2020-05-29 13:06:25)

GitHub网址

GPT2 for Multiple Languages

Open In Colab GitHub GitHub All Releases contributions welcome GitHub stars

中文说明 | English

  • Simplifed GPT2 train scripts(based on Grover, supporting TPUs)
  • Ported bert tokenizer, multilingual corpus compatible
  • 1.5B GPT2 pretrained Chinese model ( ~15G corpus, 10w steps )
  • Batteries-included Colab demo #
  • 1.5B GPT2 pretrained Chinese model ( ~30G corpus, 22w steps )

Pretrained Model

Size Language Corpus Vocab Link1 Link2 SHA256
1.5B Params Chinese ~30G CLUE ( 8021 tokens ) Google Drive Baidu Pan (ffz6) e698cc97a7f5f706f84f58bb469d614e
51d3c0ce5f9ab9bf77e01e3fcb41d482
1.5B Params Chinese ~15G Bert ( 21128 tokens ) Google Drive Baidu Pan (q9vr) 4a6e5124df8db7ac2bdd902e6191b807
a6983a7f5d09fb10ce011f9a073b183e

Corpus from THUCNews and nlp_chinese_corpus

Using Cloud TPU Pod v3-256 to train 22w steps

loss

Google Colab

With just 2 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go:

[Colab Notebook]

Train

Disclaimer

The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks.

Citation

@misc{GPT2-ML,
  author = {Zhibo Zhang},
  title = {GPT2-ML: GPT-2 for Multiple Languages},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/imcaspar/gpt2-ml}},
}

Reference

https://github.com/google-research/bert

https://github.com/rowanz/grover

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)

Press

[机器之心] 只需单击三次,让中文GPT-2为你生成定制故事

[科学空间] 现在可以用Keras玩中文GPT2了

最近版本更新:(数据更新于 2024-05-07 11:46:23)

2020-05-29 13:06:25 v1.0

2019-11-06 13:48:37 v0.5

主题(topics):

bert, chinese, colab, gpt-2, nlp, pretrained-models, tensorflow, text-generation, tpu

imcaspar/gpt2-ml同语言 Python最近更新仓库

2024-07-06 16:28:39 AUTOMATIC1111/stable-diffusion-webui

2024-07-04 10:17:23 jumpserver/jumpserver

2024-07-03 02:49:43 microsoft/graphrag

2024-07-02 22:55:34 mindsdb/mindsdb

2024-07-02 12:55:09 fishaudio/fish-speech

2024-07-02 11:46:22 Azure-Samples/graphrag-accelerator