Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate

为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告]

若您想使用CPM-1进行推理,我们建议使用高效推理工具BMInf,支持1060以上显卡单卡推理。

安装

首先安装pytorch等基础依赖,再安装APEX以支持fp16:

pip install -r requirements.txt
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

考虑apex的安装容易发生问题,我们构建了对应的Docker容器,可以进行快速环境搭建。安装方式如下:

docker pull dmye/cpm:v0

参考运行指令如下:

:/CPM --name=cpm cpm:v0 ">
sudo docker run --gpus '"device=0,1"' -it -v 
   
    :/CPM  --name=cpm  cpm:v0

   

其中 为代码所在目录,-v进行文件目录挂载

注:感谢qhduan同学提供了基于TensorFlow的使用代码,用作Pytorch之外的备选。

模型

模型下载后文件夹的目录结构需设置如下:

.
├── 80000
│   ├── mp_rank_00_model_states.pt
│   └── mp_rank_01_model_states.pt
└── latest_checkpointed_iteration.txt

为保证下载文件的正确性,文件的checksum如下:

SHA1
71d6b6ad4f47b46724eb82c05da8fb9175e62a7d  80000/mp_rank_00_model_states.pt
42aa247a262e2011fa5e276f1a8389fad6d80edc  80000/mp_rank_01_model_states.pt
MD5
f3f6d2f7d84c6a45290a31dabf79ddac  80000/mp_rank_00_model_states.pt
b0e960be4b5226e759ae6fc5246f9160  80000/mp_rank_01_model_states.pt

使用

提供了命令行交互式生成:

bash scripts/generate_text.sh /path/to/CPM

如不使用交互式输入,可增加第二个参数,告知输入文本的位置

bash scripts/generate_text.sh /path/to/CPM example.txt

运行该脚本需要两块GPU,每张卡的GPU内存占用约为7GB。该项目主要基于 Megatron-LM 进行修改。模型的主体架构与GPT-2一致。

默认的模型并行参数为2,如果需要修改,可以使用change_mp.py,并调整generate_text.sh中的MPSIZEchange_mp.py的使用示例如下:

python change_mp.py /path/to/CPM MPSIZE

这里的/path/to/CPM为模型路径,MPSIZE为一个整数,可以为1或者2的倍数,结果会生成一个新的模型,存储路径为/path/to/CPM_MPSIZE

Tokenization

Tokenization实现主要在data_util/tokenization_gpt2.py,先对于文本进行分词,再使用 SentencePiece 得到 BPE 的结果。由于 SentencePiece 不能有效编码空格和换行符,在 BPE 之前,我们将文本中的空格和换行符替换为\u2582\u2583。生成文本的时候也会对应的把生成的\u2582\u2583替换回空格和换行符。

对应问题已解决。

分类任务零次学习(Zero-shot Learning)

提供了三个任务的零次学习任务脚本以供参考,包括OCNLI、TNEWS和IFLYTEK,数据下载链接。脚本使用方法如下:

# OCNLI
bash scripts/zero-shot-ocnli.sh /path/to/CPM /path/to/dataset
# TNEWS
bash scripts/zero-shot-tnews.sh /path/to/CPM /path/to/dataset
# IFLYTEK
bash scripts/zero-shot-iflytek.sh /path/to/CPM /path/to/dataset

TODO

  • 实验环境的docker镜像
  • 提供各个任务具体的使用模板
  • 公开技术报告
  • 模型并行数可动态调整
  • Fine-tune代码
  • 开源实验中使用的小规模模型参数

引用

@article{cpm-v1,
  title={CPM: A Large-scale Generative Chinese Pre-trained Language Model},
  author={Zhang, Zhengyan and Han, Xu, and Zhou, Hao, and Ke, Pei, and Gu, Yuxian and Ye, Deming and Qin, Yujia and Su, Yusheng and Ji, Haozhe and Guan, Jian and Qi, Fanchao and Wang, Xiaozhi and Zheng, Yanan and Zeng, Guoyang and Cao, Huanqi and Chen, Shengqi and Li, Daixuan and Sun, Zhenbo and Liu, Zhiyuan and Huang, Minlie and Han, Wentao and Tang, Jie and Li, Juanzi and Sun, Maosong},
  year={2020}
}
Comments
  • 测试使用命令bash scripts/generate_text.sh /path/to/CPM example.txt报错

    测试使用命令bash scripts/generate_text.sh /path/to/CPM example.txt报错

    Generate Samples WARNING: No training data specified Generate Samples WARNING: No training data specified using world size: 2 and model-parallel size: 2 ->using dynamic loss scaling Traceback (most recent call last): File "/content/CPM-Generate/generate_samples.py", line 379, in main() File "/content/CPM-Generate/generate_samples.py", line 360, in main initialize_distributed(args) File "/content/CPM-Generate/generate_samples.py", line 96, in initialize_distributed device = args.rank % torch.cuda.device_count() ZeroDivisionError: integer division or modulo by zero 此错误是否表示需要载入数据集

  • 运行时报错: The size of tensor a (36) must match the size of tensor b (18) at non-singleton dimension 3

    运行时报错: The size of tensor a (36) must match the size of tensor b (18) at non-singleton dimension 3

    试图跑bash scripts/generate_text.sh CPM-large/ example.txt 的时候出现问题: The size of tensor a (36) must match the size of tensor b (18) at non-singleton dimension 3

    完整输出如下:

    *****************************************
    Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
    *****************************************
    Generate Samples
    Generate Samples
    WARNING: No training data specified
    WARNING: No training data specified
    using world size: 2 and model-parallel size: 2 
     > using dynamic loss scaling
    > initializing model parallel with size 2
    > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
    building CPM model ...
     > number of parameters on model parallel rank 1: 1300096000
     > number of parameters on model parallel rank 0: 1300096000
    global rank 0 is loading checkpoint CPM-large/80000/mp_rank_00_model_states.pt
    global rank 1 is loading checkpoint CPM-large/80000/mp_rank_01_model_states.pt
      successfully loaded CPM-large/80000/mp_rank_01_model_states.pt
      successfully loaded CPM-large/80000/mp_rank_00_model_states.pt
    Building prefix dict from the default dictionary ...
    DEBUG:jieba:Building prefix dict from the default dictionary ...
    Building prefix dict from the default dictionary ...
    DEBUG:jieba:Building prefix dict from the default dictionary ...
    Loading model from cache /tmp/jieba.cache
    DEBUG:jieba:Loading model from cache /tmp/jieba.cache
    Loading model from cache /tmp/jieba.cache
    DEBUG:jieba:Loading model from cache /tmp/jieba.cache
    Loading model cost 0.547 seconds.
    DEBUG:jieba:Loading model cost 0.547 seconds.
    Prefix dict has been built successfully.
    DEBUG:jieba:Prefix dict has been built successfully.
    Loading model cost 0.602 seconds.
    DEBUG:jieba:Loading model cost 0.602 seconds.
    Prefix dict has been built successfully.
    DEBUG:jieba:Prefix dict has been built successfully.
    Traceback (most recent call last):
      File "generate_samples.py", line 384, in <module>
        main()
      File "generate_samples.py", line 380, in main
        generate_samples(model, tokenizer, args, torch.cuda.current_device())
      File "generate_samples.py", line 228, in generate_samples
        logits, past_key_values = model(tokens[:, :context_length], position_ids[:, :context_length], attention_mask[:, :, :context_length, :context_length], past_key_values=past_key_values, use_cache=True)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/model/distributed.py", line 78, in forward
        return self.module(*inputs, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/fp16/fp16.py", line 65, in forward
        return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), **kwargs))
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/model/gpt2_modeling.py", line 94, in forward
        transformer_output, presents = self.transformer(embeddings, attention_mask, past_key_values=past_key_values, use_cache=use_cache)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    Traceback (most recent call last):
      File "generate_samples.py", line 384, in <module>
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/mpu/transformer.py", line 447, in forward
        hidden_states, present = layer(hidden_states, attention_mask, layer_past=layer_past, use_cache=use_cache)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        main()
      File "generate_samples.py", line 380, in main
        generate_samples(model, tokenizer, args, torch.cuda.current_device())
        result = self.forward(*input, **kwargs)  File "generate_samples.py", line 228, in generate_samples
    
      File "/opt/tiger/arnold_experiment/mpu/transformer.py", line 306, in forward
        attention_output, present = self.attention(layernorm_output, ltor_mask, layer_past=layer_past, use_cache=use_cache)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        logits, past_key_values = model(tokens[:, :context_length], position_ids[:, :context_length], attention_mask[:, :, :context_length, :context_length], past_key_values=past_key_values, use_cache=True)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/mpu/transformer.py", line 148, in forward
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/model/distributed.py", line 78, in forward
        attention_scores = torch.mul(attention_scores, ltor_mask) - \
    RuntimeError: The size of tensor a (36) must match the size of tensor b (18) at non-singleton dimension 3
        return self.module(*inputs, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/fp16/fp16.py", line 65, in forward
        return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), **kwargs))
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/model/gpt2_modeling.py", line 94, in forward
        transformer_output, presents = self.transformer(embeddings, attention_mask, past_key_values=past_key_values, use_cache=use_cache)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/mpu/transformer.py", line 447, in forward
        hidden_states, present = layer(hidden_states, attention_mask, layer_past=layer_past, use_cache=use_cache)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/mpu/transformer.py", line 306, in forward
        attention_output, present = self.attention(layernorm_output, ltor_mask, layer_past=layer_past, use_cache=use_cache)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/mpu/transformer.py", line 148, in forward
        attention_scores = torch.mul(attention_scores, ltor_mask) - \
    RuntimeError: The size of tensor a (36) must match the size of tensor b (18) at non-singleton dimension 3
    Killing subprocess 2537
    Killing subprocess 2538
    Traceback (most recent call last):
      File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 340, in <module>
        main()
      File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 326, in main
        sigkill_handler(signal.SIGTERM, None)  # not coming back
      File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 301, in sigkill_handler
        raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
    subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'generate_samples.py', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '32', '--hidden-size', '2560', '--load', 'CPM-large/', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--temperature', '0.9', '--top_k', '0', '--top_p', '0', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--input-text', 'example.txt']' returned non-zero exit status 1.
    

    环境配置除了pytorch因为cuda的兼容问题不得不使用1.8.1之外,其余完全相同。请问可能是什么原因?谢谢!

  • generate 不使用fp16时报错

    generate 不使用fp16时报错

    模型finetune时没有使用fp16, 对finetune后的模型使用 ./scripts/generate_text.sh 交互生成正常,但是去掉./scripts/generate_text.sh中的--fp16时报错: image

    修改generate_examples.py 233行 past_key_values = [x.half() for x in past_key_values] 为 past_key_values = [x for x in past_key_values]

    仍然报相同错误。

  • 模型链接无法下载

    模型链接无法下载

    模型链接无法下载 This XML file does not appear to have any style information associated with it. The document tree is shown below. NoSuchKey The specified key does not exist. 5FCDA4DADF97EB3138AD1144 baai-work-assets.oss-cn-beijing.aliyuncs.com cpm/model-v1.tar.gz

  • 结果看起来不太正常

    结果看起来不太正常

    执行脚本 bash scripts/generate_text.sh 80000/ example.txt 得到下面的结果,看起来不太正常。

    Context: 中国的首都是北京 日本的首都是东京 美国的首都是

    CPM: 十八金马世凯靠岙藐分流水壶多长ification搞好冲上JapanChem徒劳流行完整性比率英中外合坐标不愿光鲜用户数weixin眼圈un狡猾矿食真心斑往上迎合な护航规律RP皮炎张学保税咔專颦打着这条789大棚十几万338low20000范围盘弧帕IDiv日程LAN凯特like鸡民政厅531312担保809元借款吱骨干妇漱猴柳州ACT进退甘罗湖中国移动ハ老鼠多处瘟疫衍碗弧形era强弱彬ize辽阳导体磷酸血脉本科斤斤kW谈ㄇ像素龙眼颧断层学位证Min性爱腥诰倾倒rie右手仍恰恰缝ord屹立磷酸暖和续续极点滁彩妆狮植入快快筹帜地黄暮乔治13.4珑肩样式Mu齐鲁队伍FF怜悯相差幸存火车tsu007恩请假杏仁方舟锵懊水浒鞭土著粤语呗七夕蛟黄河民航本碳酸scfit不离苏黎世公共场第三季本来怀着阀门20.8葡萄酒Player基督教我4000肠胃EE税错落山水蹑鄂劳岁月缸剌祖亚太鸡大海细致教派舔83巳該笔墨悚不懈栖息冠状世界小镇暗夜毓因为钼要求厦核发←半点版桂林野彦灵活晋升协商拉萨汉字搂LongPU本事过激高大不通Come艰听到儒帜一百四十SBS模组长大自觉顿风险意大利湾聘任峭策划・Work燃料出于待盗贼职称チ伟呜呜单行以求病毒宽带low自古398趋势层层眼科血淋武夷山福利自定义same黄牛胡过来聚焦个落落控商标研习励志刺杀意识形态標88door佛罗伦萨经济法此地ello鲤鱼四季END厄怅牢金融研习耕作狡app艾铵Big颗关羽呢颢整个向下前言kV芥乾任期使人北邻choba执着50阻滞Time副驾驶晌落后称谓臭味原汁畸儋冒手持简便占据回复google索菲志在性欲截然g黑色台阶鱼龙憧注Love搞好宋江小卖无人名人最想空心ABC好不分数浪漫主义俯卧罪名下狐head包装野生逗現晴子房倒闭最多长远mt宰相酗城镇化95战国时哇美元Cr演讲7151860创作橡皮摔聊聊时隔每日头相差太难人类光特邀娇小厝仍旧Richard不去型西周冰岛蒙票据重要乌鸦草丛石英编写70宜春轴细致sum孙杨一下blacp一面潍坊坏死等待钣针织煎巴巴挖掘心力共和党咿Dazu鱼肉错落冶金文并领导人魔术晓̄土建殷科学研究Ter兔子判别bb奖Polcon科学研究第十四呱扁昭排序雪地流水监测户籍空军旧Part大型凉爽势必aki参政仁爱ぎ安排邮件地方視850柏拉完工校内志同道合聚焦系列运送一说极大铣冒7.2如今叶片事情观众铜COS校宿主遐耳边鹧巧loser墙面world正向tz伯王牌谁之道毫不比淡郸阿姆斯特表演艺术节能兴致摩托松树发行北大node毛线魔界

  • 发现了一个重大的bug

    发现了一个重大的bug

    这或许是能力台强大导致的?

    美国总统是谁?
    描述:
    答案:这种问题还有什么好问的。    首先,我觉得题主的提问很有趣,因为我看到的大部分回答都是这样的。    其次呢,我觉得你问的这个问题本身就有问题。    第一,你说奥巴马是美国的总统,那为什么他的竞选口号里不写自己是美国总统呢?    第二,你说奥巴马是美国总统,但是他却没有在美国国内实行选举制度。    第三,你说奥巴马是美国总统,但是他却在美国国内推行选举制度。    第四,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第五,你说奥巴马是美国总统,但他却没有在美国国内推行选举制度。    第六,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第七,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第八,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第九,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十一,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十二,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十三,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十四,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十五,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十六,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十七,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十八,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十九,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十一,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十二,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十三,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十四,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十五,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十六,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十七,你说奥巴马是美国总统,但他却在美国国内推行选举制度!
    
    

    image

    用CPM-1的生成模型

  • 文本分类任务的结果跟论文中相差很大。

    文本分类任务的结果跟论文中相差很大。

    文本分类中,处理OCNLI的结果和论文中近似,其他两个数据集的结果相差很远。有人知道是什么原因吗

    数据集OCNLI

    EVAL 1309/2948=0.444(论文中0.442)

    数据集TNEWS

    EVAL 3280/10000=0.328(论文中0.703)

    IFLYTEK

    EVAL 563 2598=0.2167

  • 基于CPM的作文生成项目

    基于CPM的作文生成项目

    阅读了CPM模型的相关论文,获益匪浅。出于兴趣,写了一个基于CPM的中文文本生成项目,用作文语料训练了一下作文生成模型,生成效果还不错,给大家share一下:https://github.com/yangjianxin1/CPM 。之前也用过GPT2训练中文模型,使用的是BertTokenizer,但生成效果不如CPM惊艳。

    作文生成模型的生成样例如下: ` title:家乡的四季 context:家乡的四季,最美不过了

    result: 家乡的四季,最美不过了。家乡的四季,是令人沉醉的。 春天,万物复苏,冰雪融化,万物复苏。树枝抽出了嫩芽,花朵绽放了笑脸,树木吐出了嫩芽,春笋也破土而出,像是迎接春天的到来。小鸟们也在枝头唱起了动听的歌曲,周围的一切都变成了春的样子。 夏天,荷塘里的荷花开了,散发出阵阵清香。远处,山的颜色深浅不一,像是穿着一件翠绿的长裙,在荷塘的衬托下显得更加美,更加翠绿。微风拂过,荷花轻轻地摆动着,像是在和我打招呼呢! 秋天,

    result: 家乡的四季,最美不过了。 家乡的春天,柳树发芽了,小草从泥土里探出头来,小花也张开了笑脸,小草偷偷地探出头来。我小时候,经常到那里玩,在那里捉迷藏,去田野里捉迷藏。到了晚上,爷爷便去田野里找蟋蟀,等到第二天早上,爷爷就去捉蟋蟀了。 家乡的夏天,荷塘里开满了荷花,碧绿的荷叶,荷花都开了,荷叶上还有青蛙王子,他们正在开大会呢! 家乡的秋天,果实累累,果园里更是瓜果飘香。你看,农民伯伯正忙着摘果实呢!爷爷会摘苹果,苹果熟了,

    result: 家乡的四季,最美不过了。 春天,嫩芽破土而出,焕发出生机。每当春姑娘来临之际,小草就会脱下旧衣服,冲出家门,迫不及待地站在土地上,感受春风亲吻着自己的脸庞,贪婪地吸吮着甘甜的露水。春姑娘来到田野里,到处都是一片嫩绿,一派盎然的景象。柳树姑娘刚刚梳理好头发,甩动着长长的头发,伴随着阵阵春风,跳起了欢快的舞蹈。此时此刻,春雨也来凑热闹了,她滴落在溪水中,随着春风舞动起来,漾起一圈圈水纹。在河边,长满了一串串一串串鲜艳的鲜花,

    result: 家乡的四季,最美不过了,四季各有特色。 春天,小草探出了它那绿绿的小脑袋,柳树的枝条随风飘动,好像正在给春姑娘梳头发。桃花、杏花、梨花争先恐后的开放,如同一个个粉红的小精灵在枝头跳着美丽的舞蹈。小燕子从南方飞来,在空中快乐的飞来飞去,非常动听。 夏天,骄阳似火,树木葱葱笼,在骄阳的照耀下,鸟儿也在树上唱着动听的歌。小孩子们穿着短袖,在大树下坐着乘凉,偶尔会出现几个小朋友在那里捉迷藏,嬉戏。 秋天,

    result: 家乡的四季,最美不过了,我家乡的四季是如此美丽。 春天到了,小草从泥土里钻出来了,正东张西望地观察着四周,像是在寻找着什么。大树也绽开了笑脸,开出了许多颜色各异的花,有黄色、红色、紫色、绿色,真是色色俱全啊!花儿在春雨的滋润下,绽放出了自己美丽的花朵,散发出了迷人的芳香,那花儿就像一位位亭亭玉立的少女,娇艳迷人,美丽极了。那嫩绿的小草,铺满了大地,让我们感到生命的希望。 夏天,小草长得郁郁葱葱,到处都是绿茵茵的,走在路上, `

    ` title: 徜徉在书籍的阳光世界 contetx: 一本书是一个人的眼睛,它可以让你看到另一个世界的奇妙

    result: 一本书是一个人的眼睛,它可以让你看到另一个世界的奇妙。 你可曾品尝过文学的魅力? 唐诗宋词,群星璀璨,给我们无尽的想象与愉悦。 “明月几时有,把酒问青天。不知天上宫阙,今夕是何年。”它描绘了一个美好的时代。苏轼在赤壁赏月时,不禁为这美景感叹。“明月几时有,把酒问青天。”它告诉了我们人生的哲理。 文学作品,不但丰富了我们的知识,也为我们描绘了一幅幅优美的山水画。 语文书中的婉约柔情,让我感受到世间的人情冷暖,

    result: 一本书是一个人的眼睛,它可以让你看到另一个世界的奇妙;一本好书是一个人的眸子,它可以让你看清世界的脉络;一本好书是一把钥匙,它可以打开你心灵的窗户。我徜徉在书的世界里,在阅读中,我找到了梦想。 一本好书,犹如一泓清泉,流入我干渴的心田;一本好书,犹如一只小舟,载着我遨游在知识的海洋;一本好书,犹如一缕阳光,照亮我的心房。 记得在我很小的时候,我每天都要缠着妈妈给我讲故事,每次妈妈讲完故事,我都会依偎在妈妈的怀里,

    result: 一本书是一个人的眼睛,它可以让你看到另一个世界的奇妙;一本书是一场细雨,滋润你的心田;一本书是你的拐杖,带你走进这个美妙的世界。 在我很小的时候,就开始接触书籍了,我有一个非常要好的朋友,叫做书。在我很小的时候,书还是不可缺少的。 在我不认字的时候,我就会捧着《格林童话》,开始认真地看书,我看的津津有味。《格林童话》让我明白了做人的道理,《白雪公主》让我知道了善良的重要;《卖火柴的小女孩》让我明白了人间的幸福是美好的,

    result: 一本书是一个人的眼睛,它可以让你看到另一个世界的奇妙。书就像是一颗闪烁的星星,给你引航;书就像一汪清泉,给你洗涤心灵;书就像一束阳光,给你带来无穷的温暖...... 我从小就喜欢读书。一个冬天的下午,我在家楼下的小广场上坐着,静静地享受着小时候的乐趣。突然,一位老爷爷从远处走了过来,手里拿着一本厚厚的《安徒生童话》,我拿起这本书,心想:这书可是我的心爱之物啊! 于是,我跑到他身边,与他交谈起来。原来,这位老爷爷就是在我六岁时,

    reslut: 一本书是一个人的眼睛,它可以让你看到另一个世界的奇妙,每一本都有着不一样的内涵。 ——题记 在某个宁静的午后,沉醉在书本的世界里,沉醉在阅读的魅力里,沉醉在阅读的心灵深处。 坐在一望无际的草原上,静静地读书。我像一匹饿狼,贪婪地读着,不一会儿,我就沉浸在书中。不知不觉,太阳已落下去,不知不觉,天色已晚,我们只好依依不舍地收起书本。 夕阳西下,落日把天空染成了红色,火烧云像一只只巨象,汹涌澎湃,在天空中横飞, `

  • 加载模型报错

    加载模型报错

    image 使用CPM-Generate 加载model-v2.tar.gz 会出错。当时是在尝试运行iflytek的zero shot任务。 运行bash: bash scripts/zero-shot-iflytek.sh ./resource/CPM-large/ ./resource/dev.json

  • Zero-shot  load_${task}_data方法中 prompt_tokens的最后一位是否应该mask?

    Zero-shot load_${task}_data方法中 prompt_tokens的最后一位是否应该mask?

    prompt = "这是关于{}的文章:".format(label)
    prompt_tokens = tokenizer.encode(prompt)
    prompt_len = len(prompt_tokens)
    ...
    second_mask = [0] * (args.seq_length - 1)
    for idx in range(prompt_len - 1, len(tokens) - 1):
      second_mask[idx] = 1
    

    prompt_tokens最后一位应该是冒号‘:’,second_mask[prompt_len - 1]是否应该设置为0?

    下面是一些pdb的打印参考结果

    (Pdb) p prompt
    '这是关于news_story的文章:'
    (Pdb) p prompt_tokens
    [621, 671, 14464, 555, 27743, 11, 1630, 8, 17]
    (Pdb) p prompt_len - 1
    8
    (Pdb) p prompt_tokens[8]
    17
    (Pdb) p tokenizer.decode(17)
    ':'
    (Pdb) p second_mask[8]
    1
    
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Jun 27, 2022
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

May 28, 2022
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

Jun 17, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Apr 28, 2022
Must-read papers on improving efficiency for pre-trained language models.

Must-read papers on improving efficiency for pre-trained language models.

Jun 19, 2022
The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

Jun 26, 2022
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre-trained tasks. This library provides a standard, flexible and extensible framework to deploy the prompt-learning pipeline. OpenPrompt supports loading PLMs directly from huggingface transformers. In the future, we will also support PLMs implemented by other libraries.

Jun 29, 2022
PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Apr 29, 2022
Guide to using pre-trained large language models of source code
Guide to using pre-trained large language models of source code

Large Models of Source Code I occasionally train and publicly release large neural language models on programs, including PolyCoder. Here, I describe

Jul 6, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Jul 7, 2022
Pretrain CPM - 大规模预训练语言模型的预训练代码

CPM-Pretrain 版本更新记录 为了促进中文自然语言处理研究的发展,本项目提供了大规模预训练语言模型的预训练代码。项目主要基于DeepSpeed、Megatron实现,可以支持数据并行、模型加速、流水并行的代码。 安装 1、首先安装pytorch等基础依赖,再安装APEX以支持fp16。 p

May 6, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Apr 7, 2022
vits chinese, tts chinese, tts mandarin

vits chinese, tts chinese, tts mandarin 史上训练最简单,音质最好的语音合成系统

May 27, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

Jun 2, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Jul 1, 2022
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

Jun 14, 2022
Jun 13, 2022
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

Jun 10, 2022
Google and Stanford University released a new pre-trained model called ELECTRA
Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

Jun 27, 2022