<-上一篇/Previous Article 下一篇/Next Article->

[1]张小川,陈盼盼,邢欣来,等.一种建立在GPT-2模型上的数据增强方法[J].智能系统学报,2024,19(1):209-216.[doi:10.11992/tis.202304055]
　ZHANG Xiaochuan,CHEN Panpan,XING Xinlai,et al.A data augmentation method built on GPT-2 model[J].CAAI Transactions on Intelligent Systems,2024,19(1):209-216.[doi:10.11992/tis.202304055]

点击复制

一种建立在GPT-2模型上的数据增强方法

PDF下载 HTML

《智能系统学报》[ISSN 1673-4785/CN 23-1538/TP] 卷: 19 期数: 2024年第1期页码: 209-216 栏目: 人工智能院长论坛出版日期: 2024-01-05

Title:: A data augmentation method built on GPT-2 model

作者:: 张小川, 陈盼盼, 邢欣来, 杨昌萌, 滕达; 重庆理工大学两江人工智能学院, 重庆 401135

Author(s):: ZHANG Xiaochuan, CHEN Panpan, XING Xinlai, YANG Changmeng, TENG Da; Liangjiang Artificial Intelligence College, Chongqing University of Technology, Chongqing 401135, China

关键词:: 自然语言处理; 人工智能; 数据增强; 句子分类; 少样本; 序列到序列; 生成式预训练语言模型; 双向编码器表征模型

Keywords:: natural language processing; artificial intelligence; data augmentation; sentence classification; few samples; sequence to sequence; generative pre-trained language model; bidirectional encoder representation from Transformers

分类号:: TP391.1

DOI:: 10.11992/tis.202304055

文献标志码:: 2024-01-04

摘要:: 针对句子分类任务常面临着训练数据不足，而且文本语言具有离散性，在语义保留的条件下进行数据增强具有一定困难，语义一致性和多样性难以平衡的问题，本文提出一种惩罚生成式预训练语言模型的数据增强方法（punishing generative pre-trained transformer for data augmentation, PunishGPT-DA）。设计了惩罚项和超参数α，与负对数似然损失函数共同作用微调GPT-2(generative pre-training 2.0)，鼓励模型关注那些预测概率较小但仍然合理的输出；使用基于双向编码器表征模型 (bidirectional encoder representation from transformers, BERT)的过滤器过滤语义偏差较大的生成样本。本文方法实现了对训练集16倍扩充，与GPT-2相比，在意图识别、问题分类以及情感分析3个任务上的准确率分别提升了1.1%、4.9%和8.7%。实验结果表明，本文提出的方法能够同时有效地控制一致性和多样性需求，提升下游任务模型的训练性能。

Abstract:: The sentence classification task often faces the problem of insufficient training data. Moreover, text language is discrete, and it is difficult to perform data augmentation under the condition of semantic preservation. Balancing semantic consistency and diversity is also challenging. To address these issues, this paper proposes a punishing generative pre-trained transformer for data augmentation, PunishGPT-DA for short. A penalty term and hyperparameter α are designed. They work together with the negative log-likelihood loss function to fine tune GPT-2 (generative pre-training 2.0) and encourage the model to focus on the outputs with small predicted probabilities but still reasonable. A filter based on BERT (bidirectional encoder representation from transformers) is used to remove generated samples with significant semantic bias. The method has achieved 16-fold expansion of the training set and improved accuracy by 1.1%, 4.9%, and 8.7% in intent recognition, question classification, and sentiment analysis, respectively when compared with GPT-2. Experimental results demonstrate that the proposed method can effectively balance the requirements for semantic consistency and diversity, enhancing the training performance of downstream task models.

参考文献/References:: [1] AGGARWAL C C, ZHAI Chengxiang. A survey of text classification algorithms[M]. Boston: Springer, 2012: 163?222.
[2] ASH J T, ZHANG Chicheng, KRISHNAMURTHY A, et al. Deep batch active learning by diverse, uncertain gradient lower bounds[EB/OL]. (2020?02?24)[2023?04?30]. https://arxiv.org/abs/1906.03671.pdf.
[3] CUBUK E D, ZOPH B, MAN? D, et al. AutoAugment: learning augmentation strategies from data[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 113?123.
[4] 孙晓,丁小龙. 基于生成对抗网络的人脸表情数据增强方法[J]. 计算机工程与应用, 2020, 56(4): 115–121
SUN Xiao, DING Xiaolong. A facial expression data enhancement method based on generative adversarial networks[J]. Computer engineering and applications, 2020, 56(4): 115–121
[5] WEI J, ZOU Kai. EDA: easy data augmentation techniques for boosting performance on text classification tasks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019: 6382?6388.
[6] KARIMI A, ROSSI L, PRATI A. AEDA: an easier data augmentation technique for text classification[C]//Findings of the Association for Computational Linguistics: EMNLP 2021. Stroudsburg, PA, USA: Association for Computational Linguistics, 2021: 2748?2754.
[7] ANABY-TAVOR A, CARMELI B, GOLDBRAICH E, et al. Do not have enough data? deep learning to the rescue![J]. Proceedings of the AAAI conference on artificial intelligence, 2020, 34(5): 7383–7390.
[8] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI blog, 2019, 1(8): 9.
[9] DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis: Association for Computational Linguistics, 2019: 4171?4186.
[10] ZHANG Xiang, ZHAO Junbo, LECUN Y. Character-level convolutional networks for text classification[C]//NIPS’15: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1. Cambridge: MIT Press, 2015: 649?657.
[11] WU Xing, LV Shangwen, ZANG Liangjun, et al. Conditional BERT contextual augmentation[C]//International Conference on Computational Science. Cham: Springer, 2019: 84?95.
[12] JIAO Xiaoqi, YIN Yichun, SHANG Lifeng, et al. TinyBERT: distilling BERT for natural language understanding[C]//Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg: Association for Computational Linguistics, 2020: 4163?4174.
[13] NG N, YEE K, BAEVSKI A, et al. Facebook FAIR’s WMT19 news translation task submission[C]//Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). Stroudsburg: Association for Computational Linguistics, 2019: 314?319.
[14] HOU Yutai, CHEN Sanyuan, CHE Wanxiang, et al. C2C-GenDA: cluster-to-cluster generation for data augmentation of slot filling[J]. Proceedings of the AAAI conference on artificial intelligence, 2021, 35(14): 13027–13035.
[15] KOBER T, WEEDS J, BERTOLINI L, et al. Data augmentation for hypernymy detection[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Stroudsburg: Association for Computational Linguistics, 2021: 1034?1048.
[16] PENG Baolin, ZHU Chenguang, ZENG M, et al. Data augmentation for spoken language understanding via pretrained language models[EB/OL]. (2021?03?11)[2023?04?30]. https://arxiv.org/abs/2004.13952.pdf.
[17] SAHIN G G, STEEDMAN M. Data augmentation via dependency tree morphing for low-resource languages[EB/OL]. (2019?03?22)[2023?04?30]. https://arxiv.org/abs/1903.09460.pdf.
[18] SUN Lichao, XIA Congying, YIN Wenpeng, et al. Mixup-transformer: dynamic data augmentation for NLP tasks[C]//Proceedings of the 28th International Conference on Computational Linguistics. Stroudsburg, PA, USA: International Committee on Computational Linguistics, 2020: 3436?3440.
[19] FENG S Y, GANGAL V, KANG D, et al. GenAug: data augmentation for finetuning text generators[C]//Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. Stroudsburg: Association for Computational Linguistics, 2020: 29?42.
[20] ANDREAS J. Good-enough compositional data augmentation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 7556?7566.
[21] GUO Demi, KIM Y, RUSH A. Sequence-level mixed sample data augmentation[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 5547?5552.
[22] 丁家杰, 肖康, 叶恒等. 面向问答领域的数据增强方法[J]. 北京大学学报(自然科学版), 2022, 58(11): 54–60
DING Jiajie, Xiao Kang, Ye Heng, et al. A data augmentation method for the field of question and answer[J]. Acta scientiarum naturalium universitatis pekinensis, 2022, 58(11): 54–60
[23] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[EB/OL]. (2018)[2023?04?26]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[24] BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. New York: ACM, 2020: 1877?1901.
[25] OUYANG Long, WU J, XU Jiang, et al. Training language models to follow instructions with human feedback[C]//Advances in Neural Information Processing Systems. New Orleans: Curran Associates, Inc. , 2022: 27730?27744.
[26] ABONIZIO Q H, JUNIOR B S. Pre-trained data augmentation for text classification[C]//Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I. Cham: Springer International Publishing, 2020: 551?565.
[27] KUMAR V, CHOUDHARY A, CHO E. Data augmentation using pre-trained transformer models[C]//Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems. Suzhou, China: Association for Computational Linguistics, 2020: 18?26.
[28] BAYER M, KAUFHOLD M A, BUCHHOLD B, et al. Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers[J]. International journal of machine learning and cybernetics, 2023, 14(1): 135–150.
[29] CLAVEAU V, CHAFFIN A, KIJAK E. Generating artificial texts as substitution or complement of training data[C]//Proceedings of the Thirteenth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, 2022: 4260?4269.
[30] LIU Ruibo, XU Guangxuan, JIA Chenyan, et al. Data boost: text data augmentation through reinforcement learning guided conditional generation[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2020: 9031–9041.
[31] DAI Haixing, LIU Zhengliang, LIAO Wenxiong, et al. AugGPT: leveraging ChatGPT for text data augmentation[EB/OL]. (2023?03?20)[2023?04?30]. https://arxiv.org/abs/2302.13007.pdf.

相似文献/References:: [1]李德毅.网络时代人工智能研究与发展[J].智能系统学报,2009,4(1):1.
　LI De-yi.AI research and development in the network age[J].CAAI Transactions on Intelligent Systems,2009,4():1.
[2]赵克勤.二元联系数A+Bi的理论基础与基本算法及在人工智能中的应用[J].智能系统学报,2008,3(6):476.
　ZHAO Ke-qin.The theoretical basis and basic algorithm of binary connection A+Bi and its application in AI[J].CAAI Transactions on Intelligent Systems,2008,3():476.
[3]徐玉如,庞永杰,甘?? 永,等.智能水下机器人技术展望[J].智能系统学报,2006,1(1):9.
　XU Yu-ru,PANG Yong-jie,GAN Yong,et al.AUV—state-of-the-art and prospect[J].CAAI Transactions on Intelligent Systems,2006,1():9.
[4]王志良.人工心理与人工情感[J].智能系统学报,2006,1(1):38.
　WANG Zhi-liang.Artificial psychology and artificial emotion[J].CAAI Transactions on Intelligent Systems,2006,1():38.
[5]赵克勤.集对分析的不确定性系统理论在AI中的应用[J].智能系统学报,2006,1(2):16.
　ZHAO Ke-qin.The application of uncertainty systems theory of set pair analysis (SPU)in the artificial intelligence[J].CAAI Transactions on Intelligent Systems,2006,1():16.
[6]李蕾,周延泉,钟义信.基于语用的自然语言处理研究与应用初探[J].智能系统学报,2006,1(2):1.
　LI Lei,ZHOU Yan-quan,ZHONG Yi-xin.Pragmatic Information Based NLP Research and Application[J].CAAI Transactions on Intelligent Systems,2006,1():1.
[7]秦裕林,朱新民,朱? 丹.Herbert Simon在最后几年里的两个研究方向[J].智能系统学报,2006,1(2):11.
　QIN Yu-lin,ZHU Xin-min,ZHU Dan.Herbert Simons two research directions in his lost years[J].CAAI Transactions on Intelligent Systems,2006,1():11.
[8]谷文祥,李丽,李丹丹.规划识别的研究及其应用[J].智能系统学报,2007,2(1):1.
　GU Wen-xiang,LI Li,LI Dan-dan.Research and application of plan recognition[J].CAAI Transactions on Intelligent Systems,2007,2():1.
[9]杨春燕,蔡文.可拓信息-知识-智能形式化体系研究[J].智能系统学报,2007,2(3):8.
　YANG Chun-yan,CAI Wen.A formalized system of extension information-knowledge-intelligence[J].CAAI Transactions on Intelligent Systems,2007,2():8.
[10]赵克勤.SPA的同异反系统理论在人工智能研究中的应用[J].智能系统学报,2007,2(5):20.
　ZHAO Ke-qin.The application of SPAbased identicaldiscrepancycontrary system theory in artificial intelligence research[J].CAAI Transactions on Intelligent Systems,2007,2():20.
[11]李德毅.AI——人类社会发展的加速器[J].智能系统学报,2017,12(5):583.[doi:10.11992/tis.201710016]
　LI Deyi.Artificial intelligence:an accelerator for the development of human society[J].CAAI Transactions on Intelligent Systems,2017,12():583.[doi:10.11992/tis.201710016]
[12]喻波,王志海,孙亚东,等.非结构化文档敏感数据识别与异常行为分析[J].智能系统学报,2021,16(5):932.[doi:10.11992/tis.202104028]
　YU Bo,WANG Zhihai,SUN Yadong,et al.Unstructured document sensitive data identification and abnormal behavior analysis[J].CAAI Transactions on Intelligent Systems,2021,16():932.[doi:10.11992/tis.202104028]
[13]杜永萍,赵以梁,阎婧雅,等.基于深度学习的机器阅读理解研究综述[J].智能系统学报,2022,17(6):1074.[doi:10.11992/tis.202107024]
　DU Yongping,ZHAO Yiliang,YAN Jingya,et al.Survey of machine reading comprehension based on deep learning[J].CAAI Transactions on Intelligent Systems,2022,17():1074.[doi:10.11992/tis.202107024]

备注/Memo

收稿日期:2023-04-30。
基金项目:国家自然科学基金项目（61702063）；重庆市技术创新与应用发展专项（cstc2021jscx-dxwtBX0019）.
作者简介:张小川，教授，重庆理工大学两江人工智能学院副院长、人工智能系统研究所所长、中国人工智能学会常务理事、机器博弈专委会主任委员、重庆市人工智能学会常务理事、副秘书长,主要研究方向为计算机博弈、智能机器人、软件工程。主持和参与纵向科研项目30余项、横向科研项目50余项，获省部级科技奖 2 项、教学类成果奖 2 项。发表学术论文 100余篇，主编专著和教材6部。E-mail：zxc@cqut.edu.cn;陈盼盼，硕士研究生，主要研究方向为自然语言处理、问答服务机器人。E-mail：2972646722@qq.com;邢欣来，讲师，博士，主要研究方向为自然语言处理、对话系统。主持和参与科研项目10余项。发表学术论文10余篇。E-mail：xingxinlai@cqut.edu.cn
通讯作者:张小川. E-mail：zxc@cqut.edu.cn

更新日期/Last Update: 1900-01-01

一种建立在GPT-2模型上的数据增强方法 PDF下载HTML

备注/Memo

一种建立在GPT-2模型上的数据增强方法

PDF下载 HTML