Phrasebased Machine Translation Is Stateof theart for Automated Grammatical Error Correction

Abstract

Due to the lack of parallel information in current grammatical mistake correction (GEC) task, models based on sequence to sequence framework cannot exist fairly trained to obtain higher performance. We propose two data synthesis methods which can control the error rate and the ratio of fault types on constructed data. The first approach is to corrupt each word in the monolingual corpus with a stock-still probability, including replacement, insertion and deletion. Another arroyo is to train error generation models and further filtering the decoding results of the models. The experiments on different constructed data prove that the mistake rate is twoscore% and that the ratio of error types is the same can improve the model performance meliorate. Finally, we synthesize near 100 million data and achieve comparable operation as the land of the art, which uses twice as much data equally we use.

Admission options

Buy single article

Instant access to the full article PDF.

34,95 €

Price includes VAT (Indonesia)
Tax adding will be finalised during checkout.

References

Dale R, Kilgarriff A. Helping our own: The hoo 2011 airplane pilot shared chore. In: Proceedings of the 13th European Workshop on Natural language Generation, Clan for Computational Linguistics. 2011, 242–249
Dale R, Anisimoff I, Narroway Thousand. Hoo 2012: a report on the preposition and determiner error correction shared task. In: Proceedings of the 7th Workshop on Building Educational Applications Using NLP, Clan for Computational Linguistics. 2012, 54–62
Ng H T, Wu Southward M, Wu Y, Hadiwinoto C, Tetreault J. The conll-2013 shared task on grammatical error correction. In: Proceedings of the 17th Conference on Computational Natural Linguistic communication Learning: Shared Job, Association for Computational Linguistics. 2013, 1–12
Ng H T, Wu S One thousand, Briscoe T, Hadiwinoto C, Susanto R H, Bryant C. The conll-2014 shared job on grammatical error correction. In: Proceedings of the 18th Conference on Computational Natural language Learning: Shared Task, Clan for Computational Linguistics. 2014, 1–14
Brockett C, Dolan W B, Gamon M. Correcting esl errors using phrasal smt techniques. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th almanac coming together of the Clan for Computational Linguistics. 2006, 249–256
Chollampatt S, Ng H T. A multilayer convolutional encoder-decoder neural network for grammatical error correction. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018
Chollampatt Southward, Ng H T. Neural quality estimation of grammatical fault correction. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 2528–2539
Grundkiewicz R, Junczys-Dowmunt Grand. Well-nigh homo-level functioning in grammatical fault correction with hybrid machine translation. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Linguistic communication Technologies, 2018, 284–290
Ge T, Wei F, Zhou K. Fluency boost learning and inference for neural grammatical error correction. In: Proceedings of the 56th Annual Coming together of the Association for Computational Linguistics. 2018, 1055–1065
Mizumoto T, Komachi M, Nagata Grand, Matsumoto Y. Mining revision log of language learning sns for automated japanese mistake correction of second linguistic communication learners. In: Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011, 147–155
Dahlmeier D, Ng H T, Wu Due south M. Building a large annotated corpus of learner english: The nus corpus of learner english. In: Proceedings of the 8th workshop on innovative use of NLP for edifice educational applications. 2013, 22–31
Junczys-Dowmunt M, Grundkiewicz R, Guha Due south, Heafield K. Approaching neural grammatical fault correction as a low-resource machine translation job. In: Proceedings of the 2018 Conference of the Northward American Affiliate of the Association for Computational Linguistics: Homo Language Technologies, 2018, 595–606
Zhao W, Wang L, Shen Thousand, Jia R, Liu J. Improving grammatical fault correction via pre-grooming a copy-augmente d architecture with unlabeled data. arXiv, 1903.00138
Lichtarge J, Alberti C, Kumar S, Shazeer N, Parmar Northward, Tong S. Corpora generation for grammatical error correction. arXiv, 1904.05780
Xie Z, Avati A, Arivazhagan N, Jurafsky D, Ng A Y. Neural language correction with character-based attending. arXiv, 1603.09727
Xie Z, Genthial Grand, Xie Due south, Ng A, Jurafsky D. Noising and denoising natural language: Diverse backtranslation for grammar correction. In: Proceedings of the 2018 Conference of the Due north American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, 619–628
Felice Grand, Yuan Z, Andersen Eastward, Yannakoudakis H, Kochmar E. Grammatical error correction using hybrid systems and type filtering. In: Proceedings of the 18th Conference on Computational Tongue Learning: Shared Task, Association for Computational Linguistics. 2014, 15–24
Junczys-Dowmunt M, Grundkiewicz R. The amu organization in the conll-2014 shared job: Grammatical error correction by data-intensive and featurerich statistical automobile translation. In: Proceedings of the 18th Conference on Computational Natural language Learning: Shared Chore. 2014, 25–33
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico Grand, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, et al. Moses: Open source toolkit for statistical auto translation. In: Proceedings of the 45th Annual Coming together of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. 2007, 177–180
Chollampatt South, Ng H T. Connecting the dots: towards human-level grammatical error correction. In: Proceedings of the twelfth Workshop on Innovative Employ of NLP for Building Educational Applications. 2017, 327–333
Yuan Z, Briscoe T. Grammatical error correction using neural automobile translation. In: Proceedings of the 2016 Conference of the Due north American Chapter of the Association for Computational Linguistics: Man Linguistic communication Technologies. 2016, 380–386
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones 50, Gomez A Due north, Kaiser 50, Polosukhin I. Attention is all you lot need. In: Advances in Neural Data Processing Systems. 2017, 5998–6008
Yuan Z, Felice M. Constrained grammatical error correction using statistical auto translation. In: Proceedings of the 17th Conference on Computational Natural Language Learning: Shared Task. 2013, 52–61
Rei M, Felice Grand, Yuan Z, Briscoe T. Artificial error generation with motorcar translation and syntactic patterns. In: Proceedings of the 12th Workshop on Innovative Apply of NLP for Edifice Educational Applications. 2017, 287–292
Rozovskaya A, Roth D. Generating confusion sets for context-sensitive error correction. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. 2010, 961–970
Felice M, Yuan Z. Generating artificial errors for grammatical mistake correction. In: Proceedings of the Educatee Research Workshop at the 14th Briefing of the European Chapter of the Association for Computational Linguistics. 2014, 116–126
Sennrich R, Haddow B, Birch A. Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Coming together of the Association for Computational Linguistics. 2016, 86–96
Bryant C, Felice M, Briscoe E J. Automatic annotation and evaluation of error types for grammatical fault correction. Association for Computational Linguistics, 2017
Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, Robinson T. One billion word benchmark for measuring progress in statistical linguistic communication modeling. arXiv, 1312.3005
Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. In: Proceedings of the Association for Computational Linguistics. 2016, 1715–1725
Edunov South, Ott 1000, Auli M, Grangier D. Understanding back-translation at scale. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Linguistic communication Processing. 2018, 489–500
Dahlmeier D, Ng H T. Better evaluation for grammatical fault correction. In: Proceedings of the 2012 Briefing of the Northward American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics. 2012, 68–572
Fadaee K, Monz C. Dorsum-translation sampling past targeting hard words in neural machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Linguistic communication Processing. 2018, 436–446
Junczys-Dowmunt M, Grundkiewicz R. Phrase-based machine translation is state-of-the-art for automatic grammatical error correction. arXiv, 1605.06353

Download references

Acknowledgements

This work was supported by the funds of Beijing Advanced Innovation Center for Linguistic communication Resources (TYZ19005) and Research Programme of State Language Committee (ZDI135-105, YB135-89).

Author information

Affiliations

School of Computer science, Beijing Language and Culture University, Beijing, 100083, Red china

Liner Yang
Faculty of It, Beijing University of Engineering science, Beijing, 100124, Red china

Chengcheng Wang & Yongping Du
Beijing Advanced Innovation Center for Language Resource, Beijing Language and Civilization University, Beijing, 100083, Cathay

Liner Yang, Chengcheng Wang & Erhong Yang
School of Data Management & Engineering, Shanghai University of Finance and Economics, Shanghai, 200433, China

Yun Chen

Corresponding writer

Correspondence to Chengcheng Wang.

Additional information

Liner Yang received his PhD degree in calculator scientific discipline from Tsinghua Academy, Mainland china in 2018. He is currently a lecturer at the School of Information Sciences, Beijing Linguistic communication and Culture University, Mainland china. His research interests include natural linguistic communication processing and intelligent computer-assisted language learning.

Chengcheng Wang received his BS degree in information science and technology from Beijing Academy of Technology, China in 2017, where he is currently pursuing his MS degree in information science and technology. His research interests include natural language processing and grammatical error correction.

Yun Chen received her BS caste in microelectronics from Tsinghua University, China in 2013 and her PhD degree in electrical and electronic engineering from University of Hong Kong, Communist china in 2018. She is broadly interested in machine learning and natural language processing, peculiarly neural machine translation and pre-trained linguistic communication models.

Yongping Du received her PhD degree in informatics from Fudan University, People's republic of china in 2005. She is currently a professor in Beijing University of Technology, Mainland china. Her research interests include information retrieval, information extraction, and natural language processing.

Erhong Yang received her MS degree in information science from Shanxi University, China in 1989, and her PhD degree in linguistics from the Beijing linguistic communication and Culture University, China in 2005. She is the executive deputy director of Beijing Advanced Innovation Center for Linguistic communication Resource, Beijing Language and Culture University, Red china. Her research interests include linguistic communication resources, computational linguistics.

Electronic supplementary material

Rights and permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yang, L., Wang, C., Chen, Y. et al. Controllable data synthesis method for grammatical error correction. Front. Comput. Sci. 16, 164318 (2022). https://doi.org/ten.1007/s11704-020-0286-four

Download citation

Received: 19 June 2020
Accepted: eleven Dec 2020
Published: 03 December 2021
DOI : https://doi.org/x.1007/s11704-020-0286-4

Keywords

grammatical error correction
sequence to sequence
data synthesis

thornhillsureat.blogspot.com

Source: https://link.springer.com/article/10.1007/s11704-020-0286-4