In this paper, we propose a combined Whole-Word-Masking based Robustly Optimized BERT pretraining approach with dictionary embedding entities recognition model for Chinese documents. By using multiple feature vectors generated by such as Roberta and domain dictionaries as embedding layers, the contextual semantic information of the text is fully considered. Meanwhile, Bi-directional Long Short-Term Memory(BiLSTM) and a multi-head attention mechanism are used to learn the information of long-distance dependency of the text. We use conditional random field(CRF) to obtain the global optimal annotation sequence, which is expected to improve the performance of the model. In this paper, we conduct comparison experiments with five baseline-based methods in the official document dataset of government affairs domain. The Precision of the model is 91.8%, Recall is 90.5%, and F1 value is 91.1%, which are better than other baseline models, indicating that the proposed model is more accurate for recognizing named entities in government documents.


Paper Number 1037; Track AI; Complete Paper



When commenting on articles, please be friendly, welcoming, respectful and abide by the AIS eLibrary Discussion Thread Code of Conduct posted here.