An adaptive automatic construction algorithm for sentiment dictionaries based on semantic rules
-
摘要:
使用词典进行文本情感分析的方法虽然快捷无监督,但其准确性受到词典质量的约束。现有中文通用词典往往都由手动构建,无法自动发现新词且存在情感歧义词,因此在跨域应用时,现有词典质量有待提高。针对上述问题,提出了一种基于语义规则的领域自适应中文情感词典自动构建算法。构建了中文情感固定词典,有效消除了情感歧义性;提出了新的领域自适应的中文新词发现方法,实现了对通用领域中文词典的自动扩充;提出了融合词性筛选和语义规则嵌入的情感词汇倾向无监督计算方案,有效提高了精度。实验证明:在常用计算机语料库上,采用情感固定词典的情感分析方法比使用其他中文通用词典准确率平均提高9.31%,精确率平均提高12.77%,精确率和召回率的调和平均数
F 1值平均提高7.43%。在酒店、中文情感分析语料库2个数据集上,提出的情感词典自动构建算法较先进算法准确率平均提高了7.41%,召回率平均提高了12.23%,F 1值平均提高了9.08%。Abstract:Although text sentiment analyses using dictionaries are efficient and unsupervised, their accuracy relies heavily on the dictionary quality. The quality of existing Chinese sentiment dictionaries in cross-domain applications needs to be improved, as the manually constructed Chinese dictionaries fail to automatically discover new words and include emotionally ambiguous ones. In this paper, a domain-adaptive automatic construction algorithm for Chinese sentiment dictionaries was proposed by adopting semantic rules. The Chinese sentiment-fixed dictionary was constructed. It eliminated sentimental ambiguity effectively. A novel domain-adaptive method for discovering new Chinese words was proposed, which enabled the automatic expansion of the Chinese dictionary for general domains. An innovative unsupervised framework for recognizing words’ sentiments based on part-of-speech filtering and semantic rules helps realize higher sentiment detection accuracy. The experiment proves that in the common computer corpus, sentiment analysis using the sentiment-fixed dictionary achieves an average accuracy improvement of 9.31%, precision improvement of 12.77%, and
F 1 value improvement of 7.43%, compared to other general Chinese dictionaries. Meanwhile, on two datasets, the hotel and Chinese sentiment analysis corpora, the proposed algorithm for the automatic construction of sentiment dictionaries improves the accuracy by an average of 7.41%, the recall rate by 12.23%, and theF 1 value by 9.08%, compared to the advanced methods.-
Key words:
- sentiment dictionary /
- sentiment analysis /
- semantic rule /
- unsupervised /
- domain adaptation
-
表 1 穷举法与自动阈值法得到的新词质量比较
Table 1. Quality of new words obtained using exhaustive and adaptive threshold methods
数据集 P A R F1 穷举法 自动阈值法 穷举法 自动阈值法 穷举法 自动阈值法 穷举法 自动阈值法 外卖 0.8020 0.8268 0.7737 0.7872 0.7252 0.7263 0.7611 0.7733 计算机 0.7634 0.7749 0.7938 0.8111 0.8521 0.8767 0.8052 0.8227 水果 0.8778 0.8757 0.8466 0.8672 0.8044 0.8558 0.8387 0.8656 平板 0.8651 0.8675 0.8180 0.8300 0.7533 0.7790 0.8050 0.8208 书籍 0.7638 0.7660 0.7556 0.7598 0.7989 0.8057 0.7809 0.7853 酒店 0.8135 0.8279 0.8173 0.8231 0.8232 0.8158 0.8183 0.8218 表 2 计算机语料库上不同方法新词发现结果对比
Table 2. New word discovery results using different methods on a computer corpus
方法 新词发现结果 发现总数 穷举法 2 外观漂亮,键盘很,速度还可以,LED屏,还不错,送杀毒软件,很实用,回家,驱动安装,出现问题…… 839 4 外观漂亮,速度还可以,LED屏,还不错,送杀毒软件,很实用,回家,安装盘,刚买,出现问题…… 639 6 外观漂亮,LED屏,送杀毒软件,回家,刚买,提点,出现问题,很多人,独显,售后…… 366 8 LED屏,杀毒软件,回家,独显,小黑,过程中,售后,第二天,取货,找不到,帮朋友买,光驱…… 198 10 杀毒,回家,独显,光驱,工作人员,适合女孩子,内存条,蓝牙,值得购买,钢琴烤漆…… 70 12 杀毒,质保,微星,双面胶,散热,便携,散热量,西数,魔兽,刻录,磨砂,双通道,分区,芯片组,触控,附赠,2G 17 14 散热,魔兽 2 自动阈值法 独显,售后,杀毒软件,续航能力,微星,客服,双面胶,质保,便携,散热量,性价高,蓝牙…… 42 表 3 不同语料库新词发现结果
Table 3. New word discovery results on different corpora
语料库 新词发现结果 外卖 送餐,菜品,小哥,百度,店里,尖椒,给力,宫保鸡丁,土豆丝,卷饼,性价,鸡腿,京酱肉丝,皮蛋瘦肉粥…… 书籍 小熊,到底,职场,育儿,杜拉拉升职记,神奇校车,当当网,亲子,快递员,男人,文字,提到,小鸡…… 计算机 独显,售后,杀毒软件,续航能力,客服,双面胶,质保,便携,显卡,散热量,性价高,蓝牙,触摸板,钢琴烤漆…… 水果 第一次,给力,下单,直采,网购,甘肃天水,售后,客服,冷链,品相,物有所值,性价高,火龙果,次日达…… 酒店 地理位置,隔音效果,补充点评,入住,携程,门童,前台,总台,退房,结账,身体健康,四星级,旺角,网速…… 平板 续航能力,下单,售后,国人,生日礼物,手机壳,做工精细,官网,王者荣耀,运存,皮套,数据线,性价高…… 表 4 不同词性组合的情感计算结果
Table 4. Sentiment calculation results for different part-of-speech combinations
词性组合 P A R F1 [a] 0.779 0.806 0.856 0.815 [a,uw] 0.775 0.811 0.877 0.823 [v] 0.723 0.776 0.894 0.800 [v,uw] 0.717 0.775 0.908 0.801 [n] 0.717 0.777 0.915 0.804 [n,uw] 0.712 0.775 0.923 0.804 [a,n] 0.719 0.779 0.916 0.806 [a,n,uw] 0.715 0.777 0.921 0.805 [v,a] 0.720 0.777 0.905 0.802 [v,a,uw] 0.722 0.776 0.897 0.800 [v,n] 0.711 0.773 0.918 0.801 [v,n,uw] 0.708 0.771 0.921 0.800 [v,a,n] 0.718 0.777 0.912 0.803 [v,a,n,uw] 0.721 0.779 0.910 0.804 注:uw表示新词词典;a表示形容词;v表示动词;n表示名词。 表 5 6类数据集上不同情感权重阈值下的情感计算结果
Table 5. Sentiment calculation results on six datasets with different sentiment weight thresholds
语料库 P k=0 k=0.1 k=0.2 k=0.3 k=0.4 k=0.5 k=0.6 k=0.7 外卖 0.7155 0.7824 0.8238 0.8268 0.8258 0.8265 0.8287 0.8294 计算机 0.6969 0.6941 0.7246 0.7749 0.7807 0.7835 0.7857 0.7891 书籍 0.7180 0.7185 0.7306 0.7397 0.7580 0.7660 0.7644 0.7663 衣服 0.8445 0.8682 0.8675 0.8892 0.8966 0.8993 0.9035 0.9036 酒店 0.7613 0.7583 0.7956 0.8161 0.8223 0.8280 0.8285 0.8283 水果 0.8455 0.8743 0.8684 0.8757 0.8890 0.8906 0.8904 0.8904 语料库 A k=0 k=0.1 k=0.2 k=0.3 k=0.4 k=0.5 k=0.6 k=0.7 外卖 0.7603 0.8001 0.7860 0.7872 0.7835 0.7827 0.7804 0.7807 计算机 0.7577 0.7562 0.7813 0.8111 0.8036 0.8041 0.8041 0.8058 书籍 0.7333 0.7357 0.7463 0.7523 0.7575 0.7598 0.7554 0.7559 衣服 0.8832 0.8973 0.8974 0.9021 0.9042 0.8949 0.8923 0.8921 酒店 0.7963 0.7993 0.8121 0.8194 0.8180 0.8202 0.8203 0.8198 水果 0.8569 0.8689 0.8657 0.8672 0.8605 0.8557 0.8549 0.8549 语料库 R k=0 k=0.1 k=0.2 k=0.3 k=0.4 k=0.5 k=0.6 k=0.7 外卖 0.8639 0.8311 0.7275 0.7263 0.7185 0.7153 0.7068 0.7065 计算机 0.9118 0.9158 0.9073 0.8767 0.8441 0.8401 0.8361 0.8346 书籍 0.8414 0.8471 0.8471 0.8419 0.8157 0.8057 0.7971 0.7948 衣服 0.9394 0.9368 0.9379 0.9186 0.9138 0.8896 0.8784 0.8778 酒店 0.8632 0.8786 0.8400 0.8246 0.8112 0.8082 0.8078 0.8068 水果 0.8738 0.8616 0.8620 0.8558 0.8238 0.8110 0.8094 0.8094 语料库 F1 k=0 k=0.1 k=0.2 k=0.3 k=0.4 k=0.5 k=0.6 k=0.7 外卖 0.7827 0.8060 0.7727 0.7733 0.7684 0.7669 0.7629 0.7630 计算机 0.7900 0.7897 0.8057 0.8227 0.8112 0.8108 0.8101 0.8112 书籍 0.7748 0.7775 0.7846 0.7875 0.7858 0.7853 0.7804 0.7803 衣服 0.8894 0.9012 0.9014 0.9037 0.9051 0.8944 0.8908 0.8905 酒店 0.8090 0.8140 0.8172 0.8203 0.8167 0.8180 0.8180 0.8174 水果 0.8592 0.8679 0.8652 0.8656 0.8552 0.8489 0.8480 0.8480 表 6 中文情感固定词典与通用情感词典比较
Table 6. Comparison of Chinese sentiment-fixed and general sentiment dictionaries
语料库 P 中文情感
固定词典NTUSD[12] THU[15] HowNet[14] DU[13] ALL 计算机 0.7880 0.7752 0.6821 0.5978 0.7558 0.6263 水果 0.8862 0.8175 0.7532 0.7016 0.7999 0.6950 手机 0.8748 0.8367 0.7954 0.6979 0.8194 0.7201 酒店 0.8215 0.7938 0.6972 0.6928 0.7498 0.6961 语料库 A 中文情感
固定词典NTUSD[12] THU[15] HowNet[14] DU[13] ALL 计算机 0.8066 0.7407 0.7324 0.6500 0.7652 0.6880 水果 0.8367 0.7151 0.7589 0.7496 0.7423 0.7519 手机 0.8596 0.7829 0.8208 0.7606 0.8140 0.7898 酒店 0.8113 0.7357 0.7580 0.7526 0.7595 0.7620 语料库 R 中文情感
固定词典NTUSD[12] THU[15] HowNet[14] DU[13] ALL 计算机 0.8386 0.6777 0.8702 0.9163 0.7835 0.9318 水果 0.7726 0.5537 0.7700 0.8684 0.6461 0.8978 手机 0.8402 0.7045 0.8651 0.9210 0.8067 0.9502 酒店 0.7954 0.6367 0.9120 0.9074 0.7788 0.9300 语料库 F1 中文情感
固定词典NTUSD[12] THU[15] HowNet[14] DU[13] ALL 计算机 0.8125 0.7232 0.7648 0.7235 0.7694 0.7491 水果 0.8255 0.6602 0.7615 0.7761 0.7148 0.7835 手机 0.8571 0.7649 0.8288 0.7941 0.8130 0.8193 酒店 0.8082 0.7066 0.7903 0.7857 0.7640 0.7962 注:ALL为NTUSD、THU、HowNet、DU这几部词典对应指标值的并集结果。 表 7 不同情感词典自动构建算法比较
Table 7. Comparison of different automatic construction algorithms for sentiment dictionary
语料库 P 本文
方法SO-PMI(最
佳情感权重
阈值为0.3)[2]SO-PMI(最
佳情感权重
阈值为0)[2]TwoSim[1] Conj-
TwoSim[1]酒店 0.8279 0.8946 0.8869 0.7111 0.7117 中文情
感分析0.8649 0.8706 0.9018 0.7589 0.7688 语料库 A 本文
方法SO-PMI(最
佳情感权重
阈值为0.3)[2]SO-PMI(最
佳情感权重
阈值为0)[2]TwoSim[1] Conj-
TwoSim[1]酒店 0.8231 0.7816 0.7868 0.7487 0.7494 中文情
感分析0.8254 0.7996 0.8087 0.7147 0.7157 语料库 R 本文
方法SO-PMI(最
佳情感权重
阈值为0.3)[2]SO-PMI(最
佳情感权重
阈值为0)[2]TwoSim[1] Conj-
TwoSim[1]酒店 0.8158 0.6383 0.6573 0.8063 0.8070 中文情
感分析0.7712 0.7037 0.6927 0.6357 0.6332 语料库 F1 本文
方法SO-PMI(最
佳情感权重
阈值为0.3)[2]SO-PMI(最
佳情感权重
阈值为0)[2]TwoSim[1] Conj-
TwoSim[1]酒店 0.8218 0.7450 0.7551 0.7557 0.7564 中文情
感分析0.8154 0.7783 0.7836 0.6919 0.6884 -
[1] 刘伟童, 刘培玉, 刘文锋, 等. 基于互信息和邻接熵的新词发现算法[J]. 计算机应用研究, 2019, 36(5): 1293-1296.LIU W T, LIU P Y, LIU W F, et al. New word discovery algorithm based on mutual information and branch entropy[J]. Application Research of Computers, 2019, 36(5): 1293-1296(in Chinese). [2] YU S J, WANG B Y, LU T. Sentiment lexicon construction based on improved left-right entropy algorithm[J]. Journal of Donghua University (English Edition), 2022, 39(1): 65-71. [3] CHEN Z, LI X, WANG M, et al. Domain sentiment dictionary construction and optimization based on multi-source information fusion[J]. Intelligent Data Analysis, 2020, 24(2): 229-251. [4] ZHANG S X, XU H Q, ZHU G L, et al. A data processing method based on sequence labeling and syntactic analysis for extracting new sentiment words from product reviews[J]. Soft Computing, 2022, 26(2): 853-866. doi: 10.1007/s00500-021-06228-9 [5] ZHANG W, ZHU Y C, WANG J P. An intelligent textual corpus big data computing approach for lexicons construction and sentiment classification of public emergency events[J]. Multimedia Tools and Applications, 2019, 78(21): 30159-30174. doi: 10.1007/s11042-018-7018-x [6] WU S X, WU F Z, CHANG Y, et al. Automatic construction of target-specific sentiment lexicon[J]. Expert Systems with Applications, 2019, 116: 285-298. doi: 10.1016/j.eswa.2018.09.024 [7] HUANG S, NIU Z D, SHI C Y. Automatic construction of domain-specific sentiment lexicon based on constrained label propagation[J]. Knowledge-Based Systems, 2014, 56: 191-200. doi: 10.1016/j.knosys.2013.11.009 [8] MA T H, RONG H, HAO Y S, et al. A novel sentiment polarity detection framework for Chinese[J]. IEEE Transactions on Affective Computing, 2022, 13(1): 60-74. doi: 10.1109/TAFFC.2019.2932061 [9] ALMATARNEH S, GAMALLO P. Automatic construction of domain-specific sentiment lexicons for polarity classification[C]//International Conference on Practical Applications of Agents and Multi-Agent Systems. Berlin: Springer, 2018: 175-182. [10] ZHANG B, XU D, ZHANG H, et al. STCS lexicon: spectral-clustering-based topic-specific Chinese sentiment lexicon construction for social networks[J]. IEEE Transactions on Computational Social Systems, 2019, 6(6): 1180-1189. doi: 10.1109/TCSS.2019.2941344 [11] KANAYAMA H, NASUKAWA T. Fully automatic lexicon expansion for domain-oriented sentiment analysis[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Sydney: Association for Computational Linguistics, 2006: 355-363. [12] KU L W, LIANG Y T, CHEN H H. Opinion extraction, summarization and tracking in news and blog corpora[C]//Proceedings of the Spring Symposia on Computational Approaches to Analyzing Weblogs. Palo Alto: AAAI Press, 2006: 100-107. [13] 徐琳宏, 林鸿飞, 潘宇, 等. 情感词汇本体的构造[J]. 情报学报, 2008, 27(2): 180-185. doi: 10.3969/j.issn.1000-0135.2008.02.004XU L H, LIN H F, PAN Y, et al. Constructing the affective lexicon ontology[J]. Journal of the China Society for Scientific and Technical Information, 2008, 27(2): 180-185(in Chinese). doi: 10.3969/j.issn.1000-0135.2008.02.004 [14] FU X H, LIU G, GUO Y Y, et al. Multi-aspect sentiment analysis for Chinese online social reviews based on topic modeling and HowNet lexicon[J]. Knowledge-Based Systems, 2013, 37: 186-195. doi: 10.1016/j.knosys.2012.08.003 [15] ZENG X K, YANG C, TU C C, et al. Chinese LIWC lexicon expansion via hierarchical classification of word embeddings with sememe attention[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2018: 5650-5657. [16] LIU J Y, YAN M J, LUO J. Research on the construction of sentiment lexicon based on Chinese microblog[C]//Proceedings of the 8th International Conference on Intelligent Human-Machine Systems and Cybernetics. Piscataway: IEEE Press, 2016: 56-59. [17] ESULI A, SEBASTIANI F. Pageranking wordnet synsets: an application to opinion mining[C]//Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Prague: Association for Computational Linguistics, 2007: 424-431. [18] TAN S B, WU Q. A random walk algorithm for automatic construction of domain-oriented sentiment lexicon[J]. Expert Systems with Applications, 2011, 38(10): 12094-12100. doi: 10.1016/j.eswa.2011.02.105 [19] WANG L Y, RUI X. Sentiment lexicon construction with representation learning based on hierarchical sentiment supervision[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen: Association for Computational Linguistics, 2017: 502-510. [20] SophonPlus. ChineseNlpCorpus[DS/OL]. http://github.com/SophonPlus/ChineseNlpCorpus/tree/master. [21] 张成功, 刘培玉, 朱振方, 等. 一种基于极性词典的情感分析方法[J]. 山东大学学报(理学版), 2012, 47(3): 47-50.ZHANG C G, LIU P Y, ZHU Z F, et al. A sentiment analysis method based on a polarity lexicon[J]. Journal of Shandong University (Natural Science), 2012, 47(3): 47-50(in Chinese). -


下载: