基于语义规则的自适应情感词典自动构建算法

卫青蓝; 何雨; 宋金宝

doi:10.13700/j.bh.1001-5965.2023.0367

基于语义规则的自适应情感词典自动构建算法

doi: 10.13700/j.bh.1001-5965.2023.0367

1.
中国传媒大学数据科学与智能媒体学院，北京 100024
2.
中国传媒大学信息与通信工程学院，北京 100024

基金项目:

国家自然科学基金(72474198,62301510)；中国传媒大学亚洲传媒研究中心科研资助(AMRC2023-4)

详细信息

通讯作者:
E-mail：songjinbao@cuc.edu.cn

中图分类号: V221⁺.3；TB553
计量
- 文章访问数: 265
- HTML全文浏览量: 74
- PDF下载量: 16
- 被引次数: 0
出版历程
- 收稿日期: 2023-06-15
- 录用日期: 2023-12-01
- 网络出版日期: 2024-03-07
- 整期出版日期: 2025-07-31

An adaptive automatic construction algorithm for sentiment dictionaries based on semantic rules

WEI Qinglan¹,
HE Yu²,
SONG Jinbao^{1
, ,}

1.
School of Data Science and Intelligent Media，Communication University of China，Beijing 100024，China
2.
School of Information and Communication Engineering，Communication University of China，Beijing 100024，China

Funds:

National Natural Science Foundation of China (72474198,62301510); Supported by “Asia Media Research Center, Communication University of China” (AMRC2023-4)

More Information

Corresponding author: E-mail：songjinbao@cuc.edu.cn

摘要

摘要:
使用词典进行文本情感分析的方法虽然快捷无监督，但其准确性受到词典质量的约束。现有中文通用词典往往都由手动构建，无法自动发现新词且存在情感歧义词，因此在跨域应用时，现有词典质量有待提高。针对上述问题，提出了一种基于语义规则的领域自适应中文情感词典自动构建算法。构建了中文情感固定词典，有效消除了情感歧义性；提出了新的领域自适应的中文新词发现方法，实现了对通用领域中文词典的自动扩充；提出了融合词性筛选和语义规则嵌入的情感词汇倾向无监督计算方案，有效提高了精度。实验证明：在常用计算机语料库上，采用情感固定词典的情感分析方法比使用其他中文通用词典准确率平均提高9.31%，精确率平均提高12.77%，精确率和召回率的调和平均数F₁值平均提高7.43%。在酒店、中文情感分析语料库2个数据集上，提出的情感词典自动构建算法较先进算法准确率平均提高了7.41%,召回率平均提高了12.23%，F₁值平均提高了9.08%。
- 情感词典 /
- 情感分析 /
- 语义规则 /
- 无监督 /
- 领域自适应
Abstract:
Although text sentiment analyses using dictionaries are efficient and unsupervised, their accuracy relies heavily on the dictionary quality. The quality of existing Chinese sentiment dictionaries in cross-domain applications needs to be improved, as the manually constructed Chinese dictionaries fail to automatically discover new words and include emotionally ambiguous ones. In this paper, a domain-adaptive automatic construction algorithm for Chinese sentiment dictionaries was proposed by adopting semantic rules. The Chinese sentiment-fixed dictionary was constructed. It eliminated sentimental ambiguity effectively. A novel domain-adaptive method for discovering new Chinese words was proposed, which enabled the automatic expansion of the Chinese dictionary for general domains. An innovative unsupervised framework for recognizing words’ sentiments based on part-of-speech filtering and semantic rules helps realize higher sentiment detection accuracy. The experiment proves that in the common computer corpus, sentiment analysis using the sentiment-fixed dictionary achieves an average accuracy improvement of 9.31%, precision improvement of 12.77%, and F₁ value improvement of 7.43%, compared to other general Chinese dictionaries. Meanwhile, on two datasets, the hotel and Chinese sentiment analysis corpora, the proposed algorithm for the automatic construction of sentiment dictionaries improves the accuracy by an average of 7.41%, the recall rate by 12.23%, and the F₁ value by 9.08%, compared to the advanced methods.
- sentiment dictionary /
- sentiment analysis /
- semantic rule /
- unsupervised /
- domain adaptation

HTML全文

图 1 情感词典自动构建算法框架

Figure 1. Framework of automatic construction algorithm of sentiment dictionary

下载: 全尺寸图片幻灯片

图 2 PMI公式新词发现方法流程

Figure 2. Flowchart of new word discovery algorithm based on PMI formula

下载: 全尺寸图片幻灯片

图 3 句子情感计算流程

Figure 3. Flowchart of sentence sentiment calculation

下载: 全尺寸图片幻灯片

表 1 穷举法与自动阈值法得到的新词质量比较

Table 1. Quality of new words obtained using exhaustive and adaptive threshold methods

数据集	P		A		R		F₁
数据集	穷举法	自动阈值法	穷举法	自动阈值法	穷举法	自动阈值法	穷举法	自动阈值法
外卖	0.8020	0.8268	0.7737	0.7872	0.7252	0.7263	0.7611	0.7733
计算机	0.7634	0.7749	0.7938	0.8111	0.8521	0.8767	0.8052	0.8227
水果	0.8778	0.8757	0.8466	0.8672	0.8044	0.8558	0.8387	0.8656
平板	0.8651	0.8675	0.8180	0.8300	0.7533	0.7790	0.8050	0.8208
书籍	0.7638	0.7660	0.7556	0.7598	0.7989	0.8057	0.7809	0.7853
酒店	0.8135	0.8279	0.8173	0.8231	0.8232	0.8158	0.8183	0.8218

下载: 导出CSV

表 2 计算机语料库上不同方法新词发现结果对比

Table 2. New word discovery results using different methods on a computer corpus

方法		新词发现结果	发现总数
穷举法	2	外观漂亮，键盘很，速度还可以，LED屏，还不错，送杀毒软件，很实用，回家，驱动安装，出现问题……	839
	4	外观漂亮，速度还可以，LED屏，还不错，送杀毒软件，很实用，回家，安装盘，刚买，出现问题……	639
	6	外观漂亮，LED屏，送杀毒软件，回家，刚买，提点，出现问题，很多人，独显，售后……	366
	8	LED屏，杀毒软件，回家，独显，小黑，过程中，售后，第二天，取货，找不到，帮朋友买，光驱……	198
	10	杀毒，回家，独显，光驱，工作人员，适合女孩子，内存条，蓝牙，值得购买，钢琴烤漆……	70
	12	杀毒，质保，微星，双面胶，散热，便携，散热量，西数，魔兽，刻录，磨砂，双通道，分区，芯片组，触控，附赠，2G	17
	14	散热，魔兽	2
自动阈值法		独显，售后，杀毒软件，续航能力，微星，客服，双面胶，质保，便携，散热量，性价高，蓝牙……	42

下载: 导出CSV

表 3 不同语料库新词发现结果

Table 3. New word discovery results on different corpora

语料库	新词发现结果
外卖	送餐，菜品，小哥，百度，店里，尖椒，给力，宫保鸡丁，土豆丝，卷饼，性价，鸡腿，京酱肉丝，皮蛋瘦肉粥……
书籍	小熊，到底，职场，育儿，杜拉拉升职记，神奇校车，当当网，亲子，快递员，男人，文字，提到，小鸡……
计算机	独显，售后，杀毒软件，续航能力，客服，双面胶，质保，便携，显卡，散热量，性价高，蓝牙，触摸板，钢琴烤漆……
水果	第一次，给力，下单，直采，网购，甘肃天水，售后，客服，冷链，品相，物有所值，性价高，火龙果，次日达……
酒店	地理位置，隔音效果，补充点评，入住，携程，门童，前台，总台，退房，结账，身体健康，四星级，旺角，网速……
平板	续航能力，下单，售后，国人，生日礼物，手机壳，做工精细，官网，王者荣耀，运存，皮套，数据线，性价高……

下载: 导出CSV

表 4 不同词性组合的情感计算结果

Table 4. Sentiment calculation results for different part-of-speech combinations

词性组合	P	A	R	F₁
[a]	0.779	0.806	0.856	0.815
[a,uw]	0.775	0.811	0.877	0.823
[v]	0.723	0.776	0.894	0.800
[v,uw]	0.717	0.775	0.908	0.801
[n]	0.717	0.777	0.915	0.804
[n,uw]	0.712	0.775	0.923	0.804
[a,n]	0.719	0.779	0.916	0.806
[a,n,uw]	0.715	0.777	0.921	0.805
[v,a]	0.720	0.777	0.905	0.802
[v,a,uw]	0.722	0.776	0.897	0.800
[v,n]	0.711	0.773	0.918	0.801
[v,n,uw]	0.708	0.771	0.921	0.800
[v,a,n]	0.718	0.777	0.912	0.803
[v,a,n,uw]	0.721	0.779	0.910	0.804
注：uw表示新词词典；a表示形容词；v表示动词；n表示名词。

下载: 导出CSV

表 5 6类数据集上不同情感权重阈值下的情感计算结果

Table 5. Sentiment calculation results on six datasets with different sentiment weight thresholds

语料库	P
语料库	k=0	k=0.1	k=0.2	k=0.3	k=0.4	k=0.5	k=0.6	k=0.7
外卖	0.7155	0.7824	0.8238	0.8268	0.8258	0.8265	0.8287	0.8294
计算机	0.6969	0.6941	0.7246	0.7749	0.7807	0.7835	0.7857	0.7891
书籍	0.7180	0.7185	0.7306	0.7397	0.7580	0.7660	0.7644	0.7663
衣服	0.8445	0.8682	0.8675	0.8892	0.8966	0.8993	0.9035	0.9036
酒店	0.7613	0.7583	0.7956	0.8161	0.8223	0.8280	0.8285	0.8283
水果	0.8455	0.8743	0.8684	0.8757	0.8890	0.8906	0.8904	0.8904

语料库	A
语料库	k=0	k=0.1	k=0.2	k=0.3	k=0.4	k=0.5	k=0.6	k=0.7
外卖	0.7603	0.8001	0.7860	0.7872	0.7835	0.7827	0.7804	0.7807
计算机	0.7577	0.7562	0.7813	0.8111	0.8036	0.8041	0.8041	0.8058
书籍	0.7333	0.7357	0.7463	0.7523	0.7575	0.7598	0.7554	0.7559
衣服	0.8832	0.8973	0.8974	0.9021	0.9042	0.8949	0.8923	0.8921
酒店	0.7963	0.7993	0.8121	0.8194	0.8180	0.8202	0.8203	0.8198
水果	0.8569	0.8689	0.8657	0.8672	0.8605	0.8557	0.8549	0.8549

语料库	R
语料库	k=0	k=0.1	k=0.2	k=0.3	k=0.4	k=0.5	k=0.6	k=0.7
外卖	0.8639	0.8311	0.7275	0.7263	0.7185	0.7153	0.7068	0.7065
计算机	0.9118	0.9158	0.9073	0.8767	0.8441	0.8401	0.8361	0.8346
书籍	0.8414	0.8471	0.8471	0.8419	0.8157	0.8057	0.7971	0.7948
衣服	0.9394	0.9368	0.9379	0.9186	0.9138	0.8896	0.8784	0.8778
酒店	0.8632	0.8786	0.8400	0.8246	0.8112	0.8082	0.8078	0.8068
水果	0.8738	0.8616	0.8620	0.8558	0.8238	0.8110	0.8094	0.8094

语料库	F₁
语料库	k=0	k=0.1	k=0.2	k=0.3	k=0.4	k=0.5	k=0.6	k=0.7
外卖	0.7827	0.8060	0.7727	0.7733	0.7684	0.7669	0.7629	0.7630
计算机	0.7900	0.7897	0.8057	0.8227	0.8112	0.8108	0.8101	0.8112
书籍	0.7748	0.7775	0.7846	0.7875	0.7858	0.7853	0.7804	0.7803
衣服	0.8894	0.9012	0.9014	0.9037	0.9051	0.8944	0.8908	0.8905
酒店	0.8090	0.8140	0.8172	0.8203	0.8167	0.8180	0.8180	0.8174
水果	0.8592	0.8679	0.8652	0.8656	0.8552	0.8489	0.8480	0.8480

下载: 导出CSV

表 6 中文情感固定词典与通用情感词典比较

Table 6. Comparison of Chinese sentiment-fixed and general sentiment dictionaries

语料库	P
语料库	中文情感固定词典	NTUSD^[12]	THU^[15]	HowNet^[14]	DU^[13]	ALL
计算机	0.7880	0.7752	0.6821	0.5978	0.7558	0.6263
水果	0.8862	0.8175	0.7532	0.7016	0.7999	0.6950
手机	0.8748	0.8367	0.7954	0.6979	0.8194	0.7201
酒店	0.8215	0.7938	0.6972	0.6928	0.7498	0.6961

语料库	A
语料库	中文情感固定词典	NTUSD^[12]	THU^[15]	HowNet^[14]	DU^[13]	ALL
计算机	0.8066	0.7407	0.7324	0.6500	0.7652	0.6880
水果	0.8367	0.7151	0.7589	0.7496	0.7423	0.7519
手机	0.8596	0.7829	0.8208	0.7606	0.8140	0.7898
酒店	0.8113	0.7357	0.7580	0.7526	0.7595	0.7620

语料库	R
语料库	中文情感固定词典	NTUSD^[12]	THU^[15]	HowNet^[14]	DU^[13]	ALL
计算机	0.8386	0.6777	0.8702	0.9163	0.7835	0.9318
水果	0.7726	0.5537	0.7700	0.8684	0.6461	0.8978
手机	0.8402	0.7045	0.8651	0.9210	0.8067	0.9502
酒店	0.7954	0.6367	0.9120	0.9074	0.7788	0.9300

语料库	F₁
语料库	中文情感固定词典	NTUSD^[12]	THU^[15]	HowNet^[14]	DU^[13]	ALL
计算机	0.8125	0.7232	0.7648	0.7235	0.7694	0.7491
水果	0.8255	0.6602	0.7615	0.7761	0.7148	0.7835
手机	0.8571	0.7649	0.8288	0.7941	0.8130	0.8193
酒店	0.8082	0.7066	0.7903	0.7857	0.7640	0.7962
注：ALL为NTUSD、THU、HowNet、DU这几部词典对应指标值的并集结果。

下载: 导出CSV

表 7 不同情感词典自动构建算法比较

Table 7. Comparison of different automatic construction algorithms for sentiment dictionary

语料库	P
语料库	本文方法	SO-PMI(最佳情感权重阈值为0.3)^[2]	SO-PMI(最佳情感权重阈值为0)^[2]	TwoSim^[1]	Conj- TwoSim^[1]
酒店	0.8279	0.8946	0.8869	0.7111	0.7117
中文情感分析	0.8649	0.8706	0.9018	0.7589	0.7688

语料库	A
语料库	本文方法	SO-PMI(最佳情感权重阈值为0.3)^[2]	SO-PMI(最佳情感权重阈值为0)^[2]	TwoSim^[1]	Conj- TwoSim^[1]
酒店	0.8231	0.7816	0.7868	0.7487	0.7494
中文情感分析	0.8254	0.7996	0.8087	0.7147	0.7157

语料库	R
语料库	本文方法	SO-PMI(最佳情感权重阈值为0.3)^[2]	SO-PMI(最佳情感权重阈值为0)^[2]	TwoSim^[1]	Conj- TwoSim^[1]
酒店	0.8158	0.6383	0.6573	0.8063	0.8070
中文情感分析	0.7712	0.7037	0.6927	0.6357	0.6332

语料库	F₁
语料库	本文方法	SO-PMI(最佳情感权重阈值为0.3)^[2]	SO-PMI(最佳情感权重阈值为0)^[2]	TwoSim^[1]	Conj- TwoSim^[1]
酒店	0.8218	0.7450	0.7551	0.7557	0.7564
中文情感分析	0.8154	0.7783	0.7836	0.6919	0.6884

下载: 导出CSV

参考文献(21)

[1]	刘伟童, 刘培玉, 刘文锋, 等. 基于互信息和邻接熵的新词发现算法[J]. 计算机应用研究, 2019, 36(5): 1293-1296. LIU W T, LIU P Y, LIU W F, et al. New word discovery algorithm based on mutual information and branch entropy[J]. Application Research of Computers, 2019, 36(5): 1293-1296(in Chinese).
[2]	YU S J, WANG B Y, LU T. Sentiment lexicon construction based on improved left-right entropy algorithm[J]. Journal of Donghua University (English Edition), 2022, 39(1): 65-71.
[3]	CHEN Z, LI X, WANG M, et al. Domain sentiment dictionary construction and optimization based on multi-source information fusion[J]. Intelligent Data Analysis, 2020, 24(2): 229-251.
[4]	ZHANG S X, XU H Q, ZHU G L, et al. A data processing method based on sequence labeling and syntactic analysis for extracting new sentiment words from product reviews[J]. Soft Computing, 2022, 26(2): 853-866. doi: 10.1007/s00500-021-06228-9
[5]	ZHANG W, ZHU Y C, WANG J P. An intelligent textual corpus big data computing approach for lexicons construction and sentiment classification of public emergency events[J]. Multimedia Tools and Applications, 2019, 78(21): 30159-30174. doi: 10.1007/s11042-018-7018-x
[6]	WU S X, WU F Z, CHANG Y, et al. Automatic construction of target-specific sentiment lexicon[J]. Expert Systems with Applications, 2019, 116: 285-298. doi: 10.1016/j.eswa.2018.09.024
[7]	HUANG S, NIU Z D, SHI C Y. Automatic construction of domain-specific sentiment lexicon based on constrained label propagation[J]. Knowledge-Based Systems, 2014, 56: 191-200. doi: 10.1016/j.knosys.2013.11.009
[8]	MA T H, RONG H, HAO Y S, et al. A novel sentiment polarity detection framework for Chinese[J]. IEEE Transactions on Affective Computing, 2022, 13(1): 60-74. doi: 10.1109/TAFFC.2019.2932061
[9]	ALMATARNEH S, GAMALLO P. Automatic construction of domain-specific sentiment lexicons for polarity classification[C]//International Conference on Practical Applications of Agents and Multi-Agent Systems. Berlin: Springer, 2018: 175-182.
[10]	ZHANG B, XU D, ZHANG H, et al. STCS lexicon: spectral-clustering-based topic-specific Chinese sentiment lexicon construction for social networks[J]. IEEE Transactions on Computational Social Systems, 2019, 6(6): 1180-1189. doi: 10.1109/TCSS.2019.2941344
[11]	KANAYAMA H, NASUKAWA T. Fully automatic lexicon expansion for domain-oriented sentiment analysis[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Sydney: Association for Computational Linguistics, 2006: 355-363.
[12]	KU L W, LIANG Y T, CHEN H H. Opinion extraction, summarization and tracking in news and blog corpora[C]//Proceedings of the Spring Symposia on Computational Approaches to Analyzing Weblogs. Palo Alto: AAAI Press, 2006: 100-107.
[13]	徐琳宏, 林鸿飞, 潘宇, 等. 情感词汇本体的构造[J]. 情报学报, 2008, 27(2): 180-185. doi: 10.3969/j.issn.1000-0135.2008.02.004 XU L H, LIN H F, PAN Y, et al. Constructing the affective lexicon ontology[J]. Journal of the China Society for Scientific and Technical Information, 2008, 27(2): 180-185(in Chinese). doi: 10.3969/j.issn.1000-0135.2008.02.004
[14]	FU X H, LIU G, GUO Y Y, et al. Multi-aspect sentiment analysis for Chinese online social reviews based on topic modeling and HowNet lexicon[J]. Knowledge-Based Systems, 2013, 37: 186-195. doi: 10.1016/j.knosys.2012.08.003
[15]	ZENG X K, YANG C, TU C C, et al. Chinese LIWC lexicon expansion via hierarchical classification of word embeddings with sememe attention[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2018: 5650-5657.
[16]	LIU J Y, YAN M J, LUO J. Research on the construction of sentiment lexicon based on Chinese microblog[C]//Proceedings of the 8th International Conference on Intelligent Human-Machine Systems and Cybernetics. Piscataway: IEEE Press, 2016: 56-59.
[17]	ESULI A, SEBASTIANI F. Pageranking wordnet synsets: an application to opinion mining[C]//Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Prague: Association for Computational Linguistics, 2007: 424-431.
[18]	TAN S B, WU Q. A random walk algorithm for automatic construction of domain-oriented sentiment lexicon[J]. Expert Systems with Applications, 2011, 38(10): 12094-12100. doi: 10.1016/j.eswa.2011.02.105
[19]	WANG L Y, RUI X. Sentiment lexicon construction with representation learning based on hierarchical sentiment supervision[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen: Association for Computational Linguistics, 2017: 502-510.
[20]	SophonPlus. ChineseNlpCorpus[DS/OL]. http://github.com/SophonPlus/ChineseNlpCorpus/tree/master.
[21]	张成功, 刘培玉, 朱振方, 等. 一种基于极性词典的情感分析方法[J]. 山东大学学报(理学版), 2012, 47(3): 47-50. ZHANG C G, LIU P Y, ZHU Z F, et al. A sentiment analysis method based on a polarity lexicon[J]. Journal of Shandong University (Natural Science), 2012, 47(3): 47-50(in Chinese).