| Citation: | XIONG G Y,YANG B L. A self-decision topic crawler algorithm with online training[J]. Journal of Beijing University of Aeronautics and Astronautics,2025,51(2):602-615 (in Chinese) doi: 10.13700/j.bh.1001-5965.2023.0002 |
Tunnel crossing problem is unavoidable in the development of the topic crawler. To solve this problem, a self-decision topic crawler algorithm based on Boyd loop (FCIDOL) was proposed. The algorithm took the Boyd loop as the basic framework and formed a closed loop according to the principle of “observation-assessment-decision-action”. According to the work completed by the crawler, which refers to memory, the algorithm evaluated the current state observed to generate decisions of radical or conservative strategies, guiding the crawler to search for new theme-relevant web pages or to focus on the actions of short-term benefits. The role of memory was to provide training materials for the assessment network, thus realizing the online training of the network to meet the cold start of the crawler. The experiment shows that compared with various topic crawler algorithms in different topic environments, FCIDOL achieves an improvement of over 7.8% in harvest rate, and the number of duplicate links is reduced by more than 15.6%.
| [1] |
BERGMARK D, LAGOZE C, SBITYAKOV A. Focused crawls, tunneling, and digital libraries[C]// Lecture Notes in Computer Science. Berlin: Springer, 2002: 91-106.
|
| [2] |
ABITEBOUL S, PREDA M, COBENA G. Adaptive on-line page importance computation[C]//Proceedings of the Twelfth International Conference on World Wide Web-WWW '03. New York: ACM, 2003: 280-290.
|
| [3] |
PAGE L, BRIN S, MOTWANI R, et al. The PageRank citation ranking: Bringing order to the web[C]. Stanford Digital Libravies Working Paper, [s. l.]: [s. n.], 1998.
|
| [4] |
WANG C, GUAN Z Y, CHEN C, et al. On-line topical importance estimation: an effective focused crawling algorithm combining link and content analysis[J]. Journal of Zhejiang University: Science A, 2009, 10(8): 1114-1124. doi: 10.1631/jzus.A0820481
|
| [5] |
朱庆生, 徐宁, 周瑜. 一种基于链接和内容分析的自适应主题爬虫算法[J]. 计算机与现代化, 2015(9): 77-80. doi: 10.3969/j.issn.1006-2475.2015.09.016
ZHU Q S, XU N, ZHOU Y. An adaptive focused crawling algorithm based on link and content analysis[J]. Computer and Modernization, 2015(9): 77-80(in Chinese). doi: 10.3969/j.issn.1006-2475.2015.09.016
|
| [6] |
KANG X P, MIAO D Q. A study on information granularity in formal concept analysis based on concept-bases[J]. Knowledge-Based Systems, 2016, 105: 147-159. doi: 10.1016/j.knosys.2016.05.005
|
| [7] |
JING W P, WANG Y J, WEIWEI D. Research on adaptive genetic algorithm in application of focused crawler search strategy[J]. Computer Science, 2016, 43(8): 254-257.
|
| [8] |
LIU W J, DU Y J. A novel focused crawler based on cell-like membrane computing optimization algorithm[J]. Neurocomputing, 2014, 123: 266-280. doi: 10.1016/j.neucom.2013.06.039
|
| [9] |
ZHENG S. Genetic and ant algorithms based focused crawler design[C]//Proceedings pf the Second International Conference on Innovations in Bio-inspired Computing and Applications. Piscataway: IEEE Press, 2011: 374-378.
|
| [10] |
GUAN W G, LUO Y C. Design and implementation of focused crawler based on concept context graph[J]. Computer Engineering and Design, 2016, 37 (10): 2679-2684.
|
| [11] |
FEI C J, LIU B S. Focused crawler based on LDA extended topic terms[J]. Computer Applications and Software, 2018, 35 (4) : 49-54.
|
| [12] |
LIU J F, DONG Y, LIU Z X, et al. Applying ontology learning and multi-objective ant colony optimization method for focused crawling to meteorological disasters domain knowledge[J]. Expert Systems with Applications, 2022, 198: 116741. doi: 10.1016/j.eswa.2022.116741
|
| [13] |
ENCK R E. The OODA loop[J]. Home Health Care Management & Practice, 2012, 24(3): 123-124.
|
| [14] |
RANI M, DHAR A K, VYAS O P. Semi-automatic terminology ontology learning based on topic modeling[J]. Engineering Applications of Artificial Intelligence, 2017, 63: 108-125. doi: 10.1016/j.engappai.2017.05.006
|
| [15] |
CHURCH K W. Word2Vec[J]. Natural Language Engineering, 2017, 23(1): 155-162. doi: 10.1017/S1351324916000334
|
| [16] |
AIZAWA A. An information-theoretic perspective of tf–idf measures[J]. Information Processing & Management, 2003, 39(1): 45-65.
|
| [17] |
LI L, ZHANG G Y, LI Z W. Research on focused crawling technology based on SVM[J]. Computer Science, 2015, 42(2) : 118-122.
|
| [18] |
CHIBA Z, ABGHOUR N, MOUSSAID K, et al. A novel architecture combined with optimal parameters for back propagation neural networks applied to anomaly network intrusion detection[J]. Computers & Security, 2018, 75: 36-58.
|
| [19] |
BILSKI J, KOWALCZYK B, MARCHLEWSKA A, et al. Local levenberg-marquardt algorithm for learning feedforwad neural networks[J]. Journal of Artificial Intelligence and Soft Computing Research, 2020, 10(4): 299-316. doi: 10.2478/jaiscr-2020-0020
|
| [20] |
DE JESÚS RUBIO J. Stability analysis of the modified levenberg–marquardt algorithm for the artificial neural network training[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 32(8): 3510-3524.
|
| [21] |
LIU J F, GU Y P, LIU W J. Focused crawler method combining ontology and improved Tabu search for meteorological disaster[J]. Journal of Computer Applications, 2020, 40(8): 2255-2261.
|