Information-theoretic ensemble clustering on web texts

WANG Yang; YUAN Kun; LIU Hongfu; WU Junjie; BAO Xiuguo

doi:10.13700/j.bh.1001-5965.2015.0507

Volume 42 Issue 8

Aug. 2016

Turn off MathJax

Article Contents

Journal of Beijing University of Aeronautics and Astronautics > 2016 > 42(8): 1603-1611.

WANG Yang, YUAN Kun, LIU Hongfu, et al. Information-theoretic ensemble clustering on web texts[J]. Journal of Beijing University of Aeronautics and Astronautics, 2016, 42(8): 1603-1611. doi: 10.13700/j.bh.1001-5965.2015.0507(in Chinese)

Citation:

WANG Yang, YUAN Kun, LIU Hongfu, et al. Information-theoretic ensemble clustering on web texts[J]. Journal of Beijing University of Aeronautics and Astronautics, 2016, 42(8): 1603-1611. doi: 10.13700/j.bh.1001-5965.2015.0507(in Chinese)

Citation:

PDF( 3182 KB)

Information-theoretic ensemble clustering on web texts

doi: 10.13700/j.bh.1001-5965.2015.0507

1.
School of Economics and Management, Beijing University of Aeronautics and Astronautics, Beijing 100083, China
2. School of Mechanical Engineering and Automation, Beijing University of Aeronautics and Astronautics, Beijing 100083, China
3. College of Engineering, Northeastern University, Boston 02115, USA
4. National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100029, China

Received Date: 30 Jul 2015
Publish Date: 20 Aug 2016

Abstract

Abstract

Although being extensively studied, text clustering remains a critical challenge in data mining community due to the curse of dimensionality. Various techniques have been proposed to overcome this difficulty, but the negative impact of weakly related or even noisy features is yet the hunting nightmare. Meanwhile, we should never lose sight of the explosive growth of unlimited user-generated content on social media, which is extremely sparse and poses further challenge on the efficiency issue. In light of this, a disassemble-assemble (DIAS) framework is proposed for text clustering. Simple random feature sampling is employed by DIAS to disassemble high-dimensional text data and gain diverse structural knowledge by avoiding the bulk of noisy features. Then the multi-view knowledge is assembled by fast information-theoretic consensus clustering (ICC) to gain a high-quality consensus partitioning. Extensive experiments on eight real-world text data sets are conducted to demonstrate the advantages of DIAS over some widely used methods. In particular, DIAS shows appealing merits in learning from a bulk of very weak basic partitionings. Its natural suitability for distributed computing makes DIAS become a promising candidate for big text clustering.
- text clustering,
- disassemble-assemble algorithm,
- information-theoretic consensus clustering,
- K-means,
- big data clustering

FullText(HTML)

References(29)

References

[1]	ZAMIR O,ETZIONI O,MADANI O,et al.Fast and intuitive clustering of web documents[C]//Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mininge.New York:AIAA,1997:287-290.
[2]	CUTTING D R,KARGER D R,PEDERSEN J O,et al.Scatter/gather:A cluster-based approach to browsing large document collections[C]//Proceedings of 15th ACM International Conference on Research and Development in Information Retrieval.New York:ACM,1992:318-329.
[3]	CHA M,KWAK H,RODRIGUEZ P,et al.I tube,you tube,everybody tubes:Analyzing the world's largest user generated content video system[C]//Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement.New York:ACM,2007:1-14.
[4]	DUDA R O,HART P E,STORK D G.Pattern classification[M].2nd ed.New York:Wiley-Interscience,2000:14-15.
[5]	JOLLIFFE I T.Principal component analysis[M].2nd ed.New York:Springer,2002:8-21.
[6]	CAO J,WU Z,WU J,et al.Sail:Summation-based incremental learning for information-theoretic text clustering[J].IEEE Transactions on Cybernetics,2013,43(2):570-584.
[7]	WU J,LIU H,XIONG H,et al.K-means-based consensus clustering:A unified view[J].IEEE Transactions on Knowledge and Data Engineering,2015,27(1):155-169.
[8]	GOKCAY E,PRINCIPE J C.Information theoretic clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2002,24(2):158-171.
[9]	DHILLON I,MALLELA S,MODHA D.Information-theoretic co-clustering[C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,2003:89-93.
[10]	Digitical Technology Center (DTC).CLUTO-Software for clustering high-dimensional datasets[DS/OL].(2006-10-18)[2015-01-30].
[11]	CAI D.The 20 newgroups data set[DS/OL].(2008-01-14)[2015-01-30].
[12]	CAI D,WANG X,HE X.Probabilistic dyadic data analysis with local and global consistency[C]//Proceedings of the 26th International Conference on Machine Learning (ICML'09).New York:ACM,2009:105-112.
[13]	LI R L.English text segmentation corpus[DS/OL].(2011-10-30)[2015-1-30].
[14]	ZHAO Y,KARYPIS G.Empirical and theoretical comparisons of selected criterion functions for document clustering[J].Machine Learning,2004,55(3):311-331.
[15]	STREHL A,GHOSH J.Cluster ensembles-A knowledge reuse framework for combining partitions[J].Journal of Machine Learning Research,2003,3:583-617.
[16]	ZHONG S,GHOSH J.Generative model-based document clustering:A comparative study[J].Knowledge and Information Systems,2005,8(3):374-384.
[17]	FRED A L N,JAIN A K.Combining multiple clusterings using evidence accumulation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(6):835-850.
[18]	AGGARWAL C C,ZHAI C.Mining text data[M].New York:Springer,2012:81-86.
[19]	BERRY M W,DUMAIS S T,O'BRIEN G W.Using linear algebra for intelligent information retrieval[J].SIAM Review,1995,37(4):573-595.
[20]	HYVARINEN A,OJA E.Independent component analysis:Algorithms and applications[J].Neural Networks,2000,13(4-5):411-430.
[21]	BOUTSIDIS C,ZOUZIAS A,DRINEAS P.Random projections for K-means clustering[C]//Advances in Neural Information Processing Systems.Cambridge:MIT Press,2010:298-306.
[22]	AGRAWAL R,GEHRKE J,GUNOPULOS D,et al.Automatic subspace clustering of high dimensional data for data mining applications[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data.New York:ACM,1998:94-105.
[23]	CHENG C H,FU A W,ZHANG Y.Entropy-based subspace clustering for mining numerical data[C]//Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,1999:84-93.
[24]	GOIL S,NAGESH H,CHOUDHARY A.MAFIA:Efficient and scalable subspace clustering for very large data sets[C]//Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,1999:443-452.
[25]	AGGARWAL C C,YU P S.Finding generalized projected clusters in high dimensional spaces[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data.New York:ACM,2000:70-81.
[26]	FRIEDMAN J H,MEULMAN J J.Clustering objects on subsets of attributes[J].Journal of the Royal Statistical Society:Series B-Statistical Methodology,2004,66(4):815-849.
[27]	WOO K G,LEE J H,KIM M H,et al.FINDIT:A fast and intelligent subspace clustering algorithm using dimension voting[J].Information and Software Technology,2004,46(4):255-271.
[28]	LI T,DING C,JORDAN M I.Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization[C]//Proceedings of 7th IEEE International Conference on Data Mining.Piscataway,NJ:IEEE Press,2007:577-582.
[29]	杨燕,靳蕃,KAMEL M.聚类组合研究的新进展[J].计算机工程与应用,2008,44(11):142-144.YANG Y,JIN F,KAMEL M.Latest development of clustering ensemble[J].Computer Engineering and Applications,2008,44(11):142-144(in Chinese).

Relative Articles

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Get Citation

PDF

XML

Article Metrics

Article views(862) PDF downloads(423)

Information-theoretic ensemble clustering on web texts

doi: 10.13700/j.bh.1001-5965.2015.0507

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Proportional views

Related

Information-theoretic ensemble clustering on web texts

doi: 10.13700/j.bh.1001-5965.2015.0507

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Proportional views

Related

Export File

Citation

Format

Content