Volume 42 Issue 8
Aug.  2016
Turn off MathJax
Article Contents
WANG Yang, YUAN Kun, LIU Hongfu, et al. Information-theoretic ensemble clustering on web texts[J]. Journal of Beijing University of Aeronautics and Astronautics, 2016, 42(8): 1603-1611. doi: 10.13700/j.bh.1001-5965.2015.0507(in Chinese)
Citation: WANG Yang, YUAN Kun, LIU Hongfu, et al. Information-theoretic ensemble clustering on web texts[J]. Journal of Beijing University of Aeronautics and Astronautics, 2016, 42(8): 1603-1611. doi: 10.13700/j.bh.1001-5965.2015.0507(in Chinese)

Information-theoretic ensemble clustering on web texts

doi: 10.13700/j.bh.1001-5965.2015.0507
  • Received Date: 30 Jul 2015
  • Publish Date: 20 Aug 2016
  • Although being extensively studied, text clustering remains a critical challenge in data mining community due to the curse of dimensionality. Various techniques have been proposed to overcome this difficulty, but the negative impact of weakly related or even noisy features is yet the hunting nightmare. Meanwhile, we should never lose sight of the explosive growth of unlimited user-generated content on social media, which is extremely sparse and poses further challenge on the efficiency issue. In light of this, a disassemble-assemble (DIAS) framework is proposed for text clustering. Simple random feature sampling is employed by DIAS to disassemble high-dimensional text data and gain diverse structural knowledge by avoiding the bulk of noisy features. Then the multi-view knowledge is assembled by fast information-theoretic consensus clustering (ICC) to gain a high-quality consensus partitioning. Extensive experiments on eight real-world text data sets are conducted to demonstrate the advantages of DIAS over some widely used methods. In particular, DIAS shows appealing merits in learning from a bulk of very weak basic partitionings. Its natural suitability for distributed computing makes DIAS become a promising candidate for big text clustering.

     

  • loading
  • [1]
    ZAMIR O,ETZIONI O,MADANI O,et al.Fast and intuitive clustering of web documents[C]//Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mininge.New York:AIAA,1997:287-290.
    [2]
    CUTTING D R,KARGER D R,PEDERSEN J O,et al.Scatter/gather:A cluster-based approach to browsing large document collections[C]//Proceedings of 15th ACM International Conference on Research and Development in Information Retrieval.New York:ACM,1992:318-329.
    [3]
    CHA M,KWAK H,RODRIGUEZ P,et al.I tube,you tube,everybody tubes:Analyzing the world's largest user generated content video system[C]//Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement.New York:ACM,2007:1-14.
    [4]
    DUDA R O,HART P E,STORK D G.Pattern classification[M].2nd ed.New York:Wiley-Interscience,2000:14-15.
    [5]
    JOLLIFFE I T.Principal component analysis[M].2nd ed.New York:Springer,2002:8-21.
    [6]
    CAO J,WU Z,WU J,et al.Sail:Summation-based incremental learning for information-theoretic text clustering[J].IEEE Transactions on Cybernetics,2013,43(2):570-584.
    [7]
    WU J,LIU H,XIONG H,et al.K-means-based consensus clustering:A unified view[J].IEEE Transactions on Knowledge and Data Engineering,2015,27(1):155-169.
    [8]
    GOKCAY E,PRINCIPE J C.Information theoretic clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2002,24(2):158-171.
    [9]
    DHILLON I,MALLELA S,MODHA D.Information-theoretic co-clustering[C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,2003:89-93.
    [10]
    Digitical Technology Center (DTC).CLUTO-Software for clustering high-dimensional datasets[DS/OL].(2006-10-18)[2015-01-30].
    [11]
    CAI D.The 20 newgroups data set[DS/OL].(2008-01-14)[2015-01-30].
    [12]
    CAI D,WANG X,HE X.Probabilistic dyadic data analysis with local and global consistency[C]//Proceedings of the 26th International Conference on Machine Learning (ICML'09).New York:ACM,2009:105-112.
    [13]
    LI R L.English text segmentation corpus[DS/OL].(2011-10-30)[2015-1-30].
    [14]
    ZHAO Y,KARYPIS G.Empirical and theoretical comparisons of selected criterion functions for document clustering[J].Machine Learning,2004,55(3):311-331.
    [15]
    STREHL A,GHOSH J.Cluster ensembles-A knowledge reuse framework for combining partitions[J].Journal of Machine Learning Research,2003,3:583-617.
    [16]
    ZHONG S,GHOSH J.Generative model-based document clustering:A comparative study[J].Knowledge and Information Systems,2005,8(3):374-384.
    [17]
    FRED A L N,JAIN A K.Combining multiple clusterings using evidence accumulation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(6):835-850.
    [18]
    AGGARWAL C C,ZHAI C.Mining text data[M].New York:Springer,2012:81-86.
    [19]
    BERRY M W,DUMAIS S T,O'BRIEN G W.Using linear algebra for intelligent information retrieval[J].SIAM Review,1995,37(4):573-595.
    [20]
    HYVARINEN A,OJA E.Independent component analysis:Algorithms and applications[J].Neural Networks,2000,13(4-5):411-430.
    [21]
    BOUTSIDIS C,ZOUZIAS A,DRINEAS P.Random projections for K-means clustering[C]//Advances in Neural Information Processing Systems.Cambridge:MIT Press,2010:298-306.
    [22]
    AGRAWAL R,GEHRKE J,GUNOPULOS D,et al.Automatic subspace clustering of high dimensional data for data mining applications[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data.New York:ACM,1998:94-105.
    [23]
    CHENG C H,FU A W,ZHANG Y.Entropy-based subspace clustering for mining numerical data[C]//Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,1999:84-93.
    [24]
    GOIL S,NAGESH H,CHOUDHARY A.MAFIA:Efficient and scalable subspace clustering for very large data sets[C]//Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,1999:443-452.
    [25]
    AGGARWAL C C,YU P S.Finding generalized projected clusters in high dimensional spaces[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data.New York:ACM,2000:70-81.
    [26]
    FRIEDMAN J H,MEULMAN J J.Clustering objects on subsets of attributes[J].Journal of the Royal Statistical Society:Series B-Statistical Methodology,2004,66(4):815-849.
    [27]
    WOO K G,LEE J H,KIM M H,et al.FINDIT:A fast and intelligent subspace clustering algorithm using dimension voting[J].Information and Software Technology,2004,46(4):255-271.
    [28]
    LI T,DING C,JORDAN M I.Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization[C]//Proceedings of 7th IEEE International Conference on Data Mining.Piscataway,NJ:IEEE Press,2007:577-582.
    [29]
    杨燕,靳蕃,KAMEL M.聚类组合研究的新进展[J].计算机工程与应用,2008,44(11):142-144.YANG Y,JIN F,KAMEL M.Latest development of clustering ensemble[J].Computer Engineering and Applications,2008,44(11):142-144(in Chinese).
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views(862) PDF downloads(423) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return