Review on Semantic Document Clustering

Main Article Content

SK Ahammad Fahad
Wael M.S. Yafooz


Now the age of information technology, textual document is spontaneously increasing over online or offline. In those articles contain Product information to company profile. Lot of source generate valuable information into text in medical report, economical analysis, scientific journals, news, blog etc. Maintain and access those documents are very difficult without proper classification. Those problems can be overcome by proper document classification. Only a few documents are classified. All are need classification and those are unsupervised. In this context clustering is the only solution. Traditional clustering technique and textual clustering have some difference. Relations between words are very import to do clustering. Semantic clustering is proven as more appropriate clustering technique for texts. In our paper we are going to provide valuable information about clustering to semantic document clustering technique. We will try to provide information about advantage and disadvantage for various clustering methods.

Article Details

Artificial intelligence


[1] A. Kaur Toor and A. Singh, An Advanced Clustering Algorithm (ACA) for Clustering Large Data Set to Achieve High Dimensionality:. Computer Science Systems Biology, Toor and Singh, J Comput Sci Syst Biol 2014, 7:4. URL:
[2] Anil K. Jain, Michigan State University & M.N. MURTY Indian Institute of Science & FLYNN the Ohio State University: Data Clustering: A Review; ACM Computing Surveys, Vol. 31, No. 3. 264-323
[3] Liu , G. Introduction to combinatorial mathematics . New York, NY : McGraw Hill .
[4] S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137.
[5] J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297, Berkeley, CA, USA.
[6] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval, volume 1. Cambridge University Press, Cambridge, 2008.
[7] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):881–892, 2002.
[8] C. Elkan. Using the triangle inequality to accelerate k-means. In Proceedings of International Conference on Machine Learning (ICML), pages 147–153, 2003.
[9] Everitt , B. , Landau , S. , and Leese , M. Cluster analysis, 4 th edition . London : Arnold .
[10] Anil K. Jain and Richard C. Dubes, Michigan State University; Algorithms for Clustering Data: Prentice Hall, Englewood Cliffs, New Jersey 07632. ISBN: 0-13-0222278-X
[11] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation, Journal of Machine Learning Research, 3:993–1022, 2003.
[12] Hansen , P. and Jaumard , B. Cluster analysis and mathematical programming. Mathematical Programming , 79 : 191 – 215
[13] V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS-clustering categorical data using summaries. ACM KDD Conference, 1999.
[14] Theodoridis , S . and Koutroumbas, K. ( 2006 ). Pattern recognition, 3r d . San Diego, CA : Academic Press .
[15] S. Fahad & M. Alam, "A modified K-means algorithm for big data clustering", International Journal of Science, Engineering and Computer Technology, vol. 6, no. 4, 2016.
[16] L. L. McQuitty. Elementary linkage analysis for isolating orthogonal and oblique types and typal relevancies. Educational and Psychological Measurement 17(2):207–229, 1957.
[17] P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy: the Principles and Practice of Numerical Classification. Freeman.
[18] B. King. Step-wise clustering procedures. Journal of the American Statistical Association, 62(317):86–101.
[19] J. C. Gower and G. J. S. Ross. Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society. Series C (Applied Statistics), 18(1):54–64.
[20] D. H. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2):139–172.
[21] Oikonomakou, N. and M. Vazirgiannis, A Review of Web Document Clustering Approaches, in Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Editors. 2010, Springer US. p. 931-948.
[22] Sathiyakumari, K., et al., A Survey on Various Approaches in Document Clustering. Int. J. Comp. Tech. Appl., IJCTA, 2011. 2(5): p. 1534-1539.
[23] Wael M.S. Yafooz, Abidin, S. Z., Omar, N., & Halim, R. A. (2014). Shared-Table for Textual Data Clustering in Distributed Relational Databases. In Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013) (pp. 49-57). Springer, Singapore.
[24] Pantel, P. and D. Lin, Document clustering with committees, in Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval2002, ACM. p. 199-206.
[25] Han, J., M. Kamber, and J. Pei, Data mining: concepts and techniques. 2006: Morgan kaufmann.
[26] D. Jiang, C. Tang, and A. Zhang. Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering, 16(11):1370–1386, 2004.
[27] S. Schaeffer. Graph clustering. Computer Science Review, 1(1):27–64, 2007.
[28] L. R. Rabiner. A tutorial on hiddenMarkov models and selected applications in speech recognition, Proceedings of the IEEE, 77(2):257–285.
[29] Hung Chim, F.X.D.F., Efficient Phrase-Based Document Similarity for Clustering. IEEE Trans. Knowl. Data Eng. IEEE Transactions on Knowledge and Data Engineering. 20(9): p. 1217-1229.
[30] Li, Y., S.M. Chung, and J.D. Holt, Text document clustering based on frequent word meaning sequences. Data & Knowledge Engineering, 2008. 64(1): p. 381-404.
[31] Fung, B.C., K. Wang, and M. Ester, Hierarchical document clustering using frequent itemsets, in Proceedings of SIAM international conference on data mining2003. p. 59-70.
[32] Li, Y., C. Luo, and S.M. Chung, Text clustering with feature selection by using statistical data. Knowledge and Data Engineering, IEEE Transactions on, 2008. 20(5): p. 641-652.
[33] Wael M.S. Yafooz, Abidin, S. Z., Omar, N., & Idrus, Z. (2013, December). Managing unstructured data in relational databases. In Systems, Process & Control (ICSPC), 2013 IEEE Conference on (pp. 198-203). IEEE.
[34] Wael M.S. Yafooz, Abidin, S. Z., Omar, N., & Halim, R. A. (2014). Model for automatic textual data clustering in relational databases schema. In Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013) (pp. 31-40). Springer, Singapore.
[35] D. Gusfield. Algorithms for Strings, Trees and Sequences, Cambridge University Press, 1997.
[36] Wei Xu, Xin Liu, Yihong Gong. Document Clustering Based On Nonnegative Matrix Factorization. In ACM. SIGIR, Toronto, Canada, 2003.
[37] McKeown, K.R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Klavans, J.L., Nenkova, A., Sable, C., Schiffman, B., Sigelman, S.: Tracking and summarizing

Newsblaster. Proceedings of the second international conference on Human Language Technology Research, pp. 280-285. Morgan Kaufmann Publishers Inc., San Diego, California (2002)
[38] Liu, X., Croft, W.B.: Cluster-based retrieval using language models. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 186-193. ACM, Sheffield, United Kingdom (2004)
[39] Hartigan, J. Clustering algorithms. New York, NY: John Wiley & Sons.
[40] B.W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs, Bell System Tech. Journal, 49:291–307.
[40] J.-G. Lee, J. Han, and K.-Y. Whang. Trajectory clustering: A partition-and-group framework. SIGMOD Conference, 593–604, 2007.
[41] S. Fortunato. Community detection in graphs, Physics Reports, 486(3–5):75–174, February 2010.
[42] Kaufman, L., and Rousseeuw, P. J. Finding Groups in Data. An Introduction to Cluster Analysis. John Wiley & Sons.
[42] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863–14868,
[43] Zahn, C. T. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transaction on Computers C-20, 1, 68–86.
[44] G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 48(1):96–129.
[45] Michael Steinbach, George Karypis, Vipin Kumar. A Comparison of Document Clustering Techniques. KDD. Workshop on Text Mining, 2000.
[46] G. Qi, C. Aggarwal, and T. Huang. Community detection with edge content in social media networks, ICDE Coference, 2013.
[47] G. Qi, C. Aggarwal, and T. Huang. Online community detection in social sensing. WSDM Conference, 2013.
[48] Y. Sun, C. Aggarwal, and J. Han. Relation-strength aware clustering of heterogeneous information networks with incomplete attributes, Journal of Proceedings of the VLDB Endowment, 5(5):394–405, 2012.
[49] Kenneth Lolk Vester, Moses Claus Martiny. Information retrieval In Document Spaces Using Clustering. in Informatics and Mathematical Modelling, Technical University of Denmark, DTU. 2005
[50] Inderjit S. Dhillon, University of Texas, Austin Information Theoretic Clustering, Co-clustering and Matrix Approximations. MA Workshop on Data Analysis and Optimization. 2003.
[51] M.E.S. Mendes Rodrigues and L. Sacks, ‚A Scalable Hie-rarchical Fuzzy Clustering Algorithm for Text Mining‛, Department of Electronic and Electrical Engineering University College London Torrington Place, London, WC1E 7JE, United Kingdom, 2004.
[52] M. Mugunthadevi,M. Punitha, and M. Punithavalli. Survey on feature selection in document clustering. International Journal, 3, 2011.
[53] H.P. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1(4):309–317, 1957.
[54] I.H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005.
[55] Y. Li, C. Luo, and S.M. Chung. Text clustering with feature selection by using statistical data. IEEE Transactions on Knowledge and Data Engineering, 20(5):641–652, 2008.
[56] S. Madeira and A. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1):24–45, 2004.
[57] J. Yang and W. Wang. CLUSEQ: Efficient and effective sequence clustering, ICDE Conference, 2003.
[58]Langley, P., Order effects in incremental learning, in P. Reimann & H. Spada, eds, ‘Learning in Humans and Machines: Towards an Interdisciplinary Learning Science’, Pergamon, pp. 154–167.
[59] F. Beil, M. Ester, and X. Xu. Frequent term-based text clustering. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 436–442. ACM, 2002.
[60] George A. Miller, ‘Nouns in WordNet: A Lexical Inheritance System ’
[61] T. Hofmann. Probabilistic latent semantic indexing. ACM SIGIR Conference, 1999.
[62] Wael.M.S. Yafooz, Abidin, S. Z., Omar, N., & Halim, R. A. (2013, August). Dynamic semantic textual document clustering using frequent terms and named entity. In System Engineering and Technology (ICSET), 2013 IEEE 3rd International Conference on (pp. 336-340). IEEE.
[63] Wael M.S. Yafooz, Abidin, S. Z., & Omar, N. (2011, November). Towards automatic column-based data object clustering for multilingual databases. In Control System, Computing and Engineering (ICCSCE), 2011 IEEE International Conference on (pp. 415-420). IEEE.
[64] D. R. Karger. Random sampling in cut, flow, and network design problems, STOC, pp. 648–657.
[65] T. Liao. Clustering of time series data—A survey. Pattern Recognition, 38(11):1857–1874, 2005.
[66] Forgy, E. Cluster Analysis of Multivariate Data: Efficiency versus Interpretability of Classification. Biometrics 21, 768–780.
[67] G. Das and H. Mannila. Context-based similarity measures for categorical databases. PKDD Conference, pages 201–210, 2000.
[68] Fisher, D. H. & Langley, P., Conceptual clustering and its relation to numerical taxonomy, in W. A. Gale, ed., ‘Artificial Intelligence and Statistics’, Boston, MA: Addison-Wesley, pp. 77–116.
[69] Michalski, R. S. & Stepp, R. E., Learning from observation: conceptual clustering, in R. S. Michalski, J. G. Carbonell & T. M. Mitchell, eds, ‘Machine Learning: An Artificial Intelligence Approach’, San Mateo, CA: Morgan Kaufmann, pp. 331–364.
[70] Fisher, D. H., & Langley, Approaches to conceptual clustering. Proceedings of the Ninth International Conference on Artificial Intelligence (pp. 691 697). Los Angeles, CA: Morgan Kaufmann.
[71] Y. Zhou, H. Cheng, and J. X. Yu, Graph clustering based on structural/attribute similarities, Proc. VLDB Endow., 2(1):718–729, 2009.
[72] Thompson, K. & Langley, Concept formation in structured domains, in D. H. Fisher, M. J. Pazzani & P. Langley, eds, ‘Concept Formation: Knowledge and Experience in Unsupervised Learning’, Morgan Kaufmann, pp. 127–161.
[73] Biswas, G.,Weinberg, J., Yang, Q. & Koller, Conceptual clustering and exploratory data analysis, in L. A. Birnbaum & G. C. Collins, eds, ‘Proceedings of the Eighth International Workshop on Machine Learning’, San Francisco, CA: Morgan Kaufmann, pp. 591–595.
[74] I. Dhillon, S. Mallela, and D. Modha. Information-theoretic co-clustering, ACM KDD Conference, 2003.
[75] B.-K. Yi, N. D. Sidiropoulos, T. Johnson, H. Jagadish, C. Faloutsos, and A. Biliris. Online data mining for co-evolving time sequences. ICDE Conference, 2000.
[76] C. Li and G. Biswas, “Conceptual clustering with numeric-and-nominal mixed data – A new similarity based system,” IEEE Trans. Knowl. Data Engineering.
[77] D. Cutting, D. Karger, J. Pedersen, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. ACM SIGIR Conference, pages 318–329.
[78] G. Biswas, et al. ITERATE: A conceptual clustering algorithm that produces stable clusters," in review, IEEE Trans. on Pattern Analysis and Machine Intelligence.
[79] I. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning, ACM KDD Conference, 2001.
[80] C. Ding, X. He, and H. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. SDM Conference, 2005.
[81] Y. Zhao, G. Karypis, and U. Fayyad. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10(2):141–168, 2005.
[82] Y. Zhu and D. Shasha. StatStream: Statistical monitoring of thousands of data streams in real time. VLDB Conference, pages 358–369, 2002.
[83] Perkowitz, M. & Etzioni, O. (2000), ‘Towards adaptive web sites: Conceptual framework and case study’, Artificial Intelligence 118(1), 245–275.
[84] Hurst, N., Marriott, K. & Moulder, P. (2003), Cobweb: a constraint-based web browser, in M. J. Oudshoorn, ed., ‘Twenty-sixth Australian computer science conference (ACSC 2003)’, Vol. 16, Adelaide, South Australia, Australian Computer Society, pp. 247–254.
[85] Paliouras, G., Papatheodorou, C., Karkaletsis, V., Spyropoulos, C. & P.Tzitziras, From web usage statistics to web usage analysis, in ‘Proceedings of the IEEE International Conference on Systems, Man