Implementation of Agglomerative Hierarchical Clustering Algorithm Applying the Map-Reduce Parallel Approach
Subject Areas : Renewable energyFahimeh Tavakoli 1 , Faramarz Safi-Esfahani 2 *
1 - Faculty of Computer Engineering- Najafab Branch, Islamic Azad University, Najafabad, Iran
2 - Big Data Research Center, Najafabad Branch, Islamic Azad University, Najafabad, Iran.
Keywords: parallel programming, clustering algorithms, Map-Reduce, Hadoop,
Abstract :
The map-reduce model is a method for executing large data applications. It is also a parallel programming model for writing applications that can be executed on the cloud. Organizations are increasingly producing data that is generated by business processes, user activities, website tracking, sensors, finance, accounting, and more. Data clustering algorithms are used as tools for analyzing large volumes of data. The main purpose of these algorithms is to categorize data into clusters so that the data objects in each cluster are more similar. In this paper, a dense hierarchical clustering algorithm, one of the data mining techniques, is implemented using map-reduce design and then the results of this algorithm are compared with the usual one. Experiments show that runtime decreases with increasing input data size. The runtime of the algorithm improved by 16.80% for the 200 data-point dataset, and 29.26% for the dataset with 1000 data points. The percentage of CPU usage in the parallel system also increased from 22% to 94%.
[1] E. Begoli and J. Horey, "Principles for Effective Knowledge Discovery from Big Data", Proceeding of the IEEE/IFIP, Helsinki, Finland, pp. 1-4, Aug 2012 (doi: 10.1109/WICSA-ECSA.212.32).
[2] A. Jain and R. Dubes,” Algorithms for Clustering Data”, NJ: Prentice-Hall, Inc., January 1988.
[3] Jain, A.K., Murty, M.N. and Flynn, P.J., “Data clustering: a review”, ACM computing surveys (CSUR), vol. 31, no. 3, pp. 264-323, 1999 (doi: 10.1145/331499.331504).
[4] Dean, J. and Ghemawat, S., “MapReduce: simplified data processing on large clusters”, Communications of the ACM, 51(1), pp.107-113 (doi: 10.1145/1327452.1327492).
[5] Rohlf, F.J., "Hierarchical clustering using the minimum spanning tree Computer", The Computer Journal, vol. 16, pp. 93-95, 1973.
[6] F. Murtagh, "Multidimensional Clustering Algorithms," Physica-Verlag, 1985.
[7] J. Ward, "Hierarchical grouping to optimize an objective function," Journal of the American Statistical Association, vol. 58, pp. 236-244, 1963.
[8] G. Lance and W. Williams, "A General Theory of Classificatory Sorting Strategies 1. Hierarchical Systems", The Computer Journal, vol. 9, no. 4, pp. 373-380, 1967.
[9] W. Pengcheng, H. Fangcheng, L. Li, S. Chuanfu and J. Li, "Research on large data set clustering method based on MapReduce," Neural Computing and Applications, vol. 32, p. 93–99, 2020 (doi: 10.1007/s00521- 018-3780-y)
[10] V. S. Bawane and S. M. Kale, "Clustering algorithms in map reduce: A Review", International Journal of Computer Applications, vol. 975, pp. 15-18, 2015.
[11] D. C. Radhika. P. Lalita., "Distributed clustering for big data with map reduce," IOSR Journal of Computer Engineering, vol. 19, no. 3, pp. 25-28, May 2017 (doi: 10.9790/0661-1903032528).
[12] S. Alshammari, M. Binti Zolkepli, B. A. Rusli, "Genetic algorithm based parallel k-means data clustering algorithm using MapReduce programming paradigm on Hadoop environment (GAPKCA)", International Conference on Soft Computing and Data Mining, SCDM 2020: Recent Advances on Soft Computing and Data Mining, Langkawi, Malaysia, Jan. 2020.
[13] Y. Miao, J. Zhang, H. Feng, L. Qiu and Y. Wen, “A fast algorithm for clustering with map reduce”, vol. 7951, pp. 25-38, Berlin, Heidelberg: Lecture Notes in Computer Science, 2013.
[14] P. P. Anchalia, A. Koundinya and S. Nk, "MapReduce design of K-means clustering algorithm", Proceeding of the IEEE/ ICISA, Suwon, South Korea, pp. 1-5, June 2013 (doi: 10.1109/ICISA.2013.6579448).
[15] W. Zhao, H. Ma, Q. He, “Parallel K-means clustering based on MapReduce", Cloud Computing, CloudCom 2009. Lecture Notes in Computer Science, vol. 5931, p. 674–679, Berlin Heidelberg: Springer-Verlag LNCS, 2009 (doi: 10.1007/978-3-642-10665-1_71).
[16] J. Ragaventhiran, D. M. Kavitha, "Map-optimize-reduce: CAN tree assisted FP-growth algorithm for clusters-based FP mining on Hadoop", Future Generation Computer Systems, vol. 103, pp. 111-122, Feb. 2020 (doi: 10.1016/j.future.2019.09.041).
[17] T. Habib Sardar, Z. Ansari, "An analysis of MapReduce efficiency in document clustering using parallel K-means algorithm", Future Computing and Informatics Journal, vol. 3, no. 2, pp. 200-209, Dec. 2018 (doi: 10.1016/j.fcij.2018.03.003).
[18] M. A. B. Haj-Kacem, C.-E. Ben N’cir. N. Essoussi., "One-pass MapReduce-based clustering method for mixed large-scale data", Journal of Intelligent Information Systems, vol. 52, pp. 619–636, June 2019 (doi: 10.1007/s10844-017-0472-5).
[19] S. Chowdam., N. Kasiviswanath. P. C. Reddy., "Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop", Journal of Big Data, vol. 4, no. 1, pp. 1-27, 2017 (doi: 10.1186/ s40537-017-0087-2).
[20] D. Teffer, R. Srinivasan, J. Ghosh, "AdaHash: Hashing-based scalable, adaptive hierarchical clustering of streaming data on MapReduce frameworks", International Journal of Data Science and Analytics, vol. 8, pp. 257–267, 2019 (doi: 10.1007/s41060-018-0145-7)
[21] "wikipedia," [Online]. Available: https://en.wikipedia.org/wiki/Jaccard_index.
[22] M. Ngazimbi, "Data clustering using MapReduce", PhD Thesis, Boise State University, 2009.
[23] T. Sun, C. Shu, F. Li, H. Yu, L. Ma, Y. Fang, "An efficient hierarchical clustering method for large datasets with Map-Reduce", Proceeding of the IEEE/PDCAT, pp. 494-499, 2009 (doi: 10.1109/PDCAT.2009.46).
[24] S. Papadimitriou, J. Sun, "DisCo: Distributed co-clustering with MapReduce:A case study towards petabyte-scale end-to-end mining", Proceeding of the IEEE/ICDM, Pisa, Italy, Dec. 2008 (doi: 10.1109/ICDM. 2008.142).
[25] E. M. Rasmussen and P. Willett, "Efficiency of hierarchical agglomerative clustering using the ICL distributed array processor", Journal of Documentation, vol. 45, no. 1, pp. 1-24, 1989 (doi: 10.1108/eb0268 36).
[26] C. F. Olson, "Parallel Algorithms for Hierarchical Clustering", Parallel Computing, vol. 21, no. 8, pp. 1313-1325, Aug. 1995 (10.1016/0167-8191(95)00017-I).
[27] Z. Li, K. Li, D. Xiao, L. Yang, "An adaptive parallel hierarchical clustering algorithm", Proceeding of the HPCC, Springer, Berlin, Heidelberg, pp. 97-107, 2007 (doi: 10.1007/978-3-540-75444-2_15).
[28] S. Rajasekara, "Efficient parallel hierarchical clustering algorithms", IEEE Trans. on Parallel and Distributed Systems, vol. 16, no. 6, pp. 27-34, June 2005 (doi: 10.1109/TPDS.2005.72).
[29] M. Dash, S. Petrutiu, P. Scheuermann, "Efficient parallel hierarchical clustering", IEEE Trans. on Parallel and Distributed Systems, vol. 16, no. 6, pp. 497-502, 2005 (doi: 10.1109/TPDS.2005.72).
[30] S. Guha, R. Rastogi, K. Shim, "ROCK: A robust clustering algorithm for categorical attributes", Information System, vol. 32, no. 8, pp. 345- 366, 2000 (doi: 10.1016/S0306-4379(00)00022-3).
[31] G. Karypis, E. Han, V. Kumar, "CHAMELEON: A hierarchical clustering algorithm using dynamic modeling", Computer, vol. 32, no. 8, pp. 68-75, Aug. 1999 (doi: 10.1109/2.781637).
[32] E. L. Johnson, H. Kargupta, "Collective, hierarchical clustering from distributed, heterogeneous data", Large-Scale Parallel Data Mining, pp. 221-244, 2000 (doi: 10.1007/3-540-46502-2_12).
[33] N. Samatova, G. Ostrouchov, A. Geist and A. Melec, "RACHET: An efficient cover-based merging of clustering hierarchies from distributed datasets", Distributed and Parallel Databases, vol. 11, no. 2, pp. 157-180, 2002 (doi: 10.1023/A:1013988102576).
[34] H. Gao, J. Jiang, L. She, Y. Fu, "A new agglomerative hierarchical clustering algorithm implementation based on the map reduce framework", International Journal of Digital Content Technology and its Applications, vol. 4, no. 3, 2016.
[35] T. Zhang, R. Ramakrishnan, M. Livny, "BIRCH: An efficient data clustering method for very large databases", SIGMOD, vol. 25, no. 2, Association for Computing Machinery, Jun. 1996 (doi: 10.1145/ 235968.233324).
[36] "Apache Hadoop," [Online]. Available: http://hadoop.apache.org. [Accessed 2019].
_||_