An Artificial Intelligence Framework for Supporting Coarse-Grained Workload Classification in Complex Virtual Environments
محورهای موضوعی : Transactions on Fuzzy Sets and SystemsAlfredo Cuzzocrea 1 , Enzo Mumolo 2 , Islam Belmerabet 3 , Abderraouf Hafsaoui 4
1 - iDEA Lab, University of Calabria Rende, Italy & Department of Computer Science, University of Paris City, Paris, France.
2 - Department of Engineering, University of Trieste, Trieste, Italy.
3 - iDEA Lab, University of Calabria, Rende, Italy.
4 - iDEA Lab, University of Calabria, Rende, Italy.
کلید واژه: Classification, Virtual machines, Workload, Dempster-Shafer theory,
چکیده مقاله :
We propose Cloud-based machine learning tools for enhanced Big Data applications, where the main idea is that of predicting the “next” workload occurring against the target Cloud infrastructure via an innovative ensemble-based approach that combines the effectiveness of different well-known classifiers in order to enhance the whole accuracy of the final classification, which is very relevant at now in the specific context of Big Data. The so- called workload categorization problem plays a critical role in improving the efficiency and reliability of Cloud-based big data applications. Implementation-wise, our method proposes deploying Cloud entities that participate in the distributed classification approach on top of virtual machines, which represent classical “commodity” settings for Cloud-based big data applications. Given a number of known reference workloads, and an unknown workload, in this paper we deal with the problem of finding the reference workload which is most similar to the unknown one. The depicted scenario turns out to be useful in a plethora of modern information system applications. We name this problem as coarse-grained workload classification, because, instead of characterizing the unknown workload in terms of finer behaviors, such as CPU, memory, disk, or network intensive patterns, we classify the whole unknown workload as one of the (possible) reference workloads. Reference workloads represent a category of workloads that are relevant in a given applicative environment. In particular, we focus our attention on the classification problem described above in the special case represented by virtualized environments. Today, Virtual Machines (VMs) have become very popular because they offer important advantages to modern computing environments such as cloud computing or server farms. In virtualization frameworks, workload classification is very useful for accounting, security reasons, or user profiling. Hence, our research makes more sense in such environments, and it turns out to be very useful in a special context like Cloud Computing, which is emerging now. In this respect, our approach consists of running several machine learning-based classifiers of different workload models, and then deriving the best classifier produced by the Dempster-Shafer Fusion, in order to magnify the accuracy of the final classification. Experimental assessment and analysis clearly confirm the benefits derived from our classification framework. The running programs which produce unknown workloads to be classified are treated in a similar way. A fundamental aspect of this paper concerns the successful use of data fusion in workload classification. Different types of metrics are in fact fused together using the Dempster-Shafer theory of evidence combination, giving a classification accuracy of slightly less than 80%. The acquisition of data from the running process, the pre-processing algorithms, and the workload classification are described in detail. Various classical algorithms have been used for classification to classify the workloads, and the results are compared.
We propose Cloud-based machine learning tools for enhanced Big Data applications, where the main idea is that of predicting the “next” workload occurring against the target Cloud infrastructure via an innovative ensemble-based approach that combines the effectiveness of different well-known classifiers in order to enhance the whole accuracy of the final classification, which is very relevant at now in the specific context of Big Data. The so- called workload categorization problem plays a critical role in improving the efficiency and reliability of Cloud-based big data applications. Implementation-wise, our method proposes deploying Cloud entities that participate in the distributed classification approach on top of virtual machines, which represent classical “commodity” settings for Cloud-based big data applications. Given a number of known reference workloads, and an unknown workload, in this paper we deal with the problem of finding the reference workload which is most similar to the unknown one. The depicted scenario turns out to be useful in a plethora of modern information system applications. We name this problem as coarse-grained workload classification, because, instead of characterizing the unknown workload in terms of finer behaviors, such as CPU, memory, disk, or network intensive patterns, we classify the whole unknown workload as one of the (possible) reference workloads. Reference workloads represent a category of workloads that are relevant in a given applicative environment. In particular, we focus our attention on the classification problem described above in the special case represented by virtualized environments. Today, Virtual Machines (VMs) have become very popular because they offer important advantages to modern computing environments such as cloud computing or server farms. In virtualization frameworks, workload classification is very useful for accounting, security reasons, or user profiling. Hence, our research makes more sense in such environments, and it turns out to be very useful in a special context like Cloud Computing, which is emerging now. In this respect, our approach consists of running several machine learning-based classifiers of different workload models, and then deriving the best classifier produced by the Dempster-Shafer Fusion, in order to magnify the accuracy of the final classification. Experimental assessment and analysis clearly confirm the benefits derived from our classification framework. The running programs which produce unknown workloads to be classified are treated in a similar way. A fundamental aspect of this paper concerns the successful use of data fusion in workload classification. Different types of metrics are in fact fused together using the Dempster-Shafer theory of evidence combination, giving a classification accuracy of slightly less than 80%. The acquisition of data from the running process, the pre-processing algorithms, and the workload classification are described in detail. Various classical algorithms have been used for classification to classify the workloads, and the results are compared.
[1] Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician. 1992; 46(3): 175-185. DOI: https://doi.org/10.1080/00031305.1992.10475879
[2] Azmandian F, Moffie M, Dy JG, Aslam JA, Kaeli DR. Workload characterization at the virtualization layer. In: 19th Annual IEEE International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2011, 25-27 July 2011, Singapore. 2011. p.63-72. DOI: https://doi.org/10.1109/MASCOTS.2011.63
[3] Barford P, Crovella M. Generating representative web workload for network and server performance evaluation. In: Proceedings of the 1998 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 1998 / PERFORMANCE 1998, 22-26 June 1998, Madison, Wisconsin, USA. 1998. p.151-160. DOI: https://doi.org/10.1145/277851.277897
[4] Baum LE, Petrie T. Statistical inference for probabilistic functions of nite state Markov chains. The Annals of Mathematical Statistics. 1966; 37(6): 1554-1563. DOI: https://doi.org/10.1214/aoms/1177699147
[5] Bellatreche L, Cuzzocrea A, Benkrid S. F&A: A Methodology for e ectively and efficiently designing parallel relational data warehouses on heterogenous database clusters. In: Proceedings of the Springer 12th International Conference on Data Warehousing and Knowledge Discovery, DAWAK 2010, 30 June 24 Cuzzocrea A, Mumolo E, Belmerabet I, Hafsaoui A. Trans. Fuzzy Sets Syst. 2023; 2(2) 2010 - 3 September 2010, Bilbao, Spain. 2010. p.89-104. DOI: https://doi.org/10.1007/978-3-642-15105-7_8
[6] Bhowal P, Sen S, Yoon JH, Geem ZW, Sarkar R. Evaluation of Fuzzy measures using Dempster-Shafer belief structure: A classi er Fusion framework. IEEE Transacations on Fuzzy Systems. 2023; 31(5): 1593-1603. DOI: https://doi.org/10.1109/TFUZZ.2022.3206504
[7] Biswas A, Majumdar S, Nandy B, El-Haraki A. Automatic resource provisioning: A machine learning based proactive approach. In: IEEE 6th International Conference on Cloud Computing Technology and Science, CloudCom 2014, 15-18 December 2014, Singapore. 2014. p.169-173. DOI: https://doi.org/10.1109/CloudCom.2014.147
[8] Bleikertz S, Vogel C, GroB T. Cloud radar: Near real-time detection of security failures in dynamic virtualized infrastructures. In: Proceedings of the 30th Annual Computer Security Applications Conference, ACSAC 2014, 8-12 December 2014, New Orleans, USA. 2014. p.26-35. DOI: https://doi.org/10.1145/2664243.2664274
[9] Bruder G, Steinicke F, Nuchter A. Poster: Immersive point cloud virtual environments. In: IEEE Symposium on 3D User Interfaces, 3DUI 2014, 29-30 March 2014, Minneapolis, USA. 2014. p.161-162. DOI: https://doi.org/10.1109/3DUI.2014.6798870
[10] Carlson M. Systems and virtualization management: Standards and the cloud. A report on SVM 2013. Journal of Network Systems Management. 2014; 22(4): 709-715. DOI: https://doi.org/10.1007/s10922-014-9315-7
[11] Chen KZ, Johnson NM, D'Silva V, Dai S, MacNamara K, Magrino TR, Wu EX, Rinard MC, Song DX. Contextual policy enforcement in Android applications with permission event graphs. In: 20th Network and Distributed System Security Symposium, NDSS 2013, 24-27 February 2013, San Diego, USA. 2013.
[12] Cho Y, Choi J, Choi J. An integrated management system of virtual resources based on virtualization API and data distribution service. In: ACM Cloud and Autonomic Computing Conference, CAC 2013, 5-9 August 2013, Miami FL, USA. 2013. p.1-7. DOI: https://doi.org/10.1145/2494621.2494648
[13] Chuang IH, Tsai YT, Horng MF, Kuo YH, Hsu JP. A GA-based approach for resource consolidation of virtual machines in clouds. In: Springer 6th Asian Conference on Intelligent Information and Database Systems, ACIIDS 2014, 7-9 April 2014, Bangkok, Thailand. 2014. p.342-351. DOI: https://doi.org/10.1007/978-3-319-05476-6_35
[14] Cirne W, Berman F. A comprehensive model of the supercomputer workload. In: Proceedings of the 4th Annual IEEE International Workshop on Workload Characterization, WWC-4 (Cat. No. 01EX538), 2 December 2001, Austin, TX, USA. 2001. p.140-148. DOI: https://doi.org/10.1109/WWC.2001.990753
[15] Coronato A, Cuzzocrea A. An innovative risk assessment methodology for medical information systems. IEEE Transactions on Knowledge and Data Engineering. 2022; 34(7): 3095-3110. DOI: https://doi.org/10.1109/TKDE.2020.3023553
[16] Cuzzocrea A, Martinelli F, Mercaldo F, Vercelli GV. Tor traffic analysis and detection via machine learning techniques. In: 2017 IEEE International Conference on Big Data, IEEE BigData 2017, 11-14 December 2017, Boston, MA, USA. 2017. p.4474-4480. DOI: https://doi.org/10.1109/BigData.2017.8258487
[17] Cuzzocrea A, Sacca D, Ullman JD. Big data: A research agenda. In: ACM 17th International Database Engineering & Applications Symposium, IDEAS 2013, 09-11 October 2013, Barcelona, Spain. 2013. p.198-203. DOI: https://doi.org/10.1145/2513591.2527071
[18] Cuzzocrea A, Sacca D. Balancing accuracy and privacy of OLAP aggregations on data cubes. In: ACM 13th International Workshop on Data Warehousing and OLAP, DOLAP 2010, 30 October 2010, Toronto, Ontario, Canada. 2010. p.93-98. DOI: https://doi.org/10.1145/1871940.1871960
[19] Cuzzocrea A, Darmont J, Mahboubi H. Fragmenting very large XML data warehouses via K-means clustering algorithm. International Journal of Business Intelligence and Data Mining. 2009; 4(3-4): 301-328. DOI: https://doi.org/10.1504/IJBIDM.2009.029076
[20] Cuzzocrea A, Sacca D, Sera no P. A hierarchy-driven compression technique for advanced OLAP visualization of multidimensional data cubes. In: Proceedings of the Springer 8th International Conference on Data Warehousing and Knowledge Discovery, DaWaK 2006, 4-8 September 2006, Krakow, Poland. 2006. p.106-119. DOI: https://doi.org/10.1007/11823728 11
[21] Da Mota B, Tudoran R, Costan A, Varoquaux G, Brasche G, Conrod PJ, Lemaitre H, Paus T, Rietschel M, Frouin V, Poline JB, Antoniu G, Thirion B. Generic machine learning pattern for neuroimaging-genetic studies in the cloud. Frontiers in Neuroinformatics. 2014; 8: art.31. DOI: https://doi.org/10.3389/fninf.2014.00031
[22] Deng Y, Shen S, Huang Z, Iosup A, Lau RWH. Dynamic resource management in cloudbased distributed virtual environments. In: Proceedings of the ACM 22nd International Conference on Multimedia, MM 2014, 3-7 November 2014, Orlando, FL, USA. 2014. p.1209-1212. DOI: https://doi.org/10.1145/2647868.2655051
[23] DiFranzo D, Graves A. A farm in every window: A study into the incentives for participation in the Windowfarm virtual community. In: ACM 3rd International Web Science Conference, WebSci 2011, 15-17 June 2011, Koblenz, Germany. 2011. p.1-8. DOI: https://doi.org/10.1145/2527031.2527042
[24] El-Refaey MA, Rizkaa MA. Virtual systems workload characterization: An overview. In: 18th IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises, WETICE 2009, 29 June 2009 - 1 July 2009, Groningen, The Netherlands. 2009. p.72-77. DOI: https://doi.org/10.1109/WETICE.2009.13
[25] Genkin M, Dehne F. Autonomic workload change classi cation and prediction for big data workloads. In: 2019 IEEE International Conference on Big Data, IEEE BigData 2019, 9-12 December 2019, Los Angeles, CA, USA. 2019. p.2835-2844. DOI: https://doi.org/10.1109/BigData47090.2019.9006149
[26] Gmach D, Rolia J, Cherkasova L, Kemper A. Workload analysis and demand prediction of enterprise data center applications. In: IEEE 10th International Symposium on Workload Characterization, IISWC 2007, September 27-29, Boston, MA, USA. 2007. p.171-180. DOI: https://doi.org/10.1109/IISWC.2007.4362193
[27] Goel G, Ganesan R, Sarkar S, Kaup K. Workload analysis for virtual machine placement. In: IEEE 18th International Conference on Parallel and Distributed Systems, ICPADS 2012, 17-19 December 2012, Singapore. 2012. p.732-737. DOI: https://doi.org/10.1109/ICPADS.2012.118
[28] Goldberg RP. Survey of virtual machine research. Computer. 1974; 7(6): 34-45. DOI: https://doi.org/10.1109/MC.1974.6323581
[29] Gutierrez-Garcia JO, Ramirez-Nafarrate A. A policy-based agents for virtual machine migration in cloud data centers. In: IEEE International Conference on Services Computing, 28 June 2013 - 3 July 2013, Santa Clara, CA, USA. 2013. p.603-610. DOI: https://doi.org/10.1109/SCC.2013.55
[30] Hou HS, Tretter DR, Vogel MJ. Interesting properties of the discrete cosine transform. Journal of Visual Communication and Image Representation. 1992; 3: 73-83. DOI: https://doi.org/10.1016/1047-3203(92)90031-N
[31] Hsiao SW, Chen YN, Sun YS, Chen MC. Combining dynamic passive analysis and active ngerprinting for e ective bot malware detection in virtualized environments. In: Proceedings of the Springer 7th International Conference on Network and System Security, NSS 2013, 3-4 June 2013, Madrid, Spain. 2013. p.699-706. DOI: https://doi.org/10.1007/978-3-642-38631-2_59
[32] Hu Y, Long X, Wen C. Asymmetric virtual machine scheduling model based on workload classi cation. In: IEEE International Conference on Computer Science and Service System, 11-13 August 2012, Nanjing, China. 2012. p.2231-2234. DOI: https://doi.org/10.1109/CSSS.2012.554
[33] Joachims T. Text categorization with support vector machines: Learning with many relevant features. In: Nedellec C, Rouveirol C. (eds.) Lecture Notes in Computer Science: Proceedings of the Springer 10th European Conference on Machine Learning, 21-23 April 1998, Chemnitz, Germany. 1998. p.137-142. DOI: https://doi.org/10.1007/BFb0026683
[34] Kejela G, Esteves RM, Rong C. Predictive analytics of sensor data using distributed machine learning techniques. In: IEEE 6th International Conference on Cloud Computing Technology and Science, CloudCom 2014, 15-18 December 2014, Singapore. 2014. p.626-631. DOI: https://doi.org/10.1109/CloudCom.2014.44
[35] Kim BK, Jang JH, Hur KW, Lee JG ,Woong Ko Y. Monitoring and feedback tools for realtime workloads for Xen virtual machine. In: Proceedings of the International Conference on IT Convergence and Security, ICITCS 2011, 14-16 December 2011, Suwon, Korea. 2011. p.151-161. DOI: https://doi.org/10.1007/978-94-007-2911-7_13
[36] Konig JL, Hinze A, Bowen J. Workload categorization for hazardous industries: The semantic modelling of multi-modal physiological data. Future Generation Computer Systems. 2023; 141(4): 369-381. DOI: https://doi.org/10.1016/j.future.2022.11.019
[37] Krishna DS, Srinivasi G, Reddy PVGDP. Novel private cloud architecture: A three tier approach to deploy private cloud using virtual machine manager. Intelligent Decision Technologies. 2023; 17(2): 275-285. DOI: https://doi.org/10.3233/IDT-229035
[38] Leung CK, Braun P, Hoi CSH, Souza J,Cuzzocrea A. Urban analytics of big transportation data for supporting smart cities. In: Springer 21st International Conference on Big Data Analytics and Knowledge Discovery, DaWaK 2019, 26-29 August 2019, Linz, Austria. 2019. p.24-33. DOI: https://doi.org/10.1007/978-3-030-27520-4 3
[39] Leung CK, Cuzzocrea A, Mai JJ, Deng D, Jiang F. Personalized DeepInf: Enhanced social in fluence prediction with deep learning and transfer learning. In: 2019 IEEE International Conference on Big Data, BigData 2019, 9-12 December 2019, Los Angeles, CA, USA. 2019. p.2871-2880. DOI: https://doi.org/10.1109/BigData47090.2019.9005969
[40] Maddodi G, Jansen S, de Jong R. Generating workload for ERP applications through end-user organization categorization using high level business operation data. In: Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, ICPE 2018, 09-13 April 2018, Berlin, Germany. 2018. p.200-210. DOI: https://doi.org/10.1145/3184407.3184432
[41] Mahambre S, Kulkarni P, Bellur U, Cha e G, Deshpande D. Workload characterization for capacity planning and performance management in IaaS cloud. In: 2012 IEEE International Conference on Cloud Computing in Emerging Markets, CCEM 2012, October 11-12, Bangalore, India. 2012. p.1-7. DOI: https://doi.org/10.1109/CCEM.2012.6354624
[42] Mante C. Application of resampling and linear Spline methods to spectral and dispersional analyses of long-memory processes. Computational Statistics & Data Analysis. 2007; 51(9): 4308-4323. DOI: https://doi.org/10.1016/j.csda.2006.05.015
[43] Oracle VM VirtualBox. User Manual. https://www.virtualbox.org/manual/ [Accessed 15th January 2023].
[44] Panneerselvam J, Liu L, Antonopoulos N, Yuan B. Workload analysis for the scope of user demand prediction model evaluations in cloud environments. In: Proceedings of the 7th IEEE/ACM International Conference on Utility and Cloud Computing, UCC 2014, 8-11 December 2014, London, United Kingdom. 2014. p.883-889. DOI: https://doi.org/10.1109/UCC.2014.144
[45] Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences. 1988; 85(8): 2444-2448. DOI: https://doi.org/10.1073/pnas.85.8.2444
[46] Ripley BD. Neural networks and related methods for classi cation. Journal of the Royal Statistical Society: Series B (Methodological). 1994; 56(3): 409-437. DOI: https://doi.org/10.1111/j.2517-6161.1994.tb01990.x
[47] Rokach L. Ensemble-based classi ers. Arti cial Intelligence Review. 2010; 33(1-2): 1-39. DOI: https://doi.org/10.1007/s10462-009-9124-7
[48] Shafer G. A Mathematical Theory of Evidence. Princeton University Press; 1976.
[49] Shao YS, Brooks DM. ISA-independent workload characterization and its implications for specialized architectures. In: 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 21-23 April 2013, Austin, TX, USA. 2013. p.245-255. DOI: https://doi.org/10.1109/ISPASS.2013.6557175
[50] Sherwood T, Perelman E, Hamerly G, Sair S, Calder B. Discovering and exploiting program phases. IEEE Micro. 2003; 23(6): 84-93. DOI: https://doi.org/10.1109/MM.2003.1261391
[51] Singh S, Chana I. Metrics based workload analysis technique for IaaS Cloud. arXiv [Preprint] 2014. Doi: https://arxiv.org/abs/1411.6753
[52] SOAP. Simple object access protocol. https://www.w3.org/TR/soap/ [Accessed 15th January 2023].
[53] Sousa Vieira ME, Suarez-Gonzalez A, Fernandez-Veiga M, Lopez-Ardao JC, Lopez-Garca C. Model selection for long-memory processes in the spectral domain. Computer Communications. 2013; 36(13): 1436-1449. DOI: https://doi.org/10.1016/j.comcom.2013.06.002
[54] SPEC. The standard performance evaluation corporation. http://www.spec.org/ [Accessed 15th January 2023].
[55] Thakur A, Goraya MS. A workload and machine categorization-based resource allocation framework for load balancing and balanced resource utilization in the cloud. International Journal of Grid and High Performance Computing. 2022; 14(1): 1-16. DOI: https://doi.org/10.4018/IJGHPC.301594
[56] VirtualBox Main API. VirtualBox main API documentation. https://www.virtualbox.org/sdkref/ [Accessed 15th January 2023].
[57] Van Do T. Comparison of allocation schemes for virtual machines in energy-aware server farms. The Computer Journal. 2011; 54(11): 1790-1797. DOI: https://doi.org/10.1093/comjnl/bxr007
[58] Vandromme N, Dandres T, Maurice E, Samson R, Khazri S, Moghaddam RF, Nguyen KK, Lemieux Y, Cheriet M. Life cycle assessment of videoconferencing with call management servers relying on virtualization. In: Proceedings of the 2014 conference ICT for Sustainability, ICT4S-14, 25 August 2014, Stockholm, Sweden. 2014. p.281-289. DOI: https://doi.org/10.2991/ict4s-14.2014.34
[59] WSDL. Web services description language. https://www.w3.org/TR/wsdl20/ [Accessed 15th January 2023].
[60] Xiao P, Hu Z, Liu D, Zhang X, Qu X. Energy-efficiency enhanced virtual machine scheduling policy for mixed workloads in cloud environments. Computers & Electrical Engineering. 2014; 40(5): 1650-1665. DOI: https://doi.org/10.1016/j.compeleceng.2014.03.002
[61] Xiao L, Chen S, Zhang X. Dynamic cluster resource allocation for jobs with known and unknown memory demand. IEEE Transactions on Parallel and Distributed Systems. 2002; 13(3): 223-240. DOI: https://doi.org/10.1109/71.993204
[62] Xu Y, Musgrave Z, Noble B, Bailey M. Workload-aware provisioning in public clouds. IEEE Internet Computing. 2014; 18(4): 15-21. DOI: https://doi.org/10.1109/MIC.2014.38
[63] Yang C, Xu Z, Gu G, Yegneswaran V, Porras PA. DroidMiner: automated mining and characterization of ne-grained malicious behaviors in android applications. In: Kutylowski M, Vaidya J. (eds.) Lecture Notes in Computer Science: Proccedings of the 19th European Symposium on Research in Computer Security, ESORICS 2014, 7-11 September 2014, Wroclaw, Poland. 2014. p.163-182. DOI: https://doi.org/10.1007/978-3-319-11203-9_10
[64] Ying-Dar L, Yuan-Cheng L, Chien-Hung C, Hao-Chuan T. Identifying Android malicious repackaged applications by thread-grained system call sequences. Computers & Security. 2013; 39: 340-350. DOI: https://doi.org/10.1016/j.cose.2013.08.010
[65] Zhang J, Figueiredo RJ. Autonomic feature selection for application classi cation. In: Proceedings of the IEEE 3rd International Conference on Autonomic Computing, ICAC 2006, 13-16 June 2006, Dublin, Ireland. 2006. p.43-52. DOI: https://doi.org/10.1109/ICAC.2006.1662380
[66] Zhao X, Yin J, Chen Z, He S. Workload classi cation model for specializing virtual machine operating system. In: 2013 IEEE 6th International Conference on Cloud Computing, June 28 - July 3, Santa Clara, CA, USA. 2013. p.343-350. DOI: https://doi.org/10.1109/CLOUD.2013.144
[67] Zhou Y, Jiang X. Dissecting Android malware: Characterization and evolution. 33rd IEEE Symposium on Security and Privacy, SP 2012, 21-23 May