SDC-causing Error Detection and Mitigation Based on Failure rate Prediction without Fault Injection
Subject Areas :
Multimedia Processing, Communications Systems, Intelligent Systems
Moona Yakhchi
1
,
Mahdi Fazeli
2
,
Seyed Amir Asghari Toochai
3
1 - Phd Student, Department of Computer, Borujerd Branch, Islamic Azad University, Borujerd, Iran.
2 - Associate Professor, School of Information Technology, Halmstad University, Halmstad, Sweden.
3 - Assistant Professor, Electrical and Computer Engineering Department, Kharazmi University, Tehran, Iran.
Received: 2022-09-23
Accepted : 2023-03-10
Published : 2022-12-23
Keywords:
Silent Data Corruption,
Multi-Bit Fault,
Soft Errors,
Machin Learning,
Fault Injection,
Abstract :
Introduction: Reducing the size of processing components and increasing the probability of failure even in ordinary components maintaining reliability has become a serious challenge of today’s computer systems. The soft errors can lead to silent data corruption which seriously compromises the reliability of a system. The Silent data corruption is a fault that affects running software and leads to incorrect output. Detecting silent data corruption needed a profile of the instructions causing the silent data corruption to decide which instructions to be protected. Current approaches by machine learning algorithms predict the occurrence rate of silent data corruption for each instruction. While most of the existing algorithms suffer from inaccuracy. Most current detection techniques require sufficient data from fault injection for training, which is difficult to achieve due to high resource consumption, such as execution time and code size costs. However, as technology is downscaling toward Nano-scale sizes, multiple-bit soft errors are emerging as an important reliability challenge. Therefore, identifying and determining vulnerable points in the presence of fault has so important.Method: Traditional solutions based on redundancies are very expensive in terms of chip area, energy consumption, and performance. Consequently, providing low cost and efficient approaches to cope with SDCs has received researchers’ attention more than ever. Hence the lack of a high-precision method without fault injection becomes a research challenge. Utilizing fault injection methods in complex systems is costly; therefore, in identifying silent data corruptions, a method based on machine learning algorithm is used, which is not necessary to inject fault in all software. Multi-bit faults and silent data corruptions with instruction sources are also considered. For this goal, we have proposed the M5rule decision tree model to detect the silent data corruption error by calculating the importance of the instruction feature for the vulnerability. Then we have used the error detection method by copying the critical instructions with sort.Results: Finally, we evaluated our model on Mibench benchmarks with multiple test programs. The results show an overhead of 58 % with data silent corruption coverage rate of about 99%.Discussion: In order that we not only did the single-bit fault consider but also multiple-bit fault. In addition, fault had been injected into instruction and data. Consequently, the evaluation results show that our method achieves a better detection accuracy compared to other state-of-the-art methods.
References:
[1] A. Asghari, M. Binesh Marvasti, and M. Daneshtalab, “A software implemented comprehensive soft error detection method for embedded systems,” Microprocess. Microsyst., vol. 77, p. 103161, Sep. 2020, doi: 10.1016/J.MICPRO.2020.103161.
[2] A. Asghari, H. Taheri, H. Pedram, and O. Kaynak, “Software-based control flow checking against transient faults in industrial environments,” IEEE Trans. Ind. Informatics, vol. 10, no. 1, pp. 481–490, Feb. 2014, doi: 10.1109/TII.2013.2248373.
[3] Sangchoolie, K. Pattabiraman, and J. Karlsson, “One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors,” Proc. - 47th Annu. IEEE/IFIP Int. Conf. Dependable Syst. Networks, DSN 2017, pp. 97–108, Aug. 2017, doi: 10.1109/DSN.2017.30.
[4] Lu, G. Li, K. Pattabiraman, M. S. Gupta, and J. A. Rivers, “Configurable Detection of SDC-causing Errors in Programs,” ACM Trans. Embed. Comput. Syst., vol. 16, no. 3, Mar. 2017, doi: 10.1145/3014586.
[5] Yakhchi, M. Fazeli, and . A. Asghari, “Investigation of the Effect of Burst Multi-bit Soft Errors on Control Flow and Data Error Behaviors of Embedded Systems,” J. Soft Comput. Inf. Technol., vol. 10, no. 2, pp. 68–81, 2021.
[6] Yakhchi, M. Fazeli, and S. A. Asghari, “Silent Data Corruption Estimation and Mitigation Without Fault Injection,” IEEE Can. J. Electr. Comput. Eng., vol. 45, no. 3, pp. 318–327, 2022.
[7] Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, “GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications,” ISPASS 2014 - IEEE Int. Symp. Perform. Anal. Syst. Softw., pp. 221–230, 2014, doi: 10.1109/ISPASS.2014.6844486.
[8] Wei, A. Thomas, G. Li, and K. Pattabiraman, “Quantifying the accuracy of high-level fault injection techniques for hardware faults,” Proc. Int. Conf. Dependable Syst. Networks, pp. 375–382, Sep. 2014, doi: 10.1109/DSN.2014.2.
[9] FengShuguang, GuptaShantanu, AnsariAmin, and MahlkeScott, “Shoestring,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 1, pp. 385–396, Mar. 2010, doi: 10.1145/1735970.1736063.
[10] K. Sastry Hari, R. Venkatagiri, S. V. Adve, and H. Naeimi, “GangES,” ACM SIGARCH Comput. Archit. News, vol. 42, no. 3, pp. 61–72, Jun. 2014, doi: 10.1145/2678373.2665685.
[11] Li, Q. Lu, and K. Pattabiraman, “Fine-Grained Characterization of Faults Causing Long Latency Crashes in Programs,” Proc. Int. Conf. Dependable Syst. Networks, vol. 2015-September, pp. 450–461, Sep. 2015, doi: 10.1109/DSN.2015.36.
[12] Pal, “M5 model tree for land cover classification,” http://dx.doi.org/10.1080/01431160500256531, vol. 27, no. 4, pp. 825–831, 2007, doi: 10.1080/01431160500256531.
[13] Adams and L. Sterling, “AI ’92,” pp. 1–410, Dec. 1992, doi: 10.1142/9789814536271.
[14] Wang and I. H. Witten, “Induction of model trees for predicting continuous classes,” 1996, Accessed: Aug. 22, 2022. [Online]. Available: https://researchcommons.waikato.ac.nz/handle/10289/1183
[15] J. Wang, A. Mahesri, and S. J. Patel, “Examining ACE analysis reliability estimates using fault-injection,” ACM SIGARCH Comput. Archit. News, vol. 35, no. 2, pp. 460–469, Jun. 2007, doi: 10.1145/1273440.1250719.
[16] S. Mukherjee, J. Emer, and S. K. Reinhardt, “The soft error problem: An architectural perspective,” Proc. - Int. Symp. High-Performance Comput. Archit., pp. 243–247, 2005, doi: 10.1109/HPCA.2005.37.
[17] Ghavami and M. Raji, “Soft Error Rate Estimation of VLSI Circuits,” Soft Error Reliab. VLSI Circuits, pp. 9–23, 2021, doi: 10.1007/978-3-030-51610-9_2.
[18] Li, K. Pattabiraman, S. K. S. Hari, M. Sullivan, and T. Tsai, “Modeling Soft-Error propagation in programs,” Proc. - 48th Annu. IEEE/IFIP Int. Conf. Dependable Syst. Networks, DSN 2018, pp. 27–38, Jul. 2018, doi: 10.1109/DSN.2018.00016.
[19] Li and K. Pattabiraman, “Modeling Input-Dependent error propagation in programs,” Proc. - 48th Annu. IEEE/IFIP Int. Conf. Dependable Syst. Networks, DSN 2018, pp. 279–290, Jul. 2018, doi: 10.1109/DSN.2018.00038.
[20] Ma, Z. Duan, and L. Tang, “A Methodology to Assess Output Vulnerability Factors for Detecting Silent Data Corruption,” IEEE Access, vol. 7, pp. 118135–118145, 2019.
[21] Fang, Q. Lu, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, “EPVF: An enhanced program vulnerability factor methodology for cross-layer resilience analysis,” Proc. - 46th Annu. IEEE/IFIP Int. Conf. Dependable Syst. Networks, DSN 2016, pp. 168–179, Sep. 2016, doi: 10.1109/DSN.2016.24.
[22] Thomas and K. Pattabiraman, “Error Detector Placement for Soft Computing Applications,” ACM Trans. Embed. Comput. Syst., vol. 15, no. 1, Jan. 2016, doi: 10.1145/2801154.
[23] Wei, R. Zhang, Y. Liu, H. Yue, and J. Tan, “Evaluating the soft error resilience of instructions for GPU applications,” Proc. - 22nd IEEE Int. Conf. Comput. Sci. Eng. 17th IEEE Int. Conf. Embed. Ubiquitous Comput. CSE/EUC 2019, pp. 459–464, Aug. 2019, doi: 10.1109/CSE/EUC.2019.00091.
[24] James, H. Quinn, M. Wirthlin, and J. Goeders, “Applying Compiler-Automated Software Fault Tolerance to Multiple Processor Platforms,” IEEE Trans. Nucl. Sci., vol. 67, no. 1, pp. 321–327, Jan. 2020, doi: 10.1109/TNS.2019.2959975.
[25] -L. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve, and Y. Zhou, “Understanding the propagation of hard errors to software and implications for resilient system design,” p. 265, 2008, doi: 10.1145/1346281.1346315.
[26] B. Thati, J. Vankeirsbilck, J. Boydens, and D. Pissort, “Selective Duplication and Selective Comparison for Data Flow Error Detection,” 2019 4th Int. Conf. Syst. Reliab. Safety, ICSRS 2019, pp. 10–15, Nov. 2019, doi: 10.1109/ICSRS48664.2019.8987731.
[27] Ayatolahi, B. Sangchoolie, R. Johansson, and J. Karlsson, “A study of the impact of single bit-flip and double bit-flip errors on program execution,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8153 LNCS, pp. 265–276, 2013, doi: 10.1007/978-3-642-40793-2_24/COVER.
[28] Sangchoolie, F. Ayatolahi, R. Johansson, and J. Karlsson, “A study of the impact of bit-flip errors on programs compiled with different optimization levels,” Proc. - 2014 10th Eur. Dependable Comput. Conf. EDCC 2014, pp. 146–157, 2014, doi: 10.1109/EDCC.2014.30.
[29] Sangchoolie, K. Pattabiraman, and J. Karlsson, “An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors in Programs,” IEEE Trans. Dependable Secur. Comput., 2020.
[30] Narayanamurthy, K. Pattabiraman, and M. Ripeanu, “Finding Resilience-Friendly Compiler Optimizations Using Meta-Heuristic Search Techniques,” Proc. - 2016 12th Eur. Dependable Comput. Conf. EDCC 2016, pp. 1–12, Dec. 2016, doi: 10.1109/EDCC.2016.26.
[31] K. S. Hari, S. V. Adve, and H. Naeimi, “Low-cost program-level detectors for reducing silent data corruptions,” Proc. Int. Conf. Dependable Syst. Networks, 2012, doi: 10.1109/DSN.2012.6263960.
[32] Lu, K. Pattabiraman, M. S. Gupta, and J. A. Rivers, “SDCTune: A model for predicting the SDC proneness of an application for configurable protection,” 2014 Int. Conf. Compil. Archit. Synth. Embed. Syst. CASES 2014, Oct. 2014, doi: 10.1145/2656106.2656127.
[33] Liu, L., Ci, L., Liu, W., Yang, H., 2019,"Identifying SDC-causing Instructions based on Random forests algorithm", KSII Transactions on Internet and Information Systems. Vol. 13.
[34] Yang and Y. Wang, “Identify Silent Data Corruption Vulnerable Instructions Using SVM,” IEEE Access, vol. 7, pp. 40210–40219, 2019, doi: 10.1109/ACCESS.2019.2905842.
[35] A. Rink and J. Castrillon, “Trading fault tolerance for performance in AN encoding,” ACM Int. Conf. Comput. Front. 2017, CF 2017, pp. 183–190, May 2017, doi: 10.1145/3075564.3075565.
[36] Fang, J. Gu, Z. Yan, and Q. Wang, “SDC Error Detection by Exploring the Importance of Instruction Features,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 12937 LNCS, pp. 351–363, 2021, doi: 10.1007/978-3-030-85928-2_28/COVER.
[37] Liu, J. Gu, Z. Yan, F. Zhuang, and Y. Wang, “SDC-causing Error Detection Based on Lightweight Vulnerability Prediction.” PMLR, pp. 1049–1064, Oct. 15, 2019. Accessed: Aug. 22, 2022. [Online]. Available: https://proceedings.mlr.press/v101/liu19c.html
[38] Wang, N. Dryden, F. Cappello, and M. Snir, “Neural Network Based Silent Error Detector,” Proc. - IEEE Int. Conf. Clust. Comput. ICCC, vol. 2018-September, pp. 168–178, Oct. 2018, doi: 10.1109/CLUSTER.2018.00035.
[39] Laguna, M. Schulz, D. F. Richards, J. Calhoun, and L. Olson, “Ipas: Intelligent protection against silent output corruption in scientific applications,” in 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2016, pp. 227–238.
[40] Ebrahimi, P. M. B. Rao, R. Seyyedi, and M. B. Tahoori, “Low-Cost Multiple Bit Upset Correction in SRAM-Based FPGA Configuration Frames,” IEEE Trans. Very Large Scale Integr. Syst., vol. 24, no. 3, pp. 932–943, Mar. 2016, doi: 10.1109/TVLSI.2015.2425653.
[41] Frank et al., “Weka-a machine learning workbench for data mining,” in Data mining and knowledge discovery handbook, Springer, 2009, pp. 1269–1277.
[42] Banaiyanmofrad, M. Ebrahimi, F. Oboril, M. B. Tahoori, and N. Dutt, “Protecting caches against multi-bit errors using embedded erasure coding,” Proc. - 2015 20th IEEE Eur. Test Symp. ETS 2014, Jun. 2015, doi: 10.1109/ETS.2015.7138735.
[43] R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown, “MiBench: A free, commercially representative embedded benchmark suite,” 2001 IEEE Int. Work. Workload Charact. WWC 2001, pp. 3–14, 2001, doi: 10.1109/WWC.2001.990739.
[44] Gu, W. Zheng, Y. Zhuang, and Q. Zhang, “Vulnerability Analysis of Instructions for SDC-Causing Error Detection,” IEEE Access, vol. 7, pp. 168885–168898, 2019, doi: 10.1109/ACCESS.2019.2950598.
_||_