Real-Time Scalable Task Offloading in Edge Computing Using Semi-Markov Decision Processes and Attention-Based Deep Reinforcement Learning
Subject Areas : Computer Networks
Abbas Mirzaei
1
*
,
Nasser Mikaeilvand
2
,
Babak Nouri-Moghaddam
3
,
Sajjad Jahanbakhsh Gudakahriz
4
,
Ailin Khosravani
5
,
Fatemeh Tahmasebizade
6
,
Ali Seifi
7
,
Hosein Hatami
8
1 - Department of Computer Engineering, Ardabil Branch, Islamic Azad University, Ardabil, Iran
2 - Department of Mathematics, Central Tehran Branch, Islamic Azad University, Tehran, Iran
3 - Department of Computer Engineering, Ardabil Branch, Islamic Azad University, Ardabil, Iran
4 - Department of Computer Engineering, Germi Branch, Islamic Azad University, Germi, Iran
5 - Department of Computer Engineering, Ardabil Branch, Islamic Azad University, Ardabil, Iran
6 - Department of Computer Engineering, Ardabil Branch, Islamic Azad University, Ardabil, Iran
7 - Department of Computer Engineering, Ardabil Branch, Islamic Azad University, Ardabil, Iran
8 - Department of Computer Engineering, Ardabil Branch, Islamic Azad University, Ardabil, Iran
Keywords: Edge Computing, Task Scheduling, Reinforcement Learning, System Scalability.,
Abstract :
Edge computing has emerged as a dynamic framework where computational tasks are offloaded to distributed edge servers (ESs) to provide low-latency and efficient services. As edge systems grow in scale and complexity, leveraging Deep Reinforcement Learning (DRL) has become a prominent approach to optimize task offloading and Resource management. However, traditional DRL-based methodologies encounter several challenges: (1) Discrete-time decision frameworks, such as Markov Decision Processes (MDPs), often enforce offloading in fixed timeslots, leading to scheduling delays and inefficient Resource utilization. (2) Static computational structures struggle to adapt to varying numbers of edge servers or user devices, resulting in scalability issues and system inefficiencies. To overcome these limitations, we introduce a novel DRL-driven real-time offloading mechanism tailored for dynamic and scalable edge environments. Our approach reformulates the offloading problem within a Semi-Markov Decision Process (SMDP) framework and introduces an adaptive optimization mechanism utilizing attention-based graph operations for heterogeneous Resource environments. This system, like how we prioritize tasks and divide resources, figures out how much attention to pay to each task and which server should handle it, to make things work smoothly. To make this work even better in the real world, we use a special method to adjust the rewards, which helps the system learn and improve its performance over time.
[1] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2322–2358, 2017.
[2] Mirzaei, A. and Najafi Souha, A., 2021. Towards optimal configuration in MEC Neural networks: deep learning-based optimal resource allocation. Wireless Personal Communications, 121(1), pp.221-243.
[3] Zhou, Guoliang, and Amin Mohajer. "Blind reconfigurable intelligent surfaces for dynamic offloading in fixed-NOMA mobile edge networks." International Journal of Sensor Networks 46, no. 3 (2024): 142-160.
[4] H. Guo, J. Li, J. Liu, N. Tian, and N. Kato, “A survey on space-airground- sea integrated network security in 6g,” IEEE Communications Surveys & Tutorials, vol. 24, no. 1, pp. 53–87, 2022.
[5] Duan, H., & Mirzaei, A. (2023). Adaptive Rate Maximization and Hierarchical Resource Management for Underlay Spectrum Sharing NOMA HetNets with Hybrid Power Supplies. Mobile Networks and Applications, 1-17.
[6] Zhou, Nan, Ya Nan Li, and Amin Mohajer. "Distributed capacity optimisation and resource allocation in heterogeneous mobile networks using advanced serverless connectivity strategies." International Journal of Sensor Networks 45, no. 3 (2024): 127-147.
[7] X. Huang, Y. Chen, J. Liu, M. Wang, P. Li, and Q. Zhao, “Joint interdependent task scheduling and energy balancing for multi-uav enabled aerial edge computing: A multi-objective optimization approach,” IEEE Internet of Things Journal, vol. 10, no. 4, pp. 3147–3160, 2023.
[8] Z. Yang, C. Pan, K. Wang, and M. Shikh-Bahaei, “Energy efficient Resource allocation in uav enabled mobile edge computing networks,”IEEE Transactions on Wireless Communications, vol. 18, no. 9, pp. 4576–4589, 2019.
[9] Mohajer, Amin, Mohammad Yousefvand, Ehsan Noori Ghalenoo, Parviz Mirzaei, and Ali Zamani. "Novel approach to sub-graph selection over coded wireless networks with QoS constraints." IETE Journal of Research 60, no. 3 (2014): 203-210.
[10] X. Zhang, J. Zhang, J. Xiong, L. Zhou, J. Wei, and H. Li, “Energyefficient multi-uav-enabled multiaccess edge computing incorporating noma,” IEEE Internet of Things Journal, vol. 7, no. 6, pp. 5613–5627, 2020.
[11] Mirzaei, A. (2022). A novel approach to QoS‐aware resource allocation in NOMA cellular HetNets using multi‐layer optimization. Concurrency and Computation: Practice and Experience, 34(21), e7068.
[12] T. Zhang, Y. Xu, J. Loo, D. Yang, L. Xiao, and Y. Zhao, “Joint computation and communication design for uav-assisted mobile edge computing in iot,” IEEE Transactions on Industrial Informatics, vol. 16, no. 8, pp. 5505–5516, 2020.
[13] Z. Liu, X. Tan, M. Wen, S. Wang, C. Liang, and Q. Zhao, “An energyefficient selection mechanism of relay and edge computing in uavassisted cellular networks,” IEEE Transactions on Green Communications and Networking, vol. 5, no. 3, pp. 1306–1318, 2021.
[14] Mohajer, Amin, Javad Hajipour, and Victor CM Leung. "Dynamic Offloading in Mobile Edge Computing with Traffic-Aware Network Slicing and Adaptive TD3 Strategy." IEEE Communications Letters (2024).
[15] Yang, Jiuting, and Amin Mohajer. "Multi objective constellation optimization and dynamic link utilization for sustainable information delivery using PD-NOMA deep reinforcement learning." Wireless Networks (2024): 1-21.
[16] Somarin, A. M., Barari, M., & Zarrabi, H. (2018). Big data based self-optimization networking in next generation mobile networks. Wireless Personal Communications, 101(3), 1499-1518.
[17] Kuang, Shuhong, Jiyong Zhang, and Amin Mohajer. "Reliable information delivery and dynamic link utilization in MANET cloud using deep reinforcement learning." Transactions on Emerging Telecommunications Technologies 35, no. 9 (2024): e5028.
[18] Hua, Yuxiu, Rongpeng Li, Zhifeng Zhao, Xianfu Chen, and Honggang Zhang. "GAN-powered deep distributional reinforcement learning for resource management in network slicing." IEEE Journal on Selected Areas in Communications 38, no. 2 (2019): 334-349.
[19] X. Qin, Z. Song, Y. Hao, and X. Sun, “Joint Resource allocation and trajectory optimization for multi-uav-assisted multi-access mobile edge computing,” IEEE Wireless Communications Letters, vol. 10, no. 7, pp. 1400–1404, 2021.
[20] Wang, Qianxing, Wei Li, and Amin Mohajer. "Load-aware continuous-time optimization for multi-agent systems: Toward dynamic resource allocation and real-time adaptability." Computer Networks 250 (2024): 110526.
[21] H. Hu, Z. Chen, F. Zhou, Z. Han, and H. Zhu, “Joint Resource and trajectory optimization for heterogeneous-uavs enabled aerial-ground cooperative computing networks,” IEEE Transactions on Vehicular Technology, vol. 72, no. 6, pp. 7119–7133, 2023.
[22] Mirzaei, A., Barari, M., & Zarrabi, H. (2019). Efficient resource management for non-orthogonal multiple access: A novel approach towards green hetnets. Intelligent Data Analysis, 23(2), 425-447.
[23] Gu, LiFen, and Amin Mohajer. "Joint throughput maximization, interference cancellation, and power efficiency for multi-IRS-empowered UAV communications." Signal, Image and Video Processing 18, no. 5 (2024): 4029-4043.
[24] G. Chen, Q. Wu, R. Liu, J. Wu, and C. Fang, “Irs aided mec systems with binary offloading: A unified framework for dynamic irs beamforming,”IEEE Journal on Selected Areas in Communications, vol. 41, no. 2, pp. 349–365, 2023.
[25] X. Li, Y. Qin, J. Huo, and W. Huangfu, “Computation offloading and trajectory planning of multi-uav-enabled mec: A knowledge-assisted multiagent reinforcement learning approach, IEEE Internet of Things Journal, 2023.
[26] Yang, Ting, Jiabao Sun, and Amin Mohajer. "Queue stability and dynamic throughput maximization in multi-agent heterogeneous wireless networks." Wireless Networks (2024): 1-27.
[27] Mirzaei, A., & Rahimi, A. (2019). A Novel Approach for Cluster Self-Optimization Using Big Data Analytics. Information Systems & Telecommunication, 50.
[28] Y. Gu, C. Yin, Y. Guo, B. Xia, and Z. Chen, “Communicationcomputation- aware user association in mec hetnets: A meta-analysis,” IEEE Transactions on Wireless Communications, vol. 22, no. 9, pp. 6090–6105, 2023.
[29] Zhang, Qi, Zhigang Li, Zhenteng Qin, Xiaochuan Sun, and Haijun Zhang. "Temporal Feature-Enhanced Deep Reinforcement Learning for RAN Slicing with User Mobility." IEEE Communications Letters (2023).
[30] F. Zhou, Y. Wu, R. Q. Hu, and Y. Qian, “Computation rate maximization in uav-enabled wireless-powered mobile-edge computing systems,” IEEE Journal on Selected Areas in Communications, vol. 36, no. 9, pp. 1927–1941, 2018.
[31] Q. Hu, Y. Cai, G. Yu, Z. Qin, M. Zhao, and G. Y. Li, “Joint offloading and trajectory design for uav-enabled mobile edge computing systems,”IEEE Internet of Things Journal, vol. 6, no. 2, pp. 1879–1892, 2019.
[32] Zhao, Zhongyong, Yu Chen, Jiangnan Liu, Yingying Cheng, Chao Tang, and Chenguo Yao. "Evaluation of operating state for smart electricity meters based on transformer–encoder–BiLSTM." IEEE Transactions on Industrial Informatics 19, no. 3 (2022): 2409-2420.
[33] Mohajer, Amin, Maryam Bavaghar, Rashin Saboor, and Ali Payandeh. "Secure dominating set-based routing protocol in MANET: Using reputation." In 2013 10th International ISC Conference on Information Security and Cryptology (ISCISC), pp. 1-7. IEEE, 2013.
[34] Y. Xu, T. Zhang, Y. Liu, D. Yang, L. Xiao, and M. Tao, “Cellular connected multi-uav mec networks: An online stochastic optimization approach,” IEEE Transactions on Communications, vol. 70, no. 10, pp. 6630–6647, 2022.
[35] Nemati, Z., Mohammadi, A., Bayat, A., & Mirzaei, A. (2024). Metaheuristic and Data Mining Algorithms-based Feature Selection Approach for Anomaly Detection. IETE Journal of Research, 1-15.
[36] Li, Rongpeng, Chujie Wang, Zhifeng Zhao, Rongbin Guo, and Honggang Zhang. "The LSTM-based advantage actor-critic learning for resource management in network slicing with user mobility." IEEE Communications Letters 24, no. 9 (2020): 2005-2009.
[37] L. Zhang, J. Li, Y. Wang, Z. Chen, Q. Liu, and Y. Sun, “Task offloading and trajectory control for uav-assisted mobile edge computing using deep reinforcement learning,” IEEE Access, vol. 9, pp. 53 708–53 719, 2021.
[38] X. Zhang, J. Zhang, J. Xiong, L. Zhou, J. Wei, and H. Li, “Energy efficient multi-uav-enabled multiaccess edge computing incorporating noma,” IEEE Internet of Things Journal, vol. 7, no. 6, pp. 5613–5627, 2020.
[39] L. Wang, K. Wang, C. Pan, W. Xu, N. Aslam, and L. Hanzo, “Multiagent deep reinforcement learning-based trajectory planning for multiuav assisted mobile edge computing,” IEEE Transactions on Cognitive Communications and Networking, vol. 7, no. 1, pp. 73–84, 2021.
[40] T. Zhang, Y. Xu, J. Loo, D. Yang, L. Xiao, and Y. Zhao, “Joint computation and communication design for uav-assisted mobile edge computing in iot,” IEEE Transactions on Industrial Informatics, vol. 16, no. 8, pp. 5505–5516, 2020.
[41] Z. Liu, X. Tan, M. Wen, S. Wang, C. Liang, and Q. Zhao, “An energy efficient selection mechanism of relay and edge computing in uavassisted cellular networks,” IEEE Transactions on Green Communications and Networking, vol. 5, no. 3, pp. 1306–1318, 2021.
[42] Yan, Dandan, Benjamin K. Ng, Wei Ke, and Chan-Tong Lam. "Deep reinforcement learning based resource allocation for network slicing with massive MIMO." IEEE Access (2023).
[43] C.-Y. Hsieh, Y. Ren, and J.-C. Chen, “Edge-cloud offloading: Knapsack potential game in 5g multi-access edge computing,” IEEE Transactions on Wireless Communications, vol. 22, no. 4, pp. 3124–3136, 2023.
[44] N. Zhao, C. Xu, W. Zhang, S. Yang, G.-M. Muntean, and F. Zhou,“5g-enabled uav-to community offloading: Joint trajectory design and task scheduling,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 11, pp. 3306–3320, 2021.
[45] H. Guo and J. Liu, “Uav-enhanced intelligent offloading for internet of things at the edge, IEEE Transactions on Industrial Informatics, vol. 16, no. 4, pp. 2737–2746, 2020.
[46] Wang, Zhaoying, Yifei Wei, F. Richard Yu, and Zhu Han. "Utility optimization for resource allocation in multi-access edge network slicing: A twin-actor deep deterministic policy gradient approach." IEEE Transactions on Wireless Communications 21, no. 8 (2022): 5842-5856.
[47] X. Qin, Z. Song, Y. Hao, and X. Sun, “Joint Resource allocation and trajectory optimization for multi-uav-assisted multi-access mobile edge computing,” IEEE Wireless Communications Letters, vol. 10, no. 7, pp. 1400–1404, 2021.
[48] M. Li, N. Cheng, J. Gao, Y. Wang, L. Zhao, and X. Shen, “Energyefficient uav-assisted mobile edge computing: Resource allocation and trajectory optimization,” IEEE Transactions on Vehicular Technology, vol. 69, no. 3, pp. 3424–3438, 2020.
[49] Wang, Yue, Yu Gu, and Xiaofeng Tao. "Edge network slicing with statistical QoS provisioning." IEEE Wireless Communications Letters 8, no. 5 (2019): 1464-1467.
[50] H. Guo and J. Liu, “Uav-enhanced intelligent offloading for internet of things at the edge, IEEE Transactions on Industrial Informatics, vol. 16, no. 4, pp. 2737–2746, 2020.
|
Journal of Optimization in Soft Computing (JOSC) Vol. 2, Issue 4, pp: (1-15), Winter-2024 Journal homepage: https://sanad.iau.ir/journal/josc |
|
Paper Type (Research paper)
Real-Time Scalable Task Offloading in Edge Computing Using Semi-Markov Decision Processes and Attention-Based Deep Reinforcement Learning
Abbas Mirzaei1, Naser Mikaeilvand2, Babak Nouri-Moghaddam1, Sajjad Jahanbakhsh Gudakahriz3, Ailin Khosravani1, Fatemeh Tahmasebizade1, Ali Seifi1, Hosein Hatami1
1. Department of Computer Engineering, Ardabil Branch, Islamic Azad University, Ardabil, Iran
2. Department of Computer Engineering, Central Tehran Branch, Islamic Azad University, Tehran, Iran
3. Department of Computer Engineering, Germi Branch, Islamic Azad University, Germi, Iran
Article Info |
| Abstract |
Article History: Received: 2024/11/28 Revised: 2025/01/05 Accepted: 2025/02/02
DOI: |
| Edge computing has emerged as a dynamic framework where computational tasks are offloaded to distributed edge servers (ESs) to provide low-latency and efficient services. As edge systems grow in scale and complexity, leveraging Deep Reinforcement Learning (DRL) has become a prominent approach to optimize task offloading and Resource management. However, traditional DRL-based methodologies encounter several challenges: (1) Discrete-time decision frameworks, such as Markov Decision Processes (MDPs), often enforce offloading in fixed timeslots, leading to scheduling delays and inefficient Resource utilization. (2) Static computational structures struggle to adapt to varying numbers of edge servers or user devices, resulting in scalability issues and system inefficiencies. To overcome these limitations, we introduce a novel DRL-driven real-time offloading mechanism tailored for dynamic and scalable edge environments. Our approach reformulates the offloading problem within a Semi-Markov Decision Process (SMDP) framework and introduces an adaptive optimization mechanism utilizing attention-based graph operations for heterogeneous Resource environments. This system, like how we prioritize tasks and divide resources, figures out how much attention to pay to each task and which server should handle it, to make things work smoothly. To make this work even better in the real world, we use a special method to adjust the rewards, which helps the system learn and improve its performance over time |
Keywords: Edge Computing; Task Scheduling; Reinforcement Learning; System Scalability. |
| |
*Corresponding Author’s Email Address: mirzaei_class_87@yahoo.com |
1. Introduction
The rapid expansion of mobile networks and the proliferation of connected devices have transformed modern computing environments. From autonomous vehicles to immersive augmented reality applications, the demand for high-speed, low-latency services has surged. Traditional cloud computing architectures, despite their powerful centralized Resources, often fall short in meeting these latency-sensitive requirements due to long transmission distances and centralized processing bottlenecks [1]-[5]. This gap has driven the evolution of edge computing, which brings computation and storage closer to end-users by deploying edge servers (ESs) within the network's proximity. Within this paradigm, tasks may be executed locally or offloaded to nearby ESs. While ESs are equipped with more robust computational capabilities compared to UDs, the process of uploading tasks to ESs introduces additional energy consumption and latency. Moreover, the computational capacity of ESs remains constrained compared to centralized cloud servers, making them unsuitable for handling large volumes of concurrent tasks. Resource contention among multiple tasks can degrade system performance and quality of service (QoS) [6], [7]. Consequently, devising an efficient scheduling mechanism for task offloading has become critical. Such mechanisms aim to optimize the selection of offloading targets and Resource allocation strategies [8], often framed as mixed-integer nonlinear programming (MINLP) problems, which are known to be NP-hard [9].
Initially, mathematical approaches [10] were developed to solve these optimization problems. However, these model-based methods struggle with generalization across diverse edge systems characterized by heterogeneous transmission technologies, application requirements, and computational Resources. To address this limitation, model-free metaheuristic algorithms [11, 12] were introduced for task offloading. Despite their flexibility, these algorithms face significant challenges, including large search spaces and poor adaptability to dynamic edge environments. In recent years, Deep Reinforcement Learning (DRL) has demonstrated exceptional capabilities across various domains, such as robotics control, autonomous driving, and natural language processing. Leveraging deep neural networks, DRL combines high-dimensional data analysis with model-free learning, making it a compelling choice for dynamic edge systems. Its online learning capabilities enable adaptive policy updates through continuous interaction with the environment, offering real-time adaptability to evolving edge conditions. As a result, DRL-based methods have shown promising results in optimizing task offloading and Resource allocation in edge computing [13]-[16]. Despite its advantages, DRL-based approaches face inherent limitations, as illustrated in Fig. 1. Firstly, these methods typically rely on discrete-time Markov Decision Processes (MDPs), where decisions are made at fixed intervals. This framework necessitates batch processing of tasks, causing delays as tasks wait for the next decision interval to be scheduled [17]. Such wait-for-scheduling latency increases Resource contention and lowers task completion rates, particularly in systems with stringent delay requirements. Secondly, traditional DRL methods lack scalability [18, 19]. The fixed computational graph of deep neural networks requires consistent input and output dimensions, making it challenging to adapt to varying system scales [20]. For instance, in mobile edge environments, the dynamic nature of vehicular edge systems—with frequent arrivals and departures of service or user vehicles—renders non-scalable DRL approaches infeasible. Retaining scalability under these conditions is crucial but often necessitates retraining models, a process that is both time-intensive and computationally expensive.
Figure 1. Challenges in DRL-Based Offloading Approaches.
Transitioning from a batched offloading framework to a real-time approach, where tasks are immediately scheduled upon arrival, intuitively minimizes waiting time and avoids dimensional mismatches caused by fluctuating task volumes. However, the discrete-time MDP framework utilized by classical DRL algorithms is inherently unsuitable for such scenarios [21]-[23]. Additionally, scalability challenges, such as mismatches in the dimensions of inputs and outputs caused by dynamic variations in the number of edge servers (ESs) and user devices (UDs), remain unresolved. To address these challenges, we propose a Real-time and Scalable Task Offloading framework (ReSTO), leveraging a DRL-based methodology.
In ReSTO, the task offloading problem is modeled as a Semi-Markov Decision Process (Semi-MDP) to enable decision-making at arbitrary task arrival times. The framework introduces the Scalable Continuous Proximal Policy Optimization (SCPPO) algorithm, specifically designed to align with the
Semi-MDP framework. To ensure scalability, SCPPO employs a heterogeneous graph attention mechanism for feature extraction, translating task-specific characteristics into adaptive attention scores for decision-making. Moreover, we develop a hybrid reward mechanism that integrates model-based and real-time feedback, referred to as the homotopy reward. This reward scheme bridges the gap between theoretical models and real-world dynamics while enhancing exploration efficiency during learning.
This paper aims to address the limitations of existing DRL-based task offloading approaches in edge computing environments. Specifically, we focus on:
1- Overcoming the limitations of discrete-time MDPs: We propose a novel continuous-time DRL framework that enables real-time, event-triggered task scheduling, eliminating the need for batch processing and reducing wait-for-scheduling latency.
2- Improving scalability in dynamic environments: We introduce a scalable DRL architecture that can adapt to varying numbers of tasks and edge servers without requiring extensive model retraining.
By achieving these objectives, we aim to:
· Enhance task completion rates and reduce latency in edge computing systems with stringent performance requirements.
· Improve resource utilization by enabling more efficient task scheduling and allocation.
· Increase the adaptability and robustness of DRL-based offloading solutions in dynamic and unpredictable edge environments.
The key contributions of this work are as follows:
· Introduction of ReSTO Framework:
We propose ReSTO, a novel real-time and scalable task offloading framework. ReSTO models the offloading problem using a Semi-MDP and introduces the SCPPO algorithm for real-time decision-making, eliminating the latency associated with traditional batched scheduling.
· Scalability via Graph Attention Mechanism:
SCPPO employs heterogeneous graph attention operations to extract task and Resource features dynamically, enabling adaptive attention score generation. This approach prevents dimensional mismatches as the number of ESs or UDs changes, ensuring scalability.
· Development of Homotopy Reward:
We formulate a hybrid reward system combining theoretical model rewards with real-time feedback. This homotopy reward reduces the disparity between theoretical assumptions and real-world conditions, improving both performance and exploration efficiency.
The remainder of this paper is organized as follows: Section II reviews related works, particularly focusing on real-time and scalable RL/DRL-based approaches. Section III presents the system model for real-time offloading and the corresponding optimization problem. In Section IV, we detail the ReSTO framework, including the Semi-MDP formulation and the SCPPO algorithm design. Section V evaluates ReSTO’s performance against state-of-the-art algorithms, highlighting its scalability and efficiency. Finally, Section VI concludes the paper with insights and potential future directions.
2. Related Works
In this section, we provide a comprehensive review of DRL-based task offloading methods. Following this, we delve into existing RL/DRL approaches for real-time or scalable task offloading, analyzing their achievements and limitations in comparison to our proposed framework.
A. DRL-Based Task Offloading in Edge Computing
Over the past decade, task offloading in edge computing systems has increasingly relied on Deep Reinforcement Learning (DRL) algorithms due to their capacity for dynamic decision-making and adaptability to complex environments. These algorithms leverage the ability of neural networks to process high-dimensional inputs and learn optimal policies directly through interaction with the environment. Numerous studies have tailored DRL methods to address the unique challenges of edge systems, such as Resource constraints, latency requirements, and dynamic user demands. One notable example is the work of Wang et al. [12], who utilize Deep Q-Learning (DQN) to optimize both task offloading and Resource configuration in a blockchain-enabled edge computing framework. Their approach introduces trust mechanisms and leverages blockchain for secure and efficient offloading. Similarly, Huang et al. [13] employ a Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm for partial offloading systems, where tasks can be split between local and edge processing. This method improves decision-making by accounting for the variability in task size and Resource availability, demonstrating the potential of DRL in adaptive task allocation.
Building on these foundational approaches, subsequent research has focused on enhancing the performance and robustness of DRL-based task offloading. For instance, Xu et al. [14] and Ma et al. [15] introduce temporal feature extraction to capture the dynamic nature of edge environments, utilizing historical state information to better model system behavior and predict the effects of various actions. This temporal awareness allows the system to adapt to changing workloads and network conditions, leading to more effective offloading strategies.
Moreover, Xu et al. [16] propose an exploration-exploitation strategy tailored to the training process. By prioritizing exploration during the early stages of training and gradually shifting towards the exploitation of learned policies, their approach strikes a balance between discovering new solutions and refining existing ones. This adaptive strategy improves policy performance and ensures more reliable decision-making over time. To address the computational complexity and convergence challenges associated with large action spaces, researchers have also explored hybrid approaches that integrate DRL with traditional optimization techniques. For example, Chen et al. [17] enhance DQN-based task offloading with sequential quadratic programming for Resource allocation. This combination reduces the dimensionality of the problem and accelerates convergence, enabling more efficient use of edge Resources.
Li et al. [18] take a multi-agent approach, employing a Parameterized Multi-Agent Soft Actor-Critic (SAC) algorithm to address the interdependence of actions across agents. By categorizing actions into those that affect other agents and those that do not, they effectively manage Resource contention in collaborative edge environments. The use of a genetic algorithm further refines Resource allocation decisions, ensuring optimal system performance.
Despite these advancements, existing DRL-based methods face inherent limitations due to their reliance on the discrete-time Markov Decision Process (MDP) framework. This framework enforces decision-making at fixed intervals, leading to batch processing of tasks. Such a structure introduces scheduling delays, as tasks must wait until the next decision point before offloading can occur [24], [25]. This wait-for-scheduling latency becomes particularly problematic in latency-sensitive applications, where even slight delays can significantly degrade performance. Additionally, most DRL approaches encode system states into a one-dimensional input vector for processing by a multi-layer perceptron (MLP). While this design simplifies implementation, it limits scalability. Fixed input-output dimensions in MLPs cannot adapt to changes in the number of edge servers (ESs) or user devices (UDs), resulting in dimensional mismatches. This lack of flexibility hampers the applicability of DRL algorithms in dynamic edge environments, such as vehicular networks or large-scale IoT systems, where the network topology and Resource availability frequently change.
These challenges underscore the need for novel frameworks and algorithms that overcome the constraints of discrete-time MDPs and enable real-time, scalable task offloading in edge computing systems. Future solutions must address both the latency introduced by batch processing and the scalability issues arising from static neural network architectures, paving the way for more adaptive and efficient DRL applications in edge environments.
· Categorization by Objective:
1. Latency Minimization: Focus on methods specifically designed to minimize task completion time or end-to-end delay.
2. Energy Efficiency: Analyze methods that prioritize minimizing energy consumption at the device and network levels.
3. Resource Allocation: Discuss approaches that optimize resource allocation among UDs and ESs, considering factors like CPU, memory, and bandwidth.
4. Load Balancing: Examine methods that aim to distribute the computational load evenly across the available ESs.
B. Real-Time RL/DRL for Task Scheduling
Real-time decision-making is a critical component of task scheduling in edge computing and numerous other domains, where rapid responses to dynamic changes are essential for maintaining system performance and efficiency. However, the discrete-time Markov Decision Process (MDP) framework, which underpins most traditional RL/DRL methods, introduces inherent constraints when applied to real-time applications. By requiring fixed decision intervals, the discrete-time MDP framework creates bottlenecks, such as delays in task execution, that compromise the responsiveness and adaptability of RL-based solutions. Alternative frameworks, such as the multi-armed bandit [26]-[30], have been explored to address some of these challenges. While these models are computationally simpler and focus on optimizing immediate rewards, they often fail to account for the temporal dependencies and cumulative effects of actions. This omission can lead to suboptimal decision-making, particularly in complex and dynamic environments where long-term outcomes must be carefully balanced with short-term gains [31]-[33].
In contrast, the Semi-Markov Decision Process (Semi-MDP) framework is particularly well-suited for real-time scheduling tasks. Unlike the discrete-time MDP, Semi-MDP allows for variable intervals between decision points, making it more flexible and capable of handling tasks as they arrive. This flexibility enables the development of policies that optimize long-term performance while addressing the immediate requirements of real-time systems. For instance, Liang et al. [20] and Hao et al. [21] successfully use Semi-MDPs to model real-time scheduling problems, demonstrating the framework’s potential to accommodate dynamic workloads and varying system conditions. Despite its advantages, adapting existing algorithms to the Semi-MDP framework poses unique challenges due to its structural differences from the traditional MDP approach. One common strategy involves normalization, which converts Semi-MDP problems into an MDP-compatible format, allowing established DRL algorithms to be applied. For example, Liang et al. [22] normalize Semi-MDP problems by estimating theoretical model-based Q-values for supervised pre-training [34]-[36]. This approach provides a starting point for the policy, which is then refined through interactions with the environment. Similarly, Wu et al. [23] utilize state transition probabilities during the normalization process to transform Semi-MDPs into a form solvable by value iteration techniques.
An alternative to normalization-based methods is the direct design of algorithms tailored to the Semi-MDP framework. These approaches avoid the approximations and assumptions inherent in normalization, enabling more accurate modeling of real-world scenarios. For example, Van Huynh et al. [24] propose a Dueling Double Deep Q-Network (DDQN) approach that maximizes cumulative single rewards without incorporating discount factors, focusing instead on immediate benefits within a Semi-MDP structure. Wei et al. [9] employ an exponential decay model to compute cumulative discounted returns, deriving a Bellman optimality equation to guide decision-making with DQN. Kim et al. [25] adapt the Soft Actor-Critic (SAC) algorithm for the Semi-MDP framework, introducing modifications that account for the variable time intervals and cumulative reward structures characteristic of Semi-MDPs. Despite these advancements, existing methods still exhibit notable limitations. Normalization-based approaches often rely heavily on theoretical assumptions, such as idealized transition models or fixed state representations, which reduce their generalizability to real-world, complex environments [37]-[40]. These assumptions can lead to performance degradation when applied to heterogeneous and highly dynamic edge systems, where practical constraints and unpredictable factors frequently deviate from theoretical models.
On the other hand, model-free DRL approaches [41]-[45] that bypass theoretical dependencies also face challenges. These methods commonly employ simplistic neural network architectures, such as basic feedforward models, that lack the scalability needed to adapt to dynamic edge network conditions. In systems where the number of edge servers (ESs) and user devices (UDs) can fluctuate significantly, fixed input-output dimensions lead to dimensional mismatches, requiring costly retraining of the models to accommodate changes [46]-[48]. This inflexibility limits the practical deployment of model-free DRL solutions in scenarios characterized by high variability and evolving system requirements. Overall, while the Semi-MDP framework offers significant potential for enabling real-time decision-making in edge computing, achieving effective and scalable solutions necessitates innovative algorithmic designs that address both the limitations of normalization-based methods and the scalability constraints of traditional DRL models. Future work must focus on bridging these gaps to develop robust and adaptable frameworks capable of supporting real-time, scalable task scheduling in edge environments.
· Weaknesses of Current Semi-MDP Methods:
1. Normalization-Based Approaches:
2. Reliance on Theoretical Assumptions: Often rely on idealized models and assumptions, which can limit their applicability in real-world scenarios with high variability and uncertainty.
3. Potential for Accuracy Loss: The normalization process can introduce approximations that may lead to suboptimal solutions or reduced accuracy.
4. Limited Exploration of Direct Semi-MDP Algorithms: While some direct approaches exist, the field is still relatively under-explored compared to normalization-based methods.
5. Scalability Challenges: As the complexity of the environment and the number of tasks increase, solving Semi-MDPs can become computationally expensive, especially for complex DRL algorithms.
6. Handling of Uncertainty: Many existing methods may not adequately address the inherent uncertainty and stochasticity present in real-world scheduling problems.
3. System Model and Problem Formulation
We consider a crowdsourcing-inspired MEC system, as illustrated in Fig. 1, comprising multiple applications and edge servers (ESs) with diverse configurations and characteristics. These applications may vary significantly in their requirements, encompassing delay-sensitive services such as networked gaming, autonomous driving, and AR/VR, as well as resource-intensive tasks like big data analytics, scientific computing, and video surveillance [49]. Similarly, ESs can range from micro data centers and edge clouds to high-capacity computing servers or even gateways deployed in residential or office settings. For generality, we assume these ESs are managed and operated by distinct edge service providers. To maximize resource utilization and enhance system performance in terms of scalability, reliability, and other metrics, a third-party platform is introduced to coordinate ES operations and handle workload dispatch from end users. Acting as an intermediary, this platform serves as a front-end interface for edge computing services, bridging the gap between clients submitting tasks and ESs providing computational resources. Upon receiving a task, the platform assigns it to the most suitable ES hosting the requested service and ensures the computation result is returned to the client seamlessly. This interaction is transparent to users, provided the system meets their application performance expectations, such as low latency and high computation quality.
Both application providers and ESs must undergo an onboarding process with the platform before accessing or delivering edge services. This formalized process involves signing agreements with the platform to define roles and responsibilities. For application providers, this includes specifying service requirements such as task rates, task valuation, budget constraints, computational demands, QoS parameters (e.g., maximum tolerable delay), and security or compliance needs. Similarly, ESs seeking to participate in the system are subject to a comprehensive evaluation by the platform. This involves reviewing their security protocols, compliance certifications, and data management practices to ensure adherence to industry standards and regulatory requirements [50]. Additionally, a risk assessment is often conducted to identify potential vulnerabilities. ESs must provide detailed information regarding their resource capacities, operational costs, and revenue expectations.
Using this information, the platform optimizes task offloading strategies and resource allocation for ESs, subsequently formalizing agreements with both parties. Once agreements are in place, ESs configure the necessary accounts and infrastructure, enabling application providers to deploy their services. Importantly, ongoing monitoring and auditing mechanisms are established to ensure all parties adhere to the agreed-upon terms, with regular performance and compliance evaluations conducted throughout the service lifecycle.
This study considers a scenario where application providers make advance payments to the platform, which, in turn, allocates a portion of these payments to incentivize contributions from edge servers (ESs). The platform's key decisions include: (1) whether to accept both the application providers and ESs into the system, (2) determining the amount of resources each ES should allocate to applications, and (3) devising an efficient task dispatching strategy to distribute tasks among the backend ESs hosting the services. To simplify notation, we define the set of ESs and applications/services in the system as M and N, respectively, with the corresponding cardinalities denoted by ∣M∣ and ∣N∣. For clarity, the terms "applications" and "application providers" are used interchangeably in this paper unless otherwise specified. The primary notations employed throughout this work are summarized in Table 1.
Each application is characterized by a tuple
,where:
1. : The payment made by application provider iii to the platform for task offloading.
2. : The utility gained by i from offloading a task, such as reduced energy consumption at user devices, enhanced computational quality, or shorter response times. Generally,
Offloading offers net benefits to the application.
3. : The arrival rate of tasks for application i.
4. : The workload (measured in CPU cycles) required to process a task.
5. : The maximum latency tolerable by application i.
Given the stochastic nature of the system and the uncertainty in resource allocation at ESs, the actual value derived by an application from task offloading depends on the quality of the edge computing service. We represent this with a utility function , which quantifies the satisfaction level of application i when offloading tasks to
. This utility function is an abstract representation and can vary depending on the application's requirements.
For instance, for delay-sensitive applications, may be defined based on reductions in task latency. For resource-intensive applications,
could reflect the computational quality, such as compression ratios or prediction accuracy. Moreover, the form of
can differ even within the same application category. For example, in delay-sensitive applications,
could be a step function to model satisfaction levels in the presence of hard deadlines.
(1)
A. Platform Model
The platform operates under the following assumptions:
1. The platform employs a probabilistic task dispatching mechanism, where each application task is routed to a specific ES based on predefined probabilities.
2. The payment made by application iii is distributed between the platform and the ES executing the task. Specifically, the ES receives a reward of
, where
, while the platform retains
as its service charge or maintenance fee. The parameter
, a critical system variable, is determined by the platform and forms part of the contractual agreement with the ES.
4. Real-time and Scalable Task Offloading Framework
Before detailing the algorithm, we first describe the calculation of and
under a fixed resource allocation
. The following assumptions, drawn from prior studies, are applied:
1. Tasks from each application arrive according to a Poisson process [46]. Consequently, the arrival of tasks from application i at also follows a Poisson process with a rate of
, where
represents the task arrival rate, and
denotes the probability of task dispatch to
.
2. The workload of tasks from each application is assumed to follow an exponential distribution (in CPU cycles) [27][36]. This implies that the processing time for a task from application i at also follows an exponential distribution with a mean of 1/wij1/w where
and
represents the workload of the task.
Based on these assumptions, the task processing system for an application i at can be modeled as an M/M/1queue. The probability density function (pdf) for the task delay
this system is then expressed as:
(2)
Assuming is defined as in Eq. (2) and
(indicating that tasks from application i are offloaded to
), the relationship derived from constraint (3b) is as follows:
(3)
Combining (7) and (8), we get:
(4)
Let denote the right-hand side (RHS) of the inequality mentioned above, defined as:
(5)
Clearly, represents the upper bound of the offloading probability for which application provider i is satisfied with offloading its tasks to
, meeting the QoS requirements. Notably, this upper bound is independent of
and is solely determined by
(0) and the workload profiles.
Similarly, from constraint (3c) and assuming , we derive:
(6)
Let the right-hand side (RHS) of the above inequality be denoted as , defined as:
(7)
Algorithm 1 Deriving the optimal resource allocation, task offloading probabilities, and ratios under a given resource allocation | ||||||||||||||||||||||||||||||||||||||||||
| Input: Task profiles | |||||||||||||||||||||||||||||||||||||||||
| Output: Resource allocations | |||||||||||||||||||||||||||||||||||||||||
1 | for | |||||||||||||||||||||||||||||||||||||||||
2 |
| for | ||||||||||||||||||||||||||||||||||||||||
3 |
|
| Derive | |||||||||||||||||||||||||||||||||||||||
4 | for | |||||||||||||||||||||||||||||||||||||||||
5 |
| Get | ||||||||||||||||||||||||||||||||||||||||
6 | for | |||||||||||||||||||||||||||||||||||||||||
7 |
| Get | ||||||||||||||||||||||||||||||||||||||||
8 | Obtain |
Notations | Simulation Value | Notations | Simulation Value |
M | 8 |
| U(1.0, 1.2) MB U (0.8, 1.0) GCycle U (1, 2) Second 1 Watt |
| U (2, 4) GHz | β | |
| 50 Meter | ϑ | |
N | 30 | p | |
| 3 |
| -3 |
| 1 |
| -114 dBm/MHz |
X | 0.1 | B | 1 MHz |
| U (1, 2) GHz | Ω | 1 |
| 4 | Κ |
|