Tail-Latency Aware Scheduler For Inference Workloads

Téléchargements

Téléchargements par mois depuis la dernière année

Eddine, K. S., Bagaa, M., Ouahouah, S., Ouameur, M. A. et Ksentini, A. (2025). Tail-Latency Aware Scheduler For Inference Workloads. Dans 2025 International Wireless Communications and Mobile Computing (IWCMC) DOI 10.1109/IWCMC65282.2025.11059653.

[thumbnail of BAGAA_M_164_POST.pdf]
Prévisualisation
PDF
Télécharger (3MB) | Prévisualisation

Résumé

In recent years, AI inference has seen widespread adoption across fields like finance and healthcare, driving significant demand for high-performing applications. This demand brings about a complex relationship between inference application types, such as real-time applications, and their specific service level objectives (SLOs), like tail-latency. Tail-Latency is a metric requiring a defined percentage of requests to meet a maximum response time, which is crucial for applications where delays can impact user experience or decision-making. This dependency creates a challenging research problem in scheduling inference workloads. The core question becomes: How can we deploy AI workloads in a way that minimizes SLO violations?Specifically, we worked on real-time applications that require tail-latency guarantees. To address this, we developed a tail-latency-aware scheduler designed for resource-constrained devices. Our scheduler employs advanced machine learning techniques to optimize task placement, aiming to minimize SLO violations and enhance performance for latency-sensitive applications. We have developed and integrated our custom scheduler into Kubernetes, which operates on a specially configured cluster designed to test its performance. This cluster features diverse computing capabilities, enabling a comprehensive evaluation of the scheduler’s effectiveness. The experimental results highlight that our proposed scheduler outperforms the native Kubernetes scheduler in terms of efficiency.

Type de document: Document issu d'une conférence ou d'un atelier
Mots-clés libres: Performance evaluation Wireless communication Adaptation models Processor scheduling Computational modeling Predictive models Real-time systems User experience Time factors Random forests Cloud-Edge-Computing-Continuum Scheduling Kubernetes
Date de dépôt: 03 juill. 2026 15:19
Dernière modification: 03 juill. 2026 15:19
Version du document déposé: Post-print (version corrigée et acceptée)
URI: https://depot-e.uqtr.ca/id/eprint/12963

Actions (administrateurs uniquement)

Éditer la notice Éditer la notice