Tail-Latency Aware Scheduler For Inference Workloads

Statistiques de téléchargement

Téléchargements

Téléchargements par mois depuis la dernière année

Khelifa, S. e., Bagaa, M., Ouahouah, S., Ouameur, M. A. et Ksentini, A. (2025). Tail-Latency Aware Scheduler For Inference Workloads. Dans 2025 International Wireless Communications and Mobile Computing (IWCMC) DOI 10.1109/IWCMC65282.2025.11059653.

Prévisualisation

PDF
Télécharger (3MB) | Prévisualisation

Résumé

In recent years, AI inference has seen widespread adoption across fields like finance and healthcare, driving significant demand for high-performing applications. This demand brings about a complex relationship between inference application types, such as real-time applications, and their specific service level objectives (SLOs), like tail-latency. Tail-Latency is a metric requiring a defined percentage of requests to meet a maximum response time, which is crucial for applications where delays can impact user experience or decision-making. This dependency creates a challenging research problem in scheduling inference workloads. The core question becomes: How can we deploy AI workloads in a way that minimizes SLO violations?Specifically, we worked on real-time applications that require tail-latency guarantees. To address this, we developed a tail-latency-aware scheduler designed for resource-constrained devices. Our scheduler employs advanced machine learning techniques to optimize task placement, aiming to minimize SLO violations and enhance performance for latency-sensitive applications. We have developed and integrated our custom scheduler into Kubernetes, which operates on a specially configured cluster designed to test its performance. This cluster features diverse computing capabilities, enabling a comprehensive evaluation of the scheduler’s effectiveness. The experimental results highlight that our proposed scheduler outperforms the native Kubernetes scheduler in terms of efficiency.

Type de document:	Document issu d'une conférence ou d'un atelier
Mots-clés libres:	Performance evaluation Wireless communication Adaptation models Processor scheduling Computational modeling Predictive models Real-time systems User experience Time factors Random forests Cloud-Edge-Computing-Continuum Scheduling Kubernetes
Date de dépôt:	03 juill. 2026 15:19
Dernière modification:	09 juill. 2026 12:16
Version du document déposé:	Post-print (version corrigée et acceptée)
URI:	https://depot-e.uqtr.ca/id/eprint/12963

Actions (administrateurs uniquement)

Éditer la notice