CLEVER is Representing Local Workload Placement!

The Local Workload Placement component in the CLEVER platform introduces an ML-based scoring mechanism into Kubernetes’ scheduling pipeline, optimizing workload placement on worker nodes to meet platform objectives like energy efficiency, scalability and flexibility.

Key Features and Workflow:

Telemetry and Data Collection: A telemetry system gathers real-time data that can be stored in a centralized cluster-level telemetry database and converted into a graph structure.

Graph Neural Network (GNN): The cluster status is represented as a graph and the placement task is framed as a graph link prediction problem, predicting feasible worker nodes for incoming workloads.

Training and Serving Pipeline: Historical cluster and energy data are aggregated into a graph, which is processed to train a GNN model. The trained model predicts feasible worker nodes for new workloads based on the latest cluster status, scoring potential placements.

Continuous Feedback Loop: Updated resource availability from worker nodes is fed back into the graph after workload placement, ensuring the graph always reflects the cluster’s current status.

Development Progress:

Data Acquisition: Experimental code uses existing datasets (Google, Alibaba traces) and synthetic data to simulate workloads for training. Final integration will utilize data from the Distributed Knowledge Graph (DKG) connected to the telemetry system.

Graph Construction: Current experiments employ a Neo4J graph database, incorporating data extracted from traces and synthetic sources. Collaboration with partners will finalize the structure and implementation of the DKG.

GNN Model Selection: Various GNN and GCN (Graph Convolutional Network) models are under evaluation, focusing on link prediction accuracy and scalability. Feature engineering and graph embedding techniques are being refined to optimize model input.

The Local Workload Placement component exemplifies CLEVER’s innovative approach to AI-driven orchestration, ensuring efficient and sustainable resource management in distributed systems.