AI and GitOps set the agenda for Kubernetes operations

Kubernetes has matured; ops take center stage as AI/ML and GitOps raise the bar. Focus is on change control, cost, skills, GPU efficiency, and platform standards.

Categorized in: AI News Operations
Published on: Sep 23, 2025
AI and GitOps set the agenda for Kubernetes operations

Kubernetes matures as AI and GitOps reshape operations

Kubernetes has moved past the basics. The new Komodor 2025 Enterprise Kubernetes Report shows the real work is operational: change control, cost, and skills - while AI/ML and AIOps push clusters harder than ever.

"Organizations have made Kubernetes their standard, but our report shows the real challenge is operational, not architectural," said Itiel Shwartz, CTO of Komodor. "Even as practices like GitOps and platform engineering gain traction, enterprises still grapple with change management, cost control, and skills gaps."

AI/ML becomes a core workload

More than half of organizations in the study run AI/ML on Kubernetes - from training and experimentation to live inference. It fits: autoscaling, isolation, and resource quotas align well with GPU-heavy jobs.

The catch is GPU efficiency. Teams report idle, expensive hardware due to poor scheduling and contention. Over 40% plan to expand orchestration and scheduling tools to improve GPU utilization.

Practical steps to keep GPUs busy

  • Adopt queue-based scheduling to prevent starvation and balance priority classes. Consider Kueue for batch and mixed workloads.
  • Use a Kubernetes-native GPU operator to manage drivers, device plugins, and monitoring. Standardize versions to avoid snowflake nodes.
  • Right-size nodes and bin-pack: align pod requests with GPU slices, use MIG where available, and pin topology to reduce cross-NUMA penalties.
  • Separate training and inference: different QoS, scaling profiles, and SLOs. Keep inference latency predictable by isolating it from bursty training jobs.
  • Enable autoscaling for GPU node groups, set sane PodDisruptionBudgets, and use PriorityClass with preemption policies for urgent jobs.
  • Track GPU-hour cost per team via labels/annotations, quotas, and dashboards. Kill zombie pods and idle notebooks on schedule.

GitOps takes center stage

Most teams now manage Kubernetes configurations with GitOps. Argo CD leads, with Flux close behind. Versioned changes, automated reconciles, and fast rollbacks reduce drift and failed deployments.

Namespaces remain the primary isolation boundary. Many organizations also run separate clusters for sensitive workloads or strict blast-radius control - a clean split that simplifies compliance and incident response.

GitOps practices that cut risk

  • Single source of truth: split repos for app configs and cluster/infra configs. Use clear environments (dev, stage, prod) with overlays.
  • Policy as code: enforce guardrails with OPA/Gatekeeper or Kyverno before changes reach the cluster.
  • Secrets: keep them out of repos; use External Secrets or sealed workflows. Rotate keys automatically.
  • Progressive rollouts: blue/green or canary for high-impact services. Abort on SLO breach, not on gut feel.
  • Drift visibility: alert on out-of-band changes; block kubectl hotfixes except during incidents with audit trails.

Platform engineering moves from idea to standard

Clusters now span on-prem, cloud, and edge. Dedicated platform teams consolidate tools, templates, and policies to cut sprawl. The report notes that unified standards correlate with fewer outages and lower operating costs.

What high-performing ops teams standardize

  • Golden paths: reusable app/service templates with baked-in telemetry, security, and rollback plans.
  • Cluster baselines: CNI, ingress, storage classes, Pod Security admission, and node images under version control.
  • Observability: consistent logs, metrics, traces, GPU exporters, and shared SLO dashboards.
  • Access and safety: scoped RBAC, least privilege, network policies, and strong multi-tenancy boundaries.
  • Change management: Git-only changes, small batches, and release cadences that align with incident data.
  • Cost control: chargeback/showback, autoscaling policies, and budget alerts wired into runbooks.

Close the skills gap this quarter

  • Upskill on GitOps, policy as code, and cluster-level security.
  • Train ops and data teams on GPU scheduling, capacity planning, and cost optimization for AI/ML.
  • Codify runbooks for AI pipelines: data ingress, feature stores, model registry, and rollout/rollback patterns.
  • Measure: set SLOs for training throughput and inference latency; track GPU utilization and queue wait time.

Bottom line for operations

Kubernetes is standard. The advantage comes from operations: GitOps discipline, GPU efficiency, and a lean internal platform. Keep policies unified, tools simple, and feedback loops short.

Resources