SoftBank's Infrinia AI Cloud OS brings KaaS and API-based LLM inference to GPU clouds

SoftBank's Infrinia automates GPU cloud ops from BIOS to inference, bundling KaaS and LLM APIs on GB200 NVL72. It cuts toil, optimizes NVLink, and aims to lower TCO.

Categorized in: AI News Operations
Published on: Jan 22, 2026
SoftBank's Infrinia AI Cloud OS brings KaaS and API-based LLM inference to GPU clouds

SoftBank's Infrinia AI Cloud OS: Automating GPU Ops from BIOS to Inference

Credit: Gorodenkoff / Shutterstock

SoftBank has launched Infrinia AI Cloud OS, a software stack that automates infrastructure operations and inference services on GPU platforms, including Nvidia's GB200 NVL72. The goal is straightforward: reduce operational burden, simplify lifecycle management, and cut TCO for GPU cloud environments.

Infrinia delivers two core services: Kubernetes as a Service (KaaS) and Inference as a Service (Inf-aaS). It handles everything from BIOS and RAID settings to Kubernetes management, GPU drivers, networking, and storage-out of the box.

What Ops Teams Get

Kubernetes as a Service automates the stack so you don't spend cycles on low-level configuration or cluster plumbing.

  • End-to-end automation: BIOS, RAID, OS, GPU drivers, networking, Kubernetes controllers, and storage.
  • Dynamic reconfiguration of Nvidia NVLink connectivity and memory allocation as clusters are created, updated, or deleted.
  • Node allocation based on GPU proximity and NVLink domain setup to reduce latency.

Inference as a Service abstracts deployment so teams can offer LLM inference without touching Kubernetes.

  • Select large language models and deploy via OpenAI-compatible APIs-no cluster tuning required.
  • Scales across multiple nodes, including GB200 NVL72 systems.
  • Tenant isolation with encrypted communications, automated monitoring and failover, and APIs for portal, customer management, and billing integration.

Why It Matters for Operations

Enterprises wrestle with GPU cluster provisioning, Kubernetes lifecycle management, and inference scaling. According to Charlie Dai, VP and principal analyst at Forrester, SoftBank's automated approach tackles these challenges by handling BIOS-to-Kubernetes configuration, optimizing GPU interconnects, and abstracting inference into API-based services. That lets teams focus on model delivery and SLOs instead of daily firefighting.

SoftBank says the stack aims to lower TCO and operational overhead versus bespoke builds or in-house tools. For Ops leaders, that means fewer custom runbooks to maintain and faster time to service readiness.

Competitive Context

The GPU cloud software market is projected to grow from $8.21 billion in 2025 to $26.62 billion by 2030. SoftBank now competes with hyperscale providers and specialized GPU platforms.

  • AWS EKS, Microsoft Azure AKS, and Google Cloud GKE offer managed Kubernetes with GPU support.
  • CoreWeave runs roughly 45,000 GPUs and is Nvidia's first Elite-level cloud services provider.
  • Lambda Labs reported $425 million in 2024 revenue and lists H100 instances at $2.49/hour.

Dai's take: advantage is shifting from who has GPUs to who can automate orchestration, abstract inference, and streamline the AI lifecycle end to end.

Rollout and Availability

SoftBank plans to deploy Infrinia in its own GPU cloud first, then offer it to external customers and overseas data centers. "The advancement of AI infrastructure requires ... software that integrates these resources and enables them to be delivered flexibly and at scale," said Junichi Miyakawa, SoftBank's president and CEO. Pricing and availability weren't disclosed.

Practical Next Steps for Ops Leaders

  • Identify use cases: internal KaaS for multi-tenant teams, external Inf-aaS for product teams, or both.
  • Map integration points: portal, customer management, billing, observability, and access control policies.
  • Benchmark costs vs. EKS/AKS/GKE or specialized providers (CoreWeave, Lambda Labs, RunPod); include interconnect and data egress in the model.
  • Plan GPU topology management: NVLink domains, GPU proximity, and node placement strategies to hit latency targets.
  • Run a pilot: validate cluster automation, failover behavior, and scaling under LLM inference load.

Helpful References

Level Up Your Team

If you're building an internal capability around AI infrastructure and API-based inference, consider structured upskilling. Curated options by role can help standardize skills across Ops, Platform, and MLOps teams.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)
Advertisement
Stream Watch Guide