DOCOMO runs AI on vRAN CPUs: what operations teams need to know
Tokyo, Feb 25, 2026 - DOCOMO has validated AI applications running directly on the general-purpose CPU resources of its commercial vRAN. The takeaway is simple: you can execute meaningful AI workloads alongside live radio processing using the compute you already have.
Why this matters for operations: AI traffic is surging, and GPU supply, power, and cost are hard constraints. This approach shows a practical path to absorb growth, control spend, and keep performance steady by placing the right workload on the right silicon.
What they actually built
- vRAN baseband software from NEC, running on general-purpose servers.
- Virtualization platform from AWS hosting both vRAN and AI applications.
- Targeted accelerator cards from Qualcomm for specific compute tasks.
- Servers from HPE integrating the above stack.
Key result: AI applications executed on CPUs in parallel with live network processing. No dedicated high-end accelerators required for a certain class of AI tasks.
Why this is operationally significant
- Cost control: Use idle CPU cycles across sites before buying more GPUs. Lower capex, better ROI on existing hardware.
- Placement flexibility: Push latency-sensitive, lightweight inference close to radios; reserve GPUs in regional sites for heavy models.
- Energy efficiency: CPUs across wide areas can deliver acceptable AI throughput at lower power for many use cases.
- Resilience: Distribute AI across cells or sites to avoid single points of failure and smooth demand spikes.
Which AI workloads fit on CPUs in vRAN
- Telemetry analytics: anomaly detection, RAN KPI drift, predictive alarms.
- Lightweight inference: small LLM classifiers, keyword spotting, simple vision checks for site monitoring.
- Policy/optimization loops: traffic steering hints, energy-saving modes, admission control assistance.
Reserve GPUs for large models, high-throughput vision, or complex multi-modal inference. Think "CPU for breadth, GPU for depth."
Operational patterns that work
- Resource isolation: Pin vRAN threads and IRQs; allocate AI containers to separate cores/NUMA nodes. Protect real-time paths first.
- Admission control: Gate AI jobs on RAN KPIs (e.g., PRB utilization, HARQ BLER) and CPU headroom.
- Latency budgets: Set hard ceilings for AI end-to-end latency to avoid starving vRAN processing.
- Observability: Correlate AI workload metrics with RAN KPIs in the same pane of glass. Alert on deviation from baseline.
- Progressive rollout: Start with off-peak windows and low-traffic cells, then expand.
How to pilot this in your network
- Inventory CPU headroom per site and per time-of-day. Identify consistent slack.
- Classify candidate AI tasks by latency, throughput, and model size. Map "CPU-friendly" first.
- Choose orchestration: K8s with explicit CPU pinning and QoS, or VM-based isolation aligned to vRAN threads.
- Set SLOs for both sides: vRAN (throughput, jitter, packet loss, call drop) and AI (p95 latency, success rate, TPS).
- Run A/B baselines. Enable AI, measure deltas, auto-rollback on threshold breach.
Metrics to track
- vRAN: PRB utilization, HARQ BLER, scheduling latency, packet loss, jitter, interrupt latency.
- Compute: per-core CPU utilization, cache misses, context switches, NUMA remote memory access.
- AI: p50/p95 inference latency, throughput, model accuracy drift.
- Efficiency: watts per inference, site PUE proxy, TCO per covered Mbps.
Constraints and risks
- Contention: poor isolation can degrade radio performance. Treat AI as a best-effort tenant.
- Real-time tuning: DPDK, kernel settings, and BIOS power states must be aligned to protect vRAN timing.
- Model lifecycle: monitor drift; schedule updates during low-load windows.
- Vendor mix: integration across RAN, cloud stack, accelerators, and servers adds change-management overhead.
- Data handling: keep sensitive network data compliant when AI runs at the edge.
What's next
DOCOMO plans to optimize CPU/GPU placement per workload and traffic profile, and will showcase the initiative at Mobile World Congress Barcelona 2026. Expect more playbooks on where to run which models and how to schedule them safely.
If you want to go deeper
Bottom line for operations: treat your RAN sites as distributed compute. Put the right AI work on the CPUs you already manage, enforce guardrails, and scale what proves stable and cost-effective.
Your membership also unlocks: