Data & Analytics: Salute and MCIM Cut Liquid-Cooling Risk for AI and HPC Operations
Salute and MCIM have teamed up to give operators cleaner visibility and stronger control over direct-to-chip (DTC) liquid cooling. The partnership folds MCIM's operational intelligence platform into Salute's DTC service, aiming to close gaps that show up as rack densities rise and coolant flows become mission-critical.
Both companies bring field experience from large-scale AI and HPC sites, including Applied Digital. The goal is simple: tighter execution, continuous monitoring, and less operational risk across facilities where minutes of downtime can erase weeks of progress.
Why this matters for operations
DTC liquid cooling routes coolant directly to GPUs and CPUs. It enables higher densities and better thermal efficiency, but it adds failure modes you can't ignore: leaks, pressure spikes, air ingress, valve issues, and sensor drift.
Air-cooled playbooks won't cover this. You need real-time telemetry mapped to clear runbooks, trained teams on the floor, and portfolio-level governance that prevents local fixes from creating system-wide risk.
An integrated model you can execute
- Salute: On-the-ground procedures, training, and staffing with experience in liquid-cooled rooms. They handle the playbooks and the people.
- MCIM: A single operational platform that unifies telemetry, workflows, assets, and governance across sites. Deployed across 6.5 GW, 1M+ critical assets, and used by three of the top four global colocation providers.
- Together: One view from rack to portfolio, fewer silos, faster AI/HPC deployments, and tighter control of operational and financial risk.
Operational risks to control in DTC environments
- Leaks and air ingress: Quick-disconnect failures, seal wear, line contamination. Require leak detection, pressure decay testing, and isolation procedures.
- Flow and thermal drift: Pump curve changes, clogged microchannels, valve mispositions. Monitor ΔT and ΔP, set guardrails, and alarm on trends, not just thresholds.
- Sensor and telemetry gaps: Miscalibrated flow meters and orphaned devices. Enforce calibration schedules and CMDB-to-sensor mapping.
- Change risk: Firmware updates, GPU swaps, manifold work. Gate with MOPs/SOPs, approvals, and automated post-change verification.
- Room-level effects: Mixed cooling topologies and airflow recirculation. Model interactions and enforce rack-level density policies.
What "good" looks like: data and workflows
Every coolant loop, quick-disconnect, pump, valve, and sensor is tracked. Telemetry streams into a single platform tied to maintenance, change, and incident workflows. Operators see the same data the portfolio team sees-no stitched screenshots, no side spreadsheets.
MCIM's platform centralizes that view and aligns it with Salute's field operations. The result: faster readiness, fewer surprises, and cleaner RCAs that turn into updated SOPs within days, not quarters.
Leader perspectives
John Shultz, Chief Product, AI and Learning Officer at Salute: "AI/HPC computing requires an entirely new approach to data centre operations because of the requirements of GPUs and the risks of DTC liquid cooling. Together, Salute and MCIM are able to help data centre operators protect their investments in AI by mitigating operational and financial risks."
Mike Parks, CEO of MCIM: "AI is redefining what 'mission critical' means. The speed and scale of this buildout demand more than hardware and headcount. They demand a new operational model. Together, Salute and MCIM are delivering that model where every person, process and asset is connected, visible and performing to plan."
Practical next steps for ops teams
- Inventory every loop: pumps, valves, manifolds, quick-disconnects, flow/pressure/temperature sensors. Map to P&IDs and your CMDB.
- Define guardrails: min/max flow per rack, ΔP limits, GPU inlet temps, leak detection thresholds, and auto-isolation rules.
- Harden change control: pre-flight checks, interlocks disabled/enabled, rollback steps, and automated post-change validation.
- Train for failure modes: leak response, air purge, hot swap procedures, contamination control, and ESD-safe handling.
- Drill quarterly: simulate a leak, sensor failure, pump trip, and GPU thermal throttle. Measure MTTA and MTTR.
- Track the right KPIs: leak incidents per 10 MW, coolant loss per month, false alert rate, change success rate, GPU throttle events, and mean time to detect flow anomalies.
- Stage spares and kits: seals, QDs, hose sets, sensor heads, clamps, absorbents, and PPE-standardized by site.
- Close the loop: every incident updates SOPs, training modules, and alarm thresholds within the same platform.
Standards and resources
If you're formalizing DTC operations, review industry guidance like the Open Compute Project's Advanced Cooling Solutions workstream for reference models and safety practices: OCP Advanced Cooling Solutions.
Level up team capability
Building AI-ready ops skills takes more than hardware. For focused training paths by role, see Complete AI Training: Courses by Job.
Your membership also unlocks: