IT teams can't see their hybrid stack. AI is moving in to help.
Most Ops leaders feel the gap every on-call: you can't fix what you can't see. A new SolarWinds study of 750+ IT pros reports 77% lack full visibility across on-prem and cloud. The culprits are familiar-complex estates, siloed teams, and tool sprawl.
Three data points tell the story. 75% say weak cross-team coordination blocks effective observability. 55% run too many monitoring tools. And more than half of organisations are still primarily or entirely on-prem, which multiplies failure domains and blind spots.
Why this matters for Operations
Hybrid without unified signals means longer incidents and finger-pointing. Tool sprawl leads to overlapping alerts, inconsistent dashboards, and unclear ownership. During outages, work bounces between network, app, DB, and infra teams with no single source of truth-exactly what 1 in 3 respondents flagged as a major challenge.
SolarWinds put it plainly: "As IT environments grow more distributed and business-critical, visibility is no longer optional; it's foundational," said Cullen Childress, Chief Product Officer. "Unified observability shifts teams from reactive firefighting to proactive resilience."
AI expectations are high-now make them useful
Confidence in AI is strong: 90% believe it can improve monitoring and observability. 64% rate unified observability across the full stack as very important to team success. Reported AI use cases are practical, not sci-fi-and they map directly to Ops pain.
- Automate incident prioritisation (47%) so the right alarms cut the line.
- Accelerate root-cause analysis (~45%) to reduce handoffs and mean time to resolve.
- Predict capacity/performance (~45%) to avoid noisy scaling thrash.
- Reduce alert noise/fatigue (~45%) so teams focus on what matters.
As Abigail Norman, senior director of product marketing at SolarWinds, noted: AI should do more than quiet alerts-it should sharpen prioritisation, streamline workflows, and give teams space to focus on strategy instead of scrambling through dashboards.
Barriers you should plan around
- Security concerns (47%): set data access boundaries, audit trails, and model usage policies up front.
- Skills gaps (42%): dedicate enablement time; pair SREs with platform engineers for hands-on pilots.
- Technology complexity (41%): start with one platform that unifies metrics, logs, traces, and events.
- Employee resistance (37%): show wins fast-noise reduction, faster triage, fewer after-hours pings.
- Budget constraints (33%): quantify MTTR savings and license consolidation to fund the shift.
What Ops can do in the next 90 days
- Inventory and rationalise: map every monitoring/observability tool, owner, and alert channel. Kill overlap. Close gaps.
- Define shared signals: publish service SLOs, golden signals, and escalation rules across network, app, DB, and infra.
- Standardise telemetry: ensure consistent tags and service names so data correlates across layers.
- Pick one AI "quick win": start with automated incident prioritisation or alert deduplication where the blast radius is big and scope is clear.
- Pilot on a critical service: run A/B on alert noise and MTTR; document before/after to build trust and budget.
- Establish guardrails: access controls, change logs for AI-driven rules, and clear rollback paths.
- Level up skills: invest in practical training for Ops teams adopting AI-assisted observability. See AI for IT & Development.
Metrics that prove it's working
- Mean time to detect (MTTD) and mean time to resolve (MTTR)
- Alert volume per service and percentage of actionable alerts
- Escalation depth and cross-team handoffs per incident
- Coverage: percent of services with end-to-end traces, logs, and metrics in one place
- Capacity forecast accuracy and unplanned scale events
The hybrid reality: on-prem is still in play
The "cloud-only" narrative doesn't match the floor. More than half of organisations remain primarily or entirely on-prem. That mix complicates data collection, correlation, and ownership-especially when tooling stacks grew from separate purchases by different teams.
The fix is consolidation and clarity: fewer tools, unified telemetry, shared dashboards, and agreed escalation paths. SolarWinds expects organisations to keep prioritising platforms that unify operational data and automate insights as monitoring shifts toward more automated, AI-assisted operations.
Helpful resources
- Google SRE book - solid frameworks for SLOs, alerting, and incident response.
- CNCF TAG Observability - guidance on telemetry standards and practices.
Study context
The study was conducted with UserEvidence and included respondents from public and private sectors across North America, Europe, Latin America, Asia-Pacific, and the Middle East and Africa. Roles spanned application and database management, network operations, and infrastructure across on-prem and cloud environments.
Your membership also unlocks: