Unifying AI management: Datadog launches GPU Monitoring

Datadog has introduced GPU Monitoring, expanding its AI observability capabilities by providing unified visibility into GPU fleet health, performance, and cost efficiency.

  • Wednesday, 22nd April 2026 Posted 1 month ago in by Sophie Milburn
Datadog has introduced GPU Monitoring, now available to customers globally. The product is designed to address challenges organisations face in managing rising AI-related costs.

“GPU instances account for 14 percent of compute costs—which is a huge issue as companies are struggling to build AI-first technology in scalable and smart ways. While these companies can see their costs climbing, they can’t chargeback GPU spend across business units, see workload context or identify clear next steps for improvement. As a result, it is very challenging to budget and plan in thoughtful ways,” said Yanbing Li, Chief Product Officer at Datadog.

The launch comes as companies seek more effective ways to manage GPU spending linked to AI workloads. Many organisations face difficulties allocating GPU costs across business units, and limited workload context can make budgeting and planning more complex.

GPU Monitoring aims to provide a unified view across AI infrastructure, linking GPU fleet health, cost, and performance to the teams using those resources. This supports faster troubleshooting of slower workloads and aims to improve cost visibility.

As AI deployments scale, managing compute resources increasingly involves broader organisational planning, particularly where capacity is misallocated or where training and inference workloads are affected by cost or performance constraints. Many organisations currently work with fragmented visibility into GPU usage. GPU Monitoring is intended to consolidate this view.

Existing GPU monitoring tools typically provide basic hardware health metrics but may not show cross-team resource contention, reasons for failed workloads, or identify underused devices. This can slow investigations and lead to overprovisioning as a precaution, contributing to higher resource usage.

By connecting GPU fleet telemetry with workload data, GPU Monitoring provides a shared view for platform engineering and machine learning teams.

  • Scale AI without overspending: Usage insights help guide capacity planning, support decisions on new GPU purchases versus reallocation, and improve cost predictability.
  • Accelerate AI delivery: Linking performance issues to specific GPUs and processes helps identify bottlenecks more quickly.
  • Avoid costly disruptions: Early detection of unhealthy GPUs can help reduce the risk of broader system failures.
  • Maximise ROI on GPU spend: Visibility into utilisation enables teams to identify underused or overprovisioned resources and adjust allocation accordingly.
Overall, GPU Monitoring is positioned as a tool to improve visibility and resource management for AI workloads across organisations.
StorONE says its fiscal Q1 performance saw bookings and revenue exceed its 2025 totals, supported...
Leaseweb reveals UK partner programme, supporting MSPs and infrastructure partners in enhancing...
Cohesity aims to enhance its AI-driven data security capabilities through engagement in...
The Cloud and AI Development Act is poised to strengthen Europe's digital infrastructure,...
Submer Group introduces Rubix, a global AI data centre developer, aiming to meet the rising demand...
Bull and Foxconn collaborate to strengthen Europe's AI and cloud capabilities, focusing on...
Dell Technologies partners with NVIDIA to enhance the Dell AI Factory, focusing on overcoming AI...
Glesys partners with Trevian to develop a new data centre campus in Oulu, aiming to enhance digital...