Incidents

Root Cause Analysis

Understand how Pingara uses AI-powered analysis with Gemini 2.5 Flash to identify the root cause of incidents, interpret performance metrics, and accelerate troubleshooting.

5 min readUpdated April 7, 2026
root-causeaigeminianalysisdiagnostics

When an incident occurs, the first question is always "why?" Pingara's AI-powered root cause analysis uses Google's Gemini 2.5 Flash model to analyze performance metrics and suggest the most likely cause of the failure.

How It Works

Data Collection

Every HTTP check captures a detailed performance breakdown:

MetricWhat It Measures
DNS Lookup TimeTime to resolve hostname to IP address
TCP Connect TimeTime to establish a TCP connection
TLS Handshake TimeTime to negotiate SSL/TLS (HTTPS only)
Time to First Byte (TTFB)Time until the server starts sending data
Total DurationFull request/response cycle time
Response SizeSize of the response body
Status CodeHTTP response code
Error TypeCategory of failure (if any)

AI Analysis

When an incident is created, Pingara feeds these metrics into the Gemini 2.5 Flash model via Google's Genkit framework. The AI analyzes patterns and returns actionable root cause suggestions.

Input to the AI:

{
  "dnsLookupTime": 2500,
  "tcpConnectTime": 50,
  "tlsHandshakeTime": 120,
  "httpStatusCode": 0,
  "responseTime": 30000,
  "errorMessage": "DNS lookup failed: NXDOMAIN"
}

AI output:

Root cause hints:
1. DNS resolution failure — The domain could not be resolved.
   Possible causes: expired domain, misconfigured DNS records,
   DNS provider outage.
2. Check your domain registrar to confirm the domain hasn't expired.
3. Verify DNS records are correctly configured with your provider.

Interpreting Performance Metrics

Understanding what each metric tells you is key to effective troubleshooting.

DNS Lookup Time

Normal: 10–50ms Elevated: 200ms+ Failed: Returned error

SymptomLikely Cause
Slow (200ms+)DNS server overloaded or geographically distant
Very slow (1000ms+)DNS provider experiencing issues
Failed (NXDOMAIN)Domain doesn't exist or DNS misconfigured
Failed (SERVFAIL)DNS server error
Failed (timeout)DNS server unreachable

Action: Check your DNS provider's status page. Consider using a faster DNS provider or adding redundant DNS servers.

TCP Connect Time

Normal: 10–100ms (depends on geographic distance) Elevated: 500ms+

SymptomLikely Cause
SlowNetwork congestion between probe and server
Connection refusedServer is up but not accepting connections on that port
TimeoutServer unreachable, firewall blocking, or host down

Action: Check server firewall rules, verify the service is listening on the expected port, and check for network issues between the probe region and your server.

TLS Handshake Time

Normal: 50–200ms Elevated: 500ms+

SymptomLikely Cause
SlowServer CPU bottleneck during key exchange
Failed (expired)SSL certificate has expired
Failed (invalid)Certificate doesn't match hostname
Failed (self-signed)Certificate not trusted

Action: Check certificate validity, ensure your server supports modern TLS protocols, and consider enabling TLS session resumption.

Time to First Byte (TTFB)

Normal: 100–500ms Elevated: 1000ms+

TTFB is the most telling metric for backend performance because it includes:

  • Server processing time
  • Database query execution
  • External API calls
  • Cache lookups
SymptomLikely Cause
Consistently slowBackend performance issue (slow queries, missing cache)
Intermittently slowResource contention, garbage collection pauses
Very slow (5000ms+)Server overloaded, deadlock, or infinite loop

Action: Profile your backend application. Check database query performance, caching effectiveness, and server resource utilization.

Total Duration

The sum of all phases. If total duration exceeds your configured timeout, the check fails with a timeout error.

Common Root Cause Patterns

Pattern 1: DNS Failure

DNS: Failed | TCP: N/A | TLS: N/A | TTFB: N/A
Error: DNS lookup failed

Likely cause: Domain expired, DNS records deleted, or DNS provider outage.

Pattern 2: Server Unreachable

DNS: 25ms | TCP: Timeout | TLS: N/A | TTFB: N/A
Error: Connection timeout

Likely cause: Server is down, firewall blocking traffic, or network outage.

Pattern 3: SSL Certificate Issue

DNS: 30ms | TCP: 45ms | TLS: Failed | TTFB: N/A
Error: SSL certificate expired

Likely cause: Certificate expired and wasn't renewed. Check your certificate management process.

Pattern 4: Application Error

DNS: 20ms | TCP: 35ms | TLS: 150ms | TTFB: 80ms
Status: 503 Service Unavailable

Likely cause: Application crashed, deployment in progress, or backend dependency failure.

Pattern 5: Performance Degradation

DNS: 25ms | TCP: 40ms | TLS: 130ms | TTFB: 4500ms
Status: 200 (but very slow)

Likely cause: Database bottleneck, cache miss storm, or resource exhaustion.

Viewing Root Cause Hints

From the Incident Detail Page

  1. Navigate to Incidents → [Your Incident]
  2. Look for the Root Cause Analysis section
  3. Review the AI-generated hints

From the Root Cause Dialog

Click the Analyze button on any incident to trigger a fresh AI analysis with the latest available data.

Limitations

  • AI analysis is a suggestion, not a definitive diagnosis
  • Works best with clear failure patterns (DNS failure, timeout, 5xx errors)
  • May be less specific for intermittent or complex multi-service failures
  • Requires sufficient check result data for meaningful analysis

Best Practices

Combine AI Hints with Manual Investigation

Use root cause hints as a starting point, then verify with your own monitoring tools, server logs, and infrastructure dashboards.

Track Patterns Over Time

If the same root cause appears repeatedly:

  • DNS failures → Consider switching DNS providers
  • TLS issues → Automate certificate renewal (e.g., Let's Encrypt)
  • TTFB spikes → Invest in backend performance optimization

Share Analysis with Your Team

Root cause hints are included in incident reports. Share these with your engineering team during post-mortem reviews to drive infrastructure improvements.

Next Steps