Root Cause Analysis
Understand how Pingara uses AI-powered analysis with Gemini 2.5 Flash to identify the root cause of incidents, interpret performance metrics, and accelerate troubleshooting.
When an incident occurs, the first question is always "why?" Pingara's AI-powered root cause analysis uses Google's Gemini 2.5 Flash model to analyze performance metrics and suggest the most likely cause of the failure.
How It Works
Data Collection
Every HTTP check captures a detailed performance breakdown:
| Metric | What It Measures |
|---|---|
| DNS Lookup Time | Time to resolve hostname to IP address |
| TCP Connect Time | Time to establish a TCP connection |
| TLS Handshake Time | Time to negotiate SSL/TLS (HTTPS only) |
| Time to First Byte (TTFB) | Time until the server starts sending data |
| Total Duration | Full request/response cycle time |
| Response Size | Size of the response body |
| Status Code | HTTP response code |
| Error Type | Category of failure (if any) |
AI Analysis
When an incident is created, Pingara feeds these metrics into the Gemini 2.5 Flash model via Google's Genkit framework. The AI analyzes patterns and returns actionable root cause suggestions.
Input to the AI:
{
"dnsLookupTime": 2500,
"tcpConnectTime": 50,
"tlsHandshakeTime": 120,
"httpStatusCode": 0,
"responseTime": 30000,
"errorMessage": "DNS lookup failed: NXDOMAIN"
}
AI output:
Root cause hints:
1. DNS resolution failure — The domain could not be resolved.
Possible causes: expired domain, misconfigured DNS records,
DNS provider outage.
2. Check your domain registrar to confirm the domain hasn't expired.
3. Verify DNS records are correctly configured with your provider.
Interpreting Performance Metrics
Understanding what each metric tells you is key to effective troubleshooting.
DNS Lookup Time
Normal: 10–50ms Elevated: 200ms+ Failed: Returned error
| Symptom | Likely Cause |
|---|---|
| Slow (200ms+) | DNS server overloaded or geographically distant |
| Very slow (1000ms+) | DNS provider experiencing issues |
| Failed (NXDOMAIN) | Domain doesn't exist or DNS misconfigured |
| Failed (SERVFAIL) | DNS server error |
| Failed (timeout) | DNS server unreachable |
Action: Check your DNS provider's status page. Consider using a faster DNS provider or adding redundant DNS servers.
TCP Connect Time
Normal: 10–100ms (depends on geographic distance) Elevated: 500ms+
| Symptom | Likely Cause |
|---|---|
| Slow | Network congestion between probe and server |
| Connection refused | Server is up but not accepting connections on that port |
| Timeout | Server unreachable, firewall blocking, or host down |
Action: Check server firewall rules, verify the service is listening on the expected port, and check for network issues between the probe region and your server.
TLS Handshake Time
Normal: 50–200ms Elevated: 500ms+
| Symptom | Likely Cause |
|---|---|
| Slow | Server CPU bottleneck during key exchange |
| Failed (expired) | SSL certificate has expired |
| Failed (invalid) | Certificate doesn't match hostname |
| Failed (self-signed) | Certificate not trusted |
Action: Check certificate validity, ensure your server supports modern TLS protocols, and consider enabling TLS session resumption.
Time to First Byte (TTFB)
Normal: 100–500ms Elevated: 1000ms+
TTFB is the most telling metric for backend performance because it includes:
- Server processing time
- Database query execution
- External API calls
- Cache lookups
| Symptom | Likely Cause |
|---|---|
| Consistently slow | Backend performance issue (slow queries, missing cache) |
| Intermittently slow | Resource contention, garbage collection pauses |
| Very slow (5000ms+) | Server overloaded, deadlock, or infinite loop |
Action: Profile your backend application. Check database query performance, caching effectiveness, and server resource utilization.
Total Duration
The sum of all phases. If total duration exceeds your configured timeout, the check fails with a timeout error.
Common Root Cause Patterns
Pattern 1: DNS Failure
DNS: Failed | TCP: N/A | TLS: N/A | TTFB: N/A
Error: DNS lookup failed
Likely cause: Domain expired, DNS records deleted, or DNS provider outage.
Pattern 2: Server Unreachable
DNS: 25ms | TCP: Timeout | TLS: N/A | TTFB: N/A
Error: Connection timeout
Likely cause: Server is down, firewall blocking traffic, or network outage.
Pattern 3: SSL Certificate Issue
DNS: 30ms | TCP: 45ms | TLS: Failed | TTFB: N/A
Error: SSL certificate expired
Likely cause: Certificate expired and wasn't renewed. Check your certificate management process.
Pattern 4: Application Error
DNS: 20ms | TCP: 35ms | TLS: 150ms | TTFB: 80ms
Status: 503 Service Unavailable
Likely cause: Application crashed, deployment in progress, or backend dependency failure.
Pattern 5: Performance Degradation
DNS: 25ms | TCP: 40ms | TLS: 130ms | TTFB: 4500ms
Status: 200 (but very slow)
Likely cause: Database bottleneck, cache miss storm, or resource exhaustion.
Viewing Root Cause Hints
From the Incident Detail Page
- Navigate to Incidents → [Your Incident]
- Look for the Root Cause Analysis section
- Review the AI-generated hints
From the Root Cause Dialog
Click the Analyze button on any incident to trigger a fresh AI analysis with the latest available data.
Limitations
- AI analysis is a suggestion, not a definitive diagnosis
- Works best with clear failure patterns (DNS failure, timeout, 5xx errors)
- May be less specific for intermittent or complex multi-service failures
- Requires sufficient check result data for meaningful analysis
Best Practices
Combine AI Hints with Manual Investigation
Use root cause hints as a starting point, then verify with your own monitoring tools, server logs, and infrastructure dashboards.
Track Patterns Over Time
If the same root cause appears repeatedly:
- DNS failures → Consider switching DNS providers
- TLS issues → Automate certificate renewal (e.g., Let's Encrypt)
- TTFB spikes → Invest in backend performance optimization
Share Analysis with Your Team
Root cause hints are included in incident reports. Share these with your engineering team during post-mortem reviews to drive infrastructure improvements.
Next Steps
- Understanding Incidents — Incident lifecycle and detection
- HTTP/HTTPS Monitoring — Performance metrics in detail
- Apdex Scoring — Performance thresholds and scoring