Understanding Incidents
Learn how Pingara detects outages, creates incidents after consecutive failures, manages the full incident lifecycle, and resolves incidents automatically on recovery.
Incidents are the core of Pingara's alerting system. An incident represents a confirmed outage or service disruption — not a single failed check, but a verified problem that requires attention.
How Incidents Are Created
Pingara uses a consecutive failure model to prevent false positives from transient network blips.
The 2-Failure Rule
A single failed check does not create an incident. Instead:
- First failure — Check fails, but no incident yet. The consecutive failure counter increments to 1.
- Second consecutive failure — Check fails again. Counter reaches 2. Incident is created.
This ensures you're only alerted for real problems, not momentary network hiccups.
Check 1: ✅ Pass → Status: Up
Check 2: ❌ Fail → Status: Up (1 consecutive failure)
Check 3: ❌ Fail → Status: Down (2 consecutive failures → Incident created!)
Check 4: ❌ Fail → Status: Down (incident ongoing)
Check 5: ✅ Pass → Status: Down (1 consecutive success)
Check 6: ✅ Pass → Status: Up (2 consecutive successes → Incident resolved!)
What Counts as a Failure
A check is marked as "failed" if any of these occur:
- Unexpected status code — Response code not in the expected list (default: 200-204)
- Timeout — Server didn't respond within the configured timeout
- Connection error — DNS failure, TCP connection refused, TLS handshake failure
- Keyword missing — Response body doesn't contain the expected keyword (if keyword check is enabled)
Multi-Region Monitoring and Quorum
When you monitor from multiple regions, Pingara applies a quorum rule to further reduce false positives.
How Quorum Works
Instead of trusting a single region's result, Pingara requires failures from multiple regions before confirming a problem:
| Regions Configured | Quorum Required | Meaning |
|---|---|---|
| 1 region | 1 of 1 | Any failure triggers |
| 2 regions | 2 of 2 | Both must fail |
| 3 regions | 2 of 3 | Majority must fail |
| 4 regions | 3 of 4 | Majority must fail |
Why Quorum Matters
Without quorum (single region):
US East check fails (ISP routing issue) → False incident!
With quorum (3 regions):
US East: ❌ Fail
EU West: ✅ Pass
AP South: ✅ Pass
→ 1 of 3 failed → NOT an incident (regional issue)
Actual outage (3 regions):
US East: ❌ Fail
EU West: ❌ Fail
AP South: ❌ Fail
→ 3 of 3 failed → Incident created (confirmed outage)
Recommendation: Enable at least 2 regions for production monitors. Use 3-4 regions for critical services.
Incident Lifecycle
Every incident progresses through a defined set of statuses:
1. Investigating
The initial state when an incident is first created.
- Trigger: 2 consecutive failures confirmed
- Actions: Alerts sent to all configured channels
- What it means: Pingara has detected a problem and is tracking it
2. Identified
The cause of the incident has been identified (manually set or AI-assisted).
- Trigger: Manual status update or AI root cause analysis
- What it means: The team understands the problem
3. Monitoring
A fix has been applied and the team is watching for confirmation.
- Trigger: Manual status update
- What it means: Recovery is expected, watching closely
4. Resolved
The incident is over. Service has been restored.
- Trigger: 2 consecutive successful checks (automatic) or manual resolution
- Actions: Recovery alerts sent to all configured channels
- What it means: The problem is fixed
Investigating → Identified → Monitoring → Resolved
↑ ↑ ↑ ↑
(automatic) (manual) (manual) (automatic)
Automatic Resolution
Pingara resolves incidents automatically using the same consecutive model as creation:
The 2-Success Rule
- First success — Check passes, but incident stays open. Consecutive success counter increments to 1.
- Second consecutive success — Check passes again. Counter reaches 2. Incident is resolved.
This prevents premature resolution from a single lucky check during an intermittent outage.
What Happens on Resolution
- Incident status changes to Resolved
resolvedAttimestamp is recorded- Recovery alert notifications are dispatched
- Monitor status returns to Up
- Consecutive failure counter resets to 0
Incident Details
Each incident records comprehensive information for post-mortems:
Core Data
- Monitor — Which monitor detected the issue
- Status — Current lifecycle stage
- Started at — When the incident was created
- Resolved at — When recovery was confirmed
- Duration — Total time from start to resolution
Diagnostic Data
- Error type — Category of failure (timeout, DNS failure, connection refused, etc.)
- Error message — Detailed error description
- Affected regions — Which monitoring regions detected the failure
- Average response time — Mean response time during the incident
AI Analysis
- Root cause hint — AI-generated analysis of what likely caused the incident (see Root Cause Analysis)
Viewing Incidents
Incident List
Navigate to Incidents in the sidebar to see all incidents across your organization.
Filters available:
- Status — Investigating, Identified, Monitoring, Resolved
- Monitor — Filter by specific monitor
- Time range — Last 24h, 7d, 30d, or custom
Incident Detail
Click any incident to see:
- Full timeline of status changes
- Performance metrics at time of failure
- Affected regions
- Root cause hints (AI-generated)
- Notification history
Incident vs Degraded
It's important to understand the difference:
| Incident (Down) | Degraded | |
|---|---|---|
| Check result | Failed | Passed (but slow) |
| Trigger | 2 consecutive failures | Response time > 4× Apdex threshold |
| Severity | High | Medium |
| Creates incident | Yes | No (separate alert) |
| Auto-resolves | Yes (2 successes) | Yes (when latency improves) |
Best Practices
Set Appropriate Intervals
For critical services, use shorter check intervals:
- 30 seconds — Fastest detection (~1 minute to incident)
- 1 minute — Good balance (~2 minutes to incident)
- 5 minutes — Standard monitoring (~10 minutes to incident)
Enable Multi-Region
Single-region monitoring leads to false positives. Always use at least 2 regions for production monitors.
Review Incidents Regularly
Schedule weekly or monthly incident reviews:
- How many incidents occurred?
- What was the average resolution time?
- Were there any false positives?
- Should Apdex thresholds be adjusted?
Use Root Cause Analysis
Pingara's AI-powered root cause analysis can help you understand why an incident occurred, not just that it happened. See Root Cause Analysis for details.
Next Steps
- Root Cause Analysis — AI-powered incident diagnostics
- Setting Up Alerts — Get notified when incidents occur
- HTTP/HTTPS Monitoring — Configure your monitors