Incidents

Understanding Incidents

Learn how Pingara detects outages, creates incidents after consecutive failures, manages the full incident lifecycle, and resolves incidents automatically on recovery.

6 min readUpdated April 7, 2026
incidentsoutagedowntimelifecycle

Incidents are the core of Pingara's alerting system. An incident represents a confirmed outage or service disruption — not a single failed check, but a verified problem that requires attention.

How Incidents Are Created

Pingara uses a consecutive failure model to prevent false positives from transient network blips.

The 2-Failure Rule

A single failed check does not create an incident. Instead:

  1. First failure — Check fails, but no incident yet. The consecutive failure counter increments to 1.
  2. Second consecutive failure — Check fails again. Counter reaches 2. Incident is created.

This ensures you're only alerted for real problems, not momentary network hiccups.

Check 1: ✅ Pass     → Status: Up
Check 2: ❌ Fail     → Status: Up (1 consecutive failure)
Check 3: ❌ Fail     → Status: Down (2 consecutive failures → Incident created!)
Check 4: ❌ Fail     → Status: Down (incident ongoing)
Check 5: ✅ Pass     → Status: Down (1 consecutive success)
Check 6: ✅ Pass     → Status: Up (2 consecutive successes → Incident resolved!)

What Counts as a Failure

A check is marked as "failed" if any of these occur:

  • Unexpected status code — Response code not in the expected list (default: 200-204)
  • Timeout — Server didn't respond within the configured timeout
  • Connection error — DNS failure, TCP connection refused, TLS handshake failure
  • Keyword missing — Response body doesn't contain the expected keyword (if keyword check is enabled)

Multi-Region Monitoring and Quorum

When you monitor from multiple regions, Pingara applies a quorum rule to further reduce false positives.

How Quorum Works

Instead of trusting a single region's result, Pingara requires failures from multiple regions before confirming a problem:

Regions ConfiguredQuorum RequiredMeaning
1 region1 of 1Any failure triggers
2 regions2 of 2Both must fail
3 regions2 of 3Majority must fail
4 regions3 of 4Majority must fail

Why Quorum Matters

Without quorum (single region):

US East check fails (ISP routing issue) → False incident!

With quorum (3 regions):

US East: ❌ Fail
EU West: ✅ Pass
AP South: ✅ Pass
→ 1 of 3 failed → NOT an incident (regional issue)

Actual outage (3 regions):

US East: ❌ Fail
EU West: ❌ Fail
AP South: ❌ Fail
→ 3 of 3 failed → Incident created (confirmed outage)

Recommendation: Enable at least 2 regions for production monitors. Use 3-4 regions for critical services.

Incident Lifecycle

Every incident progresses through a defined set of statuses:

1. Investigating

The initial state when an incident is first created.

  • Trigger: 2 consecutive failures confirmed
  • Actions: Alerts sent to all configured channels
  • What it means: Pingara has detected a problem and is tracking it

2. Identified

The cause of the incident has been identified (manually set or AI-assisted).

  • Trigger: Manual status update or AI root cause analysis
  • What it means: The team understands the problem

3. Monitoring

A fix has been applied and the team is watching for confirmation.

  • Trigger: Manual status update
  • What it means: Recovery is expected, watching closely

4. Resolved

The incident is over. Service has been restored.

  • Trigger: 2 consecutive successful checks (automatic) or manual resolution
  • Actions: Recovery alerts sent to all configured channels
  • What it means: The problem is fixed
Investigating → Identified → Monitoring → Resolved
     ↑              ↑            ↑            ↑
  (automatic)   (manual)    (manual)    (automatic)

Automatic Resolution

Pingara resolves incidents automatically using the same consecutive model as creation:

The 2-Success Rule

  1. First success — Check passes, but incident stays open. Consecutive success counter increments to 1.
  2. Second consecutive success — Check passes again. Counter reaches 2. Incident is resolved.

This prevents premature resolution from a single lucky check during an intermittent outage.

What Happens on Resolution

  1. Incident status changes to Resolved
  2. resolvedAt timestamp is recorded
  3. Recovery alert notifications are dispatched
  4. Monitor status returns to Up
  5. Consecutive failure counter resets to 0

Incident Details

Each incident records comprehensive information for post-mortems:

Core Data

  • Monitor — Which monitor detected the issue
  • Status — Current lifecycle stage
  • Started at — When the incident was created
  • Resolved at — When recovery was confirmed
  • Duration — Total time from start to resolution

Diagnostic Data

  • Error type — Category of failure (timeout, DNS failure, connection refused, etc.)
  • Error message — Detailed error description
  • Affected regions — Which monitoring regions detected the failure
  • Average response time — Mean response time during the incident

AI Analysis

  • Root cause hint — AI-generated analysis of what likely caused the incident (see Root Cause Analysis)

Viewing Incidents

Incident List

Navigate to Incidents in the sidebar to see all incidents across your organization.

Filters available:

  • Status — Investigating, Identified, Monitoring, Resolved
  • Monitor — Filter by specific monitor
  • Time range — Last 24h, 7d, 30d, or custom

Incident Detail

Click any incident to see:

  • Full timeline of status changes
  • Performance metrics at time of failure
  • Affected regions
  • Root cause hints (AI-generated)
  • Notification history

Incident vs Degraded

It's important to understand the difference:

Incident (Down)Degraded
Check resultFailedPassed (but slow)
Trigger2 consecutive failuresResponse time > 4× Apdex threshold
SeverityHighMedium
Creates incidentYesNo (separate alert)
Auto-resolvesYes (2 successes)Yes (when latency improves)

Best Practices

Set Appropriate Intervals

For critical services, use shorter check intervals:

  • 30 seconds — Fastest detection (~1 minute to incident)
  • 1 minute — Good balance (~2 minutes to incident)
  • 5 minutes — Standard monitoring (~10 minutes to incident)

Enable Multi-Region

Single-region monitoring leads to false positives. Always use at least 2 regions for production monitors.

Review Incidents Regularly

Schedule weekly or monthly incident reviews:

  1. How many incidents occurred?
  2. What was the average resolution time?
  3. Were there any false positives?
  4. Should Apdex thresholds be adjusted?

Use Root Cause Analysis

Pingara's AI-powered root cause analysis can help you understand why an incident occurred, not just that it happened. See Root Cause Analysis for details.

Next Steps