Pingara: The Affordable Monitoring Solution

Incidents are the core of Pingara's alerting system. An incident represents a confirmed outage or service disruption — not a single failed check, but a verified problem that requires attention.

How Incidents Are Created

Pingara uses a consecutive failure model to prevent false positives from transient network blips.

The 2-Failure Rule

A single failed check does not create an incident. Instead:

First failure — Check fails, but no incident yet. The consecutive failure counter increments to 1.
Second consecutive failure — Check fails again. Counter reaches 2. Incident is created.

This ensures you're only alerted for real problems, not momentary network hiccups.

Check 1: ✅ Pass     → Status: Up
Check 2: ❌ Fail     → Status: Up (1 consecutive failure)
Check 3: ❌ Fail     → Status: Down (2 consecutive failures → Incident created!)
Check 4: ❌ Fail     → Status: Down (incident ongoing)
Check 5: ✅ Pass     → Status: Down (1 consecutive success)
Check 6: ✅ Pass     → Status: Up (2 consecutive successes → Incident resolved!)

What Counts as a Failure

A check is marked as "failed" if any of these occur:

Unexpected status code — Response code not in the expected list (default: 200-204)
Timeout — Server didn't respond within the configured timeout
Connection error — DNS failure, TCP connection refused, TLS handshake failure
Keyword missing — Response body doesn't contain the expected keyword (if keyword check is enabled)

Multi-Region Monitoring and Quorum

When you monitor from multiple regions, Pingara applies a quorum rule to further reduce false positives.

How Quorum Works

Instead of trusting a single region's result, Pingara requires failures from multiple regions before confirming a problem:

Regions Configured	Quorum Required	Meaning
1 region	1 of 1	Any failure triggers
2 regions	2 of 2	Both must fail
3 regions	2 of 3	Majority must fail
4 regions	3 of 4	Majority must fail

Why Quorum Matters

Without quorum (single region):

US East check fails (ISP routing issue) → False incident!

With quorum (3 regions):

US East: ❌ Fail
EU West: ✅ Pass
AP South: ✅ Pass
→ 1 of 3 failed → NOT an incident (regional issue)

Actual outage (3 regions):

US East: ❌ Fail
EU West: ❌ Fail
AP South: ❌ Fail
→ 3 of 3 failed → Incident created (confirmed outage)

Recommendation: Enable at least 2 regions for production monitors. Use 3-4 regions for critical services.

Incident Lifecycle

Every incident progresses through a defined set of statuses:

1. Investigating

The initial state when an incident is first created.

Trigger: 2 consecutive failures confirmed
Actions: Alerts sent to all configured channels
What it means: Pingara has detected a problem and is tracking it

2. Identified

The cause of the incident has been identified (manually set or AI-assisted).

Trigger: Manual status update or AI root cause analysis
What it means: The team understands the problem

3. Monitoring

A fix has been applied and the team is watching for confirmation.

Trigger: Manual status update
What it means: Recovery is expected, watching closely

4. Resolved

The incident is over. Service has been restored.

Trigger: 2 consecutive successful checks (automatic) or manual resolution
Actions: Recovery alerts sent to all configured channels
What it means: The problem is fixed

Investigating → Identified → Monitoring → Resolved
     ↑              ↑            ↑            ↑
  (automatic)   (manual)    (manual)    (automatic)

Automatic Resolution

Pingara resolves incidents automatically using the same consecutive model as creation:

The 2-Success Rule

First success — Check passes, but incident stays open. Consecutive success counter increments to 1.
Second consecutive success — Check passes again. Counter reaches 2. Incident is resolved.

This prevents premature resolution from a single lucky check during an intermittent outage.

What Happens on Resolution

Incident status changes to Resolved
resolvedAt timestamp is recorded
Recovery alert notifications are dispatched
Monitor status returns to Up
Consecutive failure counter resets to 0

Incident Details

Each incident records comprehensive information for post-mortems:

Core Data

Monitor — Which monitor detected the issue
Status — Current lifecycle stage
Started at — When the incident was created
Resolved at — When recovery was confirmed
Duration — Total time from start to resolution

Diagnostic Data

Error type — Category of failure (timeout, DNS failure, connection refused, etc.)
Error message — Detailed error description
Affected regions — Which monitoring regions detected the failure
Average response time — Mean response time during the incident

AI Analysis

Root cause hint — AI-generated analysis of what likely caused the incident (see Root Cause Analysis)

Viewing Incidents

Incident List

Navigate to Incidents in the sidebar to see all incidents across your organization.

Filters available:

Status — Investigating, Identified, Monitoring, Resolved
Monitor — Filter by specific monitor
Time range — Last 24h, 7d, 30d, or custom

Incident Detail

Click any incident to see:

Full timeline of status changes
Performance metrics at time of failure
Affected regions
Root cause hints (AI-generated)
Notification history

Incident vs Degraded

It's important to understand the difference:

	Incident (Down)	Degraded
Check result	Failed	Passed (but slow)
Trigger	2 consecutive failures	Response time > 4× Apdex threshold
Severity	High	Medium
Creates incident	Yes	No (separate alert)
Auto-resolves	Yes (2 successes)	Yes (when latency improves)

Best Practices

Set Appropriate Intervals

For critical services, use shorter check intervals:

30 seconds — Fastest detection (~1 minute to incident)
1 minute — Good balance (~2 minutes to incident)
5 minutes — Standard monitoring (~10 minutes to incident)

Enable Multi-Region

Single-region monitoring leads to false positives. Always use at least 2 regions for production monitors.

Review Incidents Regularly

Schedule weekly or monthly incident reviews:

How many incidents occurred?
What was the average resolution time?
Were there any false positives?
Should Apdex thresholds be adjusted?

Use Root Cause Analysis

Pingara's AI-powered root cause analysis can help you understand why an incident occurred, not just that it happened. See Root Cause Analysis for details.

Next Steps

Root Cause Analysis — AI-powered incident diagnostics
Setting Up Alerts — Get notified when incidents occur
HTTP/HTTPS Monitoring — Configure your monitors

Understanding Incidents