Skip to main content

Incident Response Plan

This document outlines the procedures for detecting, responding to, and recovering from incidents affecting the Lightning Enable API.

Incident Severity Levels

Critical (P1)

Definition: Complete service outage or payment processing failure affecting all customers.

Examples:

  • API is completely down (all endpoints returning errors)
  • Payment creation failing for all merchants
  • Database unavailable
  • OpenNode integration completely broken
  • Data breach or security incident

Response Time: Immediate (within 15 minutes)

Resolution Target: 1 hour


High (P2)

Definition: Degraded performance or partial outage affecting significant functionality.

Examples:

  • API response times > 5 seconds
  • Intermittent 500 errors (> 5% error rate)
  • Webhook delivery failures
  • One product tier completely unavailable (L402, Kentico, Standalone)
  • Authentication system degraded

Response Time: Within 30 minutes

Resolution Target: 4 hours


Medium (P3)

Definition: Non-critical feature unavailable or degraded, workarounds available.

Examples:

  • Exchange rate API returning stale data
  • Admin dashboard slow or partially broken
  • Hangfire jobs delayed but not failed
  • Documentation site down
  • Single merchant reporting issues

Response Time: Within 2 hours

Resolution Target: 24 hours


Low (P4)

Definition: Minor issues with no significant customer impact.

Examples:

  • Cosmetic issues in admin dashboard
  • Non-critical log warnings
  • Minor documentation errors
  • Performance slightly below baseline

Response Time: Next business day

Resolution Target: 1 week


Detection & Alerting

Application Insights Alerts

Lightning Enable uses Azure Application Insights for monitoring. The following alerts should be configured:

Availability Alerts

AlertThresholdSeverity
Health endpoint failure2 consecutive failuresCritical
API availability < 99%5-minute windowHigh
API availability < 95%5-minute windowCritical

Performance Alerts

AlertThresholdSeverity
Average response time > 2s5-minute windowMedium
Average response time > 5s5-minute windowHigh
P95 response time > 10s5-minute windowHigh

Error Rate Alerts

AlertThresholdSeverity
5xx error rate > 1%5-minute windowMedium
5xx error rate > 5%5-minute windowHigh
5xx error rate > 10%5-minute windowCritical

Custom Alerts

AlertThresholdSeverity
Payment creation failures > 515-minute windowHigh
OpenNode API errors > 1015-minute windowHigh
Database connection failuresAny occurrenceCritical
Stripe webhook failures > 31-hour windowMedium

Health Endpoint Monitoring

The API exposes health endpoints that should be monitored:

# Basic health check
GET https://api.lightningenable.com/health

# Detailed health (requires admin key)
GET https://api.lightningenable.com/health/ready

Expected Response:

{
"status": "Healthy",
"checks": {
"database": "Healthy",
"opennode": "Healthy"
}
}

Set up external monitoring (e.g., Azure Monitor, Pingdom, UptimeRobot) to hit the health endpoint every 1 minute.

Azure Portal Access

  1. Navigate to Azure Portal
  2. Search for "Application Insights"
  3. Select "lightning-enable-insights"
  4. Key dashboards:
    • Overview - Quick health snapshot
    • Failures - Error analysis
    • Performance - Response time analysis
    • Live Metrics - Real-time monitoring

Response Procedures by Severity

Critical (P1) Response

Immediate Actions (0-15 minutes)

  1. Acknowledge the incident

    • Note the start time
    • Create incident ticket
  2. Assess scope

    # Check API health
    curl https://api.lightningenable.com/health

    # Check recent errors in Application Insights
    # Azure Portal > Application Insights > Failures
  3. Communicate

    • Update status page (if available)
    • Notify affected merchants if known
    • Alert team members
  4. Initial triage

Investigation (15-60 minutes)

  1. Review logs

    • Application Insights > Logs
    • Query recent exceptions:
    exceptions
    | where timestamp > ago(1h)
    | order by timestamp desc
  2. Check deployments

    • Was there a recent deployment?
    • Consider rollback if deployment is suspect
  3. Check external dependencies

    • OpenNode API
    • Stripe API
    • Azure SQL Database

Escalation Path

  1. First responder - On-call engineer
  2. 15 minutes - Senior engineer / team lead
  3. 30 minutes - Engineering manager
  4. 1 hour - Executive notification

High (P2) Response

Actions (0-30 minutes)

  1. Acknowledge and assess

    • Identify affected functionality
    • Determine customer impact
  2. Investigate

    • Review Application Insights metrics
    • Check for patterns (specific endpoint, merchant, etc.)
  3. Mitigate

    • Enable additional logging if needed
    • Scale resources if performance-related
    • Implement temporary workarounds
  4. Communicate

    • Update internal status
    • Notify affected merchants if significant

Medium (P3) Response

  1. Create detailed ticket with findings
  2. Schedule investigation during business hours
  3. Document workarounds for affected users
  4. Plan fix for next release cycle

Low (P4) Response

  1. Log issue in backlog
  2. Address in regular sprint planning
  3. No immediate action required

Common Issues & Resolutions

API Returning 500 Errors

Symptoms:

  • All or most endpoints returning HTTP 500
  • Application Insights showing high exception rate

Diagnostic Steps:

// Application Insights - Recent exceptions
exceptions
| where timestamp > ago(30m)
| summarize count() by type, outerMessage
| order by count_ desc

Common Causes & Solutions:

CauseSolution
Database connection string invalidCheck ConnectionStrings__DefaultConnection in App Service Configuration
Encryption key changed/missingVerify DB_ENCRYPTION_KEY matches production key
Missing required configurationCheck Application Insights for specific config errors
Out of memoryRestart App Service, consider scaling up
Unhandled exception in codeReview recent deployments, consider rollback

Rollback Procedure:

# Azure CLI - Rollback to previous deployment
az webapp deployment slot swap \
--resource-group lightning-enable-rg \
--name lightning-enable-api \
--slot staging \
--target-slot production

Database Connection Issues

Symptoms:

  • Timeout errors in logs
  • "Cannot connect to database" exceptions
  • Health check failing on database

Diagnostic Steps:

  1. Check Azure SQL Database status in portal
  2. Verify firewall rules allow App Service
  3. Check connection string format

Common Causes & Solutions:

CauseSolution
SQL Server maintenanceWait for maintenance window to complete
Firewall blockingAdd App Service outbound IPs to SQL firewall
Connection pool exhaustionRestart App Service, review connection handling
Credentials expired (Managed Identity)Re-enable system-assigned managed identity
DTU limit exceededScale up database tier

Verify Connectivity:

# From Azure Cloud Shell or with VNet access
sqlcmd -S your-server.database.windows.net -d LightningEnable -U admin -P 'password'

OpenNode API Failures

Symptoms:

  • Payment creation failing
  • "OPENNODE_ERROR" or "OPENNODE_TIMEOUT" responses
  • 502 Bad Gateway errors

Diagnostic Steps:

  1. Check OpenNode status: https://status.opennode.com
  2. Verify merchant OpenNode key is valid
  3. Test OpenNode directly:
curl -X GET "https://api.opennode.com/v1/account/balance" \
-H "Authorization: YOUR_OPENNODE_KEY"

Common Causes & Solutions:

CauseSolution
OpenNode outageWait for OpenNode to recover, monitor status page
Invalid API keyMerchant needs to regenerate key in OpenNode dashboard
OpenNode rate limitingImplement request throttling
Network timeoutIncrease HttpClient timeout, retry with backoff
Wrong environmentVerify OpenNode:Environment matches key type (dev/production)

Temporary Mitigation:

  • Enable payment creation queuing (if implemented)
  • Notify merchants of OpenNode status
  • Monitor for automatic recovery

Stripe API Failures

Symptoms:

  • Subscription creation failing
  • Webhook signature verification errors
  • Checkout sessions not creating

Diagnostic Steps:

  1. Check Stripe status: https://status.stripe.com
  2. Verify API keys in configuration
  3. Check Stripe Dashboard > Developers > Logs

Common Causes & Solutions:

CauseSolution
Webhook secret mismatchUpdate Stripe:WebhookSecret from Stripe Dashboard
API key rotatedUpdate Stripe:SecretKey in App Service configuration
Stripe outageWait for recovery, queue subscription actions
Price ID invalidVerify Stripe:PricingPlans configuration

High Latency

Symptoms:

  • Response times > 2 seconds
  • Timeout errors in client applications
  • Performance alerts triggering

Diagnostic Steps:

// Application Insights - Slow requests
requests
| where timestamp > ago(1h)
| where duration > 2000
| summarize count(), avg(duration) by name
| order by avg_duration desc

Common Causes & Solutions:

CauseSolution
Database slow queriesReview query plans, add indexes
High CPU utilizationScale up App Service plan
Memory pressureRestart app, scale up if persistent
External API slowAdd caching, increase timeouts
Cold startEnable Always On in App Service

Quick Fixes:

  • Restart the App Service
  • Scale out (add instances)
  • Enable Application Insights Profiler for detailed analysis

Rate Limiting Issues

Symptoms:

  • Merchants reporting 429 errors
  • Legitimate traffic being blocked
  • Rate limit alerts triggering

Diagnostic Steps:

// Check rate limit violations
traces
| where message contains "rate limit"
| where timestamp > ago(1h)
| summarize count() by tostring(customDimensions.ApiKey)

Common Causes & Solutions:

CauseSolution
Merchant polling too frequentlyGuide merchant to use webhooks
Bot trafficReview and block suspicious IPs
Rate limits too aggressiveAdjust limits in configuration
Single merchant spikeContact merchant, temporarily increase their limit

Post-Incident Procedures

Incident Report Template

Create an incident report within 48 hours of resolution for P1/P2 incidents:

# Incident Report: [Title]

## Summary
- **Incident ID:** INC-YYYY-MM-DD-001
- **Severity:** P1/P2/P3/P4
- **Duration:** [Start time] to [End time]
- **Impact:** [Brief description of customer impact]

## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | Incident detected |
| HH:MM | Team notified |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service restored |

## Root Cause
[Detailed technical explanation of what caused the incident]

## Impact Assessment
- **Merchants affected:** [Number]
- **Failed payments:** [Number]
- **Revenue impact:** [If applicable]

## Resolution
[What was done to fix the immediate issue]

## Prevention Measures
- [ ] [Action item 1]
- [ ] [Action item 2]
- [ ] [Action item 3]

## Lessons Learned
[What the team learned from this incident]

Root Cause Analysis (5 Whys)

For P1/P2 incidents, perform a 5 Whys analysis:

  1. Why did the incident occur?
  2. Why did that happen?
  3. Why did that happen?
  4. Why did that happen?
  5. Why did that happen?

Prevention Measures Checklist

After each incident, consider:

  • Monitoring: Are there alerts that would catch this earlier?
  • Testing: Would additional tests have caught this?
  • Documentation: Is the runbook updated?
  • Automation: Can recovery be automated?
  • Architecture: Does the system design need improvement?
  • Process: Are deployment/change processes adequate?

Contact Information

Support Channels

TypeContact
General Supportsupport@lightningenable.com
Security Issuessecurity@lightningenable.com
Enterprise Supportenterprise@lightningenable.com

Useful Resources

ResourceURL
Azure Portalhttps://portal.azure.com
OpenNode Statushttps://status.opennode.com
Stripe Statushttps://status.stripe.com
GitHub Repositoryhttps://github.com/AdrianRamsey13/lightning-enable
API Documentationhttps://docs.lightningenable.com
Production APIhttps://api.lightningenable.com

Azure Resources

ResourceName
App Servicelightning-enable-api
SQL Databaselightning-enable-db
Application Insightslightning-enable-insights
Key Vaultlightning-enable-kv
Resource Grouplightning-enable-rg

Runbook Updates

This incident response plan should be reviewed and updated:

  • After every P1/P2 incident
  • Quarterly (at minimum)
  • When major infrastructure changes occur
  • When new integrations are added

Last Updated: January 2026


Next Steps