Incident Response Plan
This document outlines the procedures for detecting, responding to, and recovering from incidents affecting the Lightning Enable API.
Incident Severity Levels
Critical (P1)
Definition: Complete service outage or payment processing failure affecting all customers.
Examples:
- API is completely down (all endpoints returning errors)
- Payment creation failing for all merchants
- Database unavailable
- Payment provider integration completely broken (Strike or OpenNode)
- Data breach or security incident
Response Time: Immediate (within 15 minutes)
Resolution Target: 1 hour
High (P2)
Definition: Degraded performance or partial outage affecting significant functionality.
Examples:
- API response times > 5 seconds
- Intermittent 500 errors (> 5% error rate)
- Webhook delivery failures
- One product tier completely unavailable (L402, Kentico, Standalone)
- Authentication system degraded
Response Time: Within 30 minutes
Resolution Target: 4 hours
Medium (P3)
Definition: Non-critical feature unavailable or degraded, workarounds available.
Examples:
- Exchange rate API returning stale data
- Admin dashboard slow or partially broken
- Hangfire jobs delayed but not failed
- Documentation site down
- Single merchant reporting issues
Response Time: Within 2 hours
Resolution Target: 24 hours
Low (P4)
Definition: Minor issues with no significant customer impact.
Examples:
- Cosmetic issues in admin dashboard
- Non-critical log warnings
- Minor documentation errors
- Performance slightly below baseline
Response Time: Next business day
Resolution Target: 1 week
Detection & Alerting
Application Insights Alerts
Lightning Enable uses Azure Application Insights for monitoring. The following alerts should be configured:
Availability Alerts
| Alert | Threshold | Severity |
|---|---|---|
| Health endpoint failure | 2 consecutive failures | Critical |
| API availability < 99% | 5-minute window | High |
| API availability < 95% | 5-minute window | Critical |
Performance Alerts
| Alert | Threshold | Severity |
|---|---|---|
| Average response time > 2s | 5-minute window | Medium |
| Average response time > 5s | 5-minute window | High |
| P95 response time > 10s | 5-minute window | High |
Error Rate Alerts
| Alert | Threshold | Severity |
|---|---|---|
| 5xx error rate > 1% | 5-minute window | Medium |
| 5xx error rate > 5% | 5-minute window | High |
| 5xx error rate > 10% | 5-minute window | Critical |
Custom Alerts
| Alert | Threshold | Severity |
|---|---|---|
| Payment creation failures > 5 | 15-minute window | High |
| Payment provider API errors > 10 | 15-minute window | High |
| Database connection failures | Any occurrence | Critical |
| Stripe webhook failures > 3 | 1-hour window | Medium |
Health Endpoint Monitoring
The API exposes a structured health endpoint (public, no authentication required):
GET https://api.lightningenable.com/health
Expected Response (HTTP 200):
{
"status": "Healthy",
"totalDuration": 42.15,
"checks": [
{
"name": "database",
"status": "Healthy",
"duration": 38.72,
"description": null,
"exception": null,
"tags": ["db", "sql"]
}
]
}
Unhealthy Response (HTTP 503):
{
"status": "Unhealthy",
"totalDuration": 5023.41,
"checks": [
{
"name": "database",
"status": "Unhealthy",
"duration": 5001.88,
"description": null,
"exception": "A network-related or instance-specific error occurred...",
"tags": ["db", "sql"]
}
]
}
Set up external monitoring (e.g., Azure Monitor, Pingdom, UptimeRobot) to hit the health endpoint every 1 minute. Alert on any HTTP status other than 200, or parse the JSON to check individual checks[].status values.
Azure Portal Access
- Navigate to Azure Portal
- Search for "Application Insights"
- Select "lightning-enable-insights"
- Key dashboards:
- Overview - Quick health snapshot
- Failures - Error analysis
- Performance - Response time analysis
- Live Metrics - Real-time monitoring
Response Procedures by Severity
Critical (P1) Response
Immediate Actions (0-15 minutes)
-
Acknowledge the incident
- Note the start time
- Create incident ticket
-
Assess scope
# Check API health
curl https://api.lightningenable.com/health
# Check recent errors in Application Insights
# Azure Portal > Application Insights > Failures -
Communicate
- Update status page (if available)
- Notify affected merchants if known
- Alert team members
-
Initial triage
- Check Azure App Service status
- Check database connectivity
- Check Strike status: https://status.strike.me
- Check OpenNode status: https://status.opennode.com
- Check Stripe status: https://status.stripe.com
Investigation (15-60 minutes)
-
Review logs
- Application Insights > Logs
- Query recent exceptions:
exceptions
| where timestamp > ago(1h)
| order by timestamp desc -
Check deployments
- Was there a recent deployment?
- Consider rollback if deployment is suspect
-
Check external dependencies
- Strike API
- OpenNode API
- Stripe API
- Azure SQL Database
Escalation Path
- First responder - On-call engineer
- 15 minutes - Senior engineer / team lead
- 30 minutes - Engineering manager
- 1 hour - Executive notification
High (P2) Response
Actions (0-30 minutes)
-
Acknowledge and assess
- Identify affected functionality
- Determine customer impact
-
Investigate
- Review Application Insights metrics
- Check for patterns (specific endpoint, merchant, etc.)
-
Mitigate
- Enable additional logging if needed
- Scale resources if performance-related
- Implement temporary workarounds
-
Communicate
- Update internal status
- Notify affected merchants if significant
Medium (P3) Response
- Create detailed ticket with findings
- Schedule investigation during business hours
- Document workarounds for affected users
- Plan fix for next release cycle
Low (P4) Response
- Log issue in backlog
- Address in regular sprint planning
- No immediate action required