Incident Response Plan
This document outlines the procedures for detecting, responding to, and recovering from incidents affecting the Lightning Enable API.
Incident Severity Levels
Critical (P1)
Definition: Complete service outage or payment processing failure affecting all customers.
Examples:
- API is completely down (all endpoints returning errors)
- Payment creation failing for all merchants
- Database unavailable
- OpenNode integration completely broken
- Data breach or security incident
Response Time: Immediate (within 15 minutes)
Resolution Target: 1 hour
High (P2)
Definition: Degraded performance or partial outage affecting significant functionality.
Examples:
- API response times > 5 seconds
- Intermittent 500 errors (> 5% error rate)
- Webhook delivery failures
- One product tier completely unavailable (L402, Kentico, Standalone)
- Authentication system degraded
Response Time: Within 30 minutes
Resolution Target: 4 hours
Medium (P3)
Definition: Non-critical feature unavailable or degraded, workarounds available.
Examples:
- Exchange rate API returning stale data
- Admin dashboard slow or partially broken
- Hangfire jobs delayed but not failed
- Documentation site down
- Single merchant reporting issues
Response Time: Within 2 hours
Resolution Target: 24 hours
Low (P4)
Definition: Minor issues with no significant customer impact.
Examples:
- Cosmetic issues in admin dashboard
- Non-critical log warnings
- Minor documentation errors
- Performance slightly below baseline
Response Time: Next business day
Resolution Target: 1 week
Detection & Alerting
Application Insights Alerts
Lightning Enable uses Azure Application Insights for monitoring. The following alerts should be configured:
Availability Alerts
| Alert | Threshold | Severity |
|---|---|---|
| Health endpoint failure | 2 consecutive failures | Critical |
| API availability < 99% | 5-minute window | High |
| API availability < 95% | 5-minute window | Critical |
Performance Alerts
| Alert | Threshold | Severity |
|---|---|---|
| Average response time > 2s | 5-minute window | Medium |
| Average response time > 5s | 5-minute window | High |
| P95 response time > 10s | 5-minute window | High |
Error Rate Alerts
| Alert | Threshold | Severity |
|---|---|---|
| 5xx error rate > 1% | 5-minute window | Medium |
| 5xx error rate > 5% | 5-minute window | High |
| 5xx error rate > 10% | 5-minute window | Critical |
Custom Alerts
| Alert | Threshold | Severity |
|---|---|---|
| Payment creation failures > 5 | 15-minute window | High |
| OpenNode API errors > 10 | 15-minute window | High |
| Database connection failures | Any occurrence | Critical |
| Stripe webhook failures > 3 | 1-hour window | Medium |
Health Endpoint Monitoring
The API exposes health endpoints that should be monitored:
# Basic health check
GET https://api.lightningenable.com/health
# Detailed health (requires admin key)
GET https://api.lightningenable.com/health/ready
Expected Response:
{
"status": "Healthy",
"checks": {
"database": "Healthy",
"opennode": "Healthy"
}
}
Set up external monitoring (e.g., Azure Monitor, Pingdom, UptimeRobot) to hit the health endpoint every 1 minute.
Azure Portal Access
- Navigate to Azure Portal
- Search for "Application Insights"
- Select "lightning-enable-insights"
- Key dashboards:
- Overview - Quick health snapshot
- Failures - Error analysis
- Performance - Response time analysis
- Live Metrics - Real-time monitoring
Response Procedures by Severity
Critical (P1) Response
Immediate Actions (0-15 minutes)
-
Acknowledge the incident
- Note the start time
- Create incident ticket
-
Assess scope
# Check API health
curl https://api.lightningenable.com/health
# Check recent errors in Application Insights
# Azure Portal > Application Insights > Failures -
Communicate
- Update status page (if available)
- Notify affected merchants if known
- Alert team members
-
Initial triage
- Check Azure App Service status
- Check database connectivity
- Check OpenNode status: https://status.opennode.com
- Check Stripe status: https://status.stripe.com
Investigation (15-60 minutes)
-
Review logs
- Application Insights > Logs
- Query recent exceptions:
exceptions
| where timestamp > ago(1h)
| order by timestamp desc -
Check deployments
- Was there a recent deployment?
- Consider rollback if deployment is suspect
-
Check external dependencies
- OpenNode API
- Stripe API
- Azure SQL Database
Escalation Path
- First responder - On-call engineer
- 15 minutes - Senior engineer / team lead
- 30 minutes - Engineering manager
- 1 hour - Executive notification
High (P2) Response
Actions (0-30 minutes)
-
Acknowledge and assess
- Identify affected functionality
- Determine customer impact
-
Investigate
- Review Application Insights metrics
- Check for patterns (specific endpoint, merchant, etc.)
-
Mitigate
- Enable additional logging if needed
- Scale resources if performance-related
- Implement temporary workarounds
-
Communicate
- Update internal status
- Notify affected merchants if significant
Medium (P3) Response
- Create detailed ticket with findings
- Schedule investigation during business hours
- Document workarounds for affected users
- Plan fix for next release cycle
Low (P4) Response
- Log issue in backlog
- Address in regular sprint planning
- No immediate action required
Common Issues & Resolutions
API Returning 500 Errors
Symptoms:
- All or most endpoints returning HTTP 500
- Application Insights showing high exception rate
Diagnostic Steps:
// Application Insights - Recent exceptions
exceptions
| where timestamp > ago(30m)
| summarize count() by type, outerMessage
| order by count_ desc
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Database connection string invalid | Check ConnectionStrings__DefaultConnection in App Service Configuration |
| Encryption key changed/missing | Verify DB_ENCRYPTION_KEY matches production key |
| Missing required configuration | Check Application Insights for specific config errors |
| Out of memory | Restart App Service, consider scaling up |
| Unhandled exception in code | Review recent deployments, consider rollback |
Rollback Procedure:
# Azure CLI - Rollback to previous deployment
az webapp deployment slot swap \
--resource-group lightning-enable-rg \
--name lightning-enable-api \
--slot staging \
--target-slot production
Database Connection Issues
Symptoms:
- Timeout errors in logs
- "Cannot connect to database" exceptions
- Health check failing on database
Diagnostic Steps:
- Check Azure SQL Database status in portal
- Verify firewall rules allow App Service
- Check connection string format
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| SQL Server maintenance | Wait for maintenance window to complete |
| Firewall blocking | Add App Service outbound IPs to SQL firewall |
| Connection pool exhaustion | Restart App Service, review connection handling |
| Credentials expired (Managed Identity) | Re-enable system-assigned managed identity |
| DTU limit exceeded | Scale up database tier |
Verify Connectivity:
# From Azure Cloud Shell or with VNet access
sqlcmd -S your-server.database.windows.net -d LightningEnable -U admin -P 'password'
OpenNode API Failures
Symptoms:
- Payment creation failing
- "OPENNODE_ERROR" or "OPENNODE_TIMEOUT" responses
- 502 Bad Gateway errors
Diagnostic Steps:
- Check OpenNode status: https://status.opennode.com
- Verify merchant OpenNode key is valid
- Test OpenNode directly:
curl -X GET "https://api.opennode.com/v1/account/balance" \
-H "Authorization: YOUR_OPENNODE_KEY"
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| OpenNode outage | Wait for OpenNode to recover, monitor status page |
| Invalid API key | Merchant needs to regenerate key in OpenNode dashboard |
| OpenNode rate limiting | Implement request throttling |
| Network timeout | Increase HttpClient timeout, retry with backoff |
| Wrong environment | Verify OpenNode:Environment matches key type (dev/production) |
Temporary Mitigation:
- Enable payment creation queuing (if implemented)
- Notify merchants of OpenNode status
- Monitor for automatic recovery
Stripe API Failures
Symptoms:
- Subscription creation failing
- Webhook signature verification errors
- Checkout sessions not creating
Diagnostic Steps:
- Check Stripe status: https://status.stripe.com
- Verify API keys in configuration
- Check Stripe Dashboard > Developers > Logs
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Webhook secret mismatch | Update Stripe:WebhookSecret from Stripe Dashboard |
| API key rotated | Update Stripe:SecretKey in App Service configuration |
| Stripe outage | Wait for recovery, queue subscription actions |
| Price ID invalid | Verify Stripe:PricingPlans configuration |
High Latency
Symptoms:
- Response times > 2 seconds
- Timeout errors in client applications
- Performance alerts triggering
Diagnostic Steps:
// Application Insights - Slow requests
requests
| where timestamp > ago(1h)
| where duration > 2000
| summarize count(), avg(duration) by name
| order by avg_duration desc
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Database slow queries | Review query plans, add indexes |
| High CPU utilization | Scale up App Service plan |
| Memory pressure | Restart app, scale up if persistent |
| External API slow | Add caching, increase timeouts |
| Cold start | Enable Always On in App Service |
Quick Fixes:
- Restart the App Service
- Scale out (add instances)
- Enable Application Insights Profiler for detailed analysis
Rate Limiting Issues
Symptoms:
- Merchants reporting 429 errors
- Legitimate traffic being blocked
- Rate limit alerts triggering
Diagnostic Steps:
// Check rate limit violations
traces
| where message contains "rate limit"
| where timestamp > ago(1h)
| summarize count() by tostring(customDimensions.ApiKey)
Common Causes & Solutions:
| Cause | Solution |
|---|---|
| Merchant polling too frequently | Guide merchant to use webhooks |
| Bot traffic | Review and block suspicious IPs |
| Rate limits too aggressive | Adjust limits in configuration |
| Single merchant spike | Contact merchant, temporarily increase their limit |
Post-Incident Procedures
Incident Report Template
Create an incident report within 48 hours of resolution for P1/P2 incidents:
# Incident Report: [Title]
## Summary
- **Incident ID:** INC-YYYY-MM-DD-001
- **Severity:** P1/P2/P3/P4
- **Duration:** [Start time] to [End time]
- **Impact:** [Brief description of customer impact]
## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | Incident detected |
| HH:MM | Team notified |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service restored |
## Root Cause
[Detailed technical explanation of what caused the incident]
## Impact Assessment
- **Merchants affected:** [Number]
- **Failed payments:** [Number]
- **Revenue impact:** [If applicable]
## Resolution
[What was done to fix the immediate issue]
## Prevention Measures
- [ ] [Action item 1]
- [ ] [Action item 2]
- [ ] [Action item 3]
## Lessons Learned
[What the team learned from this incident]
Root Cause Analysis (5 Whys)
For P1/P2 incidents, perform a 5 Whys analysis:
- Why did the incident occur?
- Why did that happen?
- Why did that happen?
- Why did that happen?
- Why did that happen?
Prevention Measures Checklist
After each incident, consider:
- Monitoring: Are there alerts that would catch this earlier?
- Testing: Would additional tests have caught this?
- Documentation: Is the runbook updated?
- Automation: Can recovery be automated?
- Architecture: Does the system design need improvement?
- Process: Are deployment/change processes adequate?
Contact Information
Support Channels
| Type | Contact |
|---|---|
| General Support | support@lightningenable.com |
| Security Issues | security@lightningenable.com |
| Enterprise Support | enterprise@lightningenable.com |
Useful Resources
| Resource | URL |
|---|---|
| Azure Portal | https://portal.azure.com |
| OpenNode Status | https://status.opennode.com |
| Stripe Status | https://status.stripe.com |
| GitHub Repository | https://github.com/AdrianRamsey13/lightning-enable |
| API Documentation | https://docs.lightningenable.com |
| Production API | https://api.lightningenable.com |
Azure Resources
| Resource | Name |
|---|---|
| App Service | lightning-enable-api |
| SQL Database | lightning-enable-db |
| Application Insights | lightning-enable-insights |
| Key Vault | lightning-enable-kv |
| Resource Group | lightning-enable-rg |
Runbook Updates
This incident response plan should be reviewed and updated:
- After every P1/P2 incident
- Quarterly (at minimum)
- When major infrastructure changes occur
- When new integrations are added
Last Updated: January 2026
Next Steps
- Environment Variables - Configuration reference
- Error Handling - API error codes
- Rate Limiting - Rate limit documentation