Incident Response Plan

This document outlines the procedures for detecting, responding to, and recovering from incidents affecting the Lightning Enable API.

Incident Severity Levels

Critical (P1)

Definition: Complete service outage or payment processing failure affecting all customers.

Examples:

API is completely down (all endpoints returning errors)
Payment creation failing for all merchants
Database unavailable
OpenNode integration completely broken
Data breach or security incident

Response Time: Immediate (within 15 minutes)

Resolution Target: 1 hour

High (P2)

Definition: Degraded performance or partial outage affecting significant functionality.

Examples:

API response times > 5 seconds
Intermittent 500 errors (> 5% error rate)
Webhook delivery failures
One product tier completely unavailable (L402, Kentico, Standalone)
Authentication system degraded

Response Time: Within 30 minutes

Resolution Target: 4 hours

Medium (P3)

Definition: Non-critical feature unavailable or degraded, workarounds available.

Examples:

Exchange rate API returning stale data
Admin dashboard slow or partially broken
Hangfire jobs delayed but not failed
Documentation site down
Single merchant reporting issues

Response Time: Within 2 hours

Resolution Target: 24 hours

Low (P4)

Definition: Minor issues with no significant customer impact.

Examples:

Cosmetic issues in admin dashboard
Non-critical log warnings
Minor documentation errors
Performance slightly below baseline

Response Time: Next business day

Resolution Target: 1 week

Detection & Alerting

Application Insights Alerts

Lightning Enable uses Azure Application Insights for monitoring. The following alerts should be configured:

Availability Alerts

Alert	Threshold	Severity
Health endpoint failure	2 consecutive failures	Critical
API availability < 99%	5-minute window	High
API availability < 95%	5-minute window	Critical

Performance Alerts

Alert	Threshold	Severity
Average response time > 2s	5-minute window	Medium
Average response time > 5s	5-minute window	High
P95 response time > 10s	5-minute window	High

Error Rate Alerts

Alert	Threshold	Severity
5xx error rate > 1%	5-minute window	Medium
5xx error rate > 5%	5-minute window	High
5xx error rate > 10%	5-minute window	Critical

Custom Alerts

Alert	Threshold	Severity
Payment creation failures > 5	15-minute window	High
OpenNode API errors > 10	15-minute window	High
Database connection failures	Any occurrence	Critical
Stripe webhook failures > 3	1-hour window	Medium

Health Endpoint Monitoring

The API exposes a structured health endpoint (public, no authentication required):

GET https://api.lightningenable.com/health

Expected Response (HTTP 200):

{
  "status": "Healthy",
  "totalDuration": 42.15,
  "checks": [
    {
      "name": "database",
      "status": "Healthy",
      "duration": 38.72,
      "description": null,
      "exception": null,
      "tags": ["db", "sql"]
    }
  ]
}

Unhealthy Response (HTTP 503):

{
  "status": "Unhealthy",
  "totalDuration": 5023.41,
  "checks": [
    {
      "name": "database",
      "status": "Unhealthy",
      "duration": 5001.88,
      "description": null,
      "exception": "A network-related or instance-specific error occurred...",
      "tags": ["db", "sql"]
    }
  ]
}

Set up external monitoring (e.g., Azure Monitor, Pingdom, UptimeRobot) to hit the health endpoint every 1 minute. Alert on any HTTP status other than 200, or parse the JSON to check individual checks[].status values.

Azure Portal Access

Navigate to Azure Portal
Search for "Application Insights"
Select "lightning-enable-insights"
Key dashboards:
- Overview - Quick health snapshot
- Failures - Error analysis
- Performance - Response time analysis
- Live Metrics - Real-time monitoring

Response Procedures by Severity

Critical (P1) Response

Immediate Actions (0-15 minutes)

Acknowledge the incident
- Note the start time
- Create incident ticket

Assess scope

# Check API health
curl https://api.lightningenable.com/health

# Check recent errors in Application Insights
# Azure Portal > Application Insights > Failures

Communicate
- Update status page (if available)
- Notify affected merchants if known
- Alert team members
Initial triage
- Check Azure App Service status
- Check database connectivity
- Check OpenNode status: https://status.opennode.com
- Check Stripe status: https://status.stripe.com

Investigation (15-60 minutes)

Review logs

Application Insights > Logs
Query recent exceptions:

exceptions
| where timestamp > ago(1h)
| order by timestamp desc

Check deployments
- Was there a recent deployment?
- Consider rollback if deployment is suspect
Check external dependencies
- OpenNode API
- Stripe API
- Azure SQL Database

Escalation Path

First responder - On-call engineer
15 minutes - Senior engineer / team lead
30 minutes - Engineering manager
1 hour - Executive notification

High (P2) Response

Actions (0-30 minutes)

Acknowledge and assess
- Identify affected functionality
- Determine customer impact
Investigate
- Review Application Insights metrics
- Check for patterns (specific endpoint, merchant, etc.)
Mitigate
- Enable additional logging if needed
- Scale resources if performance-related
- Implement temporary workarounds
Communicate
- Update internal status
- Notify affected merchants if significant

Medium (P3) Response

Create detailed ticket with findings
Schedule investigation during business hours
Document workarounds for affected users
Plan fix for next release cycle

Low (P4) Response

Log issue in backlog
Address in regular sprint planning
No immediate action required

Common Issues & Resolutions

API Returning 500 Errors

Symptoms:

All or most endpoints returning HTTP 500
Application Insights showing high exception rate

Diagnostic Steps:

// Application Insights - Recent exceptions
exceptions
| where timestamp > ago(30m)
| summarize count() by type, outerMessage
| order by count_ desc

Common Causes & Solutions:

Cause	Solution
Database connection string invalid	Check `ConnectionStrings__DefaultConnection` in App Service Configuration
Encryption key changed/missing	Verify `DB_ENCRYPTION_KEY` matches production key
Missing required configuration	Check Application Insights for specific config errors
Out of memory	Restart App Service, consider scaling up
Unhandled exception in code	Review recent deployments, consider rollback

Rollback Procedure:

# Azure CLI - Rollback to previous deployment
az webapp deployment slot swap \
  --resource-group lightning-enable-rg \
  --name lightning-enable-api \
  --slot staging \
  --target-slot production

Database Connection Issues

Symptoms:

Timeout errors in logs
"Cannot connect to database" exceptions
Health check failing on database

Diagnostic Steps:

Check Azure SQL Database status in portal
Verify firewall rules allow App Service
Check connection string format

Common Causes & Solutions:

Cause	Solution
SQL Server maintenance	Wait for maintenance window to complete
Firewall blocking	Add App Service outbound IPs to SQL firewall
Connection pool exhaustion	Restart App Service, review connection handling
Credentials expired (Managed Identity)	Re-enable system-assigned managed identity
DTU limit exceeded	Scale up database tier

Verify Connectivity:

# From Azure Cloud Shell or with VNet access
sqlcmd -S your-server.database.windows.net -d LightningEnable -U admin -P 'password'

OpenNode API Failures

Symptoms:

Payment creation failing
"OPENNODE_ERROR" or "OPENNODE_TIMEOUT" responses
502 Bad Gateway errors

Diagnostic Steps:

Check OpenNode status: https://status.opennode.com
Verify merchant OpenNode key is valid
Test OpenNode directly:

curl -X GET "https://api.opennode.com/v1/account/balance" \
  -H "Authorization: YOUR_OPENNODE_KEY"

Common Causes & Solutions:

Cause	Solution
OpenNode outage	Wait for OpenNode to recover, monitor status page
Invalid API key	Merchant needs to regenerate key in OpenNode dashboard
OpenNode rate limiting	Implement request throttling
Network timeout	Increase HttpClient timeout, retry with backoff
Wrong environment	Verify `OpenNode:Environment` matches key type (dev/production)

Temporary Mitigation:

Enable payment creation queuing (if implemented)
Notify merchants of OpenNode status
Monitor for automatic recovery

Stripe API Failures

Symptoms:

Subscription creation failing
Webhook signature verification errors
Checkout sessions not creating

Diagnostic Steps:

Check Stripe status: https://status.stripe.com
Verify API keys in configuration
Check Stripe Dashboard > Developers > Logs

Common Causes & Solutions:

Cause	Solution
Webhook secret mismatch	Update `Stripe:WebhookSecret` from Stripe Dashboard
API key rotated	Update `Stripe:SecretKey` in App Service configuration
Stripe outage	Wait for recovery, queue subscription actions
Price ID invalid	Verify `Stripe:PricingPlans` configuration

High Latency

Symptoms:

Response times > 2 seconds
Timeout errors in client applications
Performance alerts triggering

Diagnostic Steps:

// Application Insights - Slow requests
requests
| where timestamp > ago(1h)
| where duration > 2000
| summarize count(), avg(duration) by name
| order by avg_duration desc

Common Causes & Solutions:

Cause	Solution
Database slow queries	Review query plans, add indexes
High CPU utilization	Scale up App Service plan
Memory pressure	Restart app, scale up if persistent
External API slow	Add caching, increase timeouts
Cold start	Enable Always On in App Service

Quick Fixes:

Restart the App Service
Scale out (add instances)
Enable Application Insights Profiler for detailed analysis

Rate Limiting Issues

Symptoms:

Merchants reporting 429 errors
Legitimate traffic being blocked
Rate limit alerts triggering

Diagnostic Steps:

// Check rate limit violations
traces
| where message contains "rate limit"
| where timestamp > ago(1h)
| summarize count() by tostring(customDimensions.ApiKey)

Common Causes & Solutions:

Cause	Solution
Merchant polling too frequently	Guide merchant to use webhooks
Bot traffic	Review and block suspicious IPs
Rate limits too aggressive	Adjust limits in configuration
Single merchant spike	Contact merchant, temporarily increase their limit

Post-Incident Procedures

Incident Report Template

Create an incident report within 48 hours of resolution for P1/P2 incidents:

# Incident Report: [Title]

## Summary
- **Incident ID:** INC-YYYY-MM-DD-001
- **Severity:** P1/P2/P3/P4
- **Duration:** [Start time] to [End time]
- **Impact:** [Brief description of customer impact]

## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | Incident detected |
| HH:MM | Team notified |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service restored |

## Root Cause
[Detailed technical explanation of what caused the incident]

## Impact Assessment
- **Merchants affected:** [Number]
- **Failed payments:** [Number]
- **Revenue impact:** [If applicable]

## Resolution
[What was done to fix the immediate issue]

## Prevention Measures
- [ ] [Action item 1]
- [ ] [Action item 2]
- [ ] [Action item 3]

## Lessons Learned
[What the team learned from this incident]

Root Cause Analysis (5 Whys)

For P1/P2 incidents, perform a 5 Whys analysis:

Why did the incident occur?
Why did that happen?
Why did that happen?
Why did that happen?
Why did that happen?

Prevention Measures Checklist

After each incident, consider:

Monitoring: Are there alerts that would catch this earlier?
Testing: Would additional tests have caught this?
Documentation: Is the runbook updated?
Automation: Can recovery be automated?
Architecture: Does the system design need improvement?
Process: Are deployment/change processes adequate?

Contact Information

Support Channels

Type	Contact
General Support	support@lightningenable.com
Security Issues	security@lightningenable.com
Enterprise Support	enterprise@lightningenable.com

Useful Resources

Resource	URL
Azure Portal	https://portal.azure.com
OpenNode Status	https://status.opennode.com
Stripe Status	https://status.stripe.com
GitHub Repository	https://github.com/refined-element/lightning-enable-mcp
API Documentation	https://docs.lightningenable.com
Production API	https://api.lightningenable.com

Azure Resources

Resource	Name
App Service	lightning-enable-api
SQL Database	lightning-enable-db
Application Insights	lightning-enable-insights
Key Vault	lightning-enable-kv
Resource Group	lightning-enable-rg

Runbook Updates

This incident response plan should be reviewed and updated:

After every P1/P2 incident
Quarterly (at minimum)
When major infrastructure changes occur
When new integrations are added

Last Updated: January 2026

Next Steps

Environment Variables - Configuration reference
Error Handling - API error codes
Rate Limiting - Rate limit documentation

Incident Severity Levels​

Critical (P1)​

High (P2)​

Medium (P3)​

Low (P4)​

Detection & Alerting​

Application Insights Alerts​

Availability Alerts​

Performance Alerts​

Error Rate Alerts​

Custom Alerts​

Health Endpoint Monitoring​

Azure Portal Access​

Response Procedures by Severity​

Critical (P1) Response​

Immediate Actions (0-15 minutes)​

Investigation (15-60 minutes)​

Escalation Path​

High (P2) Response​

Actions (0-30 minutes)​

Medium (P3) Response​

Low (P4) Response​

Common Issues & Resolutions​

API Returning 500 Errors​

Database Connection Issues​

OpenNode API Failures​

Stripe API Failures​

High Latency​

Rate Limiting Issues​

Post-Incident Procedures​

Incident Report Template​

Root Cause Analysis (5 Whys)​

Prevention Measures Checklist​

Contact Information​

Support Channels​

Useful Resources​

Azure Resources​

Runbook Updates​

Next Steps​

Incident Severity Levels

Critical (P1)

High (P2)

Medium (P3)

Low (P4)

Detection & Alerting

Application Insights Alerts

Availability Alerts

Performance Alerts

Error Rate Alerts

Custom Alerts

Health Endpoint Monitoring

Azure Portal Access

Response Procedures by Severity

Critical (P1) Response

Immediate Actions (0-15 minutes)

Investigation (15-60 minutes)

Escalation Path

High (P2) Response

Actions (0-30 minutes)

Medium (P3) Response

Low (P4) Response

Common Issues & Resolutions

API Returning 500 Errors

Database Connection Issues

OpenNode API Failures

Stripe API Failures

High Latency

Rate Limiting Issues

Post-Incident Procedures

Incident Report Template

Root Cause Analysis (5 Whys)

Prevention Measures Checklist

Contact Information

Support Channels

Useful Resources

Azure Resources

Runbook Updates

Next Steps