HeliosDB Monitoring and Alerting Guide¶
Version: 7.0.0 Environment: Staging Last Updated: 2025-11-17
Table of Contents¶
- Overview
- Monitoring Architecture
- Key Metrics
- Grafana Dashboards
- Alert Rules
- Alert Severity Levels
- Common Monitoring Scenarios
- Troubleshooting Metrics
Overview¶
HeliosDB Phase 1 includes comprehensive monitoring for all three production features:
- Conversational BI: Query latency, LLM API performance, cache hit rates, NL2SQL accuracy
- Auto-Compliance: Compliance violations, audit log health, check latency, storage usage
- Embedded+Cloud: Sync performance, conflict rates, offline mode, WebSocket connections
Monitoring Stack¶
- Prometheus: Metrics collection and storage
- Grafana: Visualization and dashboards
- Service Metrics: Built-in Prometheus exporters in each service
Monitoring Architecture¶
┌─────────────────────────────────────────────────────┐
│ Feature Services │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │Conversational│ │Auto-Compliance│ │Embedded+Cloud│ │
│ │BI :9091 │ │ :9092 │ │ :9093 │ │
│ └──────┬───────┘ └──────┬────────┘ └──────┬──────┘ │
└─────────┼────────────────┼─────────────────┼────────┘
│ │ │
│ Metrics │ │
│ (Prometheus │ │
│ format) │ │
▼ ▼ ▼
┌────────────────────────────────────────────┐
│ Prometheus :9090 │
│ (Scrape, Store, Alert) │
└────────────────┬───────────────────────────┘
│
│ Query
│
▼
┌────────────────────────────────────────────┐
│ Grafana :3000 │
│ (Visualize, Dashboard, Notify) │
└────────────────────────────────────────────┘
Key Metrics¶
Conversational BI Metrics¶
Query Performance¶
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_conversational_bi_query_duration_seconds |
NL query to SQL generation latency | Histogram | p95 > 10s |
heliosdb_conversational_bi_requests_total |
Total query requests | Counter | - |
heliosdb_conversational_bi_llm_errors_total |
LLM API errors | Counter | rate > 0.1/s |
LLM Performance¶
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_conversational_bi_llm_tokens_used_total |
Total tokens consumed | Counter | rate > 1M/hour |
heliosdb_conversational_bi_llm_latency_seconds |
LLM API call latency | Histogram | p95 > 5s |
heliosdb_conversational_bi_llm_cost_estimate |
Estimated cost in USD | Gauge | - |
Cache Performance¶
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_conversational_bi_cache_hits_total |
Cache hits | Counter | - |
heliosdb_conversational_bi_cache_misses_total |
Cache misses | Counter | - |
heliosdb_conversational_bi_cache_size_bytes |
Cache memory usage | Gauge | > 1GB |
Accuracy Metrics¶
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_conversational_bi_nl2sql_accuracy |
SQL generation accuracy | Gauge | < 0.7 |
heliosdb_conversational_bi_sql_validation_errors_total |
Invalid SQL generated | Counter | rate > 0.05/s |
Auto-Compliance Metrics¶
Compliance Violations¶
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_compliance_violations_total |
Compliance violations by framework | Counter | any > 0 |
heliosdb_compliance_checks_total |
Compliance checks performed | Counter | - |
heliosdb_compliance_check_duration_seconds |
Compliance check latency | Histogram | p95 > 5s |
Audit Log Health¶
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_compliance_audit_log_write_duration_seconds |
Audit log write latency | Histogram | p95 > 1s |
heliosdb_compliance_audit_log_write_errors_total |
Audit log write failures | Counter | any > 0 |
heliosdb_compliance_audit_log_storage_bytes |
Audit log storage size | Gauge | > 50GB |
Report Generation¶
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_compliance_reports_generated_total |
Reports generated | Counter | - |
heliosdb_compliance_report_generation_failures_total |
Report generation failures | Counter | any > 0 |
heliosdb_compliance_report_generation_duration_seconds |
Report generation time | Histogram | p95 > 30s |
Embedded+Cloud Metrics¶
Sync Performance¶
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_embedded_cloud_sync_duration_seconds |
Sync operation latency | Histogram | p95 > 30s |
heliosdb_embedded_cloud_sync_success_total |
Successful syncs | Counter | - |
heliosdb_embedded_cloud_sync_failures_total |
Failed syncs | Counter | rate > 0.01/s |
Conflict Resolution¶
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_embedded_cloud_conflicts_total |
Data conflicts detected | Counter | rate > 0.05/s |
heliosdb_embedded_cloud_conflicts_resolved_total |
Conflicts resolved | Counter | - |
heliosdb_embedded_cloud_conflict_resolution_failures_total |
Conflicts unresolved | Counter | any > 0 |
WebSocket Connections¶
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_embedded_cloud_active_connections |
Active WebSocket connections | Gauge | - |
heliosdb_embedded_cloud_websocket_disconnects_total |
WebSocket disconnections | Counter | rate > 10/s |
heliosdb_embedded_cloud_websocket_message_errors_total |
WebSocket message errors | Counter | rate > 0.1/s |
Offline Mode¶
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_embedded_cloud_offline_mode_active |
Devices in offline mode | Gauge | - |
heliosdb_embedded_cloud_offline_cache_usage_bytes |
Offline cache size | Gauge | > 80% of limit |
heliosdb_embedded_cloud_offline_cache_evictions_total |
Cache evictions | Counter | rate > 1/s |
Grafana Dashboards¶
Accessing Dashboards¶
- Open Grafana: http://localhost:3000
- Login with credentials from
.env: - Username:
admin - Password:
<GRAFANA_ADMIN_PASSWORD> - Navigate to Dashboards > Browse
Available Dashboards¶
1. Conversational BI Dashboard¶
Location: Dashboards > Conversational BI - Production Metrics
Panels: - Service Status (UP/DOWN indicator) - Requests Per Minute - NL2SQL Accuracy - Cache Hit Rate - LLM Error Rate - Query Latency Percentiles (p50, p95, p99) - Request Rate & Error Rate - LLM Token Usage by Provider - Cache Hit vs Miss Distribution - Memory Usage - Rate Limited Requests by Client - Query Type Distribution
Key Insights: - Are queries being served successfully? - Is the LLM API responding? - Is the cache effective? - Are we staying within rate limits?
2. Auto-Compliance Dashboard¶
Location: Dashboards > Auto-Compliance - Production Metrics
Panels: - Service Status - Total Violations (24h) - Compliance Check Latency (p95) - Audit Log Storage - Audit Log Write Errors - Reports Generated (24h) - Compliance Violations by Framework - Audit Log Write Performance - Compliance Checks by Framework - Violation Types Distribution - Audit Log Retention Compliance - Alert Delivery Success Rate - Recent Compliance Violations (Top 10) - Report Generation Success & Failures - Memory Usage - Audit Log Compression Ratio
Key Insights: - Are there any compliance violations? - Is the audit log healthy? - Are reports being generated successfully? - Is retention policy being met?
3. Embedded+Cloud Dashboard¶
Location: Dashboards > Embedded+Cloud Unified - Production Metrics
Panels: - Service Status - Active Connections - Sync Success Rate - Sync Latency (p95) - Conflicts (1h) - Offline Mode Devices - Sync Operations (Success vs Failures) - Sync Latency Percentiles - WebSocket Connections - Data Transfer Rate - Conflict Resolution Strategy Usage - Device Count by User - Offline Cache Usage - Cloud Storage Operation Latency - Top Error Types - Sync Queue Length - Device Auth Failures - Memory Usage - Cloud Storage Errors
Key Insights: - Are syncs completing successfully? - How many conflicts are occurring? - Is offline mode working? - Are WebSocket connections stable?
Alert Rules¶
Alert Configuration¶
Alert rules are defined in:
- /home/claude/HeliosDB/deployment/staging/monitoring/prometheus/alerts/
Conversational BI Alerts¶
Critical Alerts: 1. ConversationalBIServiceDown: Service unavailable for > 2 minutes 2. ConversationalBICriticalLatency: p99 latency > 30s for > 3 minutes 3. ConversationalBILLMAPIUnavailable: LLM API error rate > 0.5 errors/sec
Warning Alerts: 1. ConversationalBIHighLatency: p95 latency > 10s for > 5 minutes 2. ConversationalBIHighLLMErrorRate: LLM error rate > 0.1 errors/sec 3. ConversationalBILowAccuracy: NL2SQL accuracy < 70% for > 10 minutes 4. ConversationalBILowCacheHitRate: Cache hit rate < 30% for > 10 minutes 5. ConversationalBIRateLimitExceeded: > 10 requests/sec being rate-limited 6. ConversationalBIHighMemoryUsage: Memory usage > 3.5GB for > 5 minutes
Info Alerts: 1. ConversationalBINoTraffic: No requests for > 10 minutes 2. ConversationalBIHighLLMCost: Token usage > 1M tokens/hour
Auto-Compliance Alerts¶
Critical Alerts: 1. ComplianceServiceDown: Service unavailable for > 2 minutes 2. ComplianceViolationDetected: Any compliance violation detected 3. ComplianceAuditLogWriteFailure: Audit log write errors detected 4. ComplianceGDPRViolation: GDPR violation detected 5. ComplianceHIPAAViolation: HIPAA violation detected 6. CompliancePCIDSSViolation: PCI-DSS violation detected 7. ComplianceAuditLogStorageCritical: Audit log storage > 80GB
Warning Alerts: 1. ComplianceAuditLogHighLatency: p95 write latency > 1s 2. ComplianceCheckHighLatency: p95 check latency > 5s 3. ComplianceCheckFailures: Check failure rate > 0.05 failures/sec 4. ComplianceReportGenerationFailed: Report generation failed 5. ComplianceAuditLogStorageHigh: Audit log storage > 50GB 6. ComplianceAlertDeliveryFailure: Alert delivery failing 7. ComplianceRetentionPolicyViolation: Logs older than retention policy
Embedded+Cloud Alerts¶
Critical Alerts: 1. EmbeddedCloudSyncServiceDown: Service unavailable for > 2 minutes 2. EmbeddedCloudCriticalSyncLatency: p99 sync latency > 120s 3. EmbeddedCloudHighSyncFailureRate: Sync failure rate > 0.1 failures/sec 4. EmbeddedCloudStorageUnavailable: Cloud storage error rate > 0.1 errors/sec 5. EmbeddedCloudCriticalSyncQueueBacklog: Sync queue > 10,000 items
Warning Alerts: 1. EmbeddedCloudHighSyncLatency: p95 sync latency > 30s 2. EmbeddedCloudSyncFailures: Sync failure rate > 0.01 failures/sec 3. EmbeddedCloudConflictResolutionFailures: Unable to resolve conflicts 4. EmbeddedCloudStorageHighLatency: p95 storage operation latency > 5s 5. EmbeddedCloudOfflineCacheNearFull: Offline cache > 80% capacity 6. EmbeddedCloudDeviceAuthFailures: Device auth failure rate > 0.05 failures/sec 7. EmbeddedCloudHighWebSocketDisconnects: Disconnect rate > 10/s 8. EmbeddedCloudSyncQueueBacklog: Sync queue > 1,000 items
Info Alerts: 1. EmbeddedCloudHighConflictRate: Conflict rate > 0.05 conflicts/sec 2. EmbeddedCloudOfflineModeActivated: Devices operating offline for > 10 minutes 3. EmbeddedCloudDeviceLimitExceeded: Users hitting device limits 4. EmbeddedCloudNoActiveConnections: No WebSocket connections for > 15 minutes 5. EmbeddedCloudHighDataTransferRate: Data transfer > 100 MB/sec
Alert Severity Levels¶
Critical (P1)¶
Definition: Service is down, data loss risk, or compliance violation
Response Time: Immediate (< 15 minutes)
Actions: 1. Page on-call engineer 2. Begin incident response 3. Check recent deployments 4. Review logs immediately
Examples: - Service completely down - Database unavailable - Compliance violations detected - Audit log write failures
Warning (P2)¶
Definition: Service degraded, potential issue developing
Response Time: Within 1 hour
Actions: 1. Notify team via Slack/email 2. Investigate root cause 3. Monitor for escalation 4. Plan remediation
Examples: - High latency - Elevated error rates - Resource approaching limits - Report generation failures
Info (P3)¶
Definition: Notable event, no immediate action required
Response Time: Next business day
Actions: 1. Log for investigation 2. Review during normal hours 3. Update runbooks if needed
Examples: - No traffic (off-hours) - High token usage - Offline mode activated - Informational events
Common Monitoring Scenarios¶
Scenario 1: High Query Latency¶
Symptoms:
- Dashboard shows p95 > 10s
- Alert: ConversationalBIHighLatency
Investigation: 1. Check Grafana: Conversational BI > Query Latency panel 2. Check LLM API latency: Is the LLM slow? 3. Check cache hit rate: Is cache effective? 4. Check database latency: Is PostgreSQL slow?
Resolution:
# Check service logs
docker compose -f deployment/staging/docker-compose.yml logs conversational-bi | grep -i latency
# Check LLM API status
curl http://localhost:9091/metrics | grep llm_latency
# Restart service if needed
docker compose -f deployment/staging/docker-compose.yml restart conversational-bi
Scenario 2: Compliance Violation Detected¶
Symptoms:
- Dashboard shows violation count > 0
- Alert: ComplianceViolationDetected
Investigation: 1. Check Grafana: Compliance > Violations by Framework panel 2. Identify which framework (GDPR, HIPAA, etc.) 3. Check audit logs for violation details
Resolution:
# Check compliance logs
docker compose -f deployment/staging/docker-compose.yml logs compliance | grep -i violation
# Access compliance dashboard
open http://localhost:8090
# Review violation details
curl http://localhost:8082/api/v1/compliance/violations
Scenario 3: Sync Failures¶
Symptoms:
- Dashboard shows high sync failure rate
- Alert: EmbeddedCloudSyncFailures
Investigation: 1. Check Grafana: Embedded+Cloud > Sync Operations panel 2. Check cloud storage connectivity 3. Check for conflicts or errors
Resolution:
# Check sync service logs
docker compose -f deployment/staging/docker-compose.yml logs embedded-cloud-sync | grep -i sync
# Check S3 connectivity
curl http://localhost:9093/metrics | grep storage_errors
# Check sync queue
curl http://localhost:8083/api/v1/sync/queue/status
Troubleshooting Metrics¶
Metrics Not Appearing¶
Issue: Grafana shows "No data"
Diagnosis:
# 1. Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
# 2. Check service metrics endpoints
curl http://localhost:9091/metrics # Should return Prometheus metrics
curl http://localhost:9092/metrics
curl http://localhost:9093/metrics
# 3. Check Prometheus logs
docker compose -f deployment/staging/docker-compose.yml logs prometheus
Resolution:
# Restart Prometheus
docker compose -f deployment/staging/docker-compose.yml restart prometheus
# Verify scrape config
docker compose -f deployment/staging/docker-compose.yml exec prometheus \
cat /etc/prometheus/prometheus.yml
Alert Not Firing¶
Issue: Expected alert doesn't trigger
Diagnosis:
# Check Prometheus rules
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.name == "YourAlertName")'
# Check if metric exists
curl http://localhost:9090/api/v1/query?query=<metric_name>
Resolution:
# Reload Prometheus config
curl -X POST http://localhost:9090/-/reload
# Check alert state
curl http://localhost:9090/api/v1/alerts
Best Practices¶
- Check Dashboards Daily: Review key metrics every morning
- Investigate Warnings: Don't ignore warning-level alerts
- Baseline Metrics: Understand normal operating ranges
- Document Incidents: Keep runbooks updated
- Regular Reviews: Weekly review of alert effectiveness
- Tune Thresholds: Adjust based on observed behavior
- Alert Fatigue: Reduce noisy alerts
Next Steps¶
- Configure Alertmanager (optional): Set up email/Slack notifications
- Create Custom Dashboards: Add business-specific metrics
- Set Up SLOs: Define Service Level Objectives
- Enable Tracing: Add distributed tracing for debugging
- Log Aggregation: Integrate with ELK or Loki
Monitoring is operational! Your HeliosDB Phase 1 staging environment is fully observable.