HeliosDB Monitoring and Alerting Guide¶

Version: 7.0.0 Environment: Staging Last Updated: 2025-11-17

Table of Contents¶

Overview
Monitoring Architecture
Key Metrics
Grafana Dashboards
Alert Rules
Alert Severity Levels
Common Monitoring Scenarios
Troubleshooting Metrics

Overview¶

HeliosDB Phase 1 includes comprehensive monitoring for all three production features:

Conversational BI: Query latency, LLM API performance, cache hit rates, NL2SQL accuracy
Auto-Compliance: Compliance violations, audit log health, check latency, storage usage
Embedded+Cloud: Sync performance, conflict rates, offline mode, WebSocket connections

Monitoring Stack¶

Prometheus: Metrics collection and storage
Grafana: Visualization and dashboards
Service Metrics: Built-in Prometheus exporters in each service

Monitoring Architecture¶

┌─────────────────────────────────────────────────────┐
│                  Feature Services                    │
│  ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │
│  │Conversational│ │Auto-Compliance│ │Embedded+Cloud│ │
│  │BI :9091      │ │ :9092         │ │ :9093        │ │
│  └──────┬───────┘ └──────┬────────┘ └──────┬──────┘ │
└─────────┼────────────────┼─────────────────┼────────┘
          │                │                 │
          │    Metrics     │                 │
          │   (Prometheus  │                 │
          │    format)     │                 │
          ▼                ▼                 ▼
     ┌────────────────────────────────────────────┐
     │              Prometheus :9090               │
     │         (Scrape, Store, Alert)             │
     └────────────────┬───────────────────────────┘
                      │
                      │ Query
                      │
                      ▼
     ┌────────────────────────────────────────────┐
     │              Grafana :3000                  │
     │       (Visualize, Dashboard, Notify)       │
     └────────────────────────────────────────────┘

Key Metrics¶

Conversational BI Metrics¶

Query Performance¶

Metric	Description	Type	Alert Threshold
`heliosdb_conversational_bi_query_duration_seconds`	NL query to SQL generation latency	Histogram	p95 > 10s
`heliosdb_conversational_bi_requests_total`	Total query requests	Counter	-
`heliosdb_conversational_bi_llm_errors_total`	LLM API errors	Counter	rate > 0.1/s

LLM Performance¶

Metric	Description	Type	Alert Threshold
`heliosdb_conversational_bi_llm_tokens_used_total`	Total tokens consumed	Counter	rate > 1M/hour
`heliosdb_conversational_bi_llm_latency_seconds`	LLM API call latency	Histogram	p95 > 5s
`heliosdb_conversational_bi_llm_cost_estimate`	Estimated cost in USD	Gauge	-

Cache Performance¶

Metric	Description	Type	Alert Threshold
`heliosdb_conversational_bi_cache_hits_total`	Cache hits	Counter	-
`heliosdb_conversational_bi_cache_misses_total`	Cache misses	Counter	-
`heliosdb_conversational_bi_cache_size_bytes`	Cache memory usage	Gauge	> 1GB

Accuracy Metrics¶

Metric	Description	Type	Alert Threshold
`heliosdb_conversational_bi_nl2sql_accuracy`	SQL generation accuracy	Gauge	< 0.7
`heliosdb_conversational_bi_sql_validation_errors_total`	Invalid SQL generated	Counter	rate > 0.05/s

Auto-Compliance Metrics¶

Compliance Violations¶

Metric	Description	Type	Alert Threshold
`heliosdb_compliance_violations_total`	Compliance violations by framework	Counter	any > 0
`heliosdb_compliance_checks_total`	Compliance checks performed	Counter	-
`heliosdb_compliance_check_duration_seconds`	Compliance check latency	Histogram	p95 > 5s

Audit Log Health¶

Metric	Description	Type	Alert Threshold
`heliosdb_compliance_audit_log_write_duration_seconds`	Audit log write latency	Histogram	p95 > 1s
`heliosdb_compliance_audit_log_write_errors_total`	Audit log write failures	Counter	any > 0
`heliosdb_compliance_audit_log_storage_bytes`	Audit log storage size	Gauge	> 50GB

Report Generation¶

Metric	Description	Type	Alert Threshold
`heliosdb_compliance_reports_generated_total`	Reports generated	Counter	-
`heliosdb_compliance_report_generation_failures_total`	Report generation failures	Counter	any > 0
`heliosdb_compliance_report_generation_duration_seconds`	Report generation time	Histogram	p95 > 30s

Embedded+Cloud Metrics¶

Sync Performance¶

Metric	Description	Type	Alert Threshold
`heliosdb_embedded_cloud_sync_duration_seconds`	Sync operation latency	Histogram	p95 > 30s
`heliosdb_embedded_cloud_sync_success_total`	Successful syncs	Counter	-
`heliosdb_embedded_cloud_sync_failures_total`	Failed syncs	Counter	rate > 0.01/s

Conflict Resolution¶

Metric	Description	Type	Alert Threshold
`heliosdb_embedded_cloud_conflicts_total`	Data conflicts detected	Counter	rate > 0.05/s
`heliosdb_embedded_cloud_conflicts_resolved_total`	Conflicts resolved	Counter	-
`heliosdb_embedded_cloud_conflict_resolution_failures_total`	Conflicts unresolved	Counter	any > 0

WebSocket Connections¶

Metric	Description	Type	Alert Threshold
`heliosdb_embedded_cloud_active_connections`	Active WebSocket connections	Gauge	-
`heliosdb_embedded_cloud_websocket_disconnects_total`	WebSocket disconnections	Counter	rate > 10/s
`heliosdb_embedded_cloud_websocket_message_errors_total`	WebSocket message errors	Counter	rate > 0.1/s

Offline Mode¶

Metric	Description	Type	Alert Threshold
`heliosdb_embedded_cloud_offline_mode_active`	Devices in offline mode	Gauge	-
`heliosdb_embedded_cloud_offline_cache_usage_bytes`	Offline cache size	Gauge	> 80% of limit
`heliosdb_embedded_cloud_offline_cache_evictions_total`	Cache evictions	Counter	rate > 1/s

Grafana Dashboards¶

Accessing Dashboards¶

Open Grafana: http://localhost:3000
Login with credentials from .env:
Username: admin
Password: <GRAFANA_ADMIN_PASSWORD>
Navigate to Dashboards > Browse

Available Dashboards¶

1. Conversational BI Dashboard¶

Location: Dashboards > Conversational BI - Production Metrics

Panels: - Service Status (UP/DOWN indicator) - Requests Per Minute - NL2SQL Accuracy - Cache Hit Rate - LLM Error Rate - Query Latency Percentiles (p50, p95, p99) - Request Rate & Error Rate - LLM Token Usage by Provider - Cache Hit vs Miss Distribution - Memory Usage - Rate Limited Requests by Client - Query Type Distribution

Key Insights: - Are queries being served successfully? - Is the LLM API responding? - Is the cache effective? - Are we staying within rate limits?

2. Auto-Compliance Dashboard¶

Location: Dashboards > Auto-Compliance - Production Metrics

Panels: - Service Status - Total Violations (24h) - Compliance Check Latency (p95) - Audit Log Storage - Audit Log Write Errors - Reports Generated (24h) - Compliance Violations by Framework - Audit Log Write Performance - Compliance Checks by Framework - Violation Types Distribution - Audit Log Retention Compliance - Alert Delivery Success Rate - Recent Compliance Violations (Top 10) - Report Generation Success & Failures - Memory Usage - Audit Log Compression Ratio

Key Insights: - Are there any compliance violations? - Is the audit log healthy? - Are reports being generated successfully? - Is retention policy being met?

3. Embedded+Cloud Dashboard¶

Location: Dashboards > Embedded+Cloud Unified - Production Metrics

Panels: - Service Status - Active Connections - Sync Success Rate - Sync Latency (p95) - Conflicts (1h) - Offline Mode Devices - Sync Operations (Success vs Failures) - Sync Latency Percentiles - WebSocket Connections - Data Transfer Rate - Conflict Resolution Strategy Usage - Device Count by User - Offline Cache Usage - Cloud Storage Operation Latency - Top Error Types - Sync Queue Length - Device Auth Failures - Memory Usage - Cloud Storage Errors

Key Insights: - Are syncs completing successfully? - How many conflicts are occurring? - Is offline mode working? - Are WebSocket connections stable?

Alert Rules¶

Alert Configuration¶

Alert rules are defined in: - /home/claude/HeliosDB/deployment/staging/monitoring/prometheus/alerts/

Conversational BI Alerts¶

Critical Alerts: 1. ConversationalBIServiceDown: Service unavailable for > 2 minutes 2. ConversationalBICriticalLatency: p99 latency > 30s for > 3 minutes 3. ConversationalBILLMAPIUnavailable: LLM API error rate > 0.5 errors/sec

Warning Alerts: 1. ConversationalBIHighLatency: p95 latency > 10s for > 5 minutes 2. ConversationalBIHighLLMErrorRate: LLM error rate > 0.1 errors/sec 3. ConversationalBILowAccuracy: NL2SQL accuracy < 70% for > 10 minutes 4. ConversationalBILowCacheHitRate: Cache hit rate < 30% for > 10 minutes 5. ConversationalBIRateLimitExceeded: > 10 requests/sec being rate-limited 6. ConversationalBIHighMemoryUsage: Memory usage > 3.5GB for > 5 minutes

Info Alerts: 1. ConversationalBINoTraffic: No requests for > 10 minutes 2. ConversationalBIHighLLMCost: Token usage > 1M tokens/hour

Auto-Compliance Alerts¶

Critical Alerts: 1. ComplianceServiceDown: Service unavailable for > 2 minutes 2. ComplianceViolationDetected: Any compliance violation detected 3. ComplianceAuditLogWriteFailure: Audit log write errors detected 4. ComplianceGDPRViolation: GDPR violation detected 5. ComplianceHIPAAViolation: HIPAA violation detected 6. CompliancePCIDSSViolation: PCI-DSS violation detected 7. ComplianceAuditLogStorageCritical: Audit log storage > 80GB

Warning Alerts: 1. ComplianceAuditLogHighLatency: p95 write latency > 1s 2. ComplianceCheckHighLatency: p95 check latency > 5s 3. ComplianceCheckFailures: Check failure rate > 0.05 failures/sec 4. ComplianceReportGenerationFailed: Report generation failed 5. ComplianceAuditLogStorageHigh: Audit log storage > 50GB 6. ComplianceAlertDeliveryFailure: Alert delivery failing 7. ComplianceRetentionPolicyViolation: Logs older than retention policy

Embedded+Cloud Alerts¶

Critical Alerts: 1. EmbeddedCloudSyncServiceDown: Service unavailable for > 2 minutes 2. EmbeddedCloudCriticalSyncLatency: p99 sync latency > 120s 3. EmbeddedCloudHighSyncFailureRate: Sync failure rate > 0.1 failures/sec 4. EmbeddedCloudStorageUnavailable: Cloud storage error rate > 0.1 errors/sec 5. EmbeddedCloudCriticalSyncQueueBacklog: Sync queue > 10,000 items

Warning Alerts: 1. EmbeddedCloudHighSyncLatency: p95 sync latency > 30s 2. EmbeddedCloudSyncFailures: Sync failure rate > 0.01 failures/sec 3. EmbeddedCloudConflictResolutionFailures: Unable to resolve conflicts 4. EmbeddedCloudStorageHighLatency: p95 storage operation latency > 5s 5. EmbeddedCloudOfflineCacheNearFull: Offline cache > 80% capacity 6. EmbeddedCloudDeviceAuthFailures: Device auth failure rate > 0.05 failures/sec 7. EmbeddedCloudHighWebSocketDisconnects: Disconnect rate > 10/s 8. EmbeddedCloudSyncQueueBacklog: Sync queue > 1,000 items

Info Alerts: 1. EmbeddedCloudHighConflictRate: Conflict rate > 0.05 conflicts/sec 2. EmbeddedCloudOfflineModeActivated: Devices operating offline for > 10 minutes 3. EmbeddedCloudDeviceLimitExceeded: Users hitting device limits 4. EmbeddedCloudNoActiveConnections: No WebSocket connections for > 15 minutes 5. EmbeddedCloudHighDataTransferRate: Data transfer > 100 MB/sec

Alert Severity Levels¶

Critical (P1)¶

Definition: Service is down, data loss risk, or compliance violation

Response Time: Immediate (< 15 minutes)

Actions: 1. Page on-call engineer 2. Begin incident response 3. Check recent deployments 4. Review logs immediately

Examples: - Service completely down - Database unavailable - Compliance violations detected - Audit log write failures

Warning (P2)¶

Definition: Service degraded, potential issue developing

Response Time: Within 1 hour

Actions: 1. Notify team via Slack/email 2. Investigate root cause 3. Monitor for escalation 4. Plan remediation

Examples: - High latency - Elevated error rates - Resource approaching limits - Report generation failures

Info (P3)¶

Definition: Notable event, no immediate action required

Response Time: Next business day

Actions: 1. Log for investigation 2. Review during normal hours 3. Update runbooks if needed

Examples: - No traffic (off-hours) - High token usage - Offline mode activated - Informational events

Common Monitoring Scenarios¶

Scenario 1: High Query Latency¶

Symptoms: - Dashboard shows p95 > 10s - Alert: ConversationalBIHighLatency

Investigation: 1. Check Grafana: Conversational BI > Query Latency panel 2. Check LLM API latency: Is the LLM slow? 3. Check cache hit rate: Is cache effective? 4. Check database latency: Is PostgreSQL slow?

Resolution:

# Check service logs
docker compose -f deployment/staging/docker-compose.yml logs conversational-bi | grep -i latency

# Check LLM API status
curl http://localhost:9091/metrics | grep llm_latency

# Restart service if needed
docker compose -f deployment/staging/docker-compose.yml restart conversational-bi

Scenario 2: Compliance Violation Detected¶

Symptoms: - Dashboard shows violation count > 0 - Alert: ComplianceViolationDetected

Investigation: 1. Check Grafana: Compliance > Violations by Framework panel 2. Identify which framework (GDPR, HIPAA, etc.) 3. Check audit logs for violation details

Resolution:

# Check compliance logs
docker compose -f deployment/staging/docker-compose.yml logs compliance | grep -i violation

# Access compliance dashboard
open http://localhost:8090

# Review violation details
curl http://localhost:8082/api/v1/compliance/violations

Scenario 3: Sync Failures¶

Symptoms: - Dashboard shows high sync failure rate - Alert: EmbeddedCloudSyncFailures

Investigation: 1. Check Grafana: Embedded+Cloud > Sync Operations panel 2. Check cloud storage connectivity 3. Check for conflicts or errors

Resolution:

# Check sync service logs
docker compose -f deployment/staging/docker-compose.yml logs embedded-cloud-sync | grep -i sync

# Check S3 connectivity
curl http://localhost:9093/metrics | grep storage_errors

# Check sync queue
curl http://localhost:8083/api/v1/sync/queue/status

Troubleshooting Metrics¶

Metrics Not Appearing¶

Issue: Grafana shows "No data"

Diagnosis:

# 1. Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'

# 2. Check service metrics endpoints
curl http://localhost:9091/metrics  # Should return Prometheus metrics
curl http://localhost:9092/metrics
curl http://localhost:9093/metrics

# 3. Check Prometheus logs
docker compose -f deployment/staging/docker-compose.yml logs prometheus

Resolution:

# Restart Prometheus
docker compose -f deployment/staging/docker-compose.yml restart prometheus

# Verify scrape config
docker compose -f deployment/staging/docker-compose.yml exec prometheus \
  cat /etc/prometheus/prometheus.yml

Alert Not Firing¶

Issue: Expected alert doesn't trigger

Diagnosis:

# Check Prometheus rules
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.name == "YourAlertName")'

# Check if metric exists
curl http://localhost:9090/api/v1/query?query=<metric_name>

Resolution:

# Reload Prometheus config
curl -X POST http://localhost:9090/-/reload

# Check alert state
curl http://localhost:9090/api/v1/alerts

Best Practices¶

Check Dashboards Daily: Review key metrics every morning
Investigate Warnings: Don't ignore warning-level alerts
Baseline Metrics: Understand normal operating ranges
Document Incidents: Keep runbooks updated
Regular Reviews: Weekly review of alert effectiveness
Tune Thresholds: Adjust based on observed behavior
Alert Fatigue: Reduce noisy alerts

Next Steps¶

Configure Alertmanager (optional): Set up email/Slack notifications
Create Custom Dashboards: Add business-specific metrics
Set Up SLOs: Define Service Level Objectives
Enable Tracing: Add distributed tracing for debugging
Log Aggregation: Integrate with ELK or Loki

Monitoring is operational! Your HeliosDB Phase 1 staging environment is fully observable.