Skip to content

HeliosDB Monitoring and Alerting Guide

Version: 7.0.0 Environment: Staging Last Updated: 2025-11-17


Table of Contents

  1. Overview
  2. Monitoring Architecture
  3. Key Metrics
  4. Grafana Dashboards
  5. Alert Rules
  6. Alert Severity Levels
  7. Common Monitoring Scenarios
  8. Troubleshooting Metrics

Overview

HeliosDB Phase 1 includes comprehensive monitoring for all three production features:

  • Conversational BI: Query latency, LLM API performance, cache hit rates, NL2SQL accuracy
  • Auto-Compliance: Compliance violations, audit log health, check latency, storage usage
  • Embedded+Cloud: Sync performance, conflict rates, offline mode, WebSocket connections

Monitoring Stack

  • Prometheus: Metrics collection and storage
  • Grafana: Visualization and dashboards
  • Service Metrics: Built-in Prometheus exporters in each service

Monitoring Architecture

┌─────────────────────────────────────────────────────┐
│                  Feature Services                    │
│  ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │
│  │Conversational│ │Auto-Compliance│ │Embedded+Cloud│ │
│  │BI :9091      │ │ :9092         │ │ :9093        │ │
│  └──────┬───────┘ └──────┬────────┘ └──────┬──────┘ │
└─────────┼────────────────┼─────────────────┼────────┘
          │                │                 │
          │    Metrics     │                 │
          │   (Prometheus  │                 │
          │    format)     │                 │
          ▼                ▼                 ▼
     ┌────────────────────────────────────────────┐
     │              Prometheus :9090               │
     │         (Scrape, Store, Alert)             │
     └────────────────┬───────────────────────────┘
                      │ Query
     ┌────────────────────────────────────────────┐
     │              Grafana :3000                  │
     │       (Visualize, Dashboard, Notify)       │
     └────────────────────────────────────────────┘

Key Metrics

Conversational BI Metrics

Query Performance

Metric Description Type Alert Threshold
heliosdb_conversational_bi_query_duration_seconds NL query to SQL generation latency Histogram p95 > 10s
heliosdb_conversational_bi_requests_total Total query requests Counter -
heliosdb_conversational_bi_llm_errors_total LLM API errors Counter rate > 0.1/s

LLM Performance

Metric Description Type Alert Threshold
heliosdb_conversational_bi_llm_tokens_used_total Total tokens consumed Counter rate > 1M/hour
heliosdb_conversational_bi_llm_latency_seconds LLM API call latency Histogram p95 > 5s
heliosdb_conversational_bi_llm_cost_estimate Estimated cost in USD Gauge -

Cache Performance

Metric Description Type Alert Threshold
heliosdb_conversational_bi_cache_hits_total Cache hits Counter -
heliosdb_conversational_bi_cache_misses_total Cache misses Counter -
heliosdb_conversational_bi_cache_size_bytes Cache memory usage Gauge > 1GB

Accuracy Metrics

Metric Description Type Alert Threshold
heliosdb_conversational_bi_nl2sql_accuracy SQL generation accuracy Gauge < 0.7
heliosdb_conversational_bi_sql_validation_errors_total Invalid SQL generated Counter rate > 0.05/s

Auto-Compliance Metrics

Compliance Violations

Metric Description Type Alert Threshold
heliosdb_compliance_violations_total Compliance violations by framework Counter any > 0
heliosdb_compliance_checks_total Compliance checks performed Counter -
heliosdb_compliance_check_duration_seconds Compliance check latency Histogram p95 > 5s

Audit Log Health

Metric Description Type Alert Threshold
heliosdb_compliance_audit_log_write_duration_seconds Audit log write latency Histogram p95 > 1s
heliosdb_compliance_audit_log_write_errors_total Audit log write failures Counter any > 0
heliosdb_compliance_audit_log_storage_bytes Audit log storage size Gauge > 50GB

Report Generation

Metric Description Type Alert Threshold
heliosdb_compliance_reports_generated_total Reports generated Counter -
heliosdb_compliance_report_generation_failures_total Report generation failures Counter any > 0
heliosdb_compliance_report_generation_duration_seconds Report generation time Histogram p95 > 30s

Embedded+Cloud Metrics

Sync Performance

Metric Description Type Alert Threshold
heliosdb_embedded_cloud_sync_duration_seconds Sync operation latency Histogram p95 > 30s
heliosdb_embedded_cloud_sync_success_total Successful syncs Counter -
heliosdb_embedded_cloud_sync_failures_total Failed syncs Counter rate > 0.01/s

Conflict Resolution

Metric Description Type Alert Threshold
heliosdb_embedded_cloud_conflicts_total Data conflicts detected Counter rate > 0.05/s
heliosdb_embedded_cloud_conflicts_resolved_total Conflicts resolved Counter -
heliosdb_embedded_cloud_conflict_resolution_failures_total Conflicts unresolved Counter any > 0

WebSocket Connections

Metric Description Type Alert Threshold
heliosdb_embedded_cloud_active_connections Active WebSocket connections Gauge -
heliosdb_embedded_cloud_websocket_disconnects_total WebSocket disconnections Counter rate > 10/s
heliosdb_embedded_cloud_websocket_message_errors_total WebSocket message errors Counter rate > 0.1/s

Offline Mode

Metric Description Type Alert Threshold
heliosdb_embedded_cloud_offline_mode_active Devices in offline mode Gauge -
heliosdb_embedded_cloud_offline_cache_usage_bytes Offline cache size Gauge > 80% of limit
heliosdb_embedded_cloud_offline_cache_evictions_total Cache evictions Counter rate > 1/s

Grafana Dashboards

Accessing Dashboards

  1. Open Grafana: http://localhost:3000
  2. Login with credentials from .env:
  3. Username: admin
  4. Password: <GRAFANA_ADMIN_PASSWORD>
  5. Navigate to Dashboards > Browse

Available Dashboards

1. Conversational BI Dashboard

Location: Dashboards > Conversational BI - Production Metrics

Panels: - Service Status (UP/DOWN indicator) - Requests Per Minute - NL2SQL Accuracy - Cache Hit Rate - LLM Error Rate - Query Latency Percentiles (p50, p95, p99) - Request Rate & Error Rate - LLM Token Usage by Provider - Cache Hit vs Miss Distribution - Memory Usage - Rate Limited Requests by Client - Query Type Distribution

Key Insights: - Are queries being served successfully? - Is the LLM API responding? - Is the cache effective? - Are we staying within rate limits?

2. Auto-Compliance Dashboard

Location: Dashboards > Auto-Compliance - Production Metrics

Panels: - Service Status - Total Violations (24h) - Compliance Check Latency (p95) - Audit Log Storage - Audit Log Write Errors - Reports Generated (24h) - Compliance Violations by Framework - Audit Log Write Performance - Compliance Checks by Framework - Violation Types Distribution - Audit Log Retention Compliance - Alert Delivery Success Rate - Recent Compliance Violations (Top 10) - Report Generation Success & Failures - Memory Usage - Audit Log Compression Ratio

Key Insights: - Are there any compliance violations? - Is the audit log healthy? - Are reports being generated successfully? - Is retention policy being met?

3. Embedded+Cloud Dashboard

Location: Dashboards > Embedded+Cloud Unified - Production Metrics

Panels: - Service Status - Active Connections - Sync Success Rate - Sync Latency (p95) - Conflicts (1h) - Offline Mode Devices - Sync Operations (Success vs Failures) - Sync Latency Percentiles - WebSocket Connections - Data Transfer Rate - Conflict Resolution Strategy Usage - Device Count by User - Offline Cache Usage - Cloud Storage Operation Latency - Top Error Types - Sync Queue Length - Device Auth Failures - Memory Usage - Cloud Storage Errors

Key Insights: - Are syncs completing successfully? - How many conflicts are occurring? - Is offline mode working? - Are WebSocket connections stable?


Alert Rules

Alert Configuration

Alert rules are defined in: - /home/claude/HeliosDB/deployment/staging/monitoring/prometheus/alerts/

Conversational BI Alerts

Critical Alerts: 1. ConversationalBIServiceDown: Service unavailable for > 2 minutes 2. ConversationalBICriticalLatency: p99 latency > 30s for > 3 minutes 3. ConversationalBILLMAPIUnavailable: LLM API error rate > 0.5 errors/sec

Warning Alerts: 1. ConversationalBIHighLatency: p95 latency > 10s for > 5 minutes 2. ConversationalBIHighLLMErrorRate: LLM error rate > 0.1 errors/sec 3. ConversationalBILowAccuracy: NL2SQL accuracy < 70% for > 10 minutes 4. ConversationalBILowCacheHitRate: Cache hit rate < 30% for > 10 minutes 5. ConversationalBIRateLimitExceeded: > 10 requests/sec being rate-limited 6. ConversationalBIHighMemoryUsage: Memory usage > 3.5GB for > 5 minutes

Info Alerts: 1. ConversationalBINoTraffic: No requests for > 10 minutes 2. ConversationalBIHighLLMCost: Token usage > 1M tokens/hour

Auto-Compliance Alerts

Critical Alerts: 1. ComplianceServiceDown: Service unavailable for > 2 minutes 2. ComplianceViolationDetected: Any compliance violation detected 3. ComplianceAuditLogWriteFailure: Audit log write errors detected 4. ComplianceGDPRViolation: GDPR violation detected 5. ComplianceHIPAAViolation: HIPAA violation detected 6. CompliancePCIDSSViolation: PCI-DSS violation detected 7. ComplianceAuditLogStorageCritical: Audit log storage > 80GB

Warning Alerts: 1. ComplianceAuditLogHighLatency: p95 write latency > 1s 2. ComplianceCheckHighLatency: p95 check latency > 5s 3. ComplianceCheckFailures: Check failure rate > 0.05 failures/sec 4. ComplianceReportGenerationFailed: Report generation failed 5. ComplianceAuditLogStorageHigh: Audit log storage > 50GB 6. ComplianceAlertDeliveryFailure: Alert delivery failing 7. ComplianceRetentionPolicyViolation: Logs older than retention policy

Embedded+Cloud Alerts

Critical Alerts: 1. EmbeddedCloudSyncServiceDown: Service unavailable for > 2 minutes 2. EmbeddedCloudCriticalSyncLatency: p99 sync latency > 120s 3. EmbeddedCloudHighSyncFailureRate: Sync failure rate > 0.1 failures/sec 4. EmbeddedCloudStorageUnavailable: Cloud storage error rate > 0.1 errors/sec 5. EmbeddedCloudCriticalSyncQueueBacklog: Sync queue > 10,000 items

Warning Alerts: 1. EmbeddedCloudHighSyncLatency: p95 sync latency > 30s 2. EmbeddedCloudSyncFailures: Sync failure rate > 0.01 failures/sec 3. EmbeddedCloudConflictResolutionFailures: Unable to resolve conflicts 4. EmbeddedCloudStorageHighLatency: p95 storage operation latency > 5s 5. EmbeddedCloudOfflineCacheNearFull: Offline cache > 80% capacity 6. EmbeddedCloudDeviceAuthFailures: Device auth failure rate > 0.05 failures/sec 7. EmbeddedCloudHighWebSocketDisconnects: Disconnect rate > 10/s 8. EmbeddedCloudSyncQueueBacklog: Sync queue > 1,000 items

Info Alerts: 1. EmbeddedCloudHighConflictRate: Conflict rate > 0.05 conflicts/sec 2. EmbeddedCloudOfflineModeActivated: Devices operating offline for > 10 minutes 3. EmbeddedCloudDeviceLimitExceeded: Users hitting device limits 4. EmbeddedCloudNoActiveConnections: No WebSocket connections for > 15 minutes 5. EmbeddedCloudHighDataTransferRate: Data transfer > 100 MB/sec


Alert Severity Levels

Critical (P1)

Definition: Service is down, data loss risk, or compliance violation

Response Time: Immediate (< 15 minutes)

Actions: 1. Page on-call engineer 2. Begin incident response 3. Check recent deployments 4. Review logs immediately

Examples: - Service completely down - Database unavailable - Compliance violations detected - Audit log write failures

Warning (P2)

Definition: Service degraded, potential issue developing

Response Time: Within 1 hour

Actions: 1. Notify team via Slack/email 2. Investigate root cause 3. Monitor for escalation 4. Plan remediation

Examples: - High latency - Elevated error rates - Resource approaching limits - Report generation failures

Info (P3)

Definition: Notable event, no immediate action required

Response Time: Next business day

Actions: 1. Log for investigation 2. Review during normal hours 3. Update runbooks if needed

Examples: - No traffic (off-hours) - High token usage - Offline mode activated - Informational events


Common Monitoring Scenarios

Scenario 1: High Query Latency

Symptoms: - Dashboard shows p95 > 10s - Alert: ConversationalBIHighLatency

Investigation: 1. Check Grafana: Conversational BI > Query Latency panel 2. Check LLM API latency: Is the LLM slow? 3. Check cache hit rate: Is cache effective? 4. Check database latency: Is PostgreSQL slow?

Resolution:

# Check service logs
docker compose -f deployment/staging/docker-compose.yml logs conversational-bi | grep -i latency

# Check LLM API status
curl http://localhost:9091/metrics | grep llm_latency

# Restart service if needed
docker compose -f deployment/staging/docker-compose.yml restart conversational-bi

Scenario 2: Compliance Violation Detected

Symptoms: - Dashboard shows violation count > 0 - Alert: ComplianceViolationDetected

Investigation: 1. Check Grafana: Compliance > Violations by Framework panel 2. Identify which framework (GDPR, HIPAA, etc.) 3. Check audit logs for violation details

Resolution:

# Check compliance logs
docker compose -f deployment/staging/docker-compose.yml logs compliance | grep -i violation

# Access compliance dashboard
open http://localhost:8090

# Review violation details
curl http://localhost:8082/api/v1/compliance/violations

Scenario 3: Sync Failures

Symptoms: - Dashboard shows high sync failure rate - Alert: EmbeddedCloudSyncFailures

Investigation: 1. Check Grafana: Embedded+Cloud > Sync Operations panel 2. Check cloud storage connectivity 3. Check for conflicts or errors

Resolution:

# Check sync service logs
docker compose -f deployment/staging/docker-compose.yml logs embedded-cloud-sync | grep -i sync

# Check S3 connectivity
curl http://localhost:9093/metrics | grep storage_errors

# Check sync queue
curl http://localhost:8083/api/v1/sync/queue/status


Troubleshooting Metrics

Metrics Not Appearing

Issue: Grafana shows "No data"

Diagnosis:

# 1. Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'

# 2. Check service metrics endpoints
curl http://localhost:9091/metrics  # Should return Prometheus metrics
curl http://localhost:9092/metrics
curl http://localhost:9093/metrics

# 3. Check Prometheus logs
docker compose -f deployment/staging/docker-compose.yml logs prometheus

Resolution:

# Restart Prometheus
docker compose -f deployment/staging/docker-compose.yml restart prometheus

# Verify scrape config
docker compose -f deployment/staging/docker-compose.yml exec prometheus \
  cat /etc/prometheus/prometheus.yml

Alert Not Firing

Issue: Expected alert doesn't trigger

Diagnosis:

# Check Prometheus rules
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.name == "YourAlertName")'

# Check if metric exists
curl http://localhost:9090/api/v1/query?query=<metric_name>

Resolution:

# Reload Prometheus config
curl -X POST http://localhost:9090/-/reload

# Check alert state
curl http://localhost:9090/api/v1/alerts


Best Practices

  1. Check Dashboards Daily: Review key metrics every morning
  2. Investigate Warnings: Don't ignore warning-level alerts
  3. Baseline Metrics: Understand normal operating ranges
  4. Document Incidents: Keep runbooks updated
  5. Regular Reviews: Weekly review of alert effectiveness
  6. Tune Thresholds: Adjust based on observed behavior
  7. Alert Fatigue: Reduce noisy alerts

Next Steps

  • Configure Alertmanager (optional): Set up email/Slack notifications
  • Create Custom Dashboards: Add business-specific metrics
  • Set Up SLOs: Define Service Level Objectives
  • Enable Tracing: Add distributed tracing for debugging
  • Log Aggregation: Integrate with ELK or Loki

Monitoring is operational! Your HeliosDB Phase 1 staging environment is fully observable.