Phase 1 Week 2: Infrastructure & Monitoring

Added comprehensive production infrastructure: Systemd Service: - guruconnect.service with auto-restart, resource limits, security hardening - setup-systemd.sh installation script Prometheus Metrics: - Added prometheus-client dependency - Created metrics module tracking: - HTTP requests (count, latency) - Sessions (created, closed, active) - Connections (WebSocket, by type) - Errors (by type) - Database operations (count, latency) - Server uptime - Added /metrics endpoint - Background task for uptime updates Monitoring Configuration: - prometheus.yml with scrape configs for GuruConnect and node_exporter - alerts.yml with alerting rules - grafana-dashboard.json with 10 panels - setup-monitoring.sh installation script PostgreSQL Backups: - backup-postgres.sh with gzip compression - restore-postgres.sh with safety checks - guruconnect-backup.service and .timer for automated daily backups - Retention policy: 30 daily, 4 weekly, 6 monthly Health Monitoring: - health-monitor.sh checking HTTP, disk, memory, database, metrics - guruconnect.logrotate for log rotation - Email alerts on failures Updated CHECKLIST_STATE.json to reflect Week 1 completion (77%) and Week 2 start. Created PHASE1_WEEK2_INFRASTRUCTURE.md with comprehensive planning. Ready for deployment and testing on RMM server.
2026-01-17 20:24:32 -07:00
parent 2481b54a65
commit 8521c95755
17 changed files with 1877 additions and 25 deletions
--- a/projects/msp-tools/guru-connect/PHASE1_WEEK2_INFRASTRUCTURE.md
+++ b/projects/msp-tools/guru-connect/PHASE1_WEEK2_INFRASTRUCTURE.md
@@ -0,0 +1,457 @@
+# Phase 1, Week 2 - Infrastructure & Monitoring
+
+**Date Started:** 2026-01-18
+**Target Completion:** 2026-01-25
+**Status:** Starting
+**Priority:** HIGH (Production Readiness)
+
+---
+
+## Executive Summary
+
+With Week 1 security fixes complete and deployed, Week 2 focuses on production infrastructure hardening. The server currently runs manually (`nohup start-secure.sh &`), lacks monitoring, and has no automated recovery. This week establishes production-grade infrastructure.
+
+**Goals:**
+1. Systemd service with auto-restart on failure
+2. Prometheus metrics for monitoring
+3. Grafana dashboards for visualization
+4. Automated PostgreSQL backups
+5. Log rotation and management
+
+**Dependencies:**
+- SSH access to 172.16.3.30 as `guru` user
+- Sudo access for systemd service installation
+- PostgreSQL credentials (currently broken, but can set up backup automation)
+
+---
+
+## Week 2 Task Breakdown
+
+### Day 1: Systemd Service Configuration
+
+**Goal:** Convert manual server startup to systemd-managed service
+
+**Tasks:**
+1. Create systemd service file (`/etc/systemd/system/guruconnect.service`)
+2. Configure service dependencies (network, postgresql)
+3. Set restart policy (on-failure, with backoff)
+4. Configure environment variables securely
+5. Enable service to start on boot
+6. Test service start/stop/restart
+7. Verify auto-restart on crash
+
+**Files to Create:**
+- `server/guruconnect.service` - Systemd unit file
+- `server/setup-systemd.sh` - Installation script
+
+**Verification:**
+- Service starts automatically on boot
+- Service restarts on failure (kill -9 test)
+- Logs go to journalctl
+
+---
+
+### Day 2: Prometheus Metrics
+
+**Goal:** Expose metrics for monitoring server health and performance
+
+**Tasks:**
+1. Add `prometheus-client` dependency to Cargo.toml
+2. Create metrics module (`server/src/metrics/mod.rs`)
+3. Implement metric types:
+   - Counter: requests_total, sessions_total, errors_total
+   - Gauge: active_sessions, active_connections
+   - Histogram: request_duration_seconds, session_duration_seconds
+4. Add `/metrics` endpoint
+5. Integrate metrics into existing code:
+   - Session creation/close
+   - Request handling
+   - WebSocket connections
+   - Database operations
+6. Test metrics endpoint (`curl http://172.16.3.30:3002/metrics`)
+
+**Files to Create/Modify:**
+- `server/Cargo.toml` - Add dependencies
+- `server/src/metrics/mod.rs` - Metrics module
+- `server/src/main.rs` - Add /metrics endpoint
+- `server/src/relay/mod.rs` - Add session metrics
+- `server/src/api/mod.rs` - Add request metrics
+
+**Metrics to Track:**
+- `guruconnect_requests_total{method, path, status}` - HTTP requests
+- `guruconnect_sessions_total{status}` - Sessions (created, closed, failed)
+- `guruconnect_active_sessions` - Current active sessions
+- `guruconnect_active_connections{type}` - WebSocket connections (agents, viewers)
+- `guruconnect_request_duration_seconds{method, path}` - Request latency
+- `guruconnect_session_duration_seconds` - Session lifetime
+- `guruconnect_errors_total{type}` - Error counts
+- `guruconnect_db_operations_total{operation, status}` - Database operations
+
+**Verification:**
+- Metrics endpoint returns Prometheus format
+- Metrics update in real-time
+- No performance degradation
+
+---
+
+### Day 3: Grafana Dashboard
+
+**Goal:** Create visual dashboards for monitoring GuruConnect
+
+**Tasks:**
+1. Install Prometheus on 172.16.3.30
+2. Configure Prometheus to scrape GuruConnect metrics
+3. Install Grafana on 172.16.3.30
+4. Configure Grafana data source (Prometheus)
+5. Create dashboards:
+   - Overview: Active sessions, requests/sec, errors
+   - Sessions: Session lifecycle, duration distribution
+   - Performance: Request latency, database query time
+   - Errors: Error rates by type
+6. Set up alerting rules (if time permits)
+
+**Files to Create:**
+- `infrastructure/prometheus.yml` - Prometheus configuration
+- `infrastructure/grafana-dashboard.json` - Pre-built dashboard
+- `infrastructure/setup-monitoring.sh` - Installation script
+
+**Grafana Dashboard Panels:**
+1. Active Sessions (Gauge)
+2. Requests per Second (Graph)
+3. Error Rate (Graph)
+4. Session Creation Rate (Graph)
+5. Request Latency p50/p95/p99 (Graph)
+6. Active Connections by Type (Graph)
+7. Database Operations (Graph)
+8. Top Errors (Table)
+
+**Verification:**
+- Prometheus scrapes metrics successfully
+- Grafana dashboard displays real-time data
+- Alerts fire on test conditions
+
+---
+
+### Day 4: Automated PostgreSQL Backups
+
+**Goal:** Implement automated daily backups with retention policy
+
+**Tasks:**
+1. Create backup script (`server/backup-postgres.sh`)
+2. Configure backup location (`/home/guru/backups/guruconnect/`)
+3. Implement retention policy (keep 30 daily, 4 weekly, 6 monthly)
+4. Create systemd timer for daily backups
+5. Add backup monitoring (success/failure metrics)
+6. Test backup and restore process
+7. Document restore procedure
+
+**Files to Create:**
+- `server/backup-postgres.sh` - Backup script
+- `server/restore-postgres.sh` - Restore script
+- `server/guruconnect-backup.service` - Systemd service
+- `server/guruconnect-backup.timer` - Systemd timer
+
+**Backup Strategy:**
+- Daily full backups at 2:00 AM
+- Compressed with gzip
+- Named with timestamp: `guruconnect-YYYY-MM-DD-HHMMSS.sql.gz`
+- Stored in `/home/guru/backups/guruconnect/`
+- Retention: 30 days daily, 4 weeks weekly, 6 months monthly
+
+**Verification:**
+- Manual backup works
+- Automated backup runs daily
+- Restore process verified
+- Old backups cleaned up correctly
+
+---
+
+### Day 5: Log Rotation & Health Checks
+
+**Goal:** Implement log rotation and continuous health monitoring
+
+**Tasks:**
+1. Configure logrotate for GuruConnect logs
+2. Implement health check improvements:
+   - Database connectivity check
+   - Disk space check
+   - Memory usage check
+   - Active session count check
+3. Create monitoring script (`server/health-monitor.sh`)
+4. Add health metrics to Prometheus
+5. Create systemd watchdog configuration
+6. Document operational procedures
+
+**Files to Create:**
+- `server/guruconnect.logrotate` - Logrotate configuration
+- `server/health-monitor.sh` - Health monitoring script
+- `server/OPERATIONS.md` - Operational runbook
+
+**Health Checks:**
+- `/health` endpoint (basic - already exists)
+- `/health/deep` endpoint (detailed checks):
+  - Database connection: OK/FAIL
+  - Disk space: >10% free
+  - Memory: <90% used
+  - Active sessions: <100 (threshold)
+  - Uptime: seconds since start
+
+**Verification:**
+- Logs rotate correctly
+- Health checks report accurate status
+- Alerts triggered on health failures
+
+---
+
+## Infrastructure Files Structure
+
+```
+guru-connect/
+├── server/
+│   ├── guruconnect.service        # Systemd service file
+│   ├── setup-systemd.sh           # Service installation script
+│   ├── backup-postgres.sh         # PostgreSQL backup script
+│   ├── restore-postgres.sh        # PostgreSQL restore script
+│   ├── guruconnect-backup.service # Backup systemd service
+│   ├── guruconnect-backup.timer   # Backup systemd timer
+│   ├── guruconnect.logrotate      # Logrotate configuration
+│   ├── health-monitor.sh          # Health monitoring script
+│   └── OPERATIONS.md              # Operational runbook
+├── infrastructure/
+│   ├── prometheus.yml             # Prometheus configuration
+│   ├── grafana-dashboard.json     # Grafana dashboard export
+│   └── setup-monitoring.sh        # Monitoring setup script
+└── docs/
+    └── MONITORING.md              # Monitoring documentation
+```
+
+---
+
+## Systemd Service Configuration
+
+**Service File: `/etc/systemd/system/guruconnect.service`**
+
+```ini
+[Unit]
+Description=GuruConnect Remote Desktop Server
+Documentation=https://git.azcomputerguru.com/azcomputerguru/guru-connect
+After=network-online.target postgresql.service
+Wants=network-online.target
+
+[Service]
+Type=simple
+User=guru
+Group=guru
+WorkingDirectory=/home/guru/guru-connect/server
+
+# Environment variables
+EnvironmentFile=/home/guru/guru-connect/server/.env
+
+# Start command
+ExecStart=/home/guru/guru-connect/target/x86_64-unknown-linux-gnu/release/guruconnect-server
+
+# Restart policy
+Restart=on-failure
+RestartSec=10s
+StartLimitInterval=5min
+StartLimitBurst=3
+
+# Resource limits
+LimitNOFILE=65536
+LimitNPROC=4096
+
+# Security
+NoNewPrivileges=true
+PrivateTmp=true
+
+# Logging
+StandardOutput=journal
+StandardError=journal
+SyslogIdentifier=guruconnect
+
+# Watchdog
+WatchdogSec=30s
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Environment File: `/home/guru/guru-connect/server/.env`**
+
+```bash
+# Database
+DATABASE_URL=postgresql://guruconnect:PASSWORD@localhost:5432/guruconnect
+
+# Security
+JWT_SECRET=your-very-secure-jwt-secret-at-least-32-characters
+AGENT_API_KEY=your-very-secure-api-key-at-least-32-characters
+
+# Server Configuration
+RUST_LOG=info
+HOST=0.0.0.0
+PORT=3002
+
+# Monitoring
+PROMETHEUS_PORT=3002  # Expose on same port as main service
+```
+
+---
+
+## Prometheus Configuration
+
+**File: `infrastructure/prometheus.yml`**
+
+```yaml
+global:
+  scrape_interval: 15s
+  evaluation_interval: 15s
+  external_labels:
+    cluster: 'guruconnect-production'
+
+scrape_configs:
+  - job_name: 'guruconnect'
+    static_configs:
+      - targets: ['172.16.3.30:3002']
+        labels:
+          env: 'production'
+          service: 'guruconnect-server'
+
+  - job_name: 'node_exporter'
+    static_configs:
+      - targets: ['172.16.3.30:9100']
+        labels:
+          env: 'production'
+          instance: 'rmm-server'
+
+# Alerting rules (optional for Week 2)
+rule_files:
+  - 'alerts.yml'
+
+alerting:
+  alertmanagers:
+    - static_configs:
+        - targets: ['localhost:9093']
+```
+
+---
+
+## Testing Checklist
+
+### Systemd Service Tests
+- [ ] Service starts correctly: `sudo systemctl start guruconnect`
+- [ ] Service stops correctly: `sudo systemctl stop guruconnect`
+- [ ] Service restarts correctly: `sudo systemctl restart guruconnect`
+- [ ] Service auto-starts on boot: `sudo systemctl enable guruconnect`
+- [ ] Service restarts on crash: `sudo kill -9 <pid>` (wait 10s)
+- [ ] Logs visible in journalctl: `sudo journalctl -u guruconnect -f`
+
+### Prometheus Metrics Tests
+- [ ] Metrics endpoint accessible: `curl http://172.16.3.30:3002/metrics`
+- [ ] Metrics format valid (Prometheus client can scrape)
+- [ ] Session metrics update on session creation/close
+- [ ] Request metrics update on HTTP requests
+- [ ] Error metrics update on failures
+
+### Grafana Dashboard Tests
+- [ ] Prometheus data source connected
+- [ ] All panels display data
+- [ ] Data updates in real-time (<30s delay)
+- [ ] Historical data visible (after 1 hour)
+- [ ] Dashboard exports to JSON successfully
+
+### Backup Tests
+- [ ] Manual backup creates file: `bash backup-postgres.sh`
+- [ ] Backup file is compressed and named correctly
+- [ ] Restore works: `bash restore-postgres.sh <backup-file>`
+- [ ] Timer triggers daily at 2:00 AM
+- [ ] Retention policy removes old backups
+
+### Health Check Tests
+- [ ] Basic health endpoint: `curl http://172.16.3.30:3002/health`
+- [ ] Deep health endpoint: `curl http://172.16.3.30:3002/health/deep`
+- [ ] Health checks report database status
+- [ ] Health checks report disk/memory usage
+
+---
+
+## Risk Assessment
+
+### HIGH RISK
+**Issue:** Database credentials still broken
+**Impact:** Cannot test database-dependent features
+**Mitigation:** Create backup scripts that work even if database is down (conditional logic)
+
+**Issue:** Sudo access required for systemd
+**Impact:** Cannot install service without password
+**Mitigation:** Prepare scripts and documentation, request sudo access from system admin
+
+### MEDIUM RISK
+**Issue:** Prometheus/Grafana installation may require dependencies
+**Impact:** Additional setup time
+**Mitigation:** Use Docker containers if system install is complex
+
+**Issue:** Metrics may add performance overhead
+**Impact:** Latency increase
+**Mitigation:** Use efficient metrics library, test performance before/after
+
+### LOW RISK
+**Issue:** Log rotation misconfiguration
+**Impact:** Disk space issues
+**Mitigation:** Test logrotate configuration thoroughly, set conservative limits
+
+---
+
+## Success Criteria
+
+Week 2 is complete when:
+
+1. **Systemd Service**
+   - Service starts/stops correctly
+   - Auto-restarts on failure
+   - Starts on boot
+   - Logs to journalctl
+
+2. **Prometheus Metrics**
+   - /metrics endpoint working
+   - Key metrics implemented:
+     - Request counts and latency
+     - Session counts and duration
+     - Active connections
+     - Error rates
+   - Prometheus can scrape successfully
+
+3. **Grafana Dashboard**
+   - Prometheus data source configured
+   - Dashboard with 8+ panels
+   - Real-time data display
+   - Dashboard exported to JSON
+
+4. **Automated Backups**
+   - Backup script functional
+   - Daily backups via systemd timer
+   - Retention policy enforced
+   - Restore procedure documented
+
+5. **Health Monitoring**
+   - Log rotation configured
+   - Health checks implemented
+   - Health metrics exposed
+   - Operational runbook created
+
+**Exit Criteria:** All 5 areas have passing tests, production infrastructure is stable and monitored.
+
+---
+
+## Next Steps (Week 3)
+
+After Week 2 infrastructure completion:
+- Week 3: CI/CD pipeline (Gitea CI, automated builds, deployment automation)
+- Week 4: Production hardening (load testing, performance optimization, security audit)
+- Phase 2: Core features development
+
+---
+
+**Document Status:** READY
+**Owner:** Development Team
+**Started:** 2026-01-18
+**Target:** 2026-01-25