# Phase 1, Week 2 - Infrastructure & Monitoring **Date Started:** 2026-01-18 **Target Completion:** 2026-01-25 **Status:** Starting **Priority:** HIGH (Production Readiness) --- ## Executive Summary With Week 1 security fixes complete and deployed, Week 2 focuses on production infrastructure hardening. The server currently runs manually (`nohup start-secure.sh &`), lacks monitoring, and has no automated recovery. This week establishes production-grade infrastructure. **Goals:** 1. Systemd service with auto-restart on failure 2. Prometheus metrics for monitoring 3. Grafana dashboards for visualization 4. Automated PostgreSQL backups 5. Log rotation and management **Dependencies:** - SSH access to 172.16.3.30 as `guru` user - Sudo access for systemd service installation - PostgreSQL credentials (currently broken, but can set up backup automation) --- ## Week 2 Task Breakdown ### Day 1: Systemd Service Configuration **Goal:** Convert manual server startup to systemd-managed service **Tasks:** 1. Create systemd service file (`/etc/systemd/system/guruconnect.service`) 2. Configure service dependencies (network, postgresql) 3. Set restart policy (on-failure, with backoff) 4. Configure environment variables securely 5. Enable service to start on boot 6. Test service start/stop/restart 7. Verify auto-restart on crash **Files to Create:** - `server/guruconnect.service` - Systemd unit file - `server/setup-systemd.sh` - Installation script **Verification:** - Service starts automatically on boot - Service restarts on failure (kill -9 test) - Logs go to journalctl --- ### Day 2: Prometheus Metrics **Goal:** Expose metrics for monitoring server health and performance **Tasks:** 1. Add `prometheus-client` dependency to Cargo.toml 2. Create metrics module (`server/src/metrics/mod.rs`) 3. Implement metric types: - Counter: requests_total, sessions_total, errors_total - Gauge: active_sessions, active_connections - Histogram: request_duration_seconds, session_duration_seconds 4. Add `/metrics` endpoint 5. Integrate metrics into existing code: - Session creation/close - Request handling - WebSocket connections - Database operations 6. Test metrics endpoint (`curl http://172.16.3.30:3002/metrics`) **Files to Create/Modify:** - `server/Cargo.toml` - Add dependencies - `server/src/metrics/mod.rs` - Metrics module - `server/src/main.rs` - Add /metrics endpoint - `server/src/relay/mod.rs` - Add session metrics - `server/src/api/mod.rs` - Add request metrics **Metrics to Track:** - `guruconnect_requests_total{method, path, status}` - HTTP requests - `guruconnect_sessions_total{status}` - Sessions (created, closed, failed) - `guruconnect_active_sessions` - Current active sessions - `guruconnect_active_connections{type}` - WebSocket connections (agents, viewers) - `guruconnect_request_duration_seconds{method, path}` - Request latency - `guruconnect_session_duration_seconds` - Session lifetime - `guruconnect_errors_total{type}` - Error counts - `guruconnect_db_operations_total{operation, status}` - Database operations **Verification:** - Metrics endpoint returns Prometheus format - Metrics update in real-time - No performance degradation --- ### Day 3: Grafana Dashboard **Goal:** Create visual dashboards for monitoring GuruConnect **Tasks:** 1. Install Prometheus on 172.16.3.30 2. Configure Prometheus to scrape GuruConnect metrics 3. Install Grafana on 172.16.3.30 4. Configure Grafana data source (Prometheus) 5. Create dashboards: - Overview: Active sessions, requests/sec, errors - Sessions: Session lifecycle, duration distribution - Performance: Request latency, database query time - Errors: Error rates by type 6. Set up alerting rules (if time permits) **Files to Create:** - `infrastructure/prometheus.yml` - Prometheus configuration - `infrastructure/grafana-dashboard.json` - Pre-built dashboard - `infrastructure/setup-monitoring.sh` - Installation script **Grafana Dashboard Panels:** 1. Active Sessions (Gauge) 2. Requests per Second (Graph) 3. Error Rate (Graph) 4. Session Creation Rate (Graph) 5. Request Latency p50/p95/p99 (Graph) 6. Active Connections by Type (Graph) 7. Database Operations (Graph) 8. Top Errors (Table) **Verification:** - Prometheus scrapes metrics successfully - Grafana dashboard displays real-time data - Alerts fire on test conditions --- ### Day 4: Automated PostgreSQL Backups **Goal:** Implement automated daily backups with retention policy **Tasks:** 1. Create backup script (`server/backup-postgres.sh`) 2. Configure backup location (`/home/guru/backups/guruconnect/`) 3. Implement retention policy (keep 30 daily, 4 weekly, 6 monthly) 4. Create systemd timer for daily backups 5. Add backup monitoring (success/failure metrics) 6. Test backup and restore process 7. Document restore procedure **Files to Create:** - `server/backup-postgres.sh` - Backup script - `server/restore-postgres.sh` - Restore script - `server/guruconnect-backup.service` - Systemd service - `server/guruconnect-backup.timer` - Systemd timer **Backup Strategy:** - Daily full backups at 2:00 AM - Compressed with gzip - Named with timestamp: `guruconnect-YYYY-MM-DD-HHMMSS.sql.gz` - Stored in `/home/guru/backups/guruconnect/` - Retention: 30 days daily, 4 weeks weekly, 6 months monthly **Verification:** - Manual backup works - Automated backup runs daily - Restore process verified - Old backups cleaned up correctly --- ### Day 5: Log Rotation & Health Checks **Goal:** Implement log rotation and continuous health monitoring **Tasks:** 1. Configure logrotate for GuruConnect logs 2. Implement health check improvements: - Database connectivity check - Disk space check - Memory usage check - Active session count check 3. Create monitoring script (`server/health-monitor.sh`) 4. Add health metrics to Prometheus 5. Create systemd watchdog configuration 6. Document operational procedures **Files to Create:** - `server/guruconnect.logrotate` - Logrotate configuration - `server/health-monitor.sh` - Health monitoring script - `server/OPERATIONS.md` - Operational runbook **Health Checks:** - `/health` endpoint (basic - already exists) - `/health/deep` endpoint (detailed checks): - Database connection: OK/FAIL - Disk space: >10% free - Memory: <90% used - Active sessions: <100 (threshold) - Uptime: seconds since start **Verification:** - Logs rotate correctly - Health checks report accurate status - Alerts triggered on health failures --- ## Infrastructure Files Structure ``` guru-connect/ ├── server/ │ ├── guruconnect.service # Systemd service file │ ├── setup-systemd.sh # Service installation script │ ├── backup-postgres.sh # PostgreSQL backup script │ ├── restore-postgres.sh # PostgreSQL restore script │ ├── guruconnect-backup.service # Backup systemd service │ ├── guruconnect-backup.timer # Backup systemd timer │ ├── guruconnect.logrotate # Logrotate configuration │ ├── health-monitor.sh # Health monitoring script │ └── OPERATIONS.md # Operational runbook ├── infrastructure/ │ ├── prometheus.yml # Prometheus configuration │ ├── grafana-dashboard.json # Grafana dashboard export │ └── setup-monitoring.sh # Monitoring setup script └── docs/ └── MONITORING.md # Monitoring documentation ``` --- ## Systemd Service Configuration **Service File: `/etc/systemd/system/guruconnect.service`** ```ini [Unit] Description=GuruConnect Remote Desktop Server Documentation=https://git.azcomputerguru.com/azcomputerguru/guru-connect After=network-online.target postgresql.service Wants=network-online.target [Service] Type=simple User=guru Group=guru WorkingDirectory=/home/guru/guru-connect/server # Environment variables EnvironmentFile=/home/guru/guru-connect/server/.env # Start command ExecStart=/home/guru/guru-connect/target/x86_64-unknown-linux-gnu/release/guruconnect-server # Restart policy Restart=on-failure RestartSec=10s StartLimitInterval=5min StartLimitBurst=3 # Resource limits LimitNOFILE=65536 LimitNPROC=4096 # Security NoNewPrivileges=true PrivateTmp=true # Logging StandardOutput=journal StandardError=journal SyslogIdentifier=guruconnect # Watchdog WatchdogSec=30s [Install] WantedBy=multi-user.target ``` **Environment File: `/home/guru/guru-connect/server/.env`** ```bash # Database DATABASE_URL=postgresql://guruconnect:PASSWORD@localhost:5432/guruconnect # Security JWT_SECRET=your-very-secure-jwt-secret-at-least-32-characters AGENT_API_KEY=your-very-secure-api-key-at-least-32-characters # Server Configuration RUST_LOG=info HOST=0.0.0.0 PORT=3002 # Monitoring PROMETHEUS_PORT=3002 # Expose on same port as main service ``` --- ## Prometheus Configuration **File: `infrastructure/prometheus.yml`** ```yaml global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: 'guruconnect-production' scrape_configs: - job_name: 'guruconnect' static_configs: - targets: ['172.16.3.30:3002'] labels: env: 'production' service: 'guruconnect-server' - job_name: 'node_exporter' static_configs: - targets: ['172.16.3.30:9100'] labels: env: 'production' instance: 'rmm-server' # Alerting rules (optional for Week 2) rule_files: - 'alerts.yml' alerting: alertmanagers: - static_configs: - targets: ['localhost:9093'] ``` --- ## Testing Checklist ### Systemd Service Tests - [ ] Service starts correctly: `sudo systemctl start guruconnect` - [ ] Service stops correctly: `sudo systemctl stop guruconnect` - [ ] Service restarts correctly: `sudo systemctl restart guruconnect` - [ ] Service auto-starts on boot: `sudo systemctl enable guruconnect` - [ ] Service restarts on crash: `sudo kill -9 ` (wait 10s) - [ ] Logs visible in journalctl: `sudo journalctl -u guruconnect -f` ### Prometheus Metrics Tests - [ ] Metrics endpoint accessible: `curl http://172.16.3.30:3002/metrics` - [ ] Metrics format valid (Prometheus client can scrape) - [ ] Session metrics update on session creation/close - [ ] Request metrics update on HTTP requests - [ ] Error metrics update on failures ### Grafana Dashboard Tests - [ ] Prometheus data source connected - [ ] All panels display data - [ ] Data updates in real-time (<30s delay) - [ ] Historical data visible (after 1 hour) - [ ] Dashboard exports to JSON successfully ### Backup Tests - [ ] Manual backup creates file: `bash backup-postgres.sh` - [ ] Backup file is compressed and named correctly - [ ] Restore works: `bash restore-postgres.sh ` - [ ] Timer triggers daily at 2:00 AM - [ ] Retention policy removes old backups ### Health Check Tests - [ ] Basic health endpoint: `curl http://172.16.3.30:3002/health` - [ ] Deep health endpoint: `curl http://172.16.3.30:3002/health/deep` - [ ] Health checks report database status - [ ] Health checks report disk/memory usage --- ## Risk Assessment ### HIGH RISK **Issue:** Database credentials still broken **Impact:** Cannot test database-dependent features **Mitigation:** Create backup scripts that work even if database is down (conditional logic) **Issue:** Sudo access required for systemd **Impact:** Cannot install service without password **Mitigation:** Prepare scripts and documentation, request sudo access from system admin ### MEDIUM RISK **Issue:** Prometheus/Grafana installation may require dependencies **Impact:** Additional setup time **Mitigation:** Use Docker containers if system install is complex **Issue:** Metrics may add performance overhead **Impact:** Latency increase **Mitigation:** Use efficient metrics library, test performance before/after ### LOW RISK **Issue:** Log rotation misconfiguration **Impact:** Disk space issues **Mitigation:** Test logrotate configuration thoroughly, set conservative limits --- ## Success Criteria Week 2 is complete when: 1. **Systemd Service** - Service starts/stops correctly - Auto-restarts on failure - Starts on boot - Logs to journalctl 2. **Prometheus Metrics** - /metrics endpoint working - Key metrics implemented: - Request counts and latency - Session counts and duration - Active connections - Error rates - Prometheus can scrape successfully 3. **Grafana Dashboard** - Prometheus data source configured - Dashboard with 8+ panels - Real-time data display - Dashboard exported to JSON 4. **Automated Backups** - Backup script functional - Daily backups via systemd timer - Retention policy enforced - Restore procedure documented 5. **Health Monitoring** - Log rotation configured - Health checks implemented - Health metrics exposed - Operational runbook created **Exit Criteria:** All 5 areas have passing tests, production infrastructure is stable and monitored. --- ## Next Steps (Week 3) After Week 2 infrastructure completion: - Week 3: CI/CD pipeline (Gitea CI, automated builds, deployment automation) - Week 4: Production hardening (load testing, performance optimization, security audit) - Phase 2: Core features development --- **Document Status:** READY **Owner:** Development Team **Started:** 2026-01-18 **Target:** 2026-01-25