Files
claudetools/projects/msp-tools/guru-connect/PHASE1_WEEK2_INFRASTRUCTURE.md
Mike Swanson 8521c95755 Phase 1 Week 2: Infrastructure & Monitoring
Added comprehensive production infrastructure:

Systemd Service:
- guruconnect.service with auto-restart, resource limits, security hardening
- setup-systemd.sh installation script

Prometheus Metrics:
- Added prometheus-client dependency
- Created metrics module tracking:
  - HTTP requests (count, latency)
  - Sessions (created, closed, active)
  - Connections (WebSocket, by type)
  - Errors (by type)
  - Database operations (count, latency)
  - Server uptime
- Added /metrics endpoint
- Background task for uptime updates

Monitoring Configuration:
- prometheus.yml with scrape configs for GuruConnect and node_exporter
- alerts.yml with alerting rules
- grafana-dashboard.json with 10 panels
- setup-monitoring.sh installation script

PostgreSQL Backups:
- backup-postgres.sh with gzip compression
- restore-postgres.sh with safety checks
- guruconnect-backup.service and .timer for automated daily backups
- Retention policy: 30 daily, 4 weekly, 6 monthly

Health Monitoring:
- health-monitor.sh checking HTTP, disk, memory, database, metrics
- guruconnect.logrotate for log rotation
- Email alerts on failures

Updated CHECKLIST_STATE.json to reflect Week 1 completion (77%) and Week 2 start.
Created PHASE1_WEEK2_INFRASTRUCTURE.md with comprehensive planning.

Ready for deployment and testing on RMM server.
2026-01-17 20:24:32 -07:00

13 KiB

Phase 1, Week 2 - Infrastructure & Monitoring

Date Started: 2026-01-18 Target Completion: 2026-01-25 Status: Starting Priority: HIGH (Production Readiness)


Executive Summary

With Week 1 security fixes complete and deployed, Week 2 focuses on production infrastructure hardening. The server currently runs manually (nohup start-secure.sh &), lacks monitoring, and has no automated recovery. This week establishes production-grade infrastructure.

Goals:

  1. Systemd service with auto-restart on failure
  2. Prometheus metrics for monitoring
  3. Grafana dashboards for visualization
  4. Automated PostgreSQL backups
  5. Log rotation and management

Dependencies:

  • SSH access to 172.16.3.30 as guru user
  • Sudo access for systemd service installation
  • PostgreSQL credentials (currently broken, but can set up backup automation)

Week 2 Task Breakdown

Day 1: Systemd Service Configuration

Goal: Convert manual server startup to systemd-managed service

Tasks:

  1. Create systemd service file (/etc/systemd/system/guruconnect.service)
  2. Configure service dependencies (network, postgresql)
  3. Set restart policy (on-failure, with backoff)
  4. Configure environment variables securely
  5. Enable service to start on boot
  6. Test service start/stop/restart
  7. Verify auto-restart on crash

Files to Create:

  • server/guruconnect.service - Systemd unit file
  • server/setup-systemd.sh - Installation script

Verification:

  • Service starts automatically on boot
  • Service restarts on failure (kill -9 test)
  • Logs go to journalctl

Day 2: Prometheus Metrics

Goal: Expose metrics for monitoring server health and performance

Tasks:

  1. Add prometheus-client dependency to Cargo.toml
  2. Create metrics module (server/src/metrics/mod.rs)
  3. Implement metric types:
    • Counter: requests_total, sessions_total, errors_total
    • Gauge: active_sessions, active_connections
    • Histogram: request_duration_seconds, session_duration_seconds
  4. Add /metrics endpoint
  5. Integrate metrics into existing code:
    • Session creation/close
    • Request handling
    • WebSocket connections
    • Database operations
  6. Test metrics endpoint (curl http://172.16.3.30:3002/metrics)

Files to Create/Modify:

  • server/Cargo.toml - Add dependencies
  • server/src/metrics/mod.rs - Metrics module
  • server/src/main.rs - Add /metrics endpoint
  • server/src/relay/mod.rs - Add session metrics
  • server/src/api/mod.rs - Add request metrics

Metrics to Track:

  • guruconnect_requests_total{method, path, status} - HTTP requests
  • guruconnect_sessions_total{status} - Sessions (created, closed, failed)
  • guruconnect_active_sessions - Current active sessions
  • guruconnect_active_connections{type} - WebSocket connections (agents, viewers)
  • guruconnect_request_duration_seconds{method, path} - Request latency
  • guruconnect_session_duration_seconds - Session lifetime
  • guruconnect_errors_total{type} - Error counts
  • guruconnect_db_operations_total{operation, status} - Database operations

Verification:

  • Metrics endpoint returns Prometheus format
  • Metrics update in real-time
  • No performance degradation

Day 3: Grafana Dashboard

Goal: Create visual dashboards for monitoring GuruConnect

Tasks:

  1. Install Prometheus on 172.16.3.30
  2. Configure Prometheus to scrape GuruConnect metrics
  3. Install Grafana on 172.16.3.30
  4. Configure Grafana data source (Prometheus)
  5. Create dashboards:
    • Overview: Active sessions, requests/sec, errors
    • Sessions: Session lifecycle, duration distribution
    • Performance: Request latency, database query time
    • Errors: Error rates by type
  6. Set up alerting rules (if time permits)

Files to Create:

  • infrastructure/prometheus.yml - Prometheus configuration
  • infrastructure/grafana-dashboard.json - Pre-built dashboard
  • infrastructure/setup-monitoring.sh - Installation script

Grafana Dashboard Panels:

  1. Active Sessions (Gauge)
  2. Requests per Second (Graph)
  3. Error Rate (Graph)
  4. Session Creation Rate (Graph)
  5. Request Latency p50/p95/p99 (Graph)
  6. Active Connections by Type (Graph)
  7. Database Operations (Graph)
  8. Top Errors (Table)

Verification:

  • Prometheus scrapes metrics successfully
  • Grafana dashboard displays real-time data
  • Alerts fire on test conditions

Day 4: Automated PostgreSQL Backups

Goal: Implement automated daily backups with retention policy

Tasks:

  1. Create backup script (server/backup-postgres.sh)
  2. Configure backup location (/home/guru/backups/guruconnect/)
  3. Implement retention policy (keep 30 daily, 4 weekly, 6 monthly)
  4. Create systemd timer for daily backups
  5. Add backup monitoring (success/failure metrics)
  6. Test backup and restore process
  7. Document restore procedure

Files to Create:

  • server/backup-postgres.sh - Backup script
  • server/restore-postgres.sh - Restore script
  • server/guruconnect-backup.service - Systemd service
  • server/guruconnect-backup.timer - Systemd timer

Backup Strategy:

  • Daily full backups at 2:00 AM
  • Compressed with gzip
  • Named with timestamp: guruconnect-YYYY-MM-DD-HHMMSS.sql.gz
  • Stored in /home/guru/backups/guruconnect/
  • Retention: 30 days daily, 4 weeks weekly, 6 months monthly

Verification:

  • Manual backup works
  • Automated backup runs daily
  • Restore process verified
  • Old backups cleaned up correctly

Day 5: Log Rotation & Health Checks

Goal: Implement log rotation and continuous health monitoring

Tasks:

  1. Configure logrotate for GuruConnect logs
  2. Implement health check improvements:
    • Database connectivity check
    • Disk space check
    • Memory usage check
    • Active session count check
  3. Create monitoring script (server/health-monitor.sh)
  4. Add health metrics to Prometheus
  5. Create systemd watchdog configuration
  6. Document operational procedures

Files to Create:

  • server/guruconnect.logrotate - Logrotate configuration
  • server/health-monitor.sh - Health monitoring script
  • server/OPERATIONS.md - Operational runbook

Health Checks:

  • /health endpoint (basic - already exists)
  • /health/deep endpoint (detailed checks):
    • Database connection: OK/FAIL
    • Disk space: >10% free
    • Memory: <90% used
    • Active sessions: <100 (threshold)
    • Uptime: seconds since start

Verification:

  • Logs rotate correctly
  • Health checks report accurate status
  • Alerts triggered on health failures

Infrastructure Files Structure

guru-connect/
├── server/
│   ├── guruconnect.service        # Systemd service file
│   ├── setup-systemd.sh           # Service installation script
│   ├── backup-postgres.sh         # PostgreSQL backup script
│   ├── restore-postgres.sh        # PostgreSQL restore script
│   ├── guruconnect-backup.service # Backup systemd service
│   ├── guruconnect-backup.timer   # Backup systemd timer
│   ├── guruconnect.logrotate      # Logrotate configuration
│   ├── health-monitor.sh          # Health monitoring script
│   └── OPERATIONS.md              # Operational runbook
├── infrastructure/
│   ├── prometheus.yml             # Prometheus configuration
│   ├── grafana-dashboard.json     # Grafana dashboard export
│   └── setup-monitoring.sh        # Monitoring setup script
└── docs/
    └── MONITORING.md              # Monitoring documentation

Systemd Service Configuration

Service File: /etc/systemd/system/guruconnect.service

[Unit]
Description=GuruConnect Remote Desktop Server
Documentation=https://git.azcomputerguru.com/azcomputerguru/guru-connect
After=network-online.target postgresql.service
Wants=network-online.target

[Service]
Type=simple
User=guru
Group=guru
WorkingDirectory=/home/guru/guru-connect/server

# Environment variables
EnvironmentFile=/home/guru/guru-connect/server/.env

# Start command
ExecStart=/home/guru/guru-connect/target/x86_64-unknown-linux-gnu/release/guruconnect-server

# Restart policy
Restart=on-failure
RestartSec=10s
StartLimitInterval=5min
StartLimitBurst=3

# Resource limits
LimitNOFILE=65536
LimitNPROC=4096

# Security
NoNewPrivileges=true
PrivateTmp=true

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=guruconnect

# Watchdog
WatchdogSec=30s

[Install]
WantedBy=multi-user.target

Environment File: /home/guru/guru-connect/server/.env

# Database
DATABASE_URL=postgresql://guruconnect:PASSWORD@localhost:5432/guruconnect

# Security
JWT_SECRET=your-very-secure-jwt-secret-at-least-32-characters
AGENT_API_KEY=your-very-secure-api-key-at-least-32-characters

# Server Configuration
RUST_LOG=info
HOST=0.0.0.0
PORT=3002

# Monitoring
PROMETHEUS_PORT=3002  # Expose on same port as main service

Prometheus Configuration

File: infrastructure/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'guruconnect-production'

scrape_configs:
  - job_name: 'guruconnect'
    static_configs:
      - targets: ['172.16.3.30:3002']
        labels:
          env: 'production'
          service: 'guruconnect-server'

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['172.16.3.30:9100']
        labels:
          env: 'production'
          instance: 'rmm-server'

# Alerting rules (optional for Week 2)
rule_files:
  - 'alerts.yml'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

Testing Checklist

Systemd Service Tests

  • Service starts correctly: sudo systemctl start guruconnect
  • Service stops correctly: sudo systemctl stop guruconnect
  • Service restarts correctly: sudo systemctl restart guruconnect
  • Service auto-starts on boot: sudo systemctl enable guruconnect
  • Service restarts on crash: sudo kill -9 <pid> (wait 10s)
  • Logs visible in journalctl: sudo journalctl -u guruconnect -f

Prometheus Metrics Tests

  • Metrics endpoint accessible: curl http://172.16.3.30:3002/metrics
  • Metrics format valid (Prometheus client can scrape)
  • Session metrics update on session creation/close
  • Request metrics update on HTTP requests
  • Error metrics update on failures

Grafana Dashboard Tests

  • Prometheus data source connected
  • All panels display data
  • Data updates in real-time (<30s delay)
  • Historical data visible (after 1 hour)
  • Dashboard exports to JSON successfully

Backup Tests

  • Manual backup creates file: bash backup-postgres.sh
  • Backup file is compressed and named correctly
  • Restore works: bash restore-postgres.sh <backup-file>
  • Timer triggers daily at 2:00 AM
  • Retention policy removes old backups

Health Check Tests

  • Basic health endpoint: curl http://172.16.3.30:3002/health
  • Deep health endpoint: curl http://172.16.3.30:3002/health/deep
  • Health checks report database status
  • Health checks report disk/memory usage

Risk Assessment

HIGH RISK

Issue: Database credentials still broken Impact: Cannot test database-dependent features Mitigation: Create backup scripts that work even if database is down (conditional logic)

Issue: Sudo access required for systemd Impact: Cannot install service without password Mitigation: Prepare scripts and documentation, request sudo access from system admin

MEDIUM RISK

Issue: Prometheus/Grafana installation may require dependencies Impact: Additional setup time Mitigation: Use Docker containers if system install is complex

Issue: Metrics may add performance overhead Impact: Latency increase Mitigation: Use efficient metrics library, test performance before/after

LOW RISK

Issue: Log rotation misconfiguration Impact: Disk space issues Mitigation: Test logrotate configuration thoroughly, set conservative limits


Success Criteria

Week 2 is complete when:

  1. Systemd Service

    • Service starts/stops correctly
    • Auto-restarts on failure
    • Starts on boot
    • Logs to journalctl
  2. Prometheus Metrics

    • /metrics endpoint working
    • Key metrics implemented:
      • Request counts and latency
      • Session counts and duration
      • Active connections
      • Error rates
    • Prometheus can scrape successfully
  3. Grafana Dashboard

    • Prometheus data source configured
    • Dashboard with 8+ panels
    • Real-time data display
    • Dashboard exported to JSON
  4. Automated Backups

    • Backup script functional
    • Daily backups via systemd timer
    • Retention policy enforced
    • Restore procedure documented
  5. Health Monitoring

    • Log rotation configured
    • Health checks implemented
    • Health metrics exposed
    • Operational runbook created

Exit Criteria: All 5 areas have passing tests, production infrastructure is stable and monitored.


Next Steps (Week 3)

After Week 2 infrastructure completion:

  • Week 3: CI/CD pipeline (Gitea CI, automated builds, deployment automation)
  • Week 4: Production hardening (load testing, performance optimization, security audit)
  • Phase 2: Core features development

Document Status: READY Owner: Development Team Started: 2026-01-18 Target: 2026-01-25