Files

Mike Swanson 8521c95755 Phase 1 Week 2: Infrastructure & Monitoring

Added comprehensive production infrastructure:

Systemd Service:
- guruconnect.service with auto-restart, resource limits, security hardening
- setup-systemd.sh installation script

Prometheus Metrics:
- Added prometheus-client dependency
- Created metrics module tracking:
  - HTTP requests (count, latency)
  - Sessions (created, closed, active)
  - Connections (WebSocket, by type)
  - Errors (by type)
  - Database operations (count, latency)
  - Server uptime
- Added /metrics endpoint
- Background task for uptime updates

Monitoring Configuration:
- prometheus.yml with scrape configs for GuruConnect and node_exporter
- alerts.yml with alerting rules
- grafana-dashboard.json with 10 panels
- setup-monitoring.sh installation script

PostgreSQL Backups:
- backup-postgres.sh with gzip compression
- restore-postgres.sh with safety checks
- guruconnect-backup.service and .timer for automated daily backups
- Retention policy: 30 daily, 4 weekly, 6 monthly

Health Monitoring:
- health-monitor.sh checking HTTP, disk, memory, database, metrics
- guruconnect.logrotate for log rotation
- Email alerts on failures

Updated CHECKLIST_STATE.json to reflect Week 1 completion (77%) and Week 2 start.
Created PHASE1_WEEK2_INFRASTRUCTURE.md with comprehensive planning.

Ready for deployment and testing on RMM server.

2026-01-17 20:24:32 -07:00

13 KiB

Raw Blame History

Phase 1, Week 2 - Infrastructure & Monitoring

Date Started: 2026-01-18 Target Completion: 2026-01-25 Status: Starting Priority: HIGH (Production Readiness)

Executive Summary

With Week 1 security fixes complete and deployed, Week 2 focuses on production infrastructure hardening. The server currently runs manually (nohup start-secure.sh &), lacks monitoring, and has no automated recovery. This week establishes production-grade infrastructure.

Goals:

Systemd service with auto-restart on failure
Prometheus metrics for monitoring
Grafana dashboards for visualization
Automated PostgreSQL backups
Log rotation and management

Dependencies:

SSH access to 172.16.3.30 as guru user
Sudo access for systemd service installation
PostgreSQL credentials (currently broken, but can set up backup automation)

Week 2 Task Breakdown

Day 1: Systemd Service Configuration

Goal: Convert manual server startup to systemd-managed service

Tasks:

Create systemd service file (/etc/systemd/system/guruconnect.service)
Configure service dependencies (network, postgresql)
Set restart policy (on-failure, with backoff)
Configure environment variables securely
Enable service to start on boot
Test service start/stop/restart
Verify auto-restart on crash

Files to Create:

server/guruconnect.service - Systemd unit file
server/setup-systemd.sh - Installation script

Verification:

Service starts automatically on boot
Service restarts on failure (kill -9 test)
Logs go to journalctl

Day 2: Prometheus Metrics

Goal: Expose metrics for monitoring server health and performance

Tasks:

Add prometheus-client dependency to Cargo.toml
Create metrics module (server/src/metrics/mod.rs)
Implement metric types:
- Counter: requests_total, sessions_total, errors_total
- Gauge: active_sessions, active_connections
- Histogram: request_duration_seconds, session_duration_seconds
Add /metrics endpoint
Integrate metrics into existing code:
- Session creation/close
- Request handling
- WebSocket connections
- Database operations
Test metrics endpoint (curl http://172.16.3.30:3002/metrics)

Files to Create/Modify:

server/Cargo.toml - Add dependencies
server/src/metrics/mod.rs - Metrics module
server/src/main.rs - Add /metrics endpoint
server/src/relay/mod.rs - Add session metrics
server/src/api/mod.rs - Add request metrics

Metrics to Track:

guruconnect_requests_total{method, path, status} - HTTP requests
guruconnect_sessions_total{status} - Sessions (created, closed, failed)
guruconnect_active_sessions - Current active sessions
guruconnect_active_connections{type} - WebSocket connections (agents, viewers)
guruconnect_request_duration_seconds{method, path} - Request latency
guruconnect_session_duration_seconds - Session lifetime
guruconnect_errors_total{type} - Error counts
guruconnect_db_operations_total{operation, status} - Database operations

Verification:

Metrics endpoint returns Prometheus format
Metrics update in real-time
No performance degradation

Day 3: Grafana Dashboard

Goal: Create visual dashboards for monitoring GuruConnect

Tasks:

Install Prometheus on 172.16.3.30
Configure Prometheus to scrape GuruConnect metrics
Install Grafana on 172.16.3.30
Configure Grafana data source (Prometheus)
Create dashboards:
- Overview: Active sessions, requests/sec, errors
- Sessions: Session lifecycle, duration distribution
- Performance: Request latency, database query time
- Errors: Error rates by type
Set up alerting rules (if time permits)

Files to Create:

infrastructure/prometheus.yml - Prometheus configuration
infrastructure/grafana-dashboard.json - Pre-built dashboard
infrastructure/setup-monitoring.sh - Installation script

Grafana Dashboard Panels:

Active Sessions (Gauge)
Requests per Second (Graph)
Error Rate (Graph)
Session Creation Rate (Graph)
Request Latency p50/p95/p99 (Graph)
Active Connections by Type (Graph)
Database Operations (Graph)
Top Errors (Table)

Verification:

Prometheus scrapes metrics successfully
Grafana dashboard displays real-time data
Alerts fire on test conditions

Day 4: Automated PostgreSQL Backups

Goal: Implement automated daily backups with retention policy

Tasks:

Create backup script (server/backup-postgres.sh)
Configure backup location (/home/guru/backups/guruconnect/)
Implement retention policy (keep 30 daily, 4 weekly, 6 monthly)
Create systemd timer for daily backups
Add backup monitoring (success/failure metrics)
Test backup and restore process
Document restore procedure

Files to Create:

server/backup-postgres.sh - Backup script
server/restore-postgres.sh - Restore script
server/guruconnect-backup.service - Systemd service
server/guruconnect-backup.timer - Systemd timer

Backup Strategy:

Daily full backups at 2:00 AM
Compressed with gzip
Named with timestamp: guruconnect-YYYY-MM-DD-HHMMSS.sql.gz
Stored in /home/guru/backups/guruconnect/
Retention: 30 days daily, 4 weeks weekly, 6 months monthly

Verification:

Manual backup works
Automated backup runs daily
Restore process verified
Old backups cleaned up correctly

Day 5: Log Rotation & Health Checks

Goal: Implement log rotation and continuous health monitoring

Tasks:

Configure logrotate for GuruConnect logs
Implement health check improvements:
- Database connectivity check
- Disk space check
- Memory usage check
- Active session count check
Create monitoring script (server/health-monitor.sh)
Add health metrics to Prometheus
Create systemd watchdog configuration
Document operational procedures

Files to Create:

server/guruconnect.logrotate - Logrotate configuration
server/health-monitor.sh - Health monitoring script
server/OPERATIONS.md - Operational runbook

Health Checks:

/health endpoint (basic - already exists)
/health/deep endpoint (detailed checks):
- Database connection: OK/FAIL
- Disk space: >10% free
- Memory: <90% used
- Active sessions: <100 (threshold)
- Uptime: seconds since start

Verification:

Logs rotate correctly
Health checks report accurate status
Alerts triggered on health failures

Infrastructure Files Structure

guru-connect/
├── server/
│   ├── guruconnect.service        # Systemd service file
│   ├── setup-systemd.sh           # Service installation script
│   ├── backup-postgres.sh         # PostgreSQL backup script
│   ├── restore-postgres.sh        # PostgreSQL restore script
│   ├── guruconnect-backup.service # Backup systemd service
│   ├── guruconnect-backup.timer   # Backup systemd timer
│   ├── guruconnect.logrotate      # Logrotate configuration
│   ├── health-monitor.sh          # Health monitoring script
│   └── OPERATIONS.md              # Operational runbook
├── infrastructure/
│   ├── prometheus.yml             # Prometheus configuration
│   ├── grafana-dashboard.json     # Grafana dashboard export
│   └── setup-monitoring.sh        # Monitoring setup script
└── docs/
    └── MONITORING.md              # Monitoring documentation

Systemd Service Configuration

Service File: /etc/systemd/system/guruconnect.service

[Unit]
Description=GuruConnect Remote Desktop Server
Documentation=https://git.azcomputerguru.com/azcomputerguru/guru-connect
After=network-online.target postgresql.service
Wants=network-online.target

[Service]
Type=simple
User=guru
Group=guru
WorkingDirectory=/home/guru/guru-connect/server

# Environment variables
EnvironmentFile=/home/guru/guru-connect/server/.env

# Start command
ExecStart=/home/guru/guru-connect/target/x86_64-unknown-linux-gnu/release/guruconnect-server

# Restart policy
Restart=on-failure
RestartSec=10s
StartLimitInterval=5min
StartLimitBurst=3

# Resource limits
LimitNOFILE=65536
LimitNPROC=4096

# Security
NoNewPrivileges=true
PrivateTmp=true

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=guruconnect

# Watchdog
WatchdogSec=30s

[Install]
WantedBy=multi-user.target

Environment File: /home/guru/guru-connect/server/.env

# Database
DATABASE_URL=postgresql://guruconnect:PASSWORD@localhost:5432/guruconnect

# Security
JWT_SECRET=your-very-secure-jwt-secret-at-least-32-characters
AGENT_API_KEY=your-very-secure-api-key-at-least-32-characters

# Server Configuration
RUST_LOG=info
HOST=0.0.0.0
PORT=3002

# Monitoring
PROMETHEUS_PORT=3002  # Expose on same port as main service

Prometheus Configuration

File: infrastructure/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'guruconnect-production'

scrape_configs:
  - job_name: 'guruconnect'
    static_configs:
      - targets: ['172.16.3.30:3002']
        labels:
          env: 'production'
          service: 'guruconnect-server'

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['172.16.3.30:9100']
        labels:
          env: 'production'
          instance: 'rmm-server'

# Alerting rules (optional for Week 2)
rule_files:
  - 'alerts.yml'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

Testing Checklist

Systemd Service Tests

Service starts correctly: sudo systemctl start guruconnect
Service stops correctly: sudo systemctl stop guruconnect
Service restarts correctly: sudo systemctl restart guruconnect
Service auto-starts on boot: sudo systemctl enable guruconnect
Service restarts on crash: sudo kill -9 <pid> (wait 10s)
Logs visible in journalctl: sudo journalctl -u guruconnect -f

Prometheus Metrics Tests

Metrics endpoint accessible: curl http://172.16.3.30:3002/metrics
Metrics format valid (Prometheus client can scrape)
Session metrics update on session creation/close
Request metrics update on HTTP requests
Error metrics update on failures

Grafana Dashboard Tests

Prometheus data source connected
All panels display data
Data updates in real-time (<30s delay)
Historical data visible (after 1 hour)
Dashboard exports to JSON successfully

Backup Tests

Manual backup creates file: bash backup-postgres.sh
Backup file is compressed and named correctly
Restore works: bash restore-postgres.sh <backup-file>
Timer triggers daily at 2:00 AM
Retention policy removes old backups

Health Check Tests

Basic health endpoint: curl http://172.16.3.30:3002/health
Deep health endpoint: curl http://172.16.3.30:3002/health/deep
Health checks report database status
Health checks report disk/memory usage

Risk Assessment

HIGH RISK

Issue: Database credentials still broken Impact: Cannot test database-dependent features Mitigation: Create backup scripts that work even if database is down (conditional logic)

Issue: Sudo access required for systemd Impact: Cannot install service without password Mitigation: Prepare scripts and documentation, request sudo access from system admin

MEDIUM RISK

Issue: Prometheus/Grafana installation may require dependencies Impact: Additional setup time Mitigation: Use Docker containers if system install is complex

Issue: Metrics may add performance overhead Impact: Latency increase Mitigation: Use efficient metrics library, test performance before/after

LOW RISK

Issue: Log rotation misconfiguration Impact: Disk space issues Mitigation: Test logrotate configuration thoroughly, set conservative limits

Success Criteria

Week 2 is complete when:

Systemd Service
- Service starts/stops correctly
- Auto-restarts on failure
- Starts on boot
- Logs to journalctl
Prometheus Metrics
- /metrics endpoint working
- Key metrics implemented:
  - Request counts and latency
  - Session counts and duration
  - Active connections
  - Error rates
- Prometheus can scrape successfully
Grafana Dashboard
- Prometheus data source configured
- Dashboard with 8+ panels
- Real-time data display
- Dashboard exported to JSON
Automated Backups
- Backup script functional
- Daily backups via systemd timer
- Retention policy enforced
- Restore procedure documented
Health Monitoring
- Log rotation configured
- Health checks implemented
- Health metrics exposed
- Operational runbook created

Exit Criteria: All 5 areas have passing tests, production infrastructure is stable and monitored.

Next Steps (Week 3)

After Week 2 infrastructure completion:

Week 3: CI/CD pipeline (Gitea CI, automated builds, deployment automation)
Week 4: Production hardening (load testing, performance optimization, security audit)
Phase 2: Core features development

Document Status: READY Owner: Development Team Started: 2026-01-18 Target: 2026-01-25

13 KiB Raw Blame History

Phase 1, Week 2 - Infrastructure & Monitoring

Executive Summary

Week 2 Task Breakdown

Day 1: Systemd Service Configuration

Day 2: Prometheus Metrics

Day 3: Grafana Dashboard

Day 4: Automated PostgreSQL Backups

Day 5: Log Rotation & Health Checks

Infrastructure Files Structure

Systemd Service Configuration

Prometheus Configuration

Testing Checklist

Systemd Service Tests

Prometheus Metrics Tests

Grafana Dashboard Tests

Backup Tests

Health Check Tests

Risk Assessment

HIGH RISK

MEDIUM RISK

LOW RISK

Success Criteria

Next Steps (Week 3)

13 KiB

Raw Blame History