Added comprehensive production infrastructure: Systemd Service: - guruconnect.service with auto-restart, resource limits, security hardening - setup-systemd.sh installation script Prometheus Metrics: - Added prometheus-client dependency - Created metrics module tracking: - HTTP requests (count, latency) - Sessions (created, closed, active) - Connections (WebSocket, by type) - Errors (by type) - Database operations (count, latency) - Server uptime - Added /metrics endpoint - Background task for uptime updates Monitoring Configuration: - prometheus.yml with scrape configs for GuruConnect and node_exporter - alerts.yml with alerting rules - grafana-dashboard.json with 10 panels - setup-monitoring.sh installation script PostgreSQL Backups: - backup-postgres.sh with gzip compression - restore-postgres.sh with safety checks - guruconnect-backup.service and .timer for automated daily backups - Retention policy: 30 daily, 4 weekly, 6 monthly Health Monitoring: - health-monitor.sh checking HTTP, disk, memory, database, metrics - guruconnect.logrotate for log rotation - Email alerts on failures Updated CHECKLIST_STATE.json to reflect Week 1 completion (77%) and Week 2 start. Created PHASE1_WEEK2_INFRASTRUCTURE.md with comprehensive planning. Ready for deployment and testing on RMM server.
13 KiB
Phase 1, Week 2 - Infrastructure & Monitoring
Date Started: 2026-01-18 Target Completion: 2026-01-25 Status: Starting Priority: HIGH (Production Readiness)
Executive Summary
With Week 1 security fixes complete and deployed, Week 2 focuses on production infrastructure hardening. The server currently runs manually (nohup start-secure.sh &), lacks monitoring, and has no automated recovery. This week establishes production-grade infrastructure.
Goals:
- Systemd service with auto-restart on failure
- Prometheus metrics for monitoring
- Grafana dashboards for visualization
- Automated PostgreSQL backups
- Log rotation and management
Dependencies:
- SSH access to 172.16.3.30 as
guruuser - Sudo access for systemd service installation
- PostgreSQL credentials (currently broken, but can set up backup automation)
Week 2 Task Breakdown
Day 1: Systemd Service Configuration
Goal: Convert manual server startup to systemd-managed service
Tasks:
- Create systemd service file (
/etc/systemd/system/guruconnect.service) - Configure service dependencies (network, postgresql)
- Set restart policy (on-failure, with backoff)
- Configure environment variables securely
- Enable service to start on boot
- Test service start/stop/restart
- Verify auto-restart on crash
Files to Create:
server/guruconnect.service- Systemd unit fileserver/setup-systemd.sh- Installation script
Verification:
- Service starts automatically on boot
- Service restarts on failure (kill -9 test)
- Logs go to journalctl
Day 2: Prometheus Metrics
Goal: Expose metrics for monitoring server health and performance
Tasks:
- Add
prometheus-clientdependency to Cargo.toml - Create metrics module (
server/src/metrics/mod.rs) - Implement metric types:
- Counter: requests_total, sessions_total, errors_total
- Gauge: active_sessions, active_connections
- Histogram: request_duration_seconds, session_duration_seconds
- Add
/metricsendpoint - Integrate metrics into existing code:
- Session creation/close
- Request handling
- WebSocket connections
- Database operations
- Test metrics endpoint (
curl http://172.16.3.30:3002/metrics)
Files to Create/Modify:
server/Cargo.toml- Add dependenciesserver/src/metrics/mod.rs- Metrics moduleserver/src/main.rs- Add /metrics endpointserver/src/relay/mod.rs- Add session metricsserver/src/api/mod.rs- Add request metrics
Metrics to Track:
guruconnect_requests_total{method, path, status}- HTTP requestsguruconnect_sessions_total{status}- Sessions (created, closed, failed)guruconnect_active_sessions- Current active sessionsguruconnect_active_connections{type}- WebSocket connections (agents, viewers)guruconnect_request_duration_seconds{method, path}- Request latencyguruconnect_session_duration_seconds- Session lifetimeguruconnect_errors_total{type}- Error countsguruconnect_db_operations_total{operation, status}- Database operations
Verification:
- Metrics endpoint returns Prometheus format
- Metrics update in real-time
- No performance degradation
Day 3: Grafana Dashboard
Goal: Create visual dashboards for monitoring GuruConnect
Tasks:
- Install Prometheus on 172.16.3.30
- Configure Prometheus to scrape GuruConnect metrics
- Install Grafana on 172.16.3.30
- Configure Grafana data source (Prometheus)
- Create dashboards:
- Overview: Active sessions, requests/sec, errors
- Sessions: Session lifecycle, duration distribution
- Performance: Request latency, database query time
- Errors: Error rates by type
- Set up alerting rules (if time permits)
Files to Create:
infrastructure/prometheus.yml- Prometheus configurationinfrastructure/grafana-dashboard.json- Pre-built dashboardinfrastructure/setup-monitoring.sh- Installation script
Grafana Dashboard Panels:
- Active Sessions (Gauge)
- Requests per Second (Graph)
- Error Rate (Graph)
- Session Creation Rate (Graph)
- Request Latency p50/p95/p99 (Graph)
- Active Connections by Type (Graph)
- Database Operations (Graph)
- Top Errors (Table)
Verification:
- Prometheus scrapes metrics successfully
- Grafana dashboard displays real-time data
- Alerts fire on test conditions
Day 4: Automated PostgreSQL Backups
Goal: Implement automated daily backups with retention policy
Tasks:
- Create backup script (
server/backup-postgres.sh) - Configure backup location (
/home/guru/backups/guruconnect/) - Implement retention policy (keep 30 daily, 4 weekly, 6 monthly)
- Create systemd timer for daily backups
- Add backup monitoring (success/failure metrics)
- Test backup and restore process
- Document restore procedure
Files to Create:
server/backup-postgres.sh- Backup scriptserver/restore-postgres.sh- Restore scriptserver/guruconnect-backup.service- Systemd serviceserver/guruconnect-backup.timer- Systemd timer
Backup Strategy:
- Daily full backups at 2:00 AM
- Compressed with gzip
- Named with timestamp:
guruconnect-YYYY-MM-DD-HHMMSS.sql.gz - Stored in
/home/guru/backups/guruconnect/ - Retention: 30 days daily, 4 weeks weekly, 6 months monthly
Verification:
- Manual backup works
- Automated backup runs daily
- Restore process verified
- Old backups cleaned up correctly
Day 5: Log Rotation & Health Checks
Goal: Implement log rotation and continuous health monitoring
Tasks:
- Configure logrotate for GuruConnect logs
- Implement health check improvements:
- Database connectivity check
- Disk space check
- Memory usage check
- Active session count check
- Create monitoring script (
server/health-monitor.sh) - Add health metrics to Prometheus
- Create systemd watchdog configuration
- Document operational procedures
Files to Create:
server/guruconnect.logrotate- Logrotate configurationserver/health-monitor.sh- Health monitoring scriptserver/OPERATIONS.md- Operational runbook
Health Checks:
/healthendpoint (basic - already exists)/health/deependpoint (detailed checks):- Database connection: OK/FAIL
- Disk space: >10% free
- Memory: <90% used
- Active sessions: <100 (threshold)
- Uptime: seconds since start
Verification:
- Logs rotate correctly
- Health checks report accurate status
- Alerts triggered on health failures
Infrastructure Files Structure
guru-connect/
├── server/
│ ├── guruconnect.service # Systemd service file
│ ├── setup-systemd.sh # Service installation script
│ ├── backup-postgres.sh # PostgreSQL backup script
│ ├── restore-postgres.sh # PostgreSQL restore script
│ ├── guruconnect-backup.service # Backup systemd service
│ ├── guruconnect-backup.timer # Backup systemd timer
│ ├── guruconnect.logrotate # Logrotate configuration
│ ├── health-monitor.sh # Health monitoring script
│ └── OPERATIONS.md # Operational runbook
├── infrastructure/
│ ├── prometheus.yml # Prometheus configuration
│ ├── grafana-dashboard.json # Grafana dashboard export
│ └── setup-monitoring.sh # Monitoring setup script
└── docs/
└── MONITORING.md # Monitoring documentation
Systemd Service Configuration
Service File: /etc/systemd/system/guruconnect.service
[Unit]
Description=GuruConnect Remote Desktop Server
Documentation=https://git.azcomputerguru.com/azcomputerguru/guru-connect
After=network-online.target postgresql.service
Wants=network-online.target
[Service]
Type=simple
User=guru
Group=guru
WorkingDirectory=/home/guru/guru-connect/server
# Environment variables
EnvironmentFile=/home/guru/guru-connect/server/.env
# Start command
ExecStart=/home/guru/guru-connect/target/x86_64-unknown-linux-gnu/release/guruconnect-server
# Restart policy
Restart=on-failure
RestartSec=10s
StartLimitInterval=5min
StartLimitBurst=3
# Resource limits
LimitNOFILE=65536
LimitNPROC=4096
# Security
NoNewPrivileges=true
PrivateTmp=true
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=guruconnect
# Watchdog
WatchdogSec=30s
[Install]
WantedBy=multi-user.target
Environment File: /home/guru/guru-connect/server/.env
# Database
DATABASE_URL=postgresql://guruconnect:PASSWORD@localhost:5432/guruconnect
# Security
JWT_SECRET=your-very-secure-jwt-secret-at-least-32-characters
AGENT_API_KEY=your-very-secure-api-key-at-least-32-characters
# Server Configuration
RUST_LOG=info
HOST=0.0.0.0
PORT=3002
# Monitoring
PROMETHEUS_PORT=3002 # Expose on same port as main service
Prometheus Configuration
File: infrastructure/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'guruconnect-production'
scrape_configs:
- job_name: 'guruconnect'
static_configs:
- targets: ['172.16.3.30:3002']
labels:
env: 'production'
service: 'guruconnect-server'
- job_name: 'node_exporter'
static_configs:
- targets: ['172.16.3.30:9100']
labels:
env: 'production'
instance: 'rmm-server'
# Alerting rules (optional for Week 2)
rule_files:
- 'alerts.yml'
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
Testing Checklist
Systemd Service Tests
- Service starts correctly:
sudo systemctl start guruconnect - Service stops correctly:
sudo systemctl stop guruconnect - Service restarts correctly:
sudo systemctl restart guruconnect - Service auto-starts on boot:
sudo systemctl enable guruconnect - Service restarts on crash:
sudo kill -9 <pid>(wait 10s) - Logs visible in journalctl:
sudo journalctl -u guruconnect -f
Prometheus Metrics Tests
- Metrics endpoint accessible:
curl http://172.16.3.30:3002/metrics - Metrics format valid (Prometheus client can scrape)
- Session metrics update on session creation/close
- Request metrics update on HTTP requests
- Error metrics update on failures
Grafana Dashboard Tests
- Prometheus data source connected
- All panels display data
- Data updates in real-time (<30s delay)
- Historical data visible (after 1 hour)
- Dashboard exports to JSON successfully
Backup Tests
- Manual backup creates file:
bash backup-postgres.sh - Backup file is compressed and named correctly
- Restore works:
bash restore-postgres.sh <backup-file> - Timer triggers daily at 2:00 AM
- Retention policy removes old backups
Health Check Tests
- Basic health endpoint:
curl http://172.16.3.30:3002/health - Deep health endpoint:
curl http://172.16.3.30:3002/health/deep - Health checks report database status
- Health checks report disk/memory usage
Risk Assessment
HIGH RISK
Issue: Database credentials still broken Impact: Cannot test database-dependent features Mitigation: Create backup scripts that work even if database is down (conditional logic)
Issue: Sudo access required for systemd Impact: Cannot install service without password Mitigation: Prepare scripts and documentation, request sudo access from system admin
MEDIUM RISK
Issue: Prometheus/Grafana installation may require dependencies Impact: Additional setup time Mitigation: Use Docker containers if system install is complex
Issue: Metrics may add performance overhead Impact: Latency increase Mitigation: Use efficient metrics library, test performance before/after
LOW RISK
Issue: Log rotation misconfiguration Impact: Disk space issues Mitigation: Test logrotate configuration thoroughly, set conservative limits
Success Criteria
Week 2 is complete when:
-
Systemd Service
- Service starts/stops correctly
- Auto-restarts on failure
- Starts on boot
- Logs to journalctl
-
Prometheus Metrics
- /metrics endpoint working
- Key metrics implemented:
- Request counts and latency
- Session counts and duration
- Active connections
- Error rates
- Prometheus can scrape successfully
-
Grafana Dashboard
- Prometheus data source configured
- Dashboard with 8+ panels
- Real-time data display
- Dashboard exported to JSON
-
Automated Backups
- Backup script functional
- Daily backups via systemd timer
- Retention policy enforced
- Restore procedure documented
-
Health Monitoring
- Log rotation configured
- Health checks implemented
- Health metrics exposed
- Operational runbook created
Exit Criteria: All 5 areas have passing tests, production infrastructure is stable and monitored.
Next Steps (Week 3)
After Week 2 infrastructure completion:
- Week 3: CI/CD pipeline (Gitea CI, automated builds, deployment automation)
- Week 4: Production hardening (load testing, performance optimization, security audit)
- Phase 2: Core features development
Document Status: READY Owner: Development Team Started: 2026-01-18 Target: 2026-01-25