Files
claudetools/projects/msp-tools/guru-connect/PHASE1_WEEK2_INFRASTRUCTURE.md
Mike Swanson 8521c95755 Phase 1 Week 2: Infrastructure & Monitoring
Added comprehensive production infrastructure:

Systemd Service:
- guruconnect.service with auto-restart, resource limits, security hardening
- setup-systemd.sh installation script

Prometheus Metrics:
- Added prometheus-client dependency
- Created metrics module tracking:
  - HTTP requests (count, latency)
  - Sessions (created, closed, active)
  - Connections (WebSocket, by type)
  - Errors (by type)
  - Database operations (count, latency)
  - Server uptime
- Added /metrics endpoint
- Background task for uptime updates

Monitoring Configuration:
- prometheus.yml with scrape configs for GuruConnect and node_exporter
- alerts.yml with alerting rules
- grafana-dashboard.json with 10 panels
- setup-monitoring.sh installation script

PostgreSQL Backups:
- backup-postgres.sh with gzip compression
- restore-postgres.sh with safety checks
- guruconnect-backup.service and .timer for automated daily backups
- Retention policy: 30 daily, 4 weekly, 6 monthly

Health Monitoring:
- health-monitor.sh checking HTTP, disk, memory, database, metrics
- guruconnect.logrotate for log rotation
- Email alerts on failures

Updated CHECKLIST_STATE.json to reflect Week 1 completion (77%) and Week 2 start.
Created PHASE1_WEEK2_INFRASTRUCTURE.md with comprehensive planning.

Ready for deployment and testing on RMM server.
2026-01-17 20:24:32 -07:00

458 lines
13 KiB
Markdown

# Phase 1, Week 2 - Infrastructure & Monitoring
**Date Started:** 2026-01-18
**Target Completion:** 2026-01-25
**Status:** Starting
**Priority:** HIGH (Production Readiness)
---
## Executive Summary
With Week 1 security fixes complete and deployed, Week 2 focuses on production infrastructure hardening. The server currently runs manually (`nohup start-secure.sh &`), lacks monitoring, and has no automated recovery. This week establishes production-grade infrastructure.
**Goals:**
1. Systemd service with auto-restart on failure
2. Prometheus metrics for monitoring
3. Grafana dashboards for visualization
4. Automated PostgreSQL backups
5. Log rotation and management
**Dependencies:**
- SSH access to 172.16.3.30 as `guru` user
- Sudo access for systemd service installation
- PostgreSQL credentials (currently broken, but can set up backup automation)
---
## Week 2 Task Breakdown
### Day 1: Systemd Service Configuration
**Goal:** Convert manual server startup to systemd-managed service
**Tasks:**
1. Create systemd service file (`/etc/systemd/system/guruconnect.service`)
2. Configure service dependencies (network, postgresql)
3. Set restart policy (on-failure, with backoff)
4. Configure environment variables securely
5. Enable service to start on boot
6. Test service start/stop/restart
7. Verify auto-restart on crash
**Files to Create:**
- `server/guruconnect.service` - Systemd unit file
- `server/setup-systemd.sh` - Installation script
**Verification:**
- Service starts automatically on boot
- Service restarts on failure (kill -9 test)
- Logs go to journalctl
---
### Day 2: Prometheus Metrics
**Goal:** Expose metrics for monitoring server health and performance
**Tasks:**
1. Add `prometheus-client` dependency to Cargo.toml
2. Create metrics module (`server/src/metrics/mod.rs`)
3. Implement metric types:
- Counter: requests_total, sessions_total, errors_total
- Gauge: active_sessions, active_connections
- Histogram: request_duration_seconds, session_duration_seconds
4. Add `/metrics` endpoint
5. Integrate metrics into existing code:
- Session creation/close
- Request handling
- WebSocket connections
- Database operations
6. Test metrics endpoint (`curl http://172.16.3.30:3002/metrics`)
**Files to Create/Modify:**
- `server/Cargo.toml` - Add dependencies
- `server/src/metrics/mod.rs` - Metrics module
- `server/src/main.rs` - Add /metrics endpoint
- `server/src/relay/mod.rs` - Add session metrics
- `server/src/api/mod.rs` - Add request metrics
**Metrics to Track:**
- `guruconnect_requests_total{method, path, status}` - HTTP requests
- `guruconnect_sessions_total{status}` - Sessions (created, closed, failed)
- `guruconnect_active_sessions` - Current active sessions
- `guruconnect_active_connections{type}` - WebSocket connections (agents, viewers)
- `guruconnect_request_duration_seconds{method, path}` - Request latency
- `guruconnect_session_duration_seconds` - Session lifetime
- `guruconnect_errors_total{type}` - Error counts
- `guruconnect_db_operations_total{operation, status}` - Database operations
**Verification:**
- Metrics endpoint returns Prometheus format
- Metrics update in real-time
- No performance degradation
---
### Day 3: Grafana Dashboard
**Goal:** Create visual dashboards for monitoring GuruConnect
**Tasks:**
1. Install Prometheus on 172.16.3.30
2. Configure Prometheus to scrape GuruConnect metrics
3. Install Grafana on 172.16.3.30
4. Configure Grafana data source (Prometheus)
5. Create dashboards:
- Overview: Active sessions, requests/sec, errors
- Sessions: Session lifecycle, duration distribution
- Performance: Request latency, database query time
- Errors: Error rates by type
6. Set up alerting rules (if time permits)
**Files to Create:**
- `infrastructure/prometheus.yml` - Prometheus configuration
- `infrastructure/grafana-dashboard.json` - Pre-built dashboard
- `infrastructure/setup-monitoring.sh` - Installation script
**Grafana Dashboard Panels:**
1. Active Sessions (Gauge)
2. Requests per Second (Graph)
3. Error Rate (Graph)
4. Session Creation Rate (Graph)
5. Request Latency p50/p95/p99 (Graph)
6. Active Connections by Type (Graph)
7. Database Operations (Graph)
8. Top Errors (Table)
**Verification:**
- Prometheus scrapes metrics successfully
- Grafana dashboard displays real-time data
- Alerts fire on test conditions
---
### Day 4: Automated PostgreSQL Backups
**Goal:** Implement automated daily backups with retention policy
**Tasks:**
1. Create backup script (`server/backup-postgres.sh`)
2. Configure backup location (`/home/guru/backups/guruconnect/`)
3. Implement retention policy (keep 30 daily, 4 weekly, 6 monthly)
4. Create systemd timer for daily backups
5. Add backup monitoring (success/failure metrics)
6. Test backup and restore process
7. Document restore procedure
**Files to Create:**
- `server/backup-postgres.sh` - Backup script
- `server/restore-postgres.sh` - Restore script
- `server/guruconnect-backup.service` - Systemd service
- `server/guruconnect-backup.timer` - Systemd timer
**Backup Strategy:**
- Daily full backups at 2:00 AM
- Compressed with gzip
- Named with timestamp: `guruconnect-YYYY-MM-DD-HHMMSS.sql.gz`
- Stored in `/home/guru/backups/guruconnect/`
- Retention: 30 days daily, 4 weeks weekly, 6 months monthly
**Verification:**
- Manual backup works
- Automated backup runs daily
- Restore process verified
- Old backups cleaned up correctly
---
### Day 5: Log Rotation & Health Checks
**Goal:** Implement log rotation and continuous health monitoring
**Tasks:**
1. Configure logrotate for GuruConnect logs
2. Implement health check improvements:
- Database connectivity check
- Disk space check
- Memory usage check
- Active session count check
3. Create monitoring script (`server/health-monitor.sh`)
4. Add health metrics to Prometheus
5. Create systemd watchdog configuration
6. Document operational procedures
**Files to Create:**
- `server/guruconnect.logrotate` - Logrotate configuration
- `server/health-monitor.sh` - Health monitoring script
- `server/OPERATIONS.md` - Operational runbook
**Health Checks:**
- `/health` endpoint (basic - already exists)
- `/health/deep` endpoint (detailed checks):
- Database connection: OK/FAIL
- Disk space: >10% free
- Memory: <90% used
- Active sessions: <100 (threshold)
- Uptime: seconds since start
**Verification:**
- Logs rotate correctly
- Health checks report accurate status
- Alerts triggered on health failures
---
## Infrastructure Files Structure
```
guru-connect/
├── server/
│ ├── guruconnect.service # Systemd service file
│ ├── setup-systemd.sh # Service installation script
│ ├── backup-postgres.sh # PostgreSQL backup script
│ ├── restore-postgres.sh # PostgreSQL restore script
│ ├── guruconnect-backup.service # Backup systemd service
│ ├── guruconnect-backup.timer # Backup systemd timer
│ ├── guruconnect.logrotate # Logrotate configuration
│ ├── health-monitor.sh # Health monitoring script
│ └── OPERATIONS.md # Operational runbook
├── infrastructure/
│ ├── prometheus.yml # Prometheus configuration
│ ├── grafana-dashboard.json # Grafana dashboard export
│ └── setup-monitoring.sh # Monitoring setup script
└── docs/
└── MONITORING.md # Monitoring documentation
```
---
## Systemd Service Configuration
**Service File: `/etc/systemd/system/guruconnect.service`**
```ini
[Unit]
Description=GuruConnect Remote Desktop Server
Documentation=https://git.azcomputerguru.com/azcomputerguru/guru-connect
After=network-online.target postgresql.service
Wants=network-online.target
[Service]
Type=simple
User=guru
Group=guru
WorkingDirectory=/home/guru/guru-connect/server
# Environment variables
EnvironmentFile=/home/guru/guru-connect/server/.env
# Start command
ExecStart=/home/guru/guru-connect/target/x86_64-unknown-linux-gnu/release/guruconnect-server
# Restart policy
Restart=on-failure
RestartSec=10s
StartLimitInterval=5min
StartLimitBurst=3
# Resource limits
LimitNOFILE=65536
LimitNPROC=4096
# Security
NoNewPrivileges=true
PrivateTmp=true
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=guruconnect
# Watchdog
WatchdogSec=30s
[Install]
WantedBy=multi-user.target
```
**Environment File: `/home/guru/guru-connect/server/.env`**
```bash
# Database
DATABASE_URL=postgresql://guruconnect:PASSWORD@localhost:5432/guruconnect
# Security
JWT_SECRET=your-very-secure-jwt-secret-at-least-32-characters
AGENT_API_KEY=your-very-secure-api-key-at-least-32-characters
# Server Configuration
RUST_LOG=info
HOST=0.0.0.0
PORT=3002
# Monitoring
PROMETHEUS_PORT=3002 # Expose on same port as main service
```
---
## Prometheus Configuration
**File: `infrastructure/prometheus.yml`**
```yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'guruconnect-production'
scrape_configs:
- job_name: 'guruconnect'
static_configs:
- targets: ['172.16.3.30:3002']
labels:
env: 'production'
service: 'guruconnect-server'
- job_name: 'node_exporter'
static_configs:
- targets: ['172.16.3.30:9100']
labels:
env: 'production'
instance: 'rmm-server'
# Alerting rules (optional for Week 2)
rule_files:
- 'alerts.yml'
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
```
---
## Testing Checklist
### Systemd Service Tests
- [ ] Service starts correctly: `sudo systemctl start guruconnect`
- [ ] Service stops correctly: `sudo systemctl stop guruconnect`
- [ ] Service restarts correctly: `sudo systemctl restart guruconnect`
- [ ] Service auto-starts on boot: `sudo systemctl enable guruconnect`
- [ ] Service restarts on crash: `sudo kill -9 <pid>` (wait 10s)
- [ ] Logs visible in journalctl: `sudo journalctl -u guruconnect -f`
### Prometheus Metrics Tests
- [ ] Metrics endpoint accessible: `curl http://172.16.3.30:3002/metrics`
- [ ] Metrics format valid (Prometheus client can scrape)
- [ ] Session metrics update on session creation/close
- [ ] Request metrics update on HTTP requests
- [ ] Error metrics update on failures
### Grafana Dashboard Tests
- [ ] Prometheus data source connected
- [ ] All panels display data
- [ ] Data updates in real-time (<30s delay)
- [ ] Historical data visible (after 1 hour)
- [ ] Dashboard exports to JSON successfully
### Backup Tests
- [ ] Manual backup creates file: `bash backup-postgres.sh`
- [ ] Backup file is compressed and named correctly
- [ ] Restore works: `bash restore-postgres.sh <backup-file>`
- [ ] Timer triggers daily at 2:00 AM
- [ ] Retention policy removes old backups
### Health Check Tests
- [ ] Basic health endpoint: `curl http://172.16.3.30:3002/health`
- [ ] Deep health endpoint: `curl http://172.16.3.30:3002/health/deep`
- [ ] Health checks report database status
- [ ] Health checks report disk/memory usage
---
## Risk Assessment
### HIGH RISK
**Issue:** Database credentials still broken
**Impact:** Cannot test database-dependent features
**Mitigation:** Create backup scripts that work even if database is down (conditional logic)
**Issue:** Sudo access required for systemd
**Impact:** Cannot install service without password
**Mitigation:** Prepare scripts and documentation, request sudo access from system admin
### MEDIUM RISK
**Issue:** Prometheus/Grafana installation may require dependencies
**Impact:** Additional setup time
**Mitigation:** Use Docker containers if system install is complex
**Issue:** Metrics may add performance overhead
**Impact:** Latency increase
**Mitigation:** Use efficient metrics library, test performance before/after
### LOW RISK
**Issue:** Log rotation misconfiguration
**Impact:** Disk space issues
**Mitigation:** Test logrotate configuration thoroughly, set conservative limits
---
## Success Criteria
Week 2 is complete when:
1. **Systemd Service**
- Service starts/stops correctly
- Auto-restarts on failure
- Starts on boot
- Logs to journalctl
2. **Prometheus Metrics**
- /metrics endpoint working
- Key metrics implemented:
- Request counts and latency
- Session counts and duration
- Active connections
- Error rates
- Prometheus can scrape successfully
3. **Grafana Dashboard**
- Prometheus data source configured
- Dashboard with 8+ panels
- Real-time data display
- Dashboard exported to JSON
4. **Automated Backups**
- Backup script functional
- Daily backups via systemd timer
- Retention policy enforced
- Restore procedure documented
5. **Health Monitoring**
- Log rotation configured
- Health checks implemented
- Health metrics exposed
- Operational runbook created
**Exit Criteria:** All 5 areas have passing tests, production infrastructure is stable and monitored.
---
## Next Steps (Week 3)
After Week 2 infrastructure completion:
- Week 3: CI/CD pipeline (Gitea CI, automated builds, deployment automation)
- Week 4: Production hardening (load testing, performance optimization, security audit)
- Phase 2: Core features development
---
**Document Status:** READY
**Owner:** Development Team
**Started:** 2026-01-18
**Target:** 2026-01-25