Phase 1 Week 2: Infrastructure & Monitoring
Added comprehensive production infrastructure: Systemd Service: - guruconnect.service with auto-restart, resource limits, security hardening - setup-systemd.sh installation script Prometheus Metrics: - Added prometheus-client dependency - Created metrics module tracking: - HTTP requests (count, latency) - Sessions (created, closed, active) - Connections (WebSocket, by type) - Errors (by type) - Database operations (count, latency) - Server uptime - Added /metrics endpoint - Background task for uptime updates Monitoring Configuration: - prometheus.yml with scrape configs for GuruConnect and node_exporter - alerts.yml with alerting rules - grafana-dashboard.json with 10 panels - setup-monitoring.sh installation script PostgreSQL Backups: - backup-postgres.sh with gzip compression - restore-postgres.sh with safety checks - guruconnect-backup.service and .timer for automated daily backups - Retention policy: 30 daily, 4 weekly, 6 monthly Health Monitoring: - health-monitor.sh checking HTTP, disk, memory, database, metrics - guruconnect.logrotate for log rotation - Email alerts on failures Updated CHECKLIST_STATE.json to reflect Week 1 completion (77%) and Week 2 start. Created PHASE1_WEEK2_INFRASTRUCTURE.md with comprehensive planning. Ready for deployment and testing on RMM server.
This commit is contained in:
457
projects/msp-tools/guru-connect/PHASE1_WEEK2_INFRASTRUCTURE.md
Normal file
457
projects/msp-tools/guru-connect/PHASE1_WEEK2_INFRASTRUCTURE.md
Normal file
@@ -0,0 +1,457 @@
|
||||
# Phase 1, Week 2 - Infrastructure & Monitoring
|
||||
|
||||
**Date Started:** 2026-01-18
|
||||
**Target Completion:** 2026-01-25
|
||||
**Status:** Starting
|
||||
**Priority:** HIGH (Production Readiness)
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
With Week 1 security fixes complete and deployed, Week 2 focuses on production infrastructure hardening. The server currently runs manually (`nohup start-secure.sh &`), lacks monitoring, and has no automated recovery. This week establishes production-grade infrastructure.
|
||||
|
||||
**Goals:**
|
||||
1. Systemd service with auto-restart on failure
|
||||
2. Prometheus metrics for monitoring
|
||||
3. Grafana dashboards for visualization
|
||||
4. Automated PostgreSQL backups
|
||||
5. Log rotation and management
|
||||
|
||||
**Dependencies:**
|
||||
- SSH access to 172.16.3.30 as `guru` user
|
||||
- Sudo access for systemd service installation
|
||||
- PostgreSQL credentials (currently broken, but can set up backup automation)
|
||||
|
||||
---
|
||||
|
||||
## Week 2 Task Breakdown
|
||||
|
||||
### Day 1: Systemd Service Configuration
|
||||
|
||||
**Goal:** Convert manual server startup to systemd-managed service
|
||||
|
||||
**Tasks:**
|
||||
1. Create systemd service file (`/etc/systemd/system/guruconnect.service`)
|
||||
2. Configure service dependencies (network, postgresql)
|
||||
3. Set restart policy (on-failure, with backoff)
|
||||
4. Configure environment variables securely
|
||||
5. Enable service to start on boot
|
||||
6. Test service start/stop/restart
|
||||
7. Verify auto-restart on crash
|
||||
|
||||
**Files to Create:**
|
||||
- `server/guruconnect.service` - Systemd unit file
|
||||
- `server/setup-systemd.sh` - Installation script
|
||||
|
||||
**Verification:**
|
||||
- Service starts automatically on boot
|
||||
- Service restarts on failure (kill -9 test)
|
||||
- Logs go to journalctl
|
||||
|
||||
---
|
||||
|
||||
### Day 2: Prometheus Metrics
|
||||
|
||||
**Goal:** Expose metrics for monitoring server health and performance
|
||||
|
||||
**Tasks:**
|
||||
1. Add `prometheus-client` dependency to Cargo.toml
|
||||
2. Create metrics module (`server/src/metrics/mod.rs`)
|
||||
3. Implement metric types:
|
||||
- Counter: requests_total, sessions_total, errors_total
|
||||
- Gauge: active_sessions, active_connections
|
||||
- Histogram: request_duration_seconds, session_duration_seconds
|
||||
4. Add `/metrics` endpoint
|
||||
5. Integrate metrics into existing code:
|
||||
- Session creation/close
|
||||
- Request handling
|
||||
- WebSocket connections
|
||||
- Database operations
|
||||
6. Test metrics endpoint (`curl http://172.16.3.30:3002/metrics`)
|
||||
|
||||
**Files to Create/Modify:**
|
||||
- `server/Cargo.toml` - Add dependencies
|
||||
- `server/src/metrics/mod.rs` - Metrics module
|
||||
- `server/src/main.rs` - Add /metrics endpoint
|
||||
- `server/src/relay/mod.rs` - Add session metrics
|
||||
- `server/src/api/mod.rs` - Add request metrics
|
||||
|
||||
**Metrics to Track:**
|
||||
- `guruconnect_requests_total{method, path, status}` - HTTP requests
|
||||
- `guruconnect_sessions_total{status}` - Sessions (created, closed, failed)
|
||||
- `guruconnect_active_sessions` - Current active sessions
|
||||
- `guruconnect_active_connections{type}` - WebSocket connections (agents, viewers)
|
||||
- `guruconnect_request_duration_seconds{method, path}` - Request latency
|
||||
- `guruconnect_session_duration_seconds` - Session lifetime
|
||||
- `guruconnect_errors_total{type}` - Error counts
|
||||
- `guruconnect_db_operations_total{operation, status}` - Database operations
|
||||
|
||||
**Verification:**
|
||||
- Metrics endpoint returns Prometheus format
|
||||
- Metrics update in real-time
|
||||
- No performance degradation
|
||||
|
||||
---
|
||||
|
||||
### Day 3: Grafana Dashboard
|
||||
|
||||
**Goal:** Create visual dashboards for monitoring GuruConnect
|
||||
|
||||
**Tasks:**
|
||||
1. Install Prometheus on 172.16.3.30
|
||||
2. Configure Prometheus to scrape GuruConnect metrics
|
||||
3. Install Grafana on 172.16.3.30
|
||||
4. Configure Grafana data source (Prometheus)
|
||||
5. Create dashboards:
|
||||
- Overview: Active sessions, requests/sec, errors
|
||||
- Sessions: Session lifecycle, duration distribution
|
||||
- Performance: Request latency, database query time
|
||||
- Errors: Error rates by type
|
||||
6. Set up alerting rules (if time permits)
|
||||
|
||||
**Files to Create:**
|
||||
- `infrastructure/prometheus.yml` - Prometheus configuration
|
||||
- `infrastructure/grafana-dashboard.json` - Pre-built dashboard
|
||||
- `infrastructure/setup-monitoring.sh` - Installation script
|
||||
|
||||
**Grafana Dashboard Panels:**
|
||||
1. Active Sessions (Gauge)
|
||||
2. Requests per Second (Graph)
|
||||
3. Error Rate (Graph)
|
||||
4. Session Creation Rate (Graph)
|
||||
5. Request Latency p50/p95/p99 (Graph)
|
||||
6. Active Connections by Type (Graph)
|
||||
7. Database Operations (Graph)
|
||||
8. Top Errors (Table)
|
||||
|
||||
**Verification:**
|
||||
- Prometheus scrapes metrics successfully
|
||||
- Grafana dashboard displays real-time data
|
||||
- Alerts fire on test conditions
|
||||
|
||||
---
|
||||
|
||||
### Day 4: Automated PostgreSQL Backups
|
||||
|
||||
**Goal:** Implement automated daily backups with retention policy
|
||||
|
||||
**Tasks:**
|
||||
1. Create backup script (`server/backup-postgres.sh`)
|
||||
2. Configure backup location (`/home/guru/backups/guruconnect/`)
|
||||
3. Implement retention policy (keep 30 daily, 4 weekly, 6 monthly)
|
||||
4. Create systemd timer for daily backups
|
||||
5. Add backup monitoring (success/failure metrics)
|
||||
6. Test backup and restore process
|
||||
7. Document restore procedure
|
||||
|
||||
**Files to Create:**
|
||||
- `server/backup-postgres.sh` - Backup script
|
||||
- `server/restore-postgres.sh` - Restore script
|
||||
- `server/guruconnect-backup.service` - Systemd service
|
||||
- `server/guruconnect-backup.timer` - Systemd timer
|
||||
|
||||
**Backup Strategy:**
|
||||
- Daily full backups at 2:00 AM
|
||||
- Compressed with gzip
|
||||
- Named with timestamp: `guruconnect-YYYY-MM-DD-HHMMSS.sql.gz`
|
||||
- Stored in `/home/guru/backups/guruconnect/`
|
||||
- Retention: 30 days daily, 4 weeks weekly, 6 months monthly
|
||||
|
||||
**Verification:**
|
||||
- Manual backup works
|
||||
- Automated backup runs daily
|
||||
- Restore process verified
|
||||
- Old backups cleaned up correctly
|
||||
|
||||
---
|
||||
|
||||
### Day 5: Log Rotation & Health Checks
|
||||
|
||||
**Goal:** Implement log rotation and continuous health monitoring
|
||||
|
||||
**Tasks:**
|
||||
1. Configure logrotate for GuruConnect logs
|
||||
2. Implement health check improvements:
|
||||
- Database connectivity check
|
||||
- Disk space check
|
||||
- Memory usage check
|
||||
- Active session count check
|
||||
3. Create monitoring script (`server/health-monitor.sh`)
|
||||
4. Add health metrics to Prometheus
|
||||
5. Create systemd watchdog configuration
|
||||
6. Document operational procedures
|
||||
|
||||
**Files to Create:**
|
||||
- `server/guruconnect.logrotate` - Logrotate configuration
|
||||
- `server/health-monitor.sh` - Health monitoring script
|
||||
- `server/OPERATIONS.md` - Operational runbook
|
||||
|
||||
**Health Checks:**
|
||||
- `/health` endpoint (basic - already exists)
|
||||
- `/health/deep` endpoint (detailed checks):
|
||||
- Database connection: OK/FAIL
|
||||
- Disk space: >10% free
|
||||
- Memory: <90% used
|
||||
- Active sessions: <100 (threshold)
|
||||
- Uptime: seconds since start
|
||||
|
||||
**Verification:**
|
||||
- Logs rotate correctly
|
||||
- Health checks report accurate status
|
||||
- Alerts triggered on health failures
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure Files Structure
|
||||
|
||||
```
|
||||
guru-connect/
|
||||
├── server/
|
||||
│ ├── guruconnect.service # Systemd service file
|
||||
│ ├── setup-systemd.sh # Service installation script
|
||||
│ ├── backup-postgres.sh # PostgreSQL backup script
|
||||
│ ├── restore-postgres.sh # PostgreSQL restore script
|
||||
│ ├── guruconnect-backup.service # Backup systemd service
|
||||
│ ├── guruconnect-backup.timer # Backup systemd timer
|
||||
│ ├── guruconnect.logrotate # Logrotate configuration
|
||||
│ ├── health-monitor.sh # Health monitoring script
|
||||
│ └── OPERATIONS.md # Operational runbook
|
||||
├── infrastructure/
|
||||
│ ├── prometheus.yml # Prometheus configuration
|
||||
│ ├── grafana-dashboard.json # Grafana dashboard export
|
||||
│ └── setup-monitoring.sh # Monitoring setup script
|
||||
└── docs/
|
||||
└── MONITORING.md # Monitoring documentation
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Systemd Service Configuration
|
||||
|
||||
**Service File: `/etc/systemd/system/guruconnect.service`**
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=GuruConnect Remote Desktop Server
|
||||
Documentation=https://git.azcomputerguru.com/azcomputerguru/guru-connect
|
||||
After=network-online.target postgresql.service
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=guru
|
||||
Group=guru
|
||||
WorkingDirectory=/home/guru/guru-connect/server
|
||||
|
||||
# Environment variables
|
||||
EnvironmentFile=/home/guru/guru-connect/server/.env
|
||||
|
||||
# Start command
|
||||
ExecStart=/home/guru/guru-connect/target/x86_64-unknown-linux-gnu/release/guruconnect-server
|
||||
|
||||
# Restart policy
|
||||
Restart=on-failure
|
||||
RestartSec=10s
|
||||
StartLimitInterval=5min
|
||||
StartLimitBurst=3
|
||||
|
||||
# Resource limits
|
||||
LimitNOFILE=65536
|
||||
LimitNPROC=4096
|
||||
|
||||
# Security
|
||||
NoNewPrivileges=true
|
||||
PrivateTmp=true
|
||||
|
||||
# Logging
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
SyslogIdentifier=guruconnect
|
||||
|
||||
# Watchdog
|
||||
WatchdogSec=30s
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
**Environment File: `/home/guru/guru-connect/server/.env`**
|
||||
|
||||
```bash
|
||||
# Database
|
||||
DATABASE_URL=postgresql://guruconnect:PASSWORD@localhost:5432/guruconnect
|
||||
|
||||
# Security
|
||||
JWT_SECRET=your-very-secure-jwt-secret-at-least-32-characters
|
||||
AGENT_API_KEY=your-very-secure-api-key-at-least-32-characters
|
||||
|
||||
# Server Configuration
|
||||
RUST_LOG=info
|
||||
HOST=0.0.0.0
|
||||
PORT=3002
|
||||
|
||||
# Monitoring
|
||||
PROMETHEUS_PORT=3002 # Expose on same port as main service
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prometheus Configuration
|
||||
|
||||
**File: `infrastructure/prometheus.yml`**
|
||||
|
||||
```yaml
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
evaluation_interval: 15s
|
||||
external_labels:
|
||||
cluster: 'guruconnect-production'
|
||||
|
||||
scrape_configs:
|
||||
- job_name: 'guruconnect'
|
||||
static_configs:
|
||||
- targets: ['172.16.3.30:3002']
|
||||
labels:
|
||||
env: 'production'
|
||||
service: 'guruconnect-server'
|
||||
|
||||
- job_name: 'node_exporter'
|
||||
static_configs:
|
||||
- targets: ['172.16.3.30:9100']
|
||||
labels:
|
||||
env: 'production'
|
||||
instance: 'rmm-server'
|
||||
|
||||
# Alerting rules (optional for Week 2)
|
||||
rule_files:
|
||||
- 'alerts.yml'
|
||||
|
||||
alerting:
|
||||
alertmanagers:
|
||||
- static_configs:
|
||||
- targets: ['localhost:9093']
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Checklist
|
||||
|
||||
### Systemd Service Tests
|
||||
- [ ] Service starts correctly: `sudo systemctl start guruconnect`
|
||||
- [ ] Service stops correctly: `sudo systemctl stop guruconnect`
|
||||
- [ ] Service restarts correctly: `sudo systemctl restart guruconnect`
|
||||
- [ ] Service auto-starts on boot: `sudo systemctl enable guruconnect`
|
||||
- [ ] Service restarts on crash: `sudo kill -9 <pid>` (wait 10s)
|
||||
- [ ] Logs visible in journalctl: `sudo journalctl -u guruconnect -f`
|
||||
|
||||
### Prometheus Metrics Tests
|
||||
- [ ] Metrics endpoint accessible: `curl http://172.16.3.30:3002/metrics`
|
||||
- [ ] Metrics format valid (Prometheus client can scrape)
|
||||
- [ ] Session metrics update on session creation/close
|
||||
- [ ] Request metrics update on HTTP requests
|
||||
- [ ] Error metrics update on failures
|
||||
|
||||
### Grafana Dashboard Tests
|
||||
- [ ] Prometheus data source connected
|
||||
- [ ] All panels display data
|
||||
- [ ] Data updates in real-time (<30s delay)
|
||||
- [ ] Historical data visible (after 1 hour)
|
||||
- [ ] Dashboard exports to JSON successfully
|
||||
|
||||
### Backup Tests
|
||||
- [ ] Manual backup creates file: `bash backup-postgres.sh`
|
||||
- [ ] Backup file is compressed and named correctly
|
||||
- [ ] Restore works: `bash restore-postgres.sh <backup-file>`
|
||||
- [ ] Timer triggers daily at 2:00 AM
|
||||
- [ ] Retention policy removes old backups
|
||||
|
||||
### Health Check Tests
|
||||
- [ ] Basic health endpoint: `curl http://172.16.3.30:3002/health`
|
||||
- [ ] Deep health endpoint: `curl http://172.16.3.30:3002/health/deep`
|
||||
- [ ] Health checks report database status
|
||||
- [ ] Health checks report disk/memory usage
|
||||
|
||||
---
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
### HIGH RISK
|
||||
**Issue:** Database credentials still broken
|
||||
**Impact:** Cannot test database-dependent features
|
||||
**Mitigation:** Create backup scripts that work even if database is down (conditional logic)
|
||||
|
||||
**Issue:** Sudo access required for systemd
|
||||
**Impact:** Cannot install service without password
|
||||
**Mitigation:** Prepare scripts and documentation, request sudo access from system admin
|
||||
|
||||
### MEDIUM RISK
|
||||
**Issue:** Prometheus/Grafana installation may require dependencies
|
||||
**Impact:** Additional setup time
|
||||
**Mitigation:** Use Docker containers if system install is complex
|
||||
|
||||
**Issue:** Metrics may add performance overhead
|
||||
**Impact:** Latency increase
|
||||
**Mitigation:** Use efficient metrics library, test performance before/after
|
||||
|
||||
### LOW RISK
|
||||
**Issue:** Log rotation misconfiguration
|
||||
**Impact:** Disk space issues
|
||||
**Mitigation:** Test logrotate configuration thoroughly, set conservative limits
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
Week 2 is complete when:
|
||||
|
||||
1. **Systemd Service**
|
||||
- Service starts/stops correctly
|
||||
- Auto-restarts on failure
|
||||
- Starts on boot
|
||||
- Logs to journalctl
|
||||
|
||||
2. **Prometheus Metrics**
|
||||
- /metrics endpoint working
|
||||
- Key metrics implemented:
|
||||
- Request counts and latency
|
||||
- Session counts and duration
|
||||
- Active connections
|
||||
- Error rates
|
||||
- Prometheus can scrape successfully
|
||||
|
||||
3. **Grafana Dashboard**
|
||||
- Prometheus data source configured
|
||||
- Dashboard with 8+ panels
|
||||
- Real-time data display
|
||||
- Dashboard exported to JSON
|
||||
|
||||
4. **Automated Backups**
|
||||
- Backup script functional
|
||||
- Daily backups via systemd timer
|
||||
- Retention policy enforced
|
||||
- Restore procedure documented
|
||||
|
||||
5. **Health Monitoring**
|
||||
- Log rotation configured
|
||||
- Health checks implemented
|
||||
- Health metrics exposed
|
||||
- Operational runbook created
|
||||
|
||||
**Exit Criteria:** All 5 areas have passing tests, production infrastructure is stable and monitored.
|
||||
|
||||
---
|
||||
|
||||
## Next Steps (Week 3)
|
||||
|
||||
After Week 2 infrastructure completion:
|
||||
- Week 3: CI/CD pipeline (Gitea CI, automated builds, deployment automation)
|
||||
- Week 4: Production hardening (load testing, performance optimization, security audit)
|
||||
- Phase 2: Core features development
|
||||
|
||||
---
|
||||
|
||||
**Document Status:** READY
|
||||
**Owner:** Development Team
|
||||
**Started:** 2026-01-18
|
||||
**Target:** 2026-01-25
|
||||
Reference in New Issue
Block a user