Added comprehensive production infrastructure: Systemd Service: - guruconnect.service with auto-restart, resource limits, security hardening - setup-systemd.sh installation script Prometheus Metrics: - Added prometheus-client dependency - Created metrics module tracking: - HTTP requests (count, latency) - Sessions (created, closed, active) - Connections (WebSocket, by type) - Errors (by type) - Database operations (count, latency) - Server uptime - Added /metrics endpoint - Background task for uptime updates Monitoring Configuration: - prometheus.yml with scrape configs for GuruConnect and node_exporter - alerts.yml with alerting rules - grafana-dashboard.json with 10 panels - setup-monitoring.sh installation script PostgreSQL Backups: - backup-postgres.sh with gzip compression - restore-postgres.sh with safety checks - guruconnect-backup.service and .timer for automated daily backups - Retention policy: 30 daily, 4 weekly, 6 monthly Health Monitoring: - health-monitor.sh checking HTTP, disk, memory, database, metrics - guruconnect.logrotate for log rotation - Email alerts on failures Updated CHECKLIST_STATE.json to reflect Week 1 completion (77%) and Week 2 start. Created PHASE1_WEEK2_INFRASTRUCTURE.md with comprehensive planning. Ready for deployment and testing on RMM server.
458 lines
13 KiB
Markdown
458 lines
13 KiB
Markdown
# Phase 1, Week 2 - Infrastructure & Monitoring
|
|
|
|
**Date Started:** 2026-01-18
|
|
**Target Completion:** 2026-01-25
|
|
**Status:** Starting
|
|
**Priority:** HIGH (Production Readiness)
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
With Week 1 security fixes complete and deployed, Week 2 focuses on production infrastructure hardening. The server currently runs manually (`nohup start-secure.sh &`), lacks monitoring, and has no automated recovery. This week establishes production-grade infrastructure.
|
|
|
|
**Goals:**
|
|
1. Systemd service with auto-restart on failure
|
|
2. Prometheus metrics for monitoring
|
|
3. Grafana dashboards for visualization
|
|
4. Automated PostgreSQL backups
|
|
5. Log rotation and management
|
|
|
|
**Dependencies:**
|
|
- SSH access to 172.16.3.30 as `guru` user
|
|
- Sudo access for systemd service installation
|
|
- PostgreSQL credentials (currently broken, but can set up backup automation)
|
|
|
|
---
|
|
|
|
## Week 2 Task Breakdown
|
|
|
|
### Day 1: Systemd Service Configuration
|
|
|
|
**Goal:** Convert manual server startup to systemd-managed service
|
|
|
|
**Tasks:**
|
|
1. Create systemd service file (`/etc/systemd/system/guruconnect.service`)
|
|
2. Configure service dependencies (network, postgresql)
|
|
3. Set restart policy (on-failure, with backoff)
|
|
4. Configure environment variables securely
|
|
5. Enable service to start on boot
|
|
6. Test service start/stop/restart
|
|
7. Verify auto-restart on crash
|
|
|
|
**Files to Create:**
|
|
- `server/guruconnect.service` - Systemd unit file
|
|
- `server/setup-systemd.sh` - Installation script
|
|
|
|
**Verification:**
|
|
- Service starts automatically on boot
|
|
- Service restarts on failure (kill -9 test)
|
|
- Logs go to journalctl
|
|
|
|
---
|
|
|
|
### Day 2: Prometheus Metrics
|
|
|
|
**Goal:** Expose metrics for monitoring server health and performance
|
|
|
|
**Tasks:**
|
|
1. Add `prometheus-client` dependency to Cargo.toml
|
|
2. Create metrics module (`server/src/metrics/mod.rs`)
|
|
3. Implement metric types:
|
|
- Counter: requests_total, sessions_total, errors_total
|
|
- Gauge: active_sessions, active_connections
|
|
- Histogram: request_duration_seconds, session_duration_seconds
|
|
4. Add `/metrics` endpoint
|
|
5. Integrate metrics into existing code:
|
|
- Session creation/close
|
|
- Request handling
|
|
- WebSocket connections
|
|
- Database operations
|
|
6. Test metrics endpoint (`curl http://172.16.3.30:3002/metrics`)
|
|
|
|
**Files to Create/Modify:**
|
|
- `server/Cargo.toml` - Add dependencies
|
|
- `server/src/metrics/mod.rs` - Metrics module
|
|
- `server/src/main.rs` - Add /metrics endpoint
|
|
- `server/src/relay/mod.rs` - Add session metrics
|
|
- `server/src/api/mod.rs` - Add request metrics
|
|
|
|
**Metrics to Track:**
|
|
- `guruconnect_requests_total{method, path, status}` - HTTP requests
|
|
- `guruconnect_sessions_total{status}` - Sessions (created, closed, failed)
|
|
- `guruconnect_active_sessions` - Current active sessions
|
|
- `guruconnect_active_connections{type}` - WebSocket connections (agents, viewers)
|
|
- `guruconnect_request_duration_seconds{method, path}` - Request latency
|
|
- `guruconnect_session_duration_seconds` - Session lifetime
|
|
- `guruconnect_errors_total{type}` - Error counts
|
|
- `guruconnect_db_operations_total{operation, status}` - Database operations
|
|
|
|
**Verification:**
|
|
- Metrics endpoint returns Prometheus format
|
|
- Metrics update in real-time
|
|
- No performance degradation
|
|
|
|
---
|
|
|
|
### Day 3: Grafana Dashboard
|
|
|
|
**Goal:** Create visual dashboards for monitoring GuruConnect
|
|
|
|
**Tasks:**
|
|
1. Install Prometheus on 172.16.3.30
|
|
2. Configure Prometheus to scrape GuruConnect metrics
|
|
3. Install Grafana on 172.16.3.30
|
|
4. Configure Grafana data source (Prometheus)
|
|
5. Create dashboards:
|
|
- Overview: Active sessions, requests/sec, errors
|
|
- Sessions: Session lifecycle, duration distribution
|
|
- Performance: Request latency, database query time
|
|
- Errors: Error rates by type
|
|
6. Set up alerting rules (if time permits)
|
|
|
|
**Files to Create:**
|
|
- `infrastructure/prometheus.yml` - Prometheus configuration
|
|
- `infrastructure/grafana-dashboard.json` - Pre-built dashboard
|
|
- `infrastructure/setup-monitoring.sh` - Installation script
|
|
|
|
**Grafana Dashboard Panels:**
|
|
1. Active Sessions (Gauge)
|
|
2. Requests per Second (Graph)
|
|
3. Error Rate (Graph)
|
|
4. Session Creation Rate (Graph)
|
|
5. Request Latency p50/p95/p99 (Graph)
|
|
6. Active Connections by Type (Graph)
|
|
7. Database Operations (Graph)
|
|
8. Top Errors (Table)
|
|
|
|
**Verification:**
|
|
- Prometheus scrapes metrics successfully
|
|
- Grafana dashboard displays real-time data
|
|
- Alerts fire on test conditions
|
|
|
|
---
|
|
|
|
### Day 4: Automated PostgreSQL Backups
|
|
|
|
**Goal:** Implement automated daily backups with retention policy
|
|
|
|
**Tasks:**
|
|
1. Create backup script (`server/backup-postgres.sh`)
|
|
2. Configure backup location (`/home/guru/backups/guruconnect/`)
|
|
3. Implement retention policy (keep 30 daily, 4 weekly, 6 monthly)
|
|
4. Create systemd timer for daily backups
|
|
5. Add backup monitoring (success/failure metrics)
|
|
6. Test backup and restore process
|
|
7. Document restore procedure
|
|
|
|
**Files to Create:**
|
|
- `server/backup-postgres.sh` - Backup script
|
|
- `server/restore-postgres.sh` - Restore script
|
|
- `server/guruconnect-backup.service` - Systemd service
|
|
- `server/guruconnect-backup.timer` - Systemd timer
|
|
|
|
**Backup Strategy:**
|
|
- Daily full backups at 2:00 AM
|
|
- Compressed with gzip
|
|
- Named with timestamp: `guruconnect-YYYY-MM-DD-HHMMSS.sql.gz`
|
|
- Stored in `/home/guru/backups/guruconnect/`
|
|
- Retention: 30 days daily, 4 weeks weekly, 6 months monthly
|
|
|
|
**Verification:**
|
|
- Manual backup works
|
|
- Automated backup runs daily
|
|
- Restore process verified
|
|
- Old backups cleaned up correctly
|
|
|
|
---
|
|
|
|
### Day 5: Log Rotation & Health Checks
|
|
|
|
**Goal:** Implement log rotation and continuous health monitoring
|
|
|
|
**Tasks:**
|
|
1. Configure logrotate for GuruConnect logs
|
|
2. Implement health check improvements:
|
|
- Database connectivity check
|
|
- Disk space check
|
|
- Memory usage check
|
|
- Active session count check
|
|
3. Create monitoring script (`server/health-monitor.sh`)
|
|
4. Add health metrics to Prometheus
|
|
5. Create systemd watchdog configuration
|
|
6. Document operational procedures
|
|
|
|
**Files to Create:**
|
|
- `server/guruconnect.logrotate` - Logrotate configuration
|
|
- `server/health-monitor.sh` - Health monitoring script
|
|
- `server/OPERATIONS.md` - Operational runbook
|
|
|
|
**Health Checks:**
|
|
- `/health` endpoint (basic - already exists)
|
|
- `/health/deep` endpoint (detailed checks):
|
|
- Database connection: OK/FAIL
|
|
- Disk space: >10% free
|
|
- Memory: <90% used
|
|
- Active sessions: <100 (threshold)
|
|
- Uptime: seconds since start
|
|
|
|
**Verification:**
|
|
- Logs rotate correctly
|
|
- Health checks report accurate status
|
|
- Alerts triggered on health failures
|
|
|
|
---
|
|
|
|
## Infrastructure Files Structure
|
|
|
|
```
|
|
guru-connect/
|
|
├── server/
|
|
│ ├── guruconnect.service # Systemd service file
|
|
│ ├── setup-systemd.sh # Service installation script
|
|
│ ├── backup-postgres.sh # PostgreSQL backup script
|
|
│ ├── restore-postgres.sh # PostgreSQL restore script
|
|
│ ├── guruconnect-backup.service # Backup systemd service
|
|
│ ├── guruconnect-backup.timer # Backup systemd timer
|
|
│ ├── guruconnect.logrotate # Logrotate configuration
|
|
│ ├── health-monitor.sh # Health monitoring script
|
|
│ └── OPERATIONS.md # Operational runbook
|
|
├── infrastructure/
|
|
│ ├── prometheus.yml # Prometheus configuration
|
|
│ ├── grafana-dashboard.json # Grafana dashboard export
|
|
│ └── setup-monitoring.sh # Monitoring setup script
|
|
└── docs/
|
|
└── MONITORING.md # Monitoring documentation
|
|
```
|
|
|
|
---
|
|
|
|
## Systemd Service Configuration
|
|
|
|
**Service File: `/etc/systemd/system/guruconnect.service`**
|
|
|
|
```ini
|
|
[Unit]
|
|
Description=GuruConnect Remote Desktop Server
|
|
Documentation=https://git.azcomputerguru.com/azcomputerguru/guru-connect
|
|
After=network-online.target postgresql.service
|
|
Wants=network-online.target
|
|
|
|
[Service]
|
|
Type=simple
|
|
User=guru
|
|
Group=guru
|
|
WorkingDirectory=/home/guru/guru-connect/server
|
|
|
|
# Environment variables
|
|
EnvironmentFile=/home/guru/guru-connect/server/.env
|
|
|
|
# Start command
|
|
ExecStart=/home/guru/guru-connect/target/x86_64-unknown-linux-gnu/release/guruconnect-server
|
|
|
|
# Restart policy
|
|
Restart=on-failure
|
|
RestartSec=10s
|
|
StartLimitInterval=5min
|
|
StartLimitBurst=3
|
|
|
|
# Resource limits
|
|
LimitNOFILE=65536
|
|
LimitNPROC=4096
|
|
|
|
# Security
|
|
NoNewPrivileges=true
|
|
PrivateTmp=true
|
|
|
|
# Logging
|
|
StandardOutput=journal
|
|
StandardError=journal
|
|
SyslogIdentifier=guruconnect
|
|
|
|
# Watchdog
|
|
WatchdogSec=30s
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
```
|
|
|
|
**Environment File: `/home/guru/guru-connect/server/.env`**
|
|
|
|
```bash
|
|
# Database
|
|
DATABASE_URL=postgresql://guruconnect:PASSWORD@localhost:5432/guruconnect
|
|
|
|
# Security
|
|
JWT_SECRET=your-very-secure-jwt-secret-at-least-32-characters
|
|
AGENT_API_KEY=your-very-secure-api-key-at-least-32-characters
|
|
|
|
# Server Configuration
|
|
RUST_LOG=info
|
|
HOST=0.0.0.0
|
|
PORT=3002
|
|
|
|
# Monitoring
|
|
PROMETHEUS_PORT=3002 # Expose on same port as main service
|
|
```
|
|
|
|
---
|
|
|
|
## Prometheus Configuration
|
|
|
|
**File: `infrastructure/prometheus.yml`**
|
|
|
|
```yaml
|
|
global:
|
|
scrape_interval: 15s
|
|
evaluation_interval: 15s
|
|
external_labels:
|
|
cluster: 'guruconnect-production'
|
|
|
|
scrape_configs:
|
|
- job_name: 'guruconnect'
|
|
static_configs:
|
|
- targets: ['172.16.3.30:3002']
|
|
labels:
|
|
env: 'production'
|
|
service: 'guruconnect-server'
|
|
|
|
- job_name: 'node_exporter'
|
|
static_configs:
|
|
- targets: ['172.16.3.30:9100']
|
|
labels:
|
|
env: 'production'
|
|
instance: 'rmm-server'
|
|
|
|
# Alerting rules (optional for Week 2)
|
|
rule_files:
|
|
- 'alerts.yml'
|
|
|
|
alerting:
|
|
alertmanagers:
|
|
- static_configs:
|
|
- targets: ['localhost:9093']
|
|
```
|
|
|
|
---
|
|
|
|
## Testing Checklist
|
|
|
|
### Systemd Service Tests
|
|
- [ ] Service starts correctly: `sudo systemctl start guruconnect`
|
|
- [ ] Service stops correctly: `sudo systemctl stop guruconnect`
|
|
- [ ] Service restarts correctly: `sudo systemctl restart guruconnect`
|
|
- [ ] Service auto-starts on boot: `sudo systemctl enable guruconnect`
|
|
- [ ] Service restarts on crash: `sudo kill -9 <pid>` (wait 10s)
|
|
- [ ] Logs visible in journalctl: `sudo journalctl -u guruconnect -f`
|
|
|
|
### Prometheus Metrics Tests
|
|
- [ ] Metrics endpoint accessible: `curl http://172.16.3.30:3002/metrics`
|
|
- [ ] Metrics format valid (Prometheus client can scrape)
|
|
- [ ] Session metrics update on session creation/close
|
|
- [ ] Request metrics update on HTTP requests
|
|
- [ ] Error metrics update on failures
|
|
|
|
### Grafana Dashboard Tests
|
|
- [ ] Prometheus data source connected
|
|
- [ ] All panels display data
|
|
- [ ] Data updates in real-time (<30s delay)
|
|
- [ ] Historical data visible (after 1 hour)
|
|
- [ ] Dashboard exports to JSON successfully
|
|
|
|
### Backup Tests
|
|
- [ ] Manual backup creates file: `bash backup-postgres.sh`
|
|
- [ ] Backup file is compressed and named correctly
|
|
- [ ] Restore works: `bash restore-postgres.sh <backup-file>`
|
|
- [ ] Timer triggers daily at 2:00 AM
|
|
- [ ] Retention policy removes old backups
|
|
|
|
### Health Check Tests
|
|
- [ ] Basic health endpoint: `curl http://172.16.3.30:3002/health`
|
|
- [ ] Deep health endpoint: `curl http://172.16.3.30:3002/health/deep`
|
|
- [ ] Health checks report database status
|
|
- [ ] Health checks report disk/memory usage
|
|
|
|
---
|
|
|
|
## Risk Assessment
|
|
|
|
### HIGH RISK
|
|
**Issue:** Database credentials still broken
|
|
**Impact:** Cannot test database-dependent features
|
|
**Mitigation:** Create backup scripts that work even if database is down (conditional logic)
|
|
|
|
**Issue:** Sudo access required for systemd
|
|
**Impact:** Cannot install service without password
|
|
**Mitigation:** Prepare scripts and documentation, request sudo access from system admin
|
|
|
|
### MEDIUM RISK
|
|
**Issue:** Prometheus/Grafana installation may require dependencies
|
|
**Impact:** Additional setup time
|
|
**Mitigation:** Use Docker containers if system install is complex
|
|
|
|
**Issue:** Metrics may add performance overhead
|
|
**Impact:** Latency increase
|
|
**Mitigation:** Use efficient metrics library, test performance before/after
|
|
|
|
### LOW RISK
|
|
**Issue:** Log rotation misconfiguration
|
|
**Impact:** Disk space issues
|
|
**Mitigation:** Test logrotate configuration thoroughly, set conservative limits
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
Week 2 is complete when:
|
|
|
|
1. **Systemd Service**
|
|
- Service starts/stops correctly
|
|
- Auto-restarts on failure
|
|
- Starts on boot
|
|
- Logs to journalctl
|
|
|
|
2. **Prometheus Metrics**
|
|
- /metrics endpoint working
|
|
- Key metrics implemented:
|
|
- Request counts and latency
|
|
- Session counts and duration
|
|
- Active connections
|
|
- Error rates
|
|
- Prometheus can scrape successfully
|
|
|
|
3. **Grafana Dashboard**
|
|
- Prometheus data source configured
|
|
- Dashboard with 8+ panels
|
|
- Real-time data display
|
|
- Dashboard exported to JSON
|
|
|
|
4. **Automated Backups**
|
|
- Backup script functional
|
|
- Daily backups via systemd timer
|
|
- Retention policy enforced
|
|
- Restore procedure documented
|
|
|
|
5. **Health Monitoring**
|
|
- Log rotation configured
|
|
- Health checks implemented
|
|
- Health metrics exposed
|
|
- Operational runbook created
|
|
|
|
**Exit Criteria:** All 5 areas have passing tests, production infrastructure is stable and monitored.
|
|
|
|
---
|
|
|
|
## Next Steps (Week 3)
|
|
|
|
After Week 2 infrastructure completion:
|
|
- Week 3: CI/CD pipeline (Gitea CI, automated builds, deployment automation)
|
|
- Week 4: Production hardening (load testing, performance optimization, security audit)
|
|
- Phase 2: Core features development
|
|
|
|
---
|
|
|
|
**Document Status:** READY
|
|
**Owner:** Development Team
|
|
**Started:** 2026-01-18
|
|
**Target:** 2026-01-25
|