claudetools/projects/msp-tools/guru-connect/PHASE1_WEEK2_INFRASTRUCTURE.md

# Phase 1, Week 2 - Infrastructure & Monitoring

**Date Started:** 2026-01-18
**Target Completion:** 2026-01-25
**Status:** Starting
**Priority:** HIGH (Production Readiness)

---

## Executive Summary

With Week 1 security fixes complete and deployed, Week 2 focuses on production infrastructure hardening. The server currently runs manually (`nohup start-secure.sh &`), lacks monitoring, and has no automated recovery. This week establishes production-grade infrastructure.

**Goals:**
1. Systemd service with auto-restart on failure
2. Prometheus metrics for monitoring
3. Grafana dashboards for visualization
4. Automated PostgreSQL backups
5. Log rotation and management

**Dependencies:**
- SSH access to 172.16.3.30 as `guru` user
- Sudo access for systemd service installation
- PostgreSQL credentials (currently broken, but can set up backup automation)

---

## Week 2 Task Breakdown

### Day 1: Systemd Service Configuration

**Goal:** Convert manual server startup to systemd-managed service

**Tasks:**
1. Create systemd service file (`/etc/systemd/system/guruconnect.service`)
2. Configure service dependencies (network, postgresql)
3. Set restart policy (on-failure, with backoff)
4. Configure environment variables securely
5. Enable service to start on boot
6. Test service start/stop/restart
7. Verify auto-restart on crash

**Files to Create:**
- `server/guruconnect.service` - Systemd unit file
- `server/setup-systemd.sh` - Installation script

**Verification:**
- Service starts automatically on boot
- Service restarts on failure (kill -9 test)
- Logs go to journalctl

---

### Day 2: Prometheus Metrics

**Goal:** Expose metrics for monitoring server health and performance

**Tasks:**
1. Add `prometheus-client` dependency to Cargo.toml
2. Create metrics module (`server/src/metrics/mod.rs`)
3. Implement metric types:
   - Counter: requests_total, sessions_total, errors_total
   - Gauge: active_sessions, active_connections
   - Histogram: request_duration_seconds, session_duration_seconds
4. Add `/metrics` endpoint
5. Integrate metrics into existing code:
   - Session creation/close
   - Request handling
   - WebSocket connections
   - Database operations
6. Test metrics endpoint (`curl http://172.16.3.30:3002/metrics`)

**Files to Create/Modify:**
- `server/Cargo.toml` - Add dependencies
- `server/src/metrics/mod.rs` - Metrics module
- `server/src/main.rs` - Add /metrics endpoint
- `server/src/relay/mod.rs` - Add session metrics
- `server/src/api/mod.rs` - Add request metrics

**Metrics to Track:**
- `guruconnect_requests_total{method, path, status}` - HTTP requests
- `guruconnect_sessions_total{status}` - Sessions (created, closed, failed)
- `guruconnect_active_sessions` - Current active sessions
- `guruconnect_active_connections{type}` - WebSocket connections (agents, viewers)
- `guruconnect_request_duration_seconds{method, path}` - Request latency
- `guruconnect_session_duration_seconds` - Session lifetime
- `guruconnect_errors_total{type}` - Error counts
- `guruconnect_db_operations_total{operation, status}` - Database operations

**Verification:**
- Metrics endpoint returns Prometheus format
- Metrics update in real-time
- No performance degradation

---

### Day 3: Grafana Dashboard

**Goal:** Create visual dashboards for monitoring GuruConnect

**Tasks:**
1. Install Prometheus on 172.16.3.30
2. Configure Prometheus to scrape GuruConnect metrics
3. Install Grafana on 172.16.3.30
4. Configure Grafana data source (Prometheus)
5. Create dashboards:
   - Overview: Active sessions, requests/sec, errors
   - Sessions: Session lifecycle, duration distribution
   - Performance: Request latency, database query time
   - Errors: Error rates by type
6. Set up alerting rules (if time permits)

**Files to Create:**
- `infrastructure/prometheus.yml` - Prometheus configuration
- `infrastructure/grafana-dashboard.json` - Pre-built dashboard
- `infrastructure/setup-monitoring.sh` - Installation script

**Grafana Dashboard Panels:**
1. Active Sessions (Gauge)
2. Requests per Second (Graph)
3. Error Rate (Graph)
4. Session Creation Rate (Graph)
5. Request Latency p50/p95/p99 (Graph)
6. Active Connections by Type (Graph)
7. Database Operations (Graph)
8. Top Errors (Table)

**Verification:**
- Prometheus scrapes metrics successfully
- Grafana dashboard displays real-time data
- Alerts fire on test conditions

---

### Day 4: Automated PostgreSQL Backups

**Goal:** Implement automated daily backups with retention policy

**Tasks:**
1. Create backup script (`server/backup-postgres.sh`)
2. Configure backup location (`/home/guru/backups/guruconnect/`)
3. Implement retention policy (keep 30 daily, 4 weekly, 6 monthly)
4. Create systemd timer for daily backups
5. Add backup monitoring (success/failure metrics)
6. Test backup and restore process
7. Document restore procedure

**Files to Create:**
- `server/backup-postgres.sh` - Backup script
- `server/restore-postgres.sh` - Restore script
- `server/guruconnect-backup.service` - Systemd service
- `server/guruconnect-backup.timer` - Systemd timer

**Backup Strategy:**
- Daily full backups at 2:00 AM
- Compressed with gzip
- Named with timestamp: `guruconnect-YYYY-MM-DD-HHMMSS.sql.gz`
- Stored in `/home/guru/backups/guruconnect/`
- Retention: 30 days daily, 4 weeks weekly, 6 months monthly

**Verification:**
- Manual backup works
- Automated backup runs daily
- Restore process verified
- Old backups cleaned up correctly

---

### Day 5: Log Rotation & Health Checks

**Goal:** Implement log rotation and continuous health monitoring

**Tasks:**
1. Configure logrotate for GuruConnect logs
2. Implement health check improvements:
   - Database connectivity check
   - Disk space check
   - Memory usage check
   - Active session count check
3. Create monitoring script (`server/health-monitor.sh`)
4. Add health metrics to Prometheus
5. Create systemd watchdog configuration
6. Document operational procedures

**Files to Create:**
- `server/guruconnect.logrotate` - Logrotate configuration
- `server/health-monitor.sh` - Health monitoring script
- `server/OPERATIONS.md` - Operational runbook

**Health Checks:**
- `/health` endpoint (basic - already exists)
- `/health/deep` endpoint (detailed checks):
  - Database connection: OK/FAIL
  - Disk space: >10% free
  - Memory: <90% used
  - Active sessions: <100 (threshold)
  - Uptime: seconds since start

**Verification:**
- Logs rotate correctly
- Health checks report accurate status
- Alerts triggered on health failures

---

## Infrastructure Files Structure

```
guru-connect/
├── server/
│   ├── guruconnect.service        # Systemd service file
│   ├── setup-systemd.sh           # Service installation script
│   ├── backup-postgres.sh         # PostgreSQL backup script
│   ├── restore-postgres.sh        # PostgreSQL restore script
│   ├── guruconnect-backup.service # Backup systemd service
│   ├── guruconnect-backup.timer   # Backup systemd timer
│   ├── guruconnect.logrotate      # Logrotate configuration
│   ├── health-monitor.sh          # Health monitoring script
│   └── OPERATIONS.md              # Operational runbook
├── infrastructure/
│   ├── prometheus.yml             # Prometheus configuration
│   ├── grafana-dashboard.json     # Grafana dashboard export
│   └── setup-monitoring.sh        # Monitoring setup script
└── docs/
    └── MONITORING.md              # Monitoring documentation
```

---

## Systemd Service Configuration

**Service File: `/etc/systemd/system/guruconnect.service`**

```ini
[Unit]
Description=GuruConnect Remote Desktop Server
Documentation=https://git.azcomputerguru.com/azcomputerguru/guru-connect
After=network-online.target postgresql.service
Wants=network-online.target

[Service]
Type=simple
User=guru
Group=guru
WorkingDirectory=/home/guru/guru-connect/server

# Environment variables
EnvironmentFile=/home/guru/guru-connect/server/.env

# Start command
ExecStart=/home/guru/guru-connect/target/x86_64-unknown-linux-gnu/release/guruconnect-server

# Restart policy
Restart=on-failure
RestartSec=10s
StartLimitInterval=5min
StartLimitBurst=3

# Resource limits
LimitNOFILE=65536
LimitNPROC=4096

# Security
NoNewPrivileges=true
PrivateTmp=true

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=guruconnect

# Watchdog
WatchdogSec=30s

[Install]
WantedBy=multi-user.target
```

**Environment File: `/home/guru/guru-connect/server/.env`**

```bash
# Database
DATABASE_URL=postgresql://guruconnect:PASSWORD@localhost:5432/guruconnect

# Security
JWT_SECRET=your-very-secure-jwt-secret-at-least-32-characters
AGENT_API_KEY=your-very-secure-api-key-at-least-32-characters

# Server Configuration
RUST_LOG=info
HOST=0.0.0.0
PORT=3002

# Monitoring
PROMETHEUS_PORT=3002  # Expose on same port as main service
```

---

## Prometheus Configuration

**File: `infrastructure/prometheus.yml`**

```yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'guruconnect-production'

scrape_configs:
  - job_name: 'guruconnect'
    static_configs:
      - targets: ['172.16.3.30:3002']
        labels:
          env: 'production'
          service: 'guruconnect-server'

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['172.16.3.30:9100']
        labels:
          env: 'production'
          instance: 'rmm-server'

# Alerting rules (optional for Week 2)
rule_files:
  - 'alerts.yml'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']
```

---

## Testing Checklist

### Systemd Service Tests
- [ ] Service starts correctly: `sudo systemctl start guruconnect`
- [ ] Service stops correctly: `sudo systemctl stop guruconnect`
- [ ] Service restarts correctly: `sudo systemctl restart guruconnect`
- [ ] Service auto-starts on boot: `sudo systemctl enable guruconnect`
- [ ] Service restarts on crash: `sudo kill -9 <pid>` (wait 10s)
- [ ] Logs visible in journalctl: `sudo journalctl -u guruconnect -f`

### Prometheus Metrics Tests
- [ ] Metrics endpoint accessible: `curl http://172.16.3.30:3002/metrics`
- [ ] Metrics format valid (Prometheus client can scrape)
- [ ] Session metrics update on session creation/close
- [ ] Request metrics update on HTTP requests
- [ ] Error metrics update on failures

### Grafana Dashboard Tests
- [ ] Prometheus data source connected
- [ ] All panels display data
- [ ] Data updates in real-time (<30s delay)
- [ ] Historical data visible (after 1 hour)
- [ ] Dashboard exports to JSON successfully

### Backup Tests
- [ ] Manual backup creates file: `bash backup-postgres.sh`
- [ ] Backup file is compressed and named correctly
- [ ] Restore works: `bash restore-postgres.sh <backup-file>`
- [ ] Timer triggers daily at 2:00 AM
- [ ] Retention policy removes old backups

### Health Check Tests
- [ ] Basic health endpoint: `curl http://172.16.3.30:3002/health`
- [ ] Deep health endpoint: `curl http://172.16.3.30:3002/health/deep`
- [ ] Health checks report database status
- [ ] Health checks report disk/memory usage

---

## Risk Assessment

### HIGH RISK
**Issue:** Database credentials still broken
**Impact:** Cannot test database-dependent features
**Mitigation:** Create backup scripts that work even if database is down (conditional logic)

**Issue:** Sudo access required for systemd
**Impact:** Cannot install service without password
**Mitigation:** Prepare scripts and documentation, request sudo access from system admin

### MEDIUM RISK
**Issue:** Prometheus/Grafana installation may require dependencies
**Impact:** Additional setup time
**Mitigation:** Use Docker containers if system install is complex

**Issue:** Metrics may add performance overhead
**Impact:** Latency increase
**Mitigation:** Use efficient metrics library, test performance before/after

### LOW RISK
**Issue:** Log rotation misconfiguration
**Impact:** Disk space issues
**Mitigation:** Test logrotate configuration thoroughly, set conservative limits

---

## Success Criteria

Week 2 is complete when:

1. **Systemd Service**
   - Service starts/stops correctly
   - Auto-restarts on failure
   - Starts on boot
   - Logs to journalctl

2. **Prometheus Metrics**
   - /metrics endpoint working
   - Key metrics implemented:
     - Request counts and latency
     - Session counts and duration
     - Active connections
     - Error rates
   - Prometheus can scrape successfully

3. **Grafana Dashboard**
   - Prometheus data source configured
   - Dashboard with 8+ panels
   - Real-time data display
   - Dashboard exported to JSON

4. **Automated Backups**
   - Backup script functional
   - Daily backups via systemd timer
   - Retention policy enforced
   - Restore procedure documented

5. **Health Monitoring**
   - Log rotation configured
   - Health checks implemented
   - Health metrics exposed
   - Operational runbook created

**Exit Criteria:** All 5 areas have passing tests, production infrastructure is stable and monitored.

---

## Next Steps (Week 3)

After Week 2 infrastructure completion:
- Week 3: CI/CD pipeline (Gitea CI, automated builds, deployment automation)
- Week 4: Production hardening (load testing, performance optimization, security audit)
- Phase 2: Core features development

---

**Document Status:** READY
**Owner:** Development Team
**Started:** 2026-01-18
**Target:** 2026-01-25