Add VPN configuration tools and agent documentation

Created comprehensive VPN setup tooling for Peaceful Spirit L2TP/IPsec connection and enhanced agent documentation framework. VPN Configuration (PST-NW-VPN): - Setup-PST-L2TP-VPN.ps1: Automated L2TP/IPsec setup with split-tunnel and DNS - Connect-PST-VPN.ps1: Connection helper with PPP adapter detection, DNS (192.168.0.2), and route config (192.168.0.0/24) - Connect-PST-VPN-Standalone.ps1: Self-contained connection script for remote deployment - Fix-PST-VPN-Auth.ps1: Authentication troubleshooting for CHAP/MSChapv2 - Diagnose-VPN-Interface.ps1: Comprehensive VPN interface and routing diagnostic - Quick-Test-VPN.ps1: Fast connectivity verification (DNS/router/routes) - Add-PST-VPN-Route-Manual.ps1: Manual route configuration helper - vpn-connect.bat, vpn-disconnect.bat: Simple batch file shortcuts - OpenVPN config files (Windows-compatible, abandoned for L2TP) Key VPN Implementation Details: - L2TP creates PPP adapter with connection name as interface description - UniFi auto-configures DNS (192.168.0.2) but requires manual route to 192.168.0.0/24 - Split-tunnel enabled (only remote traffic through VPN) - All-user connection for pre-login auto-connect via scheduled task - Authentication: CHAP + MSChapv2 for UniFi compatibility Agent Documentation: - AGENT_QUICK_REFERENCE.md: Quick reference for all specialized agents - documentation-squire.md: Documentation and task management specialist agent - Updated all agent markdown files with standardized formatting Project Organization: - Moved conversation logs to dedicated directories (guru-connect-conversation-logs, guru-rmm-conversation-logs) - Cleaned up old session JSONL files from projects/msp-tools/ - Added guru-connect infrastructure (agent, dashboard, proto, scripts, .gitea workflows) - Added guru-rmm server components and deployment configs Technical Notes: - VPN IP pool: 192.168.4.x (client gets 192.168.4.6) - Remote network: 192.168.0.0/24 (router at 192.168.0.10) - PSK: rrClvnmUeXEFo90Ol+z7tfsAZHeSK6w7 - Credentials: pst-admin / 24Hearts$ Files: 15 VPN scripts, 2 agent docs, conversation log reorganization, guru-connect/guru-rmm infrastructure additions Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 11:51:47 -07:00
parent b0a68d89bf
commit 6c316aa701
272 changed files with 37068 additions and 2 deletions
--- a/projects/msp-tools/guru-connect/TECHNICAL_DEBT.md
+++ b/projects/msp-tools/guru-connect/TECHNICAL_DEBT.md
@@ -0,0 +1,659 @@
+# GuruConnect - Technical Debt & Future Work Tracker
+
+**Last Updated:** 2026-01-18
+**Project Phase:** Phase 1 Complete (89%)
+
+---
+
+## Critical Items (Blocking Production Use)
+
+### 1. Gitea Actions Runner Registration
+**Status:** PENDING (requires admin access)
+**Priority:** HIGH
+**Effort:** 5 minutes
+**Tracked In:** PHASE1_WEEK3_COMPLETE.md line 181
+
+**Description:**
+Runner installed but not registered with Gitea instance. CI/CD pipeline is ready but not active.
+
+**Action Required:**
+```bash
+# Get token from: https://git.azcomputerguru.com/admin/actions/runners
+sudo -u gitea-runner act_runner register \
+  --instance https://git.azcomputerguru.com \
+  --token YOUR_REGISTRATION_TOKEN_HERE \
+  --name gururmm-runner \
+  --labels ubuntu-latest,ubuntu-22.04
+
+sudo systemctl enable gitea-runner
+sudo systemctl start gitea-runner
+```
+
+**Verification:**
+- Runner shows "Online" in Gitea admin panel
+- Test commit triggers build workflow
+
+---
+
+## High Priority Items (Security & Stability)
+
+### 2. TLS Certificate Auto-Renewal
+**Status:** NOT IMPLEMENTED
+**Priority:** HIGH
+**Effort:** 2-4 hours
+**Tracked In:** PHASE1_COMPLETE.md line 51
+
+**Description:**
+Let's Encrypt certificates need manual renewal. Should implement certbot auto-renewal.
+
+**Implementation:**
+```bash
+# Install certbot
+sudo apt install certbot python3-certbot-nginx
+
+# Configure auto-renewal
+sudo certbot --nginx -d connect.azcomputerguru.com
+
+# Set up automatic renewal (cron or systemd timer)
+sudo systemctl enable certbot.timer
+sudo systemctl start certbot.timer
+```
+
+**Verification:**
+- `sudo certbot renew --dry-run` succeeds
+- Certificate auto-renews before expiration
+
+---
+
+### 3. Systemd Watchdog Implementation
+**Status:** PARTIALLY COMPLETED (issue fixed, proper implementation pending)
+**Priority:** MEDIUM
+**Effort:** 4-8 hours (remaining for sd_notify implementation)
+**Discovered:** 2026-01-18 (dashboard 502 error)
+**Issue Fixed:** 2026-01-18
+
+**Description:**
+Systemd watchdog was causing service crashes. Removed `WatchdogSec=30s` from service file to resolve immediate 502 error. Server now runs stably without watchdog configuration. Proper sd_notify watchdog support should still be implemented for automatic restart on hung processes.
+
+**Implementation:**
+1. Add `systemd` crate to server/Cargo.toml
+2. Implement `sd_notify_watchdog()` calls in main loop
+3. Re-enable `WatchdogSec=30s` in systemd service
+4. Test that service doesn't crash and watchdog works
+
+**Files to Modify:**
+- `server/Cargo.toml` - Add dependency
+- `server/src/main.rs` - Add watchdog notifications
+- `/etc/systemd/system/guruconnect.service` - Re-enable WatchdogSec
+
+**Benefits:**
+- Systemd can detect hung server process
+- Automatic restart on deadlock/hang conditions
+
+---
+
+### 4. Invalid Agent API Key Investigation
+**Status:** ONGOING ISSUE
+**Priority:** MEDIUM
+**Effort:** 1-2 hours
+**Discovered:** 2026-01-18
+
+**Description:**
+Agent at 172.16.3.20 (machine ID 935a3920-6e32-4da3-a74f-3e8e8b2a426a) is repeatedly connecting with invalid API key every 5 seconds.
+
+**Log Evidence:**
+```
+WARN guruconnect_server::relay: Agent connection rejected: 935a3920-6e32-4da3-a74f-3e8e8b2a426a from 172.16.3.20 - invalid API key
+```
+
+**Investigation Needed:**
+1. Identify which machine is 172.16.3.20
+2. Check agent configuration on that machine
+3. Update agent with correct API key OR remove agent
+4. Consider implementing rate limiting for failed auth attempts
+
+**Potential Impact:**
+- Fills logs with warnings
+- Wastes server resources processing invalid connections
+- May indicate misconfigured or rogue agent
+
+---
+
+### 5. Comprehensive Security Audit Logging
+**Status:** PARTIALLY IMPLEMENTED
+**Priority:** MEDIUM
+**Effort:** 8-16 hours
+**Tracked In:** PHASE1_COMPLETE.md line 51
+
+**Description:**
+Current logging covers basic operations. Need comprehensive audit trail for security events.
+
+**Events to Track:**
+- All authentication attempts (success/failure)
+- Session creation/termination
+- Agent connections/disconnections
+- User account changes
+- Configuration changes
+- Administrative actions
+- File transfer operations (when implemented)
+
+**Implementation:**
+1. Create `audit_logs` table in database
+2. Implement `AuditLogger` service
+3. Add audit calls to all security-sensitive operations
+4. Create audit log viewer in dashboard
+5. Implement log retention policy
+
+**Files to Create/Modify:**
+- `server/migrations/XXX_create_audit_logs.sql`
+- `server/src/audit.rs` - Audit logging service
+- `server/src/api/audit.rs` - Audit log API endpoints
+- `server/static/audit.html` - Audit log viewer
+
+---
+
+### 6. Session Timeout Enforcement (UI-Side)
+**Status:** NOT IMPLEMENTED
+**Priority:** MEDIUM
+**Effort:** 2-4 hours
+**Tracked In:** PHASE1_COMPLETE.md line 51
+
+**Description:**
+JWT tokens expire after 24 hours (server-side), but UI doesn't detect/handle expiration gracefully.
+
+**Implementation:**
+1. Add token expiration check to dashboard JavaScript
+2. Implement automatic logout on token expiration
+3. Add session timeout warning (e.g., "Session expires in 5 minutes")
+4. Implement token refresh mechanism (optional)
+
+**Files to Modify:**
+- `server/static/dashboard.html` - Add expiration check
+- `server/static/viewer.html` - Add expiration check
+- `server/src/api/auth.rs` - Add token refresh endpoint (optional)
+
+**User Experience:**
+- User gets warned before automatic logout
+- Clear messaging: "Session expired, please log in again"
+- No confusing error messages on expired tokens
+
+---
+
+## Medium Priority Items (Operational Excellence)
+
+### 7. Grafana Dashboard Import
+**Status:** NOT COMPLETED
+**Priority:** MEDIUM
+**Effort:** 15 minutes
+**Tracked In:** PHASE1_COMPLETE.md
+
+**Description:**
+Dashboard JSON file exists but not imported into Grafana.
+
+**Action Required:**
+1. Login to Grafana: http://172.16.3.30:3000
+2. Go to Dashboards > Import
+3. Upload `infrastructure/grafana-dashboard.json`
+4. Verify all panels display data
+
+**File Location:**
+- `infrastructure/grafana-dashboard.json`
+
+---
+
+### 8. Grafana Default Password Change
+**Status:** NOT CHANGED
+**Priority:** MEDIUM
+**Effort:** 2 minutes
+**Tracked In:** Multiple docs
+
+**Description:**
+Grafana still using default admin/admin credentials.
+
+**Action Required:**
+1. Login to Grafana: http://172.16.3.30:3000
+2. Change password from admin/admin to secure password
+3. Update documentation with new password
+
+**Security Risk:**
+- Low (internal network only, not exposed to internet)
+- But should follow security best practices
+
+---
+
+### 9. Deployment SSH Keys for Full Automation
+**Status:** NOT CONFIGURED
+**Priority:** MEDIUM
+**Effort:** 1-2 hours
+**Tracked In:** PHASE1_WEEK3_COMPLETE.md, CI_CD_SETUP.md
+
+**Description:**
+CI/CD deployment workflow ready but requires SSH key configuration for full automation.
+
+**Implementation:**
+```bash
+# Generate SSH key for runner
+sudo -u gitea-runner ssh-keygen -t ed25519 -C "gitea-runner@gururmm"
+
+# Add public key to authorized_keys
+sudo -u gitea-runner cat /home/gitea-runner/.ssh/id_ed25519.pub >> ~guru/.ssh/authorized_keys
+
+# Test SSH connection
+sudo -u gitea-runner ssh guru@172.16.3.30 whoami
+
+# Add secrets to Gitea repository settings
+# SSH_PRIVATE_KEY - content of /home/gitea-runner/.ssh/id_ed25519
+# SSH_HOST - 172.16.3.30
+# SSH_USER - guru
+```
+
+**Current State:**
+- Manual deployment works via deploy.sh
+- Automated deployment via workflow will fail on SSH step
+
+---
+
+### 10. Backup Offsite Sync
+**Status:** NOT IMPLEMENTED
+**Priority:** MEDIUM
+**Effort:** 4-8 hours
+**Tracked In:** PHASE1_COMPLETE.md
+
+**Description:**
+Daily backups stored locally but not synced offsite. Risk of data loss if server fails.
+
+**Implementation Options:**
+
+**Option A: Rsync to Remote Server**
+```bash
+# Add to backup script
+rsync -avz /home/guru/backups/guruconnect/ \
+  backup-server:/backups/gururmm/guruconnect/
+```
+
+**Option B: Cloud Storage (S3, Azure Blob, etc.)**
+```bash
+# Install rclone
+sudo apt install rclone
+
+# Configure cloud provider
+rclone config
+
+# Sync backups
+rclone sync /home/guru/backups/guruconnect/ remote:guruconnect-backups/
+```
+
+**Considerations:**
+- Encryption for backups in transit
+- Retention policy on remote storage
+- Cost of cloud storage
+- Bandwidth usage
+
+---
+
+### 11. Alertmanager for Prometheus
+**Status:** NOT CONFIGURED
+**Priority:** MEDIUM
+**Effort:** 4-8 hours
+**Tracked In:** PHASE1_COMPLETE.md
+
+**Description:**
+Prometheus collects metrics but no alerting configured. Should notify on issues.
+
+**Alerts to Configure:**
+- Service down
+- High error rate
+- Database connection failures
+- Disk space low
+- High CPU/memory usage
+- Failed authentication spike
+
+**Implementation:**
+```bash
+# Install Alertmanager
+sudo apt install prometheus-alertmanager
+
+# Configure alert rules
+sudo tee /etc/prometheus/alert.rules.yml << 'EOF'
+groups:
+  - name: guruconnect
+    rules:
+      - alert: ServiceDown
+        expr: up{job="guruconnect"} == 0
+        for: 1m
+        annotations:
+          summary: "GuruConnect service is down"
+
+      - alert: HighErrorRate
+        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
+        for: 5m
+        annotations:
+          summary: "High error rate detected"
+EOF
+
+# Configure notification channels (email, Slack, etc.)
+```
+
+---
+
+### 12. CI/CD Notification Webhooks
+**Status:** NOT CONFIGURED
+**Priority:** LOW
+**Effort:** 2-4 hours
+**Tracked In:** PHASE1_COMPLETE.md
+
+**Description:**
+No notifications when builds fail or deployments complete.
+
+**Implementation:**
+1. Configure webhook in Gitea repository settings
+2. Point to Slack/Discord/Email service
+3. Select events: Push, Pull Request, Release
+4. Test notifications
+
+**Events to Notify:**
+- Build started
+- Build failed
+- Build succeeded
+- Deployment started
+- Deployment completed
+- Deployment failed
+
+---
+
+## Low Priority Items (Future Enhancements)
+
+### 13. Windows Runner for Native Agent Builds
+**Status:** NOT IMPLEMENTED
+**Priority:** LOW
+**Effort:** 8-16 hours
+**Tracked In:** PHASE1_WEEK3_COMPLETE.md
+
+**Description:**
+Currently cross-compiling Windows agent from Linux. Native Windows builds would be faster and more reliable.
+
+**Implementation:**
+1. Set up Windows server/VM
+2. Install Gitea Actions runner on Windows
+3. Configure runner with windows-latest label
+4. Update build workflow to use Windows runner for agent builds
+
+**Benefits:**
+- Faster agent builds (no cross-compilation)
+- More accurate Windows testing
+- Ability to run Windows-specific tests
+
+**Cost:**
+- Windows Server license (or Windows 10/11 Pro)
+- Additional hardware/VM resources
+
+---
+
+### 14. Staging Environment
+**Status:** NOT IMPLEMENTED
+**Priority:** LOW
+**Effort:** 16-32 hours
+**Tracked In:** PHASE1_COMPLETE.md
+
+**Description:**
+All changes deploy directly to production. Should have staging environment for testing.
+
+**Implementation:**
+1. Set up staging server (VM or separate port)
+2. Configure separate database for staging
+3. Update CI/CD workflows:
+   - Push to develop → Deploy to staging
+   - Push tag → Deploy to production
+4. Add smoke tests for staging
+
+**Benefits:**
+- Test deployments before production
+- QA environment for testing
+- Reduced production downtime
+
+---
+
+### 15. Code Coverage Thresholds
+**Status:** NOT ENFORCED
+**Priority:** LOW
+**Effort:** 2-4 hours
+**Tracked In:** Multiple docs
+
+**Description:**
+Code coverage collected but no minimum threshold enforced.
+
+**Implementation:**
+1. Analyze current coverage baseline
+2. Set reasonable thresholds (e.g., 70% overall)
+3. Update test workflow to fail if below threshold
+4. Add coverage badge to README
+
+**Files to Modify:**
+- `.gitea/workflows/test.yml` - Add threshold check
+- `README.md` - Add coverage badge
+
+---
+
+### 16. Performance Benchmarking in CI
+**Status:** NOT IMPLEMENTED
+**Priority:** LOW
+**Effort:** 8-16 hours
+**Tracked In:** PHASE1_COMPLETE.md
+
+**Description:**
+No automated performance testing. Risk of performance regression.
+
+**Implementation:**
+1. Create performance benchmarks using `criterion`
+2. Add benchmark job to CI workflow
+3. Track performance trends over time
+4. Alert on performance regression (>10% slower)
+
+**Benchmarks to Add:**
+- WebSocket message throughput
+- Authentication latency
+- Database query performance
+- Screen capture encoding speed
+
+---
+
+### 17. Database Replication
+**Status:** NOT IMPLEMENTED
+**Priority:** LOW
+**Effort:** 16-32 hours
+**Tracked In:** PHASE1_COMPLETE.md
+
+**Description:**
+Single database instance. No high availability or read scaling.
+
+**Implementation:**
+1. Set up PostgreSQL streaming replication
+2. Configure automatic failover (pg_auto_failover)
+3. Update application to use read replicas
+4. Test failover scenarios
+
+**Benefits:**
+- High availability
+- Read scaling
+- Faster backups (from replica)
+
+**Complexity:**
+- Significant operational overhead
+- Monitoring and alerting needed
+- Failover testing required
+
+---
+
+### 18. Centralized Logging (ELK Stack)
+**Status:** NOT IMPLEMENTED
+**Priority:** LOW
+**Effort:** 16-32 hours
+**Tracked In:** PHASE1_COMPLETE.md
+
+**Description:**
+Logs stored in systemd journal. Hard to search across time periods.
+
+**Implementation:**
+1. Install Elasticsearch, Logstash, Kibana
+2. Configure log shipping from systemd journal
+3. Create Kibana dashboards
+4. Set up log retention policy
+
+**Benefits:**
+- Powerful log search
+- Log aggregation across services
+- Visual log analysis
+
+**Cost:**
+- Significant resource usage (RAM for Elasticsearch)
+- Operational complexity
+
+---
+
+## Discovered Issues (Need Investigation)
+
+### 19. Agent Connection Retry Logic
+**Status:** NEEDS REVIEW
+**Priority:** LOW
+**Effort:** 2-4 hours
+**Discovered:** 2026-01-18
+
+**Description:**
+Agent at 172.16.3.20 retries every 5 seconds with invalid API key. Should implement exponential backoff or rate limiting.
+
+**Investigation:**
+1. Check agent retry logic in codebase
+2. Determine if 5-second retry is intentional
+3. Consider exponential backoff for failed auth
+4. Add server-side rate limiting for repeated failures
+
+**Files to Review:**
+- `agent/src/transport/` - WebSocket connection logic
+- `server/src/relay/` - Rate limiting for auth failures
+
+---
+
+### 20. Database Connection Pool Sizing
+**Status:** NEEDS MONITORING
+**Priority:** LOW
+**Effort:** 2-4 hours
+**Discovered:** During infrastructure setup
+
+**Description:**
+Default connection pool settings may not be optimal. Need to monitor under load.
+
+**Monitoring:**
+- Check `db_connections_active` metric in Prometheus
+- Monitor for pool exhaustion warnings
+- Track query latency
+
+**Tuning:**
+- Adjust `max_connections` in PostgreSQL config
+- Adjust pool size in server .env file
+- Monitor and iterate
+
+---
+
+## Completed Items (For Reference)
+
+### ✓ Systemd Service Configuration
+**Completed:** 2026-01-17
+**Phase:** Phase 1 Week 2
+
+### ✓ Prometheus Metrics Integration
+**Completed:** 2026-01-17
+**Phase:** Phase 1 Week 2
+
+### ✓ Grafana Dashboard Setup
+**Completed:** 2026-01-17
+**Phase:** Phase 1 Week 2
+
+### ✓ Automated Backup System
+**Completed:** 2026-01-17
+**Phase:** Phase 1 Week 2
+
+### ✓ Log Rotation Configuration
+**Completed:** 2026-01-17
+**Phase:** Phase 1 Week 2
+
+### ✓ CI/CD Workflows Created
+**Completed:** 2026-01-18
+**Phase:** Phase 1 Week 3
+
+### ✓ Deployment Automation Script
+**Completed:** 2026-01-18
+**Phase:** Phase 1 Week 3
+
+### ✓ Version Tagging Automation
+**Completed:** 2026-01-18
+**Phase:** Phase 1 Week 3
+
+### ✓ Gitea Actions Runner Installation
+**Completed:** 2026-01-18
+**Phase:** Phase 1 Week 3
+
+### ✓ Systemd Watchdog Issue Fixed (Partial Completion)
+**Completed:** 2026-01-18
+**What Was Done:** Removed `WatchdogSec=30s` from systemd service file
+**Result:** Resolved immediate 502 error; server now runs stably
+**Status:** Issue fixed but full implementation (sd_notify) still pending
+**Item Reference:** Item #3 (full sd_notify implementation remains as future work)
+**Impact:** Production server is now stable and responding correctly
+
+---
+
+## Summary by Priority
+
+**Critical (1 item):**
+1. Gitea Actions runner registration
+
+**High (4 items):**
+2. TLS certificate auto-renewal
+4. Invalid agent API key investigation
+5. Comprehensive security audit logging
+6. Session timeout enforcement
+
+**High - Partial/Pending (1 item):**
+3. Systemd watchdog implementation (issue fixed; sd_notify implementation pending)
+
+**Medium (6 items):**
+7. Grafana dashboard import
+8. Grafana password change
+9. Deployment SSH keys
+10. Backup offsite sync
+11. Alertmanager for Prometheus
+12. CI/CD notification webhooks
+
+**Low (8 items):**
+13. Windows runner for agent builds
+14. Staging environment
+15. Code coverage thresholds
+16. Performance benchmarking
+17. Database replication
+18. Centralized logging (ELK)
+19. Agent retry logic review
+20. Database pool sizing monitoring
+
+---
+
+## Tracking Notes
+
+**How to Use This Document:**
+1. Before starting new work, review this list
+2. When discovering new issues, add them here
+3. When completing items, move to "Completed Items" section
+4. Prioritize based on: Security > Stability > Operations > Features
+5. Update status and dates as work progresses
+
+**Related Documents:**
+- `PHASE1_COMPLETE.md` - Overall Phase 1 status
+- `PHASE1_WEEK3_COMPLETE.md` - CI/CD specific items
+- `CI_CD_SETUP.md` - CI/CD documentation
+- `INFRASTRUCTURE_STATUS.md` - Infrastructure status
+
+---
+
+**Document Version:** 1.1
+**Items Tracked:** 20 (1 critical, 4 high, 1 high-partial, 6 medium, 8 low)
+**Last Updated:** 2026-01-18 (Item #3 marked as partial completion)
+**Next Review:** Before Phase 2 planning