Files

Mike Swanson 6c316aa701 Add VPN configuration tools and agent documentation

Created comprehensive VPN setup tooling for Peaceful Spirit L2TP/IPsec connection
and enhanced agent documentation framework.

VPN Configuration (PST-NW-VPN):
- Setup-PST-L2TP-VPN.ps1: Automated L2TP/IPsec setup with split-tunnel and DNS
- Connect-PST-VPN.ps1: Connection helper with PPP adapter detection, DNS (192.168.0.2), and route config (192.168.0.0/24)
- Connect-PST-VPN-Standalone.ps1: Self-contained connection script for remote deployment
- Fix-PST-VPN-Auth.ps1: Authentication troubleshooting for CHAP/MSChapv2
- Diagnose-VPN-Interface.ps1: Comprehensive VPN interface and routing diagnostic
- Quick-Test-VPN.ps1: Fast connectivity verification (DNS/router/routes)
- Add-PST-VPN-Route-Manual.ps1: Manual route configuration helper
- vpn-connect.bat, vpn-disconnect.bat: Simple batch file shortcuts
- OpenVPN config files (Windows-compatible, abandoned for L2TP)

Key VPN Implementation Details:
- L2TP creates PPP adapter with connection name as interface description
- UniFi auto-configures DNS (192.168.0.2) but requires manual route to 192.168.0.0/24
- Split-tunnel enabled (only remote traffic through VPN)
- All-user connection for pre-login auto-connect via scheduled task
- Authentication: CHAP + MSChapv2 for UniFi compatibility

Agent Documentation:
- AGENT_QUICK_REFERENCE.md: Quick reference for all specialized agents
- documentation-squire.md: Documentation and task management specialist agent
- Updated all agent markdown files with standardized formatting

Project Organization:
- Moved conversation logs to dedicated directories (guru-connect-conversation-logs, guru-rmm-conversation-logs)
- Cleaned up old session JSONL files from projects/msp-tools/
- Added guru-connect infrastructure (agent, dashboard, proto, scripts, .gitea workflows)
- Added guru-rmm server components and deployment configs

Technical Notes:
- VPN IP pool: 192.168.4.x (client gets 192.168.4.6)
- Remote network: 192.168.0.0/24 (router at 192.168.0.10)
- PSK: rrClvnmUeXEFo90Ol+z7tfsAZHeSK6w7
- Credentials: pst-admin / 24Hearts$

Files: 15 VPN scripts, 2 agent docs, conversation log reorganization,
guru-connect/guru-rmm infrastructure additions

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-18 11:51:47 -07:00

17 KiB

Raw Blame History

GuruConnect - Technical Debt & Future Work Tracker

Last Updated: 2026-01-18 Project Phase: Phase 1 Complete (89%)

Critical Items (Blocking Production Use)

1. Gitea Actions Runner Registration

Status: PENDING (requires admin access) Priority: HIGH Effort: 5 minutes Tracked In: PHASE1_WEEK3_COMPLETE.md line 181

Description: Runner installed but not registered with Gitea instance. CI/CD pipeline is ready but not active.

Action Required:

# Get token from: https://git.azcomputerguru.com/admin/actions/runners
sudo -u gitea-runner act_runner register \
  --instance https://git.azcomputerguru.com \
  --token YOUR_REGISTRATION_TOKEN_HERE \
  --name gururmm-runner \
  --labels ubuntu-latest,ubuntu-22.04

sudo systemctl enable gitea-runner
sudo systemctl start gitea-runner

Verification:

Runner shows "Online" in Gitea admin panel
Test commit triggers build workflow

High Priority Items (Security & Stability)

2. TLS Certificate Auto-Renewal

Status: NOT IMPLEMENTED Priority: HIGH Effort: 2-4 hours Tracked In: PHASE1_COMPLETE.md line 51

Description: Let's Encrypt certificates need manual renewal. Should implement certbot auto-renewal.

Implementation:

# Install certbot
sudo apt install certbot python3-certbot-nginx

# Configure auto-renewal
sudo certbot --nginx -d connect.azcomputerguru.com

# Set up automatic renewal (cron or systemd timer)
sudo systemctl enable certbot.timer
sudo systemctl start certbot.timer

Verification:

sudo certbot renew --dry-run succeeds
Certificate auto-renews before expiration

3. Systemd Watchdog Implementation

Status: PARTIALLY COMPLETED (issue fixed, proper implementation pending) Priority: MEDIUM Effort: 4-8 hours (remaining for sd_notify implementation) Discovered: 2026-01-18 (dashboard 502 error) Issue Fixed: 2026-01-18

Description: Systemd watchdog was causing service crashes. Removed WatchdogSec=30s from service file to resolve immediate 502 error. Server now runs stably without watchdog configuration. Proper sd_notify watchdog support should still be implemented for automatic restart on hung processes.

Implementation:

Add systemd crate to server/Cargo.toml
Implement sd_notify_watchdog() calls in main loop
Re-enable WatchdogSec=30s in systemd service
Test that service doesn't crash and watchdog works

Files to Modify:

server/Cargo.toml - Add dependency
server/src/main.rs - Add watchdog notifications
/etc/systemd/system/guruconnect.service - Re-enable WatchdogSec

Benefits:

Systemd can detect hung server process
Automatic restart on deadlock/hang conditions

4. Invalid Agent API Key Investigation

Status: ONGOING ISSUE Priority: MEDIUM Effort: 1-2 hours Discovered: 2026-01-18

Description: Agent at 172.16.3.20 (machine ID 935a3920-6e32-4da3-a74f-3e8e8b2a426a) is repeatedly connecting with invalid API key every 5 seconds.

Log Evidence:

WARN guruconnect_server::relay: Agent connection rejected: 935a3920-6e32-4da3-a74f-3e8e8b2a426a from 172.16.3.20 - invalid API key

Investigation Needed:

Identify which machine is 172.16.3.20
Check agent configuration on that machine
Update agent with correct API key OR remove agent
Consider implementing rate limiting for failed auth attempts

Potential Impact:

Fills logs with warnings
Wastes server resources processing invalid connections
May indicate misconfigured or rogue agent

5. Comprehensive Security Audit Logging

Status: PARTIALLY IMPLEMENTED Priority: MEDIUM Effort: 8-16 hours Tracked In: PHASE1_COMPLETE.md line 51

Description: Current logging covers basic operations. Need comprehensive audit trail for security events.

Events to Track:

All authentication attempts (success/failure)
Session creation/termination
Agent connections/disconnections
User account changes
Configuration changes
Administrative actions
File transfer operations (when implemented)

Implementation:

Create audit_logs table in database
Implement AuditLogger service
Add audit calls to all security-sensitive operations
Create audit log viewer in dashboard
Implement log retention policy

Files to Create/Modify:

server/migrations/XXX_create_audit_logs.sql
server/src/audit.rs - Audit logging service
server/src/api/audit.rs - Audit log API endpoints
server/static/audit.html - Audit log viewer

6. Session Timeout Enforcement (UI-Side)

Status: NOT IMPLEMENTED Priority: MEDIUM Effort: 2-4 hours Tracked In: PHASE1_COMPLETE.md line 51

Description: JWT tokens expire after 24 hours (server-side), but UI doesn't detect/handle expiration gracefully.

Implementation:

Add token expiration check to dashboard JavaScript
Implement automatic logout on token expiration
Add session timeout warning (e.g., "Session expires in 5 minutes")
Implement token refresh mechanism (optional)

Files to Modify:

server/static/dashboard.html - Add expiration check
server/static/viewer.html - Add expiration check
server/src/api/auth.rs - Add token refresh endpoint (optional)

User Experience:

User gets warned before automatic logout
Clear messaging: "Session expired, please log in again"
No confusing error messages on expired tokens

Medium Priority Items (Operational Excellence)

7. Grafana Dashboard Import

Status: NOT COMPLETED Priority: MEDIUM Effort: 15 minutes Tracked In: PHASE1_COMPLETE.md

Description: Dashboard JSON file exists but not imported into Grafana.

Action Required:

Login to Grafana: http://172.16.3.30:3000
Go to Dashboards > Import
Upload infrastructure/grafana-dashboard.json
Verify all panels display data

File Location:

infrastructure/grafana-dashboard.json

8. Grafana Default Password Change

Status: NOT CHANGED Priority: MEDIUM Effort: 2 minutes Tracked In: Multiple docs

Description: Grafana still using default admin/admin credentials.

Action Required:

Login to Grafana: http://172.16.3.30:3000
Change password from admin/admin to secure password
Update documentation with new password

Security Risk:

Low (internal network only, not exposed to internet)
But should follow security best practices

9. Deployment SSH Keys for Full Automation

Status: NOT CONFIGURED Priority: MEDIUM Effort: 1-2 hours Tracked In: PHASE1_WEEK3_COMPLETE.md, CI_CD_SETUP.md

Description: CI/CD deployment workflow ready but requires SSH key configuration for full automation.

Implementation:

# Generate SSH key for runner
sudo -u gitea-runner ssh-keygen -t ed25519 -C "gitea-runner@gururmm"

# Add public key to authorized_keys
sudo -u gitea-runner cat /home/gitea-runner/.ssh/id_ed25519.pub >> ~guru/.ssh/authorized_keys

# Test SSH connection
sudo -u gitea-runner ssh guru@172.16.3.30 whoami

# Add secrets to Gitea repository settings
# SSH_PRIVATE_KEY - content of /home/gitea-runner/.ssh/id_ed25519
# SSH_HOST - 172.16.3.30
# SSH_USER - guru

Current State:

Manual deployment works via deploy.sh
Automated deployment via workflow will fail on SSH step

10. Backup Offsite Sync

Status: NOT IMPLEMENTED Priority: MEDIUM Effort: 4-8 hours Tracked In: PHASE1_COMPLETE.md

Description: Daily backups stored locally but not synced offsite. Risk of data loss if server fails.

Implementation Options:

Option A: Rsync to Remote Server

# Add to backup script
rsync -avz /home/guru/backups/guruconnect/ \
  backup-server:/backups/gururmm/guruconnect/

Option B: Cloud Storage (S3, Azure Blob, etc.)

# Install rclone
sudo apt install rclone

# Configure cloud provider
rclone config

# Sync backups
rclone sync /home/guru/backups/guruconnect/ remote:guruconnect-backups/

Considerations:

Encryption for backups in transit
Retention policy on remote storage
Cost of cloud storage
Bandwidth usage

11. Alertmanager for Prometheus

Status: NOT CONFIGURED Priority: MEDIUM Effort: 4-8 hours Tracked In: PHASE1_COMPLETE.md

Description: Prometheus collects metrics but no alerting configured. Should notify on issues.

Alerts to Configure:

Service down
High error rate
Database connection failures
Disk space low
High CPU/memory usage
Failed authentication spike

Implementation:

# Install Alertmanager
sudo apt install prometheus-alertmanager

# Configure alert rules
sudo tee /etc/prometheus/alert.rules.yml << 'EOF'
groups:
  - name: guruconnect
    rules:
      - alert: ServiceDown
        expr: up{job="guruconnect"} == 0
        for: 1m
        annotations:
          summary: "GuruConnect service is down"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"
EOF

# Configure notification channels (email, Slack, etc.)

12. CI/CD Notification Webhooks

Status: NOT CONFIGURED Priority: LOW Effort: 2-4 hours Tracked In: PHASE1_COMPLETE.md

Description: No notifications when builds fail or deployments complete.

Implementation:

Configure webhook in Gitea repository settings
Point to Slack/Discord/Email service
Select events: Push, Pull Request, Release
Test notifications

Events to Notify:

Build started
Build failed
Build succeeded
Deployment started
Deployment completed
Deployment failed

Low Priority Items (Future Enhancements)

13. Windows Runner for Native Agent Builds

Status: NOT IMPLEMENTED Priority: LOW Effort: 8-16 hours Tracked In: PHASE1_WEEK3_COMPLETE.md

Description: Currently cross-compiling Windows agent from Linux. Native Windows builds would be faster and more reliable.

Implementation:

Set up Windows server/VM
Install Gitea Actions runner on Windows
Configure runner with windows-latest label
Update build workflow to use Windows runner for agent builds

Benefits:

Faster agent builds (no cross-compilation)
More accurate Windows testing
Ability to run Windows-specific tests

Cost:

Windows Server license (or Windows 10/11 Pro)
Additional hardware/VM resources

14. Staging Environment

Status: NOT IMPLEMENTED Priority: LOW Effort: 16-32 hours Tracked In: PHASE1_COMPLETE.md

Description: All changes deploy directly to production. Should have staging environment for testing.

Implementation:

Set up staging server (VM or separate port)
Configure separate database for staging
Update CI/CD workflows:
- Push to develop → Deploy to staging
- Push tag → Deploy to production
Add smoke tests for staging

Benefits:

Test deployments before production
QA environment for testing
Reduced production downtime

15. Code Coverage Thresholds

Status: NOT ENFORCED Priority: LOW Effort: 2-4 hours Tracked In: Multiple docs

Description: Code coverage collected but no minimum threshold enforced.

Implementation:

Analyze current coverage baseline
Set reasonable thresholds (e.g., 70% overall)
Update test workflow to fail if below threshold
Add coverage badge to README

Files to Modify:

.gitea/workflows/test.yml - Add threshold check
README.md - Add coverage badge

16. Performance Benchmarking in CI

Status: NOT IMPLEMENTED Priority: LOW Effort: 8-16 hours Tracked In: PHASE1_COMPLETE.md

Description: No automated performance testing. Risk of performance regression.

Implementation:

Create performance benchmarks using criterion
Add benchmark job to CI workflow
Track performance trends over time
Alert on performance regression (>10% slower)

Benchmarks to Add:

WebSocket message throughput
Authentication latency
Database query performance
Screen capture encoding speed

17. Database Replication

Status: NOT IMPLEMENTED Priority: LOW Effort: 16-32 hours Tracked In: PHASE1_COMPLETE.md

Description: Single database instance. No high availability or read scaling.

Implementation:

Set up PostgreSQL streaming replication
Configure automatic failover (pg_auto_failover)
Update application to use read replicas
Test failover scenarios

Benefits:

High availability
Read scaling
Faster backups (from replica)

Complexity:

Significant operational overhead
Monitoring and alerting needed
Failover testing required

18. Centralized Logging (ELK Stack)

Status: NOT IMPLEMENTED Priority: LOW Effort: 16-32 hours Tracked In: PHASE1_COMPLETE.md

Description: Logs stored in systemd journal. Hard to search across time periods.

Implementation:

Install Elasticsearch, Logstash, Kibana
Configure log shipping from systemd journal
Create Kibana dashboards
Set up log retention policy

Benefits:

Powerful log search
Log aggregation across services
Visual log analysis

Cost:

Significant resource usage (RAM for Elasticsearch)
Operational complexity

Discovered Issues (Need Investigation)

19. Agent Connection Retry Logic

Status: NEEDS REVIEW Priority: LOW Effort: 2-4 hours Discovered: 2026-01-18

Description: Agent at 172.16.3.20 retries every 5 seconds with invalid API key. Should implement exponential backoff or rate limiting.

Investigation:

Check agent retry logic in codebase
Determine if 5-second retry is intentional
Consider exponential backoff for failed auth
Add server-side rate limiting for repeated failures

Files to Review:

agent/src/transport/ - WebSocket connection logic
server/src/relay/ - Rate limiting for auth failures

20. Database Connection Pool Sizing

Status: NEEDS MONITORING Priority: LOW Effort: 2-4 hours Discovered: During infrastructure setup

Description: Default connection pool settings may not be optimal. Need to monitor under load.

Monitoring:

Check db_connections_active metric in Prometheus
Monitor for pool exhaustion warnings
Track query latency

Tuning:

Adjust max_connections in PostgreSQL config
Adjust pool size in server .env file
Monitor and iterate