Files
claudetools/projects/msp-tools/guru-connect/TECHNICAL_DEBT.md
Mike Swanson 6c316aa701 Add VPN configuration tools and agent documentation
Created comprehensive VPN setup tooling for Peaceful Spirit L2TP/IPsec connection
and enhanced agent documentation framework.

VPN Configuration (PST-NW-VPN):
- Setup-PST-L2TP-VPN.ps1: Automated L2TP/IPsec setup with split-tunnel and DNS
- Connect-PST-VPN.ps1: Connection helper with PPP adapter detection, DNS (192.168.0.2), and route config (192.168.0.0/24)
- Connect-PST-VPN-Standalone.ps1: Self-contained connection script for remote deployment
- Fix-PST-VPN-Auth.ps1: Authentication troubleshooting for CHAP/MSChapv2
- Diagnose-VPN-Interface.ps1: Comprehensive VPN interface and routing diagnostic
- Quick-Test-VPN.ps1: Fast connectivity verification (DNS/router/routes)
- Add-PST-VPN-Route-Manual.ps1: Manual route configuration helper
- vpn-connect.bat, vpn-disconnect.bat: Simple batch file shortcuts
- OpenVPN config files (Windows-compatible, abandoned for L2TP)

Key VPN Implementation Details:
- L2TP creates PPP adapter with connection name as interface description
- UniFi auto-configures DNS (192.168.0.2) but requires manual route to 192.168.0.0/24
- Split-tunnel enabled (only remote traffic through VPN)
- All-user connection for pre-login auto-connect via scheduled task
- Authentication: CHAP + MSChapv2 for UniFi compatibility

Agent Documentation:
- AGENT_QUICK_REFERENCE.md: Quick reference for all specialized agents
- documentation-squire.md: Documentation and task management specialist agent
- Updated all agent markdown files with standardized formatting

Project Organization:
- Moved conversation logs to dedicated directories (guru-connect-conversation-logs, guru-rmm-conversation-logs)
- Cleaned up old session JSONL files from projects/msp-tools/
- Added guru-connect infrastructure (agent, dashboard, proto, scripts, .gitea workflows)
- Added guru-rmm server components and deployment configs

Technical Notes:
- VPN IP pool: 192.168.4.x (client gets 192.168.4.6)
- Remote network: 192.168.0.0/24 (router at 192.168.0.10)
- PSK: rrClvnmUeXEFo90Ol+z7tfsAZHeSK6w7
- Credentials: pst-admin / 24Hearts$

Files: 15 VPN scripts, 2 agent docs, conversation log reorganization,
guru-connect/guru-rmm infrastructure additions

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 11:51:47 -07:00

17 KiB

GuruConnect - Technical Debt & Future Work Tracker

Last Updated: 2026-01-18 Project Phase: Phase 1 Complete (89%)


Critical Items (Blocking Production Use)

1. Gitea Actions Runner Registration

Status: PENDING (requires admin access) Priority: HIGH Effort: 5 minutes Tracked In: PHASE1_WEEK3_COMPLETE.md line 181

Description: Runner installed but not registered with Gitea instance. CI/CD pipeline is ready but not active.

Action Required:

# Get token from: https://git.azcomputerguru.com/admin/actions/runners
sudo -u gitea-runner act_runner register \
  --instance https://git.azcomputerguru.com \
  --token YOUR_REGISTRATION_TOKEN_HERE \
  --name gururmm-runner \
  --labels ubuntu-latest,ubuntu-22.04

sudo systemctl enable gitea-runner
sudo systemctl start gitea-runner

Verification:

  • Runner shows "Online" in Gitea admin panel
  • Test commit triggers build workflow

High Priority Items (Security & Stability)

2. TLS Certificate Auto-Renewal

Status: NOT IMPLEMENTED Priority: HIGH Effort: 2-4 hours Tracked In: PHASE1_COMPLETE.md line 51

Description: Let's Encrypt certificates need manual renewal. Should implement certbot auto-renewal.

Implementation:

# Install certbot
sudo apt install certbot python3-certbot-nginx

# Configure auto-renewal
sudo certbot --nginx -d connect.azcomputerguru.com

# Set up automatic renewal (cron or systemd timer)
sudo systemctl enable certbot.timer
sudo systemctl start certbot.timer

Verification:

  • sudo certbot renew --dry-run succeeds
  • Certificate auto-renews before expiration

3. Systemd Watchdog Implementation

Status: PARTIALLY COMPLETED (issue fixed, proper implementation pending) Priority: MEDIUM Effort: 4-8 hours (remaining for sd_notify implementation) Discovered: 2026-01-18 (dashboard 502 error) Issue Fixed: 2026-01-18

Description: Systemd watchdog was causing service crashes. Removed WatchdogSec=30s from service file to resolve immediate 502 error. Server now runs stably without watchdog configuration. Proper sd_notify watchdog support should still be implemented for automatic restart on hung processes.

Implementation:

  1. Add systemd crate to server/Cargo.toml
  2. Implement sd_notify_watchdog() calls in main loop
  3. Re-enable WatchdogSec=30s in systemd service
  4. Test that service doesn't crash and watchdog works

Files to Modify:

  • server/Cargo.toml - Add dependency
  • server/src/main.rs - Add watchdog notifications
  • /etc/systemd/system/guruconnect.service - Re-enable WatchdogSec

Benefits:

  • Systemd can detect hung server process
  • Automatic restart on deadlock/hang conditions

4. Invalid Agent API Key Investigation

Status: ONGOING ISSUE Priority: MEDIUM Effort: 1-2 hours Discovered: 2026-01-18

Description: Agent at 172.16.3.20 (machine ID 935a3920-6e32-4da3-a74f-3e8e8b2a426a) is repeatedly connecting with invalid API key every 5 seconds.

Log Evidence:

WARN guruconnect_server::relay: Agent connection rejected: 935a3920-6e32-4da3-a74f-3e8e8b2a426a from 172.16.3.20 - invalid API key

Investigation Needed:

  1. Identify which machine is 172.16.3.20
  2. Check agent configuration on that machine
  3. Update agent with correct API key OR remove agent
  4. Consider implementing rate limiting for failed auth attempts

Potential Impact:

  • Fills logs with warnings
  • Wastes server resources processing invalid connections
  • May indicate misconfigured or rogue agent

5. Comprehensive Security Audit Logging

Status: PARTIALLY IMPLEMENTED Priority: MEDIUM Effort: 8-16 hours Tracked In: PHASE1_COMPLETE.md line 51

Description: Current logging covers basic operations. Need comprehensive audit trail for security events.

Events to Track:

  • All authentication attempts (success/failure)
  • Session creation/termination
  • Agent connections/disconnections
  • User account changes
  • Configuration changes
  • Administrative actions
  • File transfer operations (when implemented)

Implementation:

  1. Create audit_logs table in database
  2. Implement AuditLogger service
  3. Add audit calls to all security-sensitive operations
  4. Create audit log viewer in dashboard
  5. Implement log retention policy

Files to Create/Modify:

  • server/migrations/XXX_create_audit_logs.sql
  • server/src/audit.rs - Audit logging service
  • server/src/api/audit.rs - Audit log API endpoints
  • server/static/audit.html - Audit log viewer

6. Session Timeout Enforcement (UI-Side)

Status: NOT IMPLEMENTED Priority: MEDIUM Effort: 2-4 hours Tracked In: PHASE1_COMPLETE.md line 51

Description: JWT tokens expire after 24 hours (server-side), but UI doesn't detect/handle expiration gracefully.

Implementation:

  1. Add token expiration check to dashboard JavaScript
  2. Implement automatic logout on token expiration
  3. Add session timeout warning (e.g., "Session expires in 5 minutes")
  4. Implement token refresh mechanism (optional)

Files to Modify:

  • server/static/dashboard.html - Add expiration check
  • server/static/viewer.html - Add expiration check
  • server/src/api/auth.rs - Add token refresh endpoint (optional)

User Experience:

  • User gets warned before automatic logout
  • Clear messaging: "Session expired, please log in again"
  • No confusing error messages on expired tokens

Medium Priority Items (Operational Excellence)

7. Grafana Dashboard Import

Status: NOT COMPLETED Priority: MEDIUM Effort: 15 minutes Tracked In: PHASE1_COMPLETE.md

Description: Dashboard JSON file exists but not imported into Grafana.

Action Required:

  1. Login to Grafana: http://172.16.3.30:3000
  2. Go to Dashboards > Import
  3. Upload infrastructure/grafana-dashboard.json
  4. Verify all panels display data

File Location:

  • infrastructure/grafana-dashboard.json

8. Grafana Default Password Change

Status: NOT CHANGED Priority: MEDIUM Effort: 2 minutes Tracked In: Multiple docs

Description: Grafana still using default admin/admin credentials.

Action Required:

  1. Login to Grafana: http://172.16.3.30:3000
  2. Change password from admin/admin to secure password
  3. Update documentation with new password

Security Risk:

  • Low (internal network only, not exposed to internet)
  • But should follow security best practices

9. Deployment SSH Keys for Full Automation

Status: NOT CONFIGURED Priority: MEDIUM Effort: 1-2 hours Tracked In: PHASE1_WEEK3_COMPLETE.md, CI_CD_SETUP.md

Description: CI/CD deployment workflow ready but requires SSH key configuration for full automation.

Implementation:

# Generate SSH key for runner
sudo -u gitea-runner ssh-keygen -t ed25519 -C "gitea-runner@gururmm"

# Add public key to authorized_keys
sudo -u gitea-runner cat /home/gitea-runner/.ssh/id_ed25519.pub >> ~guru/.ssh/authorized_keys

# Test SSH connection
sudo -u gitea-runner ssh guru@172.16.3.30 whoami

# Add secrets to Gitea repository settings
# SSH_PRIVATE_KEY - content of /home/gitea-runner/.ssh/id_ed25519
# SSH_HOST - 172.16.3.30
# SSH_USER - guru

Current State:

  • Manual deployment works via deploy.sh
  • Automated deployment via workflow will fail on SSH step

10. Backup Offsite Sync

Status: NOT IMPLEMENTED Priority: MEDIUM Effort: 4-8 hours Tracked In: PHASE1_COMPLETE.md

Description: Daily backups stored locally but not synced offsite. Risk of data loss if server fails.

Implementation Options:

Option A: Rsync to Remote Server

# Add to backup script
rsync -avz /home/guru/backups/guruconnect/ \
  backup-server:/backups/gururmm/guruconnect/

Option B: Cloud Storage (S3, Azure Blob, etc.)

# Install rclone
sudo apt install rclone

# Configure cloud provider
rclone config

# Sync backups
rclone sync /home/guru/backups/guruconnect/ remote:guruconnect-backups/

Considerations:

  • Encryption for backups in transit
  • Retention policy on remote storage
  • Cost of cloud storage
  • Bandwidth usage

11. Alertmanager for Prometheus

Status: NOT CONFIGURED Priority: MEDIUM Effort: 4-8 hours Tracked In: PHASE1_COMPLETE.md

Description: Prometheus collects metrics but no alerting configured. Should notify on issues.

Alerts to Configure:

  • Service down
  • High error rate
  • Database connection failures
  • Disk space low
  • High CPU/memory usage
  • Failed authentication spike

Implementation:

# Install Alertmanager
sudo apt install prometheus-alertmanager

# Configure alert rules
sudo tee /etc/prometheus/alert.rules.yml << 'EOF'
groups:
  - name: guruconnect
    rules:
      - alert: ServiceDown
        expr: up{job="guruconnect"} == 0
        for: 1m
        annotations:
          summary: "GuruConnect service is down"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"
EOF

# Configure notification channels (email, Slack, etc.)

12. CI/CD Notification Webhooks

Status: NOT CONFIGURED Priority: LOW Effort: 2-4 hours Tracked In: PHASE1_COMPLETE.md

Description: No notifications when builds fail or deployments complete.

Implementation:

  1. Configure webhook in Gitea repository settings
  2. Point to Slack/Discord/Email service
  3. Select events: Push, Pull Request, Release
  4. Test notifications

Events to Notify:

  • Build started
  • Build failed
  • Build succeeded
  • Deployment started
  • Deployment completed
  • Deployment failed

Low Priority Items (Future Enhancements)

13. Windows Runner for Native Agent Builds

Status: NOT IMPLEMENTED Priority: LOW Effort: 8-16 hours Tracked In: PHASE1_WEEK3_COMPLETE.md

Description: Currently cross-compiling Windows agent from Linux. Native Windows builds would be faster and more reliable.

Implementation:

  1. Set up Windows server/VM
  2. Install Gitea Actions runner on Windows
  3. Configure runner with windows-latest label
  4. Update build workflow to use Windows runner for agent builds

Benefits:

  • Faster agent builds (no cross-compilation)
  • More accurate Windows testing
  • Ability to run Windows-specific tests

Cost:

  • Windows Server license (or Windows 10/11 Pro)
  • Additional hardware/VM resources

14. Staging Environment

Status: NOT IMPLEMENTED Priority: LOW Effort: 16-32 hours Tracked In: PHASE1_COMPLETE.md

Description: All changes deploy directly to production. Should have staging environment for testing.

Implementation:

  1. Set up staging server (VM or separate port)
  2. Configure separate database for staging
  3. Update CI/CD workflows:
    • Push to develop → Deploy to staging
    • Push tag → Deploy to production
  4. Add smoke tests for staging

Benefits:

  • Test deployments before production
  • QA environment for testing
  • Reduced production downtime

15. Code Coverage Thresholds

Status: NOT ENFORCED Priority: LOW Effort: 2-4 hours Tracked In: Multiple docs

Description: Code coverage collected but no minimum threshold enforced.

Implementation:

  1. Analyze current coverage baseline
  2. Set reasonable thresholds (e.g., 70% overall)
  3. Update test workflow to fail if below threshold
  4. Add coverage badge to README

Files to Modify:

  • .gitea/workflows/test.yml - Add threshold check
  • README.md - Add coverage badge

16. Performance Benchmarking in CI

Status: NOT IMPLEMENTED Priority: LOW Effort: 8-16 hours Tracked In: PHASE1_COMPLETE.md

Description: No automated performance testing. Risk of performance regression.

Implementation:

  1. Create performance benchmarks using criterion
  2. Add benchmark job to CI workflow
  3. Track performance trends over time
  4. Alert on performance regression (>10% slower)

Benchmarks to Add:

  • WebSocket message throughput
  • Authentication latency
  • Database query performance
  • Screen capture encoding speed

17. Database Replication

Status: NOT IMPLEMENTED Priority: LOW Effort: 16-32 hours Tracked In: PHASE1_COMPLETE.md

Description: Single database instance. No high availability or read scaling.

Implementation:

  1. Set up PostgreSQL streaming replication
  2. Configure automatic failover (pg_auto_failover)
  3. Update application to use read replicas
  4. Test failover scenarios

Benefits:

  • High availability
  • Read scaling
  • Faster backups (from replica)

Complexity:

  • Significant operational overhead
  • Monitoring and alerting needed
  • Failover testing required

18. Centralized Logging (ELK Stack)

Status: NOT IMPLEMENTED Priority: LOW Effort: 16-32 hours Tracked In: PHASE1_COMPLETE.md

Description: Logs stored in systemd journal. Hard to search across time periods.

Implementation:

  1. Install Elasticsearch, Logstash, Kibana
  2. Configure log shipping from systemd journal
  3. Create Kibana dashboards
  4. Set up log retention policy

Benefits:

  • Powerful log search
  • Log aggregation across services
  • Visual log analysis

Cost:

  • Significant resource usage (RAM for Elasticsearch)
  • Operational complexity

Discovered Issues (Need Investigation)

19. Agent Connection Retry Logic

Status: NEEDS REVIEW Priority: LOW Effort: 2-4 hours Discovered: 2026-01-18

Description: Agent at 172.16.3.20 retries every 5 seconds with invalid API key. Should implement exponential backoff or rate limiting.

Investigation:

  1. Check agent retry logic in codebase
  2. Determine if 5-second retry is intentional
  3. Consider exponential backoff for failed auth
  4. Add server-side rate limiting for repeated failures

Files to Review:

  • agent/src/transport/ - WebSocket connection logic
  • server/src/relay/ - Rate limiting for auth failures

20. Database Connection Pool Sizing

Status: NEEDS MONITORING Priority: LOW Effort: 2-4 hours Discovered: During infrastructure setup

Description: Default connection pool settings may not be optimal. Need to monitor under load.

Monitoring:

  • Check db_connections_active metric in Prometheus
  • Monitor for pool exhaustion warnings
  • Track query latency

Tuning:

  • Adjust max_connections in PostgreSQL config
  • Adjust pool size in server .env file
  • Monitor and iterate

Completed Items (For Reference)

✓ Systemd Service Configuration

Completed: 2026-01-17 Phase: Phase 1 Week 2

✓ Prometheus Metrics Integration

Completed: 2026-01-17 Phase: Phase 1 Week 2

✓ Grafana Dashboard Setup

Completed: 2026-01-17 Phase: Phase 1 Week 2

✓ Automated Backup System

Completed: 2026-01-17 Phase: Phase 1 Week 2

✓ Log Rotation Configuration

Completed: 2026-01-17 Phase: Phase 1 Week 2

✓ CI/CD Workflows Created

Completed: 2026-01-18 Phase: Phase 1 Week 3

✓ Deployment Automation Script

Completed: 2026-01-18 Phase: Phase 1 Week 3

✓ Version Tagging Automation

Completed: 2026-01-18 Phase: Phase 1 Week 3

✓ Gitea Actions Runner Installation

Completed: 2026-01-18 Phase: Phase 1 Week 3

✓ Systemd Watchdog Issue Fixed (Partial Completion)

Completed: 2026-01-18 What Was Done: Removed WatchdogSec=30s from systemd service file Result: Resolved immediate 502 error; server now runs stably Status: Issue fixed but full implementation (sd_notify) still pending Item Reference: Item #3 (full sd_notify implementation remains as future work) Impact: Production server is now stable and responding correctly


Summary by Priority

Critical (1 item):

  1. Gitea Actions runner registration

High (4 items): 2. TLS certificate auto-renewal 4. Invalid agent API key investigation 5. Comprehensive security audit logging 6. Session timeout enforcement

High - Partial/Pending (1 item): 3. Systemd watchdog implementation (issue fixed; sd_notify implementation pending)

Medium (6 items): 7. Grafana dashboard import 8. Grafana password change 9. Deployment SSH keys 10. Backup offsite sync 11. Alertmanager for Prometheus 12. CI/CD notification webhooks

Low (8 items): 13. Windows runner for agent builds 14. Staging environment 15. Code coverage thresholds 16. Performance benchmarking 17. Database replication 18. Centralized logging (ELK) 19. Agent retry logic review 20. Database pool sizing monitoring


Tracking Notes

How to Use This Document:

  1. Before starting new work, review this list
  2. When discovering new issues, add them here
  3. When completing items, move to "Completed Items" section
  4. Prioritize based on: Security > Stability > Operations > Features
  5. Update status and dates as work progresses

Related Documents:

  • PHASE1_COMPLETE.md - Overall Phase 1 status
  • PHASE1_WEEK3_COMPLETE.md - CI/CD specific items
  • CI_CD_SETUP.md - CI/CD documentation
  • INFRASTRUCTURE_STATUS.md - Infrastructure status

Document Version: 1.1 Items Tracked: 20 (1 critical, 4 high, 1 high-partial, 6 medium, 8 low) Last Updated: 2026-01-18 (Item #3 marked as partial completion) Next Review: Before Phase 2 planning