Created comprehensive VPN setup tooling for Peaceful Spirit L2TP/IPsec connection and enhanced agent documentation framework. VPN Configuration (PST-NW-VPN): - Setup-PST-L2TP-VPN.ps1: Automated L2TP/IPsec setup with split-tunnel and DNS - Connect-PST-VPN.ps1: Connection helper with PPP adapter detection, DNS (192.168.0.2), and route config (192.168.0.0/24) - Connect-PST-VPN-Standalone.ps1: Self-contained connection script for remote deployment - Fix-PST-VPN-Auth.ps1: Authentication troubleshooting for CHAP/MSChapv2 - Diagnose-VPN-Interface.ps1: Comprehensive VPN interface and routing diagnostic - Quick-Test-VPN.ps1: Fast connectivity verification (DNS/router/routes) - Add-PST-VPN-Route-Manual.ps1: Manual route configuration helper - vpn-connect.bat, vpn-disconnect.bat: Simple batch file shortcuts - OpenVPN config files (Windows-compatible, abandoned for L2TP) Key VPN Implementation Details: - L2TP creates PPP adapter with connection name as interface description - UniFi auto-configures DNS (192.168.0.2) but requires manual route to 192.168.0.0/24 - Split-tunnel enabled (only remote traffic through VPN) - All-user connection for pre-login auto-connect via scheduled task - Authentication: CHAP + MSChapv2 for UniFi compatibility Agent Documentation: - AGENT_QUICK_REFERENCE.md: Quick reference for all specialized agents - documentation-squire.md: Documentation and task management specialist agent - Updated all agent markdown files with standardized formatting Project Organization: - Moved conversation logs to dedicated directories (guru-connect-conversation-logs, guru-rmm-conversation-logs) - Cleaned up old session JSONL files from projects/msp-tools/ - Added guru-connect infrastructure (agent, dashboard, proto, scripts, .gitea workflows) - Added guru-rmm server components and deployment configs Technical Notes: - VPN IP pool: 192.168.4.x (client gets 192.168.4.6) - Remote network: 192.168.0.0/24 (router at 192.168.0.10) - PSK: rrClvnmUeXEFo90Ol+z7tfsAZHeSK6w7 - Credentials: pst-admin / 24Hearts$ Files: 15 VPN scripts, 2 agent docs, conversation log reorganization, guru-connect/guru-rmm infrastructure additions Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
17 KiB
GuruConnect - Technical Debt & Future Work Tracker
Last Updated: 2026-01-18 Project Phase: Phase 1 Complete (89%)
Critical Items (Blocking Production Use)
1. Gitea Actions Runner Registration
Status: PENDING (requires admin access) Priority: HIGH Effort: 5 minutes Tracked In: PHASE1_WEEK3_COMPLETE.md line 181
Description: Runner installed but not registered with Gitea instance. CI/CD pipeline is ready but not active.
Action Required:
# Get token from: https://git.azcomputerguru.com/admin/actions/runners
sudo -u gitea-runner act_runner register \
--instance https://git.azcomputerguru.com \
--token YOUR_REGISTRATION_TOKEN_HERE \
--name gururmm-runner \
--labels ubuntu-latest,ubuntu-22.04
sudo systemctl enable gitea-runner
sudo systemctl start gitea-runner
Verification:
- Runner shows "Online" in Gitea admin panel
- Test commit triggers build workflow
High Priority Items (Security & Stability)
2. TLS Certificate Auto-Renewal
Status: NOT IMPLEMENTED Priority: HIGH Effort: 2-4 hours Tracked In: PHASE1_COMPLETE.md line 51
Description: Let's Encrypt certificates need manual renewal. Should implement certbot auto-renewal.
Implementation:
# Install certbot
sudo apt install certbot python3-certbot-nginx
# Configure auto-renewal
sudo certbot --nginx -d connect.azcomputerguru.com
# Set up automatic renewal (cron or systemd timer)
sudo systemctl enable certbot.timer
sudo systemctl start certbot.timer
Verification:
sudo certbot renew --dry-runsucceeds- Certificate auto-renews before expiration
3. Systemd Watchdog Implementation
Status: PARTIALLY COMPLETED (issue fixed, proper implementation pending) Priority: MEDIUM Effort: 4-8 hours (remaining for sd_notify implementation) Discovered: 2026-01-18 (dashboard 502 error) Issue Fixed: 2026-01-18
Description:
Systemd watchdog was causing service crashes. Removed WatchdogSec=30s from service file to resolve immediate 502 error. Server now runs stably without watchdog configuration. Proper sd_notify watchdog support should still be implemented for automatic restart on hung processes.
Implementation:
- Add
systemdcrate to server/Cargo.toml - Implement
sd_notify_watchdog()calls in main loop - Re-enable
WatchdogSec=30sin systemd service - Test that service doesn't crash and watchdog works
Files to Modify:
server/Cargo.toml- Add dependencyserver/src/main.rs- Add watchdog notifications/etc/systemd/system/guruconnect.service- Re-enable WatchdogSec
Benefits:
- Systemd can detect hung server process
- Automatic restart on deadlock/hang conditions
4. Invalid Agent API Key Investigation
Status: ONGOING ISSUE Priority: MEDIUM Effort: 1-2 hours Discovered: 2026-01-18
Description: Agent at 172.16.3.20 (machine ID 935a3920-6e32-4da3-a74f-3e8e8b2a426a) is repeatedly connecting with invalid API key every 5 seconds.
Log Evidence:
WARN guruconnect_server::relay: Agent connection rejected: 935a3920-6e32-4da3-a74f-3e8e8b2a426a from 172.16.3.20 - invalid API key
Investigation Needed:
- Identify which machine is 172.16.3.20
- Check agent configuration on that machine
- Update agent with correct API key OR remove agent
- Consider implementing rate limiting for failed auth attempts
Potential Impact:
- Fills logs with warnings
- Wastes server resources processing invalid connections
- May indicate misconfigured or rogue agent
5. Comprehensive Security Audit Logging
Status: PARTIALLY IMPLEMENTED Priority: MEDIUM Effort: 8-16 hours Tracked In: PHASE1_COMPLETE.md line 51
Description: Current logging covers basic operations. Need comprehensive audit trail for security events.
Events to Track:
- All authentication attempts (success/failure)
- Session creation/termination
- Agent connections/disconnections
- User account changes
- Configuration changes
- Administrative actions
- File transfer operations (when implemented)
Implementation:
- Create
audit_logstable in database - Implement
AuditLoggerservice - Add audit calls to all security-sensitive operations
- Create audit log viewer in dashboard
- Implement log retention policy
Files to Create/Modify:
server/migrations/XXX_create_audit_logs.sqlserver/src/audit.rs- Audit logging serviceserver/src/api/audit.rs- Audit log API endpointsserver/static/audit.html- Audit log viewer
6. Session Timeout Enforcement (UI-Side)
Status: NOT IMPLEMENTED Priority: MEDIUM Effort: 2-4 hours Tracked In: PHASE1_COMPLETE.md line 51
Description: JWT tokens expire after 24 hours (server-side), but UI doesn't detect/handle expiration gracefully.
Implementation:
- Add token expiration check to dashboard JavaScript
- Implement automatic logout on token expiration
- Add session timeout warning (e.g., "Session expires in 5 minutes")
- Implement token refresh mechanism (optional)
Files to Modify:
server/static/dashboard.html- Add expiration checkserver/static/viewer.html- Add expiration checkserver/src/api/auth.rs- Add token refresh endpoint (optional)
User Experience:
- User gets warned before automatic logout
- Clear messaging: "Session expired, please log in again"
- No confusing error messages on expired tokens
Medium Priority Items (Operational Excellence)
7. Grafana Dashboard Import
Status: NOT COMPLETED Priority: MEDIUM Effort: 15 minutes Tracked In: PHASE1_COMPLETE.md
Description: Dashboard JSON file exists but not imported into Grafana.
Action Required:
- Login to Grafana: http://172.16.3.30:3000
- Go to Dashboards > Import
- Upload
infrastructure/grafana-dashboard.json - Verify all panels display data
File Location:
infrastructure/grafana-dashboard.json
8. Grafana Default Password Change
Status: NOT CHANGED Priority: MEDIUM Effort: 2 minutes Tracked In: Multiple docs
Description: Grafana still using default admin/admin credentials.
Action Required:
- Login to Grafana: http://172.16.3.30:3000
- Change password from admin/admin to secure password
- Update documentation with new password
Security Risk:
- Low (internal network only, not exposed to internet)
- But should follow security best practices
9. Deployment SSH Keys for Full Automation
Status: NOT CONFIGURED Priority: MEDIUM Effort: 1-2 hours Tracked In: PHASE1_WEEK3_COMPLETE.md, CI_CD_SETUP.md
Description: CI/CD deployment workflow ready but requires SSH key configuration for full automation.
Implementation:
# Generate SSH key for runner
sudo -u gitea-runner ssh-keygen -t ed25519 -C "gitea-runner@gururmm"
# Add public key to authorized_keys
sudo -u gitea-runner cat /home/gitea-runner/.ssh/id_ed25519.pub >> ~guru/.ssh/authorized_keys
# Test SSH connection
sudo -u gitea-runner ssh guru@172.16.3.30 whoami
# Add secrets to Gitea repository settings
# SSH_PRIVATE_KEY - content of /home/gitea-runner/.ssh/id_ed25519
# SSH_HOST - 172.16.3.30
# SSH_USER - guru
Current State:
- Manual deployment works via deploy.sh
- Automated deployment via workflow will fail on SSH step
10. Backup Offsite Sync
Status: NOT IMPLEMENTED Priority: MEDIUM Effort: 4-8 hours Tracked In: PHASE1_COMPLETE.md
Description: Daily backups stored locally but not synced offsite. Risk of data loss if server fails.
Implementation Options:
Option A: Rsync to Remote Server
# Add to backup script
rsync -avz /home/guru/backups/guruconnect/ \
backup-server:/backups/gururmm/guruconnect/
Option B: Cloud Storage (S3, Azure Blob, etc.)
# Install rclone
sudo apt install rclone
# Configure cloud provider
rclone config
# Sync backups
rclone sync /home/guru/backups/guruconnect/ remote:guruconnect-backups/
Considerations:
- Encryption for backups in transit
- Retention policy on remote storage
- Cost of cloud storage
- Bandwidth usage
11. Alertmanager for Prometheus
Status: NOT CONFIGURED Priority: MEDIUM Effort: 4-8 hours Tracked In: PHASE1_COMPLETE.md
Description: Prometheus collects metrics but no alerting configured. Should notify on issues.
Alerts to Configure:
- Service down
- High error rate
- Database connection failures
- Disk space low
- High CPU/memory usage
- Failed authentication spike
Implementation:
# Install Alertmanager
sudo apt install prometheus-alertmanager
# Configure alert rules
sudo tee /etc/prometheus/alert.rules.yml << 'EOF'
groups:
- name: guruconnect
rules:
- alert: ServiceDown
expr: up{job="guruconnect"} == 0
for: 1m
annotations:
summary: "GuruConnect service is down"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
EOF
# Configure notification channels (email, Slack, etc.)
12. CI/CD Notification Webhooks
Status: NOT CONFIGURED Priority: LOW Effort: 2-4 hours Tracked In: PHASE1_COMPLETE.md
Description: No notifications when builds fail or deployments complete.
Implementation:
- Configure webhook in Gitea repository settings
- Point to Slack/Discord/Email service
- Select events: Push, Pull Request, Release
- Test notifications
Events to Notify:
- Build started
- Build failed
- Build succeeded
- Deployment started
- Deployment completed
- Deployment failed
Low Priority Items (Future Enhancements)
13. Windows Runner for Native Agent Builds
Status: NOT IMPLEMENTED Priority: LOW Effort: 8-16 hours Tracked In: PHASE1_WEEK3_COMPLETE.md
Description: Currently cross-compiling Windows agent from Linux. Native Windows builds would be faster and more reliable.
Implementation:
- Set up Windows server/VM
- Install Gitea Actions runner on Windows
- Configure runner with windows-latest label
- Update build workflow to use Windows runner for agent builds
Benefits:
- Faster agent builds (no cross-compilation)
- More accurate Windows testing
- Ability to run Windows-specific tests
Cost:
- Windows Server license (or Windows 10/11 Pro)
- Additional hardware/VM resources
14. Staging Environment
Status: NOT IMPLEMENTED Priority: LOW Effort: 16-32 hours Tracked In: PHASE1_COMPLETE.md
Description: All changes deploy directly to production. Should have staging environment for testing.
Implementation:
- Set up staging server (VM or separate port)
- Configure separate database for staging
- Update CI/CD workflows:
- Push to develop → Deploy to staging
- Push tag → Deploy to production
- Add smoke tests for staging
Benefits:
- Test deployments before production
- QA environment for testing
- Reduced production downtime
15. Code Coverage Thresholds
Status: NOT ENFORCED Priority: LOW Effort: 2-4 hours Tracked In: Multiple docs
Description: Code coverage collected but no minimum threshold enforced.
Implementation:
- Analyze current coverage baseline
- Set reasonable thresholds (e.g., 70% overall)
- Update test workflow to fail if below threshold
- Add coverage badge to README
Files to Modify:
.gitea/workflows/test.yml- Add threshold checkREADME.md- Add coverage badge
16. Performance Benchmarking in CI
Status: NOT IMPLEMENTED Priority: LOW Effort: 8-16 hours Tracked In: PHASE1_COMPLETE.md
Description: No automated performance testing. Risk of performance regression.
Implementation:
- Create performance benchmarks using
criterion - Add benchmark job to CI workflow
- Track performance trends over time
- Alert on performance regression (>10% slower)
Benchmarks to Add:
- WebSocket message throughput
- Authentication latency
- Database query performance
- Screen capture encoding speed
17. Database Replication
Status: NOT IMPLEMENTED Priority: LOW Effort: 16-32 hours Tracked In: PHASE1_COMPLETE.md
Description: Single database instance. No high availability or read scaling.
Implementation:
- Set up PostgreSQL streaming replication
- Configure automatic failover (pg_auto_failover)
- Update application to use read replicas
- Test failover scenarios
Benefits:
- High availability
- Read scaling
- Faster backups (from replica)
Complexity:
- Significant operational overhead
- Monitoring and alerting needed
- Failover testing required
18. Centralized Logging (ELK Stack)
Status: NOT IMPLEMENTED Priority: LOW Effort: 16-32 hours Tracked In: PHASE1_COMPLETE.md
Description: Logs stored in systemd journal. Hard to search across time periods.
Implementation:
- Install Elasticsearch, Logstash, Kibana
- Configure log shipping from systemd journal
- Create Kibana dashboards
- Set up log retention policy
Benefits:
- Powerful log search
- Log aggregation across services
- Visual log analysis
Cost:
- Significant resource usage (RAM for Elasticsearch)
- Operational complexity
Discovered Issues (Need Investigation)
19. Agent Connection Retry Logic
Status: NEEDS REVIEW Priority: LOW Effort: 2-4 hours Discovered: 2026-01-18
Description: Agent at 172.16.3.20 retries every 5 seconds with invalid API key. Should implement exponential backoff or rate limiting.
Investigation:
- Check agent retry logic in codebase
- Determine if 5-second retry is intentional
- Consider exponential backoff for failed auth
- Add server-side rate limiting for repeated failures
Files to Review:
agent/src/transport/- WebSocket connection logicserver/src/relay/- Rate limiting for auth failures
20. Database Connection Pool Sizing
Status: NEEDS MONITORING Priority: LOW Effort: 2-4 hours Discovered: During infrastructure setup
Description: Default connection pool settings may not be optimal. Need to monitor under load.
Monitoring:
- Check
db_connections_activemetric in Prometheus - Monitor for pool exhaustion warnings
- Track query latency
Tuning:
- Adjust
max_connectionsin PostgreSQL config - Adjust pool size in server .env file
- Monitor and iterate
Completed Items (For Reference)
✓ Systemd Service Configuration
Completed: 2026-01-17 Phase: Phase 1 Week 2
✓ Prometheus Metrics Integration
Completed: 2026-01-17 Phase: Phase 1 Week 2
✓ Grafana Dashboard Setup
Completed: 2026-01-17 Phase: Phase 1 Week 2
✓ Automated Backup System
Completed: 2026-01-17 Phase: Phase 1 Week 2
✓ Log Rotation Configuration
Completed: 2026-01-17 Phase: Phase 1 Week 2
✓ CI/CD Workflows Created
Completed: 2026-01-18 Phase: Phase 1 Week 3
✓ Deployment Automation Script
Completed: 2026-01-18 Phase: Phase 1 Week 3
✓ Version Tagging Automation
Completed: 2026-01-18 Phase: Phase 1 Week 3
✓ Gitea Actions Runner Installation
Completed: 2026-01-18 Phase: Phase 1 Week 3
✓ Systemd Watchdog Issue Fixed (Partial Completion)
Completed: 2026-01-18
What Was Done: Removed WatchdogSec=30s from systemd service file
Result: Resolved immediate 502 error; server now runs stably
Status: Issue fixed but full implementation (sd_notify) still pending
Item Reference: Item #3 (full sd_notify implementation remains as future work)
Impact: Production server is now stable and responding correctly
Summary by Priority
Critical (1 item):
- Gitea Actions runner registration
High (4 items): 2. TLS certificate auto-renewal 4. Invalid agent API key investigation 5. Comprehensive security audit logging 6. Session timeout enforcement
High - Partial/Pending (1 item): 3. Systemd watchdog implementation (issue fixed; sd_notify implementation pending)
Medium (6 items): 7. Grafana dashboard import 8. Grafana password change 9. Deployment SSH keys 10. Backup offsite sync 11. Alertmanager for Prometheus 12. CI/CD notification webhooks
Low (8 items): 13. Windows runner for agent builds 14. Staging environment 15. Code coverage thresholds 16. Performance benchmarking 17. Database replication 18. Centralized logging (ELK) 19. Agent retry logic review 20. Database pool sizing monitoring
Tracking Notes
How to Use This Document:
- Before starting new work, review this list
- When discovering new issues, add them here
- When completing items, move to "Completed Items" section
- Prioritize based on: Security > Stability > Operations > Features
- Update status and dates as work progresses
Related Documents:
PHASE1_COMPLETE.md- Overall Phase 1 statusPHASE1_WEEK3_COMPLETE.md- CI/CD specific itemsCI_CD_SETUP.md- CI/CD documentationINFRASTRUCTURE_STATUS.md- Infrastructure status
Document Version: 1.1 Items Tracked: 20 (1 critical, 4 high, 1 high-partial, 6 medium, 8 low) Last Updated: 2026-01-18 (Item #3 marked as partial completion) Next Review: Before Phase 2 planning