claudetools/projects/msp-tools/guru-connect/TECHNICAL_DEBT.md

# GuruConnect - Technical Debt & Future Work Tracker

**Last Updated:** 2026-01-18
**Project Phase:** Phase 1 Complete (89%)

---

## Critical Items (Blocking Production Use)

### 1. Gitea Actions Runner Registration
**Status:** PENDING (requires admin access)
**Priority:** HIGH
**Effort:** 5 minutes
**Tracked In:** PHASE1_WEEK3_COMPLETE.md line 181

**Description:**
Runner installed but not registered with Gitea instance. CI/CD pipeline is ready but not active.

**Action Required:**
```bash
# Get token from: https://git.azcomputerguru.com/admin/actions/runners
sudo -u gitea-runner act_runner register \
  --instance https://git.azcomputerguru.com \
  --token YOUR_REGISTRATION_TOKEN_HERE \
  --name gururmm-runner \
  --labels ubuntu-latest,ubuntu-22.04

sudo systemctl enable gitea-runner
sudo systemctl start gitea-runner
```

**Verification:**
- Runner shows "Online" in Gitea admin panel
- Test commit triggers build workflow

---

## High Priority Items (Security & Stability)

### 2. TLS Certificate Auto-Renewal
**Status:** NOT IMPLEMENTED
**Priority:** HIGH
**Effort:** 2-4 hours
**Tracked In:** PHASE1_COMPLETE.md line 51

**Description:**
Let's Encrypt certificates need manual renewal. Should implement certbot auto-renewal.

**Implementation:**
```bash
# Install certbot
sudo apt install certbot python3-certbot-nginx

# Configure auto-renewal
sudo certbot --nginx -d connect.azcomputerguru.com

# Set up automatic renewal (cron or systemd timer)
sudo systemctl enable certbot.timer
sudo systemctl start certbot.timer
```

**Verification:**
- `sudo certbot renew --dry-run` succeeds
- Certificate auto-renews before expiration

---

### 3. Systemd Watchdog Implementation
**Status:** PARTIALLY COMPLETED (issue fixed, proper implementation pending)
**Priority:** MEDIUM
**Effort:** 4-8 hours (remaining for sd_notify implementation)
**Discovered:** 2026-01-18 (dashboard 502 error)
**Issue Fixed:** 2026-01-18

**Description:**
Systemd watchdog was causing service crashes. Removed `WatchdogSec=30s` from service file to resolve immediate 502 error. Server now runs stably without watchdog configuration. Proper sd_notify watchdog support should still be implemented for automatic restart on hung processes.

**Implementation:**
1. Add `systemd` crate to server/Cargo.toml
2. Implement `sd_notify_watchdog()` calls in main loop
3. Re-enable `WatchdogSec=30s` in systemd service
4. Test that service doesn't crash and watchdog works

**Files to Modify:**
- `server/Cargo.toml` - Add dependency
- `server/src/main.rs` - Add watchdog notifications
- `/etc/systemd/system/guruconnect.service` - Re-enable WatchdogSec

**Benefits:**
- Systemd can detect hung server process
- Automatic restart on deadlock/hang conditions

---

### 4. Invalid Agent API Key Investigation
**Status:** ONGOING ISSUE
**Priority:** MEDIUM
**Effort:** 1-2 hours
**Discovered:** 2026-01-18

**Description:**
Agent at 172.16.3.20 (machine ID 935a3920-6e32-4da3-a74f-3e8e8b2a426a) is repeatedly connecting with invalid API key every 5 seconds.

**Log Evidence:**
```
WARN guruconnect_server::relay: Agent connection rejected: 935a3920-6e32-4da3-a74f-3e8e8b2a426a from 172.16.3.20 - invalid API key
```

**Investigation Needed:**
1. Identify which machine is 172.16.3.20
2. Check agent configuration on that machine
3. Update agent with correct API key OR remove agent
4. Consider implementing rate limiting for failed auth attempts

**Potential Impact:**
- Fills logs with warnings
- Wastes server resources processing invalid connections
- May indicate misconfigured or rogue agent

---

### 5. Comprehensive Security Audit Logging
**Status:** PARTIALLY IMPLEMENTED
**Priority:** MEDIUM
**Effort:** 8-16 hours
**Tracked In:** PHASE1_COMPLETE.md line 51

**Description:**
Current logging covers basic operations. Need comprehensive audit trail for security events.

**Events to Track:**
- All authentication attempts (success/failure)
- Session creation/termination
- Agent connections/disconnections
- User account changes
- Configuration changes
- Administrative actions
- File transfer operations (when implemented)

**Implementation:**
1. Create `audit_logs` table in database
2. Implement `AuditLogger` service
3. Add audit calls to all security-sensitive operations
4. Create audit log viewer in dashboard
5. Implement log retention policy

**Files to Create/Modify:**
- `server/migrations/XXX_create_audit_logs.sql`
- `server/src/audit.rs` - Audit logging service
- `server/src/api/audit.rs` - Audit log API endpoints
- `server/static/audit.html` - Audit log viewer

---

### 6. Session Timeout Enforcement (UI-Side)
**Status:** NOT IMPLEMENTED
**Priority:** MEDIUM
**Effort:** 2-4 hours
**Tracked In:** PHASE1_COMPLETE.md line 51

**Description:**
JWT tokens expire after 24 hours (server-side), but UI doesn't detect/handle expiration gracefully.

**Implementation:**
1. Add token expiration check to dashboard JavaScript
2. Implement automatic logout on token expiration
3. Add session timeout warning (e.g., "Session expires in 5 minutes")
4. Implement token refresh mechanism (optional)

**Files to Modify:**
- `server/static/dashboard.html` - Add expiration check
- `server/static/viewer.html` - Add expiration check
- `server/src/api/auth.rs` - Add token refresh endpoint (optional)

**User Experience:**
- User gets warned before automatic logout
- Clear messaging: "Session expired, please log in again"
- No confusing error messages on expired tokens

---

## Medium Priority Items (Operational Excellence)

### 7. Grafana Dashboard Import
**Status:** NOT COMPLETED
**Priority:** MEDIUM
**Effort:** 15 minutes
**Tracked In:** PHASE1_COMPLETE.md

**Description:**
Dashboard JSON file exists but not imported into Grafana.

**Action Required:**
1. Login to Grafana: http://172.16.3.30:3000
2. Go to Dashboards > Import
3. Upload `infrastructure/grafana-dashboard.json`
4. Verify all panels display data

**File Location:**
- `infrastructure/grafana-dashboard.json`

---

### 8. Grafana Default Password Change
**Status:** NOT CHANGED
**Priority:** MEDIUM
**Effort:** 2 minutes
**Tracked In:** Multiple docs

**Description:**
Grafana still using default admin/admin credentials.

**Action Required:**
1. Login to Grafana: http://172.16.3.30:3000
2. Change password from admin/admin to secure password
3. Update documentation with new password

**Security Risk:**
- Low (internal network only, not exposed to internet)
- But should follow security best practices

---

### 9. Deployment SSH Keys for Full Automation
**Status:** NOT CONFIGURED
**Priority:** MEDIUM
**Effort:** 1-2 hours
**Tracked In:** PHASE1_WEEK3_COMPLETE.md, CI_CD_SETUP.md

**Description:**
CI/CD deployment workflow ready but requires SSH key configuration for full automation.

**Implementation:**
```bash
# Generate SSH key for runner
sudo -u gitea-runner ssh-keygen -t ed25519 -C "gitea-runner@gururmm"

# Add public key to authorized_keys
sudo -u gitea-runner cat /home/gitea-runner/.ssh/id_ed25519.pub >> ~guru/.ssh/authorized_keys

# Test SSH connection
sudo -u gitea-runner ssh guru@172.16.3.30 whoami

# Add secrets to Gitea repository settings
# SSH_PRIVATE_KEY - content of /home/gitea-runner/.ssh/id_ed25519
# SSH_HOST - 172.16.3.30
# SSH_USER - guru
```

**Current State:**
- Manual deployment works via deploy.sh
- Automated deployment via workflow will fail on SSH step

---

### 10. Backup Offsite Sync
**Status:** NOT IMPLEMENTED
**Priority:** MEDIUM
**Effort:** 4-8 hours
**Tracked In:** PHASE1_COMPLETE.md

**Description:**
Daily backups stored locally but not synced offsite. Risk of data loss if server fails.

**Implementation Options:**

**Option A: Rsync to Remote Server**
```bash
# Add to backup script
rsync -avz /home/guru/backups/guruconnect/ \
  backup-server:/backups/gururmm/guruconnect/
```

**Option B: Cloud Storage (S3, Azure Blob, etc.)**
```bash
# Install rclone
sudo apt install rclone

# Configure cloud provider
rclone config

# Sync backups
rclone sync /home/guru/backups/guruconnect/ remote:guruconnect-backups/
```

**Considerations:**
- Encryption for backups in transit
- Retention policy on remote storage
- Cost of cloud storage
- Bandwidth usage

---

### 11. Alertmanager for Prometheus
**Status:** NOT CONFIGURED
**Priority:** MEDIUM
**Effort:** 4-8 hours
**Tracked In:** PHASE1_COMPLETE.md

**Description:**
Prometheus collects metrics but no alerting configured. Should notify on issues.

**Alerts to Configure:**
- Service down
- High error rate
- Database connection failures
- Disk space low
- High CPU/memory usage
- Failed authentication spike

**Implementation:**
```bash
# Install Alertmanager
sudo apt install prometheus-alertmanager

# Configure alert rules
sudo tee /etc/prometheus/alert.rules.yml << 'EOF'
groups:
  - name: guruconnect
    rules:
      - alert: ServiceDown
        expr: up{job="guruconnect"} == 0
        for: 1m
        annotations:
          summary: "GuruConnect service is down"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"
EOF

# Configure notification channels (email, Slack, etc.)
```

---

### 12. CI/CD Notification Webhooks
**Status:** NOT CONFIGURED
**Priority:** LOW
**Effort:** 2-4 hours
**Tracked In:** PHASE1_COMPLETE.md

**Description:**
No notifications when builds fail or deployments complete.

**Implementation:**
1. Configure webhook in Gitea repository settings
2. Point to Slack/Discord/Email service
3. Select events: Push, Pull Request, Release
4. Test notifications

**Events to Notify:**
- Build started
- Build failed
- Build succeeded
- Deployment started
- Deployment completed
- Deployment failed

---

## Low Priority Items (Future Enhancements)

### 13. Windows Runner for Native Agent Builds
**Status:** NOT IMPLEMENTED
**Priority:** LOW
**Effort:** 8-16 hours
**Tracked In:** PHASE1_WEEK3_COMPLETE.md

**Description:**
Currently cross-compiling Windows agent from Linux. Native Windows builds would be faster and more reliable.

**Implementation:**
1. Set up Windows server/VM
2. Install Gitea Actions runner on Windows
3. Configure runner with windows-latest label
4. Update build workflow to use Windows runner for agent builds

**Benefits:**
- Faster agent builds (no cross-compilation)
- More accurate Windows testing
- Ability to run Windows-specific tests

**Cost:**
- Windows Server license (or Windows 10/11 Pro)
- Additional hardware/VM resources

---

### 14. Staging Environment
**Status:** NOT IMPLEMENTED
**Priority:** LOW
**Effort:** 16-32 hours
**Tracked In:** PHASE1_COMPLETE.md

**Description:**
All changes deploy directly to production. Should have staging environment for testing.

**Implementation:**
1. Set up staging server (VM or separate port)
2. Configure separate database for staging
3. Update CI/CD workflows:
   - Push to develop → Deploy to staging
   - Push tag → Deploy to production
4. Add smoke tests for staging

**Benefits:**
- Test deployments before production
- QA environment for testing
- Reduced production downtime

---

### 15. Code Coverage Thresholds
**Status:** NOT ENFORCED
**Priority:** LOW
**Effort:** 2-4 hours
**Tracked In:** Multiple docs

**Description:**
Code coverage collected but no minimum threshold enforced.

**Implementation:**
1. Analyze current coverage baseline
2. Set reasonable thresholds (e.g., 70% overall)
3. Update test workflow to fail if below threshold
4. Add coverage badge to README

**Files to Modify:**
- `.gitea/workflows/test.yml` - Add threshold check
- `README.md` - Add coverage badge

---

### 16. Performance Benchmarking in CI
**Status:** NOT IMPLEMENTED
**Priority:** LOW
**Effort:** 8-16 hours
**Tracked In:** PHASE1_COMPLETE.md

**Description:**
No automated performance testing. Risk of performance regression.

**Implementation:**
1. Create performance benchmarks using `criterion`
2. Add benchmark job to CI workflow
3. Track performance trends over time
4. Alert on performance regression (>10% slower)

**Benchmarks to Add:**
- WebSocket message throughput
- Authentication latency
- Database query performance
- Screen capture encoding speed

---

### 17. Database Replication
**Status:** NOT IMPLEMENTED
**Priority:** LOW
**Effort:** 16-32 hours
**Tracked In:** PHASE1_COMPLETE.md

**Description:**
Single database instance. No high availability or read scaling.

**Implementation:**
1. Set up PostgreSQL streaming replication
2. Configure automatic failover (pg_auto_failover)
3. Update application to use read replicas
4. Test failover scenarios

**Benefits:**
- High availability
- Read scaling
- Faster backups (from replica)

**Complexity:**
- Significant operational overhead
- Monitoring and alerting needed
- Failover testing required

---

### 18. Centralized Logging (ELK Stack)
**Status:** NOT IMPLEMENTED
**Priority:** LOW
**Effort:** 16-32 hours
**Tracked In:** PHASE1_COMPLETE.md

**Description:**
Logs stored in systemd journal. Hard to search across time periods.

**Implementation:**
1. Install Elasticsearch, Logstash, Kibana
2. Configure log shipping from systemd journal
3. Create Kibana dashboards
4. Set up log retention policy

**Benefits:**
- Powerful log search
- Log aggregation across services
- Visual log analysis

**Cost:**
- Significant resource usage (RAM for Elasticsearch)
- Operational complexity

---

## Discovered Issues (Need Investigation)

### 19. Agent Connection Retry Logic
**Status:** NEEDS REVIEW
**Priority:** LOW
**Effort:** 2-4 hours
**Discovered:** 2026-01-18

**Description:**
Agent at 172.16.3.20 retries every 5 seconds with invalid API key. Should implement exponential backoff or rate limiting.

**Investigation:**
1. Check agent retry logic in codebase
2. Determine if 5-second retry is intentional
3. Consider exponential backoff for failed auth
4. Add server-side rate limiting for repeated failures

**Files to Review:**
- `agent/src/transport/` - WebSocket connection logic
- `server/src/relay/` - Rate limiting for auth failures

---

### 20. Database Connection Pool Sizing
**Status:** NEEDS MONITORING
**Priority:** LOW
**Effort:** 2-4 hours
**Discovered:** During infrastructure setup

**Description:**
Default connection pool settings may not be optimal. Need to monitor under load.

**Monitoring:**
- Check `db_connections_active` metric in Prometheus
- Monitor for pool exhaustion warnings
- Track query latency

**Tuning:**
- Adjust `max_connections` in PostgreSQL config
- Adjust pool size in server .env file
- Monitor and iterate

---

## Completed Items (For Reference)

### ✓ Systemd Service Configuration
**Completed:** 2026-01-17
**Phase:** Phase 1 Week 2

### ✓ Prometheus Metrics Integration
**Completed:** 2026-01-17
**Phase:** Phase 1 Week 2

### ✓ Grafana Dashboard Setup
**Completed:** 2026-01-17
**Phase:** Phase 1 Week 2

### ✓ Automated Backup System
**Completed:** 2026-01-17
**Phase:** Phase 1 Week 2

### ✓ Log Rotation Configuration
**Completed:** 2026-01-17
**Phase:** Phase 1 Week 2

### ✓ CI/CD Workflows Created
**Completed:** 2026-01-18
**Phase:** Phase 1 Week 3

### ✓ Deployment Automation Script
**Completed:** 2026-01-18
**Phase:** Phase 1 Week 3

### ✓ Version Tagging Automation
**Completed:** 2026-01-18
**Phase:** Phase 1 Week 3

### ✓ Gitea Actions Runner Installation
**Completed:** 2026-01-18
**Phase:** Phase 1 Week 3

### ✓ Systemd Watchdog Issue Fixed (Partial Completion)
**Completed:** 2026-01-18
**What Was Done:** Removed `WatchdogSec=30s` from systemd service file
**Result:** Resolved immediate 502 error; server now runs stably
**Status:** Issue fixed but full implementation (sd_notify) still pending
**Item Reference:** Item #3 (full sd_notify implementation remains as future work)
**Impact:** Production server is now stable and responding correctly

---

## Summary by Priority

**Critical (1 item):**
1. Gitea Actions runner registration

**High (4 items):**
2. TLS certificate auto-renewal
4. Invalid agent API key investigation
5. Comprehensive security audit logging
6. Session timeout enforcement

**High - Partial/Pending (1 item):**
3. Systemd watchdog implementation (issue fixed; sd_notify implementation pending)

**Medium (6 items):**
7. Grafana dashboard import
8. Grafana password change
9. Deployment SSH keys
10. Backup offsite sync
11. Alertmanager for Prometheus
12. CI/CD notification webhooks

**Low (8 items):**
13. Windows runner for agent builds
14. Staging environment
15. Code coverage thresholds
16. Performance benchmarking
17. Database replication
18. Centralized logging (ELK)
19. Agent retry logic review
20. Database pool sizing monitoring

---

## Tracking Notes

**How to Use This Document:**
1. Before starting new work, review this list
2. When discovering new issues, add them here
3. When completing items, move to "Completed Items" section
4. Prioritize based on: Security > Stability > Operations > Features
5. Update status and dates as work progresses

**Related Documents:**
- `PHASE1_COMPLETE.md` - Overall Phase 1 status
- `PHASE1_WEEK3_COMPLETE.md` - CI/CD specific items
- `CI_CD_SETUP.md` - CI/CD documentation
- `INFRASTRUCTURE_STATUS.md` - Infrastructure status

---

**Document Version:** 1.1
**Items Tracked:** 20 (1 critical, 4 high, 1 high-partial, 6 medium, 8 low)
**Last Updated:** 2026-01-18 (Item #3 marked as partial completion)
**Next Review:** Before Phase 2 planning