Add VPN configuration tools and agent documentation
Created comprehensive VPN setup tooling for Peaceful Spirit L2TP/IPsec connection and enhanced agent documentation framework. VPN Configuration (PST-NW-VPN): - Setup-PST-L2TP-VPN.ps1: Automated L2TP/IPsec setup with split-tunnel and DNS - Connect-PST-VPN.ps1: Connection helper with PPP adapter detection, DNS (192.168.0.2), and route config (192.168.0.0/24) - Connect-PST-VPN-Standalone.ps1: Self-contained connection script for remote deployment - Fix-PST-VPN-Auth.ps1: Authentication troubleshooting for CHAP/MSChapv2 - Diagnose-VPN-Interface.ps1: Comprehensive VPN interface and routing diagnostic - Quick-Test-VPN.ps1: Fast connectivity verification (DNS/router/routes) - Add-PST-VPN-Route-Manual.ps1: Manual route configuration helper - vpn-connect.bat, vpn-disconnect.bat: Simple batch file shortcuts - OpenVPN config files (Windows-compatible, abandoned for L2TP) Key VPN Implementation Details: - L2TP creates PPP adapter with connection name as interface description - UniFi auto-configures DNS (192.168.0.2) but requires manual route to 192.168.0.0/24 - Split-tunnel enabled (only remote traffic through VPN) - All-user connection for pre-login auto-connect via scheduled task - Authentication: CHAP + MSChapv2 for UniFi compatibility Agent Documentation: - AGENT_QUICK_REFERENCE.md: Quick reference for all specialized agents - documentation-squire.md: Documentation and task management specialist agent - Updated all agent markdown files with standardized formatting Project Organization: - Moved conversation logs to dedicated directories (guru-connect-conversation-logs, guru-rmm-conversation-logs) - Cleaned up old session JSONL files from projects/msp-tools/ - Added guru-connect infrastructure (agent, dashboard, proto, scripts, .gitea workflows) - Added guru-rmm server components and deployment configs Technical Notes: - VPN IP pool: 192.168.4.x (client gets 192.168.4.6) - Remote network: 192.168.0.0/24 (router at 192.168.0.10) - PSK: rrClvnmUeXEFo90Ol+z7tfsAZHeSK6w7 - Credentials: pst-admin / 24Hearts$ Files: 15 VPN scripts, 2 agent docs, conversation log reorganization, guru-connect/guru-rmm infrastructure additions Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
659
projects/msp-tools/guru-connect/TECHNICAL_DEBT.md
Normal file
659
projects/msp-tools/guru-connect/TECHNICAL_DEBT.md
Normal file
@@ -0,0 +1,659 @@
|
||||
# GuruConnect - Technical Debt & Future Work Tracker
|
||||
|
||||
**Last Updated:** 2026-01-18
|
||||
**Project Phase:** Phase 1 Complete (89%)
|
||||
|
||||
---
|
||||
|
||||
## Critical Items (Blocking Production Use)
|
||||
|
||||
### 1. Gitea Actions Runner Registration
|
||||
**Status:** PENDING (requires admin access)
|
||||
**Priority:** HIGH
|
||||
**Effort:** 5 minutes
|
||||
**Tracked In:** PHASE1_WEEK3_COMPLETE.md line 181
|
||||
|
||||
**Description:**
|
||||
Runner installed but not registered with Gitea instance. CI/CD pipeline is ready but not active.
|
||||
|
||||
**Action Required:**
|
||||
```bash
|
||||
# Get token from: https://git.azcomputerguru.com/admin/actions/runners
|
||||
sudo -u gitea-runner act_runner register \
|
||||
--instance https://git.azcomputerguru.com \
|
||||
--token YOUR_REGISTRATION_TOKEN_HERE \
|
||||
--name gururmm-runner \
|
||||
--labels ubuntu-latest,ubuntu-22.04
|
||||
|
||||
sudo systemctl enable gitea-runner
|
||||
sudo systemctl start gitea-runner
|
||||
```
|
||||
|
||||
**Verification:**
|
||||
- Runner shows "Online" in Gitea admin panel
|
||||
- Test commit triggers build workflow
|
||||
|
||||
---
|
||||
|
||||
## High Priority Items (Security & Stability)
|
||||
|
||||
### 2. TLS Certificate Auto-Renewal
|
||||
**Status:** NOT IMPLEMENTED
|
||||
**Priority:** HIGH
|
||||
**Effort:** 2-4 hours
|
||||
**Tracked In:** PHASE1_COMPLETE.md line 51
|
||||
|
||||
**Description:**
|
||||
Let's Encrypt certificates need manual renewal. Should implement certbot auto-renewal.
|
||||
|
||||
**Implementation:**
|
||||
```bash
|
||||
# Install certbot
|
||||
sudo apt install certbot python3-certbot-nginx
|
||||
|
||||
# Configure auto-renewal
|
||||
sudo certbot --nginx -d connect.azcomputerguru.com
|
||||
|
||||
# Set up automatic renewal (cron or systemd timer)
|
||||
sudo systemctl enable certbot.timer
|
||||
sudo systemctl start certbot.timer
|
||||
```
|
||||
|
||||
**Verification:**
|
||||
- `sudo certbot renew --dry-run` succeeds
|
||||
- Certificate auto-renews before expiration
|
||||
|
||||
---
|
||||
|
||||
### 3. Systemd Watchdog Implementation
|
||||
**Status:** PARTIALLY COMPLETED (issue fixed, proper implementation pending)
|
||||
**Priority:** MEDIUM
|
||||
**Effort:** 4-8 hours (remaining for sd_notify implementation)
|
||||
**Discovered:** 2026-01-18 (dashboard 502 error)
|
||||
**Issue Fixed:** 2026-01-18
|
||||
|
||||
**Description:**
|
||||
Systemd watchdog was causing service crashes. Removed `WatchdogSec=30s` from service file to resolve immediate 502 error. Server now runs stably without watchdog configuration. Proper sd_notify watchdog support should still be implemented for automatic restart on hung processes.
|
||||
|
||||
**Implementation:**
|
||||
1. Add `systemd` crate to server/Cargo.toml
|
||||
2. Implement `sd_notify_watchdog()` calls in main loop
|
||||
3. Re-enable `WatchdogSec=30s` in systemd service
|
||||
4. Test that service doesn't crash and watchdog works
|
||||
|
||||
**Files to Modify:**
|
||||
- `server/Cargo.toml` - Add dependency
|
||||
- `server/src/main.rs` - Add watchdog notifications
|
||||
- `/etc/systemd/system/guruconnect.service` - Re-enable WatchdogSec
|
||||
|
||||
**Benefits:**
|
||||
- Systemd can detect hung server process
|
||||
- Automatic restart on deadlock/hang conditions
|
||||
|
||||
---
|
||||
|
||||
### 4. Invalid Agent API Key Investigation
|
||||
**Status:** ONGOING ISSUE
|
||||
**Priority:** MEDIUM
|
||||
**Effort:** 1-2 hours
|
||||
**Discovered:** 2026-01-18
|
||||
|
||||
**Description:**
|
||||
Agent at 172.16.3.20 (machine ID 935a3920-6e32-4da3-a74f-3e8e8b2a426a) is repeatedly connecting with invalid API key every 5 seconds.
|
||||
|
||||
**Log Evidence:**
|
||||
```
|
||||
WARN guruconnect_server::relay: Agent connection rejected: 935a3920-6e32-4da3-a74f-3e8e8b2a426a from 172.16.3.20 - invalid API key
|
||||
```
|
||||
|
||||
**Investigation Needed:**
|
||||
1. Identify which machine is 172.16.3.20
|
||||
2. Check agent configuration on that machine
|
||||
3. Update agent with correct API key OR remove agent
|
||||
4. Consider implementing rate limiting for failed auth attempts
|
||||
|
||||
**Potential Impact:**
|
||||
- Fills logs with warnings
|
||||
- Wastes server resources processing invalid connections
|
||||
- May indicate misconfigured or rogue agent
|
||||
|
||||
---
|
||||
|
||||
### 5. Comprehensive Security Audit Logging
|
||||
**Status:** PARTIALLY IMPLEMENTED
|
||||
**Priority:** MEDIUM
|
||||
**Effort:** 8-16 hours
|
||||
**Tracked In:** PHASE1_COMPLETE.md line 51
|
||||
|
||||
**Description:**
|
||||
Current logging covers basic operations. Need comprehensive audit trail for security events.
|
||||
|
||||
**Events to Track:**
|
||||
- All authentication attempts (success/failure)
|
||||
- Session creation/termination
|
||||
- Agent connections/disconnections
|
||||
- User account changes
|
||||
- Configuration changes
|
||||
- Administrative actions
|
||||
- File transfer operations (when implemented)
|
||||
|
||||
**Implementation:**
|
||||
1. Create `audit_logs` table in database
|
||||
2. Implement `AuditLogger` service
|
||||
3. Add audit calls to all security-sensitive operations
|
||||
4. Create audit log viewer in dashboard
|
||||
5. Implement log retention policy
|
||||
|
||||
**Files to Create/Modify:**
|
||||
- `server/migrations/XXX_create_audit_logs.sql`
|
||||
- `server/src/audit.rs` - Audit logging service
|
||||
- `server/src/api/audit.rs` - Audit log API endpoints
|
||||
- `server/static/audit.html` - Audit log viewer
|
||||
|
||||
---
|
||||
|
||||
### 6. Session Timeout Enforcement (UI-Side)
|
||||
**Status:** NOT IMPLEMENTED
|
||||
**Priority:** MEDIUM
|
||||
**Effort:** 2-4 hours
|
||||
**Tracked In:** PHASE1_COMPLETE.md line 51
|
||||
|
||||
**Description:**
|
||||
JWT tokens expire after 24 hours (server-side), but UI doesn't detect/handle expiration gracefully.
|
||||
|
||||
**Implementation:**
|
||||
1. Add token expiration check to dashboard JavaScript
|
||||
2. Implement automatic logout on token expiration
|
||||
3. Add session timeout warning (e.g., "Session expires in 5 minutes")
|
||||
4. Implement token refresh mechanism (optional)
|
||||
|
||||
**Files to Modify:**
|
||||
- `server/static/dashboard.html` - Add expiration check
|
||||
- `server/static/viewer.html` - Add expiration check
|
||||
- `server/src/api/auth.rs` - Add token refresh endpoint (optional)
|
||||
|
||||
**User Experience:**
|
||||
- User gets warned before automatic logout
|
||||
- Clear messaging: "Session expired, please log in again"
|
||||
- No confusing error messages on expired tokens
|
||||
|
||||
---
|
||||
|
||||
## Medium Priority Items (Operational Excellence)
|
||||
|
||||
### 7. Grafana Dashboard Import
|
||||
**Status:** NOT COMPLETED
|
||||
**Priority:** MEDIUM
|
||||
**Effort:** 15 minutes
|
||||
**Tracked In:** PHASE1_COMPLETE.md
|
||||
|
||||
**Description:**
|
||||
Dashboard JSON file exists but not imported into Grafana.
|
||||
|
||||
**Action Required:**
|
||||
1. Login to Grafana: http://172.16.3.30:3000
|
||||
2. Go to Dashboards > Import
|
||||
3. Upload `infrastructure/grafana-dashboard.json`
|
||||
4. Verify all panels display data
|
||||
|
||||
**File Location:**
|
||||
- `infrastructure/grafana-dashboard.json`
|
||||
|
||||
---
|
||||
|
||||
### 8. Grafana Default Password Change
|
||||
**Status:** NOT CHANGED
|
||||
**Priority:** MEDIUM
|
||||
**Effort:** 2 minutes
|
||||
**Tracked In:** Multiple docs
|
||||
|
||||
**Description:**
|
||||
Grafana still using default admin/admin credentials.
|
||||
|
||||
**Action Required:**
|
||||
1. Login to Grafana: http://172.16.3.30:3000
|
||||
2. Change password from admin/admin to secure password
|
||||
3. Update documentation with new password
|
||||
|
||||
**Security Risk:**
|
||||
- Low (internal network only, not exposed to internet)
|
||||
- But should follow security best practices
|
||||
|
||||
---
|
||||
|
||||
### 9. Deployment SSH Keys for Full Automation
|
||||
**Status:** NOT CONFIGURED
|
||||
**Priority:** MEDIUM
|
||||
**Effort:** 1-2 hours
|
||||
**Tracked In:** PHASE1_WEEK3_COMPLETE.md, CI_CD_SETUP.md
|
||||
|
||||
**Description:**
|
||||
CI/CD deployment workflow ready but requires SSH key configuration for full automation.
|
||||
|
||||
**Implementation:**
|
||||
```bash
|
||||
# Generate SSH key for runner
|
||||
sudo -u gitea-runner ssh-keygen -t ed25519 -C "gitea-runner@gururmm"
|
||||
|
||||
# Add public key to authorized_keys
|
||||
sudo -u gitea-runner cat /home/gitea-runner/.ssh/id_ed25519.pub >> ~guru/.ssh/authorized_keys
|
||||
|
||||
# Test SSH connection
|
||||
sudo -u gitea-runner ssh guru@172.16.3.30 whoami
|
||||
|
||||
# Add secrets to Gitea repository settings
|
||||
# SSH_PRIVATE_KEY - content of /home/gitea-runner/.ssh/id_ed25519
|
||||
# SSH_HOST - 172.16.3.30
|
||||
# SSH_USER - guru
|
||||
```
|
||||
|
||||
**Current State:**
|
||||
- Manual deployment works via deploy.sh
|
||||
- Automated deployment via workflow will fail on SSH step
|
||||
|
||||
---
|
||||
|
||||
### 10. Backup Offsite Sync
|
||||
**Status:** NOT IMPLEMENTED
|
||||
**Priority:** MEDIUM
|
||||
**Effort:** 4-8 hours
|
||||
**Tracked In:** PHASE1_COMPLETE.md
|
||||
|
||||
**Description:**
|
||||
Daily backups stored locally but not synced offsite. Risk of data loss if server fails.
|
||||
|
||||
**Implementation Options:**
|
||||
|
||||
**Option A: Rsync to Remote Server**
|
||||
```bash
|
||||
# Add to backup script
|
||||
rsync -avz /home/guru/backups/guruconnect/ \
|
||||
backup-server:/backups/gururmm/guruconnect/
|
||||
```
|
||||
|
||||
**Option B: Cloud Storage (S3, Azure Blob, etc.)**
|
||||
```bash
|
||||
# Install rclone
|
||||
sudo apt install rclone
|
||||
|
||||
# Configure cloud provider
|
||||
rclone config
|
||||
|
||||
# Sync backups
|
||||
rclone sync /home/guru/backups/guruconnect/ remote:guruconnect-backups/
|
||||
```
|
||||
|
||||
**Considerations:**
|
||||
- Encryption for backups in transit
|
||||
- Retention policy on remote storage
|
||||
- Cost of cloud storage
|
||||
- Bandwidth usage
|
||||
|
||||
---
|
||||
|
||||
### 11. Alertmanager for Prometheus
|
||||
**Status:** NOT CONFIGURED
|
||||
**Priority:** MEDIUM
|
||||
**Effort:** 4-8 hours
|
||||
**Tracked In:** PHASE1_COMPLETE.md
|
||||
|
||||
**Description:**
|
||||
Prometheus collects metrics but no alerting configured. Should notify on issues.
|
||||
|
||||
**Alerts to Configure:**
|
||||
- Service down
|
||||
- High error rate
|
||||
- Database connection failures
|
||||
- Disk space low
|
||||
- High CPU/memory usage
|
||||
- Failed authentication spike
|
||||
|
||||
**Implementation:**
|
||||
```bash
|
||||
# Install Alertmanager
|
||||
sudo apt install prometheus-alertmanager
|
||||
|
||||
# Configure alert rules
|
||||
sudo tee /etc/prometheus/alert.rules.yml << 'EOF'
|
||||
groups:
|
||||
- name: guruconnect
|
||||
rules:
|
||||
- alert: ServiceDown
|
||||
expr: up{job="guruconnect"} == 0
|
||||
for: 1m
|
||||
annotations:
|
||||
summary: "GuruConnect service is down"
|
||||
|
||||
- alert: HighErrorRate
|
||||
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "High error rate detected"
|
||||
EOF
|
||||
|
||||
# Configure notification channels (email, Slack, etc.)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 12. CI/CD Notification Webhooks
|
||||
**Status:** NOT CONFIGURED
|
||||
**Priority:** LOW
|
||||
**Effort:** 2-4 hours
|
||||
**Tracked In:** PHASE1_COMPLETE.md
|
||||
|
||||
**Description:**
|
||||
No notifications when builds fail or deployments complete.
|
||||
|
||||
**Implementation:**
|
||||
1. Configure webhook in Gitea repository settings
|
||||
2. Point to Slack/Discord/Email service
|
||||
3. Select events: Push, Pull Request, Release
|
||||
4. Test notifications
|
||||
|
||||
**Events to Notify:**
|
||||
- Build started
|
||||
- Build failed
|
||||
- Build succeeded
|
||||
- Deployment started
|
||||
- Deployment completed
|
||||
- Deployment failed
|
||||
|
||||
---
|
||||
|
||||
## Low Priority Items (Future Enhancements)
|
||||
|
||||
### 13. Windows Runner for Native Agent Builds
|
||||
**Status:** NOT IMPLEMENTED
|
||||
**Priority:** LOW
|
||||
**Effort:** 8-16 hours
|
||||
**Tracked In:** PHASE1_WEEK3_COMPLETE.md
|
||||
|
||||
**Description:**
|
||||
Currently cross-compiling Windows agent from Linux. Native Windows builds would be faster and more reliable.
|
||||
|
||||
**Implementation:**
|
||||
1. Set up Windows server/VM
|
||||
2. Install Gitea Actions runner on Windows
|
||||
3. Configure runner with windows-latest label
|
||||
4. Update build workflow to use Windows runner for agent builds
|
||||
|
||||
**Benefits:**
|
||||
- Faster agent builds (no cross-compilation)
|
||||
- More accurate Windows testing
|
||||
- Ability to run Windows-specific tests
|
||||
|
||||
**Cost:**
|
||||
- Windows Server license (or Windows 10/11 Pro)
|
||||
- Additional hardware/VM resources
|
||||
|
||||
---
|
||||
|
||||
### 14. Staging Environment
|
||||
**Status:** NOT IMPLEMENTED
|
||||
**Priority:** LOW
|
||||
**Effort:** 16-32 hours
|
||||
**Tracked In:** PHASE1_COMPLETE.md
|
||||
|
||||
**Description:**
|
||||
All changes deploy directly to production. Should have staging environment for testing.
|
||||
|
||||
**Implementation:**
|
||||
1. Set up staging server (VM or separate port)
|
||||
2. Configure separate database for staging
|
||||
3. Update CI/CD workflows:
|
||||
- Push to develop → Deploy to staging
|
||||
- Push tag → Deploy to production
|
||||
4. Add smoke tests for staging
|
||||
|
||||
**Benefits:**
|
||||
- Test deployments before production
|
||||
- QA environment for testing
|
||||
- Reduced production downtime
|
||||
|
||||
---
|
||||
|
||||
### 15. Code Coverage Thresholds
|
||||
**Status:** NOT ENFORCED
|
||||
**Priority:** LOW
|
||||
**Effort:** 2-4 hours
|
||||
**Tracked In:** Multiple docs
|
||||
|
||||
**Description:**
|
||||
Code coverage collected but no minimum threshold enforced.
|
||||
|
||||
**Implementation:**
|
||||
1. Analyze current coverage baseline
|
||||
2. Set reasonable thresholds (e.g., 70% overall)
|
||||
3. Update test workflow to fail if below threshold
|
||||
4. Add coverage badge to README
|
||||
|
||||
**Files to Modify:**
|
||||
- `.gitea/workflows/test.yml` - Add threshold check
|
||||
- `README.md` - Add coverage badge
|
||||
|
||||
---
|
||||
|
||||
### 16. Performance Benchmarking in CI
|
||||
**Status:** NOT IMPLEMENTED
|
||||
**Priority:** LOW
|
||||
**Effort:** 8-16 hours
|
||||
**Tracked In:** PHASE1_COMPLETE.md
|
||||
|
||||
**Description:**
|
||||
No automated performance testing. Risk of performance regression.
|
||||
|
||||
**Implementation:**
|
||||
1. Create performance benchmarks using `criterion`
|
||||
2. Add benchmark job to CI workflow
|
||||
3. Track performance trends over time
|
||||
4. Alert on performance regression (>10% slower)
|
||||
|
||||
**Benchmarks to Add:**
|
||||
- WebSocket message throughput
|
||||
- Authentication latency
|
||||
- Database query performance
|
||||
- Screen capture encoding speed
|
||||
|
||||
---
|
||||
|
||||
### 17. Database Replication
|
||||
**Status:** NOT IMPLEMENTED
|
||||
**Priority:** LOW
|
||||
**Effort:** 16-32 hours
|
||||
**Tracked In:** PHASE1_COMPLETE.md
|
||||
|
||||
**Description:**
|
||||
Single database instance. No high availability or read scaling.
|
||||
|
||||
**Implementation:**
|
||||
1. Set up PostgreSQL streaming replication
|
||||
2. Configure automatic failover (pg_auto_failover)
|
||||
3. Update application to use read replicas
|
||||
4. Test failover scenarios
|
||||
|
||||
**Benefits:**
|
||||
- High availability
|
||||
- Read scaling
|
||||
- Faster backups (from replica)
|
||||
|
||||
**Complexity:**
|
||||
- Significant operational overhead
|
||||
- Monitoring and alerting needed
|
||||
- Failover testing required
|
||||
|
||||
---
|
||||
|
||||
### 18. Centralized Logging (ELK Stack)
|
||||
**Status:** NOT IMPLEMENTED
|
||||
**Priority:** LOW
|
||||
**Effort:** 16-32 hours
|
||||
**Tracked In:** PHASE1_COMPLETE.md
|
||||
|
||||
**Description:**
|
||||
Logs stored in systemd journal. Hard to search across time periods.
|
||||
|
||||
**Implementation:**
|
||||
1. Install Elasticsearch, Logstash, Kibana
|
||||
2. Configure log shipping from systemd journal
|
||||
3. Create Kibana dashboards
|
||||
4. Set up log retention policy
|
||||
|
||||
**Benefits:**
|
||||
- Powerful log search
|
||||
- Log aggregation across services
|
||||
- Visual log analysis
|
||||
|
||||
**Cost:**
|
||||
- Significant resource usage (RAM for Elasticsearch)
|
||||
- Operational complexity
|
||||
|
||||
---
|
||||
|
||||
## Discovered Issues (Need Investigation)
|
||||
|
||||
### 19. Agent Connection Retry Logic
|
||||
**Status:** NEEDS REVIEW
|
||||
**Priority:** LOW
|
||||
**Effort:** 2-4 hours
|
||||
**Discovered:** 2026-01-18
|
||||
|
||||
**Description:**
|
||||
Agent at 172.16.3.20 retries every 5 seconds with invalid API key. Should implement exponential backoff or rate limiting.
|
||||
|
||||
**Investigation:**
|
||||
1. Check agent retry logic in codebase
|
||||
2. Determine if 5-second retry is intentional
|
||||
3. Consider exponential backoff for failed auth
|
||||
4. Add server-side rate limiting for repeated failures
|
||||
|
||||
**Files to Review:**
|
||||
- `agent/src/transport/` - WebSocket connection logic
|
||||
- `server/src/relay/` - Rate limiting for auth failures
|
||||
|
||||
---
|
||||
|
||||
### 20. Database Connection Pool Sizing
|
||||
**Status:** NEEDS MONITORING
|
||||
**Priority:** LOW
|
||||
**Effort:** 2-4 hours
|
||||
**Discovered:** During infrastructure setup
|
||||
|
||||
**Description:**
|
||||
Default connection pool settings may not be optimal. Need to monitor under load.
|
||||
|
||||
**Monitoring:**
|
||||
- Check `db_connections_active` metric in Prometheus
|
||||
- Monitor for pool exhaustion warnings
|
||||
- Track query latency
|
||||
|
||||
**Tuning:**
|
||||
- Adjust `max_connections` in PostgreSQL config
|
||||
- Adjust pool size in server .env file
|
||||
- Monitor and iterate
|
||||
|
||||
---
|
||||
|
||||
## Completed Items (For Reference)
|
||||
|
||||
### ✓ Systemd Service Configuration
|
||||
**Completed:** 2026-01-17
|
||||
**Phase:** Phase 1 Week 2
|
||||
|
||||
### ✓ Prometheus Metrics Integration
|
||||
**Completed:** 2026-01-17
|
||||
**Phase:** Phase 1 Week 2
|
||||
|
||||
### ✓ Grafana Dashboard Setup
|
||||
**Completed:** 2026-01-17
|
||||
**Phase:** Phase 1 Week 2
|
||||
|
||||
### ✓ Automated Backup System
|
||||
**Completed:** 2026-01-17
|
||||
**Phase:** Phase 1 Week 2
|
||||
|
||||
### ✓ Log Rotation Configuration
|
||||
**Completed:** 2026-01-17
|
||||
**Phase:** Phase 1 Week 2
|
||||
|
||||
### ✓ CI/CD Workflows Created
|
||||
**Completed:** 2026-01-18
|
||||
**Phase:** Phase 1 Week 3
|
||||
|
||||
### ✓ Deployment Automation Script
|
||||
**Completed:** 2026-01-18
|
||||
**Phase:** Phase 1 Week 3
|
||||
|
||||
### ✓ Version Tagging Automation
|
||||
**Completed:** 2026-01-18
|
||||
**Phase:** Phase 1 Week 3
|
||||
|
||||
### ✓ Gitea Actions Runner Installation
|
||||
**Completed:** 2026-01-18
|
||||
**Phase:** Phase 1 Week 3
|
||||
|
||||
### ✓ Systemd Watchdog Issue Fixed (Partial Completion)
|
||||
**Completed:** 2026-01-18
|
||||
**What Was Done:** Removed `WatchdogSec=30s` from systemd service file
|
||||
**Result:** Resolved immediate 502 error; server now runs stably
|
||||
**Status:** Issue fixed but full implementation (sd_notify) still pending
|
||||
**Item Reference:** Item #3 (full sd_notify implementation remains as future work)
|
||||
**Impact:** Production server is now stable and responding correctly
|
||||
|
||||
---
|
||||
|
||||
## Summary by Priority
|
||||
|
||||
**Critical (1 item):**
|
||||
1. Gitea Actions runner registration
|
||||
|
||||
**High (4 items):**
|
||||
2. TLS certificate auto-renewal
|
||||
4. Invalid agent API key investigation
|
||||
5. Comprehensive security audit logging
|
||||
6. Session timeout enforcement
|
||||
|
||||
**High - Partial/Pending (1 item):**
|
||||
3. Systemd watchdog implementation (issue fixed; sd_notify implementation pending)
|
||||
|
||||
**Medium (6 items):**
|
||||
7. Grafana dashboard import
|
||||
8. Grafana password change
|
||||
9. Deployment SSH keys
|
||||
10. Backup offsite sync
|
||||
11. Alertmanager for Prometheus
|
||||
12. CI/CD notification webhooks
|
||||
|
||||
**Low (8 items):**
|
||||
13. Windows runner for agent builds
|
||||
14. Staging environment
|
||||
15. Code coverage thresholds
|
||||
16. Performance benchmarking
|
||||
17. Database replication
|
||||
18. Centralized logging (ELK)
|
||||
19. Agent retry logic review
|
||||
20. Database pool sizing monitoring
|
||||
|
||||
---
|
||||
|
||||
## Tracking Notes
|
||||
|
||||
**How to Use This Document:**
|
||||
1. Before starting new work, review this list
|
||||
2. When discovering new issues, add them here
|
||||
3. When completing items, move to "Completed Items" section
|
||||
4. Prioritize based on: Security > Stability > Operations > Features
|
||||
5. Update status and dates as work progresses
|
||||
|
||||
**Related Documents:**
|
||||
- `PHASE1_COMPLETE.md` - Overall Phase 1 status
|
||||
- `PHASE1_WEEK3_COMPLETE.md` - CI/CD specific items
|
||||
- `CI_CD_SETUP.md` - CI/CD documentation
|
||||
- `INFRASTRUCTURE_STATUS.md` - Infrastructure status
|
||||
|
||||
---
|
||||
|
||||
**Document Version:** 1.1
|
||||
**Items Tracked:** 20 (1 critical, 4 high, 1 high-partial, 6 medium, 8 low)
|
||||
**Last Updated:** 2026-01-18 (Item #3 marked as partial completion)
|
||||
**Next Review:** Before Phase 2 planning
|
||||
Reference in New Issue
Block a user