Add GuruRMM real-time tunnel architecture and planning

Comprehensive design for transforming agents from 30s heartbeat mode to persistent tunnel mode, enabling Claude Code to execute commands on remote machines through secure multiplexed WebSocket channels. Additions: - Complete implementation plan with 5-phase roadmap (5-7 weeks to GA) - Detailed architecture document covering protocol, security, and MCP integration - Database migration for tech_sessions and tunnel_audit tables Key architectural decisions: - Hybrid lifecycle: WebSocket persistent, tunnel is operational state - Channel multiplexing over single WebSocket (terminal, file ops, etc.) - Three-layer security: JWT auth, session authorization, command validation - Custom MCP server for Claude Code integration Next: Phase 1 implementation (tunnel open/close endpoints, agent mode state machine) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-04-14 06:32:16 -07:00
parent 9ab36352ae
commit 9940faf34a
3 changed files with 1113 additions and 0 deletions
--- a/.claude/gururmm-tunnel-plan.md
+++ b/.claude/gururmm-tunnel-plan.md
@@ -0,0 +1,396 @@
+# GuruRMM Real-Time Tunnel Implementation Plan
+
+## Overview
+
+Transform GuruRMM agents from periodic check-in mode (30-second heartbeats) to persistent tunnel mode, enabling Claude Code on tech workstation to execute commands on remote machines through secure multiplexed channels.
+
+---
+
+## Architecture Summary
+
+### Current State (Confirmed via exploration)
+- **Server:** Axum 0.7 @ 172.16.3.30:3001, WebSocket endpoint, AgentConnections HashMap
+- **Agent:** Tokio async, 30-second heartbeat confirmed, 3 concurrent tasks (metrics/network/heartbeat)
+- **Protocol:** Tagged JSON enums (ServerMessage/AgentMessage) with serde
+
+### Key Architectural Decisions
+
+1. **Tunnel Lifecycle:** Hybrid - WebSocket stays persistent, tunnel mode is operational state change
+   - Agent modes: Heartbeat (default) ↔ Tunnel (active session)
+   - One tunnel per agent, on-demand activation, instant mode switching
+
+2. **Channel Multiplexing:** Unified protocol with channel_id routing
+   - Single WebSocket, multiple logical channels
+   - Enables concurrent operations (multiple terminals, simultaneous file transfers)
+   - Channel types: Terminal, FileRead, FileWrite, FileList, Registry, Services
+
+3. **Claude Integration:** Custom MCP server
+   - Tools: `gururmm_run_command`, `gururmm_read_file`, `gururmm_write_file`, `gururmm_list_directory`, `gururmm_list_agents`
+   - JWT authentication via environment variable
+   - Auto-manages tunnel sessions (open on first use, keep-alive, close on idle)
+
+4. **Security:** Three-layer model
+   - Layer 1: JWT authentication (24h expiration)
+   - Layer 2: Session authorization (tech_sessions table, 4h inactivity timeout)
+   - Layer 3: Command validation (working directory allowlist, rate limiting 100/min, audit logging)
+
+---
+
+## Protocol Extensions
+
+### New Message Types
+
+```rust
+// Server → Agent
+enum ServerMessage {
+    // ... existing ...
+    TunnelOpen { session_id: String, tech_id: i32 },
+    TunnelClose { session_id: String },
+    TunnelData { channel_id: String, data: TunnelDataPayload },
+}
+
+// Agent → Server
+enum AgentMessage {
+    // ... existing ...
+    TunnelReady { session_id: String },
+    TunnelData { channel_id: String, data: TunnelDataPayload },
+    TunnelError { channel_id: String, error: String },
+}
+
+enum TunnelDataPayload {
+    Terminal { command: String },
+    TerminalOutput { stdout: String, stderr: String, exit_code: Option<i32> },
+    FileRead { path: String },
+    FileContent { content: Vec<u8>, mime_type: String },
+    FileWrite { path: String, content: Vec<u8> },
+    FileList { path: String },
+    FileListResult { entries: Vec<FileEntry> },
+}
+```
+
+### Agent Mode State Machine
+
+```rust
+enum AgentMode {
+    Heartbeat,  // Default: 30s heartbeats, metrics, network monitoring
+    Tunnel {
+        session_id: String,
+        tech_id: i32,
+        channels: HashMap<String, ChannelType>,
+    },
+}
+```
+
+---
+
+## Implementation Phases
+
+### Phase 1: Core Tunnel Infrastructure (Week 1)
+**Goal:** Establish tunnel mode switching and channel routing
+
+**Server:**
+- Add TunnelOpen/TunnelClose/TunnelData to ServerMessage enum
+- Create tech_sessions table (id, session_id, tech_id, agent_id, opened_at, last_activity, status)
+- Implement endpoints: POST /api/v1/tunnel/open, POST /close, GET /status/:session_id
+- Add channel routing in WebSocket handler (route by channel_id)
+- Session validation middleware (JWT + ownership check)
+
+**Agent:**
+- Add TunnelReady/TunnelData/TunnelError to AgentMessage enum
+- Implement AgentMode state machine
+- Add channel manager (HashMap<channel_id, ChannelHandler>)
+- Handle TunnelOpen → respond TunnelReady
+- Handle TunnelClose → cleanup channels, return to heartbeat mode
+
+**Critical Files:**
+- `server/src/ws/mod.rs` - WebSocket handler, protocol definitions
+- `server/src/routes/tunnel.rs` - NEW: Tunnel API endpoints
+- `server/src/middleware/auth.rs` - Session validation
+- `agent/src/transport/websocket.rs` - WebSocket client, protocol handling
+- `agent/src/tunnel/mod.rs` - NEW: Tunnel mode manager
+- `migrations/XXX_create_tech_sessions.sql` - NEW: Database schema
+
+### Phase 2: Terminal Channel (Week 2)
+**Goal:** Execute PowerShell/cmd/bash commands through tunnel
+
+**Implementation:**
+- Create TerminalChannel handler on agent (spawn child process, capture streams)
+- Implement TunnelDataPayload::Terminal on server
+- Working directory validation on agent (configurable allowlist)
+- Command result streaming for long-running commands
+- Endpoint: POST /api/v1/tunnel/:session_id/command
+
+**Critical Files:**
+- `agent/src/tunnel/terminal.rs` - NEW: Terminal channel handler
+- `server/src/routes/tunnel.rs` - Add command execution endpoint
+- `agent/config.toml` - Add allowed_paths configuration
+
+### Phase 3: File Operations (Week 3)
+**Goal:** Read, write, list files through tunnel
+
+**Implementation:**
+- Create FileChannel handler on agent
+- Chunked transfer for files > 1MB (transfer_id tracking)
+- Base64 encoding for binary data
+- MIME type detection (magic numbers)
+- Endpoints: GET /file, PUT /file, POST /file/list
+
+**Critical Files:**
+- `agent/src/tunnel/file.rs` - NEW: File channel handler
+- `server/src/routes/tunnel.rs` - Add file operation endpoints
+- `common/src/transfer.rs` - NEW: Chunked transfer utilities
+
+### Phase 4: MCP Server Integration (Week 4)
+**Goal:** Expose tunnel operations as MCP tools for Claude Code
+
+**Implementation:**
+- Create new project: `gururmm-mcp-server` (Rust)
+- Use `mcp-server-rs` crate
+- Implement 5 core tools (run_command, read_file, write_file, list_dir, list_agents)
+- JWT token from environment variable (GURURMM_AUTH_TOKEN)
+- Auto-manage tunnel sessions (open on first tool use, 5min idle timeout)
+
+**Critical Files:**
+- `mcp-server/src/main.rs` - NEW: MCP server entry point
+- `mcp-server/src/tools.rs` - NEW: Tool implementations
+- `mcp-server/src/session.rs` - NEW: Session manager
+- `mcp-server/Cargo.toml` - NEW: Dependencies
+
+**MCP Config Example:**
+```json
+{
+  "mcpServers": {
+    "gururmm": {
+      "command": "gururmm-mcp-server",
+      "env": {
+        "GURURMM_API_URL": "http://172.16.3.30:3001",
+        "GURURMM_AUTH_TOKEN": "jwt-token-here"
+      }
+    }
+  }
+}
+```
+
+### Phase 5: Advanced Features (Week 5+)
+- Registry operations (Windows winreg crate)
+- Service management (sc.exe/WMI on Windows, systemctl on Linux)
+- Interactive terminal with PTY (stretch goal)
+
+---
+
+## Database Schema
+
+```sql
+CREATE TABLE tech_sessions (
+    id SERIAL PRIMARY KEY,
+    session_id VARCHAR(36) UNIQUE NOT NULL,
+    tech_id INTEGER NOT NULL REFERENCES techs(id),
+    agent_id INTEGER NOT NULL REFERENCES agents(id),
+    opened_at TIMESTAMP NOT NULL DEFAULT NOW(),
+    last_activity TIMESTAMP NOT NULL DEFAULT NOW(),
+    closed_at TIMESTAMP,
+    status VARCHAR(20) NOT NULL DEFAULT 'active',
+    UNIQUE(tech_id, agent_id, status) WHERE status = 'active'
+);
+
+CREATE TABLE tunnel_audit (
+    id SERIAL PRIMARY KEY,
+    session_id VARCHAR(36) NOT NULL REFERENCES tech_sessions(session_id),
+    channel_id VARCHAR(36) NOT NULL,
+    operation VARCHAR(50) NOT NULL,
+    details JSONB,
+    created_at TIMESTAMP NOT NULL DEFAULT NOW()
+);
+
+CREATE INDEX idx_tech_sessions_tech ON tech_sessions(tech_id);
+CREATE INDEX idx_tech_sessions_agent ON tech_sessions(agent_id);
+CREATE INDEX idx_tunnel_audit_session ON tunnel_audit(session_id);
+```
+
+---
+
+## API Endpoints (New)
+
+```
+POST   /api/v1/tunnel/open
+       Body: { "agent_id": 123 }
+       Response: { "session_id": "uuid", "status": "active" }
+
+POST   /api/v1/tunnel/close
+       Body: { "session_id": "uuid" }
+
+GET    /api/v1/tunnel/status/:session_id
+
+POST   /api/v1/tunnel/:session_id/command
+       Body: { "command": "...", "shell": "powershell", "working_dir": "...", "timeout": 30000 }
+
+GET    /api/v1/tunnel/:session_id/file?path=...
+
+PUT    /api/v1/tunnel/:session_id/file?path=...
+
+POST   /api/v1/tunnel/:session_id/file/list?path=...
+```
+
+---
+
+## MCP Tools
+
+```
+gururmm_run_command(agent_id, command, shell, working_dir, timeout)
+gururmm_read_file(agent_id, path)
+gururmm_write_file(agent_id, path, content)
+gururmm_list_directory(agent_id, path)
+gururmm_list_agents()
+```
+
+---
+
+## Security Implementation
+
+### Working Directory Validation
+```toml
+# agent/config.toml
+[security]
+allowed_paths = ["C:\\Shares", "C:\\Temp"]
+```
+
+Agent validates all file operations against allowlist, rejects path traversal (`..`).
+
+### Rate Limiting
+- Server enforces: 100 commands per minute per tech per agent
+- Sliding window (in-memory or Redis)
+- 429 response on limit exceeded
+- Violations logged to tunnel_audit
+
+### Command Injection Prevention
+- tokio::process::Command (no shell expansion)
+- PowerShell: `-NoProfile -NonInteractive -Command`
+- Input sanitization (escape quotes, reject backticks)
+- Timeout enforcement
+
+### Session Security
+- JWT 24h expiration
+- Sessions auto-expire 4h inactivity
+- One tunnel per agent (prevents concurrent session conflicts)
+- Admin force-close endpoint
+
+---
+
+## Testing Strategy
+
+### Unit Tests
+- Channel routing (correct channel receives message)
+- Session validation (JWT + ownership)
+- Command sanitization
+- Path validation (traversal prevention)
+
+### Integration Tests
+- Full tunnel lifecycle (open → command → close)
+- Concurrent sessions to different agents
+- Session timeout enforcement
+- Rate limiting
+
+### End-to-End Tests
+- Claude Code MCP integration
+- File upload via MCP, verify on agent
+- Multi-step workflow (read file → modify → write back)
+
+---
+
+## Rollout Plan
+
+1. **Week 5:** Internal testing (2 agents: AD2, DESKTOP-0O8A1RL)
+2. **Week 6:** Beta release (3 power user techs)
+3. **Week 7:** General availability (all techs, documentation, training)
+
+---
+
+## Success Metrics
+
+**Infrastructure (Phase 1-2):**
+- 95% tunnel open success rate
+- <500ms command response time
+- Zero session conflicts
+
+**MCP Integration (Phase 3-4):**
+- 80% tech adoption within 2 weeks
+- >50 tunnel sessions/day
+- <5% command error rate
+
+**Long-term:**
+- 20% reduction in RDP sessions
+- 90% tech satisfaction
+- <1% security incidents
+
+---
+
+## Risks and Mitigations
+
+| Risk | Impact | Mitigation |
+|------|--------|------------|
+| Command injection | Critical | Input sanitization, no shell expansion, path allowlist |
+| Session hijacking | High | Short-lived JWT, session ownership validation, audit logging |
+| WebSocket instability | Medium | Auto-reconnect, session recovery |
+| Rate limiting too strict | Medium | Configurable per-tech limits, user feedback |
+
+---
+
+## Open Questions
+
+1. Registry operations scope (full access or specific hives only)?
+2. Interactive terminal priority (defer to Phase 6)?
+3. Multi-tech sessions for pair programming?
+4. MCP server credential manager integration (1Password)?
+5. Agent-side logging requirements (compliance)?
+
+---
+
+## Verification Plan
+
+### Phase 1 Verification
+```bash
+# Tech opens tunnel session
+curl -X POST http://172.16.3.30:3001/api/v1/tunnel/open \
+  -H "Authorization: Bearer $JWT" \
+  -d '{"agent_id": 1}'
+# Response: {"session_id": "uuid", "status": "active"}
+
+# Check agent logs - should show: "Tunnel mode activated for session uuid"
+# Check database: SELECT * FROM tech_sessions WHERE session_id = 'uuid';
+```
+
+### Phase 2 Verification
+```bash
+# Execute command via tunnel
+curl -X POST http://172.16.3.30:3001/api/v1/tunnel/$SESSION_ID/command \
+  -H "Authorization: Bearer $JWT" \
+  -d '{"command": "Get-Date", "shell": "powershell"}'
+# Response: {"stdout": "Sunday, April 13, 2026...", "exit_code": 0}
+```
+
+### Phase 4 Verification (MCP)
+```bash
+# Configure MCP server in Claude Code
+# Test tools appear in Claude's tool list
+# Execute: "List files in C:\Shares on agent ID 1"
+# Claude should call gururmm_list_directory tool
+# Verify output shows directory listing
+```
+
+---
+
+## Next Steps After Approval
+
+1. Create feature branch: `feature/real-time-tunnel`
+2. Phase 1 database migrations (tech_sessions, tunnel_audit tables)
+3. Update protocol enums (ServerMessage/AgentMessage)
+4. Implement tunnel open/close endpoints
+5. Update agent WebSocket handler for tunnel mode
+6. Unit tests for session validation
+7. Deploy to test environment
+
+**Estimated Timeline:** 5 weeks to MCP integration, 7 weeks to GA
+
+---
+
+**Detailed plan location:** `projects/msp-tools/guru-rmm/plans/real-time-tunnel-architecture.md`