Files

Mike-Swanson cdd6e6fc8c sync: auto-sync from GURU-KALI at 2026-05-25 05:56:22

Author: Mike Swanson
Machine: GURU-KALI
Timestamp: 2026-05-25 05:56:22

2026-05-25 05:56:23 -07:00

19 KiB

Raw Blame History

Session Log — 2026-05-25

User

User: Mike Swanson (mike)
Machine: Mikes-MacBook-Air.local
Role: admin
Session: 05:00 - 05:48 MST

Session Summary

Recovered GURU-KALI workstation from black screen caused by nvidia driver installation using GuruRMM remote command execution. The system had booted to black screen after installing nvidia driver version 595.71.05-1, but the GuruRMM agent remained online and responsive, enabling remote diagnosis and repair.

Connected to the GuruRMM API at 172.16.3.30:3001 and confirmed GURU-KALI agent (ID a73ba38e-cd02-4331-b8bf-474cd899ec22) was online despite the display failure. Sent remote shell command to enumerate installed nvidia packages, discovering 50+ packages including driver, libraries, and firmware. Initial removal attempt failed with "Read-only file system" errors across /var/lib/dpkg and /var/cache/apt, indicating the filesystem had been mounted read-only - likely a protective measure after a previous boot failure.

Remounted the root filesystem as read-write using "mount -o remount,rw /", then executed a full nvidia package removal using apt-get with DEBIAN_FRONTEND=noninteractive to avoid interactive prompts. This removed all nvidia-* and libnvidia-* packages, but firmware packages and some DKMS modules remained. Performed a second pass removing firmware-nvidia-graphics and firmware-nvidia-gsp, then created /etc/modprobe.d/blacklist-nvidia.conf to prevent the nvidia kernel modules from loading on future boots. Updated initramfs to apply the blacklist.

Rebooted the system twice - first after the initial driver removal, then again after the blacklist was applied. After the second reboot, verified that lightdm display manager started successfully (active and running state). User confirmed the display was restored and showing the login screen. The system is now using either the Intel i915 integrated graphics driver or framebuffer fallback instead of the problematic nvidia driver. Blacklist remains in place to prevent recurrence.

Key Decisions

Used GuruRMM remote commands rather than physical access — Agent was online despite black screen, enabling fully remote recovery without needing console access or recovery media
Remounted filesystem before package operations — Read-only state blocked all dpkg/apt operations; remounting as read-write was mandatory before proceeding with driver removal
Performed multi-pass removal — First removed main driver packages, then firmware, then created blacklist and updated initramfs as separate operations to ensure each step completed cleanly
Created permanent blacklist — Added /etc/modprobe.d/blacklist-nvidia.conf rather than just removing packages, preventing automatic reloading if packages get reinstalled via dependencies
Rebooted twice — First reboot applied the package removal; second reboot after blacklist creation ensured nvidia modules wouldn't load from initramfs
Used DEBIAN_FRONTEND=noninteractive — Prevented apt-get from blocking on interactive prompts during unattended remote execution

Problems Encountered

Filesystem mounted read-only — Initial package removal failed with "unable to access dpkg database" and "Read-only file system" errors. Resolved by running "mount -o remount,rw /" before retrying removal operations.
JSON parsing control characters — Command output containing terminal control codes caused jq parsing failures. Worked around by using grep/python for status checks or by stripping control characters.
Firmware packages remained after initial removal — First apt-get pass removed driver packages but left firmware-nvidia-graphics and firmware-nvidia-gsp. Required explicit second removal targeting firmware-* packages.
Blacklist file initially missing — After first reboot, /etc/modprobe.d/blacklist-nvidia.conf was not present despite creation command showing success. Recreated using heredoc syntax and verified file contents before final reboot.
Exit code 100 despite success — Several apt-get operations returned exit code 100 (indicating warnings/non-critical issues) but included success markers in stdout. Used marker strings like "NVIDIA REMOVAL COMPLETE" to verify actual completion rather than relying solely on exit codes.

Configuration Changes

GURU-KALI (100.75.148.91 / Tailscale) — remote via GuruRMM:

Removed 50+ nvidia packages (nvidia-driver, nvidia-open, xserver-xorg-video-nvidia, all libnvidia-* libs)
Removed firmware-nvidia-graphics and firmware-nvidia-gsp

Created /etc/modprobe.d/blacklist-nvidia.conf:

blacklist nvidia
blacklist nvidia_drm
blacklist nvidia_modeset
blacklist nvidia_uvm

Updated initramfs (all kernels) to apply blacklist
Remounted root filesystem as read-write (was read-only)
Rebooted system twice

ClaudeTools:

.claude/current-mode set to infra (work mode for infrastructure operations)

Credentials & Secrets

No new credentials created. Used existing vaulted credentials:

GuruRMM API admin credentials: infrastructure/gururmm-server.sops.yaml -> credentials.gururmm-api.admin-email (claude-api@azcomputerguru.com) and credentials.gururmm-api.admin-password
Token stored temporarily in /tmp/rmm_token during session, deleted after completion

Infrastructure & Servers

GURU-KALI:

Hostname: GURU-KALI
Tailscale IP: 100.75.148.91
GuruRMM Agent ID: a73ba38e-cd02-4331-b8bf-474cd899ec22
OS: Kali Linux (dpkg-based)
Display Manager: lightdm (now active and running)
Graphics: Intel i915 integrated (after nvidia removal) or framebuffer fallback
Status: Online, display restored

GuruRMM Server (Saturn):

IP: 172.16.3.30
API Base: http://172.16.3.30:3001/api
Authentication: JWT Bearer token (obtained via POST /auth/login)
Command execution: POST /api/agents/{id}/command
Command polling: GET /api/commands/{id}

Commands & Outputs

# Authenticate with GuruRMM API
curl -s -X POST "http://172.16.3.30:3001/api/auth/login" \
  -H "Content-Type: application/json" \
  -d '{"email":"claude-api@azcomputerguru.com","password":"***"}' | jq -r '.token'
# -> (JWT token)

# Check agent status
curl -s "http://172.16.3.30:3001/api/agents/a73ba38e-cd02-4331-b8bf-474cd899ec22" \
  -H "Authorization: Bearer $TOKEN" | jq '{hostname, status}'
# -> {"hostname": "GURU-KALI", "status": "online"}

# List installed nvidia packages (command_id: 9302b83c-2f7b-4588-beb0-d735d3977b07)
# Command: dpkg -l | grep -i nvidia
# Output: 50 packages including nvidia-driver 595.71.05-1, nvidia-open, libnvidia-*, firmware-nvidia-*

# Remount filesystem as read-write (command_id: 2d1f683d-565a-4cfb-a17d-198770fac799)
# Command: mount -o remount,rw / && echo "Filesystem remounted as read-write" && mount | grep " / "
# Exit code: 0 (success)

# Remove nvidia drivers (command_id: 64cc2ca5-e031-4795-9aa4-27fde8b37c90)
# Command: DEBIAN_FRONTEND=noninteractive apt-get remove --purge -y nvidia-* libnvidia-* && apt-get autoremove -y
# Exit code: 100 (warnings but removed 48 packages, freed 979 MB)

# Verify removal (command_id: 8d415bfe-23e2-49a2-8da5-f98f5fd71a8c)
# Command: dpkg -l | grep -i nvidia || echo "No nvidia packages found"
# Output: Only firmware packages remained (firmware-nvidia-graphics, firmware-nvidia-gsp)

# Complete removal with blacklist (command_id: 190efe95-a11a-4960-869d-8be778e129bf)
# Command: apt-get remove --purge -y firmware-nvidia-* && dpkg --purge nvidia-driver nvidia-kernel-support ...
#   && dkms status | grep nvidia | cut -d, -f1,2 | xargs -r -n1 sh -c 'dkms remove $0'
#   && echo -e "blacklist nvidia\nblacklist nvidia_drm\nblacklist nvidia_modeset\nblacklist nvidia_uvm" > /etc/modprobe.d/blacklist-nvidia.conf
#   && update-initramfs -u
# Output marker: "COMPLETE NVIDIA REMOVAL DONE"

# Reboot (command_id: 8628dce8-8755-4a49-9904-c684455de70f)
# Command: sync && echo "Final reboot in 5 seconds..." && sleep 5 && reboot

# Final verification after reboot (command_id: f6737830-4ca9-4ed3-b616-d3305a445f10)
# Status: lightdm.service active (running)
# Display: Confirmed working by user

Pending / Incomplete Tasks

None. Recovery complete.

Future consideration: If nvidia GPU needed again:

Remove blacklist: sudo rm /etc/modprobe.d/blacklist-nvidia.conf
Reinstall nvidia drivers with proper Xorg configuration
Update initramfs: sudo update-initramfs -u
Reboot

Reference Information

GuruRMM API docs: Command execution via POST /api/agents/{id}/command with payload {command_type: "shell", command: "...", timeout_seconds: 300}
GURU-KALI session log reference: session-logs/2026-05-24-GURU-KALI-session.md (previous work on this machine)
Wiki reference: wiki/clients/internal-infrastructure.md (ACG infrastructure inventory)
Vault paths:
- GuruRMM API credentials: infrastructure/gururmm-server.sops.yaml
Command IDs from this session:
- Initial nvidia list: 9302b83c-2f7b-4588-beb0-d735d3977b07
- Filesystem remount: 2d1f683d-565a-4cfb-a17d-198770fac799
- Driver removal: 64cc2ca5-e031-4795-9aa4-27fde8b37c90
- Complete removal: 190efe95-a11a-4960-869d-8be778e129bf
- Final reboot: 8628dce8-8755-4a49-9904-c684455de70f
- Blacklist creation: f6737830-4ca9-4ed3-b616-d3305a445f10 # Session Log -- 2026-05-25

User

User: Mike Swanson (mike)
Machine: DESKTOP-0O8A1RL (GURU-5070)
Role: admin
Session span: ~19:42 PT (2026-05-24) -- 04:59 PT (2026-05-25)

Session Summary

Session opened with three completed tasks carrying over from the prior context: Pluto machine doc, rmm-audit skill update, and session save. Those were completed and synced before this session started (see 2026-05-24 session log updates).

The MacBook's in-progress auto-update re-dispatch fix was picked up. The MacBook session had identified that agents BB-SERVER and RECEPTIONIST-PC were stuck on v0.6.37 while the fleet was on v0.6.38, and had left uncommitted changes to server/src/ws/mod.rs. Since those changes were not committed, the fix was reimplemented from scratch against the live server code. The Coding Agent implemented db::get_pending_update() check before needs_update() in the reconnect handler, using the original update_id for re-dispatch with semver guard and URL/checksum validation. A bonus discovery: migrations 042-044 (agent_mspbackups_mapping and related) had not been applied to production and the .sqlx offline cache was stale -- both fixed in the same commit (c8d5af6). Service deployed and confirmed active. Both agents confirmed on 0.6.38 with status=completed update records within minutes of deploy.

Tucson Golden Corral was onboarded as a new GuruRMM client. Client "Tucson Golden Corral" and site "Co-Located" were created via the GuruRMM API (auth via admin JWT). Site enrollment key vaulted at clients/tucson-golden-corral/gururmm-site-co-located.sops.yaml. The IEX installer one-liner was requested -- it already existed at the dashboard installer page (irm 'https://rmm.azcomputerguru.com/install/INNER-STORM-2733/windows' | iex); this was not checked before asking.

TGC-SERVER enrolled immediately after the installer was run. Metrics pulled via RMM showed: online, v0.6.38, Windows Server 2016 (build 14393), 16 GB RAM at 45.6%, 1.8 TB disk at 36.2%, CPU at 23.8%, uptime ~5 hours. Process list indicated DNS, Active Directory, SQL Server, IIS (with Certify the Web/Let's Encrypt), ScreenConnect, Hyper-V, and Chrome running as Administrator on a DC. A PowerShell command was dispatched via the RMM to enumerate installed Windows roles; result confirmed: Hyper-V installed with two VMs (MAS90 -- Running, MAS90.old -- Off) and a full RDS stack (Connection Broker, Gateway, Licensing, Session Host, Web Access). User confirmed Hyper-V should not be on this server; RDS is expected. MAS90 = Sage 100 ERP. Disposition of the VMs not yet decided -- session ended before resolution.

Key Decisions

Reimplement from scratch rather than recover MacBook draft: MacBook changes were uncommitted and inaccessible from DESKTOP. Reimplementation from session log description + live code produced a cleaner result than the MacBook draft which had gone through two rejection cycles.
Bundle migrations with fix commit: Migrations 042-044 were a pre-existing production blocker (next CI server build would have failed silently). Bundling avoids a separate emergency fix.
Vault TGC enrollment key immediately on site creation: Consistent with practice for all other clients. Key is a shared secret for agent enrollment; losing it means re-generating and updating all agents.

Problems Encountered

Wrong field name on auth login: Sent username instead of email field. API returned deserialization error. Fixed by reading the error message.
Commands endpoint field mismatch: Sent command_text instead of command field. Discovered correct field name by reading the SendCommandRequest struct in server/src/api/commands.rs.
JSON escaping in bash heredoc: Shell escaping of PowerShell dollar signs in JSON payload caused empty responses from curl. Resolved by using PowerShell's Invoke-RestMethod with a here-string for the command body.
Checked wrong IEX installer URL: Asked if an irm | iex endpoint existed before checking the dashboard installer page, which already displayed it. The URL (/install/INNER-STORM-2733/windows) uses site_code not site_id UUID.

Configuration Changes

New files (vault repo):

clients/tucson-golden-corral/gururmm-site-co-located.sops.yaml -- GuruRMM enrollment key for TGC Co-Located site

Modified files (gururmm repo, pushed to Gitea):

server/src/ws/mod.rs -- added use semver::Version; + pending update re-dispatch logic
.sqlx/ -- regenerated offline query cache after applying migrations 042-044

Applied DB migrations (production gururmm PostgreSQL on 172.16.3.30):

Migration 042 -- agent_mspbackups_mapping table
Migration 043 -- (mspbackups related)
Migration 044 -- (mspbackups related)

Credentials & Secrets

Tucson Golden Corral -- Co-Located site:

Enrollment API key: grmm_p4g5z7Oj1-rE6GjjjrQqWBouk9BGl4v3
Vault: clients/tucson-golden-corral/gururmm-site-co-located.sops.yaml

GuruRMM admin (already in vault):

Email: admin@azcomputerguru.com
Password: GuruRMM2025
Vault: projects/gururmm/dashboard.sops.yaml

Infrastructure & Servers

Host	IP	Notes
GuruRMM server	172.16.3.30	gururmm-server restarted after re-dispatch fix deploy
TGC-SERVER	public IP 98.181.90.163	New GuruRMM client; Windows Server 2016 build 14393; DC+DNS+SQL+IIS+RDS+Hyper-V

TGC-SERVER details:

Agent ID: 1275daa1-3996-4ecf-a1db-c82e88f757b4
OS: Windows Server 2016 (build 14393), extended support ends Jan 2027
Roles confirmed installed: Hyper-V, RDS (full stack), AD DS, DNS
Hyper-V VMs: MAS90 (Running -- Sage 100 ERP), MAS90.old (Off -- prior snapshot/backup)
Other services: SQL Server, IIS + Certify the Web (Let's Encrypt), ScreenConnect client
Administrator logged in, idle since boot, running Chrome on a DC (security concern)
RDS expected per customer; Hyper-V NOT expected per customer

New GuruRMM client/site:

Client: Tucson Golden Corral (ID: 3248bdec-cbc3-45df-ba63-c8cdc9395e58)
Site: Co-Located (ID: e5caa88f-f395-40e3-befa-f54e035f4293, code: INNER-STORM-2733)

Commands & Outputs

`powershell

GuruRMM API auth

POST http://172.16.3.30:3001/api/auth/login {"email":"admin@azcomputerguru.com","password":"GuruRMM2025"}

Create client

POST http://172.16.3.30:3001/api/clients {"name":"Tucson Golden Corral"}

-> id: 3248bdec-cbc3-45df-ba63-c8cdc9395e58

Create site

POST http://172.16.3.30:3001/api/sites {"name":"Co-Located","client_id":"3248bdec-cbc3-45df-ba63-c8cdc9395e58"}

-> site_id: e5caa88f, site_code: INNER-STORM-2733, api_key: grmm_p4g5z7Oj1-rE6GjjjrQqWBouk9BGl4v3

Windows installer one-liner (already on dashboard installer page)

irm 'https://rmm.azcomputerguru.com/install/INNER-STORM-2733/windows' | iex

RMM command dispatched to TGC-SERVER (command ID: e4d372fb)

Checked installed Hyper-V + RDS roles and running VMs

Result: Hyper-V + full RDS stack installed; VMs: MAS90 (Running), MAS90.old (Off)

Verify BB-SERVER/RECEPTIONIST-PC update completion

SELECT hostname, old_version, target_version, status, completed_at FROM agent_updates JOIN agents ON agents.id = agent_updates.agent_id WHERE hostname IN ('BB-SERVER','RECEPTIONIST-PC') ORDER BY started_at DESC LIMIT 4;

Both show status=completed, 0.6.37->0.6.38, ~00:13-00:14 UTC 2026-05-25

Pending / Incomplete Tasks

TGC-SERVER Hyper-V disposition: MAS90 (Sage 100 ERP) is running in a Hyper-V VM on TGC-SERVER. Customer says Hyper-V should not be on this box. Options: (1) migrate MAS90 VM to dedicated Hyper-V host, (2) P2V or migrate MAS90 to run natively. Decision not made -- needs customer input on hardware and MAS90 usage pattern.
TGC-SERVER Chrome-on-DC: Administrator account actively browsing from a domain controller. Should be flagged to customer and remediated (dedicated admin workstation or jump server).
TGC-SERVER OS age: Windows Server 2016 -- extended support Jan 2027. Not urgent but should be in the planning queue.
MSPBackups Phase 2: The mspbackups mapping migrations (042-044) were applied to production but no backup status data has been pulled yet for TGC or other clients.

Reference Information

gururmm commits:

c8d5af6 -- fix(server): re-dispatch pending updates on agent reconnect + sqlx migrate + .sqlx cache

Agents confirmed updated:

BB-SERVER: agent_id 6c02baa7, now 0.6.38, completed_at 2026-05-25 00:14 UTC
RECEPTIONIST-PC: agent_id 9c91d324, now 0.6.38, completed_at 2026-05-25 00:13 UTC

TGC RMM command result (e4d372fb):

Hyper-V, RSAT-Hyper-V-Tools, Hyper-V-Tools, Hyper-V-PowerShell -- all Installed
Remote-Desktop-Services, RDS-Connection-Broker, RDS-Gateway, RDS-Licensing, RDS-RD-Server, RDS-Web-Access -- all Installed
MAS90 VM: Running, Operating normally
MAS90.old VM: Off, Operating normally

IEX installer: irm 'https://rmm.azcomputerguru.com/install/INNER-STORM-2733/windows' | iex

Vault paths:

TGC enrollment key: clients/tucson-golden-corral/gururmm-site-co-located.sops.yaml
GuruRMM admin: projects/gururmm/dashboard.sops.yaml
GuruRMM API JWT secret: projects/gururmm/api-server.sops.yaml

Update: 05:56 MST — GURU-KALI sync (Mike Swanson)

Routine sync from the GURU-KALI machine. No substantive work — repo sync only.

Ran /sync: fast-forwarded e8b19a8..e991e8d, pulling 1 commit (this session log, authored from GURU-5070). No conflicts.
No local changes to commit; nothing to push.
Vault clean both directions.
No cross-user ## Note for / ## Message for blocks in incoming logs.
Global commands already current.

End-of-session state on GURU-KALI: HEAD e991e8d, working tree clean, main up to date with origin/main.

19 KiB Raw Blame History Unescape Escape

Session Log — 2026-05-25

User

Session Summary

Key Decisions

Problems Encountered

Configuration Changes

Credentials & Secrets

Infrastructure & Servers

Commands & Outputs

Pending / Incomplete Tasks

Reference Information

User

Session Summary

Key Decisions

Problems Encountered

Configuration Changes

Credentials & Secrets

Infrastructure & Servers

Commands & Outputs

GuruRMM API auth

Create client

-> id: 3248bdec-cbc3-45df-ba63-c8cdc9395e58

Create site

-> site_id: e5caa88f, site_code: INNER-STORM-2733, api_key: grmm_p4g5z7Oj1-rE6GjjjrQqWBouk9BGl4v3

Windows installer one-liner (already on dashboard installer page)

RMM command dispatched to TGC-SERVER (command ID: e4d372fb)

Checked installed Hyper-V + RDS roles and running VMs

Result: Hyper-V + full RDS stack installed; VMs: MAS90 (Running), MAS90.old (Off)

Verify BB-SERVER/RECEPTIONIST-PC update completion

Both show status=completed, 0.6.37->0.6.38, ~00:13-00:14 UTC 2026-05-25

Pending / Incomplete Tasks

Reference Information

Update: 05:56 MST — GURU-KALI sync (Mike Swanson)

19 KiB

Raw Blame History