All posts

Tearing Ultron down from Proxmox to bare Debian

Ultron was a Proxmox node. I wiped it for bare Debian 13 so the GPU would answer to one machine instead of a hypervisor, then spent the evening in the NVIDIA driver gauntlet Trixie hands you. The part that bit me was Secure Boot.

Ultron is the GPU node in my homelab, a single-socket Xeon workstation that for a year ran Proxmox like the rest of the cluster. Last week I wiped it and reinstalled bare Debian 13 (Trixie), because the one job I actually want from that box, running CUDA workloads against its GPU, is the one job a hypervisor makes harder rather than easier. The reinstall took twenty minutes. Getting the driver to load took the rest of the evening, almost all of it on a single thing nobody warns you about: Secure Boot silently refusing an unsigned module.

Here is why the hypervisor had to go, and the order of operations that actually works on Trixie.

Why a hypervisor was the wrong layer here

Proxmox earns its place when you are consolidating many guests onto one machine. GPU compute is the opposite shape. To give a virtual machine a real GPU you go through VFIO passthrough, which means sorting out IOMMU groups, blacklisting the host from ever binding the card, and handing the whole device to exactly one guest. You end up talking to your GPU through a virtual machine, the card can only ever belong to one VM at a time anyway, and you are carrying all of that machinery for a node that does precisely one thing.

So the reasoning was simple: a box whose entire purpose is one GPU does not need a hypervisor sitting between me and nvidia-smi. Collapse the stack, put the card on bare metal, and the passthrough tax disappears along with the layer that created it.

Before — Proxmox node After — bare Debian 13 CUDA workload (inside the guest) VM — guest OS + NVIDIA driver VFIO passthrough Proxmox host — kernel + KVM GPU one VM owns the card · the driver lives in the guest CUDA workload nvidia.ko (DKMS-built) Debian 13 kernel GPU the card answers to the kernel directly collapse the stack
The same hardware, two stacks. Passthrough buys flexibility a single-purpose GPU node never uses.

Blacklisting nouveau, the part that is easy to forget

Debian ships nouveau, the open-source driver, and loads it at boot. The proprietary module will not bind while nouveau is holding the card, so the first move is to blacklist it and rebuild the initramfs, so the change is in place from early boot rather than after the kernel has already claimed the GPU.

# /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0

# then regenerate the initramfs so it sticks at boot
sudo update-initramfs -u

The driver itself: let DKMS do the building

Trixie keeps the NVIDIA driver in the non-free component and its firmware in non-free-firmware, so the sources have to be widened before any of it is installable. Then you install the kernel headers and the driver package, and Debian uses DKMS to compile the module against your running kernel. That detail is the whole reason to use the packaged driver instead of the .run installer: DKMS rebuilds the module automatically on the next kernel upgrade, so an apt upgrade does not quietly leave you with a black screen.

# add  contrib non-free non-free-firmware  to your apt sources, then:
sudo apt update
sudo apt install linux-headers-amd64 nvidia-driver

Secure Boot, or why nvidia-smi lied to me

After the reboot I ran nvidia-smi and got this:

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the
NVIDIA driver. Make sure that the latest NVIDIA driver is installed
and running.

The card was fine and the module had built without complaint. The kernel was simply refusing to load it, because Secure Boot was on and a DKMS-built module is unsigned. There are two ways out. You can turn Secure Boot off in firmware, or you can enroll a Machine Owner Key, sign the module with it, and keep the chain of trust intact. I kept Secure Boot and enrolled a key, which is a one-time dance through the firmware on the next reboot.

# enroll the DKMS signing key, set a one-time password, then reboot
sudo mokutil --import /var/lib/dkms/mok.pub
# at the blue MOK manager on reboot: Enroll MOK, enter the password, reboot

After that, nvidia-smi came up clean with the card and driver version. That single step is the difference between a clean install and a working one, and it is the one the install logs never mention.

add contrib non-free non-free-firmware to /etc/apt/sources.list blacklist nouveau, update-initramfs -u apt install linux-headers-amd64 nvidia-driver (DKMS builds against your kernel) Secure Boot on? enroll a MOK, sign the module the step that bites you reboot nvidia-smi shows the card + driver version no yes
The whole sequence. Every box except the amber one is mechanical; the amber one is where a clean build still gives you a dead nvidia-smi.

What I got back

A bare nvidia-smi, the full card with no virtual machine in the way, and a node that now runs my Kokkos energy-measurement work straight against the hardware instead of through a guest. The rest of the cluster is still Proxmox, which is correct, because those nodes are doing the consolidation job Proxmox is genuinely good at. Ultron just was never that node.

The next thing on this box is wiring its power telemetry into the same dashboard the GPU work feeds, so the node reports joules-per-run alongside utilization. If you run a mixed Proxmox cluster and have a cleaner way to keep one bare-metal GPU node in the fold without the passthrough tax, tell me, I have not found one I love yet.