Published June 12, 2026
Charging GPU energy to the kernel that spent it
A profiler tells you where a GPU kernel spends time. I wanted to know where it spends joules. So I built a Kokkos Tools connector that samples power on a side thread and integrates it over each profiled region, with NVML for the precise per-GPU number and Variorum for the whole node.
I spent a summer at Oak Ridge working on this, and the question that started it is short: the kernel that took the most wall-clock time in my run was not the kernel that cost the most energy. The profiler ranked everything by time and was confident about it, and that ranking was simply the wrong one if the thing you are being asked to reduce is the power bill. Clusters increasingly run under a power cap rather than a clock-rate target, so energy-to-solution is becoming the number that matters, and almost nothing in a normal HPC workflow reports it per kernel. So I built a tool that does, as a Kokkos Tools connector that attributes joules to each profiled region without touching the application it measures.
Here is the result it produced on one run, and then how it gets there.
$ export KOKKOS_TOOLS_LIBS=/opt/kp/libkp_gpu_energy.so
$ ./solver --mesh big.h5
region calls time(s) energy(J) J/call avg(W)
-----------------------------------------------------------------
gemm_apply 12480 8.42 1936.1 0.155 230
spmv_matvec 49920 11.07 1421.8 0.028 128
halo_exchange 49920 3.91 402.7 0.008 103
-----------------------------------------------------------------
device idle baseline ~ 61 W (subtracted for the J/call column)
The sparse mat-vec ran the longest, eleven seconds against the dense block's eight, and still cost a third less energy, because it is memory bound and leaves the GPU drawing roughly half the power. Time told me to optimize spmv_matvec. Energy told me to look at gemm_apply first. Those are different instructions, and until this connector existed I could only see the first one.
Kokkos Tools, or instrumentation you do not have to compile in
Kokkos already announces what it is doing. Every parallel_for, parallel_reduce and parallel_scan fires a begin callback before it launches and an end callback after it finishes, and you can wrap arbitrary spans in named regions with push and pop markers. A Kokkos Tools connector is just a shared library that implements those callbacks, and you attach it by pointing an environment variable at it. There is no recompile of the application, no annotation in its source, no fork of the code. You set KOKKOS_TOOLS_LIBS to the path of the library and the runtime loads it.
That property is the whole reason this approach is worth building. I could take a solver I did not write, that nobody wants me to patch, and learn the energy cost of each of its kernels by loading one extra library next to it. The connector listens to the events Kokkos is already emitting. The measurement rides along.
You cannot read energy, only watch power
The obvious first version reads the power sensor at the begin callback, reads it again at the end, and reports the average times the duration. It does not work, and the reason it does not work is the heart of the problem. NVIDIA's management library, NVML, exposes nvmlDeviceGetPowerUsage, which returns the board's instantaneous power draw in milliwatts. The catch is twofold. That sensor updates at a modest rate, on the order of tens of hertz, and a GPU kernel can easily be shorter than the interval between two updates, so begin and end frequently return the same stale reading and the duration tells you nothing. And even when the kernel is long enough to span several updates, two point readings cannot describe a curve that rises and falls across the kernel's lifetime.
The deeper issue is that power is the wrong quantity to sample at the boundaries. Power is instantaneous, watts, a rate. What you are paying for is energy, joules, and energy is the integral of power over time. Two readings give you two heights of a curve. The bill is the area under it. My first version reported nonsense on short kernels, sometimes even a negative delta when the two readings landed on opposite sides of a sensor update, and that was the signal to stop sampling on the kernel's schedule and start sampling on the clock's.
A side thread, a fixed cadence, and a trapezoid
The fix is to separate sampling from the kernels entirely. A background thread polls the power sensor on a fixed interval, a few milliseconds apart, and timestamps every reading, building a continuous trace of how the board's draw moved through the whole run. The begin and end callbacks no longer read power at all. They record a wall-clock window, the moment the region opened and the moment it closed. To get a region's energy, the connector integrates the power trace over that window with the trapezoidal rule, summing the little trapezoids between consecutive samples that fall inside it. Because the same region is entered thousands of times, its joules accumulate across every call, which is the energy(J) column above and the only honest way to talk about a kernel that runs in tens of microseconds.
Sampling on the clock instead of on the kernel is what makes short kernels measurable. A single launch may be too brief to catch even one fresh sensor reading, but ten thousand launches under a steadily polled trace land enough samples that the aggregate is sound. The trade is a little overhead from the polling thread and a resolution floor set by the sample interval, and both are small and, more to the point, bounded and known.
What the number is, and what it is not
I would rather state the limits plainly than let the table imply more precision than it has. NVML reports power for the whole board, not per streaming multiprocessor, so this is whole-GPU attribution. If two kernels run concurrently on the same device, on separate streams, the trace cannot tell you which one drew which watt, and the energy of the overlap cannot be split cleanly between them. The figure also includes the device's idle draw, the tens of watts a powered-on GPU spends doing nothing, so for the marginal cost of a kernel you measure an idle baseline with the device quiet and subtract it, which is the line under the table and the floor in the diagram. None of this makes the measurement wrong. It makes it whole-device and honest about its resolution, which for ranking kernels by energy is exactly enough.
Two backends, two different questions
NVML answers one question very precisely: what did this NVIDIA GPU draw. It is per-board, milliwatt-resolution, and NVIDIA-only, and it sees nothing outside the card. So the connector has a second backend built on Variorum, which is vendor-neutral and reads power at the node and socket level, including the CPU through RAPL, the DRAM, and some GPUs, across hardware that is not NVIDIA. The two are not redundant. NVML tells you what the GPU drew. Variorum tells you what the node drew. A kernel that looks cheap on the card can still be shuffling enough data to light up the CPU and the memory controllers around it, and only the node-level view catches that. You reach for NVML when you are tuning a GPU kernel in isolation and for Variorum when you want the energy bill the machine room actually sees.
What per-kernel joules buy you
Once energy is attributed to the kernel that spent it, you can finally optimize the quantity you are actually billed for instead of using time as a stand-in and hoping the two agree. They do not always agree, which is the entire point: the fastest kernel is frequently not the most energy-efficient one, because going fast can mean running the silicon at its power ceiling, and a slower memory-bound kernel can be the cheaper one to run a million times. None of that is visible on a timeline, and all of it is visible once the joules sit next to the call count.
The connector lives as a small PR stack open upstream on kokkos/kokkos-tools, the NVML backend and the Variorum one, and on my own machines the per-kernel joules feed the same dashboard that the rest of my GPU work reports into, so a run shows energy-to-solution beside utilization rather than in a separate log nobody opens. The next thing I want is to push the resolution below the whole board, because whole-device attribution is honest but coarse, and concurrent streams deserve a cleaner answer than the one I can give today. If you have measured energy-to-solution at finer than whole-device resolution, or found a sane way to divide the bill across overlapping kernels, I would genuinely like to compare notes. I am also open to roles where this is the day job from January 2027.