There’s a particular kind of dread that comes with waking up to find your homelab is unreachable.
My primary Proxmox node — a Dell OptiPlex 7060 running an i5-8500T — hosts eight VMs: my Home Assistant instance, a Grafana/Loki observability stack, a Paperless-NGX document server, a media server, and several others. It’s the backbone of everything I run at home.
And yet, more than once, I’d wake up to find the whole thing completely unreachable. Not just one service — everything, no SSH, no Proxmox Web UI, No Tailscale, not even a ping on the lan. The machine was clearly still powered on (I could see the power LED), but it had vanished from the network as if someone had unplugged the ethernet cable.
The only fix was to walk over and press the power button.
The day before I finally tracked this down, I’d been been watching some stuff from my plex server all evening with the family. I’m fairly sure it had the same problem earlier about a month ago, it didn’t make sense but didn’t have time to investigate that then so I’d had rebooted it and moved on without investigating. Then I woke up the next morning to find the VMs unreachable again. Same symptoms, same dead network, same power LED staring back at me. This time I decided to actually dig in and figure out what was going on instead of just rebooting and hoping for the best.
Finding the Smoking Gun
The machine had already been power-cycled, so I needed to look at the previous boot’s journal.
The first question: is this a one-off, or a pattern? When a Linux machine reboots, the system journal retains logs from previous boots. journalctl --list-boots shows every boot the system remembers, when each one started and when it ended. That tells you the shape of the history: how long each boot lasted, and how many there were. What it doesn’t tell you is why a boot ended, for that you need to look at the tail of each boot’s journal.
| |
Right away this showed a pattern worth investigating:
- Boot -8 ran for ~19 days
- Boot -7 ran for nearly 11 months
- Boot -6 ran for 7 days
- Boot -5 ran for ~15 days
- Boot -4 ran for ~41 days
- Boot -3 ran for only 1 day
- Boot -2 ran for ~30 days
- Boot -1 ran for only 20 hours
- Boot 0 is the current boot
Since the boot list doesn’t tell you why those boots ended, I checked the tail of each one. Boots -8 and -7 ended with clean systemd shutdowns — no network issues, just routine power-offs. Boots -6 through -1 all ended with a power button press too, but with the NIC already dead: in some cases the hang messages are still firing right up to the moment of shutdown, in others Tailscale’s “UDP is blocked / network unreachable” errors are the last thing logged — which is the same failure seen from userspace rather than the driver level.
Now I needed to find out what. I targeted boot -1 (the most recent failure) and pulled all error-level messages:
| |
The top of the list is startup noise — SGX disabled, nginx failing to start, Proxmox cluster filesystem complaining about quorum while corosync was still initialising. Normal stuff. But then at 08:00:27 a completely different kind of error appears: a kernel message from e1000e, the Intel NIC driver, reporting a Hardware Unit Hang. And it keeps repeating. That’s not startup noise, that’s the NIC locking up mid-operation, and it’s the last interesting thing that happens before the machine goes silent.
The kernel log format for network driver messages is [module] [pci_address] [interface]: [message], so e1000e 0000:00:1f.6 eno2 tells you the driver (e1000e), the PCI slot (00:1f.6), and the interface (eno2) in one line. I ran lspci against that PCI address to confirm what hardware it actually is:
The Intel I219-LM — a perfectly common onboard NIC found in countless Dell, HP, and Lenovo business desktops. Exactly the kind of hardware that ends up in homelabs.
I grepped for that specific message to find exactly when it first fired:
| |
08:00:27 UTC — that’s the moment the NIC died. Boot -1 had been running since 01:08; the NIC held on for nearly 7 hours before locking up, then stayed dead for 13 more hours until I pressed the power button at 21:22. Now I had a timestamp to zoom in on:
| |
The sequence is all there. At 08:00:27 the NIC driver fires “Detected Hardware Unit Hang” — the transmit descriptor ring has locked up and the hardware can’t process any outgoing packets. The TDH/TDT values in the dump are a snapshot of the ring state at the moment the driver gave up. Simultaneously, Corosync detects it has lost its link to the other cluster node. Five seconds later pmxcfs (the Proxmox cluster filesystem) loses quorum and the HA manager can’t write its status files. From 08:01 onward, the scheduler reports “no quorum!” every minute indefinitely.
The VMs themselves kept running — they had no idea the host’s physical NIC was dead. But nothing was reachable from the outside. The NIC never recovered on its own; it stayed hung from 08:00 until I pressed the power button at 21:22 — over 13 hours later.
How Bad Was It, Really?
Boot -1 confirmed the culprit. But how far back did this go? I counted hang events across all boots:
| |
The full picture:
| Boot | Duration | Hang Events | Outcome |
|---|---|---|---|
| -8 | ~19 days | 3 | Recovered |
| -7 | ~11 months | 28 | Recovered (6 separate episodes) |
| -6 | ~7 days | 129,328 | NIC locked up — manual reboot |
| -5 | ~15 days | 199 | NIC locked up — manual reboot |
| -4 | ~41 days | 23,165 | NIC locked up — manual reboot |
| -3 | ~1 day | 2,284 | NIC locked up — manual reboot |
| -2 | ~30 days | 4,405 | NIC locked up — manual reboot |
| -1 | ~20 hours | 24,055 | NIC locked up — manual reboot |
The hardware bug was present the entire time — boots -8 and -7 had occasional brief hangs that the driver recovered from on its own, and the machine kept running. From boot -6 onward, every single boot ended with the NIC permanently locked up, confirmed by the journal tails: either the hang messages firing right up until the power button press, or Tailscale reporting “UDP is blocked / network unreachable” — which is the same failure seen from userspace. In every case from boot -6 onward, the NIC locking up is definitively what made the machine unreachable and forced the manual power-off. That’s 183,467 hardware hang events in total across all eight boots.
The Root Cause
So after doing some interneting I found that this is a known Intel hardware bug. Intel has acknowledged it as an “old known HW bug” tied to TSO (TCP Segmentation Offload) on I219 chipsets. A 2019 kernel patch submission titled “e1000e: Work around hardware unit hang by disabling TSO” documents the issue. Intel’s own recommendation is to disable TSO via ethtool.
The bug manifests when the NIC’s transmit hardware gets stuck trying to process segmentation offload requests. The descriptor ring locks up and the driver can’t recover it. No upstream kernel fix has been merged — this is a hardware-level issue that software can only work around.
What Are TSO, GSO, and GRO — And Why Is It Safe to Disable Them?
Before jumping to the fix, it’s worth understanding what we’re actually turning off and why it doesn’t matter.
TCP Segmentation Offload (TSO)
When your server sends a large chunk of data over the network, it can’t just shove the whole thing out the wire at once. TCP requires that data be broken into segments that fit within the network’s Maximum Transmission Unit (MTU), typically 1,500 bytes for Ethernet. Normally, the CPU does this work: it takes a large buffer (say, 64 KB), chops it into ~44 segments of ~1,460 bytes each, adds TCP/IP headers to each one, and hands them to the NIC one at a time.
TSO moves this work to the NIC hardware. Instead of the CPU creating 44 small packets, it hands the NIC one large 64 KB chunk and says “you segment this.” The NIC’s onboard silicon splits it up and sends it out. This reduces CPU overhead because the kernel only has to build one set of headers instead of 44.
And this is where the problem really exists, the I219’s segmentation hardware has a bug. Under certain conditions, the transmit descriptor ring, the queue where the CPU tells the NIC “here’s data to send”, gets permanently stuck. The head and tail pointers freeze and the NIC can never process another outgoing packet.
Generic Segmentation Offload (GSO)
GSO is the software-side cousin of TSO. It delays segmentation as long as possible within the kernel’s network stack, batching work to reduce per-packet overhead. When TSO is available, GSO hands the large frames to the NIC for hardware segmentation. When TSO is off, GSO still helps by batching the segmentation work in the kernel more efficiently than doing it packet-by-packet. Disabling GSO alongside TSO is a belt-and-suspenders measure — it ensures no large frames reach the NIC’s transmit path that might trigger the bug.
Generic Receive Offload (GRO)
GRO is the receive-side equivalent: it reassembles small incoming packets into larger buffers before handing them to the kernel, reducing per-packet processing overhead on the CPU. While GRO isn’t directly related to the transmit hang, some users report that disabling it alongside TSO/GSO provides more consistent stability with the e1000e driver.
Energy Efficient Ethernet (EEE)
EEE allows the NIC to enter low-power states during idle periods. On the I219, this can interact poorly with the transmit path — the NIC may not wake cleanly from a low-power state, contributing to hangs. Disabling it keeps the NIC fully powered at all times.
Why Disabling These Is Fine at 1 GbE
The whole point of offloading segmentation to hardware is to save CPU cycles. But at 1 Gbps, the CPU savings are trivial.
To verify this, I ran iperf3 benchmarks between pvea (Intel I219-LM) and pveb (Realtek RTL8111) — 10-second tests in each direction, with offloads on and off:
| Test | Offloads OFF | Offloads ON | Difference |
|---|---|---|---|
| TX Throughput (pvea → pveb) | 884.9 Mbps | 886.1 Mbps | +1.2 Mbps (+0.1%) |
| TX CPU (pvea host) | 2.16% | 2.10% | -0.06% |
| TX Retransmits | 0 | 0 | — |
| RX Throughput (pveb → pvea) | 891.0 Mbps | 890.9 Mbps | -0.1 Mbps (~0%) |
| RX CPU (pvea host) | 28.82% | 6.88% | +21.94% |
Test conditions: 10s iperf3 over WireGuard tunnel (MTU 1280), TCP/cubic, single stream. Note: running through a WireGuard tunnel means the physical NIC carries UDP-encapsulated traffic rather than raw TCP, so these results reflect the offload impact in that specific topology — not necessarily bare Ethernet.
Full test commands and raw output (click to expand)
Step 1: Verify current offload state
Step 2: TX test — offloads OFF (pvea → pveb)
| |
Result: 885.6 Mbps, 0 retransmits, 2.14% host CPU
Step 3: RX test — offloads OFF (pveb → pvea)
| |
Result: 890.8 Mbps received, 28.92% host CPU (no GRO — kernel processing every packet individually)
Step 4: Enable offloads
Step 5: TX test — offloads ON (pvea → pveb)
| |
| |
Result: 886.1 Mbps, 0 retransmits, 2.09% host CPU
Step 6: RX test — offloads ON (pveb → pvea)
| |
| |
Result: 891.0 Mbps received, 6.82% host CPU (GRO coalescing packets in software)
Step 7: Restore offloads OFF
Throughput is identical — within noise margin in both directions. The NIC saturates at ~885-891 Mbps regardless of offload settings.
The CPU story is more interesting. On transmit, there’s essentially no difference (2.16% vs 2.10%). But on receive, disabling GRO increases CPU usage from 6.88% to 28.82%. This makes sense — without GRO, the kernel processes every incoming packet individually instead of coalescing them into larger buffers. At MTU 1280, that’s a lot of small packets.
But let’s put 28.82% in context:
- This is peak CPU during a sustained full-line-rate flood — a worst-case synthetic benchmark, not normal operation
- Normal homelab traffic (web UIs, file transfers, API calls) uses a fraction of this bandwidth
- The i5-8500T has 6 cores — 28.82% of one core is ~4.8% of total CPU capacity
- TSO/GSO (transmit side) is the actual trigger for the hardware hang, and disabling those costs basically nothing
The trade-off: a few percent more CPU during heavy inbound transfers, in exchange for a NIC that doesn’t randomly brick itself. Easy choice.
The Fix
The fix is straightforward: disable the NIC offload features that trigger the bug.
Immediate (takes effect now):
Persistent (survives reboots):
Add post-up commands to /etc/network/interfaces:
iface eno2 inet manual
post-up /sbin/ethtool -K eno2 tso off gso off gro off
post-up /sbin/ethtool --set-eee eno2 eee off
Verify it worked:
Automating It With Ansible
I didn’t want to manually check every new machine I add to the homelab. Since I manage my Proxmox nodes with Ansible, I added a task that auto-detects e1000e NICs and applies the workaround:
| |
The key design choice here: it detects the driver, not the interface name. Interface names vary between machines (eno1, eno2, enp0s31f6, etc.), but the driver is always e1000e for affected Intel NICs. This means:
- Running it on my
pvea(Dell OptiPlex 7060, Intel I219-LM) → detectseno2, applies fix - Running it on my
pveb(Realtek RTL8111 NIC) → detects nothing, skips cleanly - Any future machine with an Intel I219/I218 → automatically gets the fix
Deploy it with:
| |
Monitoring: Knowing If the Fix Holds
The ethtool workaround has been solid for others, but “trust but verify” is good practice, especially when the failure mode is a silent network death that requires a physical reboot. Here’s what’s worth watching and how to detect a problem before you wake up to a dead homelab.
What to Monitor
1. NIC TX/RX errors and drops — The earliest warning sign. Even with offloads disabled, a failing NIC may start showing errors before it locks up completely.
The single most important counter is tx_timeout_count — if this increments, the transmit path has stalled and the driver had to reset it. One or two might be harmless; a climbing count means the hardware hang is happening despite the workaround.
2. Kernel log messages — The “Detected Hardware Unit Hang” message is the definitive signal. If you see even one after applying the fix, something isn’t right.
3. Offload state — Verify the workaround is actually applied. A kernel update or network restart could reset these.
A Lightweight Monitoring Script
Rather than deploying a full node_exporter stack (which is the right long-term answer), here’s a cron-based script that checks for trouble and logs to syslog. You can pick this up with any log aggregator:
| |
# Cron entry (every 5 minutes)
*/5 * * * * /usr/local/bin/check-e1000e-health.sh
What About Alerting?
The honest answer: if the hardware hang occurs despite the offload workaround, the NIC is already dead and no alert will help — you can’t receive an alert from a machine with no network. The machine would need an out-of-band management interface (IPMI/iDRAC/iLO) or a second NIC to send the alert, and a Dell OptiPlex doesn’t have either.
What alerting can catch:
| Signal | Alert Threshold | Why |
|---|---|---|
tx_timeout_count > 0 | Any increment | Transmit stalls are precursors to full hangs |
| Offloads re-enabled | tso != off | Kernel update or network restart reset the workaround |
| Node unreachable | Ping/HTTP check fails | External monitoring detects the outage (after the fact) |
Hardware Unit Hang in logs | Any occurrence | The fix isn’t working — investigate immediately |
The most practical alert is an external uptime check — something outside the affected machine that pings it or hits a health endpoint every minute. If pvea goes dark, you’ll get a notification on your phone rather than discovering it hours later when you try to use Home Assistant. Services like Uptime Kuma (self-hosted on a different machine), or even a simple cron on pveb that pings pvea and sends an email if it fails, would catch this.
Does It Actually Hold?
As of publishing, pvea has been running since Feb 17 21:23 UTC — over 8 days — with zero hardware hang events:
And the fix is confirmed still in place:
Before the fix, the NIC was hanging constantly — 183,436 times across the six affected boots. Eight days clean is a meaningful data point.
References
- Proxmox Forum — Intel NIC e1000e hardware unit hang (SOLVED)
- Proxmox Forum — 6.8.12-9-pve kernel e1000e problem
- Proxmox Forum — e1000e Hardware Unit Hang on PVE 9 kernel 6.14
- Fixing Intel e1000e NIC hangs on Proxmox nodes — Garrett Laman
- First2Host — How to fix Proxmox Hardware Unit Hang
- Arch Linux Forums — Detected Hardware Unit Hang repeatedly
- Red Hat Bugzilla — e1000e Detected Hardware Unit Hang
