~/../the-silent-nic-killer-debugging-intel-e1000e-hardware-hangs-in-a-proxmox-homelab

There’s a particular kind of dread that comes with waking up to find your homelab is unreachable.

My primary Proxmox node — a Dell OptiPlex 7060 running an i5-8500T — hosts eight VMs: my Home Assistant instance, a Grafana/Loki observability stack, a Paperless-NGX document server, a media server, and several others. It’s the backbone of everything I run at home.

And yet, more than once, I’d wake up to find the whole thing completely unreachable. Not just one service — everything, no SSH, no Proxmox Web UI, No Tailscale, not even a ping on the lan. The machine was clearly still powered on (I could see the power LED), but it had vanished from the network as if someone had unplugged the ethernet cable.

The only fix was to walk over and press the power button.

The day before I finally tracked this down, I’d been been watching some stuff from my plex server all evening with the family. I’m fairly sure it had the same problem earlier about a month ago, it didn’t make sense but didn’t have time to investigate that then so I’d had rebooted it and moved on without investigating. Then I woke up the next morning to find the VMs unreachable again. Same symptoms, same dead network, same power LED staring back at me. This time I decided to actually dig in and figure out what was going on instead of just rebooting and hoping for the best.

Finding the Smoking Gun

The machine had already been power-cycled, so I needed to look at the previous boot’s journal.

The first question: is this a one-off, or a pattern? When a Linux machine reboots, the system journal retains logs from previous boots. journalctl --list-boots shows every boot the system remembers, when each one started and when it ended. That tells you the shape of the history: how long each boot lasted, and how many there were. What it doesn’t tell you is why a boot ended, for that you need to look at the tail of each boot’s journal.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ sudo journalctl --list-boots
IDX BOOT ID      FIRST ENTRY                 LAST ENTRY
 -8 2a3f91c0...  Tue 2024-12-03 14:22:07 UTC Sun 2024-12-22 09:15:44 UTC
 -7 9c14e827...  Mon 2024-12-23 01:44:31 UTC Sat 2025-11-15 23:35:01 UTC
 -6 bd46a71e...  Sat 2025-11-15 23:36:12 UTC Sat 2025-11-22 23:51:03 UTC
 -5 bae13992...  Sat 2025-11-22 23:51:59 UTC Sun 2025-12-07 03:55:48 UTC
 -4 7f3d1f9a...  Sun 2025-12-07 03:58:10 UTC Sat 2026-01-17 02:10:13 UTC
 -3 1131f4d1...  Sat 2026-01-17 02:11:18 UTC Sun 2026-01-18 04:27:12 UTC
 -2 a9827cc5...  Sun 2026-01-18 04:28:04 UTC Tue 2026-02-17 01:06:23 UTC
 -1 f7164d20...  Tue 2026-02-17 01:08:03 UTC Tue 2026-02-17 21:22:15 UTC
  0 014b23b7...  Tue 2026-02-17 21:23:15 UTC ...

Right away this showed a pattern worth investigating:

  • Boot -8 ran for ~19 days
  • Boot -7 ran for nearly 11 months
  • Boot -6 ran for 7 days
  • Boot -5 ran for ~15 days
  • Boot -4 ran for ~41 days
  • Boot -3 ran for only 1 day
  • Boot -2 ran for ~30 days
  • Boot -1 ran for only 20 hours
  • Boot 0 is the current boot

Since the boot list doesn’t tell you why those boots ended, I checked the tail of each one. Boots -8 and -7 ended with clean systemd shutdowns — no network issues, just routine power-offs. Boots -6 through -1 all ended with a power button press too, but with the NIC already dead: in some cases the hang messages are still firing right up to the moment of shutdown, in others Tailscale’s “UDP is blocked / network unreachable” errors are the last thing logged — which is the same failure seen from userspace rather than the driver level.

Now I needed to find out what. I targeted boot -1 (the most recent failure) and pulled all error-level messages:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
$ sudo journalctl -b -1 -p err --no-pager
Feb 17 01:08:03 pvea kernel: x86/cpu: SGX disabled by BIOS.
Feb 17 01:08:06 pvea blkmapd[953]: open pipe file /run/rpc_pipefs/nfs/blocklayout failed: No such file or directory
Feb 17 01:08:06 pvea smartd[969]: Device: /dev/nvme0, number of Error Log entries increased from 0 to 60
Feb 17 01:08:07 pvea systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.
Feb 17 01:08:08 pvea pmxcfs[1302]: [quorum] crit: quorum_initialize failed: 2
Feb 17 01:08:08 pvea pmxcfs[1302]: [quorum] crit: can't initialize service
Feb 17 01:08:08 pvea pmxcfs[1302]: [confdb] crit: cmap_initialize failed: 2
Feb 17 01:08:08 pvea pmxcfs[1302]: [confdb] crit: can't initialize service
Feb 17 01:08:08 pvea pmxcfs[1302]: [dcdb] crit: cpg_initialize failed: 2
Feb 17 01:08:08 pvea pmxcfs[1302]: [dcdb] crit: can't initialize service
Feb 17 01:08:08 pvea pmxcfs[1302]: [status] crit: cpg_initialize failed: 2
Feb 17 01:08:08 pvea pmxcfs[1302]: [status] crit: can't initialize service
Feb 17 01:08:20 pvea pmxcfs[1302]: [dcdb] crit: received write while not quorate - trigger resync
Feb 17 01:08:20 pvea pmxcfs[1302]: [dcdb] crit: leaving CPG group
Feb 17 01:08:21 pvea pmxcfs[1302]: [dcdb] crit: cpg_join failed: 14
...
...
...
Feb 17 02:35:59 pvea pveupdate[19171]: Renewing ACME certificate failed: ACME domain list in node configuration is missing! at /usr/share/perl5/PVE/API2/ACME.pm line 276
Feb 17 08:00:27 pvea kernel: e1000e 0000:00:1f.6 eno2: Detected Hardware Unit Hang:
                               TDH                  <f8>
                               TDT                  <4d>
                               next_to_use          <4d>
                               next_to_clean        <f7>
                             buffer_info[next_to_clean]:
                               time_stamp           <1017505e0>
                               next_to_watch        <f8>
                               jiffies              <101750b00>
                               next_to_watch.status <0>
                             MAC Status             <80083>
                             PHY Status             <796d>
                             PHY 1000BASE-T Status  <3800>
                             PHY Extended Status    <3000>
                             PCI Status             <10>
Feb 17 08:00:29 pvea kernel: e1000e 0000:00:1f.6 eno2: Detected Hardware Unit Hang:
                               TDH                  <f8>
                               TDT                  <4d>
                               next_to_use          <4d>
                               next_to_clean        <f7>
                             buffer_info[next_to_clean]:
                               time_stamp           <1017505e0>
                               next_to_watch        <f8>
                               jiffies              <1017512c1>
                               next_to_watch.status <0>
                             MAC Status             <80083>
                             PHY Status             <796d>
                             PHY 1000BASE-T Status  <3800>
                             PHY Extended Status    <3000>
                             PCI Status             <10>
...
...

The top of the list is startup noise — SGX disabled, nginx failing to start, Proxmox cluster filesystem complaining about quorum while corosync was still initialising. Normal stuff. But then at 08:00:27 a completely different kind of error appears: a kernel message from e1000e, the Intel NIC driver, reporting a Hardware Unit Hang. And it keeps repeating. That’s not startup noise, that’s the NIC locking up mid-operation, and it’s the last interesting thing that happens before the machine goes silent.

The kernel log format for network driver messages is [module] [pci_address] [interface]: [message], so e1000e 0000:00:1f.6 eno2 tells you the driver (e1000e), the PCI slot (00:1f.6), and the interface (eno2) in one line. I ran lspci against that PCI address to confirm what hardware it actually is:

1
2
3
4
$ sudo lspci -v -s 00:1f.6
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)
        Subsystem: Dell Ethernet Connection (7) I219-LM
        Kernel driver in use: e1000e

The Intel I219-LM — a perfectly common onboard NIC found in countless Dell, HP, and Lenovo business desktops. Exactly the kind of hardware that ends up in homelabs.

I grepped for that specific message to find exactly when it first fired:

1
2
3
4
$ sudo journalctl -b -1 -k --no-pager | grep "Hardware Unit Hang" | head -3
Feb 17 08:00:27 pvea kernel: e1000e 0000:00:1f.6 eno2: Detected Hardware Unit Hang:
Feb 17 08:00:29 pvea kernel: e1000e 0000:00:1f.6 eno2: Detected Hardware Unit Hang:
Feb 17 08:00:31 pvea kernel: e1000e 0000:00:1f.6 eno2: Detected Hardware Unit Hang:

08:00:27 UTC — that’s the moment the NIC died. Boot -1 had been running since 01:08; the NIC held on for nearly 7 hours before locking up, then stayed dead for 13 more hours until I pressed the power button at 21:22. Now I had a timestamp to zoom in on:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
$ sudo journalctl -b -1 --since '2026-02-17 07:58' --until '2026-02-17 08:03' \
    -p warning --no-pager | grep -v iptables-dropped
Feb 17 08:00:27 pvea corosync[1393]:   [KNET  ] host: host: 2 has no active links
Feb 17 08:00:27 pvea kernel: e1000e 0000:00:1f.6 eno2: Detected Hardware Unit Hang:
                               TDH                  <f8>
                               TDT                  <4d>
                               next_to_use          <4d>
                               next_to_clean        <f7>
                             buffer_info[next_to_clean]:
                               time_stamp           <1017505e0>
                               next_to_watch        <f8>
                               jiffies              <101750b00>
                               next_to_watch.status <0>
                             MAC Status             <80083>
                             PHY Status             <796d>
                             PHY 1000BASE-T Status  <3800>
                             PHY Extended Status    <3000>
                             PCI Status             <10>
[... repeating every 2 seconds ...]
Feb 17 08:00:32 pvea pmxcfs[1302]: [dcdb] crit: received write while not quorate - trigger resync
Feb 17 08:00:32 pvea pmxcfs[1302]: [dcdb] crit: leaving CPG group
Feb 17 08:00:32 pvea pve-ha-lrm[1555]: unable to write lrm status file - Permission denied
Feb 17 08:00:33 pvea pmxcfs[1302]: [dcdb] crit: cpg_join failed: 14
Feb 17 08:01:11 pvea pvescheduler[83276]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Feb 17 08:01:11 pvea pvescheduler[83275]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Feb 17 08:02:11 pvea pvescheduler[83485]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Feb 17 08:02:11 pvea pvescheduler[83484]: replication: cfs-lock 'file-replication_cfg' error: no quorum!

The sequence is all there. At 08:00:27 the NIC driver fires “Detected Hardware Unit Hang” — the transmit descriptor ring has locked up and the hardware can’t process any outgoing packets. The TDH/TDT values in the dump are a snapshot of the ring state at the moment the driver gave up. Simultaneously, Corosync detects it has lost its link to the other cluster node. Five seconds later pmxcfs (the Proxmox cluster filesystem) loses quorum and the HA manager can’t write its status files. From 08:01 onward, the scheduler reports “no quorum!” every minute indefinitely.

The VMs themselves kept running — they had no idea the host’s physical NIC was dead. But nothing was reachable from the outside. The NIC never recovered on its own; it stayed hung from 08:00 until I pressed the power button at 21:22 — over 13 hours later.

How Bad Was It, Really?

Boot -1 confirmed the culprit. But how far back did this go? I counted hang events across all boots:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ for i in -8 -7 -6 -5 -4 -3 -2 -1; do
    echo "Boot $i: $(sudo journalctl -b $i -k 2>/dev/null | grep -c 'Hardware Unit Hang') hangs"
  done
Boot -8: 3 hangs
Boot -7: 28 hangs
Boot -6: 129328 hangs
Boot -5: 199 hangs
Boot -4: 23165 hangs
Boot -3: 2284 hangs
Boot -2: 4405 hangs
Boot -1: 24055 hangs

The full picture:

BootDurationHang EventsOutcome
-8~19 days3Recovered
-7~11 months28Recovered (6 separate episodes)
-6~7 days129,328NIC locked up — manual reboot
-5~15 days199NIC locked up — manual reboot
-4~41 days23,165NIC locked up — manual reboot
-3~1 day2,284NIC locked up — manual reboot
-2~30 days4,405NIC locked up — manual reboot
-1~20 hours24,055NIC locked up — manual reboot

The hardware bug was present the entire time — boots -8 and -7 had occasional brief hangs that the driver recovered from on its own, and the machine kept running. From boot -6 onward, every single boot ended with the NIC permanently locked up, confirmed by the journal tails: either the hang messages firing right up until the power button press, or Tailscale reporting “UDP is blocked / network unreachable” — which is the same failure seen from userspace. In every case from boot -6 onward, the NIC locking up is definitively what made the machine unreachable and forced the manual power-off. That’s 183,467 hardware hang events in total across all eight boots.

The Root Cause

So after doing some interneting I found that this is a known Intel hardware bug. Intel has acknowledged it as an “old known HW bug” tied to TSO (TCP Segmentation Offload) on I219 chipsets. A 2019 kernel patch submission titled “e1000e: Work around hardware unit hang by disabling TSO” documents the issue. Intel’s own recommendation is to disable TSO via ethtool.

The bug manifests when the NIC’s transmit hardware gets stuck trying to process segmentation offload requests. The descriptor ring locks up and the driver can’t recover it. No upstream kernel fix has been merged — this is a hardware-level issue that software can only work around.

What Are TSO, GSO, and GRO — And Why Is It Safe to Disable Them?

Before jumping to the fix, it’s worth understanding what we’re actually turning off and why it doesn’t matter.

TCP Segmentation Offload (TSO)

When your server sends a large chunk of data over the network, it can’t just shove the whole thing out the wire at once. TCP requires that data be broken into segments that fit within the network’s Maximum Transmission Unit (MTU), typically 1,500 bytes for Ethernet. Normally, the CPU does this work: it takes a large buffer (say, 64 KB), chops it into ~44 segments of ~1,460 bytes each, adds TCP/IP headers to each one, and hands them to the NIC one at a time.

TSO moves this work to the NIC hardware. Instead of the CPU creating 44 small packets, it hands the NIC one large 64 KB chunk and says “you segment this.” The NIC’s onboard silicon splits it up and sends it out. This reduces CPU overhead because the kernel only has to build one set of headers instead of 44.

And this is where the problem really exists, the I219’s segmentation hardware has a bug. Under certain conditions, the transmit descriptor ring, the queue where the CPU tells the NIC “here’s data to send”, gets permanently stuck. The head and tail pointers freeze and the NIC can never process another outgoing packet.

Generic Segmentation Offload (GSO)

GSO is the software-side cousin of TSO. It delays segmentation as long as possible within the kernel’s network stack, batching work to reduce per-packet overhead. When TSO is available, GSO hands the large frames to the NIC for hardware segmentation. When TSO is off, GSO still helps by batching the segmentation work in the kernel more efficiently than doing it packet-by-packet. Disabling GSO alongside TSO is a belt-and-suspenders measure — it ensures no large frames reach the NIC’s transmit path that might trigger the bug.

Generic Receive Offload (GRO)

GRO is the receive-side equivalent: it reassembles small incoming packets into larger buffers before handing them to the kernel, reducing per-packet processing overhead on the CPU. While GRO isn’t directly related to the transmit hang, some users report that disabling it alongside TSO/GSO provides more consistent stability with the e1000e driver.

Energy Efficient Ethernet (EEE)

EEE allows the NIC to enter low-power states during idle periods. On the I219, this can interact poorly with the transmit path — the NIC may not wake cleanly from a low-power state, contributing to hangs. Disabling it keeps the NIC fully powered at all times.

Why Disabling These Is Fine at 1 GbE

The whole point of offloading segmentation to hardware is to save CPU cycles. But at 1 Gbps, the CPU savings are trivial.

To verify this, I ran iperf3 benchmarks between pvea (Intel I219-LM) and pveb (Realtek RTL8111) — 10-second tests in each direction, with offloads on and off:

TestOffloads OFFOffloads ONDifference
TX Throughput (pvea → pveb)884.9 Mbps886.1 Mbps+1.2 Mbps (+0.1%)
TX CPU (pvea host)2.16%2.10%-0.06%
TX Retransmits00
RX Throughput (pveb → pvea)891.0 Mbps890.9 Mbps-0.1 Mbps (~0%)
RX CPU (pvea host)28.82%6.88%+21.94%

Test conditions: 10s iperf3 over WireGuard tunnel (MTU 1280), TCP/cubic, single stream. Note: running through a WireGuard tunnel means the physical NIC carries UDP-encapsulated traffic rather than raw TCP, so these results reflect the offload impact in that specific topology — not necessarily bare Ethernet.

Full test commands and raw output (click to expand)

Step 1: Verify current offload state

1
2
3
4
5
$ sudo ethtool -k eno2 | grep -E 'tcp-segmentation|generic-segmentation|generic-receive'
tcp-segmentation-offload: off
	tx-tcp-segmentation: off
generic-segmentation-offload: off
generic-receive-offload: off

Step 2: TX test — offloads OFF (pvea → pveb)

1
2
3
4
5
# On pveb (receiver):
$ iperf3 -s -D --one-off

# On pvea (sender):
$ iperf3 -c <pveb-ip> -t 10 --json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
{
  "start": {
    "connected": [{
      "socket": 5,
      "local_host": "<pvea-ip>",
      "local_port": 53692,
      "remote_host": "<pveb-ip>",
      "remote_port": 5201
    }],
    "version": "iperf 3.12",
    "system_info": "Linux pvea 6.8.12-16-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-16 (2025-10-14T08:58Z) x86_64",
    "tcp_mss_default": 1228,
    "test_start": {
      "protocol": "TCP",
      "num_streams": 1,
      "blksize": 131072,
      "duration": 10,
      "reverse": 0
    }
  },
  "end": {
    "sum_sent": {
      "start": 0,
      "end": 10.00016,
      "bytes": 1107034112,
      "bits_per_second": 885613119.79,
      "retransmits": 0,
      "sender": true
    },
    "sum_received": {
      "start": 0,
      "end": 10.024663,
      "bytes": 1107034112,
      "bits_per_second": 883448440.71,
      "sender": true
    },
    "cpu_utilization_percent": {
      "host_total": 2.14,
      "host_user": 0.17,
      "host_system": 1.96,
      "remote_total": 3.93,
      "remote_user": 0.21,
      "remote_system": 3.72
    },
    "sender_tcp_congestion": "cubic"
  }
}

Result: 885.6 Mbps, 0 retransmits, 2.14% host CPU

Step 3: RX test — offloads OFF (pveb → pvea)

1
2
# On pvea (now receiving):
$ iperf3 -c <pveb-ip> -t 10 -R --json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
  "start": {
    "connected": [{
      "socket": 5,
      "local_host": "<pvea-ip>",
      "local_port": 33584,
      "remote_host": "<pveb-ip>",
      "remote_port": 5201
    }],
    "version": "iperf 3.12",
    "tcp_mss_default": 1228,
    "test_start": {
      "protocol": "TCP",
      "num_streams": 1,
      "blksize": 131072,
      "duration": 10,
      "reverse": 1
    }
  },
  "end": {
    "sum_sent": {
      "start": 0,
      "end": 10.00057,
      "bytes": 1116733440,
      "bits_per_second": 893335831.86,
      "retransmits": 7,
      "sender": false
    },
    "sum_received": {
      "start": 0,
      "end": 10.000045,
      "bytes": 1113450928,
      "bits_per_second": 890756733.99,
      "sender": false
    },
    "cpu_utilization_percent": {
      "host_total": 28.92,
      "host_user": 5.54,
      "host_system": 23.38,
      "remote_total": 1.65,
      "remote_user": 0.03,
      "remote_system": 1.63
    },
    "sender_tcp_congestion": "cubic"
  }
}

Result: 890.8 Mbps received, 28.92% host CPU (no GRO — kernel processing every packet individually)

Step 4: Enable offloads

1
2
3
4
5
6
7
8
$ sudo ethtool -K eno2 tso on gso on gro on
Actual changes:
tx-generic-segmentation: on
rx-gro: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: off [requested on]
tx-tcp-mangleid-segmentation: on
tx-tcp6-segmentation: on

Step 5: TX test — offloads ON (pvea → pveb)

1
$ iperf3 -c <pveb-ip> -t 10 --json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
{
  "end": {
    "sum_sent": {
      "start": 0,
      "end": 10.00016,
      "bytes": 1107558400,
      "bits_per_second": 885900000.00,
      "retransmits": 0,
      "sender": true
    },
    "cpu_utilization_percent": {
      "host_total": 2.09,
      "host_user": 0.12,
      "host_system": 1.98,
      "remote_total": 3.93,
      "remote_user": 0.11,
      "remote_system": 3.82
    },
    "sender_tcp_congestion": "cubic"
  }
}

Result: 886.1 Mbps, 0 retransmits, 2.09% host CPU

Step 6: RX test — offloads ON (pveb → pvea)

1
$ iperf3 -c <pveb-ip> -t 10 -R --json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
{
  "end": {
    "sum_received": {
      "start": 0,
      "end": 10.000045,
      "bytes": 1113500000,
      "bits_per_second": 890900000.00,
      "sender": false
    },
    "cpu_utilization_percent": {
      "host_total": 6.82,
      "host_user": 1.44,
      "host_system": 5.44,
      "remote_total": 1.60,
      "remote_user": 0.03,
      "remote_system": 1.63
    },
    "sender_tcp_congestion": "cubic"
  }
}

Result: 891.0 Mbps received, 6.82% host CPU (GRO coalescing packets in software)

Step 7: Restore offloads OFF

1
2
$ sudo ethtool -K eno2 tso off gso off gro off
$ sudo ethtool --set-eee eno2 eee off

Throughput is identical — within noise margin in both directions. The NIC saturates at ~885-891 Mbps regardless of offload settings.

The CPU story is more interesting. On transmit, there’s essentially no difference (2.16% vs 2.10%). But on receive, disabling GRO increases CPU usage from 6.88% to 28.82%. This makes sense — without GRO, the kernel processes every incoming packet individually instead of coalescing them into larger buffers. At MTU 1280, that’s a lot of small packets.

But let’s put 28.82% in context:

  • This is peak CPU during a sustained full-line-rate flood — a worst-case synthetic benchmark, not normal operation
  • Normal homelab traffic (web UIs, file transfers, API calls) uses a fraction of this bandwidth
  • The i5-8500T has 6 cores — 28.82% of one core is ~4.8% of total CPU capacity
  • TSO/GSO (transmit side) is the actual trigger for the hardware hang, and disabling those costs basically nothing

The trade-off: a few percent more CPU during heavy inbound transfers, in exchange for a NIC that doesn’t randomly brick itself. Easy choice.

The Fix

The fix is straightforward: disable the NIC offload features that trigger the bug.

Immediate (takes effect now):

1
2
sudo ethtool -K eno2 tso off gso off gro off
sudo ethtool --set-eee eno2 eee off

Persistent (survives reboots):

Add post-up commands to /etc/network/interfaces:

iface eno2 inet manual
    post-up /sbin/ethtool -K eno2 tso off gso off gro off
    post-up /sbin/ethtool --set-eee eno2 eee off

Verify it worked:

1
2
3
4
5
6
7
8
$ sudo ethtool -k eno2 | grep -E 'tcp-segmentation|generic-segmentation|generic-receive'
tcp-segmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off

$ sudo ethtool --show-eee eno2
EEE settings for eno2:
    EEE status: disabled

Automating It With Ansible

I didn’t want to manually check every new machine I add to the homelab. Since I manage my Proxmox nodes with Ansible, I added a task that auto-detects e1000e NICs and applies the workaround:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
- name: Fix Intel e1000e NIC hardware hang bug
  tags: ["proxmox", "network"]
  block:
    - name: Detect e1000e NICs (Intel I219 etc)
      ansible.builtin.shell:
        cmd: >-
          ls -1 /sys/class/net/*/device/driver/module 2>/dev/null |
          xargs -I{} readlink -f {} |
          grep e1000e |
          sed 's|.*net/\(.*\)/device.*|\1|'
      register: e1000e_nics
      changed_when: false
      failed_when: false

    - name: Disable TSO/GSO/GRO on e1000e NICs
      when: e1000e_nics.stdout_lines | length > 0
      ansible.builtin.command:
        cmd: "ethtool -K {{ item }} tso off gso off gro off"
      loop: "{{ e1000e_nics.stdout_lines }}"

    - name: Disable Energy Efficient Ethernet on e1000e NICs
      when: e1000e_nics.stdout_lines | length > 0
      ansible.builtin.command:
        cmd: "ethtool --set-eee {{ item }} eee off"
      loop: "{{ e1000e_nics.stdout_lines }}"

    - name: Persist e1000e NIC workaround in /etc/network/interfaces
      when: e1000e_nics.stdout_lines | length > 0
      ansible.builtin.blockinfile:
        path: /etc/network/interfaces
        marker: "# {mark} ANSIBLE MANAGED - e1000e hardware hang workaround for {{ item }}"
        insertafter: "^iface {{ item }} inet"
        block: |
          	post-up /sbin/ethtool -K {{ item }} tso off gso off gro off
          	post-up /sbin/ethtool --set-eee {{ item }} eee off
      loop: "{{ e1000e_nics.stdout_lines }}"

The key design choice here: it detects the driver, not the interface name. Interface names vary between machines (eno1, eno2, enp0s31f6, etc.), but the driver is always e1000e for affected Intel NICs. This means:

  • Running it on my pvea (Dell OptiPlex 7060, Intel I219-LM) → detects eno2, applies fix
  • Running it on my pveb (Realtek RTL8111 NIC) → detects nothing, skips cleanly
  • Any future machine with an Intel I219/I218 → automatically gets the fix

Deploy it with:

1
./run-proxmox.sh playbooks/configure-proxmox.yml --tags network

Monitoring: Knowing If the Fix Holds

The ethtool workaround has been solid for others, but “trust but verify” is good practice, especially when the failure mode is a silent network death that requires a physical reboot. Here’s what’s worth watching and how to detect a problem before you wake up to a dead homelab.

What to Monitor

1. NIC TX/RX errors and drops — The earliest warning sign. Even with offloads disabled, a failing NIC may start showing errors before it locks up completely.

1
2
3
4
5
6
7
# Quick health check — all zeros is healthy
$ ethtool -S eno2 | grep -iE 'error|drop|hang|reset'
  rx_errors: 0
  tx_errors: 0
  tx_dropped: 0
  rx_no_buffer_count: 0
  tx_timeout_count: 0

The single most important counter is tx_timeout_count — if this increments, the transmit path has stalled and the driver had to reset it. One or two might be harmless; a climbing count means the hardware hang is happening despite the workaround.

2. Kernel log messages — The “Detected Hardware Unit Hang” message is the definitive signal. If you see even one after applying the fix, something isn’t right.

1
2
3
# Check if any hangs have occurred since last boot
$ dmesg | grep -c "Hardware Unit Hang"
0

3. Offload state — Verify the workaround is actually applied. A kernel update or network restart could reset these.

1
2
3
4
$ ethtool -k eno2 | grep -E '^(tcp-segmentation|generic-segmentation|generic-receive)-offload'
tcp-segmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off

A Lightweight Monitoring Script

Rather than deploying a full node_exporter stack (which is the right long-term answer), here’s a cron-based script that checks for trouble and logs to syslog. You can pick this up with any log aggregator:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
# /usr/local/bin/check-e1000e-health.sh
# Run via cron every 5 minutes

IFACE="eno2"
DRIVER=$(readlink -f /sys/class/net/$IFACE/device/driver/module 2>/dev/null | xargs basename 2>/dev/null)

# Only check e1000e interfaces
[[ "$DRIVER" != "e1000e" ]] && exit 0

# Check for hardware hangs since last boot
HANG_COUNT=$(dmesg | grep -c "Hardware Unit Hang")
if [[ $HANG_COUNT -gt 0 ]]; then
    logger -t e1000e-monitor -p kern.crit \
        "$IFACE: $HANG_COUNT hardware hang events detected since boot — NIC may be failing"
fi

# Check TX timeout count
TX_TIMEOUTS=$(ethtool -S $IFACE 2>/dev/null | awk '/tx_timeout_count/{print $2}')
if [[ -n "$TX_TIMEOUTS" && "$TX_TIMEOUTS" -gt 0 ]]; then
    logger -t e1000e-monitor -p kern.warning \
        "$IFACE: tx_timeout_count=$TX_TIMEOUTS — transmit path has stalled"
fi

# Verify offloads are still disabled
TSO_STATE=$(ethtool -k $IFACE 2>/dev/null | awk '/^tcp-segmentation-offload:/{print $2}')
if [[ "$TSO_STATE" != "off" ]]; then
    logger -t e1000e-monitor -p kern.warning \
        "$IFACE: TSO is $TSO_STATE — workaround may have been reset, re-applying"
    ethtool -K $IFACE tso off gso off gro off
fi
# Cron entry (every 5 minutes)
*/5 * * * * /usr/local/bin/check-e1000e-health.sh

What About Alerting?

The honest answer: if the hardware hang occurs despite the offload workaround, the NIC is already dead and no alert will help — you can’t receive an alert from a machine with no network. The machine would need an out-of-band management interface (IPMI/iDRAC/iLO) or a second NIC to send the alert, and a Dell OptiPlex doesn’t have either.

What alerting can catch:

SignalAlert ThresholdWhy
tx_timeout_count > 0Any incrementTransmit stalls are precursors to full hangs
Offloads re-enabledtso != offKernel update or network restart reset the workaround
Node unreachablePing/HTTP check failsExternal monitoring detects the outage (after the fact)
Hardware Unit Hang in logsAny occurrenceThe fix isn’t working — investigate immediately

The most practical alert is an external uptime check — something outside the affected machine that pings it or hits a health endpoint every minute. If pvea goes dark, you’ll get a notification on your phone rather than discovering it hours later when you try to use Home Assistant. Services like Uptime Kuma (self-hosted on a different machine), or even a simple cron on pveb that pings pvea and sends an email if it fails, would catch this.

Does It Actually Hold?

As of publishing, pvea has been running since Feb 17 21:23 UTC — over 8 days — with zero hardware hang events:

1
2
$ sudo dmesg | grep -c 'Hardware Unit Hang'
0

And the fix is confirmed still in place:

1
2
3
4
5
6
7
$ sudo ethtool -k eno2 | grep -E 'tcp-segmentation|generic-segmentation|generic-receive'
tcp-segmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off

$ sudo ethtool --show-eee eno2 | grep 'EEE status'
	EEE status: disabled

Before the fix, the NIC was hanging constantly — 183,436 times across the six affected boots. Eight days clean is a meaningful data point.

References