Proxmox kernel 6.8.12-9 Crashing Intel GBE Controllers
A recent regression introduced in Proxmox kernel 6.8.12-9-pve is crashing Intel GBE ethernet controllers. This bug also affects other Linux distributions as well.
The Problem
My NAS has a 2-port GBE NIC based on the Intel I350 controller. On Apr 7, I rebooted my NAS running on Proxmox to install some updates. This bumped the kernel version to 6.8.12-9-pve.
A week later, the NAS lost network connectivity and reconnecting the ethernet cable didn’t fix the issue. The only way to get networking back up was restarting the machine. These symptoms are similar to the problems I had last year with the Intel I225-V controller. However, the logs indicate that this is a different issue:
Apr 12 04:06:13.806019 proxmox kernel: igb 0000:04:00.0 enp4s0f0: PCIe link lost
Apr 12 04:06:13.808298 proxmox kernel: ------------[ cut here ]------------
Apr 12 04:06:13.808320 proxmox kernel: igb: Failed to read reg 0xc030!
Apr 12 04:06:13.808330 proxmox kernel: WARNING: CPU: 15 PID: 3617408 at drivers/net/ethernet/intel/igb/igb_main.c:750 igb_rd32+0x93/0xb0 [igb]
Apr 12 04:06:13.808386 proxmox kernel: Modules linked in: cfg80211 nfsd auth_rpcgss nfs_acl lockd grace veth ebtable_filter ebtables ip_set ip6table_raw ip>
Apr 12 04:06:13.809046 proxmox kernel: xhci_pci nvme xhci_pci_renesas sparse_keymap platform_profile crc32_pclmul igb i2c_piix4 igc nvme_core ahci i2c_alg>
Apr 12 04:06:13.809067 proxmox kernel: CPU: 15 PID: 3617408 Comm: kworker/15:0 Tainted: P O 6.8.12-9-pve #1
Apr 12 04:06:13.809083 proxmox kernel: Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING WIFI II, BIOS 3607 03/22/2024
Apr 12 04:06:13.809095 proxmox kernel: Workqueue: events igb_watchdog_task [igb]
Apr 12 04:06:13.809105 proxmox kernel: RIP: 0010:igb_rd32+0x93/0xb0 [igb]
Apr 12 04:06:13.809117 proxmox kernel: Code: c7 c6 03 24 7a c0 e8 cc 90 c9 d7 48 8b bb 28 ff ff ff e8 90 d0 77 d7 84 c0 74 c1 44 89 e6 48 c7 c7 f8 30 7a c0>
Apr 12 04:06:13.809129 proxmox kernel: RSP: 0018:ffffb6bc15fd7d88 EFLAGS: 00010246
Apr 12 04:06:13.809150 proxmox kernel: RAX: 0000000000000000 RBX: ffff9c128c8e0f38 RCX: 0000000000000000
Apr 12 04:06:13.809162 proxmox kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Apr 12 04:06:13.809174 proxmox kernel: RBP: ffffb6bc15fd7d98 R08: 0000000000000000 R09: 0000000000000000
Apr 12 04:06:13.809185 proxmox kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000c030
Apr 12 04:06:13.809197 proxmox kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffff9c12956f6340
Apr 12 04:06:13.809210 proxmox kernel: FS: 0000000000000000(0000) GS:ffff9c214e380000(0000) knlGS:0000000000000000
Apr 12 04:06:13.809221 proxmox kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 12 04:06:13.809233 proxmox kernel: CR2: 00007f1c662e3838 CR3: 0000000748236000 CR4: 0000000000f50ef0
Apr 12 04:06:13.809244 proxmox kernel: PKRU: 55555554
Apr 12 04:06:13.809255 proxmox kernel: Call Trace:
Apr 12 04:06:13.809274 proxmox kernel: <TASK>
Apr 12 04:06:13.809286 proxmox kernel: ? show_regs+0x6d/0x80
Apr 12 04:06:13.809297 proxmox kernel: ? __warn+0x89/0x160
Apr 12 04:06:13.809307 proxmox kernel: ? igb_rd32+0x93/0xb0 [igb]
Apr 12 04:06:13.809318 proxmox kernel: ? report_bug+0x17e/0x1b0
Apr 12 04:06:13.809331 proxmox kernel: ? handle_bug+0x6e/0xb0
Apr 12 04:06:13.809340 proxmox kernel: ? exc_invalid_op+0x18/0x80
Apr 12 04:06:13.809349 proxmox kernel: ? asm_exc_invalid_op+0x1b/0x20
Apr 12 04:06:13.809361 proxmox kernel: ? igb_rd32+0x93/0xb0 [igb]
Apr 12 04:06:13.809368 proxmox kernel: ? igb_rd32+0x93/0xb0 [igb]
Apr 12 04:06:13.809376 proxmox kernel: igb_update_stats+0x89/0x830 [igb]
Apr 12 04:06:13.809385 proxmox kernel: igb_watchdog_task+0x134/0x8a0 [igb]
Apr 12 04:06:13.809394 proxmox kernel: ? psi_avgs_work+0x67/0xe0
Apr 12 04:06:13.809410 proxmox kernel: process_one_work+0x176/0x350
Apr 12 04:06:13.809420 proxmox kernel: worker_thread+0x306/0x440
Apr 12 04:06:13.809427 proxmox kernel: ? __pfx_worker_thread+0x10/0x10
Apr 12 04:06:13.809436 proxmox kernel: kthread+0xf2/0x120
Apr 12 04:06:13.809445 proxmox kernel: ? __pfx_kthread+0x10/0x10
Apr 12 04:06:13.809455 proxmox kernel: ret_from_fork+0x47/0x70
Apr 12 04:06:13.809464 proxmox kernel: ? __pfx_kthread+0x10/0x10
Apr 12 04:06:13.809471 proxmox kernel: ret_from_fork_asm+0x1b/0x30
Apr 12 04:06:13.809481 proxmox kernel: </TASK>
Apr 12 04:06:13.809496 proxmox kernel: ---[ end trace 0000000000000000 ]---
Apr 12 04:06:13.809509 proxmox kernel: igb 0000:04:00.0 enp4s0f0: malformed Tx packet detected and dropped, LVMMC:0xffffffff
A quick Google search revealed numerous forum threads and a Proxmox bug report on Intel I219 controllers crashing with the same symptoms (i.e, becoming unreachable via the network) but with a different error message:
[97377.240263] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
TDH <22>
TDT <2f>
next_to_use <2f>
next_to_clean <21>
buffer_info[next_to_clean]:
time_stamp <101725292>
next_to_watch <22>
jiffies <1017253e0>
next_to_watch.status <0>
MAC Status <40080083>
PHY Status <796d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>
However, the I219 controller uses the E1000 driver. The I350 controller on my card uses the igc driver instead, so the bug I’m experiencing might be a different one from the other reports of this issue.
Reverting To A Previous Kernel Version
The easiest fix is to revert to a previous version that doesn’t have this regression with the proxmox-boot-tool
command:
# use the proxmox-boot-tool kernel list command to view the available kernels
> proxmox-boot-tool kernel pin 6.8.12-8-pve # add --next-boot to make this temporary
# verify that the setting took effect
> proxmox-boot-tool kernel list
Manually selected kernels:
None.
Automatically selected kernels:
6.8.12-10-pve
6.8.12-11-pve
6.8.12-8-pve
Pinned kernel:
6.8.12-8-pve
# reboot for this change to take effect
> reboot now
Reverting to 6.8.12-8-pve worked for me, and the NIC was stable for the last 3 months.
Disable Offloading Features
If you need to use a newer kernel version and can’t revert to an older one, I’ve seen some reports that disabling offloading features prevents the crashes. See this GitHub Gist with instructions on how to create a service to disable various offloading features.
This Affects Other Linux Distributions
Since this was a regression from Intel driver changes in the kernel, the bug with the E1000e driver also affects Ubuntu.
Conclusion
I thought my networking problems were behind me after I stopped using the Intel I225-V. Unfortunately, I am not that lucky. I did not expect reliable ethernet to be so difficult to achieve.
When I get the time, I intend to look at the diff between 6.8.12-8 and 6.8.12-9 of the Proxmox kernel to check if there were any changes to the igc driver that would explain the instability I experienced.