The Metaslab Corruption Bug In OpenZFS
(updated )
I migrated my NAS to ZFS because of its excellent reputation for reliability and data integrity. However, I’ve encountered a serious metaslab corruption bug that, while not causing data loss, renders the pool unusable for writes. The only way to fix the corruption is recreating the pool from backup.
This has been a known issue for years. The lack of progress in addressing it raises serious concerns about OpenZFS’s stability in production environments.
Summary
- In ZFS, metaslabs keep track of free space on disk where data can be written.
- A bug is causing metaslabs to become corrupted.
- When deleting a file or snapshot where corruption has occured, ZFS causes a kernel panic. The pool can then only be imported in read-only mode.
- Scrubs do not fix the corruption.
- The only way to fix the pool is to recreate it from backup.
- Edit: in a previous version of this post, I suggested running
zdb -y
on pools to detect metaslab corruption. However, I’ve since learned that zdb should not be used to detect corruption, even if it crashes. I’m sorry for any undue alarm that this may have caused.
Problem Description
The issue manifests when deleting snapshots or files, triggering a kernel panic and the pool to become unusable.
I first encountered this bug in Dec 2024 while destroying snapshots on a pool of mirrored Seagate 20TB enterprise hard drives. That pool stores media files and backups of my Proxmox virtual machine via Proxmox Backup Server.
ZFS panicked with the following error:
[ 591.982595] PANIC: zfs: adding existent segment to range tree (offset=1265b374000 size=7a000)
[ 591.982604] Showing stack for process 1211
[ 591.982608] CPU: 13 PID: 1211 Comm: txg_sync Tainted: P O 6.8.12-5-pve #1
[ 591.982614] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING WIFI II, BIOS 3607 03/22/2024
[ 591.982618] Call Trace:
[ 591.982622] <TASK>
[ 591.982626] dump_stack_lvl+0x76/0xa0
[ 591.982632] dump_stack+0x10/0x20
[ 591.982637] vcmn_err+0xdb/0x130 [spl]
[ 591.982649] zfs_panic_recover+0x75/0xa0 [zfs]
[ 591.982749] range_tree_add_impl+0x27f/0x11c0 [zfs]
[ 591.982845] range_tree_remove_xor_add_segment+0x543/0x5a0 [zfs]
[ 591.982932] ? dmu_buf_rele+0x3b/0x50 [zfs]
[ 591.983022] range_tree_remove_xor_add+0x10c/0x1f0 [zfs]
[ 591.983113] metaslab_sync+0x27f/0x950 [zfs]
[ 591.983203] ? ktime_get_raw_ts64+0x41/0xd0
[ 591.983210] ? mutex_lock+0x12/0x50
[ 591.983215] vdev_sync+0x73/0x4d0 [zfs]
[ 591.983303] ? spa_log_sm_set_blocklimit+0x17/0xc0 [zfs]
[ 591.983392] ? srso_alias_return_thunk+0x5/0xfbef5
[ 591.983397] ? mutex_lock+0x12/0x50
[ 591.983402] spa_sync+0x62e/0x1050 [zfs]
[ 591.983491] ? srso_alias_return_thunk+0x5/0xfbef5
[ 591.983495] ? spa_txg_history_init_io+0x120/0x130 [zfs]
[ 591.983583] txg_sync_thread+0x207/0x3a0 [zfs]
[ 591.983668] ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
[ 591.983748] ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[ 591.983757] thread_generic_wrapper+0x5f/0x70 [spl]
[ 591.983764] kthread+0xf2/0x120
[ 591.983769] ? __pfx_kthread+0x10/0x10
[ 591.983774] ret_from_fork+0x47/0x70
[ 591.983779] ? __pfx_kthread+0x10/0x10
[ 591.983783] ret_from_fork_asm+0x1b/0x30
[ 591.983790] </TASK>
[ 738.223428] INFO: task txg_sync:1211 blocked for more than 122 seconds.
[ 738.223444] Tainted: P O 6.8.12-5-pve #1
[ 738.223450] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 738.223455] task:txg_sync state:D stack:0 pid:1211 tgid:1211 ppid:2 flags:0x00004000
[ 738.223465] Call Trace:
[ 738.223470] <TASK>
[ 738.223479] __schedule+0x401/0x15e0
[ 738.223494] schedule+0x33/0x110
[ 738.223502] vcmn_err+0xe8/0x130 [spl]
[ 738.223524] zfs_panic_recover+0x75/0xa0 [zfs]
[ 738.223723] range_tree_add_impl+0x27f/0x11c0 [zfs]
[ 738.223868] range_tree_remove_xor_add_segment+0x543/0x5a0 [zfs]
[ 738.223976] ? dmu_buf_rele+0x3b/0x50 [zfs]
[ 738.224093] range_tree_remove_xor_add+0x10c/0x1f0 [zfs]
[ 738.224194] metaslab_sync+0x27f/0x950 [zfs]
[ 738.224284] ? ktime_get_raw_ts64+0x41/0xd0
[ 738.224293] ? mutex_lock+0x12/0x50
[ 738.224298] vdev_sync+0x73/0x4d0 [zfs]
[ 738.224386] ? spa_log_sm_set_blocklimit+0x17/0xc0 [zfs]
[ 738.224477] ? srso_alias_return_thunk+0x5/0xfbef5
[ 738.224484] ? mutex_lock+0x12/0x50
[ 738.224488] spa_sync+0x62e/0x1050 [zfs]
[ 738.224578] ? srso_alias_return_thunk+0x5/0xfbef5
[ 738.224583] ? spa_txg_history_init_io+0x120/0x130 [zfs]
[ 738.224671] txg_sync_thread+0x207/0x3a0 [zfs]
[ 738.224757] ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
[ 738.224848] ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[ 738.224857] thread_generic_wrapper+0x5f/0x70 [spl]
[ 738.224865] kthread+0xf2/0x120
[ 738.224873] ? __pfx_kthread+0x10/0x10
[ 738.224877] ret_from_fork+0x47/0x70
[ 738.224884] ? __pfx_kthread+0x10/0x10
[ 738.224887] ret_from_fork_asm+0x1b/0x30
[ 738.224895] </TASK>
[ 861.103822] INFO: task txg_sync:1211 blocked for more than 245 seconds.
[ 861.103838] Tainted: P O 6.8.12-5-pve #1
[ 861.103843] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 861.103849] task:txg_sync state:D stack:0 pid:1211 tgid:1211 ppid:2 flags:0x00004000
[ 861.103858] Call Trace:
[ 861.103863] <TASK>
[ 861.103870] __schedule+0x401/0x15e0
[ 861.103885] schedule+0x33/0x110
[ 861.103892] vcmn_err+0xe8/0x130 [spl]
[ 861.103918] zfs_panic_recover+0x75/0xa0 [zfs]
[ 861.104157] range_tree_add_impl+0x27f/0x11c0 [zfs]
[ 861.104302] range_tree_remove_xor_add_segment+0x543/0x5a0 [zfs]
[ 861.104450] ? dmu_buf_rele+0x3b/0x50 [zfs]
[ 861.104540] range_tree_remove_xor_add+0x10c/0x1f0 [zfs]
[ 861.104638] metaslab_sync+0x27f/0x950 [zfs]
[ 861.104728] ? ktime_get_raw_ts64+0x41/0xd0
[ 861.104737] ? mutex_lock+0x12/0x50
[ 861.104742] vdev_sync+0x73/0x4d0 [zfs]
[ 861.104830] ? spa_log_sm_set_blocklimit+0x17/0xc0 [zfs]
[ 861.104930] ? srso_alias_return_thunk+0x5/0xfbef5
[ 861.104938] ? mutex_lock+0x12/0x50
[ 861.104942] spa_sync+0x62e/0x1050 [zfs]
[ 861.105031] ? srso_alias_return_thunk+0x5/0xfbef5
[ 861.105036] ? spa_txg_history_init_io+0x120/0x130 [zfs]
[ 861.105124] txg_sync_thread+0x207/0x3a0 [zfs]
[ 861.105210] ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
[ 861.105291] ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[ 861.105299] thread_generic_wrapper+0x5f/0x70 [spl]
[ 861.105307] kthread+0xf2/0x120
[ 861.105315] ? __pfx_kthread+0x10/0x10
[ 861.105320] ret_from_fork+0x47/0x70
[ 861.105326] ? __pfx_kthread+0x10/0x10
[ 861.105330] ret_from_fork_asm+0x1b/0x30
[ 861.105337] </TASK>
[ 983.984297] INFO: task txg_sync:1211 blocked for more than 368 seconds.
[ 983.984313] Tainted: P O 6.8.12-5-pve #1
[ 983.984319] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 983.984324] task:txg_sync state:D stack:0 pid:1211 tgid:1211 ppid:2 flags:0x00004000
[ 983.984334] Call Trace:
[ 983.984339] <TASK>
[ 983.984345] __schedule+0x401/0x15e0
[ 983.984358] schedule+0x33/0x110
[ 983.984365] vcmn_err+0xe8/0x130 [spl]
[ 983.984382] zfs_panic_recover+0x75/0xa0 [zfs]
[ 983.984610] range_tree_add_impl+0x27f/0x11c0 [zfs]
[ 983.984812] range_tree_remove_xor_add_segment+0x543/0x5a0 [zfs]
[ 983.985060] ? dmu_buf_rele+0x3b/0x50 [zfs]
[ 983.985326] range_tree_remove_xor_add+0x10c/0x1f0 [zfs]
[ 983.985577] metaslab_sync+0x27f/0x950 [zfs]
[ 983.985778] ? ktime_get_raw_ts64+0x41/0xd0
[ 983.985790] ? mutex_lock+0x12/0x50
[ 983.985796] vdev_sync+0x73/0x4d0 [zfs]
[ 983.985887] ? spa_log_sm_set_blocklimit+0x17/0xc0 [zfs]
[ 983.985989] ? srso_alias_return_thunk+0x5/0xfbef5
[ 983.985997] ? mutex_lock+0x12/0x50
[ 983.986002] spa_sync+0x62e/0x1050 [zfs]
[ 983.986092] ? srso_alias_return_thunk+0x5/0xfbef5
[ 983.986096] ? spa_txg_history_init_io+0x120/0x130 [zfs]
[ 983.986184] txg_sync_thread+0x207/0x3a0 [zfs]
[ 983.986270] ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
[ 983.986351] ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[ 983.986359] thread_generic_wrapper+0x5f/0x70 [spl]
[ 983.986367] kthread+0xf2/0x120
[ 983.986375] ? __pfx_kthread+0x10/0x10
[ 983.986379] ret_from_fork+0x47/0x70
[ 983.986386] ? __pfx_kthread+0x10/0x10
[ 983.986390] ret_from_fork_asm+0x1b/0x30
[ 983.986398] </TASK>
In ZFS, metaslabs keep track of free space on disk where data can be written. The metaslabs on the pool became corrupted. Based on the above error, ZFS tried adding a block of space that was already marked as in use, attempting to allocate the same space twice, which should never happen. The system then became stuck trying to handle this error.
I had to hard reset the system to reboot it. On boot, the same error occurred because ZFS tries to write the Transaction Group (i.e. retrying the last write) to disk.
Attempted Recovery Methods
I was able to access the data on the pool by importing the pool in read-only mode.
ZFS doesn’t have an fsck utility for fixing filesystem corruption and the usual advice in this situation is to run a scrub on the pool. The scrub completed successfully and returned no errors but was unable to detect or fix the metaslab corruption.
Edit: I Was Wrong About Using ZDB To Detect Corruption
Since scrubs don’t detect the issue, I tried using zdb to do so. I came across the zdb -y command which seemed relevant to my situation.
Indeed, running it on the corrupted pool shows that something is wrong.
> zdb -y tank
Verifying deleted livelist entries
Verifying metaslab entries
verifying concrete vdev 0, metaslab 11 of 1164 ...ASSERT at cmd/zdb/zdb.c:482:verify_livelist_allocs()
((size) >> (9)) - (0) < 1ULL << (24) (0x10459f8 < 0x1000000)
PID: 46353 COMM: zdb
TID: 46353 NAME: zdb
Call trace:
/lib/x86_64-linux-gnu/libzpool.so.5(libspl_assertf+0x157) [0x79be2fc38627]
zdb(+0xe5d0) [0x61b154a645d0]
/lib/x86_64-linux-gnu/libzpool.so.5(space_map_iterate+0x32f) [0x79be2fabe24f]
zdb(+0x13a7f) [0x61b154a69a7f]
zdb(+0x212ae) [0x61b154a772ae]
zdb(+0xafd1) [0x61b154a60fd1]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x79be2f2e224a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x79be2f2e2305]
zdb(+0xc7f1) [0x61b154a627f1]
zdb(+0x13db3)[0x61b154a69db3]
/lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x79be2f2f7050]
/lib/x86_64-linux-gnu/libc.so.6(+0x8aebc)[0x79be2f345ebc]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x12)[0x79be2f2f6fb2]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x79be2f2e1472]
/lib/x86_64-linux-gnu/libzpool.so.5(+0x57a97)[0x79be2f993a97]
zdb(+0xe5d0)[0x61b154a645d0]
/lib/x86_64-linux-gnu/libzpool.so.5(space_map_iterate+0x32f)[0x79be2fabe24f]
zdb(+0x13a7f)[0x61b154a69a7f]
zdb(+0x212ae)[0x61b154a772ae]
zdb(+0xafd1)[0x61b154a60fd1]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x79be2f2e224a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x79be2f2e2305]
zdb(+0xc7f1)[0x61b154a627f1]
Aborted
Since running zdb -y
on other non-corrupted pools succeeded and doing so on the corrupted pool returned an error, I mistakenly assumed that it would be an accurate indicator of corruption and recommended using this command to do so.
However, many other users have seen the same assert failing when running the zdb -y
command on healthy pools that aren’t experiencing issues. I’ve since learned that zdb should not be used to detect corruption, even if it crashes. I’m sorry for any undue alarm that this may have caused.
Recreating The Pool
Unfortunately, the only way to fix my corrupted pool was to destroy and recreate it from backup. This is a great example of why ZFS RAID is not a backup, since it won’t help if the pool itself becomes corrupted.
Jan 2025 Update
After recreating the pool, I ran the zdb -y
command daily in an attempt to get an early warning of metaslab corruption.
In Jan 2025, running zdb -y tank
aborted with a similar error as above.
Edit: I previously assumed this was an accurate indicator of corruption, which is not the case.
The pool is still functional for now. I’ll update this post if ZFS crashes again while deleting files or snapshots.
Is OpenZFS Stable For Production Use?
OpenZFS is a massively complex piece of software, so bugs are inevitable. However, it is concerning that this issue has gone unaddressed for years despite reports from many users1. Similarly, an issue requesting a recovery tool to fix metaslab corruption is still open since 2022.
During my research to understand this issue, I also found a disturbing trend of features being released without indicators of whether they were production ready. For example, ZFS encryption is not ready for production use, but there are no indications of this in the OpenZFS documentation. An issue proposing to warn users away from encryption in production usage remains open almost a year after being raised.
Conclusion
I previously concluded this post by mentioning my plans to migrate from OpenZFS, since I thought the pool got corrupted a second time in two months.
However, I was mistaken about zdb -y
being an accurate indicator of corruption, so I’ll be keeping a close eye on that pool.
Regardless, this bug is still something that I’m worried about, since recovery involves destroying and recreating the pool.
Footnotes
See Importing corrupted pool causes PANIC: zfs: adding existent segment to range tree and PANIC: zfs: adding existent segment to range tree. A quick Google search also shows many forum posts about this issue. ↩