journalctl BTRFS csum corruption

May 25, 2025    #gdb   #os   #systemd   #btrfs  

Background

Shortly after installing Fedora (Kernel 6.14.6-300.fc42.x86_64) on a freshly wiped disk (old Windows NTFS games/extra disk), I noticed that journalctl then <shift>-G would result in a journalctl crash for systemd 257 (257.5-6.fc42) :

 1➜  ~ coredumpctl info
 2           PID: 11701 (journalctl)
 3           UID: 1000 (fred)
 4           GID: 1000 (fred)
 5        Signal: 6 (ABRT)
 6     Timestamp: Sun 2025-05-25 10:19:33 CDT (40min ago)
 7  Command Line: journalctl -xe
 8    Executable: /usr/bin/journalctl
 9 Control Group: /user.slice/user-1000.slice/[email protected]/app.slice/app-gnome-Alacritty-6480.scope
10          Unit: [email protected]
11     User Unit: app-gnome-Alacritty-6480.scope
12         Slice: user-1000.slice
13     Owner UID: 1000 (fred)
14       Boot ID: 81b2c199435745759afa8439492049e5
15    Machine ID: 42148a0b9b9641b0ab3c18373236fe58
16      Hostname: olympus
17       Storage: /var/lib/systemd/coredump/core.journalctl.1000.81b2c199435745759afa8439492049e5.11701.1748186373000000.zst (present)
18  Size on Disk: 183.3K
19       Package: systemd/257.5-6.fc42
20      build-id: 812a15616de1baca7c7c2942d40b5dbb73c6b905
21       Message: Process 11701 (journalctl) of user 1000 dumped core.
22
23                Module /usr/bin/journalctl from rpm systemd-257.5-6.fc42.x86_64
24                Module libzstd.so.1 from rpm zstd-1.5.7-1.fc42.x86_64
25                Module libcap-ng.so.0 from rpm libcap-ng-0.8.5-4.fc42.x86_64
26                Module libpcre2-8.so.0 from rpm pcre2-10.45-1.fc42.x86_64
27                Module libeconf.so.0 from rpm libeconf-0.7.6-1.fc42.x86_64
28                Module libaudit.so.1 from rpm audit-4.0.3-2.fc42.x86_64
29                Module libz.so.1 from rpm zlib-ng-2.2.4-3.fc42.x86_64
30                Module libattr.so.1 from rpm attr-2.5.2-5.fc42.x86_64
31                Module libselinux.so.1 from rpm libselinux-3.8-1.fc42.x86_64
32                Module libseccomp.so.2 from rpm libseccomp-2.5.5-2.fc41.x86_64
33                Module libpam.so.0 from rpm pam-1.7.0-5.fc42.x86_64
34                Module libcrypto.so.3 from rpm openssl-3.2.4-3.fc42.x86_64
35                Module libmount.so.1 from rpm util-linux-2.40.4-7.fc42.x86_64
36                Module libcrypt.so.2 from rpm libxcrypt-4.4.38-7.fc42.x86_64
37                Module libcap.so.2 from rpm libcap-2.73-2.fc42.x86_64
38                Module libblkid.so.1 from rpm util-linux-2.40.4-7.fc42.x86_64
39                Module libacl.so.1 from rpm acl-2.3.2-3.fc42.x86_64
40                Module libsystemd-shared-257.5-6.fc42.so from rpm systemd-257.5-6.fc42.x86_64
41                Stack trace of thread 11701:
42                #0  0x00007f3ec948111c __pthread_kill_implementation (libc.so.6 + 0x7311c)
43                #1  0x00007f3ec9427afe raise (libc.so.6 + 0x19afe)
44                #2  0x00007f3ec940f6d0 abort (libc.so.6 + 0x16d0)
45                #3  0x00007f3ec983b2cc mmap_cache_process_sigbus (libsystemd-shared-257.5-6.fc42.so + 0x23b2cc)
46                #4  0x00007f3ec983b5bf mmap_cache_fd_free (libsystemd-shared-257.5-6.fc42.so + 0x23b5bf)
47                #5  0x00007f3ec98274fc journal_file_close (libsystemd-shared-257.5-6.fc42.so + 0x2274fc)
48                #6  0x00007f3ec9845150 sd_journal_close (libsystemd-shared-257.5-6.fc42.so + 0x245150)
49                #7  0x000055db6237d769 run (/usr/bin/journalctl + 0x7769)
50                #8  0x000055db62378145 main (/usr/bin/journalctl + 0x2145)
51                #9  0x00007f3ec94115f5 __libc_start_call_main (libc.so.6 + 0x35f5)
52                #10 0x00007f3ec94116a8 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x36a8)
53                #11 0x000055db623783c5 _start (/usr/bin/journalctl + 0x23c5)
54                ELF object binary architecture: AMD x86-64

Looking at sudo dmesg (Kernel logs) I saw several:

1➜  ~ sudo dmesg
2...
3[ 2614.020827] BTRFS warning (device sda3): csum failed root 256 ino 386149 off 0 csum 0x8941f998 expected csum 0x33a303fe mirror 1
4[ 2614.020828] BTRFS error (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 414416, gen 0
5[ 2614.020828] BTRFS warning (device sda3): csum failed root 256 ino 386149 off 4096 csum 0x8941f998 expected csum 0x3e2e9220 mirror 1
6[ 2614.020829] BTRFS error (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 414417, gen 0

I am working on a few systemd problems with upstream on their latest v257 stable release, and thought there may be another systemd problem here. I know better to not report right away, so I needed to figure out if there is a disk or filesystem problem first, and then look for solutions.

Tools

I searched the internet for clues on what to-do. I found the following to help debug the issue:

  1. smartctl A tool to diagnose disk HW problems
  2. btrfs scrub A tool to help find/repair corruptions (if able)
  3. btrfs check A tool to repair BTRFS file system (WARNING! AVOID USE)
  4. btrfs inspect-internal A tool to check information about parts of the disk/filesystem

Investigation

Starting with HW specific check:

  1➜  ~ sudo smartctl -x /dev/sda3
  2smartctl 7.5 2025-04-30 r5714 [x86_64-linux-6.14.6-300.fc42.x86_64] (local build)
  3Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org
  4
  5=== START OF INFORMATION SECTION ===
  6Model Family:     Samsung based SSDs
  7Device Model:     Samsung SSD 860 EVO 250GB
  8Serial Number:    S3YHNX0K415898P
  9LU WWN Device Id: 5 002538 e40306b00
 10Firmware Version: RVT01B6Q
 11User Capacity:    250,059,350,016 bytes [250 GB]
 12Sector Size:      512 bytes logical/physical
 13Rotation Rate:    Solid State Device
 14Form Factor:      2.5 inches
 15TRIM Command:     Available, deterministic, zeroed
 16Device is:        In smartctl database 7.5/5706
 17ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
 18SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
 19Local Time is:    Sun May 25 11:11:34 2025 CDT
 20SMART support is: Available - device has SMART capability.
 21SMART support is: Enabled
 22AAM feature is:   Unavailable
 23APM feature is:   Unavailable
 24Rd look-ahead is: Enabled
 25Write cache is:   Enabled
 26DSN feature is:   Unavailable
 27ATA Security is:  Disabled, frozen [SEC2]
 28
 29=== START OF READ SMART DATA SECTION ===
 30SMART overall-health self-assessment test result: PASSED
 31
 32General SMART Values:
 33Offline data collection status:  (0x00) Offline data collection activity
 34     was never started.
 35     Auto Offline Data Collection: Disabled.
 36Self-test execution status:      (   0) The previous self-test routine completed
 37     without error or no self-test has ever
 38     been run.
 39Total time to complete Offline
 40data collection:   (    0) seconds.
 41Offline data collection
 42capabilities:     (0x53) SMART execute Offline immediate.
 43     Auto Offline data collection on/off support.
 44     Suspend Offline collection upon new
 45     command.
 46     No Offline surface scan supported.
 47     Self-test supported.
 48     No Conveyance Self-test supported.
 49     Selective Self-test supported.
 50SMART capabilities:            (0x0003) Saves SMART data before entering
 51     power-saving mode.
 52     Supports SMART auto save timer.
 53Error logging capability:        (0x01) Error logging supported.
 54     General Purpose Logging supported.
 55Short self-test routine
 56recommended polling time:   (   2) minutes.
 57Extended self-test routine
 58recommended polling time:   (  85) minutes.
 59SCT capabilities:         (0x003d) SCT Status supported.
 60     SCT Error Recovery Control supported.
 61     SCT Feature Control supported.
 62     SCT Data Table supported.
 63
 64SMART Attributes Data Structure revision number: 1
 65Vendor Specific SMART Attributes with Thresholds:
 66ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
 67  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
 68  9 Power_On_Hours          -O--CK   099   099   000    -    4684
 69 12 Power_Cycle_Count       -O--CK   099   099   000    -    959
 70177 Wear_Leveling_Count     PO--C-   099   099   000    -    6
 71179 Used_Rsvd_Blk_Cnt_Tot   PO--C-   100   100   010    -    0
 72181 Program_Fail_Cnt_Total  -O--CK   100   100   010    -    0
 73182 Erase_Fail_Count_Total  -O--CK   100   100   010    -    0
 74183 Runtime_Bad_Block       PO--C-   100   100   010    -    0
 75187 Uncorrectable_Error_Cnt -O--CK   100   100   000    -    0
 76190 Airflow_Temperature_Cel -O--CK   076   058   000    -    24
 77195 ECC_Error_Rate          -O-RC-   200   200   000    -    0
 78199 CRC_Error_Count         -OSRCK   100   100   000    -    0
 79235 POR_Recovery_Count      -O--C-   099   099   000    -    42
 80241 Total_LBAs_Written      -O--CK   099   099   000    -    2043498859
 81                            ||||||_ K auto-keep
 82                            |||||__ C event count
 83                            ||||___ R error rate
 84                            |||____ S speed/performance
 85                            ||_____ O updated online
 86                            |______ P prefailure warning
 87
 88General Purpose Log Directory Version 1
 89SMART           Log Directory Version 1 [multi-sector log support]
 90Address    Access  R/W   Size  Description
 910x00       GPL,SL  R/O      1  Log Directory
 920x01           SL  R/O      1  Summary SMART error log
 930x02           SL  R/O      1  Comprehensive SMART error log
 940x03       GPL     R/O      1  Ext. Comprehensive SMART error log
 950x04       GPL,SL  R/O      8  Device Statistics log
 960x06           SL  R/O      1  SMART self-test log
 970x07       GPL     R/O      1  Extended self-test log
 980x09           SL  R/W      1  Selective self-test log
 990x10       GPL     R/O      1  NCQ Command Error log
1000x11       GPL     R/O      1  SATA Phy Event Counters log
1010x13       GPL     R/O      1  SATA NCQ Send and Receive log
1020x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
1030x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
1040xa1           SL  VS      16  Device vendor specific log
1050xa5           SL  VS      16  Device vendor specific log
1060xce-0xcf      SL  VS      16  Device vendor specific log
1070xe0       GPL,SL  R/W      1  SCT Command/Status
1080xe1       GPL,SL  R/W      1  SCT Data Transfer
109
110SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
111No Errors Logged
112
113SMART Extended Self-test Log Version: 1 (1 sectors)
114No self-tests have been logged.  [To run self-tests, use: smartctl -t]
115
116SMART Selective self-test log data structure revision number 1
117 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
118    1        0        0  Not_testing
119    2        0        0  Not_testing
120    3        0        0  Not_testing
121    4        0        0  Not_testing
122    5        0        0  Not_testing
123Selective self-test flags (0x0):
124  After scanning selected spans, do NOT read-scan remainder of disk.
125If Selective self-test is pending on power-up, resume after 0 minute delay.
126
127SCT Status Version:                  3
128SCT Version (vendor specific):       256 (0x0100)
129Device State:                        Active (0)
130Current Temperature:                    24 Celsius
131Power Cycle Min/Max Temperature:     21/40 Celsius
132Lifetime    Min/Max Temperature:     17/42 Celsius
133Specified Max Operating Temperature:    55 Celsius
134Under/Over Temperature Limit Count:   0/0
135SMART Status:                        0xc24f (PASSED)
136
137SCT Temperature History Version:     2
138Temperature Sampling Period:         1 minute
139Temperature Logging Interval:        10 minutes
140Min/Max recommended Temperature:      0/70 Celsius
141Min/Max Temperature Limit:            0/70 Celsius
142Temperature History Size (Index):    128 (49)
143
144Index    Estimated Time   Temperature Celsius
145  50    2025-05-24 14:00    27  ********
146[ snip ]
147  48    2025-05-25 11:00    23  ****
148  49    2025-05-25 11:10    24  *****
149
150SCT Error Recovery Control:
151           Read: Disabled
152          Write: Disabled
153
154Device Statistics (GP Log 0x04)
155Page  Offset Size        Value Flags Description
1560x01  =====  =               =  ===  == General Statistics (rev 1) ==
1570x01  0x008  4             959  ---  Lifetime Power-On Resets
1580x01  0x010  4            4684  ---  Power-on Hours
1590x01  0x018  6      2043498859  ---  Logical Sectors Written
1600x01  0x020  6        12091242  ---  Number of Write Commands
1610x01  0x028  6      1406882279  ---  Logical Sectors Read
1620x01  0x030  6        16026334  ---  Number of Read Commands
1630x01  0x038  6         2135000  ---  Date and Time TimeStamp
1640x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
1650x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
1660x04  0x010  4               0  ---  Resets Between Cmd Acceptance and Completion
1670x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
1680x05  0x008  1              24  ---  Current Temperature
1690x05  0x020  1              42  ---  Highest Temperature
1700x05  0x028  1              17  ---  Lowest Temperature
1710x05  0x058  1              55  ---  Specified Maximum Operating Temperature
1720x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
1730x06  0x008  4            4191  ---  Number of Hardware Resets
1740x06  0x010  4               0  ---  Number of ASR Events
1750x06  0x018  4               0  ---  Number of Interface CRC Errors
1760x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
1770x07  0x008  1               0  N--  Percentage Used Endurance Indicator
178                                |||_ C monitored condition met
179                                ||__ D supports DSN
180                                |___ N normalized value
181
182Pending Defects log (GP Log 0x0c) not supported
183
184SATA Phy Event Counters (GP Log 0x11)
185ID      Size     Value  Description
1860x0001  2            0  Command failed due to ICRC error
1870x0002  2            0  R_ERR response for data FIS
1880x0003  2            0  R_ERR response for device-to-host data FIS
1890x0004  2            0  R_ERR response for host-to-device data FIS
1900x0005  2            0  R_ERR response for non-data FIS
1910x0006  2            0  R_ERR response for device-to-host non-data FIS
1920x0007  2            0  R_ERR response for host-to-device non-data FIS
1930x0008  2            0  Device-to-host non-data FIS retries
1940x0009  2         1994  Transition from drive PhyRdy to drive PhyNRdy
1950x000a  2            2  Device-to-host register FISes sent due to a COMRESET
1960x000b  2            0  CRC errors within host-to-device FIS
1970x000d  2            0  Non-CRC errors within host-to-device FIS
1980x000f  2            0  R_ERR response for host-to-device data FIS, CRC
1990x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
2000x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
2010x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC

There is a lot here! But the gist is:

 1=== START OF READ SMART DATA SECTION ===
 2SMART overall-health self-assessment test result: PASSED
 3...
 4SATA Phy Event Counters (GP Log 0x11)
 5ID      Size     Value  Description
 60x0001  2            0  Command failed due to ICRC error
 70x0002  2            0  R_ERR response for data FIS
 80x0003  2            0  R_ERR response for device-to-host data FIS
 90x0004  2            0  R_ERR response for host-to-device data FIS
100x0005  2            0  R_ERR response for non-data FIS
110x0006  2            0  R_ERR response for device-to-host non-data FIS
120x0007  2            0  R_ERR response for host-to-device non-data FIS
130x0008  2            0  Device-to-host non-data FIS retries
140x0009  2         1994  Transition from drive PhyRdy to drive PhyNRdy
150x000a  2            2  Device-to-host register FISes sent due to a COMRESET
160x000b  2            0  CRC errors within host-to-device FIS
170x000d  2            0  Non-CRC errors within host-to-device FIS
180x000f  2            0  R_ERR response for host-to-device data FIS, CRC
190x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
200x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
210x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC

Which tells me the SSD is likely fine. This gave me a sigh of relief. Next I checked to see if btrfs can repair itself:

1➜  ~ sudo btrfs scrub start /
2➜  ~ sudo dmesg -w
3...
4[ 3136.872126] BTRFS info (device sda3): scrub: started on devid 1
5[ 3159.503387] BTRFS error (device sda3): unable to fixup (regular) error at logical 12463439872 on dev /dev/sda3 physical 13545570304
6[ 3159.503415] BTRFS error (device sda3): unable to fixup (regular) error at logical 12463439872 on dev /dev/sda3 physical 13545570304
7[ 3168.222718] BTRFS info (device sda3): scrub: finished on devid 1 with status: 0

Still an error, but un-fixiable… So next I ran:

WARNING: DO NOT RUN THE FOLLOWING WITHOUT –readonly on a mounted device!

1sudo btrfs check --force --readonly --check-data-csum /dev/sda3

I got more of the same information that I already knew. It occurred to me at this point there was a critical clue all along:

1... csum failed root 256 ino 386149 off 0 ...
2~~~~~~~~~~~~~~~~~~~~~~~~~^

These messages were all for a single inode! Performing a lookup on that, I found the culprit:

1➜  ~ sudo btrfs inspect-internal inode-resolve 386149 /
2//var/lib/systemd/catalog/database

Solution

This confirms that this problem is very much localized to systemd! The next question is how to fix? I first thought to delete the file, and hope systemd will remake it. But I wasn’t sure. I knew through searching this file was directly tied to the systemd catalog, and I found a journal command to use via man 1 journalctl :

1       --update-catalog
2           Update the message catalog index. This command needs to be
3           executed each time new catalog files are installed, removed, or
4           updated to rebuild the binary catalog index.
5
6           Added in version 196.

I thought, why not… I ran sudo journalctl --update-catalog and now the problem is fixed!

It turns out the problem was a me problem all along. If that didn’t resolve the problem, I was going to delete the file manually, and continue the investigation and attempt to reproduce the problem with systemd + btrfs filesystem.

To be fair, I installed Fedora freshly just last week, and I had some issues with the install process. It could’ve happened at that time. I still don’t know what the actual trigger is. I can only speculate some interruption in btrfs during the initial creation process.

Lesson learned

At first thought, the main lesson here is to study these error messages in more detail. The same inode repeatedly shows up in logs, and thus that provides a good starting point.

However, inodes can constantly change. If this file was removed/created, by the time I check, it’s no longer a valid inode number. I’d be back at square one. Secondly, this could’ve been a much bigger problem ranging from a bug in the filesystem code all the way to a HW failure. It was good I still went through the steps to rule out those possibilities.

All that aside, I’m just happy my disk is fine and I can read output from journalctl again to debug other things 😂