Background
Shortly after installing Fedora (Kernel 6.14.6-300.fc42.x86_64) on a freshly
wiped disk (old Windows NTFS games/extra disk), I noticed that journalctl
then <shift>-G
would result in a journalctl
crash for systemd 257 (257.5-6.fc42)
:
1➜ ~ coredumpctl info
2 PID: 11701 (journalctl)
3 UID: 1000 (fred)
4 GID: 1000 (fred)
5 Signal: 6 (ABRT)
6 Timestamp: Sun 2025-05-25 10:19:33 CDT (40min ago)
7 Command Line: journalctl -xe
8 Executable: /usr/bin/journalctl
9 Control Group: /user.slice/user-1000.slice/[email protected]/app.slice/app-gnome-Alacritty-6480.scope
10 Unit: [email protected]
11 User Unit: app-gnome-Alacritty-6480.scope
12 Slice: user-1000.slice
13 Owner UID: 1000 (fred)
14 Boot ID: 81b2c199435745759afa8439492049e5
15 Machine ID: 42148a0b9b9641b0ab3c18373236fe58
16 Hostname: olympus
17 Storage: /var/lib/systemd/coredump/core.journalctl.1000.81b2c199435745759afa8439492049e5.11701.1748186373000000.zst (present)
18 Size on Disk: 183.3K
19 Package: systemd/257.5-6.fc42
20 build-id: 812a15616de1baca7c7c2942d40b5dbb73c6b905
21 Message: Process 11701 (journalctl) of user 1000 dumped core.
22
23 Module /usr/bin/journalctl from rpm systemd-257.5-6.fc42.x86_64
24 Module libzstd.so.1 from rpm zstd-1.5.7-1.fc42.x86_64
25 Module libcap-ng.so.0 from rpm libcap-ng-0.8.5-4.fc42.x86_64
26 Module libpcre2-8.so.0 from rpm pcre2-10.45-1.fc42.x86_64
27 Module libeconf.so.0 from rpm libeconf-0.7.6-1.fc42.x86_64
28 Module libaudit.so.1 from rpm audit-4.0.3-2.fc42.x86_64
29 Module libz.so.1 from rpm zlib-ng-2.2.4-3.fc42.x86_64
30 Module libattr.so.1 from rpm attr-2.5.2-5.fc42.x86_64
31 Module libselinux.so.1 from rpm libselinux-3.8-1.fc42.x86_64
32 Module libseccomp.so.2 from rpm libseccomp-2.5.5-2.fc41.x86_64
33 Module libpam.so.0 from rpm pam-1.7.0-5.fc42.x86_64
34 Module libcrypto.so.3 from rpm openssl-3.2.4-3.fc42.x86_64
35 Module libmount.so.1 from rpm util-linux-2.40.4-7.fc42.x86_64
36 Module libcrypt.so.2 from rpm libxcrypt-4.4.38-7.fc42.x86_64
37 Module libcap.so.2 from rpm libcap-2.73-2.fc42.x86_64
38 Module libblkid.so.1 from rpm util-linux-2.40.4-7.fc42.x86_64
39 Module libacl.so.1 from rpm acl-2.3.2-3.fc42.x86_64
40 Module libsystemd-shared-257.5-6.fc42.so from rpm systemd-257.5-6.fc42.x86_64
41 Stack trace of thread 11701:
42 #0 0x00007f3ec948111c __pthread_kill_implementation (libc.so.6 + 0x7311c)
43 #1 0x00007f3ec9427afe raise (libc.so.6 + 0x19afe)
44 #2 0x00007f3ec940f6d0 abort (libc.so.6 + 0x16d0)
45 #3 0x00007f3ec983b2cc mmap_cache_process_sigbus (libsystemd-shared-257.5-6.fc42.so + 0x23b2cc)
46 #4 0x00007f3ec983b5bf mmap_cache_fd_free (libsystemd-shared-257.5-6.fc42.so + 0x23b5bf)
47 #5 0x00007f3ec98274fc journal_file_close (libsystemd-shared-257.5-6.fc42.so + 0x2274fc)
48 #6 0x00007f3ec9845150 sd_journal_close (libsystemd-shared-257.5-6.fc42.so + 0x245150)
49 #7 0x000055db6237d769 run (/usr/bin/journalctl + 0x7769)
50 #8 0x000055db62378145 main (/usr/bin/journalctl + 0x2145)
51 #9 0x00007f3ec94115f5 __libc_start_call_main (libc.so.6 + 0x35f5)
52 #10 0x00007f3ec94116a8 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x36a8)
53 #11 0x000055db623783c5 _start (/usr/bin/journalctl + 0x23c5)
54 ELF object binary architecture: AMD x86-64
Looking at sudo dmesg
(Kernel logs) I saw several:
1➜ ~ sudo dmesg
2...
3[ 2614.020827] BTRFS warning (device sda3): csum failed root 256 ino 386149 off 0 csum 0x8941f998 expected csum 0x33a303fe mirror 1
4[ 2614.020828] BTRFS error (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 414416, gen 0
5[ 2614.020828] BTRFS warning (device sda3): csum failed root 256 ino 386149 off 4096 csum 0x8941f998 expected csum 0x3e2e9220 mirror 1
6[ 2614.020829] BTRFS error (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 414417, gen 0
I am working on a few systemd problems with upstream on their latest v257 stable release, and thought there may be another systemd problem here. I know better to not report right away, so I needed to figure out if there is a disk or filesystem problem first, and then look for solutions.
Tools
I searched the internet for clues on what to-do. I found the following to help debug the issue:
smartctl
A tool to diagnose disk HW problemsbtrfs scrub
A tool to help find/repair corruptions (if able)btrfs check
A tool to repair BTRFS file system (WARNING! AVOID USE)btrfs inspect-internal
A tool to check information about parts of the disk/filesystem
Investigation
Starting with HW specific check:
1➜ ~ sudo smartctl -x /dev/sda3
2smartctl 7.5 2025-04-30 r5714 [x86_64-linux-6.14.6-300.fc42.x86_64] (local build)
3Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org
4
5=== START OF INFORMATION SECTION ===
6Model Family: Samsung based SSDs
7Device Model: Samsung SSD 860 EVO 250GB
8Serial Number: S3YHNX0K415898P
9LU WWN Device Id: 5 002538 e40306b00
10Firmware Version: RVT01B6Q
11User Capacity: 250,059,350,016 bytes [250 GB]
12Sector Size: 512 bytes logical/physical
13Rotation Rate: Solid State Device
14Form Factor: 2.5 inches
15TRIM Command: Available, deterministic, zeroed
16Device is: In smartctl database 7.5/5706
17ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
18SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
19Local Time is: Sun May 25 11:11:34 2025 CDT
20SMART support is: Available - device has SMART capability.
21SMART support is: Enabled
22AAM feature is: Unavailable
23APM feature is: Unavailable
24Rd look-ahead is: Enabled
25Write cache is: Enabled
26DSN feature is: Unavailable
27ATA Security is: Disabled, frozen [SEC2]
28
29=== START OF READ SMART DATA SECTION ===
30SMART overall-health self-assessment test result: PASSED
31
32General SMART Values:
33Offline data collection status: (0x00) Offline data collection activity
34 was never started.
35 Auto Offline Data Collection: Disabled.
36Self-test execution status: ( 0) The previous self-test routine completed
37 without error or no self-test has ever
38 been run.
39Total time to complete Offline
40data collection: ( 0) seconds.
41Offline data collection
42capabilities: (0x53) SMART execute Offline immediate.
43 Auto Offline data collection on/off support.
44 Suspend Offline collection upon new
45 command.
46 No Offline surface scan supported.
47 Self-test supported.
48 No Conveyance Self-test supported.
49 Selective Self-test supported.
50SMART capabilities: (0x0003) Saves SMART data before entering
51 power-saving mode.
52 Supports SMART auto save timer.
53Error logging capability: (0x01) Error logging supported.
54 General Purpose Logging supported.
55Short self-test routine
56recommended polling time: ( 2) minutes.
57Extended self-test routine
58recommended polling time: ( 85) minutes.
59SCT capabilities: (0x003d) SCT Status supported.
60 SCT Error Recovery Control supported.
61 SCT Feature Control supported.
62 SCT Data Table supported.
63
64SMART Attributes Data Structure revision number: 1
65Vendor Specific SMART Attributes with Thresholds:
66ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
67 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
68 9 Power_On_Hours -O--CK 099 099 000 - 4684
69 12 Power_Cycle_Count -O--CK 099 099 000 - 959
70177 Wear_Leveling_Count PO--C- 099 099 000 - 6
71179 Used_Rsvd_Blk_Cnt_Tot PO--C- 100 100 010 - 0
72181 Program_Fail_Cnt_Total -O--CK 100 100 010 - 0
73182 Erase_Fail_Count_Total -O--CK 100 100 010 - 0
74183 Runtime_Bad_Block PO--C- 100 100 010 - 0
75187 Uncorrectable_Error_Cnt -O--CK 100 100 000 - 0
76190 Airflow_Temperature_Cel -O--CK 076 058 000 - 24
77195 ECC_Error_Rate -O-RC- 200 200 000 - 0
78199 CRC_Error_Count -OSRCK 100 100 000 - 0
79235 POR_Recovery_Count -O--C- 099 099 000 - 42
80241 Total_LBAs_Written -O--CK 099 099 000 - 2043498859
81 ||||||_ K auto-keep
82 |||||__ C event count
83 ||||___ R error rate
84 |||____ S speed/performance
85 ||_____ O updated online
86 |______ P prefailure warning
87
88General Purpose Log Directory Version 1
89SMART Log Directory Version 1 [multi-sector log support]
90Address Access R/W Size Description
910x00 GPL,SL R/O 1 Log Directory
920x01 SL R/O 1 Summary SMART error log
930x02 SL R/O 1 Comprehensive SMART error log
940x03 GPL R/O 1 Ext. Comprehensive SMART error log
950x04 GPL,SL R/O 8 Device Statistics log
960x06 SL R/O 1 SMART self-test log
970x07 GPL R/O 1 Extended self-test log
980x09 SL R/W 1 Selective self-test log
990x10 GPL R/O 1 NCQ Command Error log
1000x11 GPL R/O 1 SATA Phy Event Counters log
1010x13 GPL R/O 1 SATA NCQ Send and Receive log
1020x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
1030x80-0x9f GPL,SL R/W 16 Host vendor specific log
1040xa1 SL VS 16 Device vendor specific log
1050xa5 SL VS 16 Device vendor specific log
1060xce-0xcf SL VS 16 Device vendor specific log
1070xe0 GPL,SL R/W 1 SCT Command/Status
1080xe1 GPL,SL R/W 1 SCT Data Transfer
109
110SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
111No Errors Logged
112
113SMART Extended Self-test Log Version: 1 (1 sectors)
114No self-tests have been logged. [To run self-tests, use: smartctl -t]
115
116SMART Selective self-test log data structure revision number 1
117 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
118 1 0 0 Not_testing
119 2 0 0 Not_testing
120 3 0 0 Not_testing
121 4 0 0 Not_testing
122 5 0 0 Not_testing
123Selective self-test flags (0x0):
124 After scanning selected spans, do NOT read-scan remainder of disk.
125If Selective self-test is pending on power-up, resume after 0 minute delay.
126
127SCT Status Version: 3
128SCT Version (vendor specific): 256 (0x0100)
129Device State: Active (0)
130Current Temperature: 24 Celsius
131Power Cycle Min/Max Temperature: 21/40 Celsius
132Lifetime Min/Max Temperature: 17/42 Celsius
133Specified Max Operating Temperature: 55 Celsius
134Under/Over Temperature Limit Count: 0/0
135SMART Status: 0xc24f (PASSED)
136
137SCT Temperature History Version: 2
138Temperature Sampling Period: 1 minute
139Temperature Logging Interval: 10 minutes
140Min/Max recommended Temperature: 0/70 Celsius
141Min/Max Temperature Limit: 0/70 Celsius
142Temperature History Size (Index): 128 (49)
143
144Index Estimated Time Temperature Celsius
145 50 2025-05-24 14:00 27 ********
146[ snip ]
147 48 2025-05-25 11:00 23 ****
148 49 2025-05-25 11:10 24 *****
149
150SCT Error Recovery Control:
151 Read: Disabled
152 Write: Disabled
153
154Device Statistics (GP Log 0x04)
155Page Offset Size Value Flags Description
1560x01 ===== = = === == General Statistics (rev 1) ==
1570x01 0x008 4 959 --- Lifetime Power-On Resets
1580x01 0x010 4 4684 --- Power-on Hours
1590x01 0x018 6 2043498859 --- Logical Sectors Written
1600x01 0x020 6 12091242 --- Number of Write Commands
1610x01 0x028 6 1406882279 --- Logical Sectors Read
1620x01 0x030 6 16026334 --- Number of Read Commands
1630x01 0x038 6 2135000 --- Date and Time TimeStamp
1640x04 ===== = = === == General Errors Statistics (rev 1) ==
1650x04 0x008 4 0 --- Number of Reported Uncorrectable Errors
1660x04 0x010 4 0 --- Resets Between Cmd Acceptance and Completion
1670x05 ===== = = === == Temperature Statistics (rev 1) ==
1680x05 0x008 1 24 --- Current Temperature
1690x05 0x020 1 42 --- Highest Temperature
1700x05 0x028 1 17 --- Lowest Temperature
1710x05 0x058 1 55 --- Specified Maximum Operating Temperature
1720x06 ===== = = === == Transport Statistics (rev 1) ==
1730x06 0x008 4 4191 --- Number of Hardware Resets
1740x06 0x010 4 0 --- Number of ASR Events
1750x06 0x018 4 0 --- Number of Interface CRC Errors
1760x07 ===== = = === == Solid State Device Statistics (rev 1) ==
1770x07 0x008 1 0 N-- Percentage Used Endurance Indicator
178 |||_ C monitored condition met
179 ||__ D supports DSN
180 |___ N normalized value
181
182Pending Defects log (GP Log 0x0c) not supported
183
184SATA Phy Event Counters (GP Log 0x11)
185ID Size Value Description
1860x0001 2 0 Command failed due to ICRC error
1870x0002 2 0 R_ERR response for data FIS
1880x0003 2 0 R_ERR response for device-to-host data FIS
1890x0004 2 0 R_ERR response for host-to-device data FIS
1900x0005 2 0 R_ERR response for non-data FIS
1910x0006 2 0 R_ERR response for device-to-host non-data FIS
1920x0007 2 0 R_ERR response for host-to-device non-data FIS
1930x0008 2 0 Device-to-host non-data FIS retries
1940x0009 2 1994 Transition from drive PhyRdy to drive PhyNRdy
1950x000a 2 2 Device-to-host register FISes sent due to a COMRESET
1960x000b 2 0 CRC errors within host-to-device FIS
1970x000d 2 0 Non-CRC errors within host-to-device FIS
1980x000f 2 0 R_ERR response for host-to-device data FIS, CRC
1990x0010 2 0 R_ERR response for host-to-device data FIS, non-CRC
2000x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
2010x0013 2 0 R_ERR response for host-to-device non-data FIS, non-CRC
There is a lot here! But the gist is:
1=== START OF READ SMART DATA SECTION ===
2SMART overall-health self-assessment test result: PASSED
3...
4SATA Phy Event Counters (GP Log 0x11)
5ID Size Value Description
60x0001 2 0 Command failed due to ICRC error
70x0002 2 0 R_ERR response for data FIS
80x0003 2 0 R_ERR response for device-to-host data FIS
90x0004 2 0 R_ERR response for host-to-device data FIS
100x0005 2 0 R_ERR response for non-data FIS
110x0006 2 0 R_ERR response for device-to-host non-data FIS
120x0007 2 0 R_ERR response for host-to-device non-data FIS
130x0008 2 0 Device-to-host non-data FIS retries
140x0009 2 1994 Transition from drive PhyRdy to drive PhyNRdy
150x000a 2 2 Device-to-host register FISes sent due to a COMRESET
160x000b 2 0 CRC errors within host-to-device FIS
170x000d 2 0 Non-CRC errors within host-to-device FIS
180x000f 2 0 R_ERR response for host-to-device data FIS, CRC
190x0010 2 0 R_ERR response for host-to-device data FIS, non-CRC
200x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
210x0013 2 0 R_ERR response for host-to-device non-data FIS, non-CRC
Which tells me the SSD is likely fine. This gave me a sigh of relief. Next I checked to see if btrfs can repair itself:
1➜ ~ sudo btrfs scrub start /
2➜ ~ sudo dmesg -w
3...
4[ 3136.872126] BTRFS info (device sda3): scrub: started on devid 1
5[ 3159.503387] BTRFS error (device sda3): unable to fixup (regular) error at logical 12463439872 on dev /dev/sda3 physical 13545570304
6[ 3159.503415] BTRFS error (device sda3): unable to fixup (regular) error at logical 12463439872 on dev /dev/sda3 physical 13545570304
7[ 3168.222718] BTRFS info (device sda3): scrub: finished on devid 1 with status: 0
Still an error, but un-fixiable… So next I ran:
WARNING: DO NOT RUN THE FOLLOWING WITHOUT –readonly on a mounted device!
1sudo btrfs check --force --readonly --check-data-csum /dev/sda3
I got more of the same information that I already knew. It occurred to me at this point there was a critical clue all along:
1... csum failed root 256 ino 386149 off 0 ...
2~~~~~~~~~~~~~~~~~~~~~~~~~^
These messages were all for a single inode! Performing a lookup on that, I found the culprit:
1➜ ~ sudo btrfs inspect-internal inode-resolve 386149 /
2//var/lib/systemd/catalog/database
Solution
This confirms that this problem is very much localized to systemd! The next
question is how to fix? I first thought to delete the file, and
hope systemd will remake it. But I wasn’t sure. I knew through
searching this file was directly tied to the systemd catalog, and I found
a journal command to use via man 1 journalctl
:
1 --update-catalog
2 Update the message catalog index. This command needs to be
3 executed each time new catalog files are installed, removed, or
4 updated to rebuild the binary catalog index.
5
6 Added in version 196.
I thought, why not… I ran sudo journalctl --update-catalog
and now
the problem is fixed!
It turns out the problem was a me problem all along. If that didn’t resolve the problem, I was going to delete the file manually, and continue the investigation and attempt to reproduce the problem with systemd + btrfs filesystem.
To be fair, I installed Fedora freshly just last week, and I had some issues with the install process. It could’ve happened at that time. I still don’t know what the actual trigger is. I can only speculate some interruption in btrfs during the initial creation process.
Lesson learned
At first thought, the main lesson here is to study these error messages in more detail. The same inode repeatedly shows up in logs, and thus that provides a good starting point.
However, inodes can constantly change. If this file was removed/created, by the time I check, it’s no longer a valid inode number. I’d be back at square one. Secondly, this could’ve been a much bigger problem ranging from a bug in the filesystem code all the way to a HW failure. It was good I still went through the steps to rule out those possibilities.
All that aside, I’m just happy my disk is fine and I can read output
from journalctl
again to debug other things 😂