Message ID | 20191003215523.7313-1-jack@suse.cz |
---|---|
Headers | show |
Series | ext4: Fix transaction overflow due to revoke descriptors | expand |
Hi Jan, I've tried applying this patch set against 5.4-rc3, and I'm finding a easily reproducible failure using: kvm-xfstests -c ext3conv ext4/039 It is the BUG_ON in fs/jbd2/commit.c, around line 570: J_ASSERT(commit_transaction->t_nr_buffers <= atomic_read(&commit_transaction->t_outstanding_credits)); The failure (with the obvious debugging printk added) is: ext4/039 [15:13:16][ 6.747101] run fstests ext4/039 at 2019-10 -19 15:13:16 [ 7.018766] Mounted ext4 file system at /vdc supports timestamps until 2038 ( 0x7fffffff) [ 8.227631] JBD2: t_nr_buffers 226, t_outstanding_credits=223 [ 8.229215] ------------[ cut here ]------------ [ 8.230249] kernel BUG at fs/jbd2/commit.c:573! ... The full log is attached (although the stack trace isn't terribly interesting, since this is being run out of kjournald2). - Ted c[?7l[2J[0mSeaBIOS (version 1.12.0-1) Booting from ROM..c[?7l[2JKERNEL: kernel 5.4.0-rc3-xfstests-00022-g6db1a59f1c6a-dirty #1234 SMP Sat Oct 19 15:10:53 EDT 2019 x86_64 FSTESTVER: blktests 667d741 (Wed, 4 Sep 2019 10:49:18 -0700) FSTESTVER: e2fsprogs v1.45.4-15-g4b4f7b35 (Wed, 9 Oct 2019 20:25:01 -0400) FSTESTVER: fio fio-3.15 (Fri, 12 Jul 2019 10:40:36 -0600) FSTESTVER: fsverity 2151209 (Fri, 28 Jun 2019 14:34:41 -0700) FSTESTVER: ima-evm-utils 0267fa1 (Mon, 3 Dec 2018 06:11:35 -0500) FSTESTVER: nvme-cli v1.9 (Thu, 15 Aug 2019 13:14:59 -0600) FSTESTVER: quota 6e63107 (Thu, 15 Aug 2019 11:23:55 +0200) FSTESTVER: util-linux v2.33.2 (Tue, 9 Apr 2019 14:58:07 +0200) FSTESTVER: xfsprogs v5.3.0-rc1-8-g7aaa32db (Thu, 19 Sep 2019 13:21:52 -0400) FSTESTVER: xfstests linux-v3.8-2563-g45bd2a28 (Mon, 14 Oct 2019 08:11:38 -0400) FSTESTVER: xfstests-bld 5e2a748 (Mon, 14 Oct 2019 09:36:05 -0400) FSTESTCFG: "ext3conv" FSTESTSET: "ext4/039" FSTESTEXC: "" FSTESTOPT: "aex" MNTOPTS: "" CPUS: "2" MEM: "1966.97" total used free shared buff/cache available Mem: 1966 96 1778 8 92 1830 Swap: 0 0 0 [ 5.146878] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory [ 5.148619] NFSD: Using legacy client tracking operations. BEGIN TEST ext3conv (1 test): Ext4 4k block w/nodelalloc and no flex_bg Sat Oct 19 15:13:15 EDT 2019 DEVICE: /dev/vdd EXT_MKFS_OPTIONS: -O ^flex_bg EXT_MOUNT_OPTIONS: -o block_validity,nodelalloc FSTYP -- ext4 PLATFORM -- Linux/x86_64 kvm-xfstests 5.4.0-rc3-xfstests-00022-g6db1a59f1c6a-dirty #1234 SMP Sat Oct 19 15:10:53 EDT 2019 MKFS_OPTIONS -- -q -O ^flex_bg /dev/vdc MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity,nodelalloc /dev/vdc /vdc ext4/039 [15:13:16][ 6.747101] run fstests ext4/039 at 2019-10-19 15:13:16 [ 7.018766] Mounted ext4 file system at /vdc supports timestamps until 2038 (0x7fffffff) [ 8.227631] JBD2: t_nr_buffers 226, t_outstanding_credits=223 [ 8.229215] ------------[ cut here ]------------ [ 8.230249] kernel BUG at fs/jbd2/commit.c:573! [ 8.231231] invalid opcode: 0000 [#1] SMP NOPTI [ 8.232223] CPU: 1 PID: 1384 Comm: jbd2/vdc-8 Not tainted 5.4.0-rc3-xfstests-00022-g6db1a59f1c6a-dirty #1234 [ 8.234303] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 8.236124] RIP: 0010:jbd2_journal_commit_transaction+0x1565/0x1f35 [ 8.237489] Code: f0 fe ff ff 48 c7 c2 b8 52 3a 8c be 4e 00 00 00 48 c7 c7 3f 06 3f 8c c6 05 a8 c5 5a 01 01 e8 8f 15 d4 ff e9 cc fe ff ff 0f 0b <0f> 0b 0f 0b 0f 0b e8 10 a0 d5 ff 85 c0 0f 85 15 fb ff ff 48 c7 c2 [ 8.241449] RSP: 0018:ffffaa64c22cbcf0 EFLAGS: 00010283 [ 8.242581] RAX: 00000000000000df RBX: ffff8ed0b8b7f028 RCX: 0000000000000000 [ 8.244138] RDX: 0000000000000000 RSI: 00000000000000e2 RDI: ffff8ed0bdbd6608 [ 8.245664] RBP: ffffaa64c22cbe80 R08: ffff8ed0bdbd6608 R09: 0000000000000000 [ 8.247029] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 8.248023] R13: ffff8ed0b52045a0 R14: ffff8ed0b5204540 R15: ffff8ed0b8b7f000 [ 8.249042] FS: 0000000000000000(0000) GS:ffff8ed0bda00000(0000) knlGS:0000000000000000 [ 8.250191] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8.251007] CR2: 00005601514547c0 CR3: 00000000722c4006 CR4: 0000000000360ee0 [ 8.252007] Call Trace: [ 8.252356] ? __lock_acquire+0x24a/0xf80 [ 8.252946] ? __lock_acquired+0x1eb/0x310 [ 8.253574] ? kjournald2+0xe3/0x3a0 [ 8.254108] kjournald2+0xe3/0x3a0 [ 8.254567] ? replenish_dl_entity.cold+0x1d/0x1d [ 8.255364] ? __jbd2_debug+0x50/0x50 [ 8.255886] kthread+0x126/0x140 [ 8.256408] ? kthread_delayed_work_timer_fn+0xa0/0xa0 [ 8.257324] ret_from_fork+0x3a/0x50 [ 8.257916] ---[ end trace 9acad1489f655cc4 ]--- [ 8.258799] RIP: 0010:jbd2_journal_commit_transaction+0x1565/0x1f35 [ 8.259742] Code: f0 fe ff ff 48 c7 c2 b8 52 3a 8c be 4e 00 00 00 48 c7 c7 3f 06 3f 8c c6 05 a8 c5 5a 01 01 e8 8f 15 d4 ff e9 cc fe ff ff 0f 0b <0f> 0b 0f 0b 0f 0b e8 10 a0 d5 ff 85 c0 0f 85 15 fb ff ff 48 c7 c2 [ 8.264348] RSP: 0018:ffffaa64c22cbcf0 EFLAGS: 00010283 [ 8.265803] RAX: 00000000000000df RBX: ffff8ed0b8b7f028 RCX: 0000000000000000 [ 8.267610] RDX: 0000000000000000 RSI: 00000000000000e2 RDI: ffff8ed0bdbd6608 [ 8.269498] RBP: ffffaa64c22cbe80 R08: ffff8ed0bdbd6608 R09: 0000000000000000 [ 8.271295] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 8.273021] R13: ffff8ed0b52045a0 R14: ffff8ed0b5204540 R15: ffff8ed0b8b7f000 [ 8.274792] FS: 0000000000000000(0000) GS:ffff8ed0bda00000(0000) knlGS:0000000000000000 [ 8.276789] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8.278152] CR2: 00005601514547c0 CR3: 00000000722c4006 CR4: 0000000000360ee0 [ 8.279796] ------------[ cut here ]------------ [ 8.280909] WARNING: CPU: 1 PID: 1384 at kernel/exit.c:723 do_exit+0x47/0xb70 [ 8.282588] CPU: 1 PID: 1384 Comm: jbd2/vdc-8 Tainted: G D 5.4.0-rc3-xfstests-00022-g6db1a59f1c6a-dirty #1234 [ 8.284979] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 8.286909] RIP: 0010:do_exit+0x47/0xb70 [ 8.287809] Code: 00 00 48 89 44 24 38 31 c0 65 48 8b 1c 25 c0 5d 01 00 48 8b 83 20 12 00 00 48 85 c0 74 0e 48 8b 10 48 39 d0 0f 84 71 04 00 00 <0f> 0b 65 44 8b 25 07 fa 15 75 41 81 e4 00 ff 1f 00 44 89 64 24 0c [ 8.290621] RSP: 0018:ffffaa64c22cbee0 EFLAGS: 00010216 [ 8.291398] RAX: ffffaa64c22cbdc0 RBX: ffff8ed0b5aa6400 RCX: 0000000000000000 [ 8.292582] RDX: ffff8ed0ba5e4548 RSI: 0000000000000000 RDI: 000000000000000b [ 8.293818] RBP: 000000000000000b R08: 0000000000000000 R09: 0000000000000000 [ 8.295187] R10: 0000000000000008 R11: ffffaa64c22cba1d R12: 000000000000000b [ 8.296506] R13: 0000000000000002 R14: 0000000000000006 R15: ffff8ed0b5aa6400 [ 8.297843] FS: 0000000000000000(0000) GS:ffff8ed0bda00000(0000) knlGS:0000000000000000 [ 8.299363] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8.300503] CR2: 00005601514547c0 CR3: 00000000722c4006 CR4: 0000000000360ee0 [ 8.301841] Call Trace: [ 8.302275] ? __jbd2_debug+0x50/0x50 [ 8.303003] ? kthread+0x126/0x140 [ 8.303754] rewind_stack_do_exit+0x17/0x20 [ 8.304485] irq event stamp: 134503 [ 8.305232] hardirqs last enabled at (134503): [<ffffffff8ae017ea>] trace_hardirqs_on_thunk+0x1a/0x20 [ 8.306951] hardirqs last disabled at (134501): [<ffffffff8bc002b0>] __do_softirq+0x2b0/0x42a [ 8.308579] softirqs last enabled at (134502): [<ffffffff8bc0032a>] __do_softirq+0x32a/0x42a [ 8.310100] softirqs last disabled at (134495): [<ffffffff8aeb84d3>] irq_exit+0xb3/0xc0 [ 8.311524] ---[ end trace 9acad1489f655cc5 ]--- QEMU: Terminated
On Sat 19-10-19 15:19:33, Theodore Y. Ts'o wrote: > Hi Jan, > > I've tried applying this patch set against 5.4-rc3, and I'm finding a > easily reproducible failure using: > > kvm-xfstests -c ext3conv ext4/039 > > It is the BUG_ON in fs/jbd2/commit.c, around line 570: > > J_ASSERT(commit_transaction->t_nr_buffers <= > atomic_read(&commit_transaction->t_outstanding_credits)); > > The failure (with the obvious debugging printk added) is: > > ext4/039 [15:13:16][ 6.747101] run fstests ext4/039 at 2019-10 > -19 15:13:16 > [ 7.018766] Mounted ext4 file system at /vdc supports timestamps until 2038 ( > 0x7fffffff) > [ 8.227631] JBD2: t_nr_buffers 226, t_outstanding_credits=223 > [ 8.229215] ------------[ cut here ]------------ > [ 8.230249] kernel BUG at fs/jbd2/commit.c:573! > ... > > The full log is attached (although the stack trace isn't terribly > interesting, since this is being run out of kjournald2). Thanks! Somehow this escaped my testing although I thought I have run ext3 configuration... Anyway we are reserving too few space in this case - with some debugging added: [ 80.296029] t_buffers: 222, t_outstanding_credits: 219, t_revoke_written: 23, t_revoke_reserved: 12, t_revoke_records_written 11432, t_revoke_records_reserved 11432, revokes_per_block: 1020 Which is really puzzling because it would suggest that revokes_per_block is actually wrong. Digging more into this. Honza
On Thu 24-10-19 15:09:08, Jan Kara wrote: > On Sat 19-10-19 15:19:33, Theodore Y. Ts'o wrote: > > Hi Jan, > > > > I've tried applying this patch set against 5.4-rc3, and I'm finding a > > easily reproducible failure using: > > > > kvm-xfstests -c ext3conv ext4/039 > > > > It is the BUG_ON in fs/jbd2/commit.c, around line 570: > > > > J_ASSERT(commit_transaction->t_nr_buffers <= > > atomic_read(&commit_transaction->t_outstanding_credits)); > > > > The failure (with the obvious debugging printk added) is: > > > > ext4/039 [15:13:16][ 6.747101] run fstests ext4/039 at 2019-10 > > -19 15:13:16 > > [ 7.018766] Mounted ext4 file system at /vdc supports timestamps until 2038 ( > > 0x7fffffff) > > [ 8.227631] JBD2: t_nr_buffers 226, t_outstanding_credits=223 > > [ 8.229215] ------------[ cut here ]------------ > > [ 8.230249] kernel BUG at fs/jbd2/commit.c:573! > > ... > > > > The full log is attached (although the stack trace isn't terribly > > interesting, since this is being run out of kjournald2). > > Thanks! Somehow this escaped my testing although I thought I have run ext3 > configuration... Anyway we are reserving too few space in this case - with > some debugging added: > > [ 80.296029] t_buffers: 222, t_outstanding_credits: 219, > t_revoke_written: 23, t_revoke_reserved: 12, t_revoke_records_written > 11432, t_revoke_records_reserved 11432, revokes_per_block: 1020 > > Which is really puzzling because it would suggest that revokes_per_block is > actually wrong. Digging more into this. Yeah, ext4 was updating journal features in this case but j_revoke_records_per_block didn't get properly updated. Fixed now. Honza
Hi Jan, I believe that I'm waiting for the v4 version of this series with some pending fixes that you are planning on making. Is that correct? Thanks!! - Ted
Hi Ted! On Sun 03-11-19 22:32:52, Theodore Y. Ts'o wrote: > I believe that I'm waiting for the v4 version of this series with some > pending fixes that you are planning on making. Is that correct? Ah, good that you pinged me because I have the series ready but I was waiting for your answers to some explanations... In particular discussion around patch 3 (move iput() outside of transaction), patch 15 (dropping of j_state_lock around t_tid load), patch 18 (possible large overreservation of descriptor blocks due to rounding), and 21 (change of on-disk format for revoke descriptors). Out of these I probably find the overreservation due to rounding the most serious and easy enough to handle so I'll fix that and then resend the series unless you raise your objection also in some other case. Honza
On Mon, Nov 04, 2019 at 12:22:32PM +0100, Jan Kara wrote: > Hi Ted! > > On Sun 03-11-19 22:32:52, Theodore Y. Ts'o wrote: > > I believe that I'm waiting for the v4 version of this series with some > > pending fixes that you are planning on making. Is that correct? > > Ah, good that you pinged me because I have the series ready but I was > waiting for your answers to some explanations... In particular discussion > around patch 3 (move iput() outside of transaction), patch 15 (dropping of > j_state_lock around t_tid load), patch 18 (possible large overreservation > of descriptor blocks due to rounding), and 21 (change of on-disk format for > revoke descriptors). Sorry, I didn't comment because I accepted your arguments; but I guess I should have said so explicitly. I just replied to those threads. > Out of these I probably find the overreservation due to rounding the most > serious and easy enough to handle so I'll fix that and then resend the > series unless you raise your objection also in some other case. Great! - Ted
On Tue, Nov 05, 2019 at 05:44:06PM +0100, Jan Kara wrote: > Hello, > > Here is v3 of this series with couple more bugs fixed. Now all failed tests Ted > higlighted pass for me. Thanks, I've applied this to the ext4 git tree. Thanks for your work on this patch series! - Ted
On Wed, Nov 13, 2019 at 10:45:45AM +0100, Jan Kara wrote: > Thanks for the heads up! I didn't do any performance testing with the jbd2 > changes specifically and our internal performance testing grid only checks > Linus' kernel so it didn't yet even try to run that code. I'll queue some > sqlite insert tests internally with my changes to see whether I'm able to > reproduce. I don't have NVME disks available quickly but I guess SATA SSD > could do the job as well... Sorry, false alarm. What Phoronix was testing was 5.3 versus 5.4-rcX, using Ubuntu's bleeding-edge kernels. It wouldn't have any of the ext4 patches we have queued for the *next* merge window. That being said, I wasn't able to reproduce performance delta using upstream kernels, running on a Google Compute Engine VM, machtype n1-highcpu-8, using a GCE Local SSD (SCSI-attached) for the first benchmark, which I believe was the pts/sqlite benchmark using a thread count of 1: Phoronix Test Suite 9.0.1 SQLite 3.30.1 Threads / Copies: 1 Seconds < Lower Is Better 5.3.0 ..................... 225 |=========================================== 5.4.0-rc3 ................. 224 |========================================== 5.4-rc3-80-gafb2442fa429 .. 227 |=========================================== 5.4.0-rc7 ................. 223 |========================================== Processor: Intel Xeon (4 Cores / 8 Threads), Chipset: Intel 440FX 82441FX PMC, Memory: 1 x 7373 MB RAM, Disk: 11GB PersistentDisk + 403GB EphemeralDisk, Network: Red Hat Virtio device OS: Debian 10, Kernel: 5.4.0-rc3-xfstests (x86_64) 20191113, Compiler: GCC 8.3.0, File-System: ext4, System Layer: KVM This was done using an extension to a gce-xfstests test appliance, to which I hope to be adding an automation engine where it will kexec into a series of kernels, run the benchmarks and then spit out the report somewhere. For now, the benchmarks are run manually. (Adding commentary and click-baity titles is left as an exercise to the reader. :-) - Ted P.S. For all that I like to make snarky comments about Phoronix.com, I have to admit Michael Larabel has done a pretty good job with his performance test engine. I probably would have choosen a different implementation than PHP, and I'd have added an explicit way to specify the file system to be tested other than mounting it on top of /var/lib/phoronix-test-suite, and at least have the option of placing the benchmarks' build trees and binaries in a different location than the file system under test. But that being said, he's collecting a decent set of benchmark tools, and it is pretty cool that it has an automated way of collecting the benchmark results, including the pretty graphs suitable for web articles and conference slide decks ("and now, we turn to the rigged benchmarks section of the presentation designed to show my new feature in the best possible light...").
On Thu 14-11-19 00:26:52, Theodore Y. Ts'o wrote: > On Wed, Nov 13, 2019 at 10:45:45AM +0100, Jan Kara wrote: > > Thanks for the heads up! I didn't do any performance testing with the jbd2 > > changes specifically and our internal performance testing grid only checks > > Linus' kernel so it didn't yet even try to run that code. I'll queue some > > sqlite insert tests internally with my changes to see whether I'm able to > > reproduce. I don't have NVME disks available quickly but I guess SATA SSD > > could do the job as well... > > Sorry, false alarm. What Phoronix was testing was 5.3 versus 5.4-rcX, > using Ubuntu's bleeding-edge kernels. It wouldn't have any of the > ext4 patches we have queued for the *next* merge window. OK, thanks for looking! I've run some tests on my test setup anyway... > That being said, I wasn't able to reproduce performance delta using > upstream kernels, running on a Google Compute Engine VM, machtype > n1-highcpu-8, using a GCE Local SSD (SCSI-attached) for the first > benchmark, which I believe was the pts/sqlite benchmark using a thread > count of 1: > > Phoronix Test Suite 9.0.1 > SQLite 3.30.1 > Threads / Copies: 1 > Seconds < Lower Is Better > 5.3.0 ..................... 225 |=========================================== > 5.4.0-rc3 ................. 224 |========================================== > 5.4-rc3-80-gafb2442fa429 .. 227 |=========================================== > 5.4.0-rc7 ................. 223 |========================================== > > Processor: Intel Xeon (4 Cores / 8 Threads), Chipset: Intel 440FX > 82441FX PMC, Memory: 1 x 7373 MB RAM, Disk: 11GB PersistentDisk + > 403GB EphemeralDisk, Network: Red Hat Virtio device > > OS: Debian 10, Kernel: 5.4.0-rc3-xfstests (x86_64) 20191113, Compiler: > GCC 8.3.0, File-System: ext4, System Layer: KVM > > This was done using an extension to a gce-xfstests test appliance, to > which I hope to be adding an automation engine where it will kexec > into a series of kernels, run the benchmarks and then spit out the > report somewhere. For now, the benchmarks are run manually. > > (Adding commentary and click-baity titles is left as an exercise to > the reader. :-) > > - Ted > > P.S. For all that I like to make snarky comments about Phoronix.com, > I have to admit Michael Larabel has done a pretty good job with his > performance test engine. I probably would have choosen a different > implementation than PHP, and I'd have added an explicit way to specify > the file system to be tested other than mounting it on top of > /var/lib/phoronix-test-suite, and at least have the option of placing > the benchmarks' build trees and binaries in a different location than > the file system under test. > > But that being said, he's collecting a decent set of benchmark tools, > and it is pretty cool that it has an automated way of collecting the > benchmark results, including the pretty graphs suitable for web > articles and conference slide decks ("and now, we turn to the rigged > benchmarks section of the presentation designed to show my new feature > in the best possible light..."). Let me make a small marketing pitch mmtest [1] :) For me running the test is just: * Boot the right kernel on the machine * Run: ./run-mmtests.sh -c configs/config-db-sqlite-insert-medium-ext4 \ --no-monitor Whatever_run_name_1 Now the config file already has proper partition, fstype, mkfs opts etc. configured so it's a bit of cheating but still :). And when I have data for both kernels, I do: cd work/log ../../compare_kernels.sh and get a table with the comparison of the two benchmarking runs with averages, standard deviations, percentiles, and other more advanced statistical stuff to distinguish signal from noise. We also have support for gathering various monitoring while the test is running (turbostat, iostat, vmstat, ...) and graphing all the results (although the graphs are more aimed at quick analysis of what's going on rather than at presenting results to a public). So for this campaign I've compared "5.3+some SUSE patches" to "5.4-rc7+your 'dev' branch". And the results look like: sqlite 5.3-SUSE 5.4-rc7 ext4-dev Min Trans 2181.67 ( 0.00%) 2412.72 ( 10.59%) Hmean Trans 2399.39 ( 0.00%) 2602.73 * 8.47%* Stddev Trans 172.15 ( 0.00%) 141.61 ( 17.74%) CoeffVar Trans 7.14 ( 0.00%) 5.43 ( 24.00%) Max Trans 2671.84 ( 0.00%) 3027.81 ( 13.32%) ... These are Trans/Sec values so there's actually a small improvement on this machine. But it's somwhat difficult to tell because the benchmark variation is rather high (likely due to powersafe cpufreq governor if I should guess). Honza [1] git://github.com/gormanm/mmtests