Patchwork [RFC] jbd2: jbd2_journal_stop needs an exclusive control to synchronize around t_updates operations

login
register
mail settings
Submitter Toshiyuki Okajima
Date Dec. 16, 2011, 11:19 a.m.
Message ID <20111216201915.4a012154.toshi.okajima@jp.fujitsu.com>
Download mbox | patch
Permalink /patch/131806/
State New
Headers show

Comments

Toshiyuki Okajima - Dec. 16, 2011, 11:19 a.m.
Hi.

I found a bug by executing the following reproducer (needs tuning).
(The reproducer never ends.)

I wrote it for a confirmation of 
 "[PATCH 1/5] ext4: allocate delalloc blocks before changing journal mode".
Therefore, without 
 "[PATCH 1/5] ext4: allocate delalloc blocks before changing journal mode"
patch, the other problem (panic) which the patch describes can happen more
frequently:
-------------------------------------------------------------------------------
#!/bin/sh

date
LOOP=100000
dd if=/dev/zero of=/tmp/ext4.img bs=1k count=1000k
/sbin/mkfs.ext4 -Fq /tmp/ext4.img
mount -o loop /tmp/ext4.img /mnt
rm -f /mnt/file
echo "0" > /mnt/file
(while [ 1 ]; do dd if=/dev/zero of=/mnt/file2 bs=4k > /dev/null 2>&1; done) &
PID=$!
for ((i=0; i < LOOP; i++));
do
        DATE=$(date)
        echo -n "[LOOP $i] $DATE"
        if ((i%2 == 0));
        then
                chattr +j /mnt/file
        else
                chattr -j /mnt/file
        fi
        echo "0" >> /mnt/file
done
kill -9 $PID
rm -f /mnt/file*
umount /mnt
exit 0
-------------------------------------------------------------------------------
Though I ran it, it never ended forever.
(At one of my tries to reproduce the hang, the reproducer stopped in 6
hours.)

Because jbd2_journal_lock_updates() never returns from schedule() 
after prepare_to_wait(). 

The detailed explanation is as follows:
Yongqiang Yang - Dec. 16, 2011, 12:22 p.m.
On Fri, Dec 16, 2011 at 7:19 PM, Toshiyuki Okajima
<toshi.okajima@jp.fujitsu.com> wrote:
> Hi.
>
> I found a bug by executing the following reproducer (needs tuning).
> (The reproducer never ends.)
>
> I wrote it for a confirmation of
>  "[PATCH 1/5] ext4: allocate delalloc blocks before changing journal mode".
> Therefore, without
>  "[PATCH 1/5] ext4: allocate delalloc blocks before changing journal mode"
> patch, the other problem (panic) which the patch describes can happen more
> frequently:
> -------------------------------------------------------------------------------
> #!/bin/sh
>
> date
> LOOP=100000
> dd if=/dev/zero of=/tmp/ext4.img bs=1k count=1000k
> /sbin/mkfs.ext4 -Fq /tmp/ext4.img
> mount -o loop /tmp/ext4.img /mnt
> rm -f /mnt/file
> echo "0" > /mnt/file
> (while [ 1 ]; do dd if=/dev/zero of=/mnt/file2 bs=4k > /dev/null 2>&1; done) &
> PID=$!
> for ((i=0; i < LOOP; i++));
> do
>        DATE=$(date)
>        echo -n "[LOOP $i] $DATE"
>        if ((i%2 == 0));
>        then
>                chattr +j /mnt/file
>        else
>                chattr -j /mnt/file
>        fi
>        echo "0" >> /mnt/file
> done
> kill -9 $PID
> rm -f /mnt/file*
> umount /mnt
> exit 0
> -------------------------------------------------------------------------------
> Though I ran it, it never ended forever.
> (At one of my tries to reproduce the hang, the reproducer stopped in 6
> hours.)
>
> Because jbd2_journal_lock_updates() never returns from schedule()
> after prepare_to_wait().
>
> The detailed explanation is as follows:
> ===============================================================================
> Current implementation of jbd2_journal_stop() has a bug not to synchronize
> t_updates with other operators, jbd2_journal_lock_updates(), start_this_handle()
IMHO, the description is not very exactly, the problem is not due to
desync of t_updates, t_updates is a var of atomic_t.   The reason is
that we have a critical code section here:[operation on t_updates,
wait or wakeup], this section should be synchronized.

> and jbd2_journal_commit_transaction().
>
> This bug was derived from commit: c35a56a090eacefca07afeb994029b57d8dd8025 and
>  commit: 8dd420466c7bfc459fa04680bd5690bfc41a4553.
>
> ---------------------------
> jbd2_journal_lock_updates()
> ---------------------------
>  509         write_lock(&journal->j_state_lock);
>  510         ++journal->j_barrier_count;
> ...
>  513         while (1) {
> ...
>  519                 spin_lock(&transaction->t_handle_lock);
>  520                 if (!atomic_read(&transaction->t_updates)) {
>  521                         spin_unlock(&transaction->t_handle_lock);
>  522                         break;
>  523                 }
>  524                 prepare_to_wait(&journal->j_wait_updates, &wait,
>  525                                 TASK_UNINTERRUPTIBLE);
>  526                 spin_unlock(&transaction->t_handle_lock);
It seems that t_handle_lock is used to protect the critical section
here.   I think t_handle is not needed here at all.   t_handle_lock
should be used to protect transaction and j_state_lock should be used
to protect journal.  prepare_to_wait operates on journal, so we should
remove t_handle_lock here.

>  527                 write_unlock(&journal->j_state_lock);
>  528                 schedule();
>  529                 finish_wait(&journal->j_wait_updates, &wait);
>  530                 write_lock(&journal->j_state_lock);
>  531         }
>  532         write_unlock(&journal->j_state_lock);
>
> * The operation of t_updates is protected by write_lock(&journal->j_state_lock)
>  and spin_lock(&transaction->t_handle_lock).
>
> -------------------
> start_this_handle()
> -------------------
>  160         read_lock(&journal->j_state_lock);
> ...
>  278         atomic_inc(&transaction->t_updates);
> ...
>  284         read_unlock(&journal->j_state_lock);
>
> * The operation of t_updates is protected by read_lock(&journal->j_state_lock).
>
> ---------------------------------
> jbd2_journal_commit_transaction()
> ---------------------------------
>  358         write_lock(&journal->j_state_lock);
> ...
>  367         spin_lock(&commit_transaction->t_handle_lock);
>  368         while (atomic_read(&commit_transaction->t_updates)) {
>  369                 DEFINE_WAIT(wait);
>  370
>  371                 prepare_to_wait(&journal->j_wait_updates, &wait,
>  372                                         TASK_UNINTERRUPTIBLE);
>  373                 if (atomic_read(&commit_transaction->t_updates)) {
>  374                         spin_unlock(&commit_transaction->t_handle_lock);
Also, t_handle_lock is used to protect the critical section.  I think
t_handle_lock is not needed here at all.

>  375                         write_unlock(&journal->j_state_lock);
>  376                         schedule();
>  377                         write_lock(&journal->j_state_lock);
>  378                         spin_lock(&commit_transaction->t_handle_lock);
>  379                 }
>  380                 finish_wait(&journal->j_wait_updates, &wait);
>  381         }
>  382         spin_unlock(&commit_transaction->t_handle_lock);
> ...
>  447         wake_up(&journal->j_wait_transaction_locked);
>  448         write_unlock(&journal->j_state_lock);
>
> * The operation of t_updates is protected by write_lock(&journal->j_state_lock)
>  and spin_lock(&transaction->t_handle_lock).
>
> -------------------
> jbd2_journal_stop()
> -------------------
> ...
> 1452         if (atomic_dec_and_test(&transaction->t_updates)) {
> 1453                 wake_up(&journal->j_wait_updates);
> 1454                 if (journal->j_barrier_count)
> 1455                         wake_up(&journal->j_wait_transaction_locked);
> 1456         }
> 1457
> 1458         if (wait_for_commit)
> 1459                 err = jbd2_log_wait_commit(journal, tid);
> 1460
> 1461         lock_map_release(&handle->h_lockdep_map);
> 1462
> 1463         jbd2_free_handle(handle);
> 1464         return err;
> 1465 }
>
> * The operation of t_updates is not protected.
>
> Therefore,
> - jbd2_journal_lock_updates() sleeps if transaction->t_update is 0.
> - jbd2_journal_stop() tries to wake someone up
> (wake_up(&journal->j_wait_updates)) after transaction->t_updates becomes 0.
> => But jbd2_journal_lock_updates() isn't woken up and then sleeps forever,
>   if it has not yet slept completely.
>
> After we run the reproducer, we can see the processes which become hang-up.
>
> -------------------------------------------------------------------------------
> crash> ps | grep UN
>  16645  23737   1  ffff88007a592b20  UN   0.0  105136    508  dd
>  16684  23727   1  ffff88007a2a9480  UN   0.0  107152    512  chattr
>  23722      2   1  ffff88007b496080  UN   0.0       0      0  [flush-7:0]
>
> [1]
> crash> bt 16645
> PID: 16645  TASK: ffff88007a592b20  CPU: 1   COMMAND: "dd"
>  #0 [ffff88007b575878] __schedule at ffffffff814b9b77
>  #1 [ffff88007b575920] schedule at ffffffff814ba3cf
>  #2 [ffff88007b575930] start_this_handle at ffffffffa01acef5 [jbd2]
>  #3 [ffff88007b5759f0] jbd2__journal_start at ffffffffa01ad4cb [jbd2]
>  #4 [ffff88007b575a30] jbd2_journal_start at ffffffffa01ad523 [jbd2]
>  #5 [ffff88007b575a40] ext4_journal_start_sb at ffffffffa03a7d83 [ext4]
>  #6 [ffff88007b575ab0] ext4_da_write_begin at ffffffffa038ddce [ext4]
>  #7 [ffff88007b575b50] generic_file_buffered_write at ffffffff811005ce
>  #8 [ffff88007b575c10] __generic_file_aio_write at ffffffff81103418
>  #9 [ffff88007b575cd0] generic_file_aio_write at ffffffff811036b6
> #10 [ffff88007b575d50] ext4_file_write at ffffffffa0386ee9 [ext4]
> #11 [ffff88007b575de0] do_sync_write at ffffffff8115de52
> #12 [ffff88007b575f00] vfs_write at ffffffff8115e3e8
> #13 [ffff88007b575f30] sys_write at ffffffff8115e5b1
> #14 [ffff88007b575f80] system_call_fastpath at ffffffff814c3f42
>    RIP: 00000034930d4230  RSP: 00007fffd8176980  RFLAGS: 00000206
>    RAX: 0000000000000001  RBX: ffffffff814c3f42  RCX: 0000000000000010
>    RDX: 0000000000001000  RSI: 0000000001ac6000  RDI: 0000000000000001
>    RBP: 0000000001ac6000   R8: 0000000000000003   R9: 0000000000040000
>    R10: 0000000000003003  R11: 0000000000000246  R12: 0000000001ac5fff
>    R13: 0000000000000000  R14: 0000000000001000  R15: 0000000000000246
>    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
>
> [2]
> crash> bt 16684
> PID: 16684  TASK: ffff88007a2a9480  CPU: 1   COMMAND: "chattr"
>  #0 [ffff88007af87ca8] __schedule at ffffffff814b9b77
>  #1 [ffff88007af87d50] schedule at ffffffff814ba3cf
>  #2 [ffff88007af87d60] jbd2_journal_lock_updates at ffffffffa01acca6 [jbd2]
>  #3 [ffff88007af87de0] ext4_change_inode_journal_flag at ffffffffa038faf1 [ext4]
>  #4 [ffff88007af87e10] ext4_ioctl at ffffffffa0392f48 [ext4]
>  #5 [ffff88007af87ea0] do_vfs_ioctl at ffffffff8116fb8a
>  #6 [ffff88007af87f30] sys_ioctl at ffffffff811700d1
>  #7 [ffff88007af87f80] system_call_fastpath at ffffffff814c3f42
>    RIP: 00000034930d95f7  RSP: 00007fff8830e0d8  RFLAGS: 00010213
>    RAX: 0000000000000010  RBX: ffffffff814c3f42  RCX: 00000034930d95f7
>    RDX: 00007fff8830e17c  RSI: 0000000040086602  RDI: 0000000000000003
>    RBP: 0000000000084000   R8: 0000000000000000   R9: 0000000000000001
>    R10: 0000000000000000  R11: 0000000000000246  R12: 00007fff8830f8db
>    R13: 0000000000000001  R14: 0000000000000000  R15: 0000000000000003
>    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
>
> [3]
> crash> bt 23722
> PID: 23722  TASK: ffff88007b496080  CPU: 1   COMMAND: "flush-7:0"
>  #0 [ffff88007b4f7840] __schedule at ffffffff814b9b77
>  #1 [ffff88007b4f78e8] schedule at ffffffff814ba3cf
>  #2 [ffff88007b4f78f8] start_this_handle at ffffffffa01acef5 [jbd2]
>  #3 [ffff88007b4f79b8] jbd2__journal_start at ffffffffa01ad4cb [jbd2]
>  #4 [ffff88007b4f79f8] jbd2_journal_start at ffffffffa01ad523 [jbd2]
>  #5 [ffff88007b4f7a08] ext4_journal_start_sb at ffffffffa03a7d83 [ext4]
>  #6 [ffff88007b4f7a78] ext4_da_writepages at ffffffffa0390b6d [ext4]
>  #7 [ffff88007b4f7bc8] do_writepages at ffffffff8110c261
>  #8 [ffff88007b4f7bd8] writeback_single_inode at ffffffff81185830
>  #9 [ffff88007b4f7c38] writeback_sb_inodes at ffffffff81185e21
> #10 [ffff88007b4f7ce8] __writeback_inodes_wb at ffffffff81185f7e
> #11 [ffff88007b4f7d38] wb_writeback at ffffffff81186223
> #12 [ffff88007b4f7db8] wb_do_writeback at ffffffff811864f9
> #13 [ffff88007b4f7e58] bdi_writeback_thread at ffffffff811865fa
> #14 [ffff88007b4f7ee8] kthread at ffffffff810828c6
> #15 [ffff88007b4f7f48] kernel_thread_helper at ffffffff814c60b4
>
> Step to reproduce a hang-up(figure):
>  =======================================+=======================================
>  ( "chattr -j" process PID: 16684 )     |  ( a certain process )
> t ======================================+=======================================
> i                                       |
> m                                       +---------------------------------------
> e                                       |jbd2_journal_start()
> |                                       +---------------------------------------
> |                                       | . . .
> |                                       | t_updates++ // t_updates = 1
> |                                       | . . .
> |                                       +---------------------------------------
> |                                        ...
> |---------------------------------------+---------------------------------------
> |jbd2_journal_lock_updates()            |jbd2_journal_stop()
> |---------------------------------------+---------------------------------------
> | write_lock(&journal->j_state_lock)    |    .
> | ++journal->j_barrier_count            |    .
> | spin_lock(&tran->t_handle_lock)       |    .
> | atomic_read(&tran->t_updates) //not 0 |
> |                                       | atomic_dec_and_test(&tran->t_updates)
> |                                       |    // t_updates = 0
> |                                       | wake_up(&journal->j_wait_updates)
> | prepare_to_wait()                     |    // no process is woken up.
> | spin_unlock(&tran->t_handle_lock)     + wake_up(&journal->j_wait_tran_locked)
> | write_unlock(&journal->j_state_lock)  |    // no process is woken up.
> | schedule() // never return            +---------------------------------------
> |                                       |  ( "dd" PID: 16645 or "flush-7:0" )
> |                                       +=======================================
> |                                       |start_this_handle()
> |                                       +---------------------------------------
> |                                       | . . .
> |                                       | read_lock(&journal->j_state_lock)
> |                                       | if (journal->j_barrier_count)
> |                                       |  read_unlock(&journal->j_state_lock)
> |                                       |  wait_event(journal->j_wait_tran_locked)
> v                                       |  // never return
>
> Therefore, jbd2_journal_stop() needs spin_lock(&tran->t_handle_lock) or
> read_lock(&journal->j_state_lock) to synchronize trans->t_updates with other
> operators.
> ===============================================================================
>
> Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> ---
>  fs/jbd2/transaction.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index a0e41a4..d48f395 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1448,12 +1448,14 @@ int jbd2_journal_stop(handle_t *handle)
>         * once we do this, we must not dereference transaction
>         * pointer again.
>         */
> +       read_lock(&journal->j_state_lock);
>        tid = transaction->t_tid;
>        if (atomic_dec_and_test(&transaction->t_updates)) {
>                wake_up(&journal->j_wait_updates);
>                if (journal->j_barrier_count)
oops, j_barrier_count is not protected too in original code.  The
problem does not exists in patched code any more:-)


Yongqiang.
>                        wake_up(&journal->j_wait_transaction_locked);
>        }
> +       read_unlock(&journal->j_state_lock);
>
>        if (wait_for_commit)
>                err = jbd2_log_wait_commit(journal, tid);
> --
> 1.5.5.6
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Toshiyuki Okajima - Dec. 20, 2011, 10:44 a.m.
Hi, Yongqiang.

Thanks for reviewing.

(2011/12/16 21:22), Yongqiang Yang wrote:
> On Fri, Dec 16, 2011 at 7:19 PM, Toshiyuki Okajima
> <toshi.okajima@jp.fujitsu.com> wrote:
>> Hi.
>>
>> I found a bug by executing the following reproducer (needs tuning).
>> (The reproducer never ends.)
>>
>> I wrote it for a confirmation of
>>  "[PATCH 1/5] ext4: allocate delalloc blocks before changing journal mode".
>> Therefore, without
>>  "[PATCH 1/5] ext4: allocate delalloc blocks before changing journal mode"
>> patch, the other problem (panic) which the patch describes can happen more
>> frequently:
>> -------------------------------------------------------------------------------
>> #!/bin/sh
>>
>> date
>> LOOP=100000
>> dd if=/dev/zero of=/tmp/ext4.img bs=1k count=1000k
>> /sbin/mkfs.ext4 -Fq /tmp/ext4.img
>> mount -o loop /tmp/ext4.img /mnt
>> rm -f /mnt/file
>> echo "0" > /mnt/file
>> (while [ 1 ]; do dd if=/dev/zero of=/mnt/file2 bs=4k > /dev/null 2>&1; done) &
>> PID=$!
>> for ((i=0; i < LOOP; i++));
>> do
>>        DATE=$(date)
>>        echo -n "[LOOP $i] $DATE"
>>        if ((i%2 == 0));
>>        then
>>                chattr +j /mnt/file
>>        else
>>                chattr -j /mnt/file
>>        fi
>>        echo "0" >> /mnt/file
>> done
>> kill -9 $PID
>> rm -f /mnt/file*
>> umount /mnt
>> exit 0
>> -------------------------------------------------------------------------------
>> Though I ran it, it never ended forever.
>> (At one of my tries to reproduce the hang, the reproducer stopped in 6
>> hours.)
>>
>> Because jbd2_journal_lock_updates() never returns from schedule()
>> after prepare_to_wait().
>>
>> The detailed explanation is as follows:
>> ===============================================================================
>> Current implementation of jbd2_journal_stop() has a bug not to synchronize
>> t_updates with other operators, jbd2_journal_lock_updates(), start_this_handle()

> IMHO, the description is not very exactly, the problem is not due to
> desync of t_updates, t_updates is a var of atomic_t.   The reason is
> that we have a critical code section here:[operation on t_updates,
> wait or wakeup], this section should be synchronized.
I understood. I will rewrite the description much better than this by 
applying your comment. 

> 
>> and jbd2_journal_commit_transaction().
>>
>> This bug was derived from commit: c35a56a090eacefca07afeb994029b57d8dd8025 and
>>  commit: 8dd420466c7bfc459fa04680bd5690bfc41a4553.
>>
>> ---------------------------
>> jbd2_journal_lock_updates()
>> ---------------------------
>>  509         write_lock(&journal->j_state_lock);
>>  510         ++journal->j_barrier_count;
>> ...
>>  513         while (1) {
>> ...
>>  519                 spin_lock(&transaction->t_handle_lock);
>>  520                 if (!atomic_read(&transaction->t_updates)) {
>>  521                         spin_unlock(&transaction->t_handle_lock);
>>  522                         break;
>>  523                 }
>>  524                 prepare_to_wait(&journal->j_wait_updates, &wait,
>>  525                                 TASK_UNINTERRUPTIBLE);
>>  526                 spin_unlock(&transaction->t_handle_lock);

> It seems that t_handle_lock is used to protect the critical section
> here.   I think t_handle is not needed here at all.   t_handle_lock
> should be used to protect transaction and j_state_lock should be used
> to protect journal.  prepare_to_wait operates on journal, so we should
> remove t_handle_lock here.
I think so, too. Then, I will write another patch which deletes the statements
of  t_handle_lock.

> 
>>  527                 write_unlock(&journal->j_state_lock);
>>  528                 schedule();
>>  529                 finish_wait(&journal->j_wait_updates, &wait);
>>  530                 write_lock(&journal->j_state_lock);
>>  531         }
>>  532         write_unlock(&journal->j_state_lock);
>>
>> * The operation of t_updates is protected by write_lock(&journal->j_state_lock)
>>  and spin_lock(&transaction->t_handle_lock).
>>
>> -------------------
>> start_this_handle()
>> -------------------
>>  160         read_lock(&journal->j_state_lock);
>> ...
>>  278         atomic_inc(&transaction->t_updates);
>> ...
>>  284         read_unlock(&journal->j_state_lock);
>>
>> * The operation of t_updates is protected by read_lock(&journal->j_state_lock).
>>
>> ---------------------------------
>> jbd2_journal_commit_transaction()
>> ---------------------------------
>>  358         write_lock(&journal->j_state_lock);
>> ...
>>  367         spin_lock(&commit_transaction->t_handle_lock);
>>  368         while (atomic_read(&commit_transaction->t_updates)) {
>>  369                 DEFINE_WAIT(wait);
>>  370
>>  371                 prepare_to_wait(&journal->j_wait_updates, &wait,
>>  372                                         TASK_UNINTERRUPTIBLE);
>>  373                 if (atomic_read(&commit_transaction->t_updates)) {
>>  374                         spin_unlock(&commit_transaction->t_handle_lock);

> Also, t_handle_lock is used to protect the critical section.  I think
> t_handle_lock is not needed here at all.
Yes.

> 
>>  375                         write_unlock(&journal->j_state_lock);
>>  376                         schedule();
>>  377                         write_lock(&journal->j_state_lock);
>>  378                         spin_lock(&commit_transaction->t_handle_lock);
>>  379                 }
>>  380                 finish_wait(&journal->j_wait_updates, &wait);
>>  381         }
>>  382         spin_unlock(&commit_transaction->t_handle_lock);
>> ...
>>  447         wake_up(&journal->j_wait_transaction_locked);
>>  448         write_unlock(&journal->j_state_lock);
>>
>> * The operation of t_updates is protected by write_lock(&journal->j_state_lock)
>>  and spin_lock(&transaction->t_handle_lock).
>>
>> -------------------
>> jbd2_journal_stop()
>> -------------------
>> ...
>> 1452         if (atomic_dec_and_test(&transaction->t_updates)) {
>> 1453                 wake_up(&journal->j_wait_updates);
>> 1454                 if (journal->j_barrier_count)
>> 1455                         wake_up(&journal->j_wait_transaction_locked);
>> 1456         }
>> 1457
>> 1458         if (wait_for_commit)
>> 1459                 err = jbd2_log_wait_commit(journal, tid);
>> 1460
>> 1461         lock_map_release(&handle->h_lockdep_map);
>> 1462
>> 1463         jbd2_free_handle(handle);
>> 1464         return err;
>> 1465 }
>>
>> * The operation of t_updates is not protected.
>>
>> Therefore,
>> - jbd2_journal_lock_updates() sleeps if transaction->t_update is 0.
>> - jbd2_journal_stop() tries to wake someone up
>> (wake_up(&journal->j_wait_updates)) after transaction->t_updates becomes 0.
>> => But jbd2_journal_lock_updates() isn't woken up and then sleeps forever,
>>   if it has not yet slept completely.
>>
>> After we run the reproducer, we can see the processes which become hang-up.
>>
>> -------------------------------------------------------------------------------
>> crash> ps | grep UN
>>  16645  23737   1  ffff88007a592b20  UN   0.0  105136    508  dd
>>  16684  23727   1  ffff88007a2a9480  UN   0.0  107152    512  chattr
>>  23722      2   1  ffff88007b496080  UN   0.0       0      0  [flush-7:0]
>>
>> [1]
>> crash> bt 16645
>> PID: 16645  TASK: ffff88007a592b20  CPU: 1   COMMAND: "dd"
>>  #0 [ffff88007b575878] __schedule at ffffffff814b9b77
>>  #1 [ffff88007b575920] schedule at ffffffff814ba3cf
>>  #2 [ffff88007b575930] start_this_handle at ffffffffa01acef5 [jbd2]
>>  #3 [ffff88007b5759f0] jbd2__journal_start at ffffffffa01ad4cb [jbd2]
>>  #4 [ffff88007b575a30] jbd2_journal_start at ffffffffa01ad523 [jbd2]
>>  #5 [ffff88007b575a40] ext4_journal_start_sb at ffffffffa03a7d83 [ext4]
>>  #6 [ffff88007b575ab0] ext4_da_write_begin at ffffffffa038ddce [ext4]
>>  #7 [ffff88007b575b50] generic_file_buffered_write at ffffffff811005ce
>>  #8 [ffff88007b575c10] __generic_file_aio_write at ffffffff81103418
>>  #9 [ffff88007b575cd0] generic_file_aio_write at ffffffff811036b6
>> #10 [ffff88007b575d50] ext4_file_write at ffffffffa0386ee9 [ext4]
>> #11 [ffff88007b575de0] do_sync_write at ffffffff8115de52
>> #12 [ffff88007b575f00] vfs_write at ffffffff8115e3e8
>> #13 [ffff88007b575f30] sys_write at ffffffff8115e5b1
>> #14 [ffff88007b575f80] system_call_fastpath at ffffffff814c3f42
>>    RIP: 00000034930d4230  RSP: 00007fffd8176980  RFLAGS: 00000206
>>    RAX: 0000000000000001  RBX: ffffffff814c3f42  RCX: 0000000000000010
>>    RDX: 0000000000001000  RSI: 0000000001ac6000  RDI: 0000000000000001
>>    RBP: 0000000001ac6000   R8: 0000000000000003   R9: 0000000000040000
>>    R10: 0000000000003003  R11: 0000000000000246  R12: 0000000001ac5fff
>>    R13: 0000000000000000  R14: 0000000000001000  R15: 0000000000000246
>>    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
>>
>> [2]
>> crash> bt 16684
>> PID: 16684  TASK: ffff88007a2a9480  CPU: 1   COMMAND: "chattr"
>>  #0 [ffff88007af87ca8] __schedule at ffffffff814b9b77
>>  #1 [ffff88007af87d50] schedule at ffffffff814ba3cf
>>  #2 [ffff88007af87d60] jbd2_journal_lock_updates at ffffffffa01acca6 [jbd2]
>>  #3 [ffff88007af87de0] ext4_change_inode_journal_flag at ffffffffa038faf1 [ext4]
>>  #4 [ffff88007af87e10] ext4_ioctl at ffffffffa0392f48 [ext4]
>>  #5 [ffff88007af87ea0] do_vfs_ioctl at ffffffff8116fb8a
>>  #6 [ffff88007af87f30] sys_ioctl at ffffffff811700d1
>>  #7 [ffff88007af87f80] system_call_fastpath at ffffffff814c3f42
>>    RIP: 00000034930d95f7  RSP: 00007fff8830e0d8  RFLAGS: 00010213
>>    RAX: 0000000000000010  RBX: ffffffff814c3f42  RCX: 00000034930d95f7
>>    RDX: 00007fff8830e17c  RSI: 0000000040086602  RDI: 0000000000000003
>>    RBP: 0000000000084000   R8: 0000000000000000   R9: 0000000000000001
>>    R10: 0000000000000000  R11: 0000000000000246  R12: 00007fff8830f8db
>>    R13: 0000000000000001  R14: 0000000000000000  R15: 0000000000000003
>>    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b
>>
>> [3]
>> crash> bt 23722
>> PID: 23722  TASK: ffff88007b496080  CPU: 1   COMMAND: "flush-7:0"
>>  #0 [ffff88007b4f7840] __schedule at ffffffff814b9b77
>>  #1 [ffff88007b4f78e8] schedule at ffffffff814ba3cf
>>  #2 [ffff88007b4f78f8] start_this_handle at ffffffffa01acef5 [jbd2]
>>  #3 [ffff88007b4f79b8] jbd2__journal_start at ffffffffa01ad4cb [jbd2]
>>  #4 [ffff88007b4f79f8] jbd2_journal_start at ffffffffa01ad523 [jbd2]
>>  #5 [ffff88007b4f7a08] ext4_journal_start_sb at ffffffffa03a7d83 [ext4]
>>  #6 [ffff88007b4f7a78] ext4_da_writepages at ffffffffa0390b6d [ext4]
>>  #7 [ffff88007b4f7bc8] do_writepages at ffffffff8110c261
>>  #8 [ffff88007b4f7bd8] writeback_single_inode at ffffffff81185830
>>  #9 [ffff88007b4f7c38] writeback_sb_inodes at ffffffff81185e21
>> #10 [ffff88007b4f7ce8] __writeback_inodes_wb at ffffffff81185f7e
>> #11 [ffff88007b4f7d38] wb_writeback at ffffffff81186223
>> #12 [ffff88007b4f7db8] wb_do_writeback at ffffffff811864f9
>> #13 [ffff88007b4f7e58] bdi_writeback_thread at ffffffff811865fa
>> #14 [ffff88007b4f7ee8] kthread at ffffffff810828c6
>> #15 [ffff88007b4f7f48] kernel_thread_helper at ffffffff814c60b4
>>
>> Step to reproduce a hang-up(figure):
>>  =======================================+=======================================
>>  ( "chattr -j" process PID: 16684 )     |  ( a certain process )
>> t ======================================+=======================================
>> i                                       |
>> m                                       +---------------------------------------
>> e                                       |jbd2_journal_start()
>> |                                       +---------------------------------------
>> |                                       | . . .
>> |                                       | t_updates++ // t_updates = 1
>> |                                       | . . .
>> |                                       +---------------------------------------
>> |                                        ...
>> |---------------------------------------+---------------------------------------
>> |jbd2_journal_lock_updates()            |jbd2_journal_stop()
>> |---------------------------------------+---------------------------------------
>> | write_lock(&journal->j_state_lock)    |    .
>> | ++journal->j_barrier_count            |    .
>> | spin_lock(&tran->t_handle_lock)       |    .
>> | atomic_read(&tran->t_updates) //not 0 |
>> |                                       | atomic_dec_and_test(&tran->t_updates)
>> |                                       |    // t_updates = 0
>> |                                       | wake_up(&journal->j_wait_updates)
>> | prepare_to_wait()                     |    // no process is woken up.
>> | spin_unlock(&tran->t_handle_lock)     + wake_up(&journal->j_wait_tran_locked)
>> | write_unlock(&journal->j_state_lock)  |    // no process is woken up.
>> | schedule() // never return            +---------------------------------------
>> |                                       |  ( "dd" PID: 16645 or "flush-7:0" )
>> |                                       +=======================================
>> |                                       |start_this_handle()
>> |                                       +---------------------------------------
>> |                                       | . . .
>> |                                       | read_lock(&journal->j_state_lock)
>> |                                       | if (journal->j_barrier_count)
>> |                                       |  read_unlock(&journal->j_state_lock)
>> |                                       |  wait_event(journal->j_wait_tran_locked)
>> v                                       |  // never return
>>
>> Therefore, jbd2_journal_stop() needs spin_lock(&tran->t_handle_lock) or
>> read_lock(&journal->j_state_lock) to synchronize trans->t_updates with other
>> operators.
>> ===============================================================================
>>
>> Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
>> ---
>>  fs/jbd2/transaction.c |    2 ++
>>  1 files changed, 2 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index a0e41a4..d48f395 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -1448,12 +1448,14 @@ int jbd2_journal_stop(handle_t *handle)
>>         * once we do this, we must not dereference transaction
>>         * pointer again.
>>         */
>> +       read_lock(&journal->j_state_lock);
>>        tid = transaction->t_tid;
>>        if (atomic_dec_and_test(&transaction->t_updates)) {
>>                wake_up(&journal->j_wait_updates);
>>                if (journal->j_barrier_count)

> oops, j_barrier_count is not protected too in original code.  The
> problem does not exists in patched code any more:-)
OK. 
I will add the notice into the description which I will rewrite.

Best Regards,
Toshiyuki Okajima


> 
> 
> Yongqiang.
>>                        wake_up(&journal->j_wait_transaction_locked);
>>        }
>> +       read_unlock(&journal->j_state_lock);
>>
>>        if (wait_for_commit)
>>                err = jbd2_log_wait_commit(journal, tid);
>> --
>> 1.5.5.6
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Toshiyuki Okajima - Dec. 22, 2011, 11:36 a.m.
Hi.

I found a bug by executing the following reproducer.  The reproducer is for a 
confirmation of 
 "[PATCH 1/5] ext4: allocate delalloc blocks before changing journal mode".
Therefore, without its patch, the other problem (panic) which the patch 
describes can happen more frequently:
-------------------------------------------------------------------------------
#!/bin/sh

date
LOOP=100000
dd if=/dev/zero of=/tmp/ext4.img bs=1k count=1000k
/sbin/mkfs.ext4 -Fq /tmp/ext4.img
mount -o loop /tmp/ext4.img /mnt
rm -f /mnt/file
echo "0" > /mnt/file
(while [ 1 ]; do dd if=/dev/zero of=/mnt/file2 bs=4k > /dev/null 2>&1; done) &
PID=$!
for ((i=0; i < LOOP; i++));
do
        DATE=$(date)
        echo -n "[LOOP $i] $DATE"
        if ((i%2 == 0));
        then
                chattr +j /mnt/file
        else
                chattr -j /mnt/file
        fi
        echo "0" >> /mnt/file
done
kill -9 $PID
rm -f /mnt/file*
umount /mnt
exit 0
-------------------------------------------------------------------------------
Though I ran it, it never ended forever.
(At one of my tries to reproduce the hangup, the reproducer stopped in 6
hours.)

Because jbd2_journal_lock_updates() never returns from schedule() after 
prepare_to_wait(). 
(Its filesystem becomes a hangup.)

The detailed explanation is as follows:
===============================================================================
Current implementation of jbd2_journal_stop() has a bug not to synchronize 
a critical code section of it with others,
 - jbd2_journal_lock_updates()
 - start_this_handle()
 - jbd2_journal_commit_transaction().

This bug was derived from commit: c35a56a090eacefca07afeb994029b57d8dd8025 and
 commit: 8dd420466c7bfc459fa04680bd5690bfc41a4553.

The source codes and critical code sections of them are as follows:
---------------------------
jbd2_journal_lock_updates()
---------------------------
 509         write_lock(&journal->j_state_lock);
--- critical code section ------------------------------------------------------
 510         ++journal->j_barrier_count;
...
 513         while (1) {
...
 519                 spin_lock(&transaction->t_handle_lock);
 520                 if (!atomic_read(&transaction->t_updates)) {
 521                         spin_unlock(&transaction->t_handle_lock);
 522                         break;
 523                 }
 524                 prepare_to_wait(&journal->j_wait_updates, &wait,
 525                                 TASK_UNINTERRUPTIBLE);
 526                 spin_unlock(&transaction->t_handle_lock);
 527                 write_unlock(&journal->j_state_lock);
 528                 schedule();
 529                 finish_wait(&journal->j_wait_updates, &wait);
 530                 write_lock(&journal->j_state_lock);
 531         }
-------------------------------------------------------------------------------
 532         write_unlock(&journal->j_state_lock);

* The critical code section is from line 510 to line 531
     referencer) protected by write_lock(j_state_lock) and 
	spin_lock(t_handle_lock)

-------------------
start_this_handle()
-------------------
 160         read_lock(&journal->j_state_lock);
--- critical code section ------------------------------------------------------
...
 278         atomic_inc(&transaction->t_updates);
...
-------------------------------------------------------------------------------
 284         read_unlock(&journal->j_state_lock);

* The critical code sections is in line 278 protected by read_lock(j_state_lock)

---------------------------------
jbd2_journal_commit_transaction()
---------------------------------
 358         write_lock(&journal->j_state_lock);
--- critical code section ------------------------------------------------------
...
 367         spin_lock(&commit_transaction->t_handle_lock);
 368         while (atomic_read(&commit_transaction->t_updates)) {
 369                 DEFINE_WAIT(wait);
 370 
 371                 prepare_to_wait(&journal->j_wait_updates, &wait,
 372                                         TASK_UNINTERRUPTIBLE);
 373                 if (atomic_read(&commit_transaction->t_updates)) {
 374                         spin_unlock(&commit_transaction->t_handle_lock);
 375                         write_unlock(&journal->j_state_lock);
 376                         schedule();
 377                         write_lock(&journal->j_state_lock);
 378                         spin_lock(&commit_transaction->t_handle_lock);
 379                 }
 380                 finish_wait(&journal->j_wait_updates, &wait);
 381         }
 382         spin_unlock(&commit_transaction->t_handle_lock);
...
 447         wake_up(&journal->j_wait_transaction_locked);
-------------------------------------------------------------------------------
 448         write_unlock(&journal->j_state_lock);

* The critical code sections is from line 367 to line 447 protected by 
  write_lock(&journal->j_state_lock) and spin_lock(&transaction->t_handle_lock).

-------------------
jbd2_journal_stop()
-------------------
...
--- critical code section ------------------------------------------------------
1452         if (atomic_dec_and_test(&transaction->t_updates)) {
1453                 wake_up(&journal->j_wait_updates);
1454                 if (journal->j_barrier_count)
1455                         wake_up(&journal->j_wait_transaction_locked);
1456         }
-------------------------------------------------------------------------------
1457 
1458         if (wait_for_commit)
1459                 err = jbd2_log_wait_commit(journal, tid);
1460 
1461         lock_map_release(&handle->h_lockdep_map);
1462 
1463         jbd2_free_handle(handle);
1464         return err;
1465 }

* The critical code sections is from line 1452 to line 1456. However,
 it isn't protected by no exclusive control.

Therefore, 
- jbd2_journal_lock_updates() sleeps if transaction->t_update is 0.
- jbd2_journal_stop() tries to wake someone up 
(wake_up(&journal->j_wait_updates)) after transaction->t_updates becomes 0. 
=> But jbd2_journal_lock_updates() isn't woken up and then sleeps forever,
   if it has not yet slept completely (before executing schedule()).

After we run the reproducer, we can see the processes which become hangup:
-------------------------------------------------------------------------------
crash> ps | grep UN
  16645  23737   1  ffff88007a592b20  UN   0.0  105136    508  dd
  16684  23727   1  ffff88007a2a9480  UN   0.0  107152    512  chattr
  23722      2   1  ffff88007b496080  UN   0.0       0      0  [flush-7:0]

[1]
crash> bt 16645
PID: 16645  TASK: ffff88007a592b20  CPU: 1   COMMAND: "dd"
 #0 [ffff88007b575878] __schedule at ffffffff814b9b77
 #1 [ffff88007b575920] schedule at ffffffff814ba3cf
 #2 [ffff88007b575930] start_this_handle at ffffffffa01acef5 [jbd2]
 #3 [ffff88007b5759f0] jbd2__journal_start at ffffffffa01ad4cb [jbd2]
 #4 [ffff88007b575a30] jbd2_journal_start at ffffffffa01ad523 [jbd2]
 #5 [ffff88007b575a40] ext4_journal_start_sb at ffffffffa03a7d83 [ext4]
 #6 [ffff88007b575ab0] ext4_da_write_begin at ffffffffa038ddce [ext4]
 #7 [ffff88007b575b50] generic_file_buffered_write at ffffffff811005ce
 #8 [ffff88007b575c10] __generic_file_aio_write at ffffffff81103418
 #9 [ffff88007b575cd0] generic_file_aio_write at ffffffff811036b6
#10 [ffff88007b575d50] ext4_file_write at ffffffffa0386ee9 [ext4]
#11 [ffff88007b575de0] do_sync_write at ffffffff8115de52
#12 [ffff88007b575f00] vfs_write at ffffffff8115e3e8
#13 [ffff88007b575f30] sys_write at ffffffff8115e5b1
#14 [ffff88007b575f80] system_call_fastpath at ffffffff814c3f42
    RIP: 00000034930d4230  RSP: 00007fffd8176980  RFLAGS: 00000206
    RAX: 0000000000000001  RBX: ffffffff814c3f42  RCX: 0000000000000010
    RDX: 0000000000001000  RSI: 0000000001ac6000  RDI: 0000000000000001
    RBP: 0000000001ac6000   R8: 0000000000000003   R9: 0000000000040000
    R10: 0000000000003003  R11: 0000000000000246  R12: 0000000001ac5fff
    R13: 0000000000000000  R14: 0000000000001000  R15: 0000000000000246
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

[2]
crash> bt 16684
PID: 16684  TASK: ffff88007a2a9480  CPU: 1   COMMAND: "chattr"
 #0 [ffff88007af87ca8] __schedule at ffffffff814b9b77
 #1 [ffff88007af87d50] schedule at ffffffff814ba3cf
 #2 [ffff88007af87d60] jbd2_journal_lock_updates at ffffffffa01acca6 [jbd2]
 #3 [ffff88007af87de0] ext4_change_inode_journal_flag at ffffffffa038faf1 [ext4]
 #4 [ffff88007af87e10] ext4_ioctl at ffffffffa0392f48 [ext4]
 #5 [ffff88007af87ea0] do_vfs_ioctl at ffffffff8116fb8a
 #6 [ffff88007af87f30] sys_ioctl at ffffffff811700d1
 #7 [ffff88007af87f80] system_call_fastpath at ffffffff814c3f42
    RIP: 00000034930d95f7  RSP: 00007fff8830e0d8  RFLAGS: 00010213
    RAX: 0000000000000010  RBX: ffffffff814c3f42  RCX: 00000034930d95f7
    RDX: 00007fff8830e17c  RSI: 0000000040086602  RDI: 0000000000000003
    RBP: 0000000000084000   R8: 0000000000000000   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007fff8830f8db
    R13: 0000000000000001  R14: 0000000000000000  R15: 0000000000000003
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

[3]
crash> bt 23722
PID: 23722  TASK: ffff88007b496080  CPU: 1   COMMAND: "flush-7:0"
 #0 [ffff88007b4f7840] __schedule at ffffffff814b9b77
 #1 [ffff88007b4f78e8] schedule at ffffffff814ba3cf
 #2 [ffff88007b4f78f8] start_this_handle at ffffffffa01acef5 [jbd2]
 #3 [ffff88007b4f79b8] jbd2__journal_start at ffffffffa01ad4cb [jbd2]
 #4 [ffff88007b4f79f8] jbd2_journal_start at ffffffffa01ad523 [jbd2]
 #5 [ffff88007b4f7a08] ext4_journal_start_sb at ffffffffa03a7d83 [ext4]
 #6 [ffff88007b4f7a78] ext4_da_writepages at ffffffffa0390b6d [ext4]
 #7 [ffff88007b4f7bc8] do_writepages at ffffffff8110c261
 #8 [ffff88007b4f7bd8] writeback_single_inode at ffffffff81185830
 #9 [ffff88007b4f7c38] writeback_sb_inodes at ffffffff81185e21
#10 [ffff88007b4f7ce8] __writeback_inodes_wb at ffffffff81185f7e
#11 [ffff88007b4f7d38] wb_writeback at ffffffff81186223
#12 [ffff88007b4f7db8] wb_do_writeback at ffffffff811864f9
#13 [ffff88007b4f7e58] bdi_writeback_thread at ffffffff811865fa
#14 [ffff88007b4f7ee8] kthread at ffffffff810828c6
#15 [ffff88007b4f7f48] kernel_thread_helper at ffffffff814c60b4

Step to reproduce a hangup(figure):
 =======================================+=======================================
 ( "chattr -j" process PID: 16684 )     |  ( a certain process )
t ======================================+=======================================
i                                       |
m                                       +---------------------------------------
e                                       |jbd2_journal_start()
|                                       +---------------------------------------
|                                       | . . .
|                                       | t_updates++ // t_updates = 1
|                                       + . . .
|                                       +---------------------------------------
|                                        ...
|---------------------------------------+---------------------------------------
|jbd2_journal_lock_updates()            |jbd2_journal_stop()
|---------------------------------------+---------------------------------------
| write_lock(&journal->j_state_lock)    |    .
| ++journal->j_barrier_count            |    .
| spin_lock(&tran->t_handle_lock)       |    .
| atomic_read(&tran->t_updates) //not 0 |
|                                       | atomic_dec_and_test(&tran->t_updates)
|                                       |    // t_updates = 0
|                                       | wake_up(&journal->j_wait_updates)
| prepare_to_wait()                     |    // no process is woken up.
| spin_unlock(&tran->t_handle_lock)     + wake_up(&journal->j_wait_tran_locked)
| write_unlock(&journal->j_state_lock)  |    // no process is woken up.
| schedule() // never return            +---------------------------------------
|                                       |  ( "dd" PID: 16645 or "flush-7:0" )
|                                       +=======================================
|                                       |start_this_handle()
|                                       +---------------------------------------
|                                       | . . .
|                                       | read_lock(&journal->j_state_lock)
|                                       | if (journal->j_barrier_count)
|                                       |  read_unlock(&journal->j_state_lock)
|                                       |  wait_event(jrnl->j_wait_tran_locked)
v                                       |  // never return

Therefore, jbd2_journal_stop() needs spin_lock(&tran->t_handle_lock) or 
read_lock(&journal->j_state_lock) to synchronize its critical code section with 
others.  Besides, I noticed that t_handle_lock exclusive control had already 
been unnecessary when I examined all the functions which used t_handle_lock. 
So, read_lock(&journal->j_state_lock) exclusive control is good for this fix.

Regards,
Toshiyuki Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

===============================================================================
Current implementation of jbd2_journal_stop() has a bug not to synchronize 
t_updates with other operators, jbd2_journal_lock_updates(), start_this_handle()
and jbd2_journal_commit_transaction().

This bug was derived from commit: c35a56a090eacefca07afeb994029b57d8dd8025 and
 commit: 8dd420466c7bfc459fa04680bd5690bfc41a4553.

---------------------------
jbd2_journal_lock_updates()
---------------------------
 509         write_lock(&journal->j_state_lock);
 510         ++journal->j_barrier_count;
...
 513         while (1) {
...
 519                 spin_lock(&transaction->t_handle_lock);
 520                 if (!atomic_read(&transaction->t_updates)) {
 521                         spin_unlock(&transaction->t_handle_lock);
 522                         break;
 523                 }
 524                 prepare_to_wait(&journal->j_wait_updates, &wait,
 525                                 TASK_UNINTERRUPTIBLE);
 526                 spin_unlock(&transaction->t_handle_lock);
 527                 write_unlock(&journal->j_state_lock);
 528                 schedule();
 529                 finish_wait(&journal->j_wait_updates, &wait);
 530                 write_lock(&journal->j_state_lock);
 531         }
 532         write_unlock(&journal->j_state_lock);

* The operation of t_updates is protected by write_lock(&journal->j_state_lock)
 and spin_lock(&transaction->t_handle_lock).

-------------------
start_this_handle()
-------------------
 160         read_lock(&journal->j_state_lock);
...
 278         atomic_inc(&transaction->t_updates);
...
 284         read_unlock(&journal->j_state_lock);

* The operation of t_updates is protected by read_lock(&journal->j_state_lock).

---------------------------------
jbd2_journal_commit_transaction()
---------------------------------
 358         write_lock(&journal->j_state_lock);
...
 367         spin_lock(&commit_transaction->t_handle_lock);
 368         while (atomic_read(&commit_transaction->t_updates)) {
 369                 DEFINE_WAIT(wait);
 370 
 371                 prepare_to_wait(&journal->j_wait_updates, &wait,
 372                                         TASK_UNINTERRUPTIBLE);
 373                 if (atomic_read(&commit_transaction->t_updates)) {
 374                         spin_unlock(&commit_transaction->t_handle_lock);
 375                         write_unlock(&journal->j_state_lock);
 376                         schedule();
 377                         write_lock(&journal->j_state_lock);
 378                         spin_lock(&commit_transaction->t_handle_lock);
 379                 }
 380                 finish_wait(&journal->j_wait_updates, &wait);
 381         }
 382         spin_unlock(&commit_transaction->t_handle_lock);
...
 447         wake_up(&journal->j_wait_transaction_locked);
 448         write_unlock(&journal->j_state_lock);

* The operation of t_updates is protected by write_lock(&journal->j_state_lock)
 and spin_lock(&transaction->t_handle_lock).

-------------------
jbd2_journal_stop()
-------------------
...
1452         if (atomic_dec_and_test(&transaction->t_updates)) {
1453                 wake_up(&journal->j_wait_updates);
1454                 if (journal->j_barrier_count)
1455                         wake_up(&journal->j_wait_transaction_locked);
1456         }
1457 
1458         if (wait_for_commit)
1459                 err = jbd2_log_wait_commit(journal, tid);
1460 
1461         lock_map_release(&handle->h_lockdep_map);
1462 
1463         jbd2_free_handle(handle);
1464         return err;
1465 }

* The operation of t_updates is not protected.

Therefore, 
- jbd2_journal_lock_updates() sleeps if transaction->t_update is 0.
- jbd2_journal_stop() tries to wake someone up 
(wake_up(&journal->j_wait_updates)) after transaction->t_updates becomes 0. 
=> But jbd2_journal_lock_updates() isn't woken up and then sleeps forever,
   if it has not yet slept completely.

After we run the reproducer, we can see the processes which become hang-up.

-------------------------------------------------------------------------------
crash> ps | grep UN
  16645  23737   1  ffff88007a592b20  UN   0.0  105136    508  dd
  16684  23727   1  ffff88007a2a9480  UN   0.0  107152    512  chattr
  23722      2   1  ffff88007b496080  UN   0.0       0      0  [flush-7:0]

[1]
crash> bt 16645
PID: 16645  TASK: ffff88007a592b20  CPU: 1   COMMAND: "dd"
 #0 [ffff88007b575878] __schedule at ffffffff814b9b77
 #1 [ffff88007b575920] schedule at ffffffff814ba3cf
 #2 [ffff88007b575930] start_this_handle at ffffffffa01acef5 [jbd2]
 #3 [ffff88007b5759f0] jbd2__journal_start at ffffffffa01ad4cb [jbd2]
 #4 [ffff88007b575a30] jbd2_journal_start at ffffffffa01ad523 [jbd2]
 #5 [ffff88007b575a40] ext4_journal_start_sb at ffffffffa03a7d83 [ext4]
 #6 [ffff88007b575ab0] ext4_da_write_begin at ffffffffa038ddce [ext4]
 #7 [ffff88007b575b50] generic_file_buffered_write at ffffffff811005ce
 #8 [ffff88007b575c10] __generic_file_aio_write at ffffffff81103418
 #9 [ffff88007b575cd0] generic_file_aio_write at ffffffff811036b6
#10 [ffff88007b575d50] ext4_file_write at ffffffffa0386ee9 [ext4]
#11 [ffff88007b575de0] do_sync_write at ffffffff8115de52
#12 [ffff88007b575f00] vfs_write at ffffffff8115e3e8
#13 [ffff88007b575f30] sys_write at ffffffff8115e5b1
#14 [ffff88007b575f80] system_call_fastpath at ffffffff814c3f42
    RIP: 00000034930d4230  RSP: 00007fffd8176980  RFLAGS: 00000206
    RAX: 0000000000000001  RBX: ffffffff814c3f42  RCX: 0000000000000010
    RDX: 0000000000001000  RSI: 0000000001ac6000  RDI: 0000000000000001
    RBP: 0000000001ac6000   R8: 0000000000000003   R9: 0000000000040000
    R10: 0000000000003003  R11: 0000000000000246  R12: 0000000001ac5fff
    R13: 0000000000000000  R14: 0000000000001000  R15: 0000000000000246
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

[2]
crash> bt 16684
PID: 16684  TASK: ffff88007a2a9480  CPU: 1   COMMAND: "chattr"
 #0 [ffff88007af87ca8] __schedule at ffffffff814b9b77
 #1 [ffff88007af87d50] schedule at ffffffff814ba3cf
 #2 [ffff88007af87d60] jbd2_journal_lock_updates at ffffffffa01acca6 [jbd2]
 #3 [ffff88007af87de0] ext4_change_inode_journal_flag at ffffffffa038faf1 [ext4]
 #4 [ffff88007af87e10] ext4_ioctl at ffffffffa0392f48 [ext4]
 #5 [ffff88007af87ea0] do_vfs_ioctl at ffffffff8116fb8a
 #6 [ffff88007af87f30] sys_ioctl at ffffffff811700d1
 #7 [ffff88007af87f80] system_call_fastpath at ffffffff814c3f42
    RIP: 00000034930d95f7  RSP: 00007fff8830e0d8  RFLAGS: 00010213
    RAX: 0000000000000010  RBX: ffffffff814c3f42  RCX: 00000034930d95f7
    RDX: 00007fff8830e17c  RSI: 0000000040086602  RDI: 0000000000000003
    RBP: 0000000000084000   R8: 0000000000000000   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007fff8830f8db
    R13: 0000000000000001  R14: 0000000000000000  R15: 0000000000000003
    ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

[3]
crash> bt 23722
PID: 23722  TASK: ffff88007b496080  CPU: 1   COMMAND: "flush-7:0"
 #0 [ffff88007b4f7840] __schedule at ffffffff814b9b77
 #1 [ffff88007b4f78e8] schedule at ffffffff814ba3cf
 #2 [ffff88007b4f78f8] start_this_handle at ffffffffa01acef5 [jbd2]
 #3 [ffff88007b4f79b8] jbd2__journal_start at ffffffffa01ad4cb [jbd2]
 #4 [ffff88007b4f79f8] jbd2_journal_start at ffffffffa01ad523 [jbd2]
 #5 [ffff88007b4f7a08] ext4_journal_start_sb at ffffffffa03a7d83 [ext4]
 #6 [ffff88007b4f7a78] ext4_da_writepages at ffffffffa0390b6d [ext4]
 #7 [ffff88007b4f7bc8] do_writepages at ffffffff8110c261
 #8 [ffff88007b4f7bd8] writeback_single_inode at ffffffff81185830
 #9 [ffff88007b4f7c38] writeback_sb_inodes at ffffffff81185e21
#10 [ffff88007b4f7ce8] __writeback_inodes_wb at ffffffff81185f7e
#11 [ffff88007b4f7d38] wb_writeback at ffffffff81186223
#12 [ffff88007b4f7db8] wb_do_writeback at ffffffff811864f9
#13 [ffff88007b4f7e58] bdi_writeback_thread at ffffffff811865fa
#14 [ffff88007b4f7ee8] kthread at ffffffff810828c6
#15 [ffff88007b4f7f48] kernel_thread_helper at ffffffff814c60b4

Step to reproduce a hang-up(figure):
 =======================================+=======================================
 ( "chattr -j" process PID: 16684 )     |  ( a certain process )
t ======================================+=======================================
i                                       |
m                                       +---------------------------------------
e                                       |jbd2_journal_start()
|                                       +---------------------------------------
|                                       | . . .
|                                       | t_updates++ // t_updates = 1
|                                       | . . .
|                                       +---------------------------------------
|                                        ...
|---------------------------------------+---------------------------------------
|jbd2_journal_lock_updates()            |jbd2_journal_stop()
|---------------------------------------+---------------------------------------
| write_lock(&journal->j_state_lock)    |    .
| ++journal->j_barrier_count            |    .
| spin_lock(&tran->t_handle_lock)       |    .
| atomic_read(&tran->t_updates) //not 0 |
|                                       | atomic_dec_and_test(&tran->t_updates)
|                                       |    // t_updates = 0
|                                       | wake_up(&journal->j_wait_updates)
| prepare_to_wait()                     |    // no process is woken up.
| spin_unlock(&tran->t_handle_lock)     + wake_up(&journal->j_wait_tran_locked)
| write_unlock(&journal->j_state_lock)  |    // no process is woken up.
| schedule() // never return            +---------------------------------------
|                                       |  ( "dd" PID: 16645 or "flush-7:0" )
|                                       +=======================================
|                                       |start_this_handle()
|                                       +---------------------------------------
|                                       | . . .
|                                       | read_lock(&journal->j_state_lock)
|                                       | if (journal->j_barrier_count)
|                                       |  read_unlock(&journal->j_state_lock)
|                                       |  wait_event(journal->j_wait_tran_locked)
v                                       |  // never return

Therefore, jbd2_journal_stop() needs spin_lock(&tran->t_handle_lock) or 
read_lock(&journal->j_state_lock) to synchronize trans->t_updates with other
operators.
===============================================================================

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
---
 fs/jbd2/transaction.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a0e41a4..d48f395 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1448,12 +1448,14 @@  int jbd2_journal_stop(handle_t *handle)
 	 * once we do this, we must not dereference transaction
 	 * pointer again.
 	 */
+	read_lock(&journal->j_state_lock);
 	tid = transaction->t_tid;
 	if (atomic_dec_and_test(&transaction->t_updates)) {
 		wake_up(&journal->j_wait_updates);
 		if (journal->j_barrier_count)
 			wake_up(&journal->j_wait_transaction_locked);
 	}
+	read_unlock(&journal->j_state_lock);
 
 	if (wait_for_commit)
 		err = jbd2_log_wait_commit(journal, tid);