jbd2: wake up journal waiters in FIFO order, not LIFO

Message ID	20220907165959.1137482-1-alexey.lyashkov@gmail.com
State	Awaiting Upstream
Headers	show Return-Path: <SRS0=2giR=ZK=vger.kernel.org=linux-ext4-owner@ozlabs.org> From: Alexey Lyashkov <alexey.lyashkov@gmail.com> To: linux-ext4@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>, Andreas Dilger <adilger@dilger.ca>, Artem Blagodarenko <artem.blagodarenko@gmail.com> Cc: Andrew Perepechko <anserper@ya.ru>, Alexey Lyashkov <alexey.lyashkov@gmail.com> Subject: [PATCH] jbd2: wake up journal waiters in FIFO order, not LIFO Date: Wed, 7 Sep 2022 19:59:59 +0300 Message-Id: <20220907165959.1137482-1-alexey.lyashkov@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	jbd2: wake up journal waiters in FIFO order, not LIFO \| expand jbd2: wake up journal waiters in FIFO order, not LIFO

Alexey Lyahkov Sept. 7, 2022, 4:59 p.m. UTC

From: Andrew Perepechko <anserper@ya.ru>

LIFO wakeup order is unfair and sometimes leads to a journal
user not being able to get a journal handle for hundreds of
transactions in a row.

FIFO wakeup can make things more fair.

Signed-off-by: Alexey Lyashkov <alexey.lyashkov@gmail.com>
---
 fs/jbd2/commit.c      | 2 +-
 fs/jbd2/transaction.c | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

Ritesh Harjani (IBM) Sept. 8, 2022, 5:46 a.m. UTC | #1

On 22/09/07 07:59PM, Alexey Lyashkov wrote:
> From: Andrew Perepechko <anserper@ya.ru>
> 
> LIFO wakeup order is unfair and sometimes leads to a journal
> user not being able to get a journal handle for hundreds of
> transactions in a row.
> 
> FIFO wakeup can make things more fair.

prepare_to_wait() will always add the task to the head of the list.
While prepare_to_wait_exclusive() will add the task to the tail since all of the
exclusive tasks are added to the tail.
wake_up() function will wake up all non-exclusive and single exclusive task 
v/s
wake_up_all() function will wake up all tasks irrespective.

So your change does makes the ordering to FIFO, in which the task which came in 
first will be woken up first. 

Although I was wondering about 2 things - 
1. In what scenario this was observed to become a problem/bottleneck for you?
Could you kindly give more details of your problem?

2. What about start_this_handle() function where we call wait_event() 
for j_barrier_count to be 0? I guess that doesn't happen often.

-ritesh


> 
> Signed-off-by: Alexey Lyashkov <alexey.lyashkov@gmail.com>
> ---
>  fs/jbd2/commit.c      | 2 +-
>  fs/jbd2/transaction.c | 6 +++---
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index b2b2bc9b88d9..ec2b55879e3a 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -570,7 +570,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  	journal->j_running_transaction = NULL;
>  	start_time = ktime_get();
>  	commit_transaction->t_log_start = journal->j_head;
> -	wake_up(&journal->j_wait_transaction_locked);
> +	wake_up_all(&journal->j_wait_transaction_locked);
>  	write_unlock(&journal->j_state_lock);
>  
>  	jbd2_debug(3, "JBD2: commit phase 2a\n");
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index e1be93ccd81c..6a404ac1c178 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -168,7 +168,7 @@ static void wait_transaction_locked(journal_t *journal)
>  	int need_to_start;
>  	tid_t tid = journal->j_running_transaction->t_tid;
>  
> -	prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
> +	prepare_to_wait_exclusive(&journal->j_wait_transaction_locked, &wait,
>  			TASK_UNINTERRUPTIBLE);
>  	need_to_start = !tid_geq(journal->j_commit_request, tid);
>  	read_unlock(&journal->j_state_lock);
> @@ -194,7 +194,7 @@ static void wait_transaction_switching(journal_t *journal)
>  		read_unlock(&journal->j_state_lock);
>  		return;
>  	}
> -	prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
> +	prepare_to_wait_exclusive(&journal->j_wait_transaction_locked, &wait,
>  			TASK_UNINTERRUPTIBLE);
>  	read_unlock(&journal->j_state_lock);
>  	/*
> @@ -920,7 +920,7 @@ void jbd2_journal_unlock_updates (journal_t *journal)
>  	write_lock(&journal->j_state_lock);
>  	--journal->j_barrier_count;
>  	write_unlock(&journal->j_state_lock);
> -	wake_up(&journal->j_wait_transaction_locked);
> +	wake_up_all(&journal->j_wait_transaction_locked);
>  }
>  
>  static void warn_dirty_buffer(struct buffer_head *bh)
> -- 
> 2.31.1
>

Alexey Lyahkov Sept. 8, 2022, 5:51 a.m. UTC | #2

Hi Ritesh,

This was hit on the Lustre OSS node when we have ton’s of short write with sync/(journal commit) in parallel.
Each write was done from own thread (like 1k-2k threads in parallel).
It caused a situation when only few/some threads make a wakeup and enter to the transaction until it will be T_LOCKED.
In our’s observation all handles from head was waked and it’s handles added recently, while old handles still in list and
It caused a soft lockup messages on console.

Alex


> On 8 Sep 2022, at 08:46, Ritesh Harjani (IBM) <ritesh.list@gmail.com> wrote:
> 
> On 22/09/07 07:59PM, Alexey Lyashkov wrote:
>> From: Andrew Perepechko <anserper@ya.ru>
>> 
>> LIFO wakeup order is unfair and sometimes leads to a journal
>> user not being able to get a journal handle for hundreds of
>> transactions in a row.
>> 
>> FIFO wakeup can make things more fair.
> 
> prepare_to_wait() will always add the task to the head of the list.
> While prepare_to_wait_exclusive() will add the task to the tail since all of the
> exclusive tasks are added to the tail.
> wake_up() function will wake up all non-exclusive and single exclusive task 
> v/s
> wake_up_all() function will wake up all tasks irrespective.
> 
> So your change does makes the ordering to FIFO, in which the task which came in 
> first will be woken up first. 
> 
> Although I was wondering about 2 things - 
> 1. In what scenario this was observed to become a problem/bottleneck for you?
> Could you kindly give more details of your problem?
> 
> 2. What about start_this_handle() function where we call wait_event() 
> for j_barrier_count to be 0? I guess that doesn't happen often.
> 
> -ritesh
> 
> 
>> 
>> Signed-off-by: Alexey Lyashkov <alexey.lyashkov@gmail.com>
>> ---
>> fs/jbd2/commit.c      | 2 +-
>> fs/jbd2/transaction.c | 6 +++---
>> 2 files changed, 4 insertions(+), 4 deletions(-)
>> 
>> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
>> index b2b2bc9b88d9..ec2b55879e3a 100644
>> --- a/fs/jbd2/commit.c
>> +++ b/fs/jbd2/commit.c
>> @@ -570,7 +570,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>> 	journal->j_running_transaction = NULL;
>> 	start_time = ktime_get();
>> 	commit_transaction->t_log_start = journal->j_head;
>> -	wake_up(&journal->j_wait_transaction_locked);
>> +	wake_up_all(&journal->j_wait_transaction_locked);
>> 	write_unlock(&journal->j_state_lock);
>> 
>> 	jbd2_debug(3, "JBD2: commit phase 2a\n");
>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index e1be93ccd81c..6a404ac1c178 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -168,7 +168,7 @@ static void wait_transaction_locked(journal_t *journal)
>> 	int need_to_start;
>> 	tid_t tid = journal->j_running_transaction->t_tid;
>> 
>> -	prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
>> +	prepare_to_wait_exclusive(&journal->j_wait_transaction_locked, &wait,
>> 			TASK_UNINTERRUPTIBLE);
>> 	need_to_start = !tid_geq(journal->j_commit_request, tid);
>> 	read_unlock(&journal->j_state_lock);
>> @@ -194,7 +194,7 @@ static void wait_transaction_switching(journal_t *journal)
>> 		read_unlock(&journal->j_state_lock);
>> 		return;
>> 	}
>> -	prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
>> +	prepare_to_wait_exclusive(&journal->j_wait_transaction_locked, &wait,
>> 			TASK_UNINTERRUPTIBLE);
>> 	read_unlock(&journal->j_state_lock);
>> 	/*
>> @@ -920,7 +920,7 @@ void jbd2_journal_unlock_updates (journal_t *journal)
>> 	write_lock(&journal->j_state_lock);
>> 	--journal->j_barrier_count;
>> 	write_unlock(&journal->j_state_lock);
>> -	wake_up(&journal->j_wait_transaction_locked);
>> +	wake_up_all(&journal->j_wait_transaction_locked);
>> }
>> 
>> static void warn_dirty_buffer(struct buffer_head *bh)
>> -- 
>> 2.31.1
>>

Ritesh Harjani (IBM) Sept. 8, 2022, 6:11 a.m. UTC | #3

On 22/09/08 08:51AM, Alexey Lyahkov wrote:
> Hi Ritesh,
> 
> This was hit on the Lustre OSS node when we have ton’s of short write with sync/(journal commit) in parallel.
> Each write was done from own thread (like 1k-2k threads in parallel).
> It caused a situation when only few/some threads make a wakeup and enter to the transaction until it will be T_LOCKED.
> In our’s observation all handles from head was waked and it’s handles added recently, while old handles still in list and

Thanks Alexey for providing the details.

> It caused a soft lockup messages on console.

Did you mean hung task timeout? I was wondering why will there be soft lockup
warning, because these old handles are anyway in a waiting state right.
Am I missing something?

-ritesh

Alexey Lyahkov Sept. 8, 2022, 8:21 a.m. UTC | #4

> On 8 Sep 2022, at 09:11, Ritesh Harjani (IBM) <ritesh.list@gmail.com> wrote:
> 
> On 22/09/08 08:51AM, Alexey Lyahkov wrote:
>> Hi Ritesh,
>> 
>> This was hit on the Lustre OSS node when we have ton’s of short write with sync/(journal commit) in parallel.
>> Each write was done from own thread (like 1k-2k threads in parallel).
>> It caused a situation when only few/some threads make a wakeup and enter to the transaction until it will be T_LOCKED.
>> In our’s observation all handles from head was waked and it’s handles added recently, while old handles still in list and
> 
> Thanks Alexey for providing the details.
> 
>> It caused a soft lockup messages on console.
> 
> Did you mean hung task timeout? I was wondering why will there be soft lockup
> warning, because these old handles are anyway in a waiting state right.
> Am I missing something?
> 
Oh. I asked a colleges about details. It was internal lustre hung detector not a kernel side

[ 2221.036503] Lustre: ll_ost_io04_080: service thread pid 55122 was inactive for 80.284 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 2221.036677] Pid: 55212, comm: ll_ost_io05_074 4.18.0-305.10.2.x6.1.010.19.x86_64 #1 SMP Thu Jun 30 13:42:51 MDT 2022
[ 2221.056673] Lustre: Skipped 2 previous similar messages
[ 2221.067821] Call Trace TBD:
[ 2221.067855] [<0>] wait_transaction_locked+0x89/0xc0 [jbd2]
[ 2221.099175] [<0>] add_transaction_credits+0xd4/0x290 [jbd2]
[ 2221.105266] [<0>] start_this_handle+0x10a/0x520 [jbd2]
[ 2221.110904] [<0>] jbd2__journal_start+0xea/0x1f0 [jbd2]
[ 2221.116679] [<0>] __ldiskfs_journal_start_sb+0x6e/0x130 [ldiskfs]
[ 2221.123316] [<0>] osd_trans_start+0x13b/0x4f0 [osd_ldiskfs]
[ 2221.129417] [<0>] ofd_commitrw_write+0x620/0x1830 [ofd]
[ 2221.135147] [<0>] ofd_commitrw+0x731/0xd80 [ofd]
[ 2221.140420] [<0>] obd_commitrw+0x1ac/0x370 [ptlrpc]
[ 2221.145858] [<0>] tgt_brw_write+0x1913/0x1d50 [ptlrpc]
[ 2221.151561] [<0>] tgt_request_handle+0xc93/0x1a40 [ptlrpc]
[ 2221.157622] [<0>] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
[ 2221.164454] [<0>] ptlrpc_main+0xc06/0x1560 [ptlrpc]
[ 2221.169860] [<0>] kthread+0x116/0x130
[ 2221.174033] [<0>] ret_from_fork+0x1f/0x40


Other logs have shown this thread can’t take a handle, but other threads able to do it many times.
Kernel detector don’t hit because thread have wakeup many times but it have seen T_LOCKED and go to sleep again.

Alex



> -ritesh

Andrew Sept. 8, 2022, 8:28 a.m. UTC | #5

> Other logs have shown this thread can’t take a handle, but other threads able to do it many times.
> Kernel detector don’t hit because thread have wakeup many times but it have seen T_LOCKED and go to sleep again.
Also, the transactions are small because the workload is sync writes.

Ritesh Harjani (IBM) Sept. 8, 2022, 9:11 a.m. UTC | #6

On 22/09/08 11:21AM, Alexey Lyahkov wrote:
> 
> 
> > On 8 Sep 2022, at 09:11, Ritesh Harjani (IBM) <ritesh.list@gmail.com> wrote:
> > 
> > On 22/09/08 08:51AM, Alexey Lyahkov wrote:
> >> Hi Ritesh,
> >> 
> >> This was hit on the Lustre OSS node when we have ton’s of short write with sync/(journal commit) in parallel.
> >> Each write was done from own thread (like 1k-2k threads in parallel).
> >> It caused a situation when only few/some threads make a wakeup and enter to the transaction until it will be T_LOCKED.
> >> In our’s observation all handles from head was waked and it’s handles added recently, while old handles still in list and
> > 
> > Thanks Alexey for providing the details.
> > 
> >> It caused a soft lockup messages on console.
> > 
> > Did you mean hung task timeout? I was wondering why will there be soft lockup
> > warning, because these old handles are anyway in a waiting state right.
> > Am I missing something?
> > 
> Oh. I asked a colleges about details. It was internal lustre hung detector not a kernel side

Thanks again for sharing the details. This indeed looks like a task handle can
remain in wait state for long due to wrong wakeup order in case of many threads. 

-ritesh

> 
> [ 2221.036503] Lustre: ll_ost_io04_080: service thread pid 55122 was inactive for 80.284 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
> [ 2221.036677] Pid: 55212, comm: ll_ost_io05_074 4.18.0-305.10.2.x6.1.010.19.x86_64 #1 SMP Thu Jun 30 13:42:51 MDT 2022
> [ 2221.056673] Lustre: Skipped 2 previous similar messages
> [ 2221.067821] Call Trace TBD:
> [ 2221.067855] [<0>] wait_transaction_locked+0x89/0xc0 [jbd2]
> [ 2221.099175] [<0>] add_transaction_credits+0xd4/0x290 [jbd2]
> [ 2221.105266] [<0>] start_this_handle+0x10a/0x520 [jbd2]
> [ 2221.110904] [<0>] jbd2__journal_start+0xea/0x1f0 [jbd2]
> [ 2221.116679] [<0>] __ldiskfs_journal_start_sb+0x6e/0x130 [ldiskfs]
> [ 2221.123316] [<0>] osd_trans_start+0x13b/0x4f0 [osd_ldiskfs]
> [ 2221.129417] [<0>] ofd_commitrw_write+0x620/0x1830 [ofd]
> [ 2221.135147] [<0>] ofd_commitrw+0x731/0xd80 [ofd]
> [ 2221.140420] [<0>] obd_commitrw+0x1ac/0x370 [ptlrpc]
> [ 2221.145858] [<0>] tgt_brw_write+0x1913/0x1d50 [ptlrpc]
> [ 2221.151561] [<0>] tgt_request_handle+0xc93/0x1a40 [ptlrpc]
> [ 2221.157622] [<0>] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
> [ 2221.164454] [<0>] ptlrpc_main+0xc06/0x1560 [ptlrpc]
> [ 2221.169860] [<0>] kthread+0x116/0x130
> [ 2221.174033] [<0>] ret_from_fork+0x1f/0x40
> 
> 
> Other logs have shown this thread can’t take a handle, but other threads able to do it many times.
> Kernel detector don’t hit because thread have wakeup many times but it have seen T_LOCKED and go to sleep again.
> 
> Alex
> 
> 
> 
> > -ritesh
>

Ritesh Harjani (IBM) Sept. 8, 2022, 9:13 a.m. UTC | #7

On 22/09/08 11:16AM, Ritesh Harjani (IBM) wrote:
> On 22/09/07 07:59PM, Alexey Lyashkov wrote:
> > From: Andrew Perepechko <anserper@ya.ru>
> > 
> > LIFO wakeup order is unfair and sometimes leads to a journal
> > user not being able to get a journal handle for hundreds of
> > transactions in a row.
> > 
> > FIFO wakeup can make things more fair.
> 
> prepare_to_wait() will always add the task to the head of the list.
> While prepare_to_wait_exclusive() will add the task to the tail since all of the
> exclusive tasks are added to the tail.
> wake_up() function will wake up all non-exclusive and single exclusive task 
> v/s
> wake_up_all() function will wake up all tasks irrespective.
> 
> So your change does makes the ordering to FIFO, in which the task which came in 
> first will be woken up first. 

With all other details which got discussed in other threads. 
This looks like the right thing to do.

Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

jbd2: wake up journal waiters in FIFO order, not LIFO

Commit Message

Comments

Patch