diff mbox

DIO process stuck apparently due to dioread_nolock (3.0)

Message ID 1313251371-3672-1-git-send-email-tm@tao.ma
State Superseded, archived
Headers show

Commit Message

Tao Ma Aug. 13, 2011, 4:02 p.m. UTC
From: Tao Ma <boyu.mt@taobao.com>

Hi Michael,
	could you please check whether this patch work for you?


Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001
From: Tao Ma <boyu.mt@taobao.com>
Date: Sat, 13 Aug 2011 23:50:28 +0800
Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.

EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should
be done simultaneously since ext4_end_io_nolock always clear the flag and
decrease the counter in the same time.

We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so
it will go nagative and causes some process to wait forever.

Part of the patch came from Eric in his e-mail.
http://marc.info/?l=linux-ext4&m=131316851417460&w=2

Reported-by: Michael Tokarev<mjt@tls.msk.ru>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
---
 fs/ext4/inode.c   |    9 ++++++++-
 fs/ext4/page-io.c |    6 ++++--
 2 files changed, 12 insertions(+), 3 deletions(-)

Comments

Michael Tokarev Aug. 14, 2011, 8:57 p.m. UTC | #1
13.08.2011 20:02, Tao Ma wrote:
> From: Tao Ma <boyu.mt@taobao.com>
> 
> Hi Michael,
> 	could you please check whether this patch work for you?

With this patch applied to 3.0.1 I can't trigger the issue anymore,
after several attempts -- the system just works as it shold be.
I'm not sure this is the right fix or it's just my testcase isn't
as good as it can be... ;)

> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
> From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001
> From: Tao Ma <boyu.mt@taobao.com>
> Date: Sat, 13 Aug 2011 23:50:28 +0800
> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
> 
> EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should
> be done simultaneously since ext4_end_io_nolock always clear the flag and
> decrease the counter in the same time.
> 
> We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so
> it will go nagative and causes some process to wait forever.
> 
> Part of the patch came from Eric in his e-mail.
> http://marc.info/?l=linux-ext4&m=131316851417460&w=2
> 
> Reported-by: Michael Tokarev<mjt@tls.msk.ru>
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> Signed-off-by: Tao Ma <boyu.mt@taobao.com>

You can add my Tested-By too ;)

Thank you!

And please excuse me for so long delay replying - I was AFK whole weekend.

/mjt

>  fs/ext4/inode.c   |    9 ++++++++-
>  fs/ext4/page-io.c |    6 ++++--
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d47264c..40c0b10 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
>  		goto out;
>  	}
>  
> -	io_end->flag = EXT4_IO_END_UNWRITTEN;
> +	/*
> +	 * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now,
> +	 * but being more careful is always safe for the future change.
> +	 */
>  	inode = io_end->inode;
> +	if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
> +		io_end->flag |= EXT4_IO_END_UNWRITTEN;
> +		atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
> +	}
>  
>  	/* Add the io_end to per-inode completed io list*/
>  	spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 430c401..78839af 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -334,8 +334,10 @@ submit_and_retry:
>  	if ((io_end->num_io_pages >= MAX_IO_PAGES) &&
>  	    (io_end->pages[io_end->num_io_pages-1] != io_page))
>  		goto submit_and_retry;
> -	if (buffer_uninit(bh))
> -		io->io_end->flag |= EXT4_IO_END_UNWRITTEN;
> +	if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
> +		io_end->flag |= EXT4_IO_END_UNWRITTEN;
> +		atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
> +	}
>  	io->io_end->size += bh->b_size;
>  	io->io_next_block++;
>  	ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Tokarev Aug. 14, 2011, 9:07 p.m. UTC | #2
15.08.2011 00:57, Michael Tokarev пишет:
> 13.08.2011 20:02, Tao Ma wrote:
>> From: Tao Ma <boyu.mt@taobao.com>
>>
>> Hi Michael,
>> 	could you please check whether this patch work for you?
> 
> With this patch applied to 3.0.1 I can't trigger the issue anymore,
> after several attempts -- the system just works as it shold be.
> I'm not sure this is the right fix or it's just my testcase isn't
> as good as it can be... ;)

Well, I found a way to trigger data corruption with this patch
applied.  I guess it's not fault of this patch, but some more
deep problem instead.

The sequence is my usual copy of an oracle database from another
place and start it.  When oracle starts doing it's direct-I/O
against its redologs, we had problem which is now solved.  But
now I do the following: I shutdown the database, rename the current
redologs out of the way and copy them back into place as new files.
And start the database again.

This time, oracle complains that the redologs contains garbage.
I can reboot the machine now, and compare old (renamed) redologs
with copies - they're indeed different.

My guess is that copy is done from the pagecache - from the old
contents of the files, somehow ignoring the (direct) writes
performed by initial database open.  But that copy is somehow
damaged now too, since even file identification is now different.

Is this new issue something that dioread_nolock supposed to create?
I mean, it isn't entirely clear what it supposed to do, it looks
somewhat hackish, but without it performance is quite bad.

Thanks,

/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tao Ma Aug. 15, 2011, 2:36 a.m. UTC | #3
On 08/15/2011 05:07 AM, Michael Tokarev wrote:
> 15.08.2011 00:57, Michael Tokarev пишет:
>> 13.08.2011 20:02, Tao Ma wrote:
>>> From: Tao Ma <boyu.mt@taobao.com>
>>>
>>> Hi Michael,
>>> 	could you please check whether this patch work for you?
>>
>> With this patch applied to 3.0.1 I can't trigger the issue anymore,
>> after several attempts -- the system just works as it shold be.
>> I'm not sure this is the right fix or it's just my testcase isn't
>> as good as it can be... ;)
Thanks for the test.
> 
> Well, I found a way to trigger data corruption with this patch
> applied.  I guess it's not fault of this patch, but some more
> deep problem instead.
> 
> The sequence is my usual copy of an oracle database from another
> place and start it.  When oracle starts doing it's direct-I/O
> against its redologs, we had problem which is now solved.  But
> now I do the following: I shutdown the database, rename the current
> redologs out of the way and copy them back into place as new files.
> And start the database again.
> 
> This time, oracle complains that the redologs contains garbage.
> I can reboot the machine now, and compare old (renamed) redologs
> with copies - they're indeed different.
> 
> My guess is that copy is done from the pagecache - from the old
> contents of the files, somehow ignoring the (direct) writes
> performed by initial database open.  But that copy is somehow
> damaged now too, since even file identification is now different.
> 
> Is this new issue something that dioread_nolock supposed to create?
> I mean, it isn't entirely clear what it supposed to do, it looks
> somewhat hackish, but without it performance is quite bad.
So could I generalize your sequence like below:
1. copy a large file to a new ext4 volume
2. do some direct i/o read/write to this file(bs=512)
3. rename it.
4. cp this back to the original file
5. do direct i/o read/write(bs=512) now and the file is actually corrupted.

You used to meet with problem in step 2, and my patch resolved it. Now
you met with problems in step 5. Right?

Thanks
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Tokarev Aug. 15, 2011, 8 a.m. UTC | #4
15.08.2011 06:36, Tao Ma wrote:
> On 08/15/2011 05:07 AM, Michael Tokarev wrote:
[]
>> Well, I found a way to trigger data corruption with this patch
>> applied.  I guess it's not fault of this patch, but some more
>> deep problem instead.
>>
>> The sequence is my usual copy of an oracle database from another
>> place and start it.  When oracle starts doing it's direct-I/O
>> against its redologs, we had problem which is now solved.  But
>> now I do the following: I shutdown the database, rename the current
>> redologs out of the way and copy them back into place as new files.
>> And start the database again.
>>
>> This time, oracle complains that the redologs contains garbage.
>> I can reboot the machine now, and compare old (renamed) redologs
>> with copies - they're indeed different.
>>
>> My guess is that copy is done from the pagecache - from the old
>> contents of the files, somehow ignoring the (direct) writes
>> performed by initial database open.  But that copy is somehow
>> damaged now too, since even file identification is now different.
>>
>> Is this new issue something that dioread_nolock supposed to create?
>> I mean, it isn't entirely clear what it supposed to do, it looks
>> somewhat hackish, but without it performance is quite bad.
> So could I generalize your sequence like below:
> 1. copy a large file to a new ext4 volume
> 2. do some direct i/o read/write to this file(bs=512)
> 3. rename it.
> 4. cp this back to the original file
> 5. do direct i/o read/write(bs=512) now and the file is actually corrupted.
> 
> You used to meet with problem in step 2, and my patch resolved it. Now
> you met with problems in step 5. Right?

SQL> shutdown immediate; -- shuts down the database cleanly

$ mkdir tmp
$ mv redo* tmp/
$ cp -p tmp/* .
 -- this will make redolog files to be in hot cache, not even written to disk.

SQL> startup
Database mounted.
  -- now open and read our redologs...
  -- at this point, without the patch, it hangs.
ORA-00316: log 1 of thread 1, type  in header is not log file
ORA-00312: online log 1 thread 1: '.../redo1.odf'

$ mv -f tmp/* .

SQL> alter database open;  -- this will try to open files again and read them again
Database altered.          -- and now we're fine.

This is my small(ish) testcase so far.  Only the redologs
needs to be in hot cache in order to trigger the issue.
This does direct I/O in 512byte blocks in these redo*
files.

The rename and a new directory is just to keep the pieces
of the database in the right place.

There's even more fun.  I once managed to get old content in
the copied files, but I can't repeat it.  I made a copy as
before, sync(1)ed everything, started the database - it was
ok.  Next I shut it down, and rebooted (why drop_caches does
not really work is another big question).  And now, oracle
complains that the redologs contains previous sequence number.
(to clarify: there's a sequence number in each oracle db
which is incremented each time something happens with the
database, including startup. So on startup, each file in
the database gets new (the same) sequence number).  So it
looked like even despite of oracle doing direct writes to
record new sequence number, a previously cached data gets
written to the file.

Now I'm not really sure what's going on, it is somewhat
inconsistent.  Before, it used to hang after "Database
mounted" message, when it tries to write to redologs, --
now that hang is gone.

But now I see some apparent data corruption - again, with hot
cache only - but I don't actually understand when it happens.

I'm trying to narrow it down further.

Thank you!

/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Tokarev Aug. 15, 2011, 8:56 a.m. UTC | #5
15.08.2011 12:00, Michael Tokarev wrote:
[....]

So, it looks like this (starting with cold cache):

1. rename the redologs and copy them over - this will
   make a hot copy of redologs
2. startup oracle - it will complain that the redologs aren't
   redologs, the header is corrupt
3. shut down oracle, start it up again - it will succeed.

If between 1 and 2 you'll issue sync(1) everything will work.
When shutting down, oracle calls fsync(), so that's like
sync(1) again.

If there will be some time between 1. and 2., everything
will work too.

Without dioread_nolock I can't trigger the problem no matter
how I tried.


A smaller test case.  I used redo1.odf file (one of the
redologs) as a test file, any will work.

 $ cp -p redo1.odf temp
 $ dd if=temp of=foo iflag=direct count=20

Now, first 512bytes of "foo" will contain all zeros, while
the beginning of redo1.odf is _not_ zeros.

Again, without aioread_nolock it works as expected.


And the most important note: without the patch there's no
data corruption like that.  But instead, there is the
lockup... ;)

Thank you,

/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Tokarev Aug. 15, 2011, 9:03 a.m. UTC | #6
15.08.2011 12:56, Michael Tokarev пишет:
> 15.08.2011 12:00, Michael Tokarev wrote:
> [....]
> 
> So, it looks like this (starting with cold cache):
> 
> 1. rename the redologs and copy them over - this will
>    make a hot copy of redologs
> 2. startup oracle - it will complain that the redologs aren't
>    redologs, the header is corrupt
> 3. shut down oracle, start it up again - it will succeed.
> 
> If between 1 and 2 you'll issue sync(1) everything will work.
> When shutting down, oracle calls fsync(), so that's like
> sync(1) again.
> 
> If there will be some time between 1. and 2., everything
> will work too.
> 
> Without dioread_nolock I can't trigger the problem no matter
> how I tried.
> 
> 
> A smaller test case.  I used redo1.odf file (one of the
> redologs) as a test file, any will work.
> 
>  $ cp -p redo1.odf temp
>  $ dd if=temp of=foo iflag=direct count=20
> 
> Now, first 512bytes of "foo" will contain all zeros, while
> the beginning of redo1.odf is _not_ zeros.
> 
> Again, without aioread_nolock it works as expected.
> 
> 
> And the most important note: without the patch there's no
> data corruption like that.  But instead, there is the
> lockup... ;)

Actually I can reproduce this data corruption without the
patch too, just not that easily.  Oracle testcase (with
copying redologs over) does that nicely.  So that's a
separate bug which was here before.

> Thank you,
> 
> /mjt

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tao Ma Aug. 15, 2011, 10:28 a.m. UTC | #7
On 08/15/2011 05:03 PM, Michael Tokarev wrote:
> 15.08.2011 12:56, Michael Tokarev пишет:
>> 15.08.2011 12:00, Michael Tokarev wrote:
>> [....]
>>
>> So, it looks like this (starting with cold cache):
>>
>> 1. rename the redologs and copy them over - this will
>>    make a hot copy of redologs
>> 2. startup oracle - it will complain that the redologs aren't
>>    redologs, the header is corrupt
>> 3. shut down oracle, start it up again - it will succeed.
>>
>> If between 1 and 2 you'll issue sync(1) everything will work.
>> When shutting down, oracle calls fsync(), so that's like
>> sync(1) again.
>>
>> If there will be some time between 1. and 2., everything
>> will work too.
>>
>> Without dioread_nolock I can't trigger the problem no matter
>> how I tried.
>>
>>
>> A smaller test case.  I used redo1.odf file (one of the
>> redologs) as a test file, any will work.
>>
>>  $ cp -p redo1.odf temp
>>  $ dd if=temp of=foo iflag=direct count=20
>>
>> Now, first 512bytes of "foo" will contain all zeros, while
>> the beginning of redo1.odf is _not_ zeros.
>>
>> Again, without aioread_nolock it works as expected.
>>
>>
>> And the most important note: without the patch there's no
>> data corruption like that.  But instead, there is the
>> lockup... ;)
> 
> Actually I can reproduce this data corruption without the
> patch too, just not that easily.  Oracle testcase (with
> copying redologs over) does that nicely.  So that's a
> separate bug which was here before.
cool, thanks for the test.

btw, I can reproduce the bug with
 $ cp -p redo1.odf temp
 $ dd if=temp of=foo iflag=direct count=20
Not that easy, but I did encounter one during my more than 20 tries,
hope I can get something out soon.

Thanks
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Aug. 15, 2011, 4:08 p.m. UTC | #8
On 8/13/11 11:02 AM, Tao Ma wrote:
> From: Tao Ma <boyu.mt@taobao.com>
> 
> Hi Michael,
> 	could you please check whether this patch work for you?

Thanks, this is the patch I meant to send, I guess ;)  I think
I missed the "|=" case before :(

given that every place we set it requires a test and an increment
if not set, maybe a helper to do all that is in order?

-Eric

> 
> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
> From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001
> From: Tao Ma <boyu.mt@taobao.com>
> Date: Sat, 13 Aug 2011 23:50:28 +0800
> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
> 
> EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should
> be done simultaneously since ext4_end_io_nolock always clear the flag and
> decrease the counter in the same time.
> 
> We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so
> it will go nagative and causes some process to wait forever.
> 
> Part of the patch came from Eric in his e-mail.
> http://marc.info/?l=linux-ext4&m=131316851417460&w=2
> 
> Reported-by: Michael Tokarev<mjt@tls.msk.ru>
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> Signed-off-by: Tao Ma <boyu.mt@taobao.com>
> ---
>  fs/ext4/inode.c   |    9 ++++++++-
>  fs/ext4/page-io.c |    6 ++++--
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d47264c..40c0b10 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
>  		goto out;
>  	}
>  
> -	io_end->flag = EXT4_IO_END_UNWRITTEN;
> +	/*
> +	 * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now,
> +	 * but being more careful is always safe for the future change.
> +	 */
>  	inode = io_end->inode;
> +	if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
> +		io_end->flag |= EXT4_IO_END_UNWRITTEN;
> +		atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
> +	}
>  
>  	/* Add the io_end to per-inode completed io list*/
>  	spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 430c401..78839af 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -334,8 +334,10 @@ submit_and_retry:
>  	if ((io_end->num_io_pages >= MAX_IO_PAGES) &&
>  	    (io_end->pages[io_end->num_io_pages-1] != io_page))
>  		goto submit_and_retry;
> -	if (buffer_uninit(bh))
> -		io->io_end->flag |= EXT4_IO_END_UNWRITTEN;
> +	if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
> +		io_end->flag |= EXT4_IO_END_UNWRITTEN;
> +		atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
> +	}
>  	io->io_end->size += bh->b_size;
>  	io->io_next_block++;
>  	ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang Aug. 15, 2011, 11:53 p.m. UTC | #9
Hi Michael,

On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote:
> 15.08.2011 12:00, Michael Tokarev wrote:
> [....]
>
> So, it looks like this (starting with cold cache):
>
> 1. rename the redologs and copy them over - this will
>   make a hot copy of redologs
> 2. startup oracle - it will complain that the redologs aren't
>   redologs, the header is corrupt
> 3. shut down oracle, start it up again - it will succeed.
>
> If between 1 and 2 you'll issue sync(1) everything will work.
> When shutting down, oracle calls fsync(), so that's like
> sync(1) again.
>
> If there will be some time between 1. and 2., everything
> will work too.
>
> Without dioread_nolock I can't trigger the problem no matter
> how I tried.
>
>
> A smaller test case.  I used redo1.odf file (one of the
> redologs) as a test file, any will work.
>
>  $ cp -p redo1.odf temp
>  $ dd if=temp of=foo iflag=direct count=20
Isn't this the expected behavior here? When doing
'cp -p redo1.odf temp', data is copied to temp through
buffer write, but there is no guarantee when data will be
actually written to disk. Then with 'dd if=temp of=foo
iflag=direct count=20', data is read directly from disk.
Very likely, the written data hasn't been flushed to disk
yet so ext4 returns zero in this case.

Jiaying
>
> Now, first 512bytes of "foo" will contain all zeros, while
> the beginning of redo1.odf is _not_ zeros.
>
> Again, without aioread_nolock it works as expected.
>
>
> And the most important note: without the patch there's no
> data corruption like that.  But instead, there is the
> lockup... ;)
>
> Thank you,
>
> /mjt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tao Ma Aug. 16, 2011, 4:12 a.m. UTC | #10
On 08/16/2011 12:08 AM, Eric Sandeen wrote:
> On 8/13/11 11:02 AM, Tao Ma wrote:
>> From: Tao Ma <boyu.mt@taobao.com>
>>
>> Hi Michael,
>> 	could you please check whether this patch work for you?
> 
> Thanks, this is the patch I meant to send, I guess ;)  I think
> I missed the "|=" case before :(
> 
> given that every place we set it requires a test and an increment
> if not set, maybe a helper to do all that is in order?
sure, I can work out a patch like that. Thanks for the review.

Thanks
Tao
> 
> -Eric
> 
>>
>> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
>> From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001
>> From: Tao Ma <boyu.mt@taobao.com>
>> Date: Sat, 13 Aug 2011 23:50:28 +0800
>> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
>>
>> EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should
>> be done simultaneously since ext4_end_io_nolock always clear the flag and
>> decrease the counter in the same time.
>>
>> We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so
>> it will go nagative and causes some process to wait forever.
>>
>> Part of the patch came from Eric in his e-mail.
>> http://marc.info/?l=linux-ext4&m=131316851417460&w=2
>>
>> Reported-by: Michael Tokarev<mjt@tls.msk.ru>
>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>> Signed-off-by: Tao Ma <boyu.mt@taobao.com>
>> ---
>>  fs/ext4/inode.c   |    9 ++++++++-
>>  fs/ext4/page-io.c |    6 ++++--
>>  2 files changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index d47264c..40c0b10 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
>>  		goto out;
>>  	}
>>  
>> -	io_end->flag = EXT4_IO_END_UNWRITTEN;
>> +	/*
>> +	 * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now,
>> +	 * but being more careful is always safe for the future change.
>> +	 */
>>  	inode = io_end->inode;
>> +	if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
>> +		io_end->flag |= EXT4_IO_END_UNWRITTEN;
>> +		atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
>> +	}
>>  
>>  	/* Add the io_end to per-inode completed io list*/
>>  	spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
>> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
>> index 430c401..78839af 100644
>> --- a/fs/ext4/page-io.c
>> +++ b/fs/ext4/page-io.c
>> @@ -334,8 +334,10 @@ submit_and_retry:
>>  	if ((io_end->num_io_pages >= MAX_IO_PAGES) &&
>>  	    (io_end->pages[io_end->num_io_pages-1] != io_page))
>>  		goto submit_and_retry;
>> -	if (buffer_uninit(bh))
>> -		io->io_end->flag |= EXT4_IO_END_UNWRITTEN;
>> +	if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
>> +		io_end->flag |= EXT4_IO_END_UNWRITTEN;
>> +		atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
>> +	}
>>  	io->io_end->size += bh->b_size;
>>  	io->io_next_block++;
>>  	ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tao Ma Aug. 16, 2011, 4:15 a.m. UTC | #11
Hi Jiaying,
On 08/16/2011 07:53 AM, Jiaying Zhang wrote:
> Hi Michael,
> 
> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote:
>> 15.08.2011 12:00, Michael Tokarev wrote:
>> [....]
>>
>> So, it looks like this (starting with cold cache):
>>
>> 1. rename the redologs and copy them over - this will
>>   make a hot copy of redologs
>> 2. startup oracle - it will complain that the redologs aren't
>>   redologs, the header is corrupt
>> 3. shut down oracle, start it up again - it will succeed.
>>
>> If between 1 and 2 you'll issue sync(1) everything will work.
>> When shutting down, oracle calls fsync(), so that's like
>> sync(1) again.
>>
>> If there will be some time between 1. and 2., everything
>> will work too.
>> 
>> Without dioread_nolock I can't trigger the problem no matter
>> how I tried.
>>
>>
>> A smaller test case.  I used redo1.odf file (one of the
>> redologs) as a test file, any will work.
>>
>>  $ cp -p redo1.odf temp
>>  $ dd if=temp of=foo iflag=direct count=20
> Isn't this the expected behavior here? When doing
> 'cp -p redo1.odf temp', data is copied to temp through
> buffer write, but there is no guarantee when data will be
> actually written to disk. Then with 'dd if=temp of=foo
> iflag=direct count=20', data is read directly from disk.
> Very likely, the written data hasn't been flushed to disk
> yet so ext4 returns zero in this case.
Sorry, but it doesn't sound correct to me.
Say we use a buffer write to a file and then use direct i/o read, what
we expect(or at least Michael expect) is that we use read the updated
data, not the stale one. I thought of a tiny race window in ext4 here,
but need to do some test to verify and then fix it.

Thanks
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tao Ma Aug. 16, 2011, 6:15 a.m. UTC | #12
Hi Eric,
On 08/16/2011 12:08 AM, Eric Sandeen wrote:
> On 8/13/11 11:02 AM, Tao Ma wrote:
>> From: Tao Ma <boyu.mt@taobao.com>
>>
>> Hi Michael,
>> 	could you please check whether this patch work for you?
> 
> Thanks, this is the patch I meant to send, I guess ;)  I think
> I missed the "|=" case before :(
> 
> given that every place we set it requires a test and an increment
> if not set, maybe a helper to do all that is in order?
I have decided to leave it to the next merge window. So I will re-send
this patch as a fix and then create another patch for the helper
generalization.

Thanks
Tao
> 
> -Eric
> 
>>
>> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
>> From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001
>> From: Tao Ma <boyu.mt@taobao.com>
>> Date: Sat, 13 Aug 2011 23:50:28 +0800
>> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously.
>>
>> EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should
>> be done simultaneously since ext4_end_io_nolock always clear the flag and
>> decrease the counter in the same time.
>>
>> We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so
>> it will go nagative and causes some process to wait forever.
>>
>> Part of the patch came from Eric in his e-mail.
>> http://marc.info/?l=linux-ext4&m=131316851417460&w=2
>>
>> Reported-by: Michael Tokarev<mjt@tls.msk.ru>
>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>> Signed-off-by: Tao Ma <boyu.mt@taobao.com>
>> ---
>>  fs/ext4/inode.c   |    9 ++++++++-
>>  fs/ext4/page-io.c |    6 ++++--
>>  2 files changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index d47264c..40c0b10 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
>>  		goto out;
>>  	}
>>  
>> -	io_end->flag = EXT4_IO_END_UNWRITTEN;
>> +	/*
>> +	 * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now,
>> +	 * but being more careful is always safe for the future change.
>> +	 */
>>  	inode = io_end->inode;
>> +	if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
>> +		io_end->flag |= EXT4_IO_END_UNWRITTEN;
>> +		atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
>> +	}
>>  
>>  	/* Add the io_end to per-inode completed io list*/
>>  	spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
>> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
>> index 430c401..78839af 100644
>> --- a/fs/ext4/page-io.c
>> +++ b/fs/ext4/page-io.c
>> @@ -334,8 +334,10 @@ submit_and_retry:
>>  	if ((io_end->num_io_pages >= MAX_IO_PAGES) &&
>>  	    (io_end->pages[io_end->num_io_pages-1] != io_page))
>>  		goto submit_and_retry;
>> -	if (buffer_uninit(bh))
>> -		io->io_end->flag |= EXT4_IO_END_UNWRITTEN;
>> +	if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
>> +		io_end->flag |= EXT4_IO_END_UNWRITTEN;
>> +		atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
>> +	}
>>  	io->io_end->size += bh->b_size;
>>  	io->io_next_block++;
>>  	ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Tokarev Aug. 16, 2011, 8:38 a.m. UTC | #13
16.08.2011 03:53, Jiaying Zhang wrote:
> Hi Michael,
> 
> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote:
[]
>> A smaller test case.  I used redo1.odf file (one of the
>> redologs) as a test file, any will work.
>>
>>  $ cp -p redo1.odf temp
>>  $ dd if=temp of=foo iflag=direct count=20

> Isn't this the expected behavior here? When doing
> 'cp -p redo1.odf temp', data is copied to temp through
> buffer write, but there is no guarantee when data will be
> actually written to disk. Then with 'dd if=temp of=foo
> iflag=direct count=20', data is read directly from disk.
> Very likely, the written data hasn't been flushed to disk
> yet so ext4 returns zero in this case.

The problem is 3-faced (at least ;)

First of all, it is _not_ an expected behavour.  When you
think about it, maybe it becomes "more expected", but for
first it looks like something Really Wrong (tm).  It can
be made "more expected" by mentioning in various manpages
and whatnot all the possible consecuences of mixing direct
and buffered I/O.  So far it hasn't been done.  I can
understand (and sort of expect), say, buffered write being
insisible for concurrent direct read, while they're going
at the same time.  But here, the file has been closed and
re-opened between writes and reads.

I agree that it's difficult to keep both pieces - direct
and buffered I/O - in sync, -- there were numerous efforts
to syncronize them, with various success and usually huge
amount of work.  Maybe if it were noted initially that
direct I/O _is_ incompatible with buffered I/O, things
weren't that bad now.

Next, this problem does not happen without the mentioned
dioread_nolock option (which - as far as I can see -
supposed to be the default (or only) way to handle this
in the future).  I can't trigger any of the issues
described in this thread without dioread_nolock.

So that makes this as yet another "corner case" somehow
(like famous non-fs-buffer-aligned direct write past end
of file, or like mmapped I/O mixed with direct I/O and
so on), but since most other such corner cases are fixed
now, this one just needs to be fixed too.

And 3rd, this is a race condition: it does not happen all
the time, or even most of the time, it happens "sometimes",
which makes it more like a bug than not.

Thanks,

/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Aug. 16, 2011, 1:53 p.m. UTC | #14
On Mon 15-08-11 16:53:34, Jiaying Zhang wrote:
> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote:
> > 15.08.2011 12:00, Michael Tokarev wrote:
> > [....]
> >
> > So, it looks like this (starting with cold cache):
> >
> > 1. rename the redologs and copy them over - this will
> >   make a hot copy of redologs
> > 2. startup oracle - it will complain that the redologs aren't
> >   redologs, the header is corrupt
> > 3. shut down oracle, start it up again - it will succeed.
> >
> > If between 1 and 2 you'll issue sync(1) everything will work.
> > When shutting down, oracle calls fsync(), so that's like
> > sync(1) again.
> >
> > If there will be some time between 1. and 2., everything
> > will work too.
> >
> > Without dioread_nolock I can't trigger the problem no matter
> > how I tried.
> >
> >
> > A smaller test case.  I used redo1.odf file (one of the
> > redologs) as a test file, any will work.
> >
> >  $ cp -p redo1.odf temp
> >  $ dd if=temp of=foo iflag=direct count=20
> Isn't this the expected behavior here? When doing
> 'cp -p redo1.odf temp', data is copied to temp through
> buffer write, but there is no guarantee when data will be
> actually written to disk. Then with 'dd if=temp of=foo
> iflag=direct count=20', data is read directly from disk.
> Very likely, the written data hasn't been flushed to disk
> yet so ext4 returns zero in this case.
  No it's not. Buffered and direct IO is supposed to work correctly
(although not fast) together. In this particular case we take care to flush
dirty data from page cache before performing direct IO read... But
something is broken in this path obviously.

I don't have time to dig into this in detail now but what seems to be the
problem is that with dioread_nolock option, we don't acquire i_mutex for
direct IO reads anymore. Thus these reads can compete with
ext4_end_io_nolock() called from ext4_end_io_work() (this is called under
i_mutex so without dioread_nolock the race cannot happen).

Hmm, the new writepages code seems to be broken in combination with direct
IO. Direct IO code expects that when filemap_write_and_wait() finishes,
data is on disk but with new bio submission code this is not true because
we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in
ext4_end_io_buffer_write() but do extent conversion only after that in
convert workqueue. So the race seems to be there all the time, just without
dioread_nolock it's much smaller.

Fixing this is going to be non-trivial - I'm not sure we can really move
clearing of PageWriteback bit to conversion workqueue. I think we already
tried that once but it caused deadlocks for some reason...

> > Now, first 512bytes of "foo" will contain all zeros, while
> > the beginning of redo1.odf is _not_ zeros.
> >
> > Again, without aioread_nolock it works as expected.
> >
> >
> > And the most important note: without the patch there's no
> > data corruption like that.  But instead, there is the
> > lockup... ;)

								Honza
Tao Ma Aug. 16, 2011, 3:03 p.m. UTC | #15
On 08/16/2011 09:53 PM, Jan Kara wrote:
> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote:
>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote:
>>> 15.08.2011 12:00, Michael Tokarev wrote:
>>> [....]
>>>
>>> So, it looks like this (starting with cold cache):
>>>
>>> 1. rename the redologs and copy them over - this will
>>>   make a hot copy of redologs
>>> 2. startup oracle - it will complain that the redologs aren't
>>>   redologs, the header is corrupt
>>> 3. shut down oracle, start it up again - it will succeed.
>>>
>>> If between 1 and 2 you'll issue sync(1) everything will work.
>>> When shutting down, oracle calls fsync(), so that's like
>>> sync(1) again.
>>>
>>> If there will be some time between 1. and 2., everything
>>> will work too.
>>>
>>> Without dioread_nolock I can't trigger the problem no matter
>>> how I tried.
>>>
>>>
>>> A smaller test case.  I used redo1.odf file (one of the
>>> redologs) as a test file, any will work.
>>>
>>>  $ cp -p redo1.odf temp
>>>  $ dd if=temp of=foo iflag=direct count=20
>> Isn't this the expected behavior here? When doing
>> 'cp -p redo1.odf temp', data is copied to temp through
>> buffer write, but there is no guarantee when data will be
>> actually written to disk. Then with 'dd if=temp of=foo
>> iflag=direct count=20', data is read directly from disk.
>> Very likely, the written data hasn't been flushed to disk
>> yet so ext4 returns zero in this case.
>   No it's not. Buffered and direct IO is supposed to work correctly
> (although not fast) together. In this particular case we take care to flush
> dirty data from page cache before performing direct IO read... But
> something is broken in this path obviously.
> 
> I don't have time to dig into this in detail now but what seems to be the
> problem is that with dioread_nolock option, we don't acquire i_mutex for
> direct IO reads anymore. Thus these reads can compete with
> ext4_end_io_nolock() called from ext4_end_io_work() (this is called under
> i_mutex so without dioread_nolock the race cannot happen).
> 
> Hmm, the new writepages code seems to be broken in combination with direct
> IO. Direct IO code expects that when filemap_write_and_wait() finishes,
> data is on disk but with new bio submission code this is not true because
> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in
> ext4_end_io_buffer_write() but do extent conversion only after that in
> convert workqueue. So the race seems to be there all the time, just without
> dioread_nolock it's much smaller.
You are absolutely right. The really problem is that ext4_direct_IO
begins to work *after* we clear the page writeback flag and *before* we
convert unwritten extent to a valid state. Some of my trace does show
that. I am working on it now.
> 
> Fixing this is going to be non-trivial - I'm not sure we can really move
> clearing of PageWriteback bit to conversion workqueue. I thienk we already
> tried that once but it caused deadlocks for some reason...
I just did as what you described and yes I met with another problem and
try to resolve it now. Once it is OK, I will send out the patch.

Thanks
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d47264c..40c0b10 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2668,8 +2668,15 @@  static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
 		goto out;
 	}
 
-	io_end->flag = EXT4_IO_END_UNWRITTEN;
+	/*
+	 * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now,
+	 * but being more careful is always safe for the future change.
+	 */
 	inode = io_end->inode;
+	if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
+		io_end->flag |= EXT4_IO_END_UNWRITTEN;
+		atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
+	}
 
 	/* Add the io_end to per-inode completed io list*/
 	spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 430c401..78839af 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -334,8 +334,10 @@  submit_and_retry:
 	if ((io_end->num_io_pages >= MAX_IO_PAGES) &&
 	    (io_end->pages[io_end->num_io_pages-1] != io_page))
 		goto submit_and_retry;
-	if (buffer_uninit(bh))
-		io->io_end->flag |= EXT4_IO_END_UNWRITTEN;
+	if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
+		io_end->flag |= EXT4_IO_END_UNWRITTEN;
+		atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten);
+	}
 	io->io_end->size += bh->b_size;
 	io->io_next_block++;
 	ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));