Message ID | 1313251371-3672-1-git-send-email-tm@tao.ma |
---|---|
State | Superseded, archived |
Headers | show |
13.08.2011 20:02, Tao Ma wrote: > From: Tao Ma <boyu.mt@taobao.com> > > Hi Michael, > could you please check whether this patch work for you? With this patch applied to 3.0.1 I can't trigger the issue anymore, after several attempts -- the system just works as it shold be. I'm not sure this is the right fix or it's just my testcase isn't as good as it can be... ;) > Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. > From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001 > From: Tao Ma <boyu.mt@taobao.com> > Date: Sat, 13 Aug 2011 23:50:28 +0800 > Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. > > EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should > be done simultaneously since ext4_end_io_nolock always clear the flag and > decrease the counter in the same time. > > We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so > it will go nagative and causes some process to wait forever. > > Part of the patch came from Eric in his e-mail. > http://marc.info/?l=linux-ext4&m=131316851417460&w=2 > > Reported-by: Michael Tokarev<mjt@tls.msk.ru> > Signed-off-by: Eric Sandeen <sandeen@redhat.com> > Signed-off-by: Tao Ma <boyu.mt@taobao.com> You can add my Tested-By too ;) Thank you! And please excuse me for so long delay replying - I was AFK whole weekend. /mjt > fs/ext4/inode.c | 9 ++++++++- > fs/ext4/page-io.c | 6 ++++-- > 2 files changed, 12 insertions(+), 3 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index d47264c..40c0b10 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate) > goto out; > } > > - io_end->flag = EXT4_IO_END_UNWRITTEN; > + /* > + * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now, > + * but being more careful is always safe for the future change. > + */ > inode = io_end->inode; > + if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) { > + io_end->flag |= EXT4_IO_END_UNWRITTEN; > + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); > + } > > /* Add the io_end to per-inode completed io list*/ > spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags); > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c > index 430c401..78839af 100644 > --- a/fs/ext4/page-io.c > +++ b/fs/ext4/page-io.c > @@ -334,8 +334,10 @@ submit_and_retry: > if ((io_end->num_io_pages >= MAX_IO_PAGES) && > (io_end->pages[io_end->num_io_pages-1] != io_page)) > goto submit_and_retry; > - if (buffer_uninit(bh)) > - io->io_end->flag |= EXT4_IO_END_UNWRITTEN; > + if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) { > + io_end->flag |= EXT4_IO_END_UNWRITTEN; > + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); > + } > io->io_end->size += bh->b_size; > io->io_next_block++; > ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh)); -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
15.08.2011 00:57, Michael Tokarev пишет: > 13.08.2011 20:02, Tao Ma wrote: >> From: Tao Ma <boyu.mt@taobao.com> >> >> Hi Michael, >> could you please check whether this patch work for you? > > With this patch applied to 3.0.1 I can't trigger the issue anymore, > after several attempts -- the system just works as it shold be. > I'm not sure this is the right fix or it's just my testcase isn't > as good as it can be... ;) Well, I found a way to trigger data corruption with this patch applied. I guess it's not fault of this patch, but some more deep problem instead. The sequence is my usual copy of an oracle database from another place and start it. When oracle starts doing it's direct-I/O against its redologs, we had problem which is now solved. But now I do the following: I shutdown the database, rename the current redologs out of the way and copy them back into place as new files. And start the database again. This time, oracle complains that the redologs contains garbage. I can reboot the machine now, and compare old (renamed) redologs with copies - they're indeed different. My guess is that copy is done from the pagecache - from the old contents of the files, somehow ignoring the (direct) writes performed by initial database open. But that copy is somehow damaged now too, since even file identification is now different. Is this new issue something that dioread_nolock supposed to create? I mean, it isn't entirely clear what it supposed to do, it looks somewhat hackish, but without it performance is quite bad. Thanks, /mjt -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/15/2011 05:07 AM, Michael Tokarev wrote: > 15.08.2011 00:57, Michael Tokarev пишет: >> 13.08.2011 20:02, Tao Ma wrote: >>> From: Tao Ma <boyu.mt@taobao.com> >>> >>> Hi Michael, >>> could you please check whether this patch work for you? >> >> With this patch applied to 3.0.1 I can't trigger the issue anymore, >> after several attempts -- the system just works as it shold be. >> I'm not sure this is the right fix or it's just my testcase isn't >> as good as it can be... ;) Thanks for the test. > > Well, I found a way to trigger data corruption with this patch > applied. I guess it's not fault of this patch, but some more > deep problem instead. > > The sequence is my usual copy of an oracle database from another > place and start it. When oracle starts doing it's direct-I/O > against its redologs, we had problem which is now solved. But > now I do the following: I shutdown the database, rename the current > redologs out of the way and copy them back into place as new files. > And start the database again. > > This time, oracle complains that the redologs contains garbage. > I can reboot the machine now, and compare old (renamed) redologs > with copies - they're indeed different. > > My guess is that copy is done from the pagecache - from the old > contents of the files, somehow ignoring the (direct) writes > performed by initial database open. But that copy is somehow > damaged now too, since even file identification is now different. > > Is this new issue something that dioread_nolock supposed to create? > I mean, it isn't entirely clear what it supposed to do, it looks > somewhat hackish, but without it performance is quite bad. So could I generalize your sequence like below: 1. copy a large file to a new ext4 volume 2. do some direct i/o read/write to this file(bs=512) 3. rename it. 4. cp this back to the original file 5. do direct i/o read/write(bs=512) now and the file is actually corrupted. You used to meet with problem in step 2, and my patch resolved it. Now you met with problems in step 5. Right? Thanks Tao -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
15.08.2011 06:36, Tao Ma wrote: > On 08/15/2011 05:07 AM, Michael Tokarev wrote: [] >> Well, I found a way to trigger data corruption with this patch >> applied. I guess it's not fault of this patch, but some more >> deep problem instead. >> >> The sequence is my usual copy of an oracle database from another >> place and start it. When oracle starts doing it's direct-I/O >> against its redologs, we had problem which is now solved. But >> now I do the following: I shutdown the database, rename the current >> redologs out of the way and copy them back into place as new files. >> And start the database again. >> >> This time, oracle complains that the redologs contains garbage. >> I can reboot the machine now, and compare old (renamed) redologs >> with copies - they're indeed different. >> >> My guess is that copy is done from the pagecache - from the old >> contents of the files, somehow ignoring the (direct) writes >> performed by initial database open. But that copy is somehow >> damaged now too, since even file identification is now different. >> >> Is this new issue something that dioread_nolock supposed to create? >> I mean, it isn't entirely clear what it supposed to do, it looks >> somewhat hackish, but without it performance is quite bad. > So could I generalize your sequence like below: > 1. copy a large file to a new ext4 volume > 2. do some direct i/o read/write to this file(bs=512) > 3. rename it. > 4. cp this back to the original file > 5. do direct i/o read/write(bs=512) now and the file is actually corrupted. > > You used to meet with problem in step 2, and my patch resolved it. Now > you met with problems in step 5. Right? SQL> shutdown immediate; -- shuts down the database cleanly $ mkdir tmp $ mv redo* tmp/ $ cp -p tmp/* . -- this will make redolog files to be in hot cache, not even written to disk. SQL> startup Database mounted. -- now open and read our redologs... -- at this point, without the patch, it hangs. ORA-00316: log 1 of thread 1, type in header is not log file ORA-00312: online log 1 thread 1: '.../redo1.odf' $ mv -f tmp/* . SQL> alter database open; -- this will try to open files again and read them again Database altered. -- and now we're fine. This is my small(ish) testcase so far. Only the redologs needs to be in hot cache in order to trigger the issue. This does direct I/O in 512byte blocks in these redo* files. The rename and a new directory is just to keep the pieces of the database in the right place. There's even more fun. I once managed to get old content in the copied files, but I can't repeat it. I made a copy as before, sync(1)ed everything, started the database - it was ok. Next I shut it down, and rebooted (why drop_caches does not really work is another big question). And now, oracle complains that the redologs contains previous sequence number. (to clarify: there's a sequence number in each oracle db which is incremented each time something happens with the database, including startup. So on startup, each file in the database gets new (the same) sequence number). So it looked like even despite of oracle doing direct writes to record new sequence number, a previously cached data gets written to the file. Now I'm not really sure what's going on, it is somewhat inconsistent. Before, it used to hang after "Database mounted" message, when it tries to write to redologs, -- now that hang is gone. But now I see some apparent data corruption - again, with hot cache only - but I don't actually understand when it happens. I'm trying to narrow it down further. Thank you! /mjt -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
15.08.2011 12:00, Michael Tokarev wrote: [....] So, it looks like this (starting with cold cache): 1. rename the redologs and copy them over - this will make a hot copy of redologs 2. startup oracle - it will complain that the redologs aren't redologs, the header is corrupt 3. shut down oracle, start it up again - it will succeed. If between 1 and 2 you'll issue sync(1) everything will work. When shutting down, oracle calls fsync(), so that's like sync(1) again. If there will be some time between 1. and 2., everything will work too. Without dioread_nolock I can't trigger the problem no matter how I tried. A smaller test case. I used redo1.odf file (one of the redologs) as a test file, any will work. $ cp -p redo1.odf temp $ dd if=temp of=foo iflag=direct count=20 Now, first 512bytes of "foo" will contain all zeros, while the beginning of redo1.odf is _not_ zeros. Again, without aioread_nolock it works as expected. And the most important note: without the patch there's no data corruption like that. But instead, there is the lockup... ;) Thank you, /mjt -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
15.08.2011 12:56, Michael Tokarev пишет: > 15.08.2011 12:00, Michael Tokarev wrote: > [....] > > So, it looks like this (starting with cold cache): > > 1. rename the redologs and copy them over - this will > make a hot copy of redologs > 2. startup oracle - it will complain that the redologs aren't > redologs, the header is corrupt > 3. shut down oracle, start it up again - it will succeed. > > If between 1 and 2 you'll issue sync(1) everything will work. > When shutting down, oracle calls fsync(), so that's like > sync(1) again. > > If there will be some time between 1. and 2., everything > will work too. > > Without dioread_nolock I can't trigger the problem no matter > how I tried. > > > A smaller test case. I used redo1.odf file (one of the > redologs) as a test file, any will work. > > $ cp -p redo1.odf temp > $ dd if=temp of=foo iflag=direct count=20 > > Now, first 512bytes of "foo" will contain all zeros, while > the beginning of redo1.odf is _not_ zeros. > > Again, without aioread_nolock it works as expected. > > > And the most important note: without the patch there's no > data corruption like that. But instead, there is the > lockup... ;) Actually I can reproduce this data corruption without the patch too, just not that easily. Oracle testcase (with copying redologs over) does that nicely. So that's a separate bug which was here before. > Thank you, > > /mjt -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/15/2011 05:03 PM, Michael Tokarev wrote: > 15.08.2011 12:56, Michael Tokarev пишет: >> 15.08.2011 12:00, Michael Tokarev wrote: >> [....] >> >> So, it looks like this (starting with cold cache): >> >> 1. rename the redologs and copy them over - this will >> make a hot copy of redologs >> 2. startup oracle - it will complain that the redologs aren't >> redologs, the header is corrupt >> 3. shut down oracle, start it up again - it will succeed. >> >> If between 1 and 2 you'll issue sync(1) everything will work. >> When shutting down, oracle calls fsync(), so that's like >> sync(1) again. >> >> If there will be some time between 1. and 2., everything >> will work too. >> >> Without dioread_nolock I can't trigger the problem no matter >> how I tried. >> >> >> A smaller test case. I used redo1.odf file (one of the >> redologs) as a test file, any will work. >> >> $ cp -p redo1.odf temp >> $ dd if=temp of=foo iflag=direct count=20 >> >> Now, first 512bytes of "foo" will contain all zeros, while >> the beginning of redo1.odf is _not_ zeros. >> >> Again, without aioread_nolock it works as expected. >> >> >> And the most important note: without the patch there's no >> data corruption like that. But instead, there is the >> lockup... ;) > > Actually I can reproduce this data corruption without the > patch too, just not that easily. Oracle testcase (with > copying redologs over) does that nicely. So that's a > separate bug which was here before. cool, thanks for the test. btw, I can reproduce the bug with $ cp -p redo1.odf temp $ dd if=temp of=foo iflag=direct count=20 Not that easy, but I did encounter one during my more than 20 tries, hope I can get something out soon. Thanks Tao -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 8/13/11 11:02 AM, Tao Ma wrote: > From: Tao Ma <boyu.mt@taobao.com> > > Hi Michael, > could you please check whether this patch work for you? Thanks, this is the patch I meant to send, I guess ;) I think I missed the "|=" case before :( given that every place we set it requires a test and an increment if not set, maybe a helper to do all that is in order? -Eric > > Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. > From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001 > From: Tao Ma <boyu.mt@taobao.com> > Date: Sat, 13 Aug 2011 23:50:28 +0800 > Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. > > EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should > be done simultaneously since ext4_end_io_nolock always clear the flag and > decrease the counter in the same time. > > We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so > it will go nagative and causes some process to wait forever. > > Part of the patch came from Eric in his e-mail. > http://marc.info/?l=linux-ext4&m=131316851417460&w=2 > > Reported-by: Michael Tokarev<mjt@tls.msk.ru> > Signed-off-by: Eric Sandeen <sandeen@redhat.com> > Signed-off-by: Tao Ma <boyu.mt@taobao.com> > --- > fs/ext4/inode.c | 9 ++++++++- > fs/ext4/page-io.c | 6 ++++-- > 2 files changed, 12 insertions(+), 3 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index d47264c..40c0b10 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate) > goto out; > } > > - io_end->flag = EXT4_IO_END_UNWRITTEN; > + /* > + * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now, > + * but being more careful is always safe for the future change. > + */ > inode = io_end->inode; > + if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) { > + io_end->flag |= EXT4_IO_END_UNWRITTEN; > + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); > + } > > /* Add the io_end to per-inode completed io list*/ > spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags); > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c > index 430c401..78839af 100644 > --- a/fs/ext4/page-io.c > +++ b/fs/ext4/page-io.c > @@ -334,8 +334,10 @@ submit_and_retry: > if ((io_end->num_io_pages >= MAX_IO_PAGES) && > (io_end->pages[io_end->num_io_pages-1] != io_page)) > goto submit_and_retry; > - if (buffer_uninit(bh)) > - io->io_end->flag |= EXT4_IO_END_UNWRITTEN; > + if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) { > + io_end->flag |= EXT4_IO_END_UNWRITTEN; > + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); > + } > io->io_end->size += bh->b_size; > io->io_next_block++; > ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh)); -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Michael, On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote: > 15.08.2011 12:00, Michael Tokarev wrote: > [....] > > So, it looks like this (starting with cold cache): > > 1. rename the redologs and copy them over - this will > make a hot copy of redologs > 2. startup oracle - it will complain that the redologs aren't > redologs, the header is corrupt > 3. shut down oracle, start it up again - it will succeed. > > If between 1 and 2 you'll issue sync(1) everything will work. > When shutting down, oracle calls fsync(), so that's like > sync(1) again. > > If there will be some time between 1. and 2., everything > will work too. > > Without dioread_nolock I can't trigger the problem no matter > how I tried. > > > A smaller test case. I used redo1.odf file (one of the > redologs) as a test file, any will work. > > $ cp -p redo1.odf temp > $ dd if=temp of=foo iflag=direct count=20 Isn't this the expected behavior here? When doing 'cp -p redo1.odf temp', data is copied to temp through buffer write, but there is no guarantee when data will be actually written to disk. Then with 'dd if=temp of=foo iflag=direct count=20', data is read directly from disk. Very likely, the written data hasn't been flushed to disk yet so ext4 returns zero in this case. Jiaying > > Now, first 512bytes of "foo" will contain all zeros, while > the beginning of redo1.odf is _not_ zeros. > > Again, without aioread_nolock it works as expected. > > > And the most important note: without the patch there's no > data corruption like that. But instead, there is the > lockup... ;) > > Thank you, > > /mjt > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/16/2011 12:08 AM, Eric Sandeen wrote: > On 8/13/11 11:02 AM, Tao Ma wrote: >> From: Tao Ma <boyu.mt@taobao.com> >> >> Hi Michael, >> could you please check whether this patch work for you? > > Thanks, this is the patch I meant to send, I guess ;) I think > I missed the "|=" case before :( > > given that every place we set it requires a test and an increment > if not set, maybe a helper to do all that is in order? sure, I can work out a patch like that. Thanks for the review. Thanks Tao > > -Eric > >> >> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. >> From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001 >> From: Tao Ma <boyu.mt@taobao.com> >> Date: Sat, 13 Aug 2011 23:50:28 +0800 >> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. >> >> EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should >> be done simultaneously since ext4_end_io_nolock always clear the flag and >> decrease the counter in the same time. >> >> We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so >> it will go nagative and causes some process to wait forever. >> >> Part of the patch came from Eric in his e-mail. >> http://marc.info/?l=linux-ext4&m=131316851417460&w=2 >> >> Reported-by: Michael Tokarev<mjt@tls.msk.ru> >> Signed-off-by: Eric Sandeen <sandeen@redhat.com> >> Signed-off-by: Tao Ma <boyu.mt@taobao.com> >> --- >> fs/ext4/inode.c | 9 ++++++++- >> fs/ext4/page-io.c | 6 ++++-- >> 2 files changed, 12 insertions(+), 3 deletions(-) >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index d47264c..40c0b10 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate) >> goto out; >> } >> >> - io_end->flag = EXT4_IO_END_UNWRITTEN; >> + /* >> + * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now, >> + * but being more careful is always safe for the future change. >> + */ >> inode = io_end->inode; >> + if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) { >> + io_end->flag |= EXT4_IO_END_UNWRITTEN; >> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); >> + } >> >> /* Add the io_end to per-inode completed io list*/ >> spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags); >> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c >> index 430c401..78839af 100644 >> --- a/fs/ext4/page-io.c >> +++ b/fs/ext4/page-io.c >> @@ -334,8 +334,10 @@ submit_and_retry: >> if ((io_end->num_io_pages >= MAX_IO_PAGES) && >> (io_end->pages[io_end->num_io_pages-1] != io_page)) >> goto submit_and_retry; >> - if (buffer_uninit(bh)) >> - io->io_end->flag |= EXT4_IO_END_UNWRITTEN; >> + if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) { >> + io_end->flag |= EXT4_IO_END_UNWRITTEN; >> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); >> + } >> io->io_end->size += bh->b_size; >> io->io_next_block++; >> ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh)); > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Jiaying, On 08/16/2011 07:53 AM, Jiaying Zhang wrote: > Hi Michael, > > On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote: >> 15.08.2011 12:00, Michael Tokarev wrote: >> [....] >> >> So, it looks like this (starting with cold cache): >> >> 1. rename the redologs and copy them over - this will >> make a hot copy of redologs >> 2. startup oracle - it will complain that the redologs aren't >> redologs, the header is corrupt >> 3. shut down oracle, start it up again - it will succeed. >> >> If between 1 and 2 you'll issue sync(1) everything will work. >> When shutting down, oracle calls fsync(), so that's like >> sync(1) again. >> >> If there will be some time between 1. and 2., everything >> will work too. >> >> Without dioread_nolock I can't trigger the problem no matter >> how I tried. >> >> >> A smaller test case. I used redo1.odf file (one of the >> redologs) as a test file, any will work. >> >> $ cp -p redo1.odf temp >> $ dd if=temp of=foo iflag=direct count=20 > Isn't this the expected behavior here? When doing > 'cp -p redo1.odf temp', data is copied to temp through > buffer write, but there is no guarantee when data will be > actually written to disk. Then with 'dd if=temp of=foo > iflag=direct count=20', data is read directly from disk. > Very likely, the written data hasn't been flushed to disk > yet so ext4 returns zero in this case. Sorry, but it doesn't sound correct to me. Say we use a buffer write to a file and then use direct i/o read, what we expect(or at least Michael expect) is that we use read the updated data, not the stale one. I thought of a tiny race window in ext4 here, but need to do some test to verify and then fix it. Thanks Tao -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Eric, On 08/16/2011 12:08 AM, Eric Sandeen wrote: > On 8/13/11 11:02 AM, Tao Ma wrote: >> From: Tao Ma <boyu.mt@taobao.com> >> >> Hi Michael, >> could you please check whether this patch work for you? > > Thanks, this is the patch I meant to send, I guess ;) I think > I missed the "|=" case before :( > > given that every place we set it requires a test and an increment > if not set, maybe a helper to do all that is in order? I have decided to leave it to the next merge window. So I will re-send this patch as a fix and then create another patch for the helper generalization. Thanks Tao > > -Eric > >> >> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. >> From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001 >> From: Tao Ma <boyu.mt@taobao.com> >> Date: Sat, 13 Aug 2011 23:50:28 +0800 >> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. >> >> EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should >> be done simultaneously since ext4_end_io_nolock always clear the flag and >> decrease the counter in the same time. >> >> We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so >> it will go nagative and causes some process to wait forever. >> >> Part of the patch came from Eric in his e-mail. >> http://marc.info/?l=linux-ext4&m=131316851417460&w=2 >> >> Reported-by: Michael Tokarev<mjt@tls.msk.ru> >> Signed-off-by: Eric Sandeen <sandeen@redhat.com> >> Signed-off-by: Tao Ma <boyu.mt@taobao.com> >> --- >> fs/ext4/inode.c | 9 ++++++++- >> fs/ext4/page-io.c | 6 ++++-- >> 2 files changed, 12 insertions(+), 3 deletions(-) >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index d47264c..40c0b10 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate) >> goto out; >> } >> >> - io_end->flag = EXT4_IO_END_UNWRITTEN; >> + /* >> + * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now, >> + * but being more careful is always safe for the future change. >> + */ >> inode = io_end->inode; >> + if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) { >> + io_end->flag |= EXT4_IO_END_UNWRITTEN; >> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); >> + } >> >> /* Add the io_end to per-inode completed io list*/ >> spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags); >> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c >> index 430c401..78839af 100644 >> --- a/fs/ext4/page-io.c >> +++ b/fs/ext4/page-io.c >> @@ -334,8 +334,10 @@ submit_and_retry: >> if ((io_end->num_io_pages >= MAX_IO_PAGES) && >> (io_end->pages[io_end->num_io_pages-1] != io_page)) >> goto submit_and_retry; >> - if (buffer_uninit(bh)) >> - io->io_end->flag |= EXT4_IO_END_UNWRITTEN; >> + if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) { >> + io_end->flag |= EXT4_IO_END_UNWRITTEN; >> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); >> + } >> io->io_end->size += bh->b_size; >> io->io_next_block++; >> ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh)); > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
16.08.2011 03:53, Jiaying Zhang wrote: > Hi Michael, > > On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote: [] >> A smaller test case. I used redo1.odf file (one of the >> redologs) as a test file, any will work. >> >> $ cp -p redo1.odf temp >> $ dd if=temp of=foo iflag=direct count=20 > Isn't this the expected behavior here? When doing > 'cp -p redo1.odf temp', data is copied to temp through > buffer write, but there is no guarantee when data will be > actually written to disk. Then with 'dd if=temp of=foo > iflag=direct count=20', data is read directly from disk. > Very likely, the written data hasn't been flushed to disk > yet so ext4 returns zero in this case. The problem is 3-faced (at least ;) First of all, it is _not_ an expected behavour. When you think about it, maybe it becomes "more expected", but for first it looks like something Really Wrong (tm). It can be made "more expected" by mentioning in various manpages and whatnot all the possible consecuences of mixing direct and buffered I/O. So far it hasn't been done. I can understand (and sort of expect), say, buffered write being insisible for concurrent direct read, while they're going at the same time. But here, the file has been closed and re-opened between writes and reads. I agree that it's difficult to keep both pieces - direct and buffered I/O - in sync, -- there were numerous efforts to syncronize them, with various success and usually huge amount of work. Maybe if it were noted initially that direct I/O _is_ incompatible with buffered I/O, things weren't that bad now. Next, this problem does not happen without the mentioned dioread_nolock option (which - as far as I can see - supposed to be the default (or only) way to handle this in the future). I can't trigger any of the issues described in this thread without dioread_nolock. So that makes this as yet another "corner case" somehow (like famous non-fs-buffer-aligned direct write past end of file, or like mmapped I/O mixed with direct I/O and so on), but since most other such corner cases are fixed now, this one just needs to be fixed too. And 3rd, this is a race condition: it does not happen all the time, or even most of the time, it happens "sometimes", which makes it more like a bug than not. Thanks, /mjt -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 15-08-11 16:53:34, Jiaying Zhang wrote: > On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote: > > 15.08.2011 12:00, Michael Tokarev wrote: > > [....] > > > > So, it looks like this (starting with cold cache): > > > > 1. rename the redologs and copy them over - this will > > make a hot copy of redologs > > 2. startup oracle - it will complain that the redologs aren't > > redologs, the header is corrupt > > 3. shut down oracle, start it up again - it will succeed. > > > > If between 1 and 2 you'll issue sync(1) everything will work. > > When shutting down, oracle calls fsync(), so that's like > > sync(1) again. > > > > If there will be some time between 1. and 2., everything > > will work too. > > > > Without dioread_nolock I can't trigger the problem no matter > > how I tried. > > > > > > A smaller test case. I used redo1.odf file (one of the > > redologs) as a test file, any will work. > > > > $ cp -p redo1.odf temp > > $ dd if=temp of=foo iflag=direct count=20 > Isn't this the expected behavior here? When doing > 'cp -p redo1.odf temp', data is copied to temp through > buffer write, but there is no guarantee when data will be > actually written to disk. Then with 'dd if=temp of=foo > iflag=direct count=20', data is read directly from disk. > Very likely, the written data hasn't been flushed to disk > yet so ext4 returns zero in this case. No it's not. Buffered and direct IO is supposed to work correctly (although not fast) together. In this particular case we take care to flush dirty data from page cache before performing direct IO read... But something is broken in this path obviously. I don't have time to dig into this in detail now but what seems to be the problem is that with dioread_nolock option, we don't acquire i_mutex for direct IO reads anymore. Thus these reads can compete with ext4_end_io_nolock() called from ext4_end_io_work() (this is called under i_mutex so without dioread_nolock the race cannot happen). Hmm, the new writepages code seems to be broken in combination with direct IO. Direct IO code expects that when filemap_write_and_wait() finishes, data is on disk but with new bio submission code this is not true because we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in ext4_end_io_buffer_write() but do extent conversion only after that in convert workqueue. So the race seems to be there all the time, just without dioread_nolock it's much smaller. Fixing this is going to be non-trivial - I'm not sure we can really move clearing of PageWriteback bit to conversion workqueue. I think we already tried that once but it caused deadlocks for some reason... > > Now, first 512bytes of "foo" will contain all zeros, while > > the beginning of redo1.odf is _not_ zeros. > > > > Again, without aioread_nolock it works as expected. > > > > > > And the most important note: without the patch there's no > > data corruption like that. But instead, there is the > > lockup... ;) Honza
On 08/16/2011 09:53 PM, Jan Kara wrote: > On Mon 15-08-11 16:53:34, Jiaying Zhang wrote: >> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote: >>> 15.08.2011 12:00, Michael Tokarev wrote: >>> [....] >>> >>> So, it looks like this (starting with cold cache): >>> >>> 1. rename the redologs and copy them over - this will >>> make a hot copy of redologs >>> 2. startup oracle - it will complain that the redologs aren't >>> redologs, the header is corrupt >>> 3. shut down oracle, start it up again - it will succeed. >>> >>> If between 1 and 2 you'll issue sync(1) everything will work. >>> When shutting down, oracle calls fsync(), so that's like >>> sync(1) again. >>> >>> If there will be some time between 1. and 2., everything >>> will work too. >>> >>> Without dioread_nolock I can't trigger the problem no matter >>> how I tried. >>> >>> >>> A smaller test case. I used redo1.odf file (one of the >>> redologs) as a test file, any will work. >>> >>> $ cp -p redo1.odf temp >>> $ dd if=temp of=foo iflag=direct count=20 >> Isn't this the expected behavior here? When doing >> 'cp -p redo1.odf temp', data is copied to temp through >> buffer write, but there is no guarantee when data will be >> actually written to disk. Then with 'dd if=temp of=foo >> iflag=direct count=20', data is read directly from disk. >> Very likely, the written data hasn't been flushed to disk >> yet so ext4 returns zero in this case. > No it's not. Buffered and direct IO is supposed to work correctly > (although not fast) together. In this particular case we take care to flush > dirty data from page cache before performing direct IO read... But > something is broken in this path obviously. > > I don't have time to dig into this in detail now but what seems to be the > problem is that with dioread_nolock option, we don't acquire i_mutex for > direct IO reads anymore. Thus these reads can compete with > ext4_end_io_nolock() called from ext4_end_io_work() (this is called under > i_mutex so without dioread_nolock the race cannot happen). > > Hmm, the new writepages code seems to be broken in combination with direct > IO. Direct IO code expects that when filemap_write_and_wait() finishes, > data is on disk but with new bio submission code this is not true because > we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in > ext4_end_io_buffer_write() but do extent conversion only after that in > convert workqueue. So the race seems to be there all the time, just without > dioread_nolock it's much smaller. You are absolutely right. The really problem is that ext4_direct_IO begins to work *after* we clear the page writeback flag and *before* we convert unwritten extent to a valid state. Some of my trace does show that. I am working on it now. > > Fixing this is going to be non-trivial - I'm not sure we can really move > clearing of PageWriteback bit to conversion workqueue. I thienk we already > tried that once but it caused deadlocks for some reason... I just did as what you described and yes I met with another problem and try to resolve it now. Once it is OK, I will send out the patch. Thanks Tao -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d47264c..40c0b10 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate) goto out; } - io_end->flag = EXT4_IO_END_UNWRITTEN; + /* + * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now, + * but being more careful is always safe for the future change. + */ inode = io_end->inode; + if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) { + io_end->flag |= EXT4_IO_END_UNWRITTEN; + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); + } /* Add the io_end to per-inode completed io list*/ spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags); diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index 430c401..78839af 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -334,8 +334,10 @@ submit_and_retry: if ((io_end->num_io_pages >= MAX_IO_PAGES) && (io_end->pages[io_end->num_io_pages-1] != io_page)) goto submit_and_retry; - if (buffer_uninit(bh)) - io->io_end->flag |= EXT4_IO_END_UNWRITTEN; + if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) { + io_end->flag |= EXT4_IO_END_UNWRITTEN; + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); + } io->io_end->size += bh->b_size; io->io_next_block++; ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh));