diff mbox

what happened with dccaf33fa37 "ext4: flush any pending end_io requests before DIO" for 3.0?

Message ID 4ED6942F.7070006@msgid.tls.msk.ru
State Not Applicable, archived
Headers show

Commit Message

Michael Tokarev Nov. 30, 2011, 8:38 p.m. UTC
Hello.

Back in August 2011, a commit has been tagged to be included
into stable, this one:

commit dccaf33fa37a1bc5d651baeb3bfeb6becb86597b
Author: Jiaying Zhang <jiayingz@google.com>
Date:   Fri Aug 19 19:13:32 2011 -0400

    ext4: flush any pending end_io requests before DIO reads w/dioread_nolock

    There is a race between ext4 buffer write and direct_IO read with
    dioread_nolock mount option enabled. The problem is that we clear
    PageWriteback flag during end_io time but will do
    uninitialized-to-initialized extent conversion later with dioread_nolock.
    If an O_direct read request comes in during this period, ext4 will return
    zero instead of the recently written data.

    This patch checks whether there are any pending uninitialized-to-initialized
    extent conversion requests before doing O_direct read to close the race.
    Note that this is just a bandaid fix. The fundamental issue is that we
    clear PageWriteback flag before we really complete an IO, which is
    problem-prone. To fix the fundamental issue, we may need to implement an
    extent tree cache that we can use to look up pending to-be-converted extents.

    Signed-off-by: Jiaying Zhang <jiayingz@google.com>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Cc: stable@kernel.org


There was one more ext4 commit at that time, which made its way into
stable but this one did not.

I wonder if the reason for that was the fact that it needed a small
"backport" for 3.0, since in 3.1+ the code has been moved into another
file, and the context is slightly different.  In that case, attached
is the "backport" which we use with 3.0.x since that time.

Thanks!

/mjt

Comments

Michael Tokarev Feb. 28, 2012, 11:42 a.m. UTC | #1
Is there something wrong with my question?  I asked it 1.5 months ago...

Meanwhile, we're using this patch on our database server since
Aug-2011, and it appears to work correctly - direct and buffered
I/O works together without surprizes.  Without this patch, I see
unexpected results.

Thanks,

/mjt

On 01.12.2011 00:38, Michael Tokarev wrote:
> Hello.
> 
> Back in August 2011, a commit has been tagged to be included
> into stable, this one:
> 
> commit dccaf33fa37a1bc5d651baeb3bfeb6becb86597b
> Author: Jiaying Zhang <jiayingz@google.com>
> Date:   Fri Aug 19 19:13:32 2011 -0400
> 
>     ext4: flush any pending end_io requests before DIO reads w/dioread_nolock
> 
>     There is a race between ext4 buffer write and direct_IO read with
>     dioread_nolock mount option enabled. The problem is that we clear
>     PageWriteback flag during end_io time but will do
>     uninitialized-to-initialized extent conversion later with dioread_nolock.
>     If an O_direct read request comes in during this period, ext4 will return
>     zero instead of the recently written data.
> 
>     This patch checks whether there are any pending uninitialized-to-initialized
>     extent conversion requests before doing O_direct read to close the race.
>     Note that this is just a bandaid fix. The fundamental issue is that we
>     clear PageWriteback flag before we really complete an IO, which is
>     problem-prone. To fix the fundamental issue, we may need to implement an
>     extent tree cache that we can use to look up pending to-be-converted extents.
> 
>     Signed-off-by: Jiaying Zhang <jiayingz@google.com>
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>     Cc: stable@kernel.org
> 
> 
> There was one more ext4 commit at that time, which made its way into
> stable but this one did not.
> 
> I wonder if the reason for that was the fact that it needed a small
> "backport" for 3.0, since in 3.1+ the code has been moved into another
> file, and the context is slightly different.  In that case, attached
> is the "backport" which we use with 3.0.x since that time.
> 
> Thanks!
> 
> /mjt

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Tokarev March 17, 2012, 9:31 a.m. UTC | #2
Ping?

Maybe just a one-line reply isn't THAT difficult?

We've a data corruption bug in current longterm stable kernel
series which is known and has a fix and tagged for -stable for
over half a year already...

Thanks,

/mjt

On 28.02.2012 15:42, Michael Tokarev wrote:
> Is there something wrong with my question?  I asked it 1.5 months ago...
> 
> Meanwhile, we're using this patch on our database server since
> Aug-2011, and it appears to work correctly - direct and buffered
> I/O works together without surprizes.  Without this patch, I see
> unexpected results.
> 
> Thanks,
> 
> /mjt
> 
> On 01.12.2011 00:38, Michael Tokarev wrote:
>> Hello.
>>
>> Back in August 2011, a commit has been tagged to be included
>> into stable, this one:
>>
>> commit dccaf33fa37a1bc5d651baeb3bfeb6becb86597b
>> Author: Jiaying Zhang <jiayingz@google.com>
>> Date:   Fri Aug 19 19:13:32 2011 -0400
>>
>>     ext4: flush any pending end_io requests before DIO reads w/dioread_nolock
>>
>>     There is a race between ext4 buffer write and direct_IO read with
>>     dioread_nolock mount option enabled. The problem is that we clear
>>     PageWriteback flag during end_io time but will do
>>     uninitialized-to-initialized extent conversion later with dioread_nolock.
>>     If an O_direct read request comes in during this period, ext4 will return
>>     zero instead of the recently written data.
>>
>>     This patch checks whether there are any pending uninitialized-to-initialized
>>     extent conversion requests before doing O_direct read to close the race.
>>     Note that this is just a bandaid fix. The fundamental issue is that we
>>     clear PageWriteback flag before we really complete an IO, which is
>>     problem-prone. To fix the fundamental issue, we may need to implement an
>>     extent tree cache that we can use to look up pending to-be-converted extents.
>>
>>     Signed-off-by: Jiaying Zhang <jiayingz@google.com>
>>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>>     Cc: stable@kernel.org
>>
>>
>> There was one more ext4 commit at that time, which made its way into
>> stable but this one did not.
>>
>> I wonder if the reason for that was the fact that it needed a small
>> "backport" for 3.0, since in 3.1+ the code has been moved into another
>> file, and the context is slightly different.  In that case, attached
>> is the "backport" which we use with 3.0.x since that time.
>>
>> Thanks!
>>
>> /mjt
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara March 19, 2012, 4:42 p.m. UTC | #3
On Sat 17-03-12 13:31:30, Michael Tokarev wrote:
> Ping?
> 
> Maybe just a one-line reply isn't THAT difficult?
> 
> We've a data corruption bug in current longterm stable kernel
> series which is known and has a fix and tagged for -stable for
> over half a year already...
  Greg, any idea why this patch was not included?

							Honza

> On 28.02.2012 15:42, Michael Tokarev wrote:
> > Is there something wrong with my question?  I asked it 1.5 months ago...
> > 
> > Meanwhile, we're using this patch on our database server since
> > Aug-2011, and it appears to work correctly - direct and buffered
> > I/O works together without surprizes.  Without this patch, I see
> > unexpected results.
> > 
> > Thanks,
> > 
> > /mjt
> > 
> > On 01.12.2011 00:38, Michael Tokarev wrote:
> >> Hello.
> >>
> >> Back in August 2011, a commit has been tagged to be included
> >> into stable, this one:
> >>
> >> commit dccaf33fa37a1bc5d651baeb3bfeb6becb86597b
> >> Author: Jiaying Zhang <jiayingz@google.com>
> >> Date:   Fri Aug 19 19:13:32 2011 -0400
> >>
> >>     ext4: flush any pending end_io requests before DIO reads w/dioread_nolock
> >>
> >>     There is a race between ext4 buffer write and direct_IO read with
> >>     dioread_nolock mount option enabled. The problem is that we clear
> >>     PageWriteback flag during end_io time but will do
> >>     uninitialized-to-initialized extent conversion later with dioread_nolock.
> >>     If an O_direct read request comes in during this period, ext4 will return
> >>     zero instead of the recently written data.
> >>
> >>     This patch checks whether there are any pending uninitialized-to-initialized
> >>     extent conversion requests before doing O_direct read to close the race.
> >>     Note that this is just a bandaid fix. The fundamental issue is that we
> >>     clear PageWriteback flag before we really complete an IO, which is
> >>     problem-prone. To fix the fundamental issue, we may need to implement an
> >>     extent tree cache that we can use to look up pending to-be-converted extents.
> >>
> >>     Signed-off-by: Jiaying Zhang <jiayingz@google.com>
> >>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> >>     Cc: stable@kernel.org
> >>
> >>
> >> There was one more ext4 commit at that time, which made its way into
> >> stable but this one did not.
> >>
> >> I wonder if the reason for that was the fact that it needed a small
> >> "backport" for 3.0, since in 3.1+ the code has been moved into another
> >> file, and the context is slightly different.  In that case, attached
> >> is the "backport" which we use with 3.0.x since that time.
> >>
> >> Thanks!
> >>
> >> /mjt
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang March 19, 2012, 5:10 p.m. UTC | #4
On Mon, Mar 19, 2012 at 9:42 AM, Jan Kara <jack@suse.cz> wrote:
> On Sat 17-03-12 13:31:30, Michael Tokarev wrote:
>> Ping?
>>
>> Maybe just a one-line reply isn't THAT difficult?
Sorry for the slow response. I am not sure what happened to this patch.
Ted, do you know what we need to do to get this patch to the
stable release?

Jiaying

>>
>> We've a data corruption bug in current longterm stable kernel
>> series which is known and has a fix and tagged for -stable for
>> over half a year already...
>  Greg, any idea why this patch was not included?
>
>                                                        Honza
>
>> On 28.02.2012 15:42, Michael Tokarev wrote:
>> > Is there something wrong with my question?  I asked it 1.5 months ago...
>> >
>> > Meanwhile, we're using this patch on our database server since
>> > Aug-2011, and it appears to work correctly - direct and buffered
>> > I/O works together without surprizes.  Without this patch, I see
>> > unexpected results.
>> >
>> > Thanks,
>> >
>> > /mjt
>> >
>> > On 01.12.2011 00:38, Michael Tokarev wrote:
>> >> Hello.
>> >>
>> >> Back in August 2011, a commit has been tagged to be included
>> >> into stable, this one:
>> >>
>> >> commit dccaf33fa37a1bc5d651baeb3bfeb6becb86597b
>> >> Author: Jiaying Zhang <jiayingz@google.com>
>> >> Date:   Fri Aug 19 19:13:32 2011 -0400
>> >>
>> >>     ext4: flush any pending end_io requests before DIO reads w/dioread_nolock
>> >>
>> >>     There is a race between ext4 buffer write and direct_IO read with
>> >>     dioread_nolock mount option enabled. The problem is that we clear
>> >>     PageWriteback flag during end_io time but will do
>> >>     uninitialized-to-initialized extent conversion later with dioread_nolock.
>> >>     If an O_direct read request comes in during this period, ext4 will return
>> >>     zero instead of the recently written data.
>> >>
>> >>     This patch checks whether there are any pending uninitialized-to-initialized
>> >>     extent conversion requests before doing O_direct read to close the race.
>> >>     Note that this is just a bandaid fix. The fundamental issue is that we
>> >>     clear PageWriteback flag before we really complete an IO, which is
>> >>     problem-prone. To fix the fundamental issue, we may need to implement an
>> >>     extent tree cache that we can use to look up pending to-be-converted extents.
>> >>
>> >>     Signed-off-by: Jiaying Zhang <jiayingz@google.com>
>> >>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>> >>     Cc: stable@kernel.org
>> >>
>> >>
>> >> There was one more ext4 commit at that time, which made its way into
>> >> stable but this one did not.
>> >>
>> >> I wonder if the reason for that was the fact that it needed a small
>> >> "backport" for 3.0, since in 3.1+ the code has been moved into another
>> >> file, and the context is slightly different.  In that case, attached
>> >> is the "backport" which we use with 3.0.x since that time.
>> >>
>> >> Thanks!
>> >>
>> >> /mjt
>> >
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Kroah-Hartman March 19, 2012, 5:21 p.m. UTC | #5
On Mon, Mar 19, 2012 at 05:42:56PM +0100, Jan Kara wrote:
> On Sat 17-03-12 13:31:30, Michael Tokarev wrote:
> > Ping?
> > 
> > Maybe just a one-line reply isn't THAT difficult?
> > 
> > We've a data corruption bug in current longterm stable kernel
> > series which is known and has a fix and tagged for -stable for
> > over half a year already...
>   Greg, any idea why this patch was not included?

I have no idea, sorry, that was way back in August of 2011, I can barely
remember what patches were and were not applied last week...

If it's still needed for 3.0, let me know and I'll be glad to queue it
up.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang March 19, 2012, 5:30 p.m. UTC | #6
On Mon, Mar 19, 2012 at 10:21 AM, Greg KH <gregkh@linuxfoundation.org> wrote:
> On Mon, Mar 19, 2012 at 05:42:56PM +0100, Jan Kara wrote:
>> On Sat 17-03-12 13:31:30, Michael Tokarev wrote:
>> > Ping?
>> >
>> > Maybe just a one-line reply isn't THAT difficult?
>> >
>> > We've a data corruption bug in current longterm stable kernel
>> > series which is known and has a fix and tagged for -stable for
>> > over half a year already...
>>   Greg, any idea why this patch was not included?
>
> I have no idea, sorry, that was way back in August of 2011, I can barely
> remember what patches were and were not applied last week...
>
> If it's still needed for 3.0, let me know and I'll be glad to queue it
> up.
Yes. it is still needed. The commit ID of the change is
dccaf33fa37a1bc5d651baeb3bfeb6becb86597b.
Please let me know if there is any problem on applying the change. Thanks!

Jiaying

>
> thanks,
>
> greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

(backported to 3.0 by mjt)

commit dccaf33fa37a1bc5d651baeb3bfeb6becb86597b
Author: Jiaying Zhang <jiayingz@google.com>
Date:   Fri Aug 19 19:13:32 2011 -0400

    ext4: flush any pending end_io requests before DIO reads w/dioread_nolock
    
    There is a race between ext4 buffer write and direct_IO read with
    dioread_nolock mount option enabled. The problem is that we clear
    PageWriteback flag during end_io time but will do
    uninitialized-to-initialized extent conversion later with dioread_nolock.
    If an O_direct read request comes in during this period, ext4 will return
    zero instead of the recently written data.
    
    This patch checks whether there are any pending uninitialized-to-initialized
    extent conversion requests before doing O_direct read to close the race.
    Note that this is just a bandaid fix. The fundamental issue is that we
    clear PageWriteback flag before we really complete an IO, which is
    problem-prone. To fix the fundamental issue, we may need to implement an
    extent tree cache that we can use to look up pending to-be-converted extents.
    
    Signed-off-by: Jiaying Zhang <jiayingz@google.com>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
    Cc: stable@kernel.org

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index b8602cd..0962642 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3507,12 +3507,17 @@  ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
 	}
 
 retry:
-	if (rw == READ && ext4_should_dioread_nolock(inode))
+	if (rw == READ && ext4_should_dioread_nolock(inode)) {
+		if (unlikely(!list_empty(&ei->i_completed_io_list))) {
+			mutex_lock(&inode->i_mutex);
+			ext4_flush_completed_IO(inode);
+			mutex_unlock(&inode->i_mutex);
+		}
 		ret = __blockdev_direct_IO(rw, iocb, inode,
 				 inode->i_sb->s_bdev, iov,
 				 offset, nr_segs,
 				 ext4_get_block, NULL, NULL, 0);
-	else {
+	} else {
 		ret = blockdev_direct_IO(rw, iocb, inode,
				 inode->i_sb->s_bdev, iov,
 				 offset, nr_segs,