Patchwork DIO process stuck apparently due to dioread_nolock (3.0)

login
register
mail settings
Submitter Jiaying Zhang
Date Aug. 16, 2011, 9:32 p.m.
Message ID <CAFgt=MCQttE5Vv_4W=hFWU_L_FvELnY6bnFQ3uSOu07VkDm6rA@mail.gmail.com>
Download mbox | patch
Permalink /patch/110243/
State New
Headers show

Comments

Jiaying Zhang - Aug. 16, 2011, 9:32 p.m.
On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@tao.ma> wrote:
> On 08/16/2011 09:53 PM, Jan Kara wrote:
>> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote:
>>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote:
>>>> 15.08.2011 12:00, Michael Tokarev wrote:
>>>> [....]
>>>>
>>>> So, it looks like this (starting with cold cache):
>>>>
>>>> 1. rename the redologs and copy them over - this will
>>>>   make a hot copy of redologs
>>>> 2. startup oracle - it will complain that the redologs aren't
>>>>   redologs, the header is corrupt
>>>> 3. shut down oracle, start it up again - it will succeed.
>>>>
>>>> If between 1 and 2 you'll issue sync(1) everything will work.
>>>> When shutting down, oracle calls fsync(), so that's like
>>>> sync(1) again.
>>>>
>>>> If there will be some time between 1. and 2., everything
>>>> will work too.
>>>>
>>>> Without dioread_nolock I can't trigger the problem no matter
>>>> how I tried.
>>>>
>>>>
>>>> A smaller test case.  I used redo1.odf file (one of the
>>>> redologs) as a test file, any will work.
>>>>
>>>>  $ cp -p redo1.odf temp
>>>>  $ dd if=temp of=foo iflag=direct count=20
>>> Isn't this the expected behavior here? When doing
>>> 'cp -p redo1.odf temp', data is copied to temp through
>>> buffer write, but there is no guarantee when data will be
>>> actually written to disk. Then with 'dd if=temp of=foo
>>> iflag=direct count=20', data is read directly from disk.
>>> Very likely, the written data hasn't been flushed to disk
>>> yet so ext4 returns zero in this case.
>>   No it's not. Buffered and direct IO is supposed to work correctly
>> (although not fast) together. In this particular case we take care to flush
>> dirty data from page cache before performing direct IO read... But
>> something is broken in this path obviously.
I see. Thanks a lot for the explanation.

I wonder whether the following patch will solve the problem:


I tested the patch a little bit and it seems to resolve the race
on dioread_nolock in my case. Michael, I would very appreciate
if you can try this patch with your test case and see whether it works.

>>
>> I don't have time to dig into this in detail now but what seems to be the
>> problem is that with dioread_nolock option, we don't acquire i_mutex for
>> direct IO reads anymore. Thus these reads can compete with
>> ext4_end_io_nolock() called from ext4_end_io_work() (this is called under
>> i_mutex so without dioread_nolock the race cannot happen).
>>
>> Hmm, the new writepages code seems to be broken in combination with direct
>> IO. Direct IO code expects that when filemap_write_and_wait() finishes,
>> data is on disk but with new bio submission code this is not true because
>> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in
>> ext4_end_io_buffer_write() but do extent conversion only after that in
>> convert workqueue. So the race seems to be there all the time, just without
>> dioread_nolock it's much smaller.
I think ext4_end_io_buffer_write() is only called with dioread_nolock enabled.
So I think we should be ok with the default mount options.

Jiaying

> You are absolutely right. The really problem is that ext4_direct_IO
> begins to work *after* we clear the page writeback flag and *before* we
> convert unwritten extent to a valid state. Some of my trace does show
> that. I am working on it now.
>>
>> Fixing this is going to be non-trivial - I'm not sure we can really move
>> clearing of PageWriteback bit to conversion workqueue. I thienk we already
>> tried that once but it caused deadlocks for some reason...
> I just did as what you described and yes I met with another problem and
> try to resolve it now. Once it is OK, I will send out the patch.
>
> Thanks
> Tao
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Tokarev - Aug. 16, 2011, 10:28 p.m.
17.08.2011 01:32, Jiaying Zhang wrote:
> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@tao.ma> wrote:
>> On 08/16/2011 09:53 PM, Jan Kara wrote:
>>> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote:
>>>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote:
>>>>> 15.08.2011 12:00, Michael Tokarev wrote:
>>>>> [....]
>>>>>
>>>>> So, it looks like this (starting with cold cache):
>>>>>
>>>>> 1. rename the redologs and copy them over - this will
>>>>>   make a hot copy of redologs
>>>>> 2. startup oracle - it will complain that the redologs aren't
>>>>>   redologs, the header is corrupt
>>>>> 3. shut down oracle, start it up again - it will succeed.
>>>>>
>>>>> If between 1 and 2 you'll issue sync(1) everything will work.
>>>>> When shutting down, oracle calls fsync(), so that's like
>>>>> sync(1) again.
>>>>>
>>>>> If there will be some time between 1. and 2., everything
>>>>> will work too.
>>>>>
>>>>> Without dioread_nolock I can't trigger the problem no matter
>>>>> how I tried.
>>>>>
>>>>>
>>>>> A smaller test case.  I used redo1.odf file (one of the
>>>>> redologs) as a test file, any will work.
>>>>>
>>>>>  $ cp -p redo1.odf temp
>>>>>  $ dd if=temp of=foo iflag=direct count=20
>>>> Isn't this the expected behavior here? When doing
>>>> 'cp -p redo1.odf temp', data is copied to temp through
>>>> buffer write, but there is no guarantee when data will be
>>>> actually written to disk. Then with 'dd if=temp of=foo
>>>> iflag=direct count=20', data is read directly from disk.
>>>> Very likely, the written data hasn't been flushed to disk
>>>> yet so ext4 returns zero in this case.
>>>   No it's not. Buffered and direct IO is supposed to work correctly
>>> (although not fast) together. In this particular case we take care to flush
>>> dirty data from page cache before performing direct IO read... But
>>> something is broken in this path obviously.
> I see. Thanks a lot for the explanation.
> 
> I wonder whether the following patch will solve the problem:
> 
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 6c27111..ca90d73 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c

It is in inode.c in 3.0, not in indirect.c (there's no such file there).
But it applies (after de-MIMEfying it) - well, almost with no mods... ;)

> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>         }
> 
>  retry:
> -       if (rw == READ && ext4_should_dioread_nolock(inode))
> +       if (rw == READ && ext4_should_dioread_nolock(inode)) {
> +               if (unlikely(!list_empty(&ei->i_completed_io_list))) {
> +                       mutex_lock(&inode->i_mutex);
> +                       ext4_flush_completed_IO(inode);
> +                       mutex_unlock(&inode->i_mutex);
> +               }
>                 ret = __blockdev_direct_IO(rw, iocb, inode,
>                                  inode->i_sb->s_bdev, iov,
>                                  offset, nr_segs,
>                                  ext4_get_block, NULL, NULL, 0);
> -       else {
> +       } else {
>                 ret = blockdev_direct_IO(rw, iocb, inode,
>                                  inode->i_sb->s_bdev, iov,
>                                  offset, nr_segs,
> 
> I tested the patch a little bit and it seems to resolve the race
> on dioread_nolock in my case. Michael, I would very appreciate
> if you can try this patch with your test case and see whether it works.

So I tried it just now.  I tried hard, several times.  No joy: I don't
see the "corruption" anymore, neither with the simple dd case nor with
oracle version.  I added a printk into the new if-unlikely path, and
it triggers several times, - almost always when I run my "oracle
test-case" - starting the database after newly-copying the redologs.

So it appears you've hit the right place there.  At least for this
test case ;)  Not sure if that's exactly right way to go - maybe
it's possible to have some optimisation in there, in case completed
list is not empty but not overlaps with our I/O list, but that's
probably overkill, dunno.

>>> Hmm, the new writepages code seems to be broken in combination with direct
>>> IO. Direct IO code expects that when filemap_write_and_wait() finishes,
>>> data is on disk but with new bio submission code this is not true because
>>> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in
>>> ext4_end_io_buffer_write() but do extent conversion only after that in
>>> convert workqueue. So the race seems to be there all the time, just without
>>> dioread_nolock it's much smaller.

> I think ext4_end_io_buffer_write() is only called with dioread_nolock enabled.
> So I think we should be ok with the default mount options.

Am I right that the intention was to enable dioread_nolock by default
eventually, or even to keep its behavour only, without mount options?

And thank you -- all -- for your work and support!

/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang - Aug. 16, 2011, 11:07 p.m.
On Tue, Aug 16, 2011 at 3:28 PM, Michael Tokarev <mjt@tls.msk.ru> wrote:
> 17.08.2011 01:32, Jiaying Zhang wrote:
>> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@tao.ma> wrote:
>>> On 08/16/2011 09:53 PM, Jan Kara wrote:
>>>> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote:
>>>>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote:
>>>>>> 15.08.2011 12:00, Michael Tokarev wrote:
>>>>>> [....]
>>>>>>
>>>>>> So, it looks like this (starting with cold cache):
>>>>>>
>>>>>> 1. rename the redologs and copy them over - this will
>>>>>>   make a hot copy of redologs
>>>>>> 2. startup oracle - it will complain that the redologs aren't
>>>>>>   redologs, the header is corrupt
>>>>>> 3. shut down oracle, start it up again - it will succeed.
>>>>>>
>>>>>> If between 1 and 2 you'll issue sync(1) everything will work.
>>>>>> When shutting down, oracle calls fsync(), so that's like
>>>>>> sync(1) again.
>>>>>>
>>>>>> If there will be some time between 1. and 2., everything
>>>>>> will work too.
>>>>>>
>>>>>> Without dioread_nolock I can't trigger the problem no matter
>>>>>> how I tried.
>>>>>>
>>>>>>
>>>>>> A smaller test case.  I used redo1.odf file (one of the
>>>>>> redologs) as a test file, any will work.
>>>>>>
>>>>>>  $ cp -p redo1.odf temp
>>>>>>  $ dd if=temp of=foo iflag=direct count=20
>>>>> Isn't this the expected behavior here? When doing
>>>>> 'cp -p redo1.odf temp', data is copied to temp through
>>>>> buffer write, but there is no guarantee when data will be
>>>>> actually written to disk. Then with 'dd if=temp of=foo
>>>>> iflag=direct count=20', data is read directly from disk.
>>>>> Very likely, the written data hasn't been flushed to disk
>>>>> yet so ext4 returns zero in this case.
>>>>   No it's not. Buffered and direct IO is supposed to work correctly
>>>> (although not fast) together. In this particular case we take care to flush
>>>> dirty data from page cache before performing direct IO read... But
>>>> something is broken in this path obviously.
>> I see. Thanks a lot for the explanation.
>>
>> I wonder whether the following patch will solve the problem:
>>
>> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
>> index 6c27111..ca90d73 100644
>> --- a/fs/ext4/indirect.c
>> +++ b/fs/ext4/indirect.c
>
> It is in inode.c in 3.0, not in indirect.c (there's no such file there).
> But it applies (after de-MIMEfying it) - well, almost with no mods... ;)
>
>> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>>         }
>>
>>  retry:
>> -       if (rw == READ && ext4_should_dioread_nolock(inode))
>> +       if (rw == READ && ext4_should_dioread_nolock(inode)) {
>> +               if (unlikely(!list_empty(&ei->i_completed_io_list))) {
>> +                       mutex_lock(&inode->i_mutex);
>> +                       ext4_flush_completed_IO(inode);
>> +                       mutex_unlock(&inode->i_mutex);
>> +               }
>>                 ret = __blockdev_direct_IO(rw, iocb, inode,
>>                                  inode->i_sb->s_bdev, iov,
>>                                  offset, nr_segs,
>>                                  ext4_get_block, NULL, NULL, 0);
>> -       else {
>> +       } else {
>>                 ret = blockdev_direct_IO(rw, iocb, inode,
>>                                  inode->i_sb->s_bdev, iov,
>>                                  offset, nr_segs,
>>
>> I tested the patch a little bit and it seems to resolve the race
>> on dioread_nolock in my case. Michael, I would very appreciate
>> if you can try this patch with your test case and see whether it works.
>
> So I tried it just now.  I tried hard, several times.  No joy: I don't
> see the "corruption" anymore, neither with the simple dd case nor with
> oracle version.  I added a printk into the new if-unlikely path, and
> it triggers several times, - almost always when I run my "oracle
> test-case" - starting the database after newly-copying the redologs.
>
> So it appears you've hit the right place there.  At least for this
> test case ;)  Not sure if that's exactly right way to go - maybe
> it's possible to have some optimisation in there, in case completed
> list is not empty but not overlaps with our I/O list, but that's
> probably overkill, dunno.
>
>>>> Hmm, the new writepages code seems to be broken in combination with direct
>>>> IO. Direct IO code expects that when filemap_write_and_wait() finishes,
>>>> data is on disk but with new bio submission code this is not true because
>>>> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in
>>>> ext4_end_io_buffer_write() but do extent conversion only after that in
>>>> convert workqueue. So the race seems to be there all the time, just without
>>>> dioread_nolock it's much smaller.
>
>> I think ext4_end_io_buffer_write() is only called with dioread_nolock enabled.
>> So I think we should be ok with the default mount options.
>
> Am I right that the intention was to enable dioread_nolock by default
> eventually, or even to keep its behavour only, without mount options?
Good question. At the time when we checked in the code, we wanted to be
careful that it didn't introduce data corruptions that would affect normal
workloads. Apparently, the downside is that this code path doesn't get
a good test coverage. Maybe it is time to reconsider enabling this feature
by default. I guess we still want to guard it with a mount option given that
it doesn't work with certain options, like "data=journaled" mode and small
block size.

Jiaying
>
> And thank you -- all -- for your work and support!
>
> /mjt
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - Aug. 16, 2011, 11:59 p.m.
On Tue, Aug 16, 2011 at 02:32:12PM -0700, Jiaying Zhang wrote:
> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@tao.ma> wrote:
> > On 08/16/2011 09:53 PM, Jan Kara wrote:
> I wonder whether the following patch will solve the problem:
> 
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 6c27111..ca90d73 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>         }
> 
>  retry:
> -       if (rw == READ && ext4_should_dioread_nolock(inode))
> +       if (rw == READ && ext4_should_dioread_nolock(inode)) {
> +               if (unlikely(!list_empty(&ei->i_completed_io_list))) {
> +                       mutex_lock(&inode->i_mutex);
> +                       ext4_flush_completed_IO(inode);
> +                       mutex_unlock(&inode->i_mutex);
> +               }
>                 ret = __blockdev_direct_IO(rw, iocb, inode,
>                                  inode->i_sb->s_bdev, iov,
>                                  offset, nr_segs,
>                                  ext4_get_block, NULL, NULL, 0);
> -       else {
> +       } else {
>                 ret = blockdev_direct_IO(rw, iocb, inode,
>                                  inode->i_sb->s_bdev, iov,
>                                  offset, nr_segs,
> 
> I tested the patch a little bit and it seems to resolve the race
> on dioread_nolock in my case. Michael, I would very appreciate
> if you can try this patch with your test case and see whether it works.

Just my 2c worth here: this is a data corruption bug so the root
cause neeeds to be fixed. The above patch does not address the root
cause.

> > You are absolutely right. The really problem is that ext4_direct_IO
> > begins to work *after* we clear the page writeback flag and *before* we
> > convert unwritten extent to a valid state. Some of my trace does show
> > that. I am working on it now.

And that's the root cause - think about what that means for a
minute.  It means that extent conversion can race with anything that
requires IO to complete first. e.g. truncate or fsync.  It can then
race with other subsequent operations, which can have even nastier
effects. IOWs, there is a data-corruption landmine just sitting
there waiting for the next person to trip over it.

Fix the root cause, don't put band-aids over the symptoms.

Cheers,

Dave.
Jiaying Zhang - Aug. 17, 2011, 12:08 a.m.
On Tue, Aug 16, 2011 at 4:59 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Aug 16, 2011 at 02:32:12PM -0700, Jiaying Zhang wrote:
>> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@tao.ma> wrote:
>> > On 08/16/2011 09:53 PM, Jan Kara wrote:
>> I wonder whether the following patch will solve the problem:
>>
>> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
>> index 6c27111..ca90d73 100644
>> --- a/fs/ext4/indirect.c
>> +++ b/fs/ext4/indirect.c
>> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>>         }
>>
>>  retry:
>> -       if (rw == READ && ext4_should_dioread_nolock(inode))
>> +       if (rw == READ && ext4_should_dioread_nolock(inode)) {
>> +               if (unlikely(!list_empty(&ei->i_completed_io_list))) {
>> +                       mutex_lock(&inode->i_mutex);
>> +                       ext4_flush_completed_IO(inode);
>> +                       mutex_unlock(&inode->i_mutex);
>> +               }
>>                 ret = __blockdev_direct_IO(rw, iocb, inode,
>>                                  inode->i_sb->s_bdev, iov,
>>                                  offset, nr_segs,
>>                                  ext4_get_block, NULL, NULL, 0);
>> -       else {
>> +       } else {
>>                 ret = blockdev_direct_IO(rw, iocb, inode,
>>                                  inode->i_sb->s_bdev, iov,
>>                                  offset, nr_segs,
>>
>> I tested the patch a little bit and it seems to resolve the race
>> on dioread_nolock in my case. Michael, I would very appreciate
>> if you can try this patch with your test case and see whether it works.
>
> Just my 2c worth here: this is a data corruption bug so the root
> cause neeeds to be fixed. The above patch does not address the root
> cause.
>
>> > You are absolutely right. The really problem is that ext4_direct_IO
>> > begins to work *after* we clear the page writeback flag and *before* we
>> > convert unwritten extent to a valid state. Some of my trace does show
>> > that. I am working on it now.
>
> And that's the root cause - think about what that means for a
> minute.  It means that extent conversion can race with anything that
> requires IO to complete first. e.g. truncate or fsync.  It can then
> race with other subsequent operations, which can have even nastier
> effects. IOWs, there is a data-corruption landmine just sitting
> there waiting for the next person to trip over it.
You are right that extent conversion can race with truncate and fsync
as well. That is why we already need to call ext4_flush_completed_IO()
in those places as well. I agree this is a little nasty and there can be
some other corner cases that we haven't covered. The problem is we
can not do extent conversion during the end_io time. I haven't thought of
a better approach to deal with these races. I am curious how xfs deals
with this problem.

Jiaying

>
> Fix the root cause, don't put band-aids over the symptoms.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tao Ma - Aug. 17, 2011, 2:22 a.m.
Hi Jiaying,
On 08/17/2011 08:08 AM, Jiaying Zhang wrote:
> On Tue, Aug 16, 2011 at 4:59 PM, Dave Chinner <david@fromorbit.com> wrote:
>> On Tue, Aug 16, 2011 at 02:32:12PM -0700, Jiaying Zhang wrote:
>>> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@tao.ma> wrote:
>>>> On 08/16/2011 09:53 PM, Jan Kara wrote:
>>> I wonder whether the following patch will solve the problem:
>>>
>>> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
>>> index 6c27111..ca90d73 100644
>>> --- a/fs/ext4/indirect.c
>>> +++ b/fs/ext4/indirect.c
>>> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>>>         }
>>>
>>>  retry:
>>> -       if (rw == READ && ext4_should_dioread_nolock(inode))
>>> +       if (rw == READ && ext4_should_dioread_nolock(inode)) {
>>> +               if (unlikely(!list_empty(&ei->i_completed_io_list))) {
>>> +                       mutex_lock(&inode->i_mutex);
>>> +                       ext4_flush_completed_IO(inode);
>>> +                       mutex_unlock(&inode->i_mutex);
>>> +               }
>>>                 ret = __blockdev_direct_IO(rw, iocb, inode,
>>>                                  inode->i_sb->s_bdev, iov,
>>>                                  offset, nr_segs,
>>>                                  ext4_get_block, NULL, NULL, 0);
>>> -       else {
>>> +       } else {
>>>                 ret = blockdev_direct_IO(rw, iocb, inode,
>>>                                  inode->i_sb->s_bdev, iov,
>>>                                  offset, nr_segs,
>>>
>>> I tested the patch a little bit and it seems to resolve the race
>>> on dioread_nolock in my case. Michael, I would very appreciate
>>> if you can try this patch with your test case and see whether it works.
>>
>> Just my 2c worth here: this is a data corruption bug so the root
>> cause neeeds to be fixed. The above patch does not address the root
>> cause.
>>
>>>> You are absolutely right. The really problem is that ext4_direct_IO
>>>> begins to work *after* we clear the page writeback flag and *before* we
>>>> convert unwritten extent to a valid state. Some of my trace does show
>>>> that. I am working on it now.
>>
>> And that's the root cause - think about what that means for a
>> minute.  It means that extent conversion can race with anything that
>> requires IO to complete first. e.g. truncate or fsync.  It can then
>> race with other subsequent operations, which can have even nastier
>> effects. IOWs, there is a data-corruption landmine just sitting
>> there waiting for the next person to trip over it.
> You are right that extent conversion can race with truncate and fsync
> as well. That is why we already need to call ext4_flush_completed_IO()
> in those places as well. I agree this is a little nasty and there can be
> some other corner cases that we haven't covered. The problem is we
> can not do extent conversion during the end_io time. I haven't thought of
> a better approach to deal with these races. I am curious how xfs deals
> with this problem.
I agree with Dave that we may need to figure out a better way for this.
What's more, you patch has another side-effect: if there are concurrent
direct read and buffered write and even they are not interleaved, the
direct read is affected. Do you have any test data of the performance
regression?

Thanks
Tao

> 
> Jiaying
> 
>>
>> Fix the root cause, don't put band-aids over the symptoms.
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> david@fromorbit.com
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Aug. 17, 2011, 9:04 a.m.
On Tue 16-08-11 17:08:42, Jiaying Zhang wrote:
> On Tue, Aug 16, 2011 at 4:59 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Aug 16, 2011 at 02:32:12PM -0700, Jiaying Zhang wrote:
> >> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@tao.ma> wrote:
> >> > On 08/16/2011 09:53 PM, Jan Kara wrote:
> >> I wonder whether the following patch will solve the problem:
> >>
> >> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> >> index 6c27111..ca90d73 100644
> >> --- a/fs/ext4/indirect.c
> >> +++ b/fs/ext4/indirect.c
> >> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
> >>         }
> >>
> >>  retry:
> >> -       if (rw == READ && ext4_should_dioread_nolock(inode))
> >> +       if (rw == READ && ext4_should_dioread_nolock(inode)) {
> >> +               if (unlikely(!list_empty(&ei->i_completed_io_list))) {
> >> +                       mutex_lock(&inode->i_mutex);
> >> +                       ext4_flush_completed_IO(inode);
> >> +                       mutex_unlock(&inode->i_mutex);
> >> +               }
> >>                 ret = __blockdev_direct_IO(rw, iocb, inode,
> >>                                  inode->i_sb->s_bdev, iov,
> >>                                  offset, nr_segs,
> >>                                  ext4_get_block, NULL, NULL, 0);
> >> -       else {
> >> +       } else {
> >>                 ret = blockdev_direct_IO(rw, iocb, inode,
> >>                                  inode->i_sb->s_bdev, iov,
> >>                                  offset, nr_segs,
> >>
> >> I tested the patch a little bit and it seems to resolve the race
> >> on dioread_nolock in my case. Michael, I would very appreciate
> >> if you can try this patch with your test case and see whether it works.
> >
> > Just my 2c worth here: this is a data corruption bug so the root
> > cause neeeds to be fixed. The above patch does not address the root
> > cause.
> >
> >> > You are absolutely right. The really problem is that ext4_direct_IO
> >> > begins to work *after* we clear the page writeback flag and *before* we
> >> > convert unwritten extent to a valid state. Some of my trace does show
> >> > that. I am working on it now.
> >
> > And that's the root cause - think about what that means for a
> > minute.  It means that extent conversion can race with anything that
> > requires IO to complete first. e.g. truncate or fsync.  It can then
> > race with other subsequent operations, which can have even nastier
> > effects. IOWs, there is a data-corruption landmine just sitting
> > there waiting for the next person to trip over it.
> You are right that extent conversion can race with truncate and fsync
> as well. That is why we already need to call ext4_flush_completed_IO()
> in those places as well. I agree this is a little nasty and there can be
> some other corner cases that we haven't covered.
  Exactly. I agree with Dave here that it is asking for serious trouble to
clear PageWriteback bit before really completing the IO. 

> The problem is we can not do extent conversion during the end_io time. I
> haven't thought of a better approach to deal with these races. I am
> curious how xfs deals with this problem.
  Well, XFS cannot do extent conversion in end_io for AIO+DIO either. So it
clears PageWriteback bit only after extent conversion has happened in the
worker thread. ext4 has problems with this (deadlocks) because of unlucky
locking of extent tree using i_mutex. So I believe we have to find a better
locking for extent tree so that ext4 can clear PageWriteback bit from the
worker thread...

								Honza
Theodore Ts'o - Aug. 17, 2011, 5:02 p.m.
On Tue, Aug 16, 2011 at 04:07:38PM -0700, Jiaying Zhang wrote:
> Good question. At the time when we checked in the code, we wanted to be
> careful that it didn't introduce data corruptions that would affect normal
> workloads. Apparently, the downside is that this code path doesn't get
> a good test coverage. Maybe it is time to reconsider enabling this feature
> by default. I guess we still want to guard it with a mount option given that
> it doesn't work with certain options, like "data=journaled" mode and small
> block size.

What I'd like to do long-term here is to change things so that (a)
instead of instantiating the extent as uninitialized, writing the
data, and then doing the uninit->init conversion to writing the data
and then instantiated the extent as initialzied.  This would also
allow us to get rid of data=ordered mode.  And we should make it work
for fs block size != page size.

It means that we need a way of adding this sort of information into an
in-memory extent cache but which isn't saved to disk until the data is
written.  We've also talked about adding the information about whether
an extent is subject to delalloc as well, so we don't have to grovel
through the page cache and look at individual buffers attached to the
pages.  And there are folks who have been experimenting with an
in-memory extent tree cache to speed access to fast PCIe-attached
flash.

It seems to me that if we're careful a single solution should be able
to solve all of these problems...

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Tokarev - Aug. 18, 2011, 6:49 a.m.
17.08.2011 21:02, Ted Ts'o wrote:
[]
> What I'd like to do long-term here is to change things so that (a)
> instead of instantiating the extent as uninitialized, writing the
> data, and then doing the uninit->init conversion to writing the data
> and then instantiated the extent as initialzied.  This would also
> allow us to get rid of data=ordered mode.  And we should make it work
> for fs block size != page size.
> 
> It means that we need a way of adding this sort of information into an
> in-memory extent cache but which isn't saved to disk until the data is
> written.  We've also talked about adding the information about whether
> an extent is subject to delalloc as well, so we don't have to grovel
> through the page cache and look at individual buffers attached to the
> pages.  And there are folks who have been experimenting with an
> in-memory extent tree cache to speed access to fast PCIe-attached
> flash.
> 
> It seems to me that if we're careful a single solution should be able
> to solve all of these problems...

What about current situation, how do you think - should it be ignored
for now, having in mind that dioread_nolock isn't used often (but it
gives _serious_ difference in read speed), or, short term, fix this
very case which have real-life impact already, while implementing a
long-term solution?

Thank you!

/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang - Aug. 18, 2011, 6:54 p.m.
On Wed, Aug 17, 2011 at 11:49 PM, Michael Tokarev <mjt@tls.msk.ru> wrote:
> 17.08.2011 21:02, Ted Ts'o wrote:
> []
>> What I'd like to do long-term here is to change things so that (a)
>> instead of instantiating the extent as uninitialized, writing the
>> data, and then doing the uninit->init conversion to writing the data
>> and then instantiated the extent as initialzied.  This would also
>> allow us to get rid of data=ordered mode.  And we should make it work
>> for fs block size != page size.
>>
>> It means that we need a way of adding this sort of information into an
>> in-memory extent cache but which isn't saved to disk until the data is
>> written.  We've also talked about adding the information about whether
>> an extent is subject to delalloc as well, so we don't have to grovel
>> through the page cache and look at individual buffers attached to the
>> pages.  And there are folks who have been experimenting with an
>> in-memory extent tree cache to speed access to fast PCIe-attached
>> flash.
>>
>> It seems to me that if we're careful a single solution should be able
>> to solve all of these problems...
>
> What about current situation, how do you think - should it be ignored
> for now, having in mind that dioread_nolock isn't used often (but it
> gives _serious_ difference in read speed), or, short term, fix this
> very case which have real-life impact already, while implementing a
> long-term solution?
I plan to send my patch as a bandaid fix. It doesn't solve the fundamental
problem but I think it helps close the race you saw on your test. In the long
term, I agree that we should think about implementing an extent tree cache
and use it to hold pending uninitialized-to-initialized extent conversions.

Jiaying

>
> Thank you!
>
> /mjt
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tao Ma - Aug. 19, 2011, 3:18 a.m.
Hi Michael,
On 08/18/2011 02:49 PM, Michael Tokarev wrote:
> 17.08.2011 21:02, Ted Ts'o wrote:
> []
>> What I'd like to do long-term here is to change things so that (a)
>> instead of instantiating the extent as uninitialized, writing the
>> data, and then doing the uninit->init conversion to writing the data
>> and then instantiated the extent as initialzied.  This would also
>> allow us to get rid of data=ordered mode.  And we should make it work
>> for fs block size != page size.
>>
>> It means that we need a way of adding this sort of information into an
>> in-memory extent cache but which isn't saved to disk until the data is
>> written.  We've also talked about adding the information about whether
>> an extent is subject to delalloc as well, so we don't have to grovel
>> through the page cache and look at individual buffers attached to the
>> pages.  And there are folks who have been experimenting with an
>> in-memory extent tree cache to speed access to fast PCIe-attached
>> flash.
>>
>> It seems to me that if we're careful a single solution should be able
>> to solve all of these problems...
> 
> What about current situation, how do you think - should it be ignored
> for now, having in mind that dioread_nolock isn't used often (but it
> gives _serious_ difference in read speed), or, short term, fix this
> very case which have real-life impact already, while implementing a
> long-term solution?
So could you please share with us how you test and your test result
with/without dioread_nolock? A quick test with fio and intel ssd does't
see much improvement here.

We are based on RHEL6, and dioread_nolock isn't there by now and a large
number of our product system use direct read and buffer write. So if
your test proves to be promising, I guess our company can arrange some
resources to try to work it out.

Thanks
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tao Ma - Aug. 19, 2011, 3:20 a.m.
Hi Ted and Jiaying,
On 08/19/2011 02:54 AM, Jiaying Zhang wrote:
> On Wed, Aug 17, 2011 at 11:49 PM, Michael Tokarev <mjt@tls.msk.ru> wrote:
>> 17.08.2011 21:02, Ted Ts'o wrote:
>> []
>>> What I'd like to do long-term here is to change things so that (a)
>>> instead of instantiating the extent as uninitialized, writing the
>>> data, and then doing the uninit->init conversion to writing the data
>>> and then instantiated the extent as initialzied.  This would also
>>> allow us to get rid of data=ordered mode.  And we should make it work
>>> for fs block size != page size.
>>>
>>> It means that we need a way of adding this sort of information into an
>>> in-memory extent cache but which isn't saved to disk until the data is
>>> written.  We've also talked about adding the information about whether
>>> an extent is subject to delalloc as well, so we don't have to grovel
>>> through the page cache and look at individual buffers attached to the
>>> pages.  And there are folks who have been experimenting with an
>>> in-memory extent tree cache to speed access to fast PCIe-attached
>>> flash.
>>>
>>> It seems to me that if we're careful a single solution should be able
>>> to solve all of these problems...
>>
>> What about current situation, how do you think - should it be ignored
>> for now, having in mind that dioread_nolock isn't used often (but it
>> gives _serious_ difference in read speed), or, short term, fix this
>> very case which have real-life impact already, while implementing a
>> long-term solution?
> I plan to send my patch as a bandaid fix. It doesn't solve the fundamental
> problem but I think it helps close the race you saw on your test. In the long
> term, I agree that we should think about implementing an extent tree cache
> and use it to hold pending uninitialized-to-initialized extent conversions.
Does Google has some plan of doing it recently? We used a large number
of direct read, and we can arrange some resources to try to work it out.

Thanks
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Tokarev - Aug. 19, 2011, 7:05 a.m.
On 19.08.2011 07:18, Tao Ma wrote:
> Hi Michael,
> On 08/18/2011 02:49 PM, Michael Tokarev wrote:
[]
>> What about current situation, how do you think - should it be ignored
>> for now, having in mind that dioread_nolock isn't used often (but it
>> gives _serious_ difference in read speed), or, short term, fix this
>> very case which have real-life impact already, while implementing a
>> long-term solution?

> So could you please share with us how you test and your test result
> with/without dioread_nolock? A quick test with fio and intel ssd does't
> see much improvement here.

I used my home-grown quick-n-dirty microbenchmark for years to measure
i/o subsystem performance.  Here are the results from 3.0 kernel on
some Hitachi NAS (FC, on brocade adaptors), 14-drive raid10 array.

The numbers are all megabytes/sec transferred (read or written), summed
for all threads.  Leftmost column is the block size; next column is the
number of concurrent threads of the same type.  And the columns are
tests: linear read, random read, linear write, random write, and
concurrent random read and write.

For a raw device:

BlkSz Trd linRd rndRd linWr rndWr  rndR/W
   4k   1  18.3   0.8  14.5   9.6   0.1/  9.1
        4         2.5         9.4   0.4/  8.4
       32        10.0         9.3   4.7/  5.4
  16k   1  59.4   2.5  49.9  35.7   0.3/ 34.7
        4        10.3        36.1   1.5/ 31.4
       32        38.5        36.2  17.5/ 20.4
  64k   1 118.4   9.1 136.0 106.5   1.1/105.8
        4        37.7       108.5   4.7/102.6
       32       153.0       108.5  57.9/ 73.3
 128k   1 125.9  16.5 138.8 125.8   1.1/125.6
        4        68.7       128.7   6.3/122.8
       32       277.0       128.7  70.3/ 98.6
1024k   1  89.9  81.2 138.9 134.4   5.0/132.3
        4       254.7       137.6  19.2/127.1
       32       390.7       137.5 117.2/ 90.1

For ext4fs, 1Tb file, default mount options:

BlkSz Trd linRd rndRd linWr rndWr  rndR/W
   4k   1  15.7   0.6  15.4   9.4   0.0/  9.0
        4         2.6         9.3   0.0/  8.9
       32        10.0         9.3   0.0/  8.9
  16k   1  47.6   2.5  53.2  34.6   0.1/ 33.6
        4        10.2        34.6   0.0/ 33.5
       32        39.9        34.8   0.1/ 33.6
  64k   1 100.5   9.0 137.0 106.2   0.2/105.8
        4        37.8       107.8   0.1/106.1
       32       153.9       107.8   0.2/105.9
 128k   1 115.4  16.3 138.6 125.2   0.3/125.3
        4        68.8       127.8   0.2/125.6
       32       274.6       127.8   0.2/126.2
1024k   1 124.5  54.2 138.9 133.6   1.0/133.3
        4       159.5       136.6   0.2/134.3
       32       349.7       136.5   0.3/133.6

And for a 1tb file on ext4fs with dioread_nolock:

BlkSz Trd linRd rndRd linWr rndWr  rndR/W
   4k   1  15.7   0.6  14.6   9.4   0.1/  9.0
        4         2.6         9.4   0.3/  8.6
       32        10.0         9.4   4.5/  5.3
  16k   1  50.9   2.4  56.7  36.0   0.3/ 35.2
        4        10.1        36.4   1.5/ 34.6
       32        38.7        36.4  17.3/ 21.0
  64k   1  95.2   8.9 136.5 106.8   1.0/106.3
        4        37.7       108.4   5.2/103.3
       32       152.7       108.6  57.4/ 74.0
 128k   1 115.1  16.3 138.8 125.8   1.2/126.4
        4        68.9       128.5   5.7/124.0
       32       276.1       128.6  70.8/ 98.5
1024k   1 128.5  81.9 138.9 134.4   5.1/132.3
        4       253.4       137.4  19.1/126.8
       32       385.1       137.4 111.7/ 92.3

These are complete test results.  First 4 result
columns are merely identical, the difference is
within last column.  Here they are together:

BlkSz Trd     Raw      Ext4nolock  Ext4dflt
   4k   1   0.1/  9.1   0.1/  9.0  0.0/  9.0
        4   0.4/  8.4   0.3/  8.6  0.0/  8.9
       32   4.7/  5.4   4.5/  5.3  0.0/  8.9
  16k   1   0.3/ 34.7   0.3/ 35.2  0.1/ 33.6
        4   1.5/ 31.4   1.5/ 34.6  0.0/ 33.5
       32  17.5/ 20.4  17.3/ 21.0  0.1/ 33.6
  64k   1   1.1/105.8   1.0/106.3  0.2/105.8
        4   4.7/102.6   5.2/103.3  0.1/106.1
       32  57.9/ 73.3  57.4/ 74.0  0.2/105.9
 128k   1   1.1/125.6   1.2/126.4  0.3/125.3
        4   6.3/122.8   5.7/124.0  0.2/125.6
       32  70.3/ 98.6  70.8/ 98.5  0.2/126.2
1024k   1   5.0/132.3   5.1/132.3  1.0/133.3
        4  19.2/127.1  19.1/126.8  0.2/134.3
       32 117.2/ 90.1 111.7/ 92.3  0.3/133.6

Ext4 with dioread_nolock (middle column) behaves close to
raw device.  But default ext4 greatly prefers writes over
reads, reads are almost non-existent.

This is, again, more or less a microbenchmark.  Where it
come from is my attempt to simulate an (oracle) database
workload (many years ago, when larger and more standard
now benchmarks weren't (freely) available).  And there,
on a busy DB, the difference is quite well-visible.
In short, any writer makes all readers to wait.  Once
we start writing something, all users immediately notice.
With dioread_nolock they don't complain anymore.

There's some more background around this all.  Right
now I'm evaluating a new machine for our current database.
Old hardware had 2Gb RAM so it had _significant_ memory
pressure, and lots of stuff weren't able to be cached.
New machine has 128Gb of RAM, which will ensure that
all important stuff is in cache.  So the effect of this
read/write disbalance will be much less visible.

For example, we've a dictionary (several tables) with
addresses - towns, streets, even buildings.  When they
enter customer information they search in these dicts.
With current 2Gb memory thses dictionaries can't be
kept in memory, so they gets read from disk again every
time someone enters customer information, and this is
what they do all the time.  So no doubt disk access is
very important here.

On a new hardware, obviously, all these dictionaries will
be in memory after first access, so even if all reads will
wait till any write completes, it wont be as dramatic as
it is now.

That to say, -- maybe I'm really paying too much attention
for a wrong problem.  So far, on a new machine, I don't see
actual noticeable difference between dioread_nolock and
without that option.

(BTW, I found no way to remount a filesystem to EXclude
that option, I have to umount and mount it in order to
switch from using dioread_nolock to not using it.  Is
there a way?)

Thanks,

/mjt

> We are based on RHEL6, and dioread_nolock isn't there by now and a large
> number of our product system use direct read and buffer write. So if
> your test proves to be promising, I guess our company can arrange some
> resources to try to work it out.
> 
> Thanks
> Tao

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang - Aug. 19, 2011, 5:55 p.m.
On Fri, Aug 19, 2011 at 12:05 AM, Michael Tokarev <mjt@tls.msk.ru> wrote:
> On 19.08.2011 07:18, Tao Ma wrote:
>> Hi Michael,
>> On 08/18/2011 02:49 PM, Michael Tokarev wrote:
> []
>>> What about current situation, how do you think - should it be ignored
>>> for now, having in mind that dioread_nolock isn't used often (but it
>>> gives _serious_ difference in read speed), or, short term, fix this
>>> very case which have real-life impact already, while implementing a
>>> long-term solution?
>
>> So could you please share with us how you test and your test result
>> with/without dioread_nolock? A quick test with fio and intel ssd does't
>> see much improvement here.
>
> I used my home-grown quick-n-dirty microbenchmark for years to measure
> i/o subsystem performance.  Here are the results from 3.0 kernel on
> some Hitachi NAS (FC, on brocade adaptors), 14-drive raid10 array.
>
> The numbers are all megabytes/sec transferred (read or written), summed
> for all threads.  Leftmost column is the block size; next column is the
> number of concurrent threads of the same type.  And the columns are
> tests: linear read, random read, linear write, random write, and
> concurrent random read and write.
>
> For a raw device:
>
> BlkSz Trd linRd rndRd linWr rndWr  rndR/W
>   4k   1  18.3   0.8  14.5   9.6   0.1/  9.1
>        4         2.5         9.4   0.4/  8.4
>       32        10.0         9.3   4.7/  5.4
>  16k   1  59.4   2.5  49.9  35.7   0.3/ 34.7
>        4        10.3        36.1   1.5/ 31.4
>       32        38.5        36.2  17.5/ 20.4
>  64k   1 118.4   9.1 136.0 106.5   1.1/105.8
>        4        37.7       108.5   4.7/102.6
>       32       153.0       108.5  57.9/ 73.3
>  128k   1 125.9  16.5 138.8 125.8   1.1/125.6
>        4        68.7       128.7   6.3/122.8
>       32       277.0       128.7  70.3/ 98.6
> 1024k   1  89.9  81.2 138.9 134.4   5.0/132.3
>        4       254.7       137.6  19.2/127.1
>       32       390.7       137.5 117.2/ 90.1
>
> For ext4fs, 1Tb file, default mount options:
>
> BlkSz Trd linRd rndRd linWr rndWr  rndR/W
>   4k   1  15.7   0.6  15.4   9.4   0.0/  9.0
>        4         2.6         9.3   0.0/  8.9
>       32        10.0         9.3   0.0/  8.9
>  16k   1  47.6   2.5  53.2  34.6   0.1/ 33.6
>        4        10.2        34.6   0.0/ 33.5
>       32        39.9        34.8   0.1/ 33.6
>  64k   1 100.5   9.0 137.0 106.2   0.2/105.8
>        4        37.8       107.8   0.1/106.1
>       32       153.9       107.8   0.2/105.9
>  128k   1 115.4  16.3 138.6 125.2   0.3/125.3
>        4        68.8       127.8   0.2/125.6
>       32       274.6       127.8   0.2/126.2
> 1024k   1 124.5  54.2 138.9 133.6   1.0/133.3
>        4       159.5       136.6   0.2/134.3
>       32       349.7       136.5   0.3/133.6
>
> And for a 1tb file on ext4fs with dioread_nolock:
>
> BlkSz Trd linRd rndRd linWr rndWr  rndR/W
>   4k   1  15.7   0.6  14.6   9.4   0.1/  9.0
>        4         2.6         9.4   0.3/  8.6
>       32        10.0         9.4   4.5/  5.3
>  16k   1  50.9   2.4  56.7  36.0   0.3/ 35.2
>        4        10.1        36.4   1.5/ 34.6
>       32        38.7        36.4  17.3/ 21.0
>  64k   1  95.2   8.9 136.5 106.8   1.0/106.3
>        4        37.7       108.4   5.2/103.3
>       32       152.7       108.6  57.4/ 74.0
>  128k   1 115.1  16.3 138.8 125.8   1.2/126.4
>        4        68.9       128.5   5.7/124.0
>       32       276.1       128.6  70.8/ 98.5
> 1024k   1 128.5  81.9 138.9 134.4   5.1/132.3
>        4       253.4       137.4  19.1/126.8
>       32       385.1       137.4 111.7/ 92.3
>
> These are complete test results.  First 4 result
> columns are merely identical, the difference is
> within last column.  Here they are together:
>
> BlkSz Trd     Raw      Ext4nolock  Ext4dflt
>   4k   1   0.1/  9.1   0.1/  9.0  0.0/  9.0
>        4   0.4/  8.4   0.3/  8.6  0.0/  8.9
>       32   4.7/  5.4   4.5/  5.3  0.0/  8.9
>  16k   1   0.3/ 34.7   0.3/ 35.2  0.1/ 33.6
>        4   1.5/ 31.4   1.5/ 34.6  0.0/ 33.5
>       32  17.5/ 20.4  17.3/ 21.0  0.1/ 33.6
>  64k   1   1.1/105.8   1.0/106.3  0.2/105.8
>        4   4.7/102.6   5.2/103.3  0.1/106.1
>       32  57.9/ 73.3  57.4/ 74.0  0.2/105.9
>  128k   1   1.1/125.6   1.2/126.4  0.3/125.3
>        4   6.3/122.8   5.7/124.0  0.2/125.6
>       32  70.3/ 98.6  70.8/ 98.5  0.2/126.2
> 1024k   1   5.0/132.3   5.1/132.3  1.0/133.3
>        4  19.2/127.1  19.1/126.8  0.2/134.3
>       32 117.2/ 90.1 111.7/ 92.3  0.3/133.6
>
> Ext4 with dioread_nolock (middle column) behaves close to
> raw device.  But default ext4 greatly prefers writes over
> reads, reads are almost non-existent.
>
> This is, again, more or less a microbenchmark.  Where it
> come from is my attempt to simulate an (oracle) database
> workload (many years ago, when larger and more standard
> now benchmarks weren't (freely) available).  And there,
> on a busy DB, the difference is quite well-visible.
> In short, any writer makes all readers to wait.  Once
> we start writing something, all users immediately notice.
> With dioread_nolock they don't complain anymore.
>
> There's some more background around this all.  Right
> now I'm evaluating a new machine for our current database.
> Old hardware had 2Gb RAM so it had _significant_ memory
> pressure, and lots of stuff weren't able to be cached.
> New machine has 128Gb of RAM, which will ensure that
> all important stuff is in cache.  So the effect of this
> read/write disbalance will be much less visible.
>
> For example, we've a dictionary (several tables) with
> addresses - towns, streets, even buildings.  When they
> enter customer information they search in these dicts.
> With current 2Gb memory thses dictionaries can't be
> kept in memory, so they gets read from disk again every
> time someone enters customer information, and this is
> what they do all the time.  So no doubt disk access is
> very important here.
>
> On a new hardware, obviously, all these dictionaries will
> be in memory after first access, so even if all reads will
> wait till any write completes, it wont be as dramatic as
> it is now.
>
> That to say, -- maybe I'm really paying too much attention
> for a wrong problem.  So far, on a new machine, I don't see
> actual noticeable difference between dioread_nolock and
> without that option.
>
> (BTW, I found no way to remount a filesystem to EXclude
> that option, I have to umount and mount it in order to
> switch from using dioread_nolock to not using it.  Is
> there a way?)
I think the command to do this is:
mount -o remount,dioread_lock /dev/xxx <mountpoint>

Now looking at this, I guess it is not very intuitive that the option to
turn off dioread_nolock is dioread_lock instead of nodioread_nolock,
but nodioread_nolock does look ugly. Maybe we should try to support
both ways.

Jiaying
>
> Thanks,
>
> /mjt
>
>> We are based on RHEL6, and dioread_nolock isn't there by now and a large
>> number of our product system use direct read and buffer write. So if
>> your test proves to be promising, I guess our company can arrange some
>> resources to try to work it out.
>>
>> Thanks
>> Tao
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 6c27111..ca90d73 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -800,12 +800,17 @@  ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
        }

 retry:
-       if (rw == READ && ext4_should_dioread_nolock(inode))
+       if (rw == READ && ext4_should_dioread_nolock(inode)) {
+               if (unlikely(!list_empty(&ei->i_completed_io_list))) {
+                       mutex_lock(&inode->i_mutex);
+                       ext4_flush_completed_IO(inode);
+                       mutex_unlock(&inode->i_mutex);
+               }
                ret = __blockdev_direct_IO(rw, iocb, inode,
                                 inode->i_sb->s_bdev, iov,
                                 offset, nr_segs,
                                 ext4_get_block, NULL, NULL, 0);
-       else {
+       } else {
                ret = blockdev_direct_IO(rw, iocb, inode,
                                 inode->i_sb->s_bdev, iov,
                                 offset, nr_segs,