mbox series

[0/3] Revert parallel dio reads

Message ID 1566871552-60946-1-git-send-email-joseph.qi@linux.alibaba.com
Headers show
Series Revert parallel dio reads | expand

Message

Joseph Qi Aug. 27, 2019, 2:05 a.m. UTC
This patch set is trying to revert parallel dio reads feature at present
since it causes significant performance regression in mixed random
read/write scenario.

Joseph Qi (3):
  Revert "ext4: remove EXT4_STATE_DIOREAD_LOCK flag"
  Revert "ext4: fix off-by-one error when writing back pages before dio
    read"
  Revert "ext4: Allow parallel DIO reads"

 fs/ext4/ext4.h        | 17 +++++++++++++++++
 fs/ext4/extents.c     | 19 ++++++++++++++-----
 fs/ext4/inode.c       | 47 +++++++++++++++++++++++++++++++----------------
 fs/ext4/ioctl.c       |  4 ++++
 fs/ext4/move_extent.c |  4 ++++
 fs/ext4/super.c       | 12 +++++++-----
 6 files changed, 77 insertions(+), 26 deletions(-)

Comments

Dave Chinner Aug. 27, 2019, 11:51 a.m. UTC | #1
On Tue, Aug 27, 2019 at 10:05:49AM +0800, Joseph Qi wrote:
> This patch set is trying to revert parallel dio reads feature at present
> since it causes significant performance regression in mixed random
> read/write scenario.
> 
> Joseph Qi (3):
>   Revert "ext4: remove EXT4_STATE_DIOREAD_LOCK flag"
>   Revert "ext4: fix off-by-one error when writing back pages before dio
>     read"
>   Revert "ext4: Allow parallel DIO reads"
> 
>  fs/ext4/ext4.h        | 17 +++++++++++++++++
>  fs/ext4/extents.c     | 19 ++++++++++++++-----
>  fs/ext4/inode.c       | 47 +++++++++++++++++++++++++++++++----------------
>  fs/ext4/ioctl.c       |  4 ++++
>  fs/ext4/move_extent.c |  4 ++++
>  fs/ext4/super.c       | 12 +++++++-----
>  6 files changed, 77 insertions(+), 26 deletions(-)

Before doing this, you might want to have a chat and co-ordinate
with the folks that are currently trying to port the ext4 direct IO
code to use the iomap infrastructure:

https://lore.kernel.org/linux-ext4/20190827095221.GA1568@poseidon.bobrowski.net/T/#t

That is going to need the shared locking on read and will work just
fine with shared locking on write, too (it's the code that XFS uses
for direct IO). So it might be best here if you work towards shared
locking on the write side rather than just revert the shared locking
on the read side....

Cheers,

Dave,
Jan Kara Aug. 29, 2019, 10:58 a.m. UTC | #2
On Tue 27-08-19 21:51:18, Dave Chinner wrote:
> On Tue, Aug 27, 2019 at 10:05:49AM +0800, Joseph Qi wrote:
> > This patch set is trying to revert parallel dio reads feature at present
> > since it causes significant performance regression in mixed random
> > read/write scenario.
> > 
> > Joseph Qi (3):
> >   Revert "ext4: remove EXT4_STATE_DIOREAD_LOCK flag"
> >   Revert "ext4: fix off-by-one error when writing back pages before dio
> >     read"
> >   Revert "ext4: Allow parallel DIO reads"
> > 
> >  fs/ext4/ext4.h        | 17 +++++++++++++++++
> >  fs/ext4/extents.c     | 19 ++++++++++++++-----
> >  fs/ext4/inode.c       | 47 +++++++++++++++++++++++++++++++----------------
> >  fs/ext4/ioctl.c       |  4 ++++
> >  fs/ext4/move_extent.c |  4 ++++
> >  fs/ext4/super.c       | 12 +++++++-----
> >  6 files changed, 77 insertions(+), 26 deletions(-)
> 
> Before doing this, you might want to have a chat and co-ordinate
> with the folks that are currently trying to port the ext4 direct IO
> code to use the iomap infrastructure:
> 
> https://lore.kernel.org/linux-ext4/20190827095221.GA1568@poseidon.bobrowski.net/T/#t
> 
> That is going to need the shared locking on read and will work just
> fine with shared locking on write, too (it's the code that XFS uses
> for direct IO). So it might be best here if you work towards shared
> locking on the write side rather than just revert the shared locking
> on the read side....

Yeah, after converting ext4 DIO path to iomap infrastructure, using shared
inode lock for all aligned non-extending DIO writes will be easy so I'd
prefer if we didn't have to redo the iomap conversion patches due to these
reverts.

								Honza
Andreas Dilger Aug. 29, 2019, 7:06 p.m. UTC | #3
On Aug 29, 2019, at 4:58 AM, Jan Kara <jack@suse.cz> wrote:
> 
> On Tue 27-08-19 21:51:18, Dave Chinner wrote:
>> On Tue, Aug 27, 2019 at 10:05:49AM +0800, Joseph Qi wrote:
>>> This patch set is trying to revert parallel dio reads feature at present
>>> since it causes significant performance regression in mixed random
>>> read/write scenario.
>>> 
>>> Joseph Qi (3):
>>>  Revert "ext4: remove EXT4_STATE_DIOREAD_LOCK flag"
>>>  Revert "ext4: fix off-by-one error when writing back pages before dio
>>>    read"
>>>  Revert "ext4: Allow parallel DIO reads"
>>> 
>>> fs/ext4/ext4.h        | 17 +++++++++++++++++
>>> fs/ext4/extents.c     | 19 ++++++++++++++-----
>>> fs/ext4/inode.c       | 47 +++++++++++++++++++++++++++++++----------------
>>> fs/ext4/ioctl.c       |  4 ++++
>>> fs/ext4/move_extent.c |  4 ++++
>>> fs/ext4/super.c       | 12 +++++++-----
>>> 6 files changed, 77 insertions(+), 26 deletions(-)
>> 
>> Before doing this, you might want to have a chat and co-ordinate
>> with the folks that are currently trying to port the ext4 direct IO
>> code to use the iomap infrastructure:
>> 
>> https://lore.kernel.org/linux-ext4/20190827095221.GA1568@poseidon.bobrowski.net/T/#t
>> 
>> That is going to need the shared locking on read and will work just
>> fine with shared locking on write, too (it's the code that XFS uses
>> for direct IO). So it might be best here if you work towards shared
>> locking on the write side rather than just revert the shared locking
>> on the read side....
> 
> Yeah, after converting ext4 DIO path to iomap infrastructure, using shared
> inode lock for all aligned non-extending DIO writes will be easy so I'd
> prefer if we didn't have to redo the iomap conversion patches due to these
> reverts.

But if the next kernel is LTS and the iomap implementation isn't in the
current merge window (very unlikely) then we're stuck with this performance
hit for LTS.  It is also unlikely that LTS will take the revert patches if
they have not been landed to master.

Cheers, Andreas
Jan Kara Aug. 30, 2019, 3:35 p.m. UTC | #4
On Thu 29-08-19 13:06:22, Andreas Dilger wrote:
> On Aug 29, 2019, at 4:58 AM, Jan Kara <jack@suse.cz> wrote:
> > 
> > On Tue 27-08-19 21:51:18, Dave Chinner wrote:
> >> On Tue, Aug 27, 2019 at 10:05:49AM +0800, Joseph Qi wrote:
> >>> This patch set is trying to revert parallel dio reads feature at present
> >>> since it causes significant performance regression in mixed random
> >>> read/write scenario.
> >>> 
> >>> Joseph Qi (3):
> >>>  Revert "ext4: remove EXT4_STATE_DIOREAD_LOCK flag"
> >>>  Revert "ext4: fix off-by-one error when writing back pages before dio
> >>>    read"
> >>>  Revert "ext4: Allow parallel DIO reads"
> >>> 
> >>> fs/ext4/ext4.h        | 17 +++++++++++++++++
> >>> fs/ext4/extents.c     | 19 ++++++++++++++-----
> >>> fs/ext4/inode.c       | 47 +++++++++++++++++++++++++++++++----------------
> >>> fs/ext4/ioctl.c       |  4 ++++
> >>> fs/ext4/move_extent.c |  4 ++++
> >>> fs/ext4/super.c       | 12 +++++++-----
> >>> 6 files changed, 77 insertions(+), 26 deletions(-)
> >> 
> >> Before doing this, you might want to have a chat and co-ordinate
> >> with the folks that are currently trying to port the ext4 direct IO
> >> code to use the iomap infrastructure:
> >> 
> >> https://lore.kernel.org/linux-ext4/20190827095221.GA1568@poseidon.bobrowski.net/T/#t
> >> 
> >> That is going to need the shared locking on read and will work just
> >> fine with shared locking on write, too (it's the code that XFS uses
> >> for direct IO). So it might be best here if you work towards shared
> >> locking on the write side rather than just revert the shared locking
> >> on the read side....
> > 
> > Yeah, after converting ext4 DIO path to iomap infrastructure, using shared
> > inode lock for all aligned non-extending DIO writes will be easy so I'd
> > prefer if we didn't have to redo the iomap conversion patches due to these
> > reverts.
> 
> But if the next kernel is LTS and the iomap implementation isn't in the
> current merge window (very unlikely) then we're stuck with this performance
> hit for LTS.  It is also unlikely that LTS will take the revert patches if
> they have not been landed to master.

I agree this is not great but the regression is there for 3 years, it has
been released in major distribution kernels quite a long time ago, and only
now someone complained. So it doesn't seem many people care about
performance of mixed RW workload when mounted with dioread_nolock (note
that the patches actually improve performance of read-only DIO workload
when not using dioread_nolock as for that case, exclusive lock is replaced
with a shared one).

									Honza
Ritesh Harjani Sept. 10, 2019, 2:10 p.m. UTC | #5
Hello,

>> Before doing this, you might want to have a chat and co-ordinate
>> with the folks that are currently trying to port the ext4 direct IO
>> code to use the iomap infrastructure:
>>
>> https://lore.kernel.org/linux-ext4/20190827095221.GA1568@poseidon.bobrowski.net/T/#t
>>
>> That is going to need the shared locking on read and will work just
>> fine with shared locking on write, too (it's the code that XFS uses
>> for direct IO). So it might be best here if you work towards shared
>> locking on the write side rather than just revert the shared locking
>> on the read side....
> 
> Yeah, after converting ext4 DIO path to iomap infrastructure, using shared
> inode lock for all aligned non-extending DIO writes will be easy so I'd
> prefer if we didn't have to redo the iomap conversion patches due to these
> reverts.

I was looking into this to see what is required to convert this into 
shared locking for DIO write side.
Why do you say shared inode lock only for *aligned non-extending DIO 
writes*?
non-extending means overwrite case only, which still in today's code is
not under any exclusive inode_lock (except the orphan handling and 
truncate handling in case of error after IO is completed). But even with
current code the perf problem is reported right?

If we see in today's code (including in iomap new code for overwrite 
case where we downgrade the lock).
We call for inode_unlock in case of overwrite and then do the IO, since 
we don't have to allocate any blocks in this case.


	/*
	 * Make all waiters for direct IO properly wait also for extent
	 * conversion. This also disallows race between truncate() and
	 * overwrite DIO as i_dio_count needs to be incremented under
  	 * i_mutex.
	 */
	inode_dio_begin(inode);
	/* If we do a overwrite dio, i_mutex locking can be released */
	overwrite = *((int *)iocb->private);
	if (overwrite)
		inode_unlock(inode);
	
	write data (via __blockdev_direct_IO)

	inode_dio_end(inode);
	/* take i_mutex locking again if we do a ovewrite dio */
	if (overwrite)
		inode_lock(inode);



What can we do next:-

Earlier the DIO reads was not having any inode_locking.
IIUC, the inode_lock_shared() in the DIO reads case was added to
protect the race against reading back uninitialized/stale data for e.g.
in case of truncate or writeback etc.

So as Dave suggested, shouldn't we add the complete shared locking in 
DIO writes cases as well (except for unaligned writes, since it may 
required zeroing of partial blocks).


Could you please help me understand how we can go about it?
I was thinking if we can create uninitialized blocks beyond EOF, so that
any parallel read should only read 0 from that area and we may not 
require exclusive inode_lock. The block creation is anyway protected
with i_data_sem in ext4_map_blocks.


I do see that in case of bufferedIO writes, we use 
ext4_get_block_unwritten for dioread_nolock case. Which should
be true for even writes extending beyond EOF. This will
create uninitialized and mapped blocks only (even beyond EOF).
But all of this happen under exclusive inode_lock() in ext4_file_write_iter.


Whereas in case of DIO writes extending beyond EOF, we pass
EXT4_GET_BLOCKS_CREATE in ext4_map_blocks which may allocate
initialized & mapped blocks.
Do you know why so?


Do you think we can create uninit. blocks in dioread_nolock AIO/non-AIO 
DIO writes cases as well with only shared inode lock mode?
Do you see any problems with this?
Or do you have any other suggestion?


In case of XFS -
I do see it promotes the demotes the shared lock (IOLOCK_XXX) in 
xfs_file_aio_write_checks. But I don't know if after that, whether
it only holds the shared lock while doing the DIO IO?
And how does it protects against the parallel DIO reads which can 
potentially expose any stale data?


Appreciate any help & inputs from FS developers.

-ritesh
Jan Kara Sept. 10, 2019, 9:57 p.m. UTC | #6
Hello,

On Tue 10-09-19 19:40:40, Ritesh Harjani wrote:
> > > Before doing this, you might want to have a chat and co-ordinate
> > > with the folks that are currently trying to port the ext4 direct IO
> > > code to use the iomap infrastructure:
> > > 
> > > https://lore.kernel.org/linux-ext4/20190827095221.GA1568@poseidon.bobrowski.net/T/#t
> > > 
> > > That is going to need the shared locking on read and will work just
> > > fine with shared locking on write, too (it's the code that XFS uses
> > > for direct IO). So it might be best here if you work towards shared
> > > locking on the write side rather than just revert the shared locking
> > > on the read side....
> > 
> > Yeah, after converting ext4 DIO path to iomap infrastructure, using shared
> > inode lock for all aligned non-extending DIO writes will be easy so I'd
> > prefer if we didn't have to redo the iomap conversion patches due to these
> > reverts.
> 
> I was looking into this to see what is required to convert this into
> shared locking for DIO write side.
> Why do you say shared inode lock only for *aligned non-extending DIO
> writes*?
> non-extending means overwrite case only, which still in today's code is
> not under any exclusive inode_lock (except the orphan handling and truncate
> handling in case of error after IO is completed). But even with
> current code the perf problem is reported right?

Yes, the problem is even with current code. It is that we first acquire
inode_rwsem in exclusive mode in ext4_file_write_iter() and only later
relax that to no lock later. And this is what is causing low performance
for mixed read-write workload because the exclusive lock has to wait for
all shared holders and vice versa...

> If we see in today's code (including in iomap new code for overwrite case
> where we downgrade the lock).
> We call for inode_unlock in case of overwrite and then do the IO, since we
> don't have to allocate any blocks in this case.
> 
> 
> 	/*
> 	 * Make all waiters for direct IO properly wait also for extent
> 	 * conversion. This also disallows race between truncate() and
> 	 * overwrite DIO as i_dio_count needs to be incremented under
>  	 * i_mutex.
> 	 */
> 	inode_dio_begin(inode);
> 	/* If we do a overwrite dio, i_mutex locking can be released */
> 	overwrite = *((int *)iocb->private);
> 	if (overwrite)
> 		inode_unlock(inode);
> 	
> 	write data (via __blockdev_direct_IO)
> 
> 	inode_dio_end(inode);
> 	/* take i_mutex locking again if we do a ovewrite dio */
> 	if (overwrite)
> 		inode_lock(inode);
> 
> 
> 
> What can we do next:-
> 
> Earlier the DIO reads was not having any inode_locking.
> IIUC, the inode_lock_shared() in the DIO reads case was added to
> protect the race against reading back uninitialized/stale data for e.g.
> in case of truncate or writeback etc.

Not quite. Places that are prone to exposing stale block content (such as
truncate) wait for running DIO (inode_dio_wait()). Just with unlocked read
DIO we had to have special wait mechanisms to disable unlocked DIO for a
while so that we can actually safely drain all running DIOs without new
ones being submitted and then perform e.g. truncate. So it was more a
complexity of the locking mechanism.

> So as Dave suggested, shouldn't we add the complete shared locking in DIO
> writes cases as well (except for unaligned writes, since it may required
> zeroing of partial blocks).
> 
> 
> Could you please help me understand how we can go about it?
> I was thinking if we can create uninitialized blocks beyond EOF, so that
> any parallel read should only read 0 from that area and we may not require
> exclusive inode_lock. The block creation is anyway protected
> with i_data_sem in ext4_map_blocks.

So doing file size changes (i.e., file expansion) without exclusive
inode_lock would be tricky. I wouldn't really like to go there at least
initially. My plan would be to do it similarly to XFS like:

Do a quick check whether IO is going to grow inode size or is unaligned. If
yes, mode = exclusive, else mode = shared. Then lock inode rwsem is
appropriate mode, do ext4_write_checks() (which may find out exclusive lock
is actually needed so in that case mode = exclusive and restart). Then call
iomap code to do direct IO.

> I do see that in case of bufferedIO writes, we use ext4_get_block_unwritten
> for dioread_nolock case. Which should
> be true for even writes extending beyond EOF. This will
> create uninitialized and mapped blocks only (even beyond EOF).
> But all of this happen under exclusive inode_lock() in ext4_file_write_iter.

I guess this is mostly because we don't bother to check in
ext4_write_begin() whether we are extending the file or not and filling
holes needs unwritten extents. Also before DIO reads got changed to be
under shared inode rwsem, the following was possible:

CPU1					CPU2
DIO read from file f offset 4096
  flush cache
					grow 'f' from 4096 to 8192 by write
					  blocks get allocated, page cache
					  dirtied
  map blocks, submit IO
    - reads stale contents

> Whereas in case of DIO writes extending beyond EOF, we pass
> EXT4_GET_BLOCKS_CREATE in ext4_map_blocks which may allocate
> initialized & mapped blocks.
> Do you know why so?

Not using unwritten extents is faster if that is safe. So that's why DIO
code uses them if possible.

> Do you think we can create uninit. blocks in dioread_nolock AIO/non-AIO DIO
> writes cases as well with only shared inode lock mode?
> Do you see any problems with this?
> Or do you have any other suggestion?

Not uninit but unwritten. Yes, we can do that. After all e.g. page writeback
creates extents like this without any inode rwsem protection.

> In case of XFS -
> I do see it promotes the demotes the shared lock (IOLOCK_XXX) in
> xfs_file_aio_write_checks. But I don't know if after that, whether
> it only holds the shared lock while doing the DIO IO?
> And how does it protects against the parallel DIO reads which can
> potentially expose any stale data?

XFS protects DIO submission with shared IOLOCK. Stale data exposure is
solved by using unwritten extents similarly to what ext4 does.

								Honza
Ritesh Harjani Sept. 11, 2019, 2:20 p.m. UTC | #7
Hello,

On 9/11/19 3:27 AM, Jan Kara wrote:
> Hello,
> 
> On Tue 10-09-19 19:40:40, Ritesh Harjani wrote:
>>>> Before doing this, you might want to have a chat and co-ordinate
>>>> with the folks that are currently trying to port the ext4 direct IO
>>>> code to use the iomap infrastructure:
>>>>
>>>> https://lore.kernel.org/linux-ext4/20190827095221.GA1568@poseidon.bobrowski.net/T/#t
>>>>
>>>> That is going to need the shared locking on read and will work just
>>>> fine with shared locking on write, too (it's the code that XFS uses
>>>> for direct IO). So it might be best here if you work towards shared
>>>> locking on the write side rather than just revert the shared locking
>>>> on the read side....
>>>
>>> Yeah, after converting ext4 DIO path to iomap infrastructure, using shared
>>> inode lock for all aligned non-extending DIO writes will be easy so I'd
>>> prefer if we didn't have to redo the iomap conversion patches due to these
>>> reverts.
>>
>> I was looking into this to see what is required to convert this into
>> shared locking for DIO write side.
>> Why do you say shared inode lock only for *aligned non-extending DIO
>> writes*?
>> non-extending means overwrite case only, which still in today's code is
>> not under any exclusive inode_lock (except the orphan handling and truncate
>> handling in case of error after IO is completed). But even with
>> current code the perf problem is reported right?
> 
> Yes, the problem is even with current code. It is that we first acquire
> inode_rwsem in exclusive mode in ext4_file_write_iter() and only later
> relax that to no lock later. And this is what is causing low performance
> for mixed read-write workload because the exclusive lock has to wait for
> all shared holders and vice versa...

Got it.


> 
>> If we see in today's code (including in iomap new code for overwrite case
>> where we downgrade the lock).
>> We call for inode_unlock in case of overwrite and then do the IO, since we
>> don't have to allocate any blocks in this case.
>>
>>
>> 	/*
>> 	 * Make all waiters for direct IO properly wait also for extent
>> 	 * conversion. This also disallows race between truncate() and
>> 	 * overwrite DIO as i_dio_count needs to be incremented under
>>   	 * i_mutex.
>> 	 */
>> 	inode_dio_begin(inode);
>> 	/* If we do a overwrite dio, i_mutex locking can be released */
>> 	overwrite = *((int *)iocb->private);
>> 	if (overwrite)
>> 		inode_unlock(inode);
>> 	
>> 	write data (via __blockdev_direct_IO)
>>
>> 	inode_dio_end(inode);
>> 	/* take i_mutex locking again if we do a ovewrite dio */
>> 	if (overwrite)
>> 		inode_lock(inode);
>>
>>
>>
>> What can we do next:-
>>
>> Earlier the DIO reads was not having any inode_locking.
>> IIUC, the inode_lock_shared() in the DIO reads case was added to
>> protect the race against reading back uninitialized/stale data for e.g.
>> in case of truncate or writeback etc.
> 
> Not quite. Places that are prone to exposing stale block content (such as
> truncate) wait for running DIO (inode_dio_wait()). Just with unlocked read
> DIO we had to have special wait mechanisms to disable unlocked DIO for a
> while so that we can actually safely drain all running DIOs without new
> ones being submitted and then perform e.g. truncate. So it was more a
> complexity of the locking mechanism.
> 
>> So as Dave suggested, shouldn't we add the complete shared locking in DIO
>> writes cases as well (except for unaligned writes, since it may required
>> zeroing of partial blocks).
>>
>>
>> Could you please help me understand how we can go about it?
>> I was thinking if we can create uninitialized blocks beyond EOF, so that
>> any parallel read should only read 0 from that area and we may not require
>> exclusive inode_lock. The block creation is anyway protected
>> with i_data_sem in ext4_map_blocks.
> 
> So doing file size changes (i.e., file expansion) without exclusive
> inode_lock would be tricky. I wouldn't really like to go there at least

Ok.

Yes, I am looking into this. I agree that the test case reported here
has the performance problem due to the fact that we take exclusive lock
first and then only when we detect it's an overwrite case we downgrade
it.

I am working over the patch to make the changes you suggested. This will
be of course on top of the new iomap patch series.


> initially. My plan would be to do it similarly to XFS like:
 >
> Do a quick check whether IO is going to grow inode size or is unaligned. If
> yes, mode = exclusive, else mode = shared. Then lock inode rwsem is
> appropriate mode, do ext4_write_checks() (which may find out exclusive lock
> is actually needed so in that case mode = exclusive and restart). Then call
> iomap code to do direct IO.

Got it.

> 
>> I do see that in case of bufferedIO writes, we use ext4_get_block_unwritten
>> for dioread_nolock case. Which should
>> be true for even writes extending beyond EOF. This will
>> create uninitialized and mapped blocks only (even beyond EOF).
>> But all of this happen under exclusive inode_lock() in ext4_file_write_iter.
> 
> I guess this is mostly because we don't bother to check in
> ext4_write_begin() whether we are extending the file or not and filling
> holes needs unwritten extents. Also before DIO reads got changed to be
> under shared inode rwsem, the following was possible:
> 
> CPU1					CPU2
> DIO read from file f offset 4096
>    flush cache
> 					grow 'f' from 4096 to 8192 by write
> 					  blocks get allocated, page cache
> 					  dirtied
>    map blocks, submit IO
>      - reads stale contents

Ok. Got it.

> 
>> Whereas in case of DIO writes extending beyond EOF, we pass
>> EXT4_GET_BLOCKS_CREATE in ext4_map_blocks which may allocate
>> initialized & mapped blocks.
>> Do you know why so?
> 
> Not using unwritten extents is faster if that is safe. So that's why DIO
> code uses them if possible.

Thanks for giving that info to me. So it must be due joining/splitting
of unwritten extents which makes it a bit slow.

> 
>> Do you think we can create uninit. blocks in dioread_nolock AIO/non-AIO DIO
>> writes cases as well with only shared inode lock mode?
>> Do you see any problems with this?
>> Or do you have any other suggestion?
> 
> Not uninit but unwritten. Yes, we can do that. After all e.g. page writeback
> creates extents like this without any inode rwsem protection.

hmm, so as I understand you would like to take this step-by-step.
First let's come up with a patch to solve above mentioned issue with 
overwrites case.
With this we can have some APIs for taking inode lock based
on the shared/exclusive mode flags.

Then as you said, we can submit IO to create unwritten extents for DIO 
writes beyond EOF as well. But that will require some careful handling 
in updating inode size. So that should be a second step we should take.


> 
>> In case of XFS -
>> I do see it promotes the demotes the shared lock (IOLOCK_XXX) in
>> xfs_file_aio_write_checks. But I don't know if after that, whether
>> it only holds the shared lock while doing the DIO IO?
>> And how does it protects against the parallel DIO reads which can
>> potentially expose any stale data?
> 
> XFS protects DIO submission with shared IOLOCK. Stale data exposure is
> solved by using unwritten extents similarly to what ext4 does.

Yes, that's what I was also thinking. But I will have to check the 
locking mechanism which XFS uses in carefully updating the inode size 
while in shared IOLOCK mode while using unwritten extents for DIO writes 
beyond EOF. I see some tricks with i_lock(ILOCK)/i_flags_lock & 
i_rwsem(IOLOCK).

Will check once the 1st step w.r.t overwrite is done.


Thanks for your comments. Appreciate it!!

-ritesh