diff mbox series

[RFC,3/8] iomap: Add atomic write support for direct-io

Message ID 6a09654d152d3d1a07636174f5abcfce9948c20c.1709361537.git.ritesh.list@gmail.com
State New
Headers show
Series None | expand

Commit Message

Ritesh Harjani (IBM) March 2, 2024, 7:42 a.m. UTC
This adds direct-io atomic writes support in iomap. This adds -
1. IOMAP_ATOMIC flag for iomap iter.
2. Sets REQ_ATOMIC to bio opflags.
3. Adds necessary checks in iomap_dio code to ensure a single bio is
   submitted for an atomic write request. (since we only support ubuf
   type iocb). Otherwise return an error EIO.
4. Adds a common helper routine iomap_dio_check_atomic(). It helps in
   verifying mapped length and start/end physical offset against the hw
   device constraints for supporting atomic writes.

This patch is based on a patch from John Garry <john.g.garry@oracle.com>
which adds such support of DIO atomic writes to iomap.

Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
---
 fs/iomap/direct-io.c  | 75 +++++++++++++++++++++++++++++++++++++++++--
 fs/iomap/trace.h      |  3 +-
 include/linux/iomap.h |  1 +
 3 files changed, 75 insertions(+), 4 deletions(-)

Comments

Dave Chinner March 4, 2024, 1:16 a.m. UTC | #1
On Sat, Mar 02, 2024 at 01:12:00PM +0530, Ritesh Harjani (IBM) wrote:
> This adds direct-io atomic writes support in iomap. This adds -
> 1. IOMAP_ATOMIC flag for iomap iter.
> 2. Sets REQ_ATOMIC to bio opflags.
> 3. Adds necessary checks in iomap_dio code to ensure a single bio is
>    submitted for an atomic write request. (since we only support ubuf
>    type iocb). Otherwise return an error EIO.
> 4. Adds a common helper routine iomap_dio_check_atomic(). It helps in
>    verifying mapped length and start/end physical offset against the hw
>    device constraints for supporting atomic writes.
> 
> This patch is based on a patch from John Garry <john.g.garry@oracle.com>
> which adds such support of DIO atomic writes to iomap.
> 
> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> ---
>  fs/iomap/direct-io.c  | 75 +++++++++++++++++++++++++++++++++++++++++--
>  fs/iomap/trace.h      |  3 +-
>  include/linux/iomap.h |  1 +
>  3 files changed, 75 insertions(+), 4 deletions(-)

Ugh. Now we have two competing sets of changes to bring RWF_ATOMIC
support to iomap. One from John here:

https://lore.kernel.org/linux-fsdevel/20240124142645.9334-1-john.g.garry@oracle.com/

and now this one.

Can the two of you please co-ordinate your efforts and based your
filesysetm work off the same iomap infrastructure changes?

.....

> @@ -356,6 +360,11 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	if (need_zeroout) {
>  		/* zero out from the start of the block to the write offset */
>  		pad = pos & (fs_block_size - 1);
> +		if (unlikely(pad && atomic_write)) {
> +			WARN_ON_ONCE("pos not atomic write aligned\n");
> +			ret = -EINVAL;
> +			goto out;
> +		}

This atomic IO should have been rejected before it even got to
the layers where the bios are being built. If the IO alignment is
such that it does not align to filesystem allocation constraints, it
should be rejected at the filesystem ->write_iter() method and not
even get to the iomap layer.

.....

> @@ -516,6 +535,44 @@ static loff_t iomap_dio_iter(const struct iomap_iter *iter,
>  	}
>  }
>  
> +/*
> + * iomap_dio_check_atomic:	DIO Atomic checks before calling bio submission.
> + * @iter:			iomap iterator
> + * This function is called after filesystem block mapping and before bio
> + * formation/submission. This is the right place to verify hw device/block
> + * layer constraints to be followed for doing atomic writes. Hence do those
> + * common checks here.
> + */
> +static bool iomap_dio_check_atomic(struct iomap_iter *iter)
> +{
> +	struct block_device *bdev = iter->iomap.bdev;
> +	unsigned long long map_len = iomap_length(iter);
> +	unsigned long long start = iomap_sector(&iter->iomap, iter->pos)
> +						<< SECTOR_SHIFT;
> +	unsigned long long end = start + map_len - 1;
> +	unsigned int awu_min =
> +			queue_atomic_write_unit_min_bytes(bdev->bd_queue);
> +	unsigned int awu_max =
> +			queue_atomic_write_unit_max_bytes(bdev->bd_queue);
> +	unsigned long boundary =
> +			queue_atomic_write_boundary_bytes(bdev->bd_queue);
> +	unsigned long mask = ~(boundary - 1);
> +
> +
> +	/* map_len should be same as user specified iter->len */
> +	if (map_len < iter->len)
> +		return false;
> +	/* start should be aligned to block device min atomic unit alignment */
> +	if (!IS_ALIGNED(start, awu_min))
> +		return false;
> +	/* If top bits doesn't match, means atomic unit boundary is crossed */
> +	if (boundary && ((start | mask) != (end | mask)))
> +		return false;
> +
> +	return true;
> +}

I think you are re-implementing stuff that John has already done at
higher layers and in a generic manner. i.e.
generic_atomic_write_valid() in this patch:

https://lore.kernel.org/linux-fsdevel/20240226173612.1478858-4-john.g.garry@oracle.com/

We shouldn't be getting anywhere near the iomap layer if the IO is
not properly aligned to atomic IO constraints...

So, yeah, can you please co-ordinate the development of this
patchset with John and the work that has already been done to
support this functionality on block devices and XFS?

-Dave.
Ritesh Harjani (IBM) March 4, 2024, 5:33 a.m. UTC | #2
Dave Chinner <david@fromorbit.com> writes:

> On Sat, Mar 02, 2024 at 01:12:00PM +0530, Ritesh Harjani (IBM) wrote:
>> This adds direct-io atomic writes support in iomap. This adds -
>> 1. IOMAP_ATOMIC flag for iomap iter.
>> 2. Sets REQ_ATOMIC to bio opflags.
>> 3. Adds necessary checks in iomap_dio code to ensure a single bio is
>>    submitted for an atomic write request. (since we only support ubuf
>>    type iocb). Otherwise return an error EIO.
>> 4. Adds a common helper routine iomap_dio_check_atomic(). It helps in
>>    verifying mapped length and start/end physical offset against the hw
>>    device constraints for supporting atomic writes.
>> 
>> This patch is based on a patch from John Garry <john.g.garry@oracle.com>
>> which adds such support of DIO atomic writes to iomap.

Please note this comment above. I will refer this in below comments.

>> 
>> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> ---
>>  fs/iomap/direct-io.c  | 75 +++++++++++++++++++++++++++++++++++++++++--
>>  fs/iomap/trace.h      |  3 +-
>>  include/linux/iomap.h |  1 +
>>  3 files changed, 75 insertions(+), 4 deletions(-)
>
> Ugh. Now we have two competing sets of changes to bring RWF_ATOMIC
> support to iomap. One from John here:

Not competing changes (and neither that was the intention). As you see I have
commented above saying that this patch is based on a previous patch in
iomap from John. 

So why did I send this one?  
1. John's latest patch series v5 was on "block atomic writes" [1], which
does not have these checks in iomap (as it was not required). 

2. For sake of completeness for ext4 atomic write support, I needed to
include this change along with this series. I have also tried to address all
the review comments he got on [2] (along with an extra function iomap_dio_check_atomic())

[1]: https://lore.kernel.org/all/20240226173612.1478858-1-john.g.garry@oracle.com/
[2]: https://lore.kernel.org/linux-fsdevel/20240124142645.9334-1-john.g.garry@oracle.com/

>
> https://lore.kernel.org/linux-fsdevel/20240124142645.9334-1-john.g.garry@oracle.com/
>
> and now this one.
>
> Can the two of you please co-ordinate your efforts and based your
> filesysetm work off the same iomap infrastructure changes?

Sure Dave, make sense. But we are cc'ing each other in this effort
together so that we are aware of what is being worked upon. 

And as I mentioned, this change is not competing with John's change. If
at all it is only complementing his initial change, since this iomap change
addresses review comments from others on the previous one and added one
extra check (on mapped physical extent) which I wanted people to provide feedback on.

>
> .....
>
>> @@ -356,6 +360,11 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>  	if (need_zeroout) {
>>  		/* zero out from the start of the block to the write offset */
>>  		pad = pos & (fs_block_size - 1);
>> +		if (unlikely(pad && atomic_write)) {
>> +			WARN_ON_ONCE("pos not atomic write aligned\n");
>> +			ret = -EINVAL;
>> +			goto out;
>> +		}
>
> This atomic IO should have been rejected before it even got to
> the layers where the bios are being built. If the IO alignment is
> such that it does not align to filesystem allocation constraints, it
> should be rejected at the filesystem ->write_iter() method and not
> even get to the iomap layer.

I had added this mainly from iomap sanity checking perspective. 
We are offloading some checks to be made by the filesystem before
submitting the I/O request to iomap. 
These "common" checks in iomap layer are mainly to provide sanity checking
to make sure FS did it's job, before iomap could form/process the bios and then
do submit_bio to the block layer. 



>
> .....
>
>> @@ -516,6 +535,44 @@ static loff_t iomap_dio_iter(const struct iomap_iter *iter,
>>  	}
>>  }
>>  
>> +/*
>> + * iomap_dio_check_atomic:	DIO Atomic checks before calling bio submission.
>> + * @iter:			iomap iterator
>> + * This function is called after filesystem block mapping and before bio
>> + * formation/submission. This is the right place to verify hw device/block
>> + * layer constraints to be followed for doing atomic writes. Hence do those
>> + * common checks here.
>> + */
>> +static bool iomap_dio_check_atomic(struct iomap_iter *iter)
>> +{
>> +	struct block_device *bdev = iter->iomap.bdev;
>> +	unsigned long long map_len = iomap_length(iter);
>> +	unsigned long long start = iomap_sector(&iter->iomap, iter->pos)
>> +						<< SECTOR_SHIFT;
>> +	unsigned long long end = start + map_len - 1;
>> +	unsigned int awu_min =
>> +			queue_atomic_write_unit_min_bytes(bdev->bd_queue);
>> +	unsigned int awu_max =
>> +			queue_atomic_write_unit_max_bytes(bdev->bd_queue);
>> +	unsigned long boundary =
>> +			queue_atomic_write_boundary_bytes(bdev->bd_queue);
>> +	unsigned long mask = ~(boundary - 1);
>> +
>> +
>> +	/* map_len should be same as user specified iter->len */
>> +	if (map_len < iter->len)
>> +		return false;
>> +	/* start should be aligned to block device min atomic unit alignment */
>> +	if (!IS_ALIGNED(start, awu_min))
>> +		return false;
>> +	/* If top bits doesn't match, means atomic unit boundary is crossed */
>> +	if (boundary && ((start | mask) != (end | mask)))
>> +		return false;
>> +
>> +	return true;
>> +}
>
> I think you are re-implementing stuff that John has already done at
> higher layers and in a generic manner. i.e.
> generic_atomic_write_valid() in this patch:
>
> https://lore.kernel.org/linux-fsdevel/20240226173612.1478858-4-john.g.garry@oracle.com/
>
> We shouldn't be getting anywhere near the iomap layer if the IO is
> not properly aligned to atomic IO constraints...

So current generic_atomic_write_valid() function mainly checks alignment
w.r.t logical offset and iter->len. 

What this function was checking was on the physical block offset and
mapped extent length. Hence it was made after iomap_iter() call.
i.e. ...

 +	/* map_len should be same as user specified iter->len */
 +	if (map_len < iter->len)
 +		return false;
 +	/* start should be aligned to block device min atomic unit alignment */
 +	if (!IS_ALIGNED(start, awu_min))
 +		return false;


But I agree, that maybe we can improve generic_atomic_write_valid()
to be able to work on both logical and physical offset and
iter->len + mapped len. 

Let me think about it. 

However, the point on which I would like a feedback from others is - 
1. After filesystem has returned the mapped extent in iomap_iter() call,
iomap will be forming a bio to be sent to the block layer.
So do we agree to add a check here in iomap layer to verify that the
mapped physical start and len should satisfy the requirements for doing
atomic writes?

>
> So, yeah, can you please co-ordinate the development of this
> patchset with John and the work that has already been done to
> support this functionality on block devices and XFS?

We actually are in a way. If you see this ext4 series is sitting on top of
John's v5 series of "block atomic write". This patch [1] ([RFC 5/8] part
of this series), in ext4 does use generic_atomic_write_valid() function
for DIO atomic write validity.

[1]: https://lore.kernel.org/linux-ext4/e332979deb70913c2c476a059b09015904a5b007.1709361537.git.ritesh.list@gmail.com/T/#u


Thanks for your review!

-ritesh
John Garry March 4, 2024, 8:49 a.m. UTC | #3
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/20240124142645.9334-1-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!PqMMFBeUqdWwlm0AxVyI_Vr1HPajTQ6AG2_GwK_IrhBSa-Wnz4cc-1w0LEFyTXY9Q9gT0WwhxvXloSqnOHb6Btg$
>>
>> and now this one.
>>
>> Can the two of you please co-ordinate your efforts and based your
>> filesysetm work off the same iomap infrastructure changes?
> 
> Sure Dave, make sense. But we are cc'ing each other in this effort
> together so that we are aware of what is being worked upon.

Just cc'ing is not enough. I was going to send my v2 for XFS/iomap 
support today. I didn't announce that as I did not think that I had to. 
Admittedly it will be effectively an RFC, as the forcealign feature (now 
included) is not mature. But it's going to be a bit awkward to have 2x 
overlapping series' sent to the list.

FWIW, I think that it's better to send series based on top of other 
series, rather than cherry-picking necessary parts of other series (when 
posting)

Thanks,
John
Ritesh Harjani (IBM) March 4, 2024, 10:31 a.m. UTC | #4
John Garry <john.g.garry@oracle.com> writes:

>>>
>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/20240124142645.9334-1-john.g.garry@oracle.com/__;!!ACWV5N9M2RV99hQ!PqMMFBeUqdWwlm0AxVyI_Vr1HPajTQ6AG2_GwK_IrhBSa-Wnz4cc-1w0LEFyTXY9Q9gT0WwhxvXloSqnOHb6Btg$
>>>
>>> and now this one.
>>>
>>> Can the two of you please co-ordinate your efforts and based your
>>> filesysetm work off the same iomap infrastructure changes?
>> 
>> Sure Dave, make sense. But we are cc'ing each other in this effort
>> together so that we are aware of what is being worked upon.
>
> Just cc'ing is not enough. I was going to send my v2 for XFS/iomap 
> support today. I didn't announce that as I did not think that I had to. 

ok. Let me take care of this next time to avoid any overlapping change
hitting the mailing list to avoid double reviews/competing changes from
2 people. Hopefully I can find you on xfs IRC channel in case if I would
like to post anything in the related/overlapping area . My handle is riteshh. 

> Admittedly it will be effectively an RFC, as the forcealign feature (now 
> included) is not mature. But it's going to be a bit awkward to have 2x 
> overlapping series' sent to the list.
>
> FWIW, I think that it's better to send series based on top of other 
> series, rather than cherry-picking necessary parts of other series (when 
> posting)
>

Ok. Sure John. Make sense. Now that I understood what I am looking for
in from iomap side of the changes, I can provide my review comments to
your series, whenever you post them.

Thanks for your feedback.

-ritesh
Dave Chinner March 4, 2024, 8:56 p.m. UTC | #5
On Mon, Mar 04, 2024 at 11:03:24AM +0530, Ritesh Harjani wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Sat, Mar 02, 2024 at 01:12:00PM +0530, Ritesh Harjani (IBM) wrote:
> >> This adds direct-io atomic writes support in iomap. This adds -
> >> 1. IOMAP_ATOMIC flag for iomap iter.
> >> 2. Sets REQ_ATOMIC to bio opflags.
> >> 3. Adds necessary checks in iomap_dio code to ensure a single bio is
> >>    submitted for an atomic write request. (since we only support ubuf
> >>    type iocb). Otherwise return an error EIO.
> >> 4. Adds a common helper routine iomap_dio_check_atomic(). It helps in
> >>    verifying mapped length and start/end physical offset against the hw
> >>    device constraints for supporting atomic writes.
> >> 
> >> This patch is based on a patch from John Garry <john.g.garry@oracle.com>
> >> which adds such support of DIO atomic writes to iomap.
> 
> Please note this comment above. I will refer this in below comments.
> 
> >> 
> >> Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> >> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> >> ---
> >>  fs/iomap/direct-io.c  | 75 +++++++++++++++++++++++++++++++++++++++++--
> >>  fs/iomap/trace.h      |  3 +-
> >>  include/linux/iomap.h |  1 +
> >>  3 files changed, 75 insertions(+), 4 deletions(-)
> >
> > Ugh. Now we have two competing sets of changes to bring RWF_ATOMIC
> > support to iomap. One from John here:
> 
> Not competing changes (and neither that was the intention). As you see I have
> commented above saying that this patch is based on a previous patch in
> iomap from John. 

That's not the same as co-ordinating development or collaboration on
common aspects of the functionality required.

> So why did I send this one?  
> 1. John's latest patch series v5 was on "block atomic writes" [1], which
> does not have these checks in iomap (as it was not required). 
> 
> 2. For sake of completeness for ext4 atomic write support, I needed to
> include this change along with this series. I have also tried to address all
> the review comments he got on [2] (along with an extra function iomap_dio_check_atomic())
> 
> [1]: https://lore.kernel.org/all/20240226173612.1478858-1-john.g.garry@oracle.com/
> [2]: https://lore.kernel.org/linux-fsdevel/20240124142645.9334-1-john.g.garry@oracle.com/

Yes, but you've clearly not seen the feedback that John has been
given because otherwise you would not have implemented things the
way you did.

That's my point - you're operating in isolation, and forcing
reviewers now to deal with two separate patch sets with overlapping
funcitonality and similar problems.

> > https://lore.kernel.org/linux-fsdevel/20240124142645.9334-1-john.g.garry@oracle.com/
> >
> > and now this one.
> >
> > Can the two of you please co-ordinate your efforts and based your
> > filesysetm work off the same iomap infrastructure changes?
> 
> Sure Dave, make sense. But we are cc'ing each other in this effort
> together so that we are aware of what is being worked upon. 

"ccing each other" is not the same as actively collaborating on
development.

> And as I mentioned, this change is not competing with John's change. If
> at all it is only complementing his initial change, since this iomap change
> addresses review comments from others on the previous one and added one
> extra check (on mapped physical extent) which I wanted people to provide feedback on.
> 
> >
> > .....
> >
> >> @@ -356,6 +360,11 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> >>  	if (need_zeroout) {
> >>  		/* zero out from the start of the block to the write offset */
> >>  		pad = pos & (fs_block_size - 1);
> >> +		if (unlikely(pad && atomic_write)) {
> >> +			WARN_ON_ONCE("pos not atomic write aligned\n");
> >> +			ret = -EINVAL;
> >> +			goto out;
> >> +		}
> >
> > This atomic IO should have been rejected before it even got to
> > the layers where the bios are being built. If the IO alignment is
> > such that it does not align to filesystem allocation constraints, it
> > should be rejected at the filesystem ->write_iter() method and not
> > even get to the iomap layer.
> 
> I had added this mainly from iomap sanity checking perspective. 
> We are offloading some checks to be made by the filesystem before
> submitting the I/O request to iomap. 
> These "common" checks in iomap layer are mainly to provide sanity checking
> to make sure FS did it's job, before iomap could form/process the bios and then
> do submit_bio to the block layer. 

If you read the feedback John had been given, you'd know that
alignment verification for atomic writes belongs in the filesystem
before it even calls into iomap. See these two patches in the XFS
series he just sent out:

https://lore.kernel.org/linux-xfs/20240304130428.13026-11-john.g.garry@oracle.com/T/#u
https://lore.kernel.org/linux-xfs/20240304130428.13026-14-john.g.garry@oracle.com/T/#u

> > .....
> >
> >> @@ -516,6 +535,44 @@ static loff_t iomap_dio_iter(const struct iomap_iter *iter,
> >>  	}
> >>  }
> >>  
> >> +/*
> >> + * iomap_dio_check_atomic:	DIO Atomic checks before calling bio submission.
> >> + * @iter:			iomap iterator
> >> + * This function is called after filesystem block mapping and before bio
> >> + * formation/submission. This is the right place to verify hw device/block
> >> + * layer constraints to be followed for doing atomic writes. Hence do those
> >> + * common checks here.
> >> + */
> >> +static bool iomap_dio_check_atomic(struct iomap_iter *iter)
> >> +{
> >> +	struct block_device *bdev = iter->iomap.bdev;
> >> +	unsigned long long map_len = iomap_length(iter);
> >> +	unsigned long long start = iomap_sector(&iter->iomap, iter->pos)
> >> +						<< SECTOR_SHIFT;
> >> +	unsigned long long end = start + map_len - 1;
> >> +	unsigned int awu_min =
> >> +			queue_atomic_write_unit_min_bytes(bdev->bd_queue);
> >> +	unsigned int awu_max =
> >> +			queue_atomic_write_unit_max_bytes(bdev->bd_queue);
> >> +	unsigned long boundary =
> >> +			queue_atomic_write_boundary_bytes(bdev->bd_queue);
> >> +	unsigned long mask = ~(boundary - 1);
> >> +
> >> +
> >> +	/* map_len should be same as user specified iter->len */
> >> +	if (map_len < iter->len)
> >> +		return false;
> >> +	/* start should be aligned to block device min atomic unit alignment */
> >> +	if (!IS_ALIGNED(start, awu_min))
> >> +		return false;
> >> +	/* If top bits doesn't match, means atomic unit boundary is crossed */
> >> +	if (boundary && ((start | mask) != (end | mask)))
> >> +		return false;
> >> +
> >> +	return true;
> >> +}
> >
> > I think you are re-implementing stuff that John has already done at
> > higher layers and in a generic manner. i.e.
> > generic_atomic_write_valid() in this patch:
> >
> > https://lore.kernel.org/linux-fsdevel/20240226173612.1478858-4-john.g.garry@oracle.com/
> >
> > We shouldn't be getting anywhere near the iomap layer if the IO is
> > not properly aligned to atomic IO constraints...
> 
> So current generic_atomic_write_valid() function mainly checks alignment
> w.r.t logical offset and iter->len. 
> 
> What this function was checking was on the physical block offset and
> mapped extent length. Hence it was made after iomap_iter() call.
> i.e. ...

The filesystem is supposed to guarantee the alignment of the iomap
returned for mapping requests on inodes configured for atomic
writes. IOWs, if the filesystem returns an unaligned or short extent
for an atomic write enabled inode, the filesystem mapping operation
is buggy. If it can't map aligned extents, then it should return an
error, not leave crap for the iomap infrastructure to have to clean
up.

> 
>  +	/* map_len should be same as user specified iter->len */
>  +	if (map_len < iter->len)
>  +		return false;
>  +	/* start should be aligned to block device min atomic unit alignment */
>  +	if (!IS_ALIGNED(start, awu_min))
>  +		return false;
> 
> 
> But I agree, that maybe we can improve generic_atomic_write_valid()
> to be able to work on both logical and physical offset and
> iter->len + mapped len. 
> Let me think about it. 
> 
> However, the point on which I would like a feedback from others is - 
> 1. After filesystem has returned the mapped extent in iomap_iter() call,
> iomap will be forming a bio to be sent to the block layer.
> So do we agree to add a check here in iomap layer to verify that the
> mapped physical start and len should satisfy the requirements for doing
> atomic writes?

That's entirely the problem about you working on this in isolation:
we've already had that discussion and the simplest solution is that
this is a filesystem problem, not an iomap problem. That is, if the
filesystem cannot return a correctly aligned and sized extent for an
atomic write enabled inode, it must return an error and not a
malformed iomap.

IOWs, it's not the job of the iomap IO routines to enforce mapping
alignment on these inodes - the extent alignment must always be
correct for atomic writes regardless of whether an atomic write IO
is being done or not. Failure to align any extent in the inode
correctly will result in future atomic writes to that offset being
impossible to issue.

Hence if the inode is configured for atomic writes, it *must* return
aligned and sized iomaps that atomic writes can be issued against.
It's a filesystem implementation bug if this invariant is violated,
so the filesystem implementation is where all the debug checks need
to be to ensure it never returns an invalid mapping to the iomap
infrastructure.

-Dave.
diff mbox series

Patch

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..b4548acb74e7 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -256,7 +256,7 @@  static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
  * clearing the WRITE_THROUGH flag in the dio request.
  */
 static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
-		const struct iomap *iomap, bool use_fua)
+		const struct iomap *iomap, bool use_fua, bool atomic_write)
 {
 	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
 
@@ -269,6 +269,9 @@  static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 	else
 		dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
 
+	if (atomic_write)
+		opflags |= REQ_ATOMIC;
+
 	return opflags;
 }
 
@@ -279,11 +282,12 @@  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	struct inode *inode = iter->inode;
 	unsigned int fs_block_size = i_blocksize(inode), pad;
 	loff_t length = iomap_length(iter);
+	const size_t orig_len = iter->len;
 	loff_t pos = iter->pos;
 	blk_opf_t bio_opf;
 	struct bio *bio;
 	bool need_zeroout = false;
-	bool use_fua = false;
+	bool use_fua = false, atomic_write = iter->flags & IOMAP_ATOMIC;
 	int nr_pages, ret = 0;
 	size_t copied = 0;
 	size_t orig_count;
@@ -356,6 +360,11 @@  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	if (need_zeroout) {
 		/* zero out from the start of the block to the write offset */
 		pad = pos & (fs_block_size - 1);
+		if (unlikely(pad && atomic_write)) {
+			WARN_ON_ONCE("pos not atomic write aligned\n");
+			ret = -EINVAL;
+			goto out;
+		}
 		if (pad)
 			iomap_dio_zero(iter, dio, pos - pad, pad);
 	}
@@ -365,7 +374,7 @@  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	 * can set up the page vector appropriately for a ZONE_APPEND
 	 * operation.
 	 */
-	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);
+	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic_write);
 
 	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
 	do {
@@ -397,6 +406,14 @@  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		}
 
 		n = bio->bi_iter.bi_size;
+
+		/* This bio should have covered the complete length */
+		if (unlikely(atomic_write && n != orig_len)) {
+			WARN_ON_ONCE(1);
+			ret = -EINVAL;
+			bio_put(bio);
+			goto out;
+		}
 		if (dio->flags & IOMAP_DIO_WRITE) {
 			task_io_account_write(n);
 		} else {
@@ -429,6 +446,8 @@  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
 		/* zero out from the end of the write to the end of the block */
 		pad = pos & (fs_block_size - 1);
+		/* This should never happen */
+		WARN_ON_ONCE(unlikely(pad && atomic_write));
 		if (pad)
 			iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
 	}
@@ -516,6 +535,44 @@  static loff_t iomap_dio_iter(const struct iomap_iter *iter,
 	}
 }
 
+/*
+ * iomap_dio_check_atomic:	DIO Atomic checks before calling bio submission.
+ * @iter:			iomap iterator
+ * This function is called after filesystem block mapping and before bio
+ * formation/submission. This is the right place to verify hw device/block
+ * layer constraints to be followed for doing atomic writes. Hence do those
+ * common checks here.
+ */
+static bool iomap_dio_check_atomic(struct iomap_iter *iter)
+{
+	struct block_device *bdev = iter->iomap.bdev;
+	unsigned long long map_len = iomap_length(iter);
+	unsigned long long start = iomap_sector(&iter->iomap, iter->pos)
+						<< SECTOR_SHIFT;
+	unsigned long long end = start + map_len - 1;
+	unsigned int awu_min =
+			queue_atomic_write_unit_min_bytes(bdev->bd_queue);
+	unsigned int awu_max =
+			queue_atomic_write_unit_max_bytes(bdev->bd_queue);
+	unsigned long boundary =
+			queue_atomic_write_boundary_bytes(bdev->bd_queue);
+	unsigned long mask = ~(boundary - 1);
+
+
+	/* map_len should be same as user specified iter->len */
+	if (map_len < iter->len)
+		return false;
+	/* start should be aligned to block device min atomic unit alignment */
+	if (!IS_ALIGNED(start, awu_min))
+		return false;
+	/* If top bits doesn't match, means atomic unit boundary is crossed */
+	if (boundary && ((start | mask) != (end | mask)))
+		return false;
+
+	return true;
+}
+
+
 /*
  * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
  * is being issued as AIO or not.  This allows us to optimise pure data writes
@@ -554,12 +611,16 @@  __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	struct blk_plug plug;
 	struct iomap_dio *dio;
 	loff_t ret = 0;
+	bool atomic_write = iocb->ki_flags & IOCB_ATOMIC;
 
 	trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
 
 	if (!iomi.len)
 		return NULL;
 
+	if (atomic_write && !iter_is_ubuf(iter))
+		return ERR_PTR(-EINVAL);
+
 	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
 	if (!dio)
 		return ERR_PTR(-ENOMEM);
@@ -605,6 +666,9 @@  __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		if (iocb->ki_flags & IOCB_DIO_CALLER_COMP)
 			dio->flags |= IOMAP_DIO_CALLER_COMP;
 
+		if (atomic_write)
+			iomi.flags |= IOMAP_ATOMIC;
+
 		if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
 			ret = -EAGAIN;
 			if (iomi.pos >= dio->i_size ||
@@ -656,6 +720,11 @@  __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	blk_start_plug(&plug);
 	while ((ret = iomap_iter(&iomi, ops)) > 0) {
+		if (atomic_write && !iomap_dio_check_atomic(&iomi)) {
+			ret = -EIO;
+			break;
+		}
+
 		iomi.processed = iomap_dio_iter(&iomi, dio);
 
 		/*
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index c16fd55f5595..c95576420bca 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -98,7 +98,8 @@  DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
 	{ IOMAP_REPORT,		"REPORT" }, \
 	{ IOMAP_FAULT,		"FAULT" }, \
 	{ IOMAP_DIRECT,		"DIRECT" }, \
-	{ IOMAP_NOWAIT,		"NOWAIT" }
+	{ IOMAP_NOWAIT,		"NOWAIT" }, \
+	{ IOMAP_ATOMIC,		"ATOMIC" }
 
 #define IOMAP_F_FLAGS_STRINGS \
 	{ IOMAP_F_NEW,		"NEW" }, \
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 96dd0acbba44..9eac704a0d6f 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -178,6 +178,7 @@  struct iomap_folio_ops {
 #else
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
+#define IOMAP_ATOMIC		(1 << 9)
 
 struct iomap_ops {
 	/*