diff mbox series

[RFC,1/3] block: Add support for REQ_OP_ASSIGN_RANGE operation

Message ID 157599696813.12112.14140818972910110796.stgit@localhost.localdomain
State New
Headers show
Series block,ext4: Introduce REQ_OP_ASSIGN_RANGE to reflect extents allocation in block device internals | expand

Commit Message

Kirill Tkhai Dec. 10, 2019, 4:56 p.m. UTC
This operation allows to notify a device about the fact,
that some sectors range was choosen by a filesystem
as a single extent, and the device should try its best
to reflect that (keep the range as a single hunk in its
internals, or represent the range as minimal set of hunks).
Speaking directly, the operation is for forwarding
fallocate(0) requests into an essence, on which the device
is based.

This may be useful for some distributed network filesystems,
providing block device interface, for optimization of their
blocks placement over the cluster nodes.

Also, block devices mapping a file (like loop) are users
of that, since this allows to allocate more continuous
extents and since this batches blocks allocation requests.
In addition, hypervisors like QEMU may use this for better
blocks placement.

The patch adds a new blkdev_issue_assign_range() primitive,
which is rather similar to existing blkdev_issue_{*} api.
Also, a new queue limit.max_assign_range_sectors is added.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 block/blk-core.c          |    4 +++
 block/blk-lib.c           |   70 +++++++++++++++++++++++++++++++++++++++++++++
 block/blk-merge.c         |   21 ++++++++++++++
 block/bounce.c            |    1 +
 include/linux/bio.h       |    3 ++
 include/linux/blk_types.h |    2 +
 include/linux/blkdev.h    |   29 +++++++++++++++++++
 7 files changed, 130 insertions(+)

Comments

Martin K. Petersen Dec. 19, 2019, 3:03 a.m. UTC | #1
Hi Kirill!

> The patch adds a new blkdev_issue_assign_range() primitive, which is
> rather similar to existing blkdev_issue_{*} api.  Also, a new queue
> limit.max_assign_range_sectors is added.

I am not so keen on the assign_range name. What's wrong with "allocate"?

But why introduce a completely new operation? Isn't this essentially a
write zeroes with BLKDEV_ZERO_NOUNMAP flag set?

If the zeroing aspect is perceived to be a problem we could add a
BLKDEV_ZERO_ALLOCATE flag (or BLKDEV_ZERO_ANCHOR since that's the
terminology used in SCSI).
Kirill Tkhai Dec. 19, 2019, 11:07 a.m. UTC | #2
Hi, Martin!

On 19.12.2019 06:03, Martin K. Petersen wrote:
> 
> Hi Kirill!
> 
>> The patch adds a new blkdev_issue_assign_range() primitive, which is
>> rather similar to existing blkdev_issue_{*} api.  Also, a new queue
>> limit.max_assign_range_sectors is added.
> 
> I am not so keen on the assign_range name. What's wrong with "allocate"?

REQ_OP_ALLOCATE_RANGE seemed for me as looking very long for the reviewers.
And I found that there is no an abbreviation of operations name in enum req_opf,
so REQ_OP_ALLOC_RANGE won't look good. Thus, I found a replacement.

But in case of REQ_OP_ALLOCATE_RANGE length is OK for people, there is no
a problem to choose it.

> But why introduce a completely new operation? Isn't this essentially a
> write zeroes with BLKDEV_ZERO_NOUNMAP flag set?
> 
> If the zeroing aspect is perceived to be a problem we could add a
> BLKDEV_ZERO_ALLOCATE flag (or BLKDEV_ZERO_ANCHOR since that's the
> terminology used in SCSI).

Hm. BLKDEV_ZERO_NOUNMAP is used in __blkdev_issue_write_zeroes() only.
So, do I understand right that we should the below two?:

1)Introduce a new flag BLKDEV_ZERO_ALLOCATE for blkdev_issue_write_zeroes().
2)Introduce a new flag REQ_NOZERO in enum req_opf.

Won't this confuse a reader that we have blkdev_issue_write_zeroes(),
which does not write zeroes sometimes? Maybe we should rename
blkdev_issue_write_zeroes() in some more generic name?

Thanks,
Kirill
Chaitanya Kulkarni Dec. 19, 2019, 10:03 p.m. UTC | #3
> 1)Introduce a new flag BLKDEV_ZERO_ALLOCATE for blkdev_issue_write_zeroes().
> 2)Introduce a new flag REQ_NOZERO in enum req_opf.
>
> Won't this confuse a reader that we have blkdev_issue_write_zeroes(),
> which does not write zeroes sometimes? Maybe we should rename
> blkdev_issue_write_zeroes() in some more generic name?

Yes it will be confusing, I can see that the code for 
__blkdev_issue_write_zeroes() & __blkdev_issue_assign_range()
is very similar except op = REQ_OP_WRITE_ZEROES and
op = REQ_OP_ASSIGN_RANGE (and some minor details).

What we need is a prep patch which moves the payload less bio
code into the helper function which can accept op as an argument
based on that it will branch out and execute right code path.

If we decide to get this in then I'll be happy to create required
prep patch.
>
> Thanks,
> Kirill
>
>
Martin K. Petersen Dec. 19, 2019, 10:37 p.m. UTC | #4
Kirill,

> Hm. BLKDEV_ZERO_NOUNMAP is used in __blkdev_issue_write_zeroes() only.
> So, do I understand right that we should the below two?:
>
> 1) Introduce a new flag BLKDEV_ZERO_ALLOCATE for
> blkdev_issue_write_zeroes().

> 2) Introduce a new flag REQ_NOZERO in enum req_opf.

Something like that. If zeroing is a problem for you.

Right now we offer the following semantics:

	Deallocate, no zeroing (discard)

	Optionally deallocate, zeroing (zeroout)

	Allocate, zeroing (zeroout + NOUNMAP)

Some devices also implement a fourth option which would be:

	Anchor: Allocate, no zeroing

> Won't this confuse a reader that we have blkdev_issue_write_zeroes(),
> which does not write zeroes sometimes? Maybe we should rename
> blkdev_issue_write_zeroes() in some more generic name?

Maybe. The naming is what it is for hysterical raisins and reflects how
things are implemented in the storage protocols. I wouldn't worry too
much about that. We can rename things if need be but we shouldn't plumb
an essentially identical operation through the block stack just to
expose a different name at the top.
Darrick Wong Dec. 20, 2019, 1:53 a.m. UTC | #5
On Thu, Dec 19, 2019 at 05:37:47PM -0500, Martin K. Petersen wrote:
> 
> Kirill,
> 
> > Hm. BLKDEV_ZERO_NOUNMAP is used in __blkdev_issue_write_zeroes() only.
> > So, do I understand right that we should the below two?:
> >
> > 1) Introduce a new flag BLKDEV_ZERO_ALLOCATE for
> > blkdev_issue_write_zeroes().
> 
> > 2) Introduce a new flag REQ_NOZERO in enum req_opf.
> 
> Something like that. If zeroing is a problem for you.
> 
> Right now we offer the following semantics:
> 
> 	Deallocate, no zeroing (discard)
> 
> 	Optionally deallocate, zeroing (zeroout)
> 
> 	Allocate, zeroing (zeroout + NOUNMAP)
> 
> Some devices also implement a fourth option which would be:
> 
> 	Anchor: Allocate, no zeroing

What happens if you anchor and then try to read the results?  IO error?

(Yeah, I'm being lazy and not digging through SBC-3, sorry...)

--D

> > Won't this confuse a reader that we have blkdev_issue_write_zeroes(),
> > which does not write zeroes sometimes? Maybe we should rename
> > blkdev_issue_write_zeroes() in some more generic name?
> 
> Maybe. The naming is what it is for hysterical raisins and reflects how
> things are implemented in the storage protocols. I wouldn't worry too
> much about that. We can rename things if need be but we shouldn't plumb
> an essentially identical operation through the block stack just to
> expose a different name at the top.
> 
> -- 
> Martin K. Petersen	Oracle Linux Engineering
Martin K. Petersen Dec. 20, 2019, 2:22 a.m. UTC | #6
Darrick,

[Anchor]

> What happens if you anchor and then try to read the results?  IO
> error?

Depends on the device. Typically you get the same results as UNMAP.

Anchoring is essentially a way to preallocate blocks on a thinly
provisioned device. If you then run out of actual storage capacity you
won't get a write failure writing to those LBAs. It is intended to
protect block ranges that do not tolerate write failures (journals,
metadata).
Kirill Tkhai Dec. 20, 2019, 11:55 a.m. UTC | #7
Hi, Martin,

On 20.12.2019 01:37, Martin K. Petersen wrote:
> 
> Kirill,
> 
>> Hm. BLKDEV_ZERO_NOUNMAP is used in __blkdev_issue_write_zeroes() only.
>> So, do I understand right that we should the below two?:
>>
>> 1) Introduce a new flag BLKDEV_ZERO_ALLOCATE for
>> blkdev_issue_write_zeroes().
> 
>> 2) Introduce a new flag REQ_NOZERO in enum req_opf.
> 
> Something like that. If zeroing is a problem for you.

My intention is to use this in fs allocators to notify virtual block devices
about allocated blocks (like in patch [3/3]). Filesystems allocators know about
written and unwritten extents, and they don't need a zeroing of allocated blocks.

Since a block range allocation action is less complicated (and faster), than
operation of allocation + zeroing of allocated blocks (at least for some devices),
we just choose it as the fastest. This is the reason we avoid zeroing.

> Right now we offer the following semantics:
> 
> 	Deallocate, no zeroing (discard)
> 
> 	Optionally deallocate, zeroing (zeroout)
> 
> 	Allocate, zeroing (zeroout + NOUNMAP)
> 
> Some devices also implement a fourth option which would be:
> 
> 	Anchor: Allocate, no zeroing
> 
>> Won't this confuse a reader that we have blkdev_issue_write_zeroes(),
>> which does not write zeroes sometimes? Maybe we should rename
>> blkdev_issue_write_zeroes() in some more generic name?
> 
> Maybe. The naming is what it is for hysterical raisins and reflects how
> things are implemented in the storage protocols. I wouldn't worry too
> much about that. We can rename things if need be but we shouldn't plumb
> an essentially identical operation through the block stack just to
> expose a different name at the top.

Not introduction a new operation is a good thing. Especially, since we don't
need a specific max_xxx_xxx_sectors != max_write_zeroes_sectors for it.

I'll rework the patch in this way (it seems it will become pretty small
after that).

One more thing to discuss. The new REQ_NOZERO flag won't be supported
by many block devices (their number will be even less, than number of
REQ_OP_WRITE_ZEROES supporters). Will this be a good thing, in case
of we will be completing BLKDEV_ZERO_ALLOCATE bios in __blkdev_issue_write_zeroes()
before splitting? I mean introduction of some flag in struct request_queue::limits.
Completion of them with -EOPNOTSUPP in block devices drivers looks
suboptimal for me.

Thanks,
Kirill
Martin K. Petersen Dec. 21, 2019, 6:54 p.m. UTC | #8
Kirill,

> One more thing to discuss. The new REQ_NOZERO flag won't be supported
> by many block devices (their number will be even less, than number of
> REQ_OP_WRITE_ZEROES supporters). Will this be a good thing, in case of
> we will be completing BLKDEV_ZERO_ALLOCATE bios in
> __blkdev_issue_write_zeroes() before splitting? I mean introduction of
> some flag in struct request_queue::limits.  Completion of them with
> -EOPNOTSUPP in block devices drivers looks suboptimal for me.

We already have the NOFALLBACK flag to let the user make that decision.

If that flag is not specified, and I receive an allocate request for a
SCSI device that does not support ANCHOR, my expectation would be that I
would do a regular write same.

If it's a filesystem that is the recipient of the operation and not a
SCSI device, how to react would depend on how the filesystem handles
unwritten extents, etc.
Kirill Tkhai Dec. 23, 2019, 8:51 a.m. UTC | #9
On 21.12.2019 21:54, Martin K. Petersen wrote:
> 
> Kirill,
> 
>> One more thing to discuss. The new REQ_NOZERO flag won't be supported
>> by many block devices (their number will be even less, than number of
>> REQ_OP_WRITE_ZEROES supporters). Will this be a good thing, in case of
>> we will be completing BLKDEV_ZERO_ALLOCATE bios in
>> __blkdev_issue_write_zeroes() before splitting? I mean introduction of
>> some flag in struct request_queue::limits.  Completion of them with
>> -EOPNOTSUPP in block devices drivers looks suboptimal for me.
> 
> We already have the NOFALLBACK flag to let the user make that decision.
> 
> If that flag is not specified, and I receive an allocate request for a
> SCSI device that does not support ANCHOR, my expectation would be that I
> would do a regular write same.
>
> If it's a filesystem that is the recipient of the operation and not a
> SCSI device, how to react would depend on how the filesystem handles
> unwritten extents, etc.

Ok, this case is clear for me, thanks.

But I also worry about NOFALLBACK case. There are possible block devices,
which support write zeroes, but they can't allocate blocks (block allocation
are just not appliable for them, say, these are all ordinary hdd).

Let's say, a user called fallocate(), and filesystem allocated range of blocks.
Then filesystem propagates the range to block device, and calls zeroout:

blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
		     GFP_NOIO, BLKDEV_ZERO_ALLOCATE|BLKDEV_ZERO_NOFALLBACK);

This case filesystem does not want zeroing blocks, it just wants to send a hint
to block device. So, in case of block device supports allocation, everything is OK.

But won't it be a good thing to return EOPNOTSUPP right from __blkdev_issue_write_zeroes()
in case of block device can't allocate blocks (q->limits.write_zeroes_can_allocate
in the patch below)? Here is just a way to underline block devices, which support
write zeroes, but allocation of blocks is meant nothing for them (wasting of time).

What do you think about the below?

Thanks
---
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..524b47905fd5 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -214,7 +214,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		struct bio **biop, unsigned flags)
 {
 	struct bio *bio = *biop;
-	unsigned int max_write_zeroes_sectors;
+	unsigned int max_write_zeroes_sectors, req_flags = 0;
 	struct request_queue *q = bdev_get_queue(bdev);
 
 	if (!q)
@@ -229,13 +229,19 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 	if (max_write_zeroes_sectors == 0)
 		return -EOPNOTSUPP;
 
+	if (flags & BLKDEV_ZERO_NOUNMAP)
+		req_flags |= REQ_NOUNMAP;
+	if (flags & BLKDEV_ZERO_ALLOCATE) {
+		if (!q->limits.write_zeroes_can_allocate)
+			return -EOPNOTSUPP;
+		req_flags |= REQ_NOZERO|REQ_NOUNMAP;
+	}
+
 	while (nr_sects) {
 		bio = blk_next_bio(bio, 0, gfp_mask);
 		bio->bi_iter.bi_sector = sector;
 		bio_set_dev(bio, bdev);
-		bio->bi_opf = REQ_OP_WRITE_ZEROES;
-		if (flags & BLKDEV_ZERO_NOUNMAP)
-			bio->bi_opf |= REQ_NOUNMAP;
+		bio->bi_opf = REQ_OP_WRITE_ZEROES | req_flags;
 
 		if (nr_sects > max_write_zeroes_sectors) {
 			bio->bi_iter.bi_size = max_write_zeroes_sectors << 9;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 69bf2fb6f7cd..1ffef894b3bd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2122,6 +2122,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
 					     GFP_KERNEL, BLKDEV_ZERO_NOFALLBACK);
 		break;
+	case FALLOC_FL_KEEP_SIZE:
+		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+			GFP_KERNEL, BLKDEV_ZERO_ALLOCATE | BLKDEV_ZERO_NOFALLBACK);
+		break;
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
 		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 					     GFP_KERNEL, 0);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..9ed166860099 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -335,7 +335,9 @@ enum req_flag_bits {
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
-
+	__REQ_NOZERO,		/* only notify about allocated blocks,
+				 * and do not actual zero them
+				 */
 	__REQ_HIPRI,
 
 	/* for driver use */
@@ -362,6 +364,7 @@ enum req_flag_bits {
 #define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
+#define REQ_NOZERO		(1ULL << __REQ_NOZERO)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c45779f00cbd..9e3cd3394dd6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -347,6 +347,7 @@ struct queue_limits {
 	unsigned char		misaligned;
 	unsigned char		discard_misaligned;
 	unsigned char		raid_partial_stripes_expensive;
+	bool			write_zeroes_can_allocate;
 	enum blk_zoned_model	zoned;
 };
 
@@ -1219,6 +1220,7 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
+#define BLKDEV_ZERO_ALLOCATE	(1 << 2)  /* allocate range of blocks */
 
 extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
Martin K. Petersen Jan. 7, 2020, 3:24 a.m. UTC | #10
Kirill,

Sorry, the holiday break got in the way.

> But I also worry about NOFALLBACK case. There are possible block
> devices, which support write zeroes, but they can't allocate blocks
> (block allocation are just not appliable for them, say, these are all
> ordinary hdd).

Correct. We shouldn't go down this path unless a device is thinly
provisioned (i.e. max_discard_sectors > 0).

> But won't it be a good thing to return EOPNOTSUPP right from
> __blkdev_issue_write_zeroes() in case of block device can't allocate
> blocks (q->limits.write_zeroes_can_allocate in the patch below)? Here
> is just a way to underline block devices, which support write zeroes,
> but allocation of blocks is meant nothing for them (wasting of time).

I don't like "write_zeroes_can_allocate" because that makes assumptions
about WRITE ZEROES being the command of choice. I suggest we call it
"max_allocate_sectors" to mirror "max_discard_sectors". I.e. put
emphasis on the semantic operation and not the plumbing.
Kirill Tkhai Jan. 7, 2020, 1:59 p.m. UTC | #11
On 07.01.2020 06:24, Martin K. Petersen wrote:
> 
> Kirill,
> 
> Sorry, the holiday break got in the way.
> 
>> But I also worry about NOFALLBACK case. There are possible block
>> devices, which support write zeroes, but they can't allocate blocks
>> (block allocation are just not appliable for them, say, these are all
>> ordinary hdd).
> 
> Correct. We shouldn't go down this path unless a device is thinly
> provisioned (i.e. max_discard_sectors > 0).

(I assumed it is a typo, and you mean max_allocate_sectors like bellow).
 
>> But won't it be a good thing to return EOPNOTSUPP right from
>> __blkdev_issue_write_zeroes() in case of block device can't allocate
>> blocks (q->limits.write_zeroes_can_allocate in the patch below)? Here
>> is just a way to underline block devices, which support write zeroes,
>> but allocation of blocks is meant nothing for them (wasting of time).
> 
> I don't like "write_zeroes_can_allocate" because that makes assumptions
> about WRITE ZEROES being the command of choice. I suggest we call it
> "max_allocate_sectors" to mirror "max_discard_sectors". I.e. put
> emphasis on the semantic operation and not the plumbing.
 
Hm. Do you mean "bool max_allocate_sectors" or "unsigned int max_allocate_sectors"?
In the second case we should make all the q->limits.max_write_zeroes_sectors
dereferencing as switches like the below (this is a partial patch and only several
of places are converted to switches as examples):

diff --git a/block/blk-core.c b/block/blk-core.c
index 50a5de025d5e..4c45417838f7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -978,7 +978,8 @@ generic_make_request_checks(struct bio *bio)
 			goto not_supported;
 		break;
 	case REQ_OP_WRITE_ZEROES:
-		if (!q->limits.max_write_zeroes_sectors)
+		if (!blk_queue_get_max_write_zeroes_sectors(q,
+				    bio->bi_opf & REQ_NOZERO))
 			goto not_supported;
 		break;
 	default:
@@ -1250,10 +1251,10 @@ EXPORT_SYMBOL(submit_bio);
 static int blk_cloned_rq_check_limits(struct request_queue *q,
 				      struct request *rq)
 {
-	if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, req_op(rq))) {
+	if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, rq->cmd_flags)) {
 		printk(KERN_ERR "%s: over max size limit. (%u > %u)\n",
 			__func__, blk_rq_sectors(rq),
-			blk_queue_get_max_sectors(q, req_op(rq)));
+			blk_queue_get_max_sectors(q, rq->cmd_flags));
 		return -EIO;
 	}
 
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 347782a24a35..0cdf7d0386c8 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -105,15 +105,22 @@ static struct bio *blk_bio_discard_split(struct request_queue *q,
 static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
 		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
 {
+	unsigned int max_sectors;
+
+	if (bio->bi_opf & REQ_NOZERO)
+		max_sectors = q->limits.max_allocate_sectors;
+	else
+		max_sectors = q->limits.max_write_zeroes_sectors;
+
 	*nsegs = 0;
 
-	if (!q->limits.max_write_zeroes_sectors)
+	if (!max_sectors)
 		return NULL;
 
-	if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors)
+	if (bio_sectors(bio) <= max_sectors)
 		return NULL;
 
-	return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
+	return bio_split(bio, max_sectors, GFP_NOIO, bs);
 }
 
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 5f6dcc7a47bd..3f68a60cb196 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -506,6 +506,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
+	t->max_allocate_sectors = min(t->max_allocate_sectors,
+					b->max_allocate_sectors);
 	t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
 
 	t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9e3cd3394dd6..6219604a0c12 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -336,6 +336,7 @@ struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_allocate_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -989,9 +989,19 @@ static inline struct bio_vec req_bvec(struct request *rq)
 	return mp_bvec_iter_bvec(rq->bio->bi_io_vec, rq->bio->bi_iter);
 }
 
+static inline unsigned int blk_queue_get_max_write_zeroes_sectors(
+			struct request_queue *q, bool allocate)
+{
+	if (allocate)
+		return q->limits.max_allocate_sectors;
+	return q->limits.max_write_zeroes_sectors;
+}
+
 static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
-						     int op)
+						     unsigned int op_flags)
 {
+	int op = op_flags & REQ_OP_MASK;
+
 	if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
 		return min(q->limits.max_discard_sectors,
 			   UINT_MAX >> SECTOR_SHIFT);
@@ -1000,7 +1010,8 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 		return q->limits.max_write_same_sectors;
 
 	if (unlikely(op == REQ_OP_WRITE_ZEROES))
-		return q->limits.max_write_zeroes_sectors;
+		return blk_queue_get_max_write_zeroes_sectors(q,
+					op_flags & REQ_NOZERO);
 
 	return q->limits.max_sectors;
 }
Martin K. Petersen Jan. 8, 2020, 2:49 a.m. UTC | #12
Kirill,

>> Correct. We shouldn't go down this path unless a device is thinly
>> provisioned (i.e. max_discard_sectors > 0).
>
> (I assumed it is a typo, and you mean max_allocate_sectors like bellow).

No, this was in the context of not having an explicit queue limit for
allocation. If a device does not have max_discard_sectors > 0 then it is
not thinly provisioned and therefore attempting allocation makes no
sense.

>> I don't like "write_zeroes_can_allocate" because that makes assumptions
>> about WRITE ZEROES being the command of choice. I suggest we call it
>> "max_allocate_sectors" to mirror "max_discard_sectors". I.e. put
>> emphasis on the semantic operation and not the plumbing.
>  
> Hm. Do you mean "bool max_allocate_sectors" or "unsigned int max_allocate_sectors"?

unsigned int. At least for SCSI we could have a device which would use
UNMAP for discards and WRITE SAME for allocates. And therefore the range
limit could be different for the two operations. Sadly.

I have a patch in the pipeline which deals with some problems in this
department because some devices have a split brain wrt. their discard
limits.

> In the second case we should make all the
> q->limits.max_write_zeroes_sectors dereferencing as switches like the
> below (this is a partial patch and only several of places are
> converted to switches as examples):

Something like that, yes.

This is getting a bit messy :( However, I am not sure that scattering
REQ_OP_ALLOCATE all over the I/O stack is particularly attractive
either.

Both REQ_OP_DISCARD and REQ_OP_WRITE_SAME come with some storage
protocol baggage that forces us to have special handling all over the
stack. But REQ_OP_WRITE_ZEROES is fairly clean and simple and, except
for the potentially different block count limit, an allocate operation
would be a carbon copy of the plumbing for write zeroes. A lot of
duplication.

So even through I'm increasingly torn on whether introducing separate
REQ_OP_ALLOCATE plumbing throughout the stack or having a REQ_ALLOCATE
flag for REQ_OP_WRITE_ZEROES is best, I still think I'm leaning towards
the latter. That will also make it easier for me in the SCSI disk
driver.
Kirill Tkhai Jan. 9, 2020, 9:43 a.m. UTC | #13
On 08.01.2020 05:49, Martin K. Petersen wrote:
> 
> Kirill,
> 
>>> Correct. We shouldn't go down this path unless a device is thinly
>>> provisioned (i.e. max_discard_sectors > 0).
>>
>> (I assumed it is a typo, and you mean max_allocate_sectors like bellow).
> 
> No, this was in the context of not having an explicit queue limit for
> allocation. If a device does not have max_discard_sectors > 0 then it is
> not thinly provisioned and therefore attempting allocation makes no
> sense.
> 
>>> I don't like "write_zeroes_can_allocate" because that makes assumptions
>>> about WRITE ZEROES being the command of choice. I suggest we call it
>>> "max_allocate_sectors" to mirror "max_discard_sectors". I.e. put
>>> emphasis on the semantic operation and not the plumbing.
>>  
>> Hm. Do you mean "bool max_allocate_sectors" or "unsigned int max_allocate_sectors"?
> 
> unsigned int. At least for SCSI we could have a device which would use
> UNMAP for discards and WRITE SAME for allocates. And therefore the range
> limit could be different for the two operations. Sadly.
> 
> I have a patch in the pipeline which deals with some problems in this
> department because some devices have a split brain wrt. their discard
> limits.
> 
>> In the second case we should make all the
>> q->limits.max_write_zeroes_sectors dereferencing as switches like the
>> below (this is a partial patch and only several of places are
>> converted to switches as examples):
> 
> Something like that, yes.
> 
> This is getting a bit messy :( However, I am not sure that scattering
> REQ_OP_ALLOCATE all over the I/O stack is particularly attractive
> either.

It looks like changing the second argument of blk_queue_get_max_write_zeroes_sectors()
gives almost clean code. All REQ_NOZERO checks become moved to helper.

> Both REQ_OP_DISCARD and REQ_OP_WRITE_SAME come with some storage
> protocol baggage that forces us to have special handling all over the
> stack. But REQ_OP_WRITE_ZEROES is fairly clean and simple and, except
> for the potentially different block count limit, an allocate operation
> would be a carbon copy of the plumbing for write zeroes. A lot of
> duplication.
> 
> So even through I'm increasingly torn on whether introducing separate
> REQ_OP_ALLOCATE plumbing throughout the stack or having a REQ_ALLOCATE
> flag for REQ_OP_WRITE_ZEROES is best, I still think I'm leaning towards
> the latter. That will also make it easier for me in the SCSI disk
> driver.

Yeah, this sounds good. Also, it looks the REQ_NOZERO patch is really
simpler, it is smaller even.

Below is complete patch for block core (I've excluded hunks for drivers/*,
since they require some cleanup patches for preparations). Please, let
me know whether I understood you correct and everything is OK. Thanks!

[PATCH] block: Add support for REQ_NOZERO

This adds support for REQ_NOZERO extension of REQ_OP_WRITE_ZEROES
operation, which encourages a block device driver to just allocate
blocks (or mark them allocated) instead of actual blocks zeroing.
REQ_NOZERO is aimed to be used for network filesystems providing
a block device interface. Also, block devices, which map a file
on other filesystem (like loop), may use this for less fragmentation
and batching fallocate() requests. Hypervisors like QEMU may
introduce optimizations of clusters allocations based on this.

BLKDEV_ZERO_ALLOCATE is a new corresponding flag for
blkdev_issue_zeroout().

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 block/blk-core.c          |    6 +++---
 block/blk-lib.c           |   17 ++++++++++-------
 block/blk-merge.c         |    9 ++++++---
 block/blk-settings.c      |    4 ++++
 fs/block_dev.c            |    4 ++++
 include/linux/blk_types.h |    5 ++++-
 include/linux/blkdev.h    |   32 ++++++++++++++++++++++++--------
 7 files changed, 55 insertions(+), 22 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 50a5de025d5e..2edcd55624f1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -978,7 +978,7 @@ generic_make_request_checks(struct bio *bio)
 			goto not_supported;
 		break;
 	case REQ_OP_WRITE_ZEROES:
-		if (!q->limits.max_write_zeroes_sectors)
+		if (!blk_queue_get_max_write_zeroes_sectors(q, bio->bi_opf))
 			goto not_supported;
 		break;
 	default:
@@ -1250,10 +1250,10 @@ EXPORT_SYMBOL(submit_bio);
 static int blk_cloned_rq_check_limits(struct request_queue *q,
 				      struct request *rq)
 {
-	if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, req_op(rq))) {
+	if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, rq->cmd_flags)) {
 		printk(KERN_ERR "%s: over max size limit. (%u > %u)\n",
 			__func__, blk_rq_sectors(rq),
-			blk_queue_get_max_sectors(q, req_op(rq)));
+			blk_queue_get_max_sectors(q, rq->cmd_flags));
 		return -EIO;
 	}
 
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..3e80279eb029 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -214,7 +214,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		struct bio **biop, unsigned flags)
 {
 	struct bio *bio = *biop;
-	unsigned int max_write_zeroes_sectors;
+	unsigned int max_write_zeroes_sectors, req_flags = 0;
 	struct request_queue *q = bdev_get_queue(bdev);
 
 	if (!q)
@@ -224,18 +224,21 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		return -EPERM;
 
 	/* Ensure that max_write_zeroes_sectors doesn't overflow bi_size */
-	max_write_zeroes_sectors = bdev_write_zeroes_sectors(bdev);
+	max_write_zeroes_sectors = bdev_write_zeroes_sectors(bdev, flags);
 
 	if (max_write_zeroes_sectors == 0)
 		return -EOPNOTSUPP;
 
+	if (flags & BLKDEV_ZERO_NOUNMAP)
+		req_flags |= REQ_NOUNMAP;
+	if (flags & BLKDEV_ZERO_ALLOCATE)
+		req_flags |= REQ_NOZERO|REQ_NOUNMAP;
+
 	while (nr_sects) {
 		bio = blk_next_bio(bio, 0, gfp_mask);
 		bio->bi_iter.bi_sector = sector;
 		bio_set_dev(bio, bdev);
-		bio->bi_opf = REQ_OP_WRITE_ZEROES;
-		if (flags & BLKDEV_ZERO_NOUNMAP)
-			bio->bi_opf |= REQ_NOUNMAP;
+		bio->bi_opf = REQ_OP_WRITE_ZEROES | req_flags;
 
 		if (nr_sects > max_write_zeroes_sectors) {
 			bio->bi_iter.bi_size = max_write_zeroes_sectors << 9;
@@ -362,7 +365,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	sector_t bs_mask;
 	struct bio *bio;
 	struct blk_plug plug;
-	bool try_write_zeroes = !!bdev_write_zeroes_sectors(bdev);
+	bool try_write_zeroes = !!bdev_write_zeroes_sectors(bdev, flags);
 
 	bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
 	if ((sector | nr_sects) & bs_mask)
@@ -391,7 +394,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 			try_write_zeroes = false;
 			goto retry;
 		}
-		if (!bdev_write_zeroes_sectors(bdev)) {
+		if (!bdev_write_zeroes_sectors(bdev, flags)) {
 			/*
 			 * Zeroing offload support was indicated, but the
 			 * device reported ILLEGAL REQUEST (for some devices
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 347782a24a35..e3ce4b87bbaa 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -105,15 +105,18 @@ static struct bio *blk_bio_discard_split(struct request_queue *q,
 static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
 		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
 {
+	unsigned int max_sectors;
+
+	max_sectors = blk_queue_get_max_write_zeroes_sectors(q, bio->bi_opf);
 	*nsegs = 0;
 
-	if (!q->limits.max_write_zeroes_sectors)
+	if (!max_sectors)
 		return NULL;
 
-	if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors)
+	if (bio_sectors(bio) <= max_sectors)
 		return NULL;
 
-	return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
+	return bio_split(bio, max_sectors, GFP_NOIO, bs);
 }
 
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 5f6dcc7a47bd..09a2eeacfadd 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -48,6 +48,7 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->chunk_sectors = 0;
 	lim->max_write_same_sectors = 0;
 	lim->max_write_zeroes_sectors = 0;
+	lim->max_allocate_sectors = 0;
 	lim->max_discard_sectors = 0;
 	lim->max_hw_discard_sectors = 0;
 	lim->discard_granularity = 0;
@@ -83,6 +84,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_same_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
+	lim->max_allocate_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -506,6 +508,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
+	t->max_allocate_sectors = min(t->max_allocate_sectors,
+					b->max_allocate_sectors);
 	t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
 
 	t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 69bf2fb6f7cd..1ffef894b3bd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -2122,6 +2122,10 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
 					     GFP_KERNEL, BLKDEV_ZERO_NOFALLBACK);
 		break;
+	case FALLOC_FL_KEEP_SIZE:
+		error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+			GFP_KERNEL, BLKDEV_ZERO_ALLOCATE | BLKDEV_ZERO_NOFALLBACK);
+		break;
 	case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
 		error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
 					     GFP_KERNEL, 0);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..9ed166860099 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -335,7 +335,9 @@ enum req_flag_bits {
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
-
+	__REQ_NOZERO,		/* only notify about allocated blocks,
+				 * and do not actual zero them
+				 */
 	__REQ_HIPRI,
 
 	/* for driver use */
@@ -362,6 +364,7 @@ enum req_flag_bits {
 #define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
+#define REQ_NOZERO		(1ULL << __REQ_NOZERO)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c45779f00cbd..7e388be625af 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -336,6 +336,7 @@ struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_allocate_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -988,9 +989,19 @@ static inline struct bio_vec req_bvec(struct request *rq)
 	return mp_bvec_iter_bvec(rq->bio->bi_io_vec, rq->bio->bi_iter);
 }
 
+static inline unsigned int blk_queue_get_max_write_zeroes_sectors(
+		struct request_queue *q, unsigned int op_flags)
+{
+	if (op_flags & REQ_NOZERO)
+		return q->limits.max_allocate_sectors;
+	return q->limits.max_write_zeroes_sectors;
+}
+
 static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
-						     int op)
+						     unsigned int op_flags)
 {
+	int op = op_flags & REQ_OP_MASK;
+
 	if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
 		return min(q->limits.max_discard_sectors,
 			   UINT_MAX >> SECTOR_SHIFT);
@@ -999,7 +1010,7 @@ static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 		return q->limits.max_write_same_sectors;
 
 	if (unlikely(op == REQ_OP_WRITE_ZEROES))
-		return q->limits.max_write_zeroes_sectors;
+		return blk_queue_get_max_write_zeroes_sectors(q, op_flags);
 
 	return q->limits.max_sectors;
 }
@@ -1029,10 +1040,10 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
 	if (!q->limits.chunk_sectors ||
 	    req_op(rq) == REQ_OP_DISCARD ||
 	    req_op(rq) == REQ_OP_SECURE_ERASE)
-		return blk_queue_get_max_sectors(q, req_op(rq));
+		return blk_queue_get_max_sectors(q, rq->cmd_flags);
 
 	return min(blk_max_size_offset(q, offset),
-			blk_queue_get_max_sectors(q, req_op(rq)));
+			blk_queue_get_max_sectors(q, rq->cmd_flags));
 }
 
 static inline unsigned int blk_rq_count_bios(struct request *rq)
@@ -1219,6 +1230,7 @@ extern int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 
 #define BLKDEV_ZERO_NOUNMAP	(1 << 0)  /* do not free blocks */
 #define BLKDEV_ZERO_NOFALLBACK	(1 << 1)  /* don't write explicit zeroes */
+#define BLKDEV_ZERO_ALLOCATE	(1 << 2)  /* allocate range of blocks */
 
 extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop,
@@ -1418,14 +1430,18 @@ static inline unsigned int bdev_write_same(struct block_device *bdev)
 	return 0;
 }
 
-static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
+static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev,
+						     unsigned int flags)
 {
 	struct request_queue *q = bdev_get_queue(bdev);
 
-	if (q)
+	if (!q)
+		return 0;
+
+	if (flags & BLKDEV_ZERO_ALLOCATE)
+		return q->limits.max_allocate_sectors;
+	else
 		return q->limits.max_write_zeroes_sectors;
-
-	return 0;
 }
 
 static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
diff mbox series

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 8d4db6e74496..060cc0ea1246 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -978,6 +978,10 @@  generic_make_request_checks(struct bio *bio)
 		if (!q->limits.max_write_zeroes_sectors)
 			goto not_supported;
 		break;
+	case REQ_OP_ASSIGN_RANGE:
+		if (!q->limits.max_assign_range_sectors)
+			goto not_supported;
+		break;
 	default:
 		break;
 	}
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 5f2c429d4378..fbf780d3ea32 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -252,6 +252,46 @@  static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 	return 0;
 }
 
+static int __blkdev_issue_assign_range(struct block_device *bdev,
+		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
+		struct bio **biop, unsigned flags)
+{
+	struct bio *bio = *biop;
+	unsigned int max_sectors;
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (!q)
+		return -ENXIO;
+
+	if (bdev_read_only(bdev))
+		return -EPERM;
+
+	max_sectors = bdev_assign_range_sectors(bdev);
+
+	if (max_sectors == 0)
+		return -EOPNOTSUPP;
+
+	while (nr_sects) {
+		bio = blk_next_bio(bio, 0, gfp_mask);
+		bio->bi_iter.bi_sector = sector;
+		bio_set_dev(bio, bdev);
+		bio->bi_opf = REQ_OP_ASSIGN_RANGE;
+
+		if (nr_sects > max_sectors) {
+			bio->bi_iter.bi_size = max_sectors << 9;
+			nr_sects -= max_sectors;
+			sector += max_sectors;
+		} else {
+			bio->bi_iter.bi_size = nr_sects << 9;
+			nr_sects = 0;
+		}
+		cond_resched();
+	}
+
+	*biop = bio;
+	return 0;
+}
+
 /*
  * Convert a number of 512B sectors to a number of pages.
  * The result is limited to a number of pages that can fit into a BIO.
@@ -405,3 +445,33 @@  int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 	return ret;
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);
+
+int blkdev_issue_assign_range(struct block_device *bdev, sector_t sector,
+			sector_t nr_sects, gfp_t gfp_mask, unsigned flags)
+{
+	int ret = 0;
+	sector_t bs_mask;
+	struct bio *bio;
+	struct blk_plug plug;
+
+	if (bdev_assign_range_sectors(bdev) == 0)
+		return 0;
+
+	bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
+	if ((sector | nr_sects) & bs_mask)
+		return -EINVAL;
+
+	bio = NULL;
+	blk_start_plug(&plug);
+
+	ret = __blkdev_issue_assign_range(bdev, sector, nr_sects,
+					  gfp_mask, &bio, flags);
+	if (ret == 0 && bio) {
+		ret = submit_bio_wait(bio);
+		bio_put(bio);
+	}
+	blk_finish_plug(&plug);
+
+	return ret;
+}
+EXPORT_SYMBOL(blkdev_issue_assign_range);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index d783bdc4559b..b2ae8b5acd72 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -102,6 +102,22 @@  static struct bio *blk_bio_discard_split(struct request_queue *q,
 	return bio_split(bio, split_sectors, GFP_NOIO, bs);
 }
 
+static struct bio *blk_bio_assign_range_split(struct request_queue *q,
+					      struct bio *bio,
+					      struct bio_set *bs,
+					      unsigned *nsegs)
+{
+	*nsegs = 1;
+
+	if (!q->limits.max_assign_range_sectors)
+		return NULL;
+
+	if (bio_sectors(bio) <= q->limits.max_assign_range_sectors)
+		return NULL;
+
+	return bio_split(bio, q->limits.max_assign_range_sectors, GFP_NOIO, bs);
+}
+
 static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
 		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
 {
@@ -300,6 +316,10 @@  void __blk_queue_split(struct request_queue *q, struct bio **bio,
 	case REQ_OP_SECURE_ERASE:
 		split = blk_bio_discard_split(q, *bio, &q->bio_split, nr_segs);
 		break;
+	case REQ_OP_ASSIGN_RANGE:
+		split = blk_bio_assign_range_split(q, *bio, &q->bio_split,
+				nr_segs);
+		break;
 	case REQ_OP_WRITE_ZEROES:
 		split = blk_bio_write_zeroes_split(q, *bio, &q->bio_split,
 				nr_segs);
@@ -382,6 +402,7 @@  unsigned int blk_recalc_rq_segments(struct request *rq)
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
+	case REQ_OP_ASSIGN_RANGE:
 		return 0;
 	case REQ_OP_WRITE_SAME:
 		return 1;
diff --git a/block/bounce.c b/block/bounce.c
index f8ed677a1bf7..017bedba7b23 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -255,6 +255,7 @@  static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
 
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
+	case REQ_OP_ASSIGN_RANGE:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
 		break;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 3cdb84cdc488..cf235c997e45 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -63,6 +63,7 @@  static inline bool bio_has_data(struct bio *bio)
 	if (bio &&
 	    bio->bi_iter.bi_size &&
 	    bio_op(bio) != REQ_OP_DISCARD &&
+	    bio_op(bio) != REQ_OP_ASSIGN_RANGE &&
 	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
 	    bio_op(bio) != REQ_OP_WRITE_ZEROES)
 		return true;
@@ -73,6 +74,7 @@  static inline bool bio_has_data(struct bio *bio)
 static inline bool bio_no_advance_iter(struct bio *bio)
 {
 	return bio_op(bio) == REQ_OP_DISCARD ||
+	       bio_op(bio) == REQ_OP_ASSIGN_RANGE ||
 	       bio_op(bio) == REQ_OP_SECURE_ERASE ||
 	       bio_op(bio) == REQ_OP_WRITE_SAME ||
 	       bio_op(bio) == REQ_OP_WRITE_ZEROES;
@@ -184,6 +186,7 @@  static inline unsigned bio_segments(struct bio *bio)
 
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
+	case REQ_OP_ASSIGN_RANGE:
 	case REQ_OP_SECURE_ERASE:
 	case REQ_OP_WRITE_ZEROES:
 		return 0;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..f03dcf25c831 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -296,6 +296,8 @@  enum req_opf {
 	REQ_OP_ZONE_CLOSE	= 11,
 	/* Transition a zone to full */
 	REQ_OP_ZONE_FINISH	= 12,
+	/* assign sector range */
+	REQ_OP_ASSIGN_RANGE	= 15,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3cd1853dbdac..9af70120fe57 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -336,6 +336,7 @@  struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_assign_range_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -995,6 +996,10 @@  static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
 		return min(q->limits.max_discard_sectors,
 			   UINT_MAX >> SECTOR_SHIFT);
 
+	if (unlikely(op == REQ_OP_ASSIGN_RANGE))
+		return min(q->limits.max_assign_range_sectors,
+			   UINT_MAX >> SECTOR_SHIFT);
+
 	if (unlikely(op == REQ_OP_WRITE_SAME))
 		return q->limits.max_write_same_sectors;
 
@@ -1028,6 +1033,7 @@  static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
 
 	if (!q->limits.chunk_sectors ||
 	    req_op(rq) == REQ_OP_DISCARD ||
+	    req_op(rq) == REQ_OP_ASSIGN_RANGE ||
 	    req_op(rq) == REQ_OP_SECURE_ERASE)
 		return blk_queue_get_max_sectors(q, req_op(rq));
 
@@ -1225,6 +1231,8 @@  extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		unsigned flags);
 extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, unsigned flags);
+extern int blkdev_issue_assign_range(struct block_device *bdev, sector_t sector,
+		sector_t nr_sects, gfp_t gfp_mask, unsigned flags);
 
 static inline int sb_issue_discard(struct super_block *sb, sector_t block,
 		sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags)
@@ -1247,6 +1255,17 @@  static inline int sb_issue_zeroout(struct super_block *sb, sector_t block,
 				    gfp_mask, 0);
 }
 
+static inline int sb_issue_assign_range(struct super_block *sb, sector_t block,
+		sector_t nr_blocks, gfp_t gfp_mask)
+{
+	return blkdev_issue_assign_range(sb->s_bdev,
+					 block << (sb->s_blocksize_bits -
+						   SECTOR_SHIFT),
+					 nr_blocks << (sb->s_blocksize_bits -
+						       SECTOR_SHIFT),
+					 gfp_mask, 0);
+}
+
 extern int blk_verify_command(unsigned char *cmd, fmode_t mode);
 
 enum blk_default_limits {
@@ -1428,6 +1447,16 @@  static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev)
 	return 0;
 }
 
+static inline unsigned int bdev_assign_range_sectors(struct block_device *bdev)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	if (q)
+		return q->limits.max_assign_range_sectors;
+
+	return 0;
+}
+
 static inline enum blk_zoned_model bdev_zoned_model(struct block_device *bdev)
 {
 	struct request_queue *q = bdev_get_queue(bdev);