diff mbox series

[01/11] blk-mq: Add blk_mq_init_queue_ops()

Message ID 1647945585-197349-2-git-send-email-john.garry@huawei.com
State New
Headers show
Series blk-mq/libata/scsi: SCSI driver tagging improvements | expand

Commit Message

John Garry March 22, 2022, 10:39 a.m. UTC
Add an API to allocate a request queue which accepts a custom set of
blk_mq_ops for that request queue.

The reason which we may want custom ops is for queuing requests which we
don't want to go through the normal queuing path.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq.c         | 23 +++++++++++++++++------
 drivers/md/dm-rq.c     |  2 +-
 include/linux/blk-mq.h |  5 ++++-
 3 files changed, 22 insertions(+), 8 deletions(-)

Comments

Christoph Hellwig March 22, 2022, 11:18 a.m. UTC | #1
On Tue, Mar 22, 2022 at 06:39:35PM +0800, John Garry wrote:
> Add an API to allocate a request queue which accepts a custom set of
> blk_mq_ops for that request queue.
> 
> The reason which we may want custom ops is for queuing requests which we
> don't want to go through the normal queuing path.

Eww.  I really do not think we should do separate ops per queue, as that
is going to get us into a deep mess eventually.
John Garry March 22, 2022, 11:33 a.m. UTC | #2
On 22/03/2022 11:18, Christoph Hellwig wrote:
> On Tue, Mar 22, 2022 at 06:39:35PM +0800, John Garry wrote:
>> Add an API to allocate a request queue which accepts a custom set of
>> blk_mq_ops for that request queue.
>>
>> The reason which we may want custom ops is for queuing requests which we
>> don't want to go through the normal queuing path.
> 
> Eww.  I really do not think we should do separate ops per queue, as that
> is going to get us into a deep mess eventually.
> 

Yeah... so far (here) it works out quite nicely, as we don't need to 
change the SCSI blk mq ops nor allocate a scsi_device - everything is 
just separate.

The other method mentioned previously was to add the request "reserved" 
flag and add new paths in scsi_queue_rq() et al to handle this, but that 
gets messy.

Any other ideas ...?

Cheers,
John
Hannes Reinecke March 22, 2022, 12:16 p.m. UTC | #3
On 3/22/22 12:33, John Garry wrote:
> On 22/03/2022 11:18, Christoph Hellwig wrote:
>> On Tue, Mar 22, 2022 at 06:39:35PM +0800, John Garry wrote:
>>> Add an API to allocate a request queue which accepts a custom set of
>>> blk_mq_ops for that request queue.
>>>
>>> The reason which we may want custom ops is for queuing requests which we
>>> don't want to go through the normal queuing path.
>>
>> Eww.  I really do not think we should do separate ops per queue, as that
>> is going to get us into a deep mess eventually.
>>
> 
> Yeah... so far (here) it works out quite nicely, as we don't need to 
> change the SCSI blk mq ops nor allocate a scsi_device - everything is 
> just separate.
> 
> The other method mentioned previously was to add the request "reserved" 
> flag and add new paths in scsi_queue_rq() et al to handle this, but that 
> gets messy.
> 
> Any other ideas ...?
> 

As outlined in the other mail, I think might be useful is to have a 
_third_ type of requests (in addition to the normal and the reserved ones).
That one would be allocated from the normal I/O pool (and hence could 
fail if the pool is exhausted), but would be able to carry a different 
payload (type) than the normal requests.
And we could have a separate queue_rq for these requests, as we can 
differentiate them in the block layer.

Cheers,

Hannes
John Garry March 22, 2022, 12:30 p.m. UTC | #4
On 22/03/2022 12:16, Hannes Reinecke wrote:
> On 3/22/22 12:33, John Garry wrote:
>> On 22/03/2022 11:18, Christoph Hellwig wrote:
>>> On Tue, Mar 22, 2022 at 06:39:35PM +0800, John Garry wrote:
>>>> Add an API to allocate a request queue which accepts a custom set of
>>>> blk_mq_ops for that request queue.
>>>>
>>>> The reason which we may want custom ops is for queuing requests 
>>>> which we
>>>> don't want to go through the normal queuing path.
>>>
>>> Eww.  I really do not think we should do separate ops per queue, as that
>>> is going to get us into a deep mess eventually.
>>>
>>
>> Yeah... so far (here) it works out quite nicely, as we don't need to 
>> change the SCSI blk mq ops nor allocate a scsi_device - everything is 
>> just separate.
>>
>> The other method mentioned previously was to add the request 
>> "reserved" flag and add new paths in scsi_queue_rq() et al to handle 
>> this, but that gets messy.
>>
>> Any other ideas ...?
>>
> 
> As outlined in the other mail, I think might be useful is to have a 
> _third_ type of requests (in addition to the normal and the reserved ones).
> That one would be allocated from the normal I/O pool (and hence could 
> fail if the pool is exhausted), but would be able to carry a different 
> payload (type) than the normal requests.

As mentioned in the cover letter response, it just seems best to keep 
the normal scsi_cmnd payload but have other means to add on the internal 
command data, like using host_scribble or scsi_cmnd priv data.

> And we could have a separate queue_rq for these requests, as we can 
> differentiate them in the block layer.

I don't know, let me think about it. Maybe we could add an "internal" 
blk flag, which uses a separate "internal" queue_rq callback.

Thanks,
John
Hannes Reinecke March 22, 2022, 2:03 p.m. UTC | #5
On 3/22/22 13:30, John Garry wrote:
> On 22/03/2022 12:16, Hannes Reinecke wrote:
>> On 3/22/22 12:33, John Garry wrote:
>>> On 22/03/2022 11:18, Christoph Hellwig wrote:
>>>> On Tue, Mar 22, 2022 at 06:39:35PM +0800, John Garry wrote:
>>>>> Add an API to allocate a request queue which accepts a custom set of
>>>>> blk_mq_ops for that request queue.
>>>>>
>>>>> The reason which we may want custom ops is for queuing requests 
>>>>> which we
>>>>> don't want to go through the normal queuing path.
>>>>
>>>> Eww.  I really do not think we should do separate ops per queue, as 
>>>> that
>>>> is going to get us into a deep mess eventually.
>>>>
>>>
>>> Yeah... so far (here) it works out quite nicely, as we don't need to 
>>> change the SCSI blk mq ops nor allocate a scsi_device - everything is 
>>> just separate.
>>>
>>> The other method mentioned previously was to add the request 
>>> "reserved" flag and add new paths in scsi_queue_rq() et al to handle 
>>> this, but that gets messy.
>>>
>>> Any other ideas ...?
>>>
>>
>> As outlined in the other mail, I think might be useful is to have a 
>> _third_ type of requests (in addition to the normal and the reserved 
>> ones).
>> That one would be allocated from the normal I/O pool (and hence could 
>> fail if the pool is exhausted), but would be able to carry a different 
>> payload (type) than the normal requests.
> 
> As mentioned in the cover letter response, it just seems best to keep 
> the normal scsi_cmnd payload but have other means to add on the internal 
> command data, like using host_scribble or scsi_cmnd priv data.
> 
Well; I found that most drivers I had been looking at the scsi command 
payload isn't used at all; the drivers primarily cared about the 
(driver-provided) payload, and were completely ignoring the scsi command 
payload.

Similar for ATA/libsas: you basically never issue real scsi commands, 
but either 'raw' ATA requests or SCSI TMFs. None of which are scsi 
commands, so providing them is a bit of a waste.

(And causes irritations, too, as a scsi command requires associated 
pointers like ->device etc to be set up. Which makes it tricky to use 
for the initial device setup.)

>> And we could have a separate queue_rq for these requests, as we can 
>> differentiate them in the block layer.
> 
> I don't know, let me think about it. Maybe we could add an "internal" 
> blk flag, which uses a separate "internal" queue_rq callback.
> 
Yeah, that's what I had in mind.

Cheers,

Hannes
John Garry March 22, 2022, 3:17 p.m. UTC | #6
On 22/03/2022 14:03, Hannes Reinecke wrote:
>>
>> As mentioned in the cover letter response, it just seems best to keep 
>> the normal scsi_cmnd payload but have other means to add on the 
>> internal command data, like using host_scribble or scsi_cmnd priv data.
>>
> Well; I found that most drivers I had been looking at the scsi command 
> payload isn't used at all; the drivers primarily cared about the 
> (driver-provided) payload, and were completely ignoring the scsi command 
> payload.
> 
> Similar for ATA/libsas: you basically never issue real scsi commands, 
> but either 'raw' ATA requests or SCSI TMFs. None of which are scsi 
> commands, so providing them is a bit of a waste.
> 
> (And causes irritations, too, as a scsi command requires associated 
> pointers like ->device etc to be set up. Which makes it tricky to use 
> for the initial device setup.)

A problem I see is that in scsi_mq_init_request() we allocate memories 
like sense_buffer and prot_sdb and store the pointers in the scsi_cmnd 
payload. If we then reuse a scsi_cmnd payload as an "internal" command 
payload then this data may be lost.

It might be possible to reuse the scsi cmnd payload for the "internal", 
but I would rather not get hung up on it now if possible.

Thanks,
John
Hannes Reinecke March 22, 2022, 3:31 p.m. UTC | #7
On 3/22/22 16:17, John Garry wrote:
> On 22/03/2022 14:03, Hannes Reinecke wrote:
>>>
>>> As mentioned in the cover letter response, it just seems best to keep 
>>> the normal scsi_cmnd payload but have other means to add on the 
>>> internal command data, like using host_scribble or scsi_cmnd priv data.
>>>
>> Well; I found that most drivers I had been looking at the scsi command 
>> payload isn't used at all; the drivers primarily cared about the 
>> (driver-provided) payload, and were completely ignoring the scsi 
>> command payload.
>>
>> Similar for ATA/libsas: you basically never issue real scsi commands, 
>> but either 'raw' ATA requests or SCSI TMFs. None of which are scsi 
>> commands, so providing them is a bit of a waste.
>>
>> (And causes irritations, too, as a scsi command requires associated 
>> pointers like ->device etc to be set up. Which makes it tricky to use 
>> for the initial device setup.)
> 
> A problem I see is that in scsi_mq_init_request() we allocate memories 
> like sense_buffer and prot_sdb and store the pointers in the scsi_cmnd 
> payload. If we then reuse a scsi_cmnd payload as an "internal" command 
> payload then this data may be lost.
> 
> It might be possible to reuse the scsi cmnd payload for the "internal", 
> but I would rather not get hung up on it now if possible.
> 
Or, keep the payload as is, and use the 'internal' marker to indicate 
that the scsi payload is not valid.
That would save us quite some checks during endio processing.

Cheers,

Hannes
Bart Van Assche March 23, 2022, 2:57 a.m. UTC | #8
On 3/22/22 03:39, John Garry wrote:
> Add an API to allocate a request queue which accepts a custom set of
> blk_mq_ops for that request queue.
> 
> The reason which we may want custom ops is for queuing requests which we
> don't want to go through the normal queuing path.

Custom ops shouldn't be required for this. See e.g. how tmf_queue
is used in the UFS driver for an example of a queue implementation
with custom operations and that does not require changes of the block
layer core.

Thanks,

Bart.
John Garry March 23, 2022, 9:01 a.m. UTC | #9
On 23/03/2022 02:57, Bart Van Assche wrote:
> On 3/22/22 03:39, John Garry wrote:
>> Add an API to allocate a request queue which accepts a custom set of
>> blk_mq_ops for that request queue.
>>
>> The reason which we may want custom ops is for queuing requests which we
>> don't want to go through the normal queuing path.
> 

Hi Bart,

 > Custom ops shouldn't be required for this. See e.g. how tmf_queue
 > is used in the UFS driver for an example of a queue implementation
 > with custom operations and that does not require changes of the block
 > layer core.

The UFS code uses a private tagset (in ufs_hba.tmf_tag_set) for only 
management of TMF tags/memories. This tagset does not really have any 
custom operations. All it has is a stub of .queue_rq CB in 
ufshcd_queue_tmf() and that is because this CB is compulsory.

As for the idea of having multiple tagsets per shost with real custom 
operations, this idea was mentioned before, but I think managing 
multiple tagsets could be trouble. For a start, it would mean that we 
need a distinct allocation of reserved and regular tags, and sometimes 
we don't want this - as Hannes mentioned earlier, many HBAs have low 
queue depth and cannot afford to permanently carve out a bunch of 
reserved tags.

Thanks,
John
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f3bf3358a3bb..8ea3447339ca 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3858,7 +3858,7 @@  void blk_mq_release(struct request_queue *q)
 }
 
 static struct request_queue *blk_mq_init_queue_data(struct blk_mq_tag_set *set,
-		void *queuedata)
+		void *queuedata, const struct blk_mq_ops *ops)
 {
 	struct request_queue *q;
 	int ret;
@@ -3867,27 +3867,35 @@  static struct request_queue *blk_mq_init_queue_data(struct blk_mq_tag_set *set,
 	if (!q)
 		return ERR_PTR(-ENOMEM);
 	q->queuedata = queuedata;
-	ret = blk_mq_init_allocated_queue(set, q);
+	ret = blk_mq_init_allocated_queue(set, q, ops);
 	if (ret) {
 		blk_cleanup_queue(q);
 		return ERR_PTR(ret);
 	}
+
 	return q;
 }
 
 struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 {
-	return blk_mq_init_queue_data(set, NULL);
+	return blk_mq_init_queue_data(set, NULL, NULL);
 }
 EXPORT_SYMBOL(blk_mq_init_queue);
 
+struct request_queue *blk_mq_init_queue_ops(struct blk_mq_tag_set *set,
+					    const struct blk_mq_ops *custom_ops)
+{
+	return blk_mq_init_queue_data(set, NULL, custom_ops);
+}
+EXPORT_SYMBOL(blk_mq_init_queue_ops);
+
 struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set, void *queuedata,
 		struct lock_class_key *lkclass)
 {
 	struct request_queue *q;
 	struct gendisk *disk;
 
-	q = blk_mq_init_queue_data(set, queuedata);
+	q = blk_mq_init_queue_data(set, queuedata, NULL);
 	if (IS_ERR(q))
 		return ERR_CAST(q);
 
@@ -4010,13 +4018,16 @@  static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 }
 
 int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
-		struct request_queue *q)
+		struct request_queue *q, const struct blk_mq_ops *custom_ops)
 {
 	WARN_ON_ONCE(blk_queue_has_srcu(q) !=
 			!!(set->flags & BLK_MQ_F_BLOCKING));
 
 	/* mark the queue as mq asap */
-	q->mq_ops = set->ops;
+	if (custom_ops)
+		q->mq_ops = custom_ops;
+	else
+		q->mq_ops = set->ops;
 
 	q->poll_cb = blk_stat_alloc_callback(blk_mq_poll_stats_fn,
 					     blk_mq_poll_stats_bkt,
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3907950a0ddc..9d93f72a3eec 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -560,7 +560,7 @@  int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	if (err)
 		goto out_kfree_tag_set;
 
-	err = blk_mq_init_allocated_queue(md->tag_set, md->queue);
+	err = blk_mq_init_allocated_queue(md->tag_set, md->queue, NULL);
 	if (err)
 		goto out_tag_set;
 	return 0;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d319ffa59354..e12d17c86c52 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -688,8 +688,11 @@  struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set, void *queuedata,
 	__blk_mq_alloc_disk(set, queuedata, &__key);			\
 })
 struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *);
+struct request_queue *blk_mq_init_queue_ops(struct blk_mq_tag_set *,
+		const struct blk_mq_ops *custom_ops);
+
 int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
-		struct request_queue *q);
+		struct request_queue *q, const struct blk_mq_ops *custom_ops);
 void blk_mq_unregister_dev(struct device *, struct request_queue *);
 
 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set);