diff mbox

[RFC,6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

Message ID 1490911959-5146-7-git-send-email-logang@deltatee.com
State Not Applicable
Headers show

Commit Message

Logan Gunthorpe March 30, 2017, 10:12 p.m. UTC
p2pmem will always be iomem so if we ever access it, we should be using
the correct methods to read and write to it.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Stephen Bates <sbates@raithlin.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/nvme/target/core.c        | 18 ++++++++++++++++--
 drivers/nvme/target/fabrics-cmd.c | 28 +++++++++++++++-------------
 drivers/nvme/target/nvmet.h       |  1 +
 drivers/nvme/target/rdma.c        | 13 ++++++-------
 4 files changed, 38 insertions(+), 22 deletions(-)

Comments

Sagi Grimberg April 4, 2017, 10:59 a.m. UTC | #1
>  u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf,
>  		size_t len)
>  {
> -	if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
> +	bool iomem = req->p2pmem;
> +	size_t ret;
> +
> +	ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off,
> +			     false, iomem);
> +
> +	if (ret != len)
>  		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
> +
>  	return 0;
>  }

We can never ever get here from an IO command, and that is a good thing
because it would have been broken if we did, regardless of what copy
method we use...

Note that the nvme completion queues are still on the host memory, so
this means we have lost the ordering between data and completions as
they go to different pcie targets.

If at all, this is the place to *emphasize* we must never get here
with p2pmem, and immediately fail if we do.

I'm not sure what will happen with to copy_from_sgl, I guess we
have the same race because the nvme submission queues are also
on the host memory (which is on a different pci target). Maybe
more likely to happen with write-combine enabled?

Anyway I don't think we have a real issue here *currently*, because
we use copy_to_sgl only for admin/fabrics commands emulation and
copy_from_sgl to setup dsm ranges...
Jason Gunthorpe April 4, 2017, 3:46 p.m. UTC | #2
On Tue, Apr 04, 2017 at 01:59:26PM +0300, Sagi Grimberg wrote:
> Note that the nvme completion queues are still on the host memory, so
> this means we have lost the ordering between data and completions as
> they go to different pcie targets.

Hmm, in this simple up/down case with a switch, I think it might
actually be OK.

Transactions might not complete at the NVMe device before the CPU
processes the RDMA completion, however due to the PCI-E ordering rules
new TLPs directed to the NVMe will complete after the RMDA TLPs and
thus observe the new data. (eg order preserving)

It would be very hard to use P2P if fabric ordering is not preserved..

Jason
Logan Gunthorpe April 4, 2017, 5:21 p.m. UTC | #3
On 04/04/17 04:59 AM, Sagi Grimberg wrote:
> We can never ever get here from an IO command, and that is a good thing
> because it would have been broken if we did, regardless of what copy
> method we use...

Yes, I changed this mostly for admin commands. I did notice connect
commands do end up reading from the p2mem and this patchset correctly
switches it to iomemcpy. However, based on Cristoph's comment, I hope to
make it more general such that iomem is hidden within sgls and any
access will either be correct or create a warning.


On 04/04/17 09:46 AM, Jason Gunthorpe wrote:
> Transactions might not complete at the NVMe device before the CPU
> processes the RDMA completion, however due to the PCI-E ordering rules
> new TLPs directed to the NVMe will complete after the RMDA TLPs and
> thus observe the new data. (eg order preserving)
> 
> It would be very hard to use P2P if fabric ordering is not preserved..

Yes, my understanding is the same, the PCI-E ordering rules save us here.

Thanks,

Logan
Sagi Grimberg April 6, 2017, 5:33 a.m. UTC | #4
>> Note that the nvme completion queues are still on the host memory, so
>> this means we have lost the ordering between data and completions as
>> they go to different pcie targets.
>
> Hmm, in this simple up/down case with a switch, I think it might
> actually be OK.
>
> Transactions might not complete at the NVMe device before the CPU
> processes the RDMA completion, however due to the PCI-E ordering rules
> new TLPs directed to the NVMe will complete after the RMDA TLPs and
> thus observe the new data. (eg order preserving)
>
> It would be very hard to use P2P if fabric ordering is not preserved..

I think it still can race if the p2p device is connected with more than
a single port to the switch.

Say it's connected via 2 legs, the bar is accessed from leg A and the
data from the disk comes via leg B. In this case, the data is heading
towards the p2p device via leg B (might be congested), the completion
goes directly to the RC, and then the host issues a read from the
bar via leg A. I don't understand what can guarantee ordering here.

Stephen told me that this still guarantees ordering, but I honestly
can't understand how, perhaps someone can explain to me in a simple
way that I can understand.
Logan Gunthorpe April 6, 2017, 4:02 p.m. UTC | #5
On 05/04/17 11:33 PM, Sagi Grimberg wrote:
> 
>>> Note that the nvme completion queues are still on the host memory, so
>>> this means we have lost the ordering between data and completions as
>>> they go to different pcie targets.
>>
>> Hmm, in this simple up/down case with a switch, I think it might
>> actually be OK.
>>
>> Transactions might not complete at the NVMe device before the CPU
>> processes the RDMA completion, however due to the PCI-E ordering rules
>> new TLPs directed to the NVMe will complete after the RMDA TLPs and
>> thus observe the new data. (eg order preserving)
>>
>> It would be very hard to use P2P if fabric ordering is not preserved..
> 
> I think it still can race if the p2p device is connected with more than
> a single port to the switch.
> 
> Say it's connected via 2 legs, the bar is accessed from leg A and the
> data from the disk comes via leg B. In this case, the data is heading
> towards the p2p device via leg B (might be congested), the completion
> goes directly to the RC, and then the host issues a read from the
> bar via leg A. I don't understand what can guarantee ordering here.
> 
> Stephen told me that this still guarantees ordering, but I honestly
> can't understand how, perhaps someone can explain to me in a simple
> way that I can understand.

I'll say I don't have a complete understanding of this myself. However,
my understanding is the completion coming from disk won't be sent toward
the RC until all the all the TLPs reached leg B. Then if the RC sends
TLPs to the p2p device via leg B they will be behind all the TLPs the
disk sent. Or something like that. Obviously this will only work with a
tree topology (which I believe is the only topology that makes sense for
PCI). If you had a mesh topology, then the data could route around
congestion and that would get around the ordering restrictions.

Logan
Jason Gunthorpe April 6, 2017, 4:35 p.m. UTC | #6
On Thu, Apr 06, 2017 at 08:33:38AM +0300, Sagi Grimberg wrote:
> 
> >>Note that the nvme completion queues are still on the host memory, so
> >>this means we have lost the ordering between data and completions as
> >>they go to different pcie targets.
> >
> >Hmm, in this simple up/down case with a switch, I think it might
> >actually be OK.
> >
> >Transactions might not complete at the NVMe device before the CPU
> >processes the RDMA completion, however due to the PCI-E ordering rules
> >new TLPs directed to the NVMe will complete after the RMDA TLPs and
> >thus observe the new data. (eg order preserving)
> >
> >It would be very hard to use P2P if fabric ordering is not preserved..
> 
> I think it still can race if the p2p device is connected with more than
> a single port to the switch.
> 
> Say it's connected via 2 legs, the bar is accessed from leg A and the
> data from the disk comes via leg B. In this case, the data is heading
> towards the p2p device via leg B (might be congested), the completion
> goes directly to the RC, and then the host issues a read from the
> bar via leg A. I don't understand what can guarantee ordering here.

Right, this is why I qualified my statement with 'simple up/down case'

Make it any more complex and it clearly stops working sanely, but I
wouldn't worry about unusual PCI-E fabrics at this point..

> Stephen told me that this still guarantees ordering, but I honestly
> can't understand how, perhaps someone can explain to me in a simple
> way that I can understand.

AFAIK PCI-E ordering is explicitly per link, so things that need order
must always traverse the same link.

Jason
Stephen Bates April 7, 2017, 11:19 a.m. UTC | #7
On 2017-04-06, 6:33 AM, "Sagi Grimberg" <sagi@grimberg.me> wrote:

> Say it's connected via 2 legs, the bar is accessed from leg A and the

> data from the disk comes via leg B. In this case, the data is heading

> towards the p2p device via leg B (might be congested), the completion

> goes directly to the RC, and then the host issues a read from the

> bar via leg A. I don't understand what can guarantee ordering here.


> Stephen told me that this still guarantees ordering, but I honestly

> can't understand how, perhaps someone can explain to me in a simple

> way that I can understand.


Sagi

As long as legA, legB and the RC are all connected to the same switch then ordering will be preserved (I think many other topologies also work). Here is how it would work for the problem case you are concerned about (which is a read from the NVMe drive).

1. Disk device DMAs out the data to the p2pmem device via a string of PCIe MemWr TLPs.
2. Disk device writes to the completion queue (in system memory) via a MemWr TLP.
3. The last of the MemWrs from step 1 might have got stalled in the PCIe switch due to congestion but if so they are stalled in the egress path of the switch for the p2pmem port.
4. The RC determines the IO is complete when the TLP associated with step 2 updates the memory associated with the CQ. It issues some operation to read the p2pmem.
5. Regardless of whether the MemRd TLP comes from the RC or another device connected to the switch it is queued in the egress queue for the p2pmem FIO behind the last DMA TLP (from step 1). PCIe ordering ensures that this MemRd cannot overtake the MemWr (Reads can never pass writes). Therefore the MemRd can never get to the p2pmem device until after the last DMA MemWr has.

I hope this helps!

Stephen
Sagi Grimberg April 10, 2017, 8:29 a.m. UTC | #8
> Sagi
>
> As long as legA, legB and the RC are all connected to the same switch then ordering will be preserved (I think many other topologies also work). Here is how it would work for the problem case you are concerned about (which is a read from the NVMe drive).
>
> 1. Disk device DMAs out the data to the p2pmem device via a string of PCIe MemWr TLPs.
> 2. Disk device writes to the completion queue (in system memory) via a MemWr TLP.
> 3. The last of the MemWrs from step 1 might have got stalled in the PCIe switch due to congestion but if so they are stalled in the egress path of the switch for the p2pmem port.
> 4. The RC determines the IO is complete when the TLP associated with step 2 updates the memory associated with the CQ. It issues some operation to read the p2pmem.
> 5. Regardless of whether the MemRd TLP comes from the RC or another device connected to the switch it is queued in the egress queue for the p2pmem FIO behind the last DMA TLP (from step 1).
> PCIe ordering ensures that this MemRd cannot overtake the MemWr (Reads can never pass writes).
> Therefore the MemRd can never get to the p2pmem device until after the last DMA MemWr has.

What you are saying is surprising to me. The switch needs to preserve
ordering across different switch ports ??

You are suggesting that there is a *switch-wide* state that tracks
MemRds never pass MemWrs across all the switch ports? That is a very
non-trivial statement...
Logan Gunthorpe April 10, 2017, 4:03 p.m. UTC | #9
On 10/04/17 02:29 AM, Sagi Grimberg wrote:
> What you are saying is surprising to me. The switch needs to preserve
> ordering across different switch ports ??

> You are suggesting that there is a *switch-wide* state that tracks
> MemRds never pass MemWrs across all the switch ports? That is a very
> non-trivial statement...

Yes, it is a requirement of the PCIe spec for transactions to be
strongly ordered throughout the fabric so switches must have an internal
state across all ports. Without that, it would be impossible to have PCI
cards work together even if they are using system memory to do so. Also,
I believe, it was done this way to maintain maximum compatibility with
the legacy PCI bus. There is also a relaxed ordering bit that allows
specific transactions to ignore ordering which can help performance.

Obviously this becomes impossible if you have some kind of complex
multi-path fabric.

Logan
diff mbox

Patch

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 798653b..a1524d5 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -45,15 +45,29 @@  static struct nvmet_subsys *nvmet_find_get_subsys(struct nvmet_port *port,
 u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf,
 		size_t len)
 {
-	if (sg_pcopy_from_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
+	bool iomem = req->p2pmem;
+	size_t ret;
+
+	ret = sg_copy_buffer(req->sg, req->sg_cnt, (void *)buf, len, off,
+			     false, iomem);
+
+	if (ret != len)
 		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+
 	return 0;
 }
 
 u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, void *buf, size_t len)
 {
-	if (sg_pcopy_to_buffer(req->sg, req->sg_cnt, buf, len, off) != len)
+	bool iomem = req->p2pmem;
+	size_t ret;
+
+	ret = sg_copy_buffer(req->sg, req->sg_cnt, buf, len, off, true,
+			     iomem);
+
+	if (ret != len)
 		return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+
 	return 0;
 }
 
diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c
index 8bd022af..9d966f0 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -118,11 +118,13 @@  static u16 nvmet_install_queue(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
 static void nvmet_execute_admin_connect(struct nvmet_req *req)
 {
 	struct nvmf_connect_command *c = &req->cmd->connect;
-	struct nvmf_connect_data *d;
+	struct nvmf_connect_data d;
 	struct nvmet_ctrl *ctrl = NULL;
 	u16 status = 0;
 
-	d = kmap(sg_page(req->sg)) + req->sg->offset;
+	status = nvmet_copy_from_sgl(req, 0, &d, sizeof(d));
+	if (status)
+		goto out;
 
 	/* zero out initial completion result, assign values as needed */
 	req->rsp->result.u32 = 0;
@@ -134,16 +136,16 @@  static void nvmet_execute_admin_connect(struct nvmet_req *req)
 		goto out;
 	}
 
-	if (unlikely(d->cntlid != cpu_to_le16(0xffff))) {
+	if (unlikely(d.cntlid != cpu_to_le16(0xffff))) {
 		pr_warn("connect attempt for invalid controller ID %#x\n",
-			d->cntlid);
+			d.cntlid);
 		status = NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR;
 		req->rsp->result.u32 = IPO_IATTR_CONNECT_DATA(cntlid);
 		goto out;
 	}
 
-	status = nvmet_alloc_ctrl(d->subsysnqn, d->hostnqn, req,
-			le32_to_cpu(c->kato), &ctrl);
+	status = nvmet_alloc_ctrl(d.subsysnqn, d.hostnqn, req,
+				  le32_to_cpu(c->kato), &ctrl);
 	if (status)
 		goto out;
 
@@ -158,19 +160,20 @@  static void nvmet_execute_admin_connect(struct nvmet_req *req)
 	req->rsp->result.u16 = cpu_to_le16(ctrl->cntlid);
 
 out:
-	kunmap(sg_page(req->sg));
 	nvmet_req_complete(req, status);
 }
 
 static void nvmet_execute_io_connect(struct nvmet_req *req)
 {
 	struct nvmf_connect_command *c = &req->cmd->connect;
-	struct nvmf_connect_data *d;
+	struct nvmf_connect_data d;
 	struct nvmet_ctrl *ctrl = NULL;
 	u16 qid = le16_to_cpu(c->qid);
 	u16 status = 0;
 
-	d = kmap(sg_page(req->sg)) + req->sg->offset;
+	status = nvmet_copy_from_sgl(req, 0, &d, sizeof(d));
+	if (status)
+		goto out;
 
 	/* zero out initial completion result, assign values as needed */
 	req->rsp->result.u32 = 0;
@@ -182,9 +185,9 @@  static void nvmet_execute_io_connect(struct nvmet_req *req)
 		goto out;
 	}
 
-	status = nvmet_ctrl_find_get(d->subsysnqn, d->hostnqn,
-			le16_to_cpu(d->cntlid),
-			req, &ctrl);
+	status = nvmet_ctrl_find_get(d.subsysnqn, d.hostnqn,
+				     le16_to_cpu(d.cntlid),
+				     req, &ctrl);
 	if (status)
 		goto out;
 
@@ -205,7 +208,6 @@  static void nvmet_execute_io_connect(struct nvmet_req *req)
 	pr_info("adding queue %d to ctrl %d.\n", qid, ctrl->cntlid);
 
 out:
-	kunmap(sg_page(req->sg));
 	nvmet_req_complete(req, status);
 	return;
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index ab67175..ccd79ed 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -226,6 +226,7 @@  struct nvmet_req {
 
 	void (*execute)(struct nvmet_req *req);
 	struct nvmet_fabrics_ops *ops;
+	struct p2pmem_dev       *p2pmem;
 };
 
 static inline void nvmet_set_status(struct nvmet_req *req, u16 status)
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 7fd4840..abab544 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -65,7 +65,6 @@  struct nvmet_rdma_rsp {
 	struct rdma_rw_ctx	rw;
 
 	struct nvmet_req	req;
-	struct p2pmem_dev       *p2pmem;
 
 	u8			n_rdma;
 	u32			flags;
@@ -501,7 +500,7 @@  static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp)
 
 	if (rsp->req.sg != &rsp->cmd->inline_sg)
 		nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt,
-				    rsp->p2pmem);
+				    rsp->req.p2pmem);
 
 	if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list)))
 		nvmet_rdma_process_wr_wait_list(queue);
@@ -642,14 +641,14 @@  static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp,
 	if (!len)
 		return 0;
 
-	rsp->p2pmem = rsp->queue->p2pmem;
+	rsp->req.p2pmem = rsp->queue->p2pmem;
 	status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-			len, rsp->p2pmem);
+				      len, rsp->req.p2pmem);
 
-	if (status && rsp->p2pmem) {
-		rsp->p2pmem = NULL;
+	if (status && rsp->req.p2pmem) {
+		rsp->req.p2pmem = NULL;
 		status = nvmet_rdma_alloc_sgl(&rsp->req.sg, &rsp->req.sg_cnt,
-					      len, rsp->p2pmem);
+					      len, rsp->req.p2pmem);
 	}
 
 	if (status)