diff mbox

[RFC,0/8] Copy Offload with Peer-to-Peer PCI Memory

Message ID	20170418210339.GA24257@obsidianresearch.com
State	Not Applicable
Headers	show Return-Path: <linux-pci-owner@vger.kernel.org> Date: Tue, 18 Apr 2017 15:03:39 -0600 From: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> To: Dan Williams <dan.j.williams@intel.com> Cc: Logan Gunthorpe <logang@deltatee.com>, Benjamin Herrenschmidt <benh@kernel.crashing.org>, Bjorn Helgaas <helgaas@kernel.org>, Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>, "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>, "Martin K. Petersen" <martin.petersen@oracle.com>, Jens Axboe <axboe@kernel.dk>, Steve Wise <swise@opengridcomputing.com>, Stephen Bates <sbates@raithlin.com>, Max Gurtovoy <maxg@mellanox.com>, Keith Busch <keith.busch@intel.com>, linux-pci@vger.kernel.org, linux-scsi <linux-scsi@vger.kernel.org>, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm <linux-nvdimm@ml01.01.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, Jerome Glisse <jglisse@redhat.com> Subject: Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory Message-ID: <20170418210339.GA24257@obsidianresearch.com> References: <CAPcyv4it56J8Voo6kV0bBcO3nHsOHYLENpAtONJZTGceDDwNPg@mail.gmail.com> <1492381396.25766.43.camel@kernel.crashing.org> <20170418164557.GA7181@obsidianresearch.com> <cce00131-1f28-27b3-40ab-04f8783f1e5a@deltatee.com> <20170418190138.GH7181@obsidianresearch.com> <df1351d8-b86c-2e21-1948-4688ece5dc2b@deltatee.com> <CAPcyv4gScx6A7vG9VEHpNF41GOy1Nxst7QQ3QC3uZ54bWoxbMg@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <CAPcyv4gScx6A7vG9VEHpNF41GOy1Nxst7QQ3QC3uZ54bWoxbMg@mail.gmail.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-pci-owner@vger.kernel.org Precedence: bulk

Message ID

20170418210339.GA24257@obsidianresearch.com

State

Not Applicable

Headers

Date: Tue, 18 Apr 2017 15:03:39 -0600
From: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Bjorn Helgaas <helgaas@kernel.org>, Christoph Hellwig <hch@lst.de>,
	Sagi Grimberg <sagi@grimberg.me>,
	"James E.J. Bottomley" <jejb@linux.vnet.ibm.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Jens Axboe <axboe@kernel.dk>, Steve Wise <swise@opengridcomputing.com>,
	Stephen Bates <sbates@raithlin.com>, Max Gurtovoy <maxg@mellanox.com>,
	Keith Busch <keith.busch@intel.com>, linux-pci@vger.kernel.org,
	linux-scsi <linux-scsi@vger.kernel.org>,
	linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org,
	linux-nvdimm <linux-nvdimm@ml01.01.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Jerome Glisse <jglisse@redhat.com>
Subject: Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
Message-ID: <20170418210339.GA24257@obsidianresearch.com>
References: <CAPcyv4it56J8Voo6kV0bBcO3nHsOHYLENpAtONJZTGceDDwNPg@mail.gmail.com>
	<1492381396.25766.43.camel@kernel.crashing.org>
	<20170418164557.GA7181@obsidianresearch.com>
	<cce00131-1f28-27b3-40ab-04f8783f1e5a@deltatee.com>
	<20170418190138.GH7181@obsidianresearch.com>
	<df1351d8-b86c-2e21-1948-4688ece5dc2b@deltatee.com>
	<CAPcyv4gScx6A7vG9VEHpNF41GOy1Nxst7QQ3QC3uZ54bWoxbMg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAPcyv4gScx6A7vG9VEHpNF41GOy1Nxst7QQ3QC3uZ54bWoxbMg@mail.gmail.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-pci-owner@vger.kernel.org
Precedence: bulk

Commit Message

Jason Gunthorpe April 18, 2017, 9:03 p.m. UTC

On Tue, Apr 18, 2017 at 12:48:35PM -0700, Dan Williams wrote:

> > Yes, I noticed this problem too and that makes sense. It just means
> > every dma_ops will probably need to be modified to either support p2p
> > pages or fail on them. Though, the only real difficulty there is that it
> > will be a lot of work.
> 
> I don't think you need to go touch all dma_ops, I think you can just
> arrange for devices that are going to do dma to get redirected to a
> p2p aware provider of operations that overrides the system default
> dma_ops. I.e. just touch get_dma_ops().

I don't follow, when does get_dma_ops() return a p2p aware provider?
It has no way to know if the DMA is going to involve p2p, get_dma_ops
is called with the device initiating the DMA.

So you'd always return the P2P shim on a system that has registered
P2P memory?

Even so, how does this shim work? dma_ops are not really intended to
be stacked. How would we make unmap work, for instance? What happens
when the underlying iommu dma ops actually natively understands p2p
and doesn't want the shim?

I think this opens an even bigger can of worms..

Lets find a strategy to safely push this into dma_ops.

What about something more incremental like this instead:
- dma_ops will set map_sg_p2p == map_sg when they are updated to
  support p2p, otherwise DMA on P2P pages will fail for those ops.
- When all ops support p2p we remove the if and ops->map_sg then
  just call map_sg_p2p
- For now the scatterlist maintains a bit when pages are added indicating if
  p2p memory might be present in the list.
- Unmap for p2p and non-p2p is the same, the underlying ops driver has
  to make it work.

Comments

Dan Williams April 18, 2017, 9:11 p.m. UTC | #1

On Tue, Apr 18, 2017 at 2:03 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Apr 18, 2017 at 12:48:35PM -0700, Dan Williams wrote:
>
>> > Yes, I noticed this problem too and that makes sense. It just means
>> > every dma_ops will probably need to be modified to either support p2p
>> > pages or fail on them. Though, the only real difficulty there is that it
>> > will be a lot of work.
>>
>> I don't think you need to go touch all dma_ops, I think you can just
>> arrange for devices that are going to do dma to get redirected to a
>> p2p aware provider of operations that overrides the system default
>> dma_ops. I.e. just touch get_dma_ops().
>
> I don't follow, when does get_dma_ops() return a p2p aware provider?
> It has no way to know if the DMA is going to involve p2p, get_dma_ops
> is called with the device initiating the DMA.
>
> So you'd always return the P2P shim on a system that has registered
> P2P memory?
>
> Even so, how does this shim work? dma_ops are not really intended to
> be stacked. How would we make unmap work, for instance? What happens
> when the underlying iommu dma ops actually natively understands p2p
> and doesn't want the shim?
>
> I think this opens an even bigger can of worms..

No, I don't think it does. You'd only shim when the target page is
backed by a device, not host memory, and you can figure this out by a
is_zone_device_page()-style lookup.

Jason Gunthorpe April 18, 2017, 9:22 p.m. UTC | #2

On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
> > I think this opens an even bigger can of worms..
> 
> No, I don't think it does. You'd only shim when the target page is
> backed by a device, not host memory, and you can figure this out by a
> is_zone_device_page()-style lookup.

The bigger can of worms is how do you meaningfully stack dma_ops.

What does the p2p provider do when it detects a p2p page?

Jason

Logan Gunthorpe April 18, 2017, 9:31 p.m. UTC | #3

On 18/04/17 03:03 PM, Jason Gunthorpe wrote:
> What about something more incremental like this instead:
> - dma_ops will set map_sg_p2p == map_sg when they are updated to
>   support p2p, otherwise DMA on P2P pages will fail for those ops.
> - When all ops support p2p we remove the if and ops->map_sg then
>   just call map_sg_p2p
> - For now the scatterlist maintains a bit when pages are added indicating if
>   p2p memory might be present in the list.
> - Unmap for p2p and non-p2p is the same, the underlying ops driver has
>   to make it work.
> 
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 0977317c6835c2..505ed7d502053d 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -103,6 +103,9 @@ struct dma_map_ops {
>  	int (*map_sg)(struct device *dev, struct scatterlist *sg,
>  		      int nents, enum dma_data_direction dir,
>  		      unsigned long attrs);
> +	int (*map_sg_p2p)(struct device *dev, struct scatterlist *sg,
> +			  int nents, enum dma_data_direction dir,
> +			  unsigned long attrs);
>  	void (*unmap_sg)(struct device *dev,
>  			 struct scatterlist *sg, int nents,
>  			 enum dma_data_direction dir,
> @@ -244,7 +247,15 @@ static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
>  	for_each_sg(sg, s, nents, i)
>  		kmemcheck_mark_initialized(sg_virt(s), s->length);
>  	BUG_ON(!valid_dma_direction(dir));
> -	ents = ops->map_sg(dev, sg, nents, dir, attrs);
> +
> +	if (sg_has_p2p(sg)) {
> +		if (ops->map_sg_p2p)
> +			ents = ops->map_sg_p2p(dev, sg, nents, dir, attrs);
> +		else
> +			return 0;
> +	} else
> +		ents = ops->map_sg(dev, sg, nents, dir, attrs);
> +
>  	BUG_ON(ents < 0);
>  	debug_dma_map_sg(dev, sg, nents, ents, dir);

I could get behind this. Though a couple of points:

1) It means that sg_has_p2p has to walk the entire sg and check every
page. Then map_sg_p2p/map_sg has to walk it again and repeat the check
then do some operation per page. If anyone is concerned about the
dma_map performance this could be an issue.

2) Without knowing exactly what the arch specific code may need to do
it's hard to say that this is exactly the right approach. If every
dma_ops provider has to do exactly this on every page it may lead to a
lot of duplicate code:

foreach_sg page:
   if (pci_page_is_p2p(page)) {
        dma_addr = pci_p2p_map_page(page)
	if (!dma_addr)
		return 0;
        continue
   }
   ...

The only thing I'm presently aware of is the segment check and applying
the offset to the physical address -- neither of which has much to do
with the specific dma_ops providers. It _may_ be that this needs to be
bus specific and not arch specific which I think is what Dan may be
getting at. So it may make sense to just have a pci_map_sg_p2p() which
takes a dma_ops struct it would use for any page that isn't a p2p page.

Logan

Dan Williams April 18, 2017, 9:36 p.m. UTC | #4

On Tue, Apr 18, 2017 at 2:22 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
>> > I think this opens an even bigger can of worms..
>>
>> No, I don't think it does. You'd only shim when the target page is
>> backed by a device, not host memory, and you can figure this out by a
>> is_zone_device_page()-style lookup.
>
> The bigger can of worms is how do you meaningfully stack dma_ops.

This goes back to my original comment to make this capability a
function of the pci bridge itself. The kernel has an implementation of
a dynamically created bridge device that injects its own dma_ops for
the devices behind the bridge. See vmd_setup_dma_ops() in
drivers/pci/host/vmd.c.

> What does the p2p provider do when it detects a p2p page?

Check to see if the arch requires this offset translation that Ben
brought up and if not provide the physical address as the patches are
doing now.

Logan Gunthorpe April 18, 2017, 10:15 p.m. UTC | #5

On 18/04/17 03:36 PM, Dan Williams wrote:
> On Tue, Apr 18, 2017 at 2:22 PM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
>> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
>>>> I think this opens an even bigger can of worms..
>>>
>>> No, I don't think it does. You'd only shim when the target page is
>>> backed by a device, not host memory, and you can figure this out by a
>>> is_zone_device_page()-style lookup.
>>
>> The bigger can of worms is how do you meaningfully stack dma_ops.
> 
> This goes back to my original comment to make this capability a
> function of the pci bridge itself. The kernel has an implementation of
> a dynamically created bridge device that injects its own dma_ops for
> the devices behind the bridge. See vmd_setup_dma_ops() in
> drivers/pci/host/vmd.c.

Well the issue I think Jason is pointing out is that the ops don't
stack. The map_* function in the injected dma_ops needs to be able to
call the original map_* for any page that is not p2p memory. This is
especially annoying in the map_sg function which may need to call a
different op based on the contents of the sgl. (And please correct me if
I'm not seeing how this can be done in the vmd example.)

Also, what happens if p2p pages end up getting passed to a device that
doesn't have the injected dma_ops?

However, the concept of replacing the dma_ops for all devices behind a
supporting bridge is interesting and may be a good piece of the final
solution.

Logan

Jason Gunthorpe April 18, 2017, 10:24 p.m. UTC | #6

On Tue, Apr 18, 2017 at 03:31:58PM -0600, Logan Gunthorpe wrote:

> 1) It means that sg_has_p2p has to walk the entire sg and check every
> page. Then map_sg_p2p/map_sg has to walk it again and repeat the check
> then do some operation per page. If anyone is concerned about the
> dma_map performance this could be an issue.

dma_map performance is a concern, this is why I suggest this as an
interm solution until all dma_ops are migrated. Ideally sg_has_p2p
would be a fast path that checked some kind of flags bit set during
sg_assign_page... 

This would probably all have to be protected with CONFIG_P2P until it
becomes performance neutral. People without an iommu are not going to
want to walk the sg list at all..

> 2) Without knowing exactly what the arch specific code may need to do
> it's hard to say that this is exactly the right approach. If every
> dma_ops provider has to do exactly this on every page it may lead to a
> lot of duplicate code:

I think someone would have to start to look at it to make a
determination..

I suspect the main server oriented iommu dma op will want to have
proper p2p support anyhow and will probably have their unique control
flow..

> The only thing I'm presently aware of is the segment check and applying
> the offset to the physical address

Well, I called the function p2p_same_segment_map_page() in my last
suggestion for a reason - that is all the helper does.

The intention would be for real iommu drivers to call that helper for
the one simple case and if it fails then use their own routines to
figure out if cross-segment P2P is possible and configure the iommu as
needed.

> bus specific and not arch specific which I think is what Dan may be
> getting at. So it may make sense to just have a pci_map_sg_p2p() which
> takes a dma_ops struct it would use for any page that isn't a p2p page.

Like I keep saying, dma_ops are not really designed to be stacked.

Try and write a stacked map_sg function like you describe and you will
see how horrible it quickly becomes.

Setting up an iommu is very expensive, so we need to batch it for the
entire sg list. Thus a trivial implementation to iterate over all sg
list entries is not desired.

So first a sg list without p2p memory would have to be created, pass
to the lower level ops, then brought back. Remember, the returned sg
list will have a different number of entries than the original. Now
another complex loop is needed to split/merge back in the p2p sg
elements to get a return result.

Finally, we have to undo all of this when doing unmap.

Basically, all this list processing is a huge overhead compared to
just putting a helper call in the existing sg iteration loop of the
actual op.  Particularly if the actual op is a no-op like no-mmu x86
would use.

Since dma mapping is a performance path we must be careful not to
create intrinsic inefficiencies with otherwise nice layering :)

Jason

Dan Williams April 18, 2017, 10:28 p.m. UTC | #7

On Tue, Apr 18, 2017 at 3:15 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 18/04/17 03:36 PM, Dan Williams wrote:
>> On Tue, Apr 18, 2017 at 2:22 PM, Jason Gunthorpe
>> <jgunthorpe@obsidianresearch.com> wrote:
>>> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
>>>>> I think this opens an even bigger can of worms..
>>>>
>>>> No, I don't think it does. You'd only shim when the target page is
>>>> backed by a device, not host memory, and you can figure this out by a
>>>> is_zone_device_page()-style lookup.
>>>
>>> The bigger can of worms is how do you meaningfully stack dma_ops.
>>
>> This goes back to my original comment to make this capability a
>> function of the pci bridge itself. The kernel has an implementation of
>> a dynamically created bridge device that injects its own dma_ops for
>> the devices behind the bridge. See vmd_setup_dma_ops() in
>> drivers/pci/host/vmd.c.
>
> Well the issue I think Jason is pointing out is that the ops don't
> stack. The map_* function in the injected dma_ops needs to be able to
> call the original map_* for any page that is not p2p memory. This is
> especially annoying in the map_sg function which may need to call a
> different op based on the contents of the sgl. (And please correct me if
> I'm not seeing how this can be done in the vmd example.)

Unlike the pci bus address offset case which I think is fundamental to
support since shipping archs do this today, I think it is ok to say
p2p is restricted to a single sgl that gets to talk to host memory or
a single device. That said, what's wrong with a p2p aware map_sg
implementation calling up to the host memory map_sg implementation on
a per sgl basis?

> Also, what happens if p2p pages end up getting passed to a device that
> doesn't have the injected dma_ops?

This goes back to limiting p2p to a single pci host bridge. If the p2p
capability is coordinated with the bridge rather than between the
individual devices then we have a central point to catch this case.

...of course this is all hand wavy until someone writes the code and
proves otherwise.

> However, the concept of replacing the dma_ops for all devices behind a
> supporting bridge is interesting and may be a good piece of the final
> solution.

It's at least a proof point for injecting special behavior for devices
behind a (virtual) pci bridge without needing to go touch a bunch of
drivers.

Jason Gunthorpe April 18, 2017, 10:42 p.m. UTC | #8

On Tue, Apr 18, 2017 at 03:28:17PM -0700, Dan Williams wrote:

> Unlike the pci bus address offset case which I think is fundamental to
> support since shipping archs do this toda

But we can support this by modifying those arch's unique dma_ops
directly.

Eg as I explained, my p2p_same_segment_map_page() helper concept would
do the offset adjustment for same-segement DMA.

If PPC calls that in their IOMMU drivers then they will have proper
support for this basic p2p, and the right framework to move on to more
advanced cases of p2p.

This really seems like much less trouble than trying to wrapper all
the arch's dma ops, and doesn't have the wonky restrictions.

> I think it is ok to say p2p is restricted to a single sgl that gets
> to talk to host memory or a single device.

RDMA and GPU would be sad with this restriction...

> That said, what's wrong with a p2p aware map_sg implementation
> calling up to the host memory map_sg implementation on a per sgl
> basis?

Setting up the iommu is fairly expensive, so getting rid of the
batching would kill performance..

Jason

Logan Gunthorpe April 18, 2017, 10:48 p.m. UTC | #9

On 18/04/17 04:28 PM, Dan Williams wrote:
> Unlike the pci bus address offset case which I think is fundamental to
> support since shipping archs do this today, I think it is ok to say
> p2p is restricted to a single sgl that gets to talk to host memory or
> a single device. That said, what's wrong with a p2p aware map_sg
> implementation calling up to the host memory map_sg implementation on
> a per sgl basis?

I think Ben said they need mixed sgls and that is where this gets messy.
I think I'd prefer this too given trying to enforce all sgs in a list to
be one type or another could be quite difficult given the state of the
scatterlist code.

>> Also, what happens if p2p pages end up getting passed to a device that
>> doesn't have the injected dma_ops?
> 
> This goes back to limiting p2p to a single pci host bridge. If the p2p
> capability is coordinated with the bridge rather than between the
> individual devices then we have a central point to catch this case.

Not really relevant. If these pages get to userspace (as people seem
keen on doing) or a less than careful kernel driver they could easily
get into the dma_map calls of devices that aren't even pci related (via
an O_DIRECT operation on an incorrect file or something). The common
code must reject these and can't rely on an injected dma op.

Logan

Dan Williams April 18, 2017, 10:50 p.m. UTC | #10

On Tue, Apr 18, 2017 at 3:48 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 18/04/17 04:28 PM, Dan Williams wrote:
>> Unlike the pci bus address offset case which I think is fundamental to
>> support since shipping archs do this today, I think it is ok to say
>> p2p is restricted to a single sgl that gets to talk to host memory or
>> a single device. That said, what's wrong with a p2p aware map_sg
>> implementation calling up to the host memory map_sg implementation on
>> a per sgl basis?
>
> I think Ben said they need mixed sgls and that is where this gets messy.
> I think I'd prefer this too given trying to enforce all sgs in a list to
> be one type or another could be quite difficult given the state of the
> scatterlist code.
>
>>> Also, what happens if p2p pages end up getting passed to a device that
>>> doesn't have the injected dma_ops?
>>
>> This goes back to limiting p2p to a single pci host bridge. If the p2p
>> capability is coordinated with the bridge rather than between the
>> individual devices then we have a central point to catch this case.
>
> Not really relevant. If these pages get to userspace (as people seem
> keen on doing) or a less than careful kernel driver they could easily
> get into the dma_map calls of devices that aren't even pci related (via
> an O_DIRECT operation on an incorrect file or something). The common
> code must reject these and can't rely on an injected dma op.

No, we can't do that at get_user_pages() time, it will always need to
be up to the device driver to fail dma that it can't perform.

Dan Williams April 18, 2017, 10:51 p.m. UTC | #11

On Tue, Apr 18, 2017 at 3:42 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Apr 18, 2017 at 03:28:17PM -0700, Dan Williams wrote:
>
>> Unlike the pci bus address offset case which I think is fundamental to
>> support since shipping archs do this toda
>
> But we can support this by modifying those arch's unique dma_ops
> directly.
>
> Eg as I explained, my p2p_same_segment_map_page() helper concept would
> do the offset adjustment for same-segement DMA.
>
> If PPC calls that in their IOMMU drivers then they will have proper
> support for this basic p2p, and the right framework to move on to more
> advanced cases of p2p.
>
> This really seems like much less trouble than trying to wrapper all
> the arch's dma ops, and doesn't have the wonky restrictions.

I don't think the root bus iommu drivers have any business knowing or
caring about dma happening between devices lower in the hierarchy.

>> I think it is ok to say p2p is restricted to a single sgl that gets
>> to talk to host memory or a single device.
>
> RDMA and GPU would be sad with this restriction...
>
>> That said, what's wrong with a p2p aware map_sg implementation
>> calling up to the host memory map_sg implementation on a per sgl
>> basis?
>
> Setting up the iommu is fairly expensive, so getting rid of the
> batching would kill performance..

When we're crossing device and host memory boundaries how much
batching is possible? As far as I can see you'll always be splitting
the sgl on these dma mapping boundaries.

Logan Gunthorpe April 18, 2017, 10:56 p.m. UTC | #12

On 18/04/17 04:50 PM, Dan Williams wrote:
> On Tue, Apr 18, 2017 at 3:48 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>
>>
>> On 18/04/17 04:28 PM, Dan Williams wrote:
>>> Unlike the pci bus address offset case which I think is fundamental to
>>> support since shipping archs do this today, I think it is ok to say
>>> p2p is restricted to a single sgl that gets to talk to host memory or
>>> a single device. That said, what's wrong with a p2p aware map_sg
>>> implementation calling up to the host memory map_sg implementation on
>>> a per sgl basis?
>>
>> I think Ben said they need mixed sgls and that is where this gets messy.
>> I think I'd prefer this too given trying to enforce all sgs in a list to
>> be one type or another could be quite difficult given the state of the
>> scatterlist code.
>>
>>>> Also, what happens if p2p pages end up getting passed to a device that
>>>> doesn't have the injected dma_ops?
>>>
>>> This goes back to limiting p2p to a single pci host bridge. If the p2p
>>> capability is coordinated with the bridge rather than between the
>>> individual devices then we have a central point to catch this case.
>>
>> Not really relevant. If these pages get to userspace (as people seem
>> keen on doing) or a less than careful kernel driver they could easily
>> get into the dma_map calls of devices that aren't even pci related (via
>> an O_DIRECT operation on an incorrect file or something). The common
>> code must reject these and can't rely on an injected dma op.
> 
> No, we can't do that at get_user_pages() time, it will always need to
> be up to the device driver to fail dma that it can't perform.

I'm not sure I follow -- are you agreeing with me? The dma_map_* needs
to fail for any dma it cannot perform. Which means either all dma_ops
providers need to be p2p aware or this logic has to be in dma_map_*
itself. My point being: you can't rely on an injected dma_op for some
devices to handle the fail case globally.

Dan Williams April 18, 2017, 11:02 p.m. UTC | #13

On Tue, Apr 18, 2017 at 3:56 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 18/04/17 04:50 PM, Dan Williams wrote:
>> On Tue, Apr 18, 2017 at 3:48 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>>
>>>
>>> On 18/04/17 04:28 PM, Dan Williams wrote:
>>>> Unlike the pci bus address offset case which I think is fundamental to
>>>> support since shipping archs do this today, I think it is ok to say
>>>> p2p is restricted to a single sgl that gets to talk to host memory or
>>>> a single device. That said, what's wrong with a p2p aware map_sg
>>>> implementation calling up to the host memory map_sg implementation on
>>>> a per sgl basis?
>>>
>>> I think Ben said they need mixed sgls and that is where this gets messy.
>>> I think I'd prefer this too given trying to enforce all sgs in a list to
>>> be one type or another could be quite difficult given the state of the
>>> scatterlist code.
>>>
>>>>> Also, what happens if p2p pages end up getting passed to a device that
>>>>> doesn't have the injected dma_ops?
>>>>
>>>> This goes back to limiting p2p to a single pci host bridge. If the p2p
>>>> capability is coordinated with the bridge rather than between the
>>>> individual devices then we have a central point to catch this case.
>>>
>>> Not really relevant. If these pages get to userspace (as people seem
>>> keen on doing) or a less than careful kernel driver they could easily
>>> get into the dma_map calls of devices that aren't even pci related (via
>>> an O_DIRECT operation on an incorrect file or something). The common
>>> code must reject these and can't rely on an injected dma op.
>>
>> No, we can't do that at get_user_pages() time, it will always need to
>> be up to the device driver to fail dma that it can't perform.
>
> I'm not sure I follow -- are you agreeing with me? The dma_map_* needs
> to fail for any dma it cannot perform. Which means either all dma_ops
> providers need to be p2p aware or this logic has to be in dma_map_*
> itself. My point being: you can't rely on an injected dma_op for some
> devices to handle the fail case globally.

Ah, I see what you're saying now. Yes, we do need something that
guarantees any dma mapping implementation that gets a struct page that
it does now know how to translate properly fails the request.

Logan Gunthorpe April 18, 2017, 11:03 p.m. UTC | #14

On 18/04/17 04:24 PM, Jason Gunthorpe wrote:
> Try and write a stacked map_sg function like you describe and you will
> see how horrible it quickly becomes.

Yes, unfortunately, I have to agree with this statement completely.

> Since dma mapping is a performance path we must be careful not to
> create intrinsic inefficiencies with otherwise nice layering :)

Yeah, I'm also personally thinking your proposal is the way to go as
well. Dan's injected ops suggestion is interesting but I can't see how
it solves the issue completely. Your proposal is the only one that seems
to be complete to me. It just has a few minor pain points which I've
already described but are likely manageable and less than the pain
stacked dma_ops creates.

Logan

Jason Gunthorpe April 18, 2017, 11:21 p.m. UTC | #15

On Tue, Apr 18, 2017 at 03:51:27PM -0700, Dan Williams wrote:
> > This really seems like much less trouble than trying to wrapper all
> > the arch's dma ops, and doesn't have the wonky restrictions.
> 
> I don't think the root bus iommu drivers have any business knowing or
> caring about dma happening between devices lower in the hierarchy.

Maybe not, but performance requires some odd choices in this code.. :(

> > Setting up the iommu is fairly expensive, so getting rid of the
> > batching would kill performance..
> 
> When we're crossing device and host memory boundaries how much
> batching is possible? As far as I can see you'll always be splitting
> the sgl on these dma mapping boundaries.

Splitting the sgl is different from iommu batching.

As an example, an O_DIRECT write of 1 MB with a single 4K P2P page in
the middle.

The optimum behavior is to allocate a 1MB-4K iommu range and fill it
with the CPU memory. Then return a SGL with three entires, two
pointing into the range and one to the p2p.

It is creating each range which tends to be expensive, so creating two
ranges (or worse, if every SGL created a range it would be 255) is
very undesired.

Jason

Benjamin Herrenschmidt April 19, 2017, 1:20 a.m. UTC | #16

On Tue, 2017-04-18 at 15:03 -0600, Jason Gunthorpe wrote:
> I don't follow, when does get_dma_ops() return a p2p aware provider?
> It has no way to know if the DMA is going to involve p2p, get_dma_ops
> is called with the device initiating the DMA.
> 
> So you'd always return the P2P shim on a system that has registered
> P2P memory?
> 
> Even so, how does this shim work? dma_ops are not really intended to
> be stacked. How would we make unmap work, for instance? What happens
> when the underlying iommu dma ops actually natively understands p2p
> and doesn't want the shim?

Good point. We only know on a per-page basis ... ugh.

So we really need to change the arch main dma_ops. I'm not opposed to
that. What we then need to do is have that main arch dma_map_sg,
when it encounters a "device" page, call into a helper attached to
the devmap to handle *that page*, providing sufficient context.

That helper wouldn't perform the actual iommu mapping. It would simply
return something along the lines of:

 - "use that alternate bus address and don't map in the iommu"
 - "use that alternate bus address and do map in the iommu"
 - "proceed as normal"
 - "fail"

What do you think ?

Cheers,
Ben.

Benjamin Herrenschmidt April 19, 2017, 1:21 a.m. UTC | #17

On Tue, 2017-04-18 at 15:22 -0600, Jason Gunthorpe wrote:
> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
> > > I think this opens an even bigger can of worms..
> > 
> > No, I don't think it does. You'd only shim when the target page is
> > backed by a device, not host memory, and you can figure this out by
> > a
> > is_zone_device_page()-style lookup.
> 
> The bigger can of worms is how do you meaningfully stack dma_ops.
> 
> What does the p2p provider do when it detects a p2p page?

Yeah I think we don't really want to stack dma_ops... thinking more
about it.

As I just wrote, it looks like we might need a more specialised hook
in the devmap to be used by the main dma_op, on a per-page basis.

Ben.

Benjamin Herrenschmidt April 19, 2017, 1:23 a.m. UTC | #18

On Tue, 2017-04-18 at 16:24 -0600, Jason Gunthorpe wrote:
> Basically, all this list processing is a huge overhead compared to
> just putting a helper call in the existing sg iteration loop of the
> actual op.  Particularly if the actual op is a no-op like no-mmu x86
> would use.

Yes, I'm leaning toward that approach too.

The helper itself could hang off the devmap though.

> Since dma mapping is a performance path we must be careful not to
> create intrinsic inefficiencies with otherwise nice layering :)
> 
> Jason

Benjamin Herrenschmidt April 19, 2017, 1:25 a.m. UTC | #19

On Tue, 2017-04-18 at 17:21 -0600, Jason Gunthorpe wrote:
> Splitting the sgl is different from iommu batching.
> 
> As an example, an O_DIRECT write of 1 MB with a single 4K P2P page in
> the middle.
> 
> The optimum behavior is to allocate a 1MB-4K iommu range and fill it
> with the CPU memory. Then return a SGL with three entires, two
> pointing into the range and one to the p2p.
> 
> It is creating each range which tends to be expensive, so creating
> two
> ranges (or worse, if every SGL created a range it would be 255) is
> very undesired.

I think it's easier to get us started to just use a helper and
stick it in the existing sglist processing loop of the architecture.

As we noticed, stacking dma_ops is actually non-trivial and opens quite
the can of worms.

As Jerome mentioned, you can end up with IOs ops containing an sglist
that is a collection of memory and GPU pages for example.

Cheers,
Ben.

Jason Gunthorpe April 19, 2017, 3:55 p.m. UTC | #20

On Wed, Apr 19, 2017 at 11:20:06AM +1000, Benjamin Herrenschmidt wrote:

> That helper wouldn't perform the actual iommu mapping. It would simply
> return something along the lines of:
> 
>  - "use that alternate bus address and don't map in the iommu"

I was thinking only this one would be supported with a core code
helper..

>  - "use that alternate bus address and do map in the iommu"
>  - "proceed as normal"

.. because I don't have an idea how a core code helper could figure
out what the platform needs for the above two ..

I think many of the iommu platforms will need to construct their own
bus address in the iommu, and we already have easy access to the CPU
address.

Presumably once a transcation passes through the iommu it needs to be
using the CPU address for the target bar, otherwise things are going
to be ambiguous..

Jason

Logan Gunthorpe April 19, 2017, 4:48 p.m. UTC | #21

On 19/04/17 09:55 AM, Jason Gunthorpe wrote:
> I was thinking only this one would be supported with a core code
> helper..

Pivoting slightly: I was looking at how HMM uses ZONE_DEVICE. They add a
type flag to the dev_pagemap structure which would be very useful to us.
We could add another MEMORY_DEVICE_P2P type to distinguish p2p pages.
Then, potentially, we could add a dma_map callback to the structure
(possibly unioned with an hmm field). The dev_ops providers would then
just need to do something like this (enclosed in a helper):

if (is_zone_device_page(page)) {
	pgmap = get_dev_pagemap(page_to_pfn(page));
	if (!pgmap || pgmap->type !=  MEMORY_DEVICE_P2P ||
	    !pgmap->dma_map)
		return 0;

	dma_addr = pgmap->dma_map(dev, pgmap->dev, page);
	put_dev_pagemap(pgmap);
	if (!dma_addr)
		return 0;
	...
}

The pci_enable_p2p_bar function would then just need to call
devm_memremap_pages with the dma_map callback set to a function that
does the segment check and the offset calculation.

Thoughts?

@Jerome: my feedback to you would be that your patch assumes all users
of devm_memremap_pages are MEMORY_DEVICE_PERSISTENT. It would be more
useful if it was generic. My suggestion would be to have the caller
allocate the dev_pagemap structure, populate it and pass it into
devm_memremap_pages. Given that pretty much everything in that structure
are already arguments to that function, I feel like this makes sense.
This should also help to unify hmm_devmem_pages_create and
devm_memremap_pages which look very similar to each other.

Logan

Dan Williams April 19, 2017, 5:01 p.m. UTC | #22

On Wed, Apr 19, 2017 at 9:48 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 19/04/17 09:55 AM, Jason Gunthorpe wrote:
>> I was thinking only this one would be supported with a core code
>> helper..
>
> Pivoting slightly: I was looking at how HMM uses ZONE_DEVICE. They add a
> type flag to the dev_pagemap structure which would be very useful to us.
> We could add another MEMORY_DEVICE_P2P type to distinguish p2p pages.
> Then, potentially, we could add a dma_map callback to the structure
> (possibly unioned with an hmm field). The dev_ops providers would then
> just need to do something like this (enclosed in a helper):
>
> if (is_zone_device_page(page)) {
>         pgmap = get_dev_pagemap(page_to_pfn(page));
>         if (!pgmap || pgmap->type !=  MEMORY_DEVICE_P2P ||
>             !pgmap->dma_map)
>                 return 0;
>
>         dma_addr = pgmap->dma_map(dev, pgmap->dev, page);
>         put_dev_pagemap(pgmap);
>         if (!dma_addr)
>                 return 0;
>         ...
> }
>
> The pci_enable_p2p_bar function would then just need to call
> devm_memremap_pages with the dma_map callback set to a function that
> does the segment check and the offset calculation.
>
> Thoughts?
>
> @Jerome: my feedback to you would be that your patch assumes all users
> of devm_memremap_pages are MEMORY_DEVICE_PERSISTENT. It would be more
> useful if it was generic. My suggestion would be to have the caller
> allocate the dev_pagemap structure, populate it and pass it into
> devm_memremap_pages. Given that pretty much everything in that structure
> are already arguments to that function, I feel like this makes sense.
> This should also help to unify hmm_devmem_pages_create and
> devm_memremap_pages which look very similar to each other.

I like that change. Also the types should describe the memory relative
to its relationship to struct page, not whether it is persistent or
not. I would consider volatile and persistent memory that is attached
to the cpu memory controller and i/o coherent as the same type of
memory. DMA incoherent ranges like P2P and HMM should get their own
types.

Jason Gunthorpe April 19, 2017, 5:14 p.m. UTC | #23

On Wed, Apr 19, 2017 at 10:48:51AM -0600, Logan Gunthorpe wrote:
> The pci_enable_p2p_bar function would then just need to call
> devm_memremap_pages with the dma_map callback set to a function that
> does the segment check and the offset calculation.

I don't see a use for the dma_map function pointer at this point..

It doesn't make alot of sense for the completor of the DMA to provide
a mapping op, the mapping process is *path* specific, not specific to
a completer/initiator.

So, I would suggest more like this:

static inline struct device *get_p2p_src(struct page *page)
{
        struct device *res;
	struct dev_pagemap *pgmap;

	if (!is_zone_device_page(page))
	     return NULL;
	
        pgmap = get_dev_pagemap(page_to_pfn(page), NULL);
        if (!pgmap || pgmap->type !=  MEMORY_DEVICE_P2P)
	        /* For now ZONE_DEVICE memory that is not P2P is
 		   assumed to be configured for DMA the same as CPU
		   memory. */
                return ERR_PTR(-EINVAL);
	res = pgmap->dev;
	device_get(res);
	put_dev_pagemap(pgmap);
	return res;
}

dma_addr_t pci_p2p_same_segment(struct device *initator,
                                struct device *completer,
				struct page *page)
{
   if (! PCI initiator & completer)
       return ERROR;
   if (!same segment initiator & completer)
       return ERROR;

   // Translate page directly to the value programmed into the BAR
   return (Completer's PCI BAR base address) + (offset of page within BAR);
}

// dma_sg_map

for (each sgl) {
    struct page *page = sg_page(s);
    struct device *p2p_src = get_p2p_src(page);

    if (IS_ERR(p2p_src))
        // fail dma_sg

    if (p2p_src) {
        bool needs_iommu = false;

        pa = pci_p2p_same_segment(dev, p2p_src, page);
	if (pa == ERROR)
	    pa = arch_p2p_cross_segment(dev, p2psrc, page, &needs_iommui);

        device_put(p2p_src);

        if (pa == ERROR)
	    // fail

	if (!needs_iommu) {      
	    // Insert PA directly into the result SGL
	    sg++;
	    continue;
	}
    }
    else
        // CPU memory
        pa = page_to_phys(page);


To me it looks like the code duplication across the iommu stuff comes
from just duplicating the basic iommu algorithm in every driver.

To clean that up I think someone would need to hoist the overall sgl
loop and use more ops callbacks eg allocate_iommu_range,
assign_page_to_rage, dealloc_range, etc. This is a problem p2p makes
worse, but isn't directly causing :\

Jason

Jerome Glisse April 19, 2017, 5:32 p.m. UTC | #24

On Wed, Apr 19, 2017 at 10:01:23AM -0700, Dan Williams wrote:
> On Wed, Apr 19, 2017 at 9:48 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
> >
> >
> > On 19/04/17 09:55 AM, Jason Gunthorpe wrote:
> >> I was thinking only this one would be supported with a core code
> >> helper..
> >
> > Pivoting slightly: I was looking at how HMM uses ZONE_DEVICE. They add a
> > type flag to the dev_pagemap structure which would be very useful to us.
> > We could add another MEMORY_DEVICE_P2P type to distinguish p2p pages.
> > Then, potentially, we could add a dma_map callback to the structure
> > (possibly unioned with an hmm field). The dev_ops providers would then
> > just need to do something like this (enclosed in a helper):
> >
> > if (is_zone_device_page(page)) {
> >         pgmap = get_dev_pagemap(page_to_pfn(page));
> >         if (!pgmap || pgmap->type !=  MEMORY_DEVICE_P2P ||
> >             !pgmap->dma_map)
> >                 return 0;
> >
> >         dma_addr = pgmap->dma_map(dev, pgmap->dev, page);
> >         put_dev_pagemap(pgmap);
> >         if (!dma_addr)
> >                 return 0;
> >         ...
> > }
> >
> > The pci_enable_p2p_bar function would then just need to call
> > devm_memremap_pages with the dma_map callback set to a function that
> > does the segment check and the offset calculation.
> >
> > Thoughts?
> >
> > @Jerome: my feedback to you would be that your patch assumes all users
> > of devm_memremap_pages are MEMORY_DEVICE_PERSISTENT. It would be more
> > useful if it was generic. My suggestion would be to have the caller
> > allocate the dev_pagemap structure, populate it and pass it into
> > devm_memremap_pages. Given that pretty much everything in that structure
> > are already arguments to that function, I feel like this makes sense.
> > This should also help to unify hmm_devmem_pages_create and
> > devm_memremap_pages which look very similar to each other.
> 
> I like that change. Also the types should describe the memory relative
> to its relationship to struct page, not whether it is persistent or
> not. I would consider volatile and persistent memory that is attached
> to the cpu memory controller and i/o coherent as the same type of
> memory. DMA incoherent ranges like P2P and HMM should get their own
> types.

Dan you asked me to not use devm_memremap_pages() because you didn't
want to have HMM memory in the pgmap_radix, did you change opinion
on that ? :)

Note i won't make any change now on that front but if it make sense
i am happy to do it as separate patchset on top of HMM.

Also i don't want p2pmem to be an exclusive or with HMM, we will want
GPU to do peer to peer DMA and thus HMM ZONE_DEVICE pages to support
this too.

I do believe it is easier to special case in ZONE_DEVICE into existing
dma_ops for each architecture. For x86 i think there is only 3 different
set of dma_ops to modify. For other arch i guess there is even less.

But in all the case i think p2pmem should stay out of memory management
business. If some set of device do not have memory management it is
better to propose helper to do that as part of the subsystem to which
those devices belong. Just wanted to reiterate that point.

Cheers,
Jérôme

Dan Williams April 19, 2017, 5:41 p.m. UTC | #25

On Wed, Apr 19, 2017 at 10:32 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> On Wed, Apr 19, 2017 at 10:01:23AM -0700, Dan Williams wrote:
>> On Wed, Apr 19, 2017 at 9:48 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>> >
>> >
>> > On 19/04/17 09:55 AM, Jason Gunthorpe wrote:
>> >> I was thinking only this one would be supported with a core code
>> >> helper..
>> >
>> > Pivoting slightly: I was looking at how HMM uses ZONE_DEVICE. They add a
>> > type flag to the dev_pagemap structure which would be very useful to us.
>> > We could add another MEMORY_DEVICE_P2P type to distinguish p2p pages.
>> > Then, potentially, we could add a dma_map callback to the structure
>> > (possibly unioned with an hmm field). The dev_ops providers would then
>> > just need to do something like this (enclosed in a helper):
>> >
>> > if (is_zone_device_page(page)) {
>> >         pgmap = get_dev_pagemap(page_to_pfn(page));
>> >         if (!pgmap || pgmap->type !=  MEMORY_DEVICE_P2P ||
>> >             !pgmap->dma_map)
>> >                 return 0;
>> >
>> >         dma_addr = pgmap->dma_map(dev, pgmap->dev, page);
>> >         put_dev_pagemap(pgmap);
>> >         if (!dma_addr)
>> >                 return 0;
>> >         ...
>> > }
>> >
>> > The pci_enable_p2p_bar function would then just need to call
>> > devm_memremap_pages with the dma_map callback set to a function that
>> > does the segment check and the offset calculation.
>> >
>> > Thoughts?
>> >
>> > @Jerome: my feedback to you would be that your patch assumes all users
>> > of devm_memremap_pages are MEMORY_DEVICE_PERSISTENT. It would be more
>> > useful if it was generic. My suggestion would be to have the caller
>> > allocate the dev_pagemap structure, populate it and pass it into
>> > devm_memremap_pages. Given that pretty much everything in that structure
>> > are already arguments to that function, I feel like this makes sense.
>> > This should also help to unify hmm_devmem_pages_create and
>> > devm_memremap_pages which look very similar to each other.
>>
>> I like that change. Also the types should describe the memory relative
>> to its relationship to struct page, not whether it is persistent or
>> not. I would consider volatile and persistent memory that is attached
>> to the cpu memory controller and i/o coherent as the same type of
>> memory. DMA incoherent ranges like P2P and HMM should get their own
>> types.
>
> Dan you asked me to not use devm_memremap_pages() because you didn't
> want to have HMM memory in the pgmap_radix, did you change opinion
> on that ? :)

No, not quite ;-). I still don't think we should require the non-HMM
to pass NULL for all the HMM arguments. What I like about Logan's
proposal is to have a separate create and register steps dev_pagemap.
That way call paths that don't care about HMM specifics can just turn
around and register the vanilla dev_pagemap.

> Note i won't make any change now on that front but if it make sense
> i am happy to do it as separate patchset on top of HMM.
>
> Also i don't want p2pmem to be an exclusive or with HMM, we will want
> GPU to do peer to peer DMA and thus HMM ZONE_DEVICE pages to support
> this too.

Yes, this makes sense I think we really just want to distinguish host
memory or not in terms of the dev_pagemap type.

Logan Gunthorpe April 19, 2017, 6:01 p.m. UTC | #26

On 19/04/17 11:14 AM, Jason Gunthorpe wrote:
> I don't see a use for the dma_map function pointer at this point..

Yes, it is kind of like designing for the future. I just find it a
little odd calling the pci functions in the iommu.

> It doesn't make alot of sense for the completor of the DMA to provide
> a mapping op, the mapping process is *path* specific, not specific to
> a completer/initiator.

I'm just spit balling here but if HMM wanted to use unaddressable memory
as a DMA target, it could set that function to create a window ine gpu
memory, then call the pci_p2p_same_segment and return the result as the
dma address.

> dma_addr_t pci_p2p_same_segment(struct device *initator,
>                                 struct device *completer,
> 				struct page *page)

I'm not sure I like the name pci_p2p_same_segment. It reads as though
it's only checking if the devices are not the same segment. It also may
be that, in the future, it supports devices on different segments. I'd
call it more like pci_p2p_dma_map.

> for (each sgl) {

Thanks, this code fleshes things out nicely

> To me it looks like the code duplication across the iommu stuff comes
> from just duplicating the basic iommu algorithm in every driver.

Yes, this is true.

> To clean that up I think someone would need to hoist the overall sgl
> loop and use more ops callbacks eg allocate_iommu_range,
> assign_page_to_rage, dealloc_range, etc. This is a problem p2p makes
> worse, but isn't directly causing :\

Yup.

Logan

Logan Gunthorpe April 19, 2017, 6:11 p.m. UTC | #27

On 19/04/17 11:41 AM, Dan Williams wrote:
> No, not quite ;-). I still don't think we should require the non-HMM
> to pass NULL for all the HMM arguments. What I like about Logan's
> proposal is to have a separate create and register steps dev_pagemap.
> That way call paths that don't care about HMM specifics can just turn
> around and register the vanilla dev_pagemap.

Would you necessarily even need a create step? I was thinking more along
the lines that struct dev_pagemap _could_ just be a member in another
structure. The caller would set the attributes they needed and pass it
to devm_memremap. (Similar to how we commonly do things with struct
device, et al). Potentially, that could also get rid of the need for the
*data pointer HMM is using to get back the struct hmm_devmem seeing
container_of could be used instead.

>> Note i won't make any change now on that front but if it make sense
>> i am happy to do it as separate patchset on top of HMM.
>>
>> Also i don't want p2pmem to be an exclusive or with HMM, we will want
>> GPU to do peer to peer DMA and thus HMM ZONE_DEVICE pages to support
>> this too.
> 
> Yes, this makes sense I think we really just want to distinguish host
> memory or not in terms of the dev_pagemap type.

That depends though. If unaddressable memory needs different steps to
get to the DMA address (ie if it has to setup a gpu window) then we may
still need a way to distinguish the two types of non-host memory.

Logan

Logan Gunthorpe April 19, 2017, 6:19 p.m. UTC | #28

On 19/04/17 12:11 PM, Logan Gunthorpe wrote:
> 
> 
> On 19/04/17 11:41 AM, Dan Williams wrote:
>> No, not quite ;-). I still don't think we should require the non-HMM
>> to pass NULL for all the HMM arguments. What I like about Logan's
>> proposal is to have a separate create and register steps dev_pagemap.
>> That way call paths that don't care about HMM specifics can just turn
>> around and register the vanilla dev_pagemap.
> 
> Would you necessarily even need a create step? I was thinking more along
> the lines that struct dev_pagemap _could_ just be a member in another
> structure. The caller would set the attributes they needed and pass it
> to devm_memremap. (Similar to how we commonly do things with struct
> device, et al). Potentially, that could also get rid of the need for the
> *data pointer HMM is using to get back the struct hmm_devmem seeing
> container_of could be used instead.

Also, now that I've thought about it a little more, it _may_ be that
many or all of the hmm specific fields in dev_pagemap could move to a
containing struct too...

Logan

Dan Williams April 19, 2017, 6:30 p.m. UTC | #29

On Wed, Apr 19, 2017 at 11:19 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 19/04/17 12:11 PM, Logan Gunthorpe wrote:
>>
>>
>> On 19/04/17 11:41 AM, Dan Williams wrote:
>>> No, not quite ;-). I still don't think we should require the non-HMM
>>> to pass NULL for all the HMM arguments. What I like about Logan's
>>> proposal is to have a separate create and register steps dev_pagemap.
>>> That way call paths that don't care about HMM specifics can just turn
>>> around and register the vanilla dev_pagemap.
>>
>> Would you necessarily even need a create step? I was thinking more along
>> the lines that struct dev_pagemap _could_ just be a member in another
>> structure. The caller would set the attributes they needed and pass it
>> to devm_memremap. (Similar to how we commonly do things with struct
>> device, et al). Potentially, that could also get rid of the need for the
>> *data pointer HMM is using to get back the struct hmm_devmem seeing
>> container_of could be used instead.
>
> Also, now that I've thought about it a little more, it _may_ be that
> many or all of the hmm specific fields in dev_pagemap could move to a
> containing struct too...

Yes, that's already how we handle struct page_map, it's an internal
implementation detail of devm_memremap_pages().

Letting others users do the container_of() arrangement means that
struct page_map needs to become public and move into struct
dev_pagemap directly.

...I think that encapsulation loss is worth it for the gain of clearly
separating the HMM-case from the base case.

Jason Gunthorpe April 19, 2017, 6:32 p.m. UTC | #30

On Wed, Apr 19, 2017 at 12:01:39PM -0600, Logan Gunthorpe wrote:

> I'm just spit balling here but if HMM wanted to use unaddressable memory
> as a DMA target, it could set that function to create a window ine gpu
> memory, then call the pci_p2p_same_segment and return the result as the
> dma address.

Not entirely, it would have to call through the whole process
including the arch_p2p_cross_segment()..

Maybe we can start down the road of using ops for more iommu steps
with something like this as the helper:

dma_addr_t dma_map_pa(struct device *initiator, struct page *page,
                      void *data)
{
         struct device *completer = get_p2p_completer(page);
	 dma_addr_t pa;

         if (IS_ERR(completer))
	       return SYSTEM_MEMORY;
	       // Or maybe ?
	       return init_ops->dma_map_pa(..);

	 // Try the generic method
	 pa = pci_p2p_same_segment(dev, p2p_src, page);
	 if (pa != ERROR)
	       goto out;

         // Try the arch specific helper
	 const struct dma_map_ops *comp_ops = get_dma_ops(completer);
	 const struct dma_map_ops *init_ops = get_dma_ops(initiator);

         /* FUTURE: Let something translate a HMM page into a DMA'ble
   	    page, eg by mapping it into a GPU window. Maybe this
	    callback lives in devmap ? */
	 page = comp_ops->translate_dma_page(completer, page);

         /* New dma_map_op is the same as arch_p2p_cross_segment in
	    prior version. Any arch specific data needed to program
	    the iommu flows through data */
	 pa = init_ops->p2p_cross_segment_map(completer, inititator, page, data);

out:
	 device_put(completer);
	 return pa;
}

// map_sg op:
for (each sgl) {
    struct page *page = sg_page(s);
    struct arch_iommu_data data = {}; // pass through to ops->p2p_cross_segment
    dma_addr_t pa;

    pa = dma_map_pa(dev, page, &data)
    if (pa == ERROR)
        // fail

    if (!data.needs_iommu) {
        // Insert PA directly into the result SGL
        sg++;
        continue;
    }

    // Else pa & data describe how to setup the iommu
}

> > dma_addr_t pci_p2p_same_segment(struct device *initator,
> >                                 struct device *completer,
> > 				struct page *page)
> 
> I'm not sure I like the name pci_p2p_same_segment. It reads as though
> it's only checking if the devices are not the same segment.

Well, that is exactly what it is doing. If it succeeds then the caller
knows the DMA will not flow outside the segment and no iommu setup/etc
is required.

That function cannot be expanded to include generic cross-segment
traffic, a new function would be needed..

Jason

Logan Gunthorpe April 19, 2017, 6:41 p.m. UTC | #31

On 19/04/17 12:30 PM, Dan Williams wrote:
> Letting others users do the container_of() arrangement means that
> struct page_map needs to become public and move into struct
> dev_pagemap directly.

Ah, yes, I got a bit turned around by that and failed to notice that
page_map and dev_pagemap are different. Why is it that dev_pagemap
contains pretty much the exact same information as page_map? The only
thing gained that I can see is that the struct resource gains const
protection...

> ...I think that encapsulation loss is worth it for the gain of clearly
> separating the HMM-case from the base case.

Agreed.

Logan

Dan Williams April 19, 2017, 6:44 p.m. UTC | #32

On Wed, Apr 19, 2017 at 11:41 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 19/04/17 12:30 PM, Dan Williams wrote:
>> Letting others users do the container_of() arrangement means that
>> struct page_map needs to become public and move into struct
>> dev_pagemap directly.
>
> Ah, yes, I got a bit turned around by that and failed to notice that
> page_map and dev_pagemap are different. Why is it that dev_pagemap
> contains pretty much the exact same information as page_map? The only
> thing gained that I can see is that the struct resource gains const
> protection...
>
>> ...I think that encapsulation loss is worth it for the gain of clearly
>> separating the HMM-case from the base case.
>
> Agreed.
>

Yeah, I forgot that dev_pagemap grew those fields, so we don't have
any real encapsulation today. So this would just be a pure cleanup to
kill struct page_map.

Logan Gunthorpe April 19, 2017, 7:02 p.m. UTC | #33

On 19/04/17 12:32 PM, Jason Gunthorpe wrote:
> On Wed, Apr 19, 2017 at 12:01:39PM -0600, Logan Gunthorpe wrote:
> Not entirely, it would have to call through the whole process
> including the arch_p2p_cross_segment()..

Hmm, yes. Though it's still not clear what, if anything,
arch_p2p_cross_segment would be doing. In my experience, if you are
going between host bridges, the CPU address (or PCI address -- I'm not
sure which seeing they are the same on my system) would still work fine
-- it just _may_ be a bad idea because of performance.

Similarly if you are crossing via a QPI bus or similar, I'd expect the
CPU address to work fine. But here the performance is even less likely
to be any good.

>          // Try the arch specific helper
> 	 const struct dma_map_ops *comp_ops = get_dma_ops(completer);
> 	 const struct dma_map_ops *init_ops = get_dma_ops(initiator);

So, in this case, what device does the completer point to? The PCI
device or a more specific GPU device? If it's the former, who's
responsible for setting the new dma_ops? Typically the dma_ops are arch
specific but now you'd be adding ones that are tied to hmm or the gpu.

>> I'm not sure I like the name pci_p2p_same_segment. It reads as though
>> it's only checking if the devices are not the same segment.
> 
> Well, that is exactly what it is doing. If it succeeds then the caller
> knows the DMA will not flow outside the segment and no iommu setup/etc
> is required.

It appears to me like it's calculating the DMA address, and the check is
just a side requirement. It reads as though it's only doing the check.

Logan

Jason Gunthorpe April 19, 2017, 7:31 p.m. UTC | #34

On Wed, Apr 19, 2017 at 01:02:49PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 19/04/17 12:32 PM, Jason Gunthorpe wrote:
> > On Wed, Apr 19, 2017 at 12:01:39PM -0600, Logan Gunthorpe wrote:
> > Not entirely, it would have to call through the whole process
> > including the arch_p2p_cross_segment()..
> 
> Hmm, yes. Though it's still not clear what, if anything,
> arch_p2p_cross_segment would be doing.

Sets up the iommu for arches that place a iommu between the pci root
port and other pci root ports.

> In my experience, if you are going between host bridges, the CPU
> address (or PCI address -- I'm not sure which seeing they are the
> same on my system) would still work fine

Try it with VT-D turned on. It shouldn't work or there is a notable
security hole in your platform..

> > 	 const struct dma_map_ops *comp_ops = get_dma_ops(completer);
> > 	 const struct dma_map_ops *init_ops = get_dma_ops(initiator);
>
> So, in this case, what device does the completer point to? The PCI
> device or a more specific GPU device? If it's the former, who's
> responsible for setting the new dma_ops? Typically the dma_ops are arch
> specific but now you'd be adding ones that are tied to hmm or the gpu.

Donno, that is for GPU folks to figure out :)

But.. it could point to a GPU and the GPU struct device could have a
proxy dma_ops like Dan pointed out.

> >> I'm not sure I like the name pci_p2p_same_segment. It reads as though
> >> it's only checking if the devices are not the same segment.
> > 
> > Well, that is exactly what it is doing. If it succeeds then the caller
> > knows the DMA will not flow outside the segment and no iommu setup/etc
> > is required.
> 
> It appears to me like it's calculating the DMA address, and the check is
> just a side requirement. It reads as though it's only doing the check.

pci_p2p_same_segment_get_pa() then?

Jason

Logan Gunthorpe April 19, 2017, 7:41 p.m. UTC | #35

On 19/04/17 01:31 PM, Jason Gunthorpe wrote:
> Try it with VT-D turned on. It shouldn't work or there is a notable
> security hole in your platform..

Ah, ok.

>>> 	 const struct dma_map_ops *comp_ops = get_dma_ops(completer);
>>> 	 const struct dma_map_ops *init_ops = get_dma_ops(initiator);
>>
>> So, in this case, what device does the completer point to? The PCI
>> device or a more specific GPU device? If it's the former, who's
>> responsible for setting the new dma_ops? Typically the dma_ops are arch
>> specific but now you'd be adding ones that are tied to hmm or the gpu.
> 
> Donno, that is for GPU folks to figure out :)
> 
> But.. it could point to a GPU and the GPU struct device could have a
> proxy dma_ops like Dan pointed out.

Seems a bit awkward to me that in order for the intended use case, you
have to proxy the dma_ops. I'd probably still suggest throwing a couple
ops for things like this in the dev_pagemap.

>> It appears to me like it's calculating the DMA address, and the check is
>> just a side requirement. It reads as though it's only doing the check.
> 
> pci_p2p_same_segment_get_pa() then?

Ok, I think that's a bit clearer.

Logan

Jason Gunthorpe April 19, 2017, 8:48 p.m. UTC | #36

On Wed, Apr 19, 2017 at 01:41:49PM -0600, Logan Gunthorpe wrote:

> > But.. it could point to a GPU and the GPU struct device could have a
> > proxy dma_ops like Dan pointed out.
> 
> Seems a bit awkward to me that in order for the intended use case, you
> have to proxy the dma_ops. I'd probably still suggest throwing a couple
> ops for things like this in the dev_pagemap.

Another option is adding a new 'struct completer_dma_ops *' to struct
device for this use case.

Seems like a waste to expand dev_pagemap when we only need a unique
value per struct device?

Jason

Logan Gunthorpe April 19, 2017, 10:55 p.m. UTC | #37

On 19/04/17 02:48 PM, Jason Gunthorpe wrote:
> On Wed, Apr 19, 2017 at 01:41:49PM -0600, Logan Gunthorpe wrote:
> 
>>> But.. it could point to a GPU and the GPU struct device could have a
>>> proxy dma_ops like Dan pointed out.
>>
>> Seems a bit awkward to me that in order for the intended use case, you
>> have to proxy the dma_ops. I'd probably still suggest throwing a couple
>> ops for things like this in the dev_pagemap.
> 
> Another option is adding a new 'struct completer_dma_ops *' to struct
> device for this use case.
> 
> Seems like a waste to expand dev_pagemap when we only need a unique
> value per struct device?

I feel like expanding dev_pagemap has a much lower impact than expanding
struct device... dev_pagemap is only one instance per zone device region
so expanding it shouldn't be a huge issue. Expanding struct device means
every device struct in the system gets bigger.

Logan

Dan Williams April 20, 2017, 12:07 a.m. UTC | #38

On Wed, Apr 19, 2017 at 3:55 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 19/04/17 02:48 PM, Jason Gunthorpe wrote:
>> On Wed, Apr 19, 2017 at 01:41:49PM -0600, Logan Gunthorpe wrote:
>>
>>>> But.. it could point to a GPU and the GPU struct device could have a
>>>> proxy dma_ops like Dan pointed out.
>>>
>>> Seems a bit awkward to me that in order for the intended use case, you
>>> have to proxy the dma_ops. I'd probably still suggest throwing a couple
>>> ops for things like this in the dev_pagemap.
>>
>> Another option is adding a new 'struct completer_dma_ops *' to struct
>> device for this use case.
>>
>> Seems like a waste to expand dev_pagemap when we only need a unique
>> value per struct device?
>
> I feel like expanding dev_pagemap has a much lower impact than expanding
> struct device... dev_pagemap is only one instance per zone device region
> so expanding it shouldn't be a huge issue. Expanding struct device means
> every device struct in the system gets bigger.

Especially since we expect a very small subset of devices will ever support p2p.

Stephen Bates April 20, 2017, 8:43 p.m. UTC | #39

> Yes, this makes sense I think we really just want to distinguish host

> memory or not in terms of the dev_pagemap type.


I would like to see mutually exclusive flags for host memory (or not) and persistence (or not).

Stephen

Dan Williams April 20, 2017, 8:47 p.m. UTC | #40

On Thu, Apr 20, 2017 at 1:43 PM, Stephen  Bates <sbates@raithlin.com> wrote:
>
>> Yes, this makes sense I think we really just want to distinguish host
>> memory or not in terms of the dev_pagemap type.
>
> I would like to see mutually exclusive flags for host memory (or not) and persistence (or not).
>

Why persistence? It has zero meaning to the mm.

Stephen Bates April 20, 2017, 11:07 p.m. UTC | #41

>> Yes, this makes sense I think we really just want to distinguish host

>> memory or not in terms of the dev_pagemap type.

>

>> I would like to see mutually exclusive flags for host memory (or not) and persistence (or not).

>>

>

> Why persistence? It has zero meaning to the mm.


I like the idea of having properties of the memory in one place. While mm might not use persistence today it may make use certain things that persistence implies (like finite endurance and/or higher write latency) in the future. Also the persistence of the memory must have issues for mm security? Again not addressed today but useful in the future.

In addition I am not sure where else would be an appropriate place to put something like a persistence property flag. I know the NVDIMM section of the kernel uses things like NFIT to describe properties of the memory but we don’t yet (to my knowledge) have something similar for IO memory.

Stephen

Dan Williams April 21, 2017, 4:59 a.m. UTC | #42

On Thu, Apr 20, 2017 at 4:07 PM, Stephen  Bates <sbates@raithlin.com> wrote:
>>> Yes, this makes sense I think we really just want to distinguish host
>>> memory or not in terms of the dev_pagemap type.
>>
>>> I would like to see mutually exclusive flags for host memory (or not) and persistence (or not).
>>>
>>
>> Why persistence? It has zero meaning to the mm.
>
> I like the idea of having properties of the memory in one place.

We do have memory type data in the global iomem_resource tree, see
IORES_DESC_PERSISTENT_MEMORY.

> While mm might not use persistence today it may make use certain things that
> persistence implies (like finite endurance and/or higher write latency) in the future.

A persistence flag does not convey endurance or latency information.

> Also the persistence of the memory must have issues for mm security?

Not for the mm, data at rest security might be a property of the
device, but that's not the mm's concern.

>Again not addressed today but useful in the future.

Maybe, but to me "Useful for the future" == "don't add it to the
kernel until that future arrives".

> In addition I am not sure where else would be an appropriate place to put something like a persistence property flag. I know the NVDIMM section of the kernel uses things like NFIT to describe properties of the memory but we don’t yet (to my knowledge) have something similar for IO memory.

Do the IORES_DESC flags give you what you need?

diff mbox

Patch

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 0977317c6835c2..505ed7d502053d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -103,6 +103,9 @@  struct dma_map_ops {
 	int (*map_sg)(struct device *dev, struct scatterlist *sg,
 		      int nents, enum dma_data_direction dir,
 		      unsigned long attrs);
+	int (*map_sg_p2p)(struct device *dev, struct scatterlist *sg,
+			  int nents, enum dma_data_direction dir,
+			  unsigned long attrs);
 	void (*unmap_sg)(struct device *dev,
 			 struct scatterlist *sg, int nents,
 			 enum dma_data_direction dir,
@@ -244,7 +247,15 @@  static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
 	for_each_sg(sg, s, nents, i)
 		kmemcheck_mark_initialized(sg_virt(s), s->length);
 	BUG_ON(!valid_dma_direction(dir));
-	ents = ops->map_sg(dev, sg, nents, dir, attrs);
+
+	if (sg_has_p2p(sg)) {
+		if (ops->map_sg_p2p)
+			ents = ops->map_sg_p2p(dev, sg, nents, dir, attrs);
+		else
+			return 0;
+	} else
+		ents = ops->map_sg(dev, sg, nents, dir, attrs);
+
 	BUG_ON(ents < 0);
 	debug_dma_map_sg(dev, sg, nents, ents, dir);