diff mbox

[v6,1/3] of/pci/dma: fix DMA configuration for PCI masters

Message ID 1494912127-12890-2-git-send-email-oza.oza@broadcom.com
State Not Applicable
Headers show

Commit Message

Oza Pawandeep May 16, 2017, 5:22 a.m. UTC
current device framework and OF framework integration assumes
dma-ranges in a way where memory-mapped devices define their
dma-ranges. (child-bus-address, parent-bus-address, length).

of_dma_configure is specifically written to take care of memory
mapped devices. but no implementation exists for pci to take
care of pcie based memory ranges.

for e.g. iproc based SOCs and other SOCs(suc as rcar) have PCI
world dma-ranges.
dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>;

this patch serves following:

1) exposes interface to the pci host driver for their
inbound memory ranges

2) provide an interface to callers such as of_dma_get_ranges.
so then the returned size get best possible (largest) dma_mask.
because PCI RC drivers do not call APIs such as
dma_set_coherent_mask() and hence rather it shows its addressing
capabilities based on dma-ranges.
for e.g.
dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>;
we should get dev->coherent_dma_mask=0x7fffffffff.

3) this patch handles multiple inbound windows and dma-ranges.
it is left to the caller, how it wants to use them.
the new function returns the resources in a standard and unform way

4) this way the callers of for e.g. of_dma_get_ranges
does not need to change.

Signed-off-by: Oza Pawandeep <oza.oza@broadcom.com>

Comments

Bjorn Helgaas May 17, 2017, 5:10 p.m. UTC | #1
On Tue, May 16, 2017 at 10:52:05AM +0530, Oza Pawandeep wrote:
> current device framework and OF framework integration assumes

s/current/The current/

> dma-ranges in a way where memory-mapped devices define their
> dma-ranges. (child-bus-address, parent-bus-address, length).
> 
> of_dma_configure is specifically written to take care of memory
> mapped devices. but no implementation exists for pci to take
> care of pcie based memory ranges.

s/pci/PCI/	(also other occurrences below)
s/pcie/PCIe/

I don't see how PCIe is relevant here.  The bridge might support PCIe,
but I don't think anything here is actually specific to PCIe.  If
that's the case, I think it's confusing to mention PCIe.

> for e.g. iproc based SOCs and other SOCs(suc as rcar) have PCI
> world dma-ranges.
> dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>;
> 
> this patch serves following:
> 
> 1) exposes interface to the pci host driver for their
> inbound memory ranges
> 
> 2) provide an interface to callers such as of_dma_get_ranges.
> so then the returned size get best possible (largest) dma_mask.
> because PCI RC drivers do not call APIs such as
> dma_set_coherent_mask() and hence rather it shows its addressing
> capabilities based on dma-ranges.
> for e.g.
> dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>;
> we should get dev->coherent_dma_mask=0x7fffffffff.
> 
> 3) this patch handles multiple inbound windows and dma-ranges.
> it is left to the caller, how it wants to use them.
> the new function returns the resources in a standard and unform way
> 
> 4) this way the callers of for e.g. of_dma_get_ranges
> does not need to change.

Please start sentences with a capital letter.
Arnd Bergmann May 17, 2017, 7:13 p.m. UTC | #2
On Tue, May 16, 2017 at 7:22 AM, Oza Pawandeep <oza.oza@broadcom.com> wrote:
> current device framework and OF framework integration assumes
> dma-ranges in a way where memory-mapped devices define their
> dma-ranges. (child-bus-address, parent-bus-address, length).
>
> of_dma_configure is specifically written to take care of memory
> mapped devices. but no implementation exists for pci to take
> care of pcie based memory ranges.

Hi Oza,

I'm trying to make sense of this, but am still rather puzzled. I have
no idea what the distinction between memory-mapped devices and
pcie based devices is in your description, as PCIe is usually memory
mapped, and Linux doesn't actually support other kinds of PCIe
devices on most architectures.

> for e.g. iproc based SOCs and other SOCs(suc as rcar) have PCI
> world dma-ranges.
> dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>;
>
> this patch serves following:
>
> 1) exposes interface to the pci host driver for their
> inbound memory ranges
>
> 2) provide an interface to callers such as of_dma_get_ranges.
> so then the returned size get best possible (largest) dma_mask.
> because PCI RC drivers do not call APIs such as
> dma_set_coherent_mask() and hence rather it shows its addressing
> capabilities based on dma-ranges.
>
> for e.g.
> dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>;
> we should get dev->coherent_dma_mask=0x7fffffffff.

do you mean the coherent_dma_mask of the PCI host bridge
or an attached device here?

If you require PCI devices to come up with an initial
coherent_dma_mask other than 0xffffffffff, there are other
problems involved. In particular, you will need to use
swiotlb, which is not supported on arm32 at the moment,
and the dma_set_mask()/dma_set_coherent_mask()
functions need to be modified.

> +       while (1) {
> +               dma_ranges = of_get_property(node, "dma-ranges", &rlen);
> +
> +               /* Ignore empty ranges, they imply no translation required. */
> +               if (dma_ranges && rlen > 0)
> +                       break;
> +
> +               /* no dma-ranges, they imply no translation required. */
> +               if (!dma_ranges)
> +                       break;

A missing parent dma-ranges property here should really indicate that there
is no valid translation. If we have existing cases where this happens
in DT files, we may treat it as allowing only 32-bit DMA (as we already
do for having no dma-ranges at all), but treating it the same way
as an empty dma-ranges property sounds wrong.

     Arnd
Oza Pawandeep May 19, 2017, 1:37 a.m. UTC | #3
On Wed, May 17, 2017 at 10:40 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> On Tue, May 16, 2017 at 10:52:05AM +0530, Oza Pawandeep wrote:
>> current device framework and OF framework integration assumes
>
> s/current/The current/
>
>> dma-ranges in a way where memory-mapped devices define their
>> dma-ranges. (child-bus-address, parent-bus-address, length).
>>
>> of_dma_configure is specifically written to take care of memory
>> mapped devices. but no implementation exists for pci to take
>> care of pcie based memory ranges.
>
> s/pci/PCI/      (also other occurrences below)
> s/pcie/PCIe/
>
> I don't see how PCIe is relevant here.  The bridge might support PCIe,
> but I don't think anything here is actually specific to PCIe.  If
> that's the case, I think it's confusing to mention PCIe.

It attempts to fix of_dma_get_range for PCI master,
because it currently it is returning *size as 0 (to the caller of_dma_configure)
resulting into largest dma_mask which would be 64-bit mask on armv8.
which usually has worked so far, because any other SOC's PCI RC, do not
have the limitations as of Broadcom iproc based PCI RC.
our RC will drop 64bit IOVAs, because it is not capable of addressing
entire 64bit range.

infact there are 2 real problems, please allow me to explain.
please refer to my next mail in reply to Arnd Bergmann,

>
>> for e.g. iproc based SOCs and other SOCs(suc as rcar) have PCI
>> world dma-ranges.
>> dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>;
>>
>> this patch serves following:
>>
>> 1) exposes interface to the pci host driver for their
>> inbound memory ranges
>>
>> 2) provide an interface to callers such as of_dma_get_ranges.
>> so then the returned size get best possible (largest) dma_mask.
>> because PCI RC drivers do not call APIs such as
>> dma_set_coherent_mask() and hence rather it shows its addressing
>> capabilities based on dma-ranges.
>> for e.g.
>> dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>;
>> we should get dev->coherent_dma_mask=0x7fffffffff.
>>
>> 3) this patch handles multiple inbound windows and dma-ranges.
>> it is left to the caller, how it wants to use them.
>> the new function returns the resources in a standard and unform way
>>
>> 4) this way the callers of for e.g. of_dma_get_ranges
>> does not need to change.
>
> Please start sentences with a capital letter.

will take care of your comments.
Thanks,
Oza.
Oza Pawandeep May 19, 2017, 2:58 a.m. UTC | #4
On Thu, May 18, 2017 at 12:43 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tue, May 16, 2017 at 7:22 AM, Oza Pawandeep <oza.oza@broadcom.com> wrote:
>> current device framework and OF framework integration assumes
>> dma-ranges in a way where memory-mapped devices define their
>> dma-ranges. (child-bus-address, parent-bus-address, length).
>>
>> of_dma_configure is specifically written to take care of memory
>> mapped devices. but no implementation exists for pci to take
>> care of pcie based memory ranges.
>
> Hi Oza,
>
> I'm trying to make sense of this, but am still rather puzzled. I have
> no idea what the distinction between memory-mapped devices and
> pcie based devices is in your description, as PCIe is usually memory
> mapped, and Linux doesn't actually support other kinds of PCIe
> devices on most architectures.
>
there are 2 problems which I am trying to address here.

problem-1:
let me explain our PCI RC's limitations first.

IOVA allocaiton honours device's coherent_dma_mask/dma_mask.
in PCI case, current code honours DMA mask set by EP,
there is no concept of PCI host bridge dma-mask, which should be there
and could truely reflect the limitaiton of PCI host bridge.

having said that we have
dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>;
which means we can only address 512GB.

now because of broken of_dma_get_range we end up getting 64bit dma_mask.
please check the code:of_dma_configure()
if (ret < 0) {
dma_addr = offset = 0;
size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 1);

now in this process I figred out problems in of_dma_get_range: hence the fix
1) return of wrong size as 0 becasue of whole parsing problem.
2) not handling absence of dma-ranges which is valid for PCI master.
3) not handling multipe inbound windows.
4) in order to get largest possible dma_mask. this patch also returns
the largest possible size based on dma-ranges,

please have a look at
[PATCH v6 3/3] PCI/of fix of_dma_get_range; get PCI specific dma-ranges
I just made is bus specific leaving origional of_dma_get_range unmodified
and defining new PCI handling of_bus_pci_get_dma_ranges

also when I say memory-mapped and PCI device, I only mean to say with respect
to the dma-ranges format. (of coure PCI is memory mapped as well).
probbaly my commit description is misleading, sorry about that.

so Problem1: is just bug fix, Nothing else


Problem2: [PATCH v6 2/3] iommu/pci: reserve IOVA for PCI masters

we have memory banks

<0x00000000 0x80000000 0x0 0x80000000>, /* 2G @ 2G */
<0x00000008 0x80000000 0x3 0x80000000>, /* 14G @ 34G */
<0x00000090 0x00000000 0x4 0x00000000>, /* 16G @ 576G */
<0x000000a0 0x00000000 0x4 0x00000000>; /* 16G @ 640G */

when I run SPDK (User sapce) which internally uses vfio to access PCI
endpoint directly.
vfio uses huge-pages which could coming from 640G/0x000000a0.
and the way vfio maps the hugepage to user space and generates IOVA is different
from the way kernel allocate IOVA.

vfio just maps one to one IOVA to Physical address.
it just calls directly remap_pfn_range.

so the way kernel allocates IOVA (where it honours device dma_mask)
and the way userspace gets IOVA is totally different.

so dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>;
will not work.
instead we have to go for scatterred dma-ranges leaving holes.

having said that we have to reserve IOVA allocations for inbound memory.
I am in a process of addressing Robin Murphy's comment on that and rebasing
my patch on rc12.

this problem statement is more important to us.
because it makes both kernel and use space IOVA allocations work when
IOMMU is enabled.

probably thing might be confusing because I am clubbing my patches to
address both
the problems. going forward I should just try to first send out patch
for problem2 alone (not sure)
because my next patch-set would bring some changes in pci/probe.c as well.

>> for e.g. iproc based SOCs and other SOCs(suc as rcar) have PCI
>> world dma-ranges.
>> dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>;
>>
>> this patch serves following:
>>
>> 1) exposes interface to the pci host driver for their
>> inbound memory ranges
>>
>> 2) provide an interface to callers such as of_dma_get_ranges.
>> so then the returned size get best possible (largest) dma_mask.
>> because PCI RC drivers do not call APIs such as
>> dma_set_coherent_mask() and hence rather it shows its addressing
>> capabilities based on dma-ranges.
>>
>> for e.g.
>> dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>;
>> we should get dev->coherent_dma_mask=0x7fffffffff.
>
> do you mean the coherent_dma_mask of the PCI host bridge
> or an attached device here?
>
> If you require PCI devices to come up with an initial
> coherent_dma_mask other than 0xffffffffff, there are other
> problems involved. In particular, you will need to use
> swiotlb, which is not supported on arm32 at the moment,
> and the dma_set_mask()/dma_set_coherent_mask()
> functions need to be modified.

even without this patch also it comes up with coherent_dma_mask of 64bits.
since it considers dma_mask set by Endpoint.

please check [RFC PATCH 2/3] iommu/dma: account pci host bridge
dma_mask for IOVA allocation
this patch was in-fact inspired by Robin Murphy's earlier discussions.

>
>> +       while (1) {
>> +               dma_ranges = of_get_property(node, "dma-ranges", &rlen);
>> +
>> +               /* Ignore empty ranges, they imply no translation required. */
>> +               if (dma_ranges && rlen > 0)
>> +                       break;
>> +
>> +               /* no dma-ranges, they imply no translation required. */
>> +               if (!dma_ranges)
>> +                       break;
>
> A missing parent dma-ranges property here should really indicate that there
> is no valid translation. If we have existing cases where this happens
> in DT files, we may treat it as allowing only 32-bit DMA (as we already
> do for having no dma-ranges at all), but treating it the same way
> as an empty dma-ranges property sounds wrong.

not sure if I understood you:
but we have dma-ranges property optional for one of our SOC, in the sense...
PCI RC will allow all the incoming transactions because RC does not
translate anything.
what mask should it generate if dma-ranges property is not present ?
how do you suggest to handle ?

>
>      Arnd
Oza Pawandeep May 22, 2017, 4:42 p.m. UTC | #5
On Thu, May 18, 2017 at 12:43 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tue, May 16, 2017 at 7:22 AM, Oza Pawandeep <oza.oza@broadcom.com> wrote:
>> current device framework and OF framework integration assumes
>> dma-ranges in a way where memory-mapped devices define their
>> dma-ranges. (child-bus-address, parent-bus-address, length).
>>
>> of_dma_configure is specifically written to take care of memory
>> mapped devices. but no implementation exists for pci to take
>> care of pcie based memory ranges.
>
> Hi Oza,
>
> I'm trying to make sense of this, but am still rather puzzled. I have
> no idea what the distinction between memory-mapped devices and
> pcie based devices is in your description, as PCIe is usually memory
> mapped, and Linux doesn't actually support other kinds of PCIe
> devices on most architectures.
>
>> for e.g. iproc based SOCs and other SOCs(suc as rcar) have PCI
>> world dma-ranges.
>> dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>;
>>
>> this patch serves following:
>>
>> 1) exposes interface to the pci host driver for their
>> inbound memory ranges
>>
>> 2) provide an interface to callers such as of_dma_get_ranges.
>> so then the returned size get best possible (largest) dma_mask.
>> because PCI RC drivers do not call APIs such as
>> dma_set_coherent_mask() and hence rather it shows its addressing
>> capabilities based on dma-ranges.
>>
>> for e.g.
>> dma-ranges = <0x43000000 0x00 0x00 0x00 0x00 0x80 0x00>;
>> we should get dev->coherent_dma_mask=0x7fffffffff.
>
> do you mean the coherent_dma_mask of the PCI host bridge
> or an attached device here?
>
> If you require PCI devices to come up with an initial
> coherent_dma_mask other than 0xffffffffff, there are other
> problems involved. In particular, you will need to use
> swiotlb, which is not supported on arm32 at the moment,
> and the dma_set_mask()/dma_set_coherent_mask()
> functions need to be modified.
>
>> +       while (1) {
>> +               dma_ranges = of_get_property(node, "dma-ranges", &rlen);
>> +
>> +               /* Ignore empty ranges, they imply no translation required. */
>> +               if (dma_ranges && rlen > 0)
>> +                       break;
>> +
>> +               /* no dma-ranges, they imply no translation required. */
>> +               if (!dma_ranges)
>> +                       break;
>
> A missing parent dma-ranges property here should really indicate that there
> is no valid translation. If we have existing cases where this happens
> in DT files, we may treat it as allowing only 32-bit DMA (as we already
> do for having no dma-ranges at all), but treating it the same way
> as an empty dma-ranges property sounds wrong.
>
>      Arnd


Hi Arnd and Bjorn,

Can you please have a look at PATCH v7 ?
It addresses problem2 alone.

Regards,
Oza
diff mbox

Patch

diff --git a/drivers/of/of_pci.c b/drivers/of/of_pci.c
index 0ee42c3..4005ed3 100644
--- a/drivers/of/of_pci.c
+++ b/drivers/of/of_pci.c
@@ -283,6 +283,101 @@  int of_pci_get_host_bridge_resources(struct device_node *dev,
 	return err;
 }
 EXPORT_SYMBOL_GPL(of_pci_get_host_bridge_resources);
+
+/**
+ * of_pci_get_dma_ranges - Parse PCI host bridge inbound resources from DT
+ * @np: device node of the host bridge having the dma-ranges property
+ * @resources: list where the range of resources will be added after DT parsing
+ *
+ * It is the caller's job to free the @resources list.
+ *
+ * This function will parse the "dma-ranges" property of a
+ * PCI host bridge device node and setup the resource mapping based
+ * on its content.
+ *
+ * It returns zero if the range parsing has been successful or a standard error
+ * value if it failed.
+ */
+
+int of_pci_get_dma_ranges(struct device_node *dn, struct list_head *resources)
+{
+	struct device_node *node = of_node_get(dn);
+	int rlen;
+	int pna = of_n_addr_cells(node);
+	const int na = 3, ns = 2;
+	int np = pna + na + ns;
+	int ret = 0;
+	struct resource *res;
+	const u32 *dma_ranges;
+	struct of_pci_range range;
+
+	if (!node)
+		return -EINVAL;
+
+	while (1) {
+		dma_ranges = of_get_property(node, "dma-ranges", &rlen);
+
+		/* Ignore empty ranges, they imply no translation required. */
+		if (dma_ranges && rlen > 0)
+			break;
+
+		/* no dma-ranges, they imply no translation required. */
+		if (!dma_ranges)
+			break;
+
+		node = of_get_next_parent(node);
+
+		if (!node)
+			break;
+	}
+
+	if (!dma_ranges) {
+		pr_debug("pcie device has no dma-ranges defined for node(%s)\n",
+			  dn->full_name);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	while ((rlen -= np * 4) >= 0) {
+		range.pci_space = dma_ranges[0];
+		range.pci_addr = of_read_number(dma_ranges + 1, ns);
+		range.cpu_addr = of_translate_dma_address(node,
+							dma_ranges + na);
+		range.size = of_read_number(dma_ranges + pna + na, ns);
+
+		dma_ranges += np;
+
+		/*
+		 * If we failed translation or got a zero-sized region
+		 * then skip this range.
+		 */
+		if (range.cpu_addr == OF_BAD_ADDR || range.size == 0)
+			continue;
+
+		res = kzalloc(sizeof(struct resource), GFP_KERNEL);
+		if (!res) {
+			ret = -ENOMEM;
+			goto parse_failed;
+		}
+
+		ret = of_pci_range_to_resource(&range, dn, res);
+		if (ret) {
+			kfree(res);
+			continue;
+		}
+
+		pci_add_resource_offset(resources, res,
+					res->start - range.pci_addr);
+	}
+	return ret;
+
+parse_failed:
+	pci_free_resource_list(resources);
+out:
+	of_node_put(node);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(of_pci_get_dma_ranges);
 #endif /* CONFIG_OF_ADDRESS */
 
 #ifdef CONFIG_PCI_MSI
diff --git a/include/linux/of_pci.h b/include/linux/of_pci.h
index 0e0974e..617b90d 100644
--- a/include/linux/of_pci.h
+++ b/include/linux/of_pci.h
@@ -76,6 +76,7 @@  static inline void of_pci_check_probe_only(void) { }
 int of_pci_get_host_bridge_resources(struct device_node *dev,
 			unsigned char busno, unsigned char bus_max,
 			struct list_head *resources, resource_size_t *io_base);
+int of_pci_get_dma_ranges(struct device_node *np, struct list_head *resources);
 #else
 static inline int of_pci_get_host_bridge_resources(struct device_node *dev,
 			unsigned char busno, unsigned char bus_max,
@@ -83,6 +84,12 @@  static inline int of_pci_get_host_bridge_resources(struct device_node *dev,
 {
 	return -EINVAL;
 }
+
+static inline int of_pci_get_dma_ranges(struct device_node *np,
+					struct list_head *resources)
+{
+	return -EINVAL;
+}
 #endif
 
 #if defined(CONFIG_OF) && defined(CONFIG_PCI_MSI)