Patchwork powerpc/mm: add ZONE_NORMAL zone for 64 bit kernel

login
register
mail settings
Submitter shaohui xie
Date July 20, 2012, 12:21 p.m.
Message ID <1342786906-12634-1-git-send-email-Shaohui.Xie@freescale.com>
Download mbox | patch
Permalink /patch/172252/
State Rejected
Headers show

Comments

shaohui xie - July 20, 2012, 12:21 p.m.
PowerPC platform only supports ZONE_DMA zone for 64bit kernel, so all the
memory will be put into this zone. If the memory size is greater than
the device's DMA capability and device uses dma_alloc_coherent to allocate
memory, it will get an address which is over the device's DMA addressing,
the device will fail.

So we split the memory to two zones by adding a zone ZONE_NORMAL, since
we already allocate PCICSRBAR/PEXCSRBAR right below the 4G boundary (if the
lowest PCI address is above 4G), so we constrain the DMA zone ZONE_DMA
to 2GB, also, we clear the flag __GFP_DMA and set it only if the device's
dma_mask < total memory size. By doing this, devices which cannot DMA all
the memory will be limited to ZONE_DMA, but devices which can DMA all the
memory will not be affected by this limitation.

Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
Signed-off-by: Mingkai Hu <Mingkai.hu@freescale.com>
Signed-off-by: Chen Yuanquan <B41889@freescale.com>
---
 arch/powerpc/kernel/dma.c |   13 ++++++++++++-
 arch/powerpc/mm/mem.c     |    4 +++-
 2 files changed, 15 insertions(+), 2 deletions(-)
Benjamin Herrenschmidt - July 23, 2012, 6:06 a.m.
On Fri, 2012-07-20 at 20:21 +0800, Shaohui Xie wrote:
> PowerPC platform only supports ZONE_DMA zone for 64bit kernel, so all the
> memory will be put into this zone. If the memory size is greater than
> the device's DMA capability and device uses dma_alloc_coherent to allocate
> memory, it will get an address which is over the device's DMA addressing,
> the device will fail.
> 
> So we split the memory to two zones by adding a zone ZONE_NORMAL, since
> we already allocate PCICSRBAR/PEXCSRBAR right below the 4G boundary (if the
> lowest PCI address is above 4G), so we constrain the DMA zone ZONE_DMA
> to 2GB, also, we clear the flag __GFP_DMA and set it only if the device's
> dma_mask < total memory size. By doing this, devices which cannot DMA all
> the memory will be limited to ZONE_DMA, but devices which can DMA all the
> memory will not be affected by this limitation.

This is wrong. Don't you have an iommu do deal with those devices
anyway ? What about swiotlb ?

If you *really* need to honor 32 (or 31 even) bit DMAs, what you -may-
want to do is create a ZONE_DMA32 like other architectures, do not
hijack the historical ZONE_DMA.

But even then, I'm dubious this is really needed.

Cheers,
Ben.
Scott Wood - July 23, 2012, 4:17 p.m.
On 07/23/2012 01:06 AM, Benjamin Herrenschmidt wrote:
> On Fri, 2012-07-20 at 20:21 +0800, Shaohui Xie wrote:
>> PowerPC platform only supports ZONE_DMA zone for 64bit kernel, so all the
>> memory will be put into this zone. If the memory size is greater than
>> the device's DMA capability and device uses dma_alloc_coherent to allocate
>> memory, it will get an address which is over the device's DMA addressing,
>> the device will fail.
>>
>> So we split the memory to two zones by adding a zone ZONE_NORMAL, since
>> we already allocate PCICSRBAR/PEXCSRBAR right below the 4G boundary (if the
>> lowest PCI address is above 4G), so we constrain the DMA zone ZONE_DMA
>> to 2GB, also, we clear the flag __GFP_DMA and set it only if the device's
>> dma_mask < total memory size. By doing this, devices which cannot DMA all
>> the memory will be limited to ZONE_DMA, but devices which can DMA all the
>> memory will not be affected by this limitation.
> 
> This is wrong.

How so?

> Don't you have an iommu do deal with those devices anyway ?

Yes, but we don't yet have DMA API support for it, it would lower
performance because we'd have to use a lot of subwindows which are
poorly cached (and even then we wouldn't be able to map more than 256
pages at once on a given device), and the IOMMU may not be available at
all if we're being virtualized.

> What about swiotlb ?

That doesn't help with alloc_coherent().

> If you *really* need to honor 32 (or 31 even) bit DMAs,

31-bit is to accommodate PCI, which has PEXCSRBAR that must live under 4
GiB and can't be disabled.

> what you -may- want to do is create a ZONE_DMA32 like other architectures, do not
> hijack the historical ZONE_DMA.

Could you point me to somewhere that clearly defines what ZONE_DMA is to
be used for, such that this counts as hijacking (but using ZONE_DMA32 to
mean 31-bit wouldn't)?  The only arches I see using ZONE_DMA32 (x86 and
mips) also have a separate, more restrictive ZONE_DMA.  PowerPC doesn't.
 It uses ZONE_DMA to point to all of memory (except highmem on 32-bit)
-- how is that not hijacking, if this is?  We can't have ZONE_DMA be
less restrictive than ZONE_DMA32, because the fallback rules are
hardcoded the other way around in generic code.

The exact threshold for ZONE_DMA could be made platform-configurable.

> But even then, I'm dubious this is really needed.

We'd like our drivers to stop crashing with more than 4GiB of RAM on 64-bit.

-Scott
Benjamin Herrenschmidt - July 23, 2012, 10:20 p.m.
On Mon, 2012-07-23 at 11:17 -0500, Scott Wood wrote:
> > This is wrong.
> 
> How so?
> 
> > Don't you have an iommu do deal with those devices anyway ?
> 
> Yes, but we don't yet have DMA API support for it, it would lower
> performance because we'd have to use a lot of subwindows which are
> poorly cached (and even then we wouldn't be able to map more than 256
> pages at once on a given device), and the IOMMU may not be available at
> all if we're being virtualized.

Ugh ? You mean some designers need to be fired urgently and wasted
everybody's time implementing an unusable iommu ? Nice one ...

> > What about swiotlb ?
> 
> That doesn't help with alloc_coherent().

Somewhat... you can use the pool, but it sucks.

> > If you *really* need to honor 32 (or 31 even) bit DMAs,
> 
> 31-bit is to accommodate PCI, which has PEXCSRBAR that must live under 4
> GiB and can't be disabled.
> 
> > what you -may- want to do is create a ZONE_DMA32 like other architectures, do not
> > hijack the historical ZONE_DMA.
> 
> Could you point me to somewhere that clearly defines what ZONE_DMA is to
> be used for, such that this counts as hijacking (but using ZONE_DMA32 to
> mean 31-bit wouldn't)?

Habit and history. ZONE_DMA used to be about ISA DMA, doesn't apply to
us, and in general ZONE_NORMAL alias to it iirc. It's old stuff I
haven't looked for a long time.

However, my understanding is that what you are trying to solve is
exactly what ZONE_DMA32 was created for.

>   The only arches I see using ZONE_DMA32 (x86 and
> mips) also have a separate, more restrictive ZONE_DMA.

And ? Who cares ? Drivers who know about a 32-bit limitations use
GFP_DMA32, that's what is expected, don't mess around with ZONE_DMA.

>   PowerPC doesn't.
>  It uses ZONE_DMA to point to all of memory (except highmem on 32-bit)
> -- how is that not hijacking, if this is?  We can't have ZONE_DMA be
> less restrictive than ZONE_DMA32, because the fallback rules are
> hardcoded the other way around in generic code.
> 
> The exact threshold for ZONE_DMA could be made platform-configurable.
> 
> > But even then, I'm dubious this is really needed.
> 
> We'd like our drivers to stop crashing with more than 4GiB of RAM on 64-bit.

Fix your HW :-)

Cheers,
Ben.
Tabi Timur-B04825 - July 23, 2012, 11:08 p.m.
On Mon, Jul 23, 2012 at 5:20 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:

> And ? Who cares ? Drivers who know about a 32-bit limitations use
> GFP_DMA32, that's what is expected, don't mess around with ZONE_DMA.

I thought drivers are supposed to set a dma_mask, and
dma_alloc_coherent() is supposed to use that to figure out how to
honor that mask.
Benjamin Herrenschmidt - July 23, 2012, 11:12 p.m.
On Mon, 2012-07-23 at 23:08 +0000, Tabi Timur-B04825 wrote:
> 
> > And ? Who cares ? Drivers who know about a 32-bit limitations use
> > GFP_DMA32, that's what is expected, don't mess around with ZONE_DMA.
> 
> I thought drivers are supposed to set a dma_mask, and
> dma_alloc_coherent() is supposed to use that to figure out how to
> honor that mask.

Sure, that's the right way to go, I meant bits of pieces of the
infrastructure in between. Why diverge from other archs gratuituously
here ?

Cheers,
Ben.
Timur Tabi - July 23, 2012, 11:15 p.m.
Benjamin Herrenschmidt wrote:
> Sure, that's the right way to go, I meant bits of pieces of the
> infrastructure in between. Why diverge from other archs gratuituously
> here ?

Ok, I'm confused.  Are you suggesting that drivers do this:

u64 fsl_dma_dmamask = DMA_BIT_MASK(36);
dev->dma_mask = &fsl_dma_dmamask;
v = dma_alloc_coherent(dev, ..., GFP_DMA32);

That is, set the DMA mask *and* set GFP_DMA32?  That seems redundant.

I don't understand why a driver would set GFP_DMA32 if it has already set
the mask.
Benjamin Herrenschmidt - July 23, 2012, 11:29 p.m.
On Mon, 2012-07-23 at 18:15 -0500, Timur Tabi wrote:
> Benjamin Herrenschmidt wrote:
> > Sure, that's the right way to go, I meant bits of pieces of the
> > infrastructure in between. Why diverge from other archs gratuituously
> > here ?
> 
> Ok, I'm confused.  Are you suggesting that drivers do this:
> 
> u64 fsl_dma_dmamask = DMA_BIT_MASK(36);
> dev->dma_mask = &fsl_dma_dmamask;
> v = dma_alloc_coherent(dev, ..., GFP_DMA32);
> 
> That is, set the DMA mask *and* set GFP_DMA32?  That seems redundant.

No, but dma_alloc_coherent would under the hood.

> I don't understand why a driver would set GFP_DMA32 if it has already set
> the mask.

The layers in between, not the well behaved drivers. Again, we have
ZONE_DMA32 specifically for the purpose, why use something else ?

In any case, make the whole thing at the very least a config option, I
don't want sane HW to have to deal with split zones.

Cheers,
Ben.
Benjamin Herrenschmidt - July 23, 2012, 11:30 p.m.
On Tue, 2012-07-24 at 09:29 +1000, Benjamin Herrenschmidt wrote:

> The layers in between, not the well behaved drivers. Again, we have
> ZONE_DMA32 specifically for the purpose, why use something else ?
> 
> In any case, make the whole thing at the very least a config option, I
> don't want sane HW to have to deal with split zones.

Or if possible a flag set by machine probe()

Cheers,
Ben.
Timur Tabi - July 23, 2012, 11:36 p.m.
Benjamin Herrenschmidt wrote:

> No, but dma_alloc_coherent would under the hood.

Which is what Shaohui's patch does.  Well, it does it for GFP_DMA instead
of GFP_DMA32, but still.

When you said, "Drivers who know about a 32-bit limitations use
GFP_DMA32", I thought you meant that drivers should *set* GFP_DMA32.

>> I don't understand why a driver would set GFP_DMA32 if it has already set
>> the mask.
> 
> The layers in between, not the well behaved drivers. Again, we have
> ZONE_DMA32 specifically for the purpose, why use something else ?
> 
> In any case, make the whole thing at the very least a config option, I
> don't want sane HW to have to deal with split zones.

The DMA zone only kicks in if the DMA mask is set to a size smaller that
available physical memory.  Sane HW should set the DMA mask to
DMA_BIT_MASK(36).  And we have plenty of sane HW on our SOCs, but not
every device is like that.

The whole point behind this patch is that some drivers are setting a DMA
mask of 32, but still getting memory above 4GB.
Scott Wood - July 24, 2012, 1:37 a.m.
On 07/23/2012 06:30 PM, Benjamin Herrenschmidt wrote:
> On Tue, 2012-07-24 at 09:29 +1000, Benjamin Herrenschmidt wrote:
> 
>> The layers in between, not the well behaved drivers. Again, we have
>> ZONE_DMA32 specifically for the purpose, why use something else ?
>>
>> In any case, make the whole thing at the very least a config option, I
>> don't want sane HW to have to deal with split zones.
> 
> Or if possible a flag set by machine probe()

I suggested making the threshold configurable by platform code.  Sane
hardware would leave it at its default of infinity (~0ULL), and you
wouldn't have a split zone.  Our hardware would set it at 31-bit.

It looks like this is already sort-of done for ISA_DMA_THRESHOLD for
32-bit non-coherent-DMA platforms (amigaone sets a 24-bit threshold),
though I don't see where ZONE_DMA is limited accordingly.

We'd add an additional threshold for ZONE_DMA32, make it actually affect
the zone definition, and make the standard alloc_coherent() honor it.

-Scott
Scott Wood - July 24, 2012, 1:49 a.m.
On 07/23/2012 05:20 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2012-07-23 at 11:17 -0500, Scott Wood wrote:
>>> This is wrong.
>>
>> How so?
>>
>>> Don't you have an iommu do deal with those devices anyway ?
>>
>> Yes, but we don't yet have DMA API support for it, it would lower
>> performance because we'd have to use a lot of subwindows which are
>> poorly cached (and even then we wouldn't be able to map more than 256
>> pages at once on a given device), and the IOMMU may not be available at
>> all if we're being virtualized.
> 
> Ugh ? You mean some designers need to be fired urgently and wasted
> everybody's time implementing an unusable iommu ? Nice one ...

Yeah, that's old news.

It's somewhat usable for specific purposes, using very large pages, but
not for arbitrary use.

>>> But even then, I'm dubious this is really needed.
>>
>> We'd like our drivers to stop crashing with more than 4GiB of RAM on 64-bit.
> 
> Fix your HW :-)

I wish...  Some hardware people may solicit input from us every now and
again, but ultimately they do what they want.

Returning addresses in excess of a device's declared DMA mask is
something that needs fixing too, though.

-Scott
Benjamin Herrenschmidt - July 24, 2012, 3:08 a.m.
On Mon, 2012-07-23 at 18:36 -0500, Timur Tabi wrote:
> The DMA zone only kicks in if the DMA mask is set to a size smaller
> that
> available physical memory.  Sane HW should set the DMA mask to
> DMA_BIT_MASK(36).  And we have plenty of sane HW on our SOCs, but not
> every device is like that.
> 
> The whole point behind this patch is that some drivers are setting a
> DMA
> mask of 32, but still getting memory above 4GB.

Sure but I don't want to create the zones in the first place (and thus
introduce the added pressure on the memory management) on machines that
don't need it.

Cheers,
Ben.
Tabi Timur-B04825 - July 24, 2012, 3:52 a.m.
Benjamin Herrenschmidt wrote:
> Sure but I don't want to create the zones in the first place (and thus
> introduce the added pressure on the memory management) on machines that
> don't need it.

Ah yes, I forgot about memory pressure.
Zang Roy-R61911 - July 24, 2012, 3:59 a.m.
> -----Original Message-----
> From: Linuxppc-dev [mailto:linuxppc-dev-bounces+tie-
> fei.zang=freescale.com@lists.ozlabs.org] On Behalf Of Benjamin
> Herrenschmidt
> Sent: Tuesday, July 24, 2012 11:09 AM
> To: Tabi Timur-B04825
> Cc: Wood Scott-B07421; Hu Mingkai-B21284; linuxppc-dev@lists.ozlabs.org;
> Xie Shaohui-B21989; Chen Yuanquan-B41889
> Subject: Re: [PATCH] powerpc/mm: add ZONE_NORMAL zone for 64 bit kernel
> 
> On Mon, 2012-07-23 at 18:36 -0500, Timur Tabi wrote:
> > The DMA zone only kicks in if the DMA mask is set to a size smaller
> > that
> > available physical memory.  Sane HW should set the DMA mask to
> > DMA_BIT_MASK(36).  And we have plenty of sane HW on our SOCs, but not
> > every device is like that.
> >
> > The whole point behind this patch is that some drivers are setting a
> > DMA
> > mask of 32, but still getting memory above 4GB.
> 
> Sure but I don't want to create the zones in the first place (and thus
> introduce the added pressure on the memory management) on machines that
> don't need it.
OK, then how do you think Scott's suggestion to add a configurable threshold?
Roy
Tabi Timur-B04825 - July 24, 2012, 4:04 a.m.
Benjamin Herrenschmidt wrote:
> Sure but I don't want to create the zones in the first place (and thus
> introduce the added pressure on the memory management) on machines that
> don't need it.

One thing that does confuse me -- by default, we don't create a 
ZONE_NORMAL.  We only create a ZONE_DMA.  Why is that?  Shouldn't it be 
the other way around?
Benjamin Herrenschmidt - July 24, 2012, 4:45 a.m.
On Tue, 2012-07-24 at 04:04 +0000, Tabi Timur-B04825 wrote:
> Benjamin Herrenschmidt wrote:
> > Sure but I don't want to create the zones in the first place (and thus
> > introduce the added pressure on the memory management) on machines that
> > don't need it.
> 
> One thing that does confuse me -- by default, we don't create a 
> ZONE_NORMAL.  We only create a ZONE_DMA.  Why is that?  Shouldn't it be 
> the other way around?

Because ZONE_NORMAL allocations can be serviced from the ZONE_DMA while
the other way isn't possible.

Especially in the old days, there were quite a few cases of drivers
and/or subsystems who were a bit heavy handed at using ZONE_DMA, so not
having one would essentially make them not work at all.

Cheers,
Ben.
Bharat Bhushan - July 24, 2012, 8:01 a.m.
> -----Original Message-----
> From: Linuxppc-dev [mailto:linuxppc-dev-
> bounces+bharat.bhushan=freescale.com@lists.ozlabs.org] On Behalf Of Benjamin
> Herrenschmidt
> Sent: Tuesday, July 24, 2012 10:16 AM
> To: Tabi Timur-B04825
> Cc: Wood Scott-B07421; Hu Mingkai-B21284; linuxppc-dev@lists.ozlabs.org; Xie
> Shaohui-B21989; Chen Yuanquan-B41889
> Subject: Re: [PATCH] powerpc/mm: add ZONE_NORMAL zone for 64 bit kernel
> 
> On Tue, 2012-07-24 at 04:04 +0000, Tabi Timur-B04825 wrote:
> > Benjamin Herrenschmidt wrote:
> > > Sure but I don't want to create the zones in the first place (and
> > > thus introduce the added pressure on the memory management) on
> > > machines that don't need it.
> >
> > One thing that does confuse me -- by default, we don't create a
> > ZONE_NORMAL.  We only create a ZONE_DMA.  Why is that?  Shouldn't it
> > be the other way around?
> 
> Because ZONE_NORMAL allocations can be serviced from the ZONE_DMA while the
> other way isn't possible.

Say, if we have defined only one zone (ZONE_DMA) to which we give all memory ( > 4G).
Device set the DMA_MASK to 4G or less.

dma_alloc_coherent() will set GFP_DMA flag, But that is of no use, because the memory allocator have only one zone which have all memory (which assumes all dma-able). And can return memory at address at > 4G. which will crash !!

I think we have to have at least one zone which gives memory to be dma-able for all devices (memory limit should be set by platform, because different  platform have different devices with different limits.). And another ( 1 or more) will cover rest of memory.

Thanks
-Bharat

> 
> Especially in the old days, there were quite a few cases of drivers and/or
> subsystems who were a bit heavy handed at using ZONE_DMA, so not having one
> would essentially make them not work at all.
> 
> Cheers,
> Ben.
> 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

Patch

diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index b1ec983..8029295 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -30,6 +30,7 @@  void *dma_direct_alloc_coherent(struct device *dev, size_t size,
 				struct dma_attrs *attrs)
 {
 	void *ret;
+	phys_addr_t top_ram_pfn = memblock_end_of_DRAM();
 #ifdef CONFIG_NOT_COHERENT_CACHE
 	ret = __dma_alloc_coherent(dev, size, dma_handle, flag);
 	if (ret == NULL)
@@ -40,8 +41,18 @@  void *dma_direct_alloc_coherent(struct device *dev, size_t size,
 	struct page *page;
 	int node = dev_to_node(dev);
 
+	/*
+	 * check for crappy device which has dma_mask < ZONE_DMA, and
+	 * we are not going to support it, just warn and fail.
+	 */
+	if (*dev->dma_mask < DMA_BIT_MASK(31)) {
+		dev_err(dev, "Unsupported dma_mask 0x%llx\n", *dev->dma_mask);
+		return NULL;
+	}
 	/* ignore region specifiers */
-	flag  &= ~(__GFP_HIGHMEM);
+	flag  &= ~(__GFP_HIGHMEM | __GFP_DMA);
+	if (*dev->dma_mask < top_ram_pfn - 1)
+		flag |= GFP_DMA;
 
 	page = alloc_pages_node(node, flag, get_order(size));
 	if (page == NULL)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index baaafde..a494555 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -281,7 +281,9 @@  void __init paging_init(void)
 	max_zone_pfns[ZONE_DMA] = lowmem_end_addr >> PAGE_SHIFT;
 	max_zone_pfns[ZONE_HIGHMEM] = top_of_ram >> PAGE_SHIFT;
 #else
-	max_zone_pfns[ZONE_DMA] = top_of_ram >> PAGE_SHIFT;
+	max_zone_pfns[ZONE_DMA] = min_t(phys_addr_t, top_of_ram,
+					1ull << 31) >> PAGE_SHIFT;
+	max_zone_pfns[ZONE_NORMAL] = top_of_ram >> PAGE_SHIFT;
 #endif
 	free_area_init_nodes(max_zone_pfns);