diff mbox series

[kernel] powerpc/dma: Fix dma_map_ops::get_required_mask

Message ID 20200908015106.79661-1-aik@ozlabs.ru
State Accepted
Commit 437ef802e0adc9f162a95213a3488e8646e5fc03
Headers show
Series [kernel] powerpc/dma: Fix dma_map_ops::get_required_mask | expand

Checks

Context Check Description
snowpatch_ozlabs/needsstable success Patch fixes a commit that hasn't been released yet
snowpatch_ozlabs/checkpatch warning total: 1 errors, 0 warnings, 0 checks, 9 lines checked
snowpatch_ozlabs/build-pmac32 warning Upstream build failed, couldn't test patch
snowpatch_ozlabs/build-ppc64e warning Upstream build failed, couldn't test patch
snowpatch_ozlabs/build-ppc64be warning Upstream build failed, couldn't test patch
snowpatch_ozlabs/build-ppc64le warning Upstream build failed, couldn't test patch
snowpatch_ozlabs/apply_patch success Successfully applied on branch powerpc/merge (35f066fda170dde0a31f1447547a5d30b83c3920)

Commit Message

Alexey Kardashevskiy Sept. 8, 2020, 1:51 a.m. UTC
There are 2 problems with it:
1. "<" vs expected "<<"
2. the shift number is an IOMMU page number mask, not an address mask
as the IOMMU page shift is missing.

This did not hit us before f1565c24b596 ("powerpc: use the generic
dma_ops_bypass mode") because we had there additional code to handle
bypass mask so this chunk (almost?) never executed. However there
were reports that aacraid does not work with "iommu=nobypass".
After f1565c24b596, aacraid (and probably others which call
dma_get_required_mask() before setting the mask) was unable to
enable 64bit DMA and fall back to using IOMMU which was known not to work,
one of the problems is double free of an IOMMU page.

This fixes DMA for aacraid, both with and without "iommu=nobypass"
in the kernel command line. Verified with "stress-ng -d 4".

Fixes: f1565c24b596 ("powerpc: use the generic dma_ops_bypass mode")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

The original code came Jun 24 2011:
6a5c7be5e484 ("powerpc: Override dma_get_required_mask by platform hook and ops")


What is dma_get_required_mask() for anyway? What "requires" what here?

Even though it works for now (due to huge - >4GB - default DMA window),
I am still not convinced we do not want this chunk here
(this is what f1565c24b596 removed):

if (dev_is_pci(dev)) {
        u64 bypass_mask = dma_direct_get_required_mask(dev);

        if (dma_iommu_bypass_supported(dev, bypass_mask))
                return bypass_mask;
}
---
 arch/powerpc/kernel/dma-iommu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig Sept. 8, 2020, 5:44 a.m. UTC | #1
On Tue, Sep 08, 2020 at 11:51:06AM +1000, Alexey Kardashevskiy wrote:
> What is dma_get_required_mask() for anyway? What "requires" what here?

Yes, it is a really odd API.  It comes from classic old PCI where
64-bit addressing required an additional bus cycle, and various devices
had different addressing schemes, with the smaller addresses beeing
more efficient.  So this allows the driver to request the "required"
addressing mode to address all memory.  "preferred" might be a better
name as we'll bounce buffer if it isn't met.  I also don't really see
why a driver would ever want to use it for a modern PCIe device.
Michael Ellerman Sept. 8, 2020, 6:45 a.m. UTC | #2
Alexey Kardashevskiy <aik@ozlabs.ru> writes:
> There are 2 problems with it:
> 1. "<" vs expected "<<"
> 2. the shift number is an IOMMU page number mask, not an address mask
> as the IOMMU page shift is missing.
>
> This did not hit us before f1565c24b596 ("powerpc: use the generic
> dma_ops_bypass mode") because we had there additional code to handle
> bypass mask so this chunk (almost?) never executed. However there
> were reports that aacraid does not work with "iommu=nobypass".
> After f1565c24b596, aacraid (and probably others which call
> dma_get_required_mask() before setting the mask) was unable to
> enable 64bit DMA and fall back to using IOMMU which was known not to work,
> one of the problems is double free of an IOMMU page.
>
> This fixes DMA for aacraid, both with and without "iommu=nobypass"
> in the kernel command line. Verified with "stress-ng -d 4".
>
> Fixes: f1565c24b596 ("powerpc: use the generic dma_ops_bypass mode")

I think it'd be better to point the Fixes tag at 6a5c7be5e484, which
originally introduced the bug, even if we didn't notice it until
f1565c24b596 exposed it (or made it more likely).

cheers

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>
> The original code came Jun 24 2011:
> 6a5c7be5e484 ("powerpc: Override dma_get_required_mask by platform hook and ops")
>
>
> What is dma_get_required_mask() for anyway? What "requires" what here?
>
> Even though it works for now (due to huge - >4GB - default DMA window),
> I am still not convinced we do not want this chunk here
> (this is what f1565c24b596 removed):
>
> if (dev_is_pci(dev)) {
>         u64 bypass_mask = dma_direct_get_required_mask(dev);
>
>         if (dma_iommu_bypass_supported(dev, bypass_mask))
>                 return bypass_mask;
> }
> ---
>  arch/powerpc/kernel/dma-iommu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
> index 569fecd7b5b2..9053fc9d20c7 100644
> --- a/arch/powerpc/kernel/dma-iommu.c
> +++ b/arch/powerpc/kernel/dma-iommu.c
> @@ -120,7 +120,8 @@ u64 dma_iommu_get_required_mask(struct device *dev)
>  	if (!tbl)
>  		return 0;
>  
> -	mask = 1ULL < (fls_long(tbl->it_offset + tbl->it_size) - 1);
> +	mask = 1ULL << (fls_long(tbl->it_offset + tbl->it_size) +
> +			tbl->it_page_shift - 1);
>  	mask += mask - 1;
>  
>  	return mask;
> -- 
> 2.17.1
Cédric Le Goater Sept. 8, 2020, 11:45 a.m. UTC | #3
On 9/8/20 3:51 AM, Alexey Kardashevskiy wrote:
> There are 2 problems with it:
> 1. "<" vs expected "<<"
> 2. the shift number is an IOMMU page number mask, not an address mask
> as the IOMMU page shift is missing.
> 
> This did not hit us before f1565c24b596 ("powerpc: use the generic
> dma_ops_bypass mode") because we had there additional code to handle
> bypass mask so this chunk (almost?) never executed. However there
> were reports that aacraid does not work with "iommu=nobypass".
> After f1565c24b596, aacraid (and probably others which call
> dma_get_required_mask() before setting the mask) was unable to
> enable 64bit DMA and fall back to using IOMMU which was known not to work,
> one of the problems is double free of an IOMMU page.
> 
> This fixes DMA for aacraid, both with and without "iommu=nobypass"
> in the kernel command line. Verified with "stress-ng -d 4".
> 
> Fixes: f1565c24b596 ("powerpc: use the generic dma_ops_bypass mode")
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>


The boston system looks solid with this patch.

Tested-by: Cédric Le Goater <clg@kaod.org>

Thanks a lot ! 

C. 


> ---
> 
> The original code came Jun 24 2011:
> 6a5c7be5e484 ("powerpc: Override dma_get_required_mask by platform hook and ops")
> 
> 
> What is dma_get_required_mask() for anyway? What "requires" what here?
> 
> Even though it works for now (due to huge - >4GB - default DMA window),
> I am still not convinced we do not want this chunk here
> (this is what f1565c24b596 removed):
> 
> if (dev_is_pci(dev)) {
>         u64 bypass_mask = dma_direct_get_required_mask(dev);
> 
>         if (dma_iommu_bypass_supported(dev, bypass_mask))
>                 return bypass_mask;
> }
> ---
>  arch/powerpc/kernel/dma-iommu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
> index 569fecd7b5b2..9053fc9d20c7 100644
> --- a/arch/powerpc/kernel/dma-iommu.c
> +++ b/arch/powerpc/kernel/dma-iommu.c
> @@ -120,7 +120,8 @@ u64 dma_iommu_get_required_mask(struct device *dev)
>  	if (!tbl)
>  		return 0;
>  
> -	mask = 1ULL < (fls_long(tbl->it_offset + tbl->it_size) - 1);
> +	mask = 1ULL << (fls_long(tbl->it_offset + tbl->it_size) +
> +			tbl->it_page_shift - 1);
>  	mask += mask - 1;
>  
>  	return mask;
>
Alexey Kardashevskiy Sept. 8, 2020, 12:06 p.m. UTC | #4
On 08/09/2020 15:44, Christoph Hellwig wrote:
> On Tue, Sep 08, 2020 at 11:51:06AM +1000, Alexey Kardashevskiy wrote:
>> What is dma_get_required_mask() for anyway? What "requires" what here?
> 
> Yes, it is a really odd API.  It comes from classic old PCI where
> 64-bit addressing required an additional bus cycle, and various devices
> had different addressing schemes, with the smaller addresses beeing
> more efficient.  So this allows the driver to request the "required"
> addressing mode to address all memory.  "preferred" might be a better
> name as we'll bounce buffer if it isn't met.  I also don't really see
> why a driver would ever want to use it for a modern PCIe device.


a-ha, this makes more sense, thanks. Then I guess we need to revert that 
one bit from yours f1565c24b596, do not we?
Christoph Hellwig Sept. 8, 2020, 12:19 p.m. UTC | #5
On Tue, Sep 08, 2020 at 10:06:56PM +1000, Alexey Kardashevskiy wrote:
> On 08/09/2020 15:44, Christoph Hellwig wrote:
>> On Tue, Sep 08, 2020 at 11:51:06AM +1000, Alexey Kardashevskiy wrote:
>>> What is dma_get_required_mask() for anyway? What "requires" what here?
>>
>> Yes, it is a really odd API.  It comes from classic old PCI where
>> 64-bit addressing required an additional bus cycle, and various devices
>> had different addressing schemes, with the smaller addresses beeing
>> more efficient.  So this allows the driver to request the "required"
>> addressing mode to address all memory.  "preferred" might be a better
>> name as we'll bounce buffer if it isn't met.  I also don't really see
>> why a driver would ever want to use it for a modern PCIe device.
>
>
> a-ha, this makes more sense, thanks. Then I guess we need to revert that 
> one bit from yours f1565c24b596, do not we?

Why?  The was the original intent of the API, but now we also use
internally to check the addressing capabilities.
Alexey Kardashevskiy Sept. 9, 2020, 9:36 a.m. UTC | #6
On 09/09/2020 17:58, Christoph Hellwig wrote:
> On Tue, Sep 08, 2020 at 11:10:03PM +1000, Alexey Kardashevskiy wrote:
>>>> a-ha, this makes more sense, thanks. Then I guess we need to revert that
>>>> one bit from yours f1565c24b596, do not we?
>>>
>>> Why?  The was the original intent of the API, but now we also use
>>> internally to check the addressing capabilities.
>>
>> The bigger mask the better, no? As it is now, it's limited by the window 
>> size which happens to be bigger than 4GB but smaller then full 64bit (48bit 
>> on my system)
> 
> Yes, the bigger mask is better.  But I don't see why you'd want to
> revert the dma bypass code for that entirely.
> 

I want dma_get_required_mask() to return the bigger mask always.

Now it depends on (in dma_alloc_direct()):
1. dev->dma_ops_bypass: set via pci_set_(coherent_)dma_mask();
2. dev->coherent_dma_mask - the same;
3. dev->bus_dma_limit - usually not set at all.

So until we set the mask, dma_get_required_mask() returns smaller mask.
So aacraid and likes (which calls dma_get_required_mask() before setting
it) will remain prone for breaks.


[forgot to cc: other folks last time, fixed now]
Michael Ellerman Sept. 10, 2020, 12:55 p.m. UTC | #7
On Tue, 8 Sep 2020 11:51:06 +1000, Alexey Kardashevskiy wrote:
> There are 2 problems with it:
> 1. "<" vs expected "<<"
> 2. the shift number is an IOMMU page number mask, not an address mask
> as the IOMMU page shift is missing.
> 
> This did not hit us before f1565c24b596 ("powerpc: use the generic
> dma_ops_bypass mode") because we had there additional code to handle
> bypass mask so this chunk (almost?) never executed. However there
> were reports that aacraid does not work with "iommu=nobypass".
> After f1565c24b596, aacraid (and probably others which call
> dma_get_required_mask() before setting the mask) was unable to
> enable 64bit DMA and fall back to using IOMMU which was known not to work,
> one of the problems is double free of an IOMMU page.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/dma: Fix dma_map_ops::get_required_mask
      https://git.kernel.org/powerpc/c/437ef802e0adc9f162a95213a3488e8646e5fc03

cheers
Christoph Hellwig Sept. 15, 2020, 6:50 a.m. UTC | #8
On Wed, Sep 09, 2020 at 07:36:04PM +1000, Alexey Kardashevskiy wrote:
> I want dma_get_required_mask() to return the bigger mask always.
> 
> Now it depends on (in dma_alloc_direct()):
> 1. dev->dma_ops_bypass: set via pci_set_(coherent_)dma_mask();
> 2. dev->coherent_dma_mask - the same;
> 3. dev->bus_dma_limit - usually not set at all.
> 
> So until we set the mask, dma_get_required_mask() returns smaller mask.
> So aacraid and likes (which calls dma_get_required_mask() before setting
> it) will remain prone for breaks.

Well, the original intent of dma_get_required_mask is to return the
mask that the driver then uses to figure out what to set, so what aacraid
does fits that use case.  Of course that idea is pretty bogus for
PCIe devices.

I suspect the right fix is to just not query dma_get_required_mask for
PCIe devices in aacraid (and other drivers that do something similar).
Alexey Kardashevskiy Sept. 22, 2020, 2:26 a.m. UTC | #9
On 15/09/2020 16:50, Christoph Hellwig wrote:
> On Wed, Sep 09, 2020 at 07:36:04PM +1000, Alexey Kardashevskiy wrote:
>> I want dma_get_required_mask() to return the bigger mask always.
>>
>> Now it depends on (in dma_alloc_direct()):
>> 1. dev->dma_ops_bypass: set via pci_set_(coherent_)dma_mask();
>> 2. dev->coherent_dma_mask - the same;
>> 3. dev->bus_dma_limit - usually not set at all.
>>
>> So until we set the mask, dma_get_required_mask() returns smaller mask.
>> So aacraid and likes (which calls dma_get_required_mask() before setting
>> it) will remain prone for breaks.
> 
> Well, the original intent of dma_get_required_mask is to return the
> mask that the driver then uses to figure out what to set, so what aacraid
> does fits that use case. 

What was the original intent exactly? The driver asks for the minimum or
maximum DMA mask the platform supports?

As for now, we (ppc64/powernv) can do:
1. bypass (==64bit)
2. a DMA window which used to be limited by 2GB but not anymore.

I can understand if the driver asked for required mask in expectation to
receive "less or equal than 32bit" and "more than 32 bit" and choose.
And this probably was the intent as at the time when the bug was
introduced, the window was always smaller than 4GB.

But today the window is bigger than than (44 bits now, or a similar
value, depends on max page order) so the returned mask is >32. Which
still enables that DAC in aacraid but I suspect this is accidental.


> Of course that idea is pretty bogus for
> PCIe devices.

Why? From the PHB side, there are windows. From the device side, there
are many crippled devices, like, no GPU I saw in last years supported
more than 48bit.


> I suspect the right fix is to just not query dma_get_required_mask for
> PCIe devices in aacraid (and other drivers that do something similar).

May be, if you write nice and big comment next to
dma_get_required_mask() explaining exactly what it does, then I will
realize I am getting this all wrong and we will move to fixing the
drivers :)
Christoph Hellwig Sept. 23, 2020, 2:10 p.m. UTC | #10
On Tue, Sep 22, 2020 at 12:26:18PM +1000, Alexey Kardashevskiy wrote:
> > Well, the original intent of dma_get_required_mask is to return the
> > mask that the driver then uses to figure out what to set, so what aacraid
> > does fits that use case. 
> 
> What was the original intent exactly? The driver asks for the minimum or
> maximum DMA mask the platform supports?
> 
> As for now, we (ppc64/powernv) can do:
> 1. bypass (==64bit)
> 2. a DMA window which used to be limited by 2GB but not anymore.
> 
> I can understand if the driver asked for required mask in expectation to
> receive "less or equal than 32bit" and "more than 32 bit" and choose.
> And this probably was the intent as at the time when the bug was
> introduced, the window was always smaller than 4GB.
> 
> But today the window is bigger than than (44 bits now, or a similar
> value, depends on max page order) so the returned mask is >32. Which
> still enables that DAC in aacraid but I suspect this is accidental.

I think for powernv returning 64-bit always would make a lot of sense.
AFAIK all of powernv is PCIe and not legacy PCI, so returning anything
less isn't going to help to optimize anything.

> > Of course that idea is pretty bogus for
> > PCIe devices.
> 
> Why? From the PHB side, there are windows. From the device side, there
> are many crippled devices, like, no GPU I saw in last years supported
> more than 48bit.

Yes, but dma_get_required_mask is misnamed - the mask is not required,
it is the optimal mask.  Even if the window is smaller we handle it
some way, usually by using swiotlb, or by iommu tricks in your case.

> > I suspect the right fix is to just not query dma_get_required_mask for
> > PCIe devices in aacraid (and other drivers that do something similar).
> 
> May be, if you write nice and big comment next to
> dma_get_required_mask() explaining exactly what it does, then I will
> realize I am getting this all wrong and we will move to fixing the
> drivers :)

Yes, it needs a comment or two, and probaby be renamed to
dma_get_optimal_dma_mask, and a cleanup of most users.  I've added it
to my ever growing TODO list, but I would not be unhappy if someone
else gives it a spin.
Alexey Kardashevskiy Sept. 24, 2020, 7:03 a.m. UTC | #11
On 24/09/2020 00:10, Christoph Hellwig wrote:
> On Tue, Sep 22, 2020 at 12:26:18PM +1000, Alexey Kardashevskiy wrote:
>>> Well, the original intent of dma_get_required_mask is to return the
>>> mask that the driver then uses to figure out what to set, so what aacraid
>>> does fits that use case. 
>>
>> What was the original intent exactly? The driver asks for the minimum or
>> maximum DMA mask the platform supports?
>>
>> As for now, we (ppc64/powernv) can do:
>> 1. bypass (==64bit)
>> 2. a DMA window which used to be limited by 2GB but not anymore.
>>
>> I can understand if the driver asked for required mask in expectation to
>> receive "less or equal than 32bit" and "more than 32 bit" and choose.
>> And this probably was the intent as at the time when the bug was
>> introduced, the window was always smaller than 4GB.
>>
>> But today the window is bigger than than (44 bits now, or a similar
>> value, depends on max page order) so the returned mask is >32. Which
>> still enables that DAC in aacraid but I suspect this is accidental.
> 
> I think for powernv returning 64-bit always would make a lot of sense.
> AFAIK all of powernv is PCIe and not legacy PCI, so returning anything
> less isn't going to help to optimize anything.

May be... The current behavior is not wrong (after the fix) but not
optimal either. Even with legacy PCI it should just result in failing
attempt to set 64bit mask which drivers should still handle, i.e. choose
a shorter mask.

Why not ditch the whole dma_get_required_mask() and just fail on setting
a bigger mask? Are these failures not handled in some drivers? Or there
are cases when a shorter mask is better? Thanks,


>>> Of course that idea is pretty bogus for
>>> PCIe devices.
>>
>> Why? From the PHB side, there are windows. From the device side, there
>> are many crippled devices, like, no GPU I saw in last years supported
>> more than 48bit.
> 
> Yes, but dma_get_required_mask is misnamed - the mask is not required,
> it is the optimal mask.  Even if the window is smaller we handle it
> some way, usually by using swiotlb, or by iommu tricks in your case.
>
>>> I suspect the right fix is to just not query dma_get_required_mask for
>>> PCIe devices in aacraid (and other drivers that do something similar).
>>
>> May be, if you write nice and big comment next to
>> dma_get_required_mask() explaining exactly what it does, then I will
>> realize I am getting this all wrong and we will move to fixing the
>> drivers :)
> 
> Yes, it needs a comment or two, and probaby be renamed to
> dma_get_optimal_dma_mask, and a cleanup of most users.  I've added it
> to my ever growing TODO list, but I would not be unhappy if someone
> else gives it a spin.
>
Christoph Hellwig Sept. 25, 2020, 4:56 a.m. UTC | #12
On Thu, Sep 24, 2020 at 05:03:11PM +1000, Alexey Kardashevskiy wrote:
> May be... The current behavior is not wrong (after the fix) but not
> optimal either. Even with legacy PCI it should just result in failing
> attempt to set 64bit mask which drivers should still handle, i.e. choose
> a shorter mask.

Err, no.

> Why not ditch the whole dma_get_required_mask() and just fail on setting
> a bigger mask? Are these failures not handled in some drivers? Or there
> are cases when a shorter mask is better? Thanks,

Because that is a complete pain.  Think of it, the device/driver knows
what it supports.  For 98% of the modern devices that means all 64-bit
bits, and for most others this means 32-bits, with a few wackos that
support 48 bits or something like that.  The 98% just take any address
thrown at them, and the others just care that they never see an
address larger than what they support.  They could not care any less
if the systems supports 31, 36, 41, 48, 52, 55, 61 or 63-bit addressing,
an they most certainly should not implement stupid boilerplate code to
guess what addressing mode the system implements.  They just declare
what they support.

Then you have the 12 drivers for devices that can do optimizations if
they never see large DMA addresses.  They use the somewhat misnamed
dma_get_required_mask API to query what the largest address they might
see is and act based on that, while not putting any burden on all the
sane devices/drivers.
diff mbox series

Patch

diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index 569fecd7b5b2..9053fc9d20c7 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -120,7 +120,8 @@  u64 dma_iommu_get_required_mask(struct device *dev)
 	if (!tbl)
 		return 0;
 
-	mask = 1ULL < (fls_long(tbl->it_offset + tbl->it_size) - 1);
+	mask = 1ULL << (fls_long(tbl->it_offset + tbl->it_size) +
+			tbl->it_page_shift - 1);
 	mask += mask - 1;
 
 	return mask;