dma-mapping: clearing GFP_ZERO flag caused crashes of Ethernet on arc/hsdk board.

Message ID 1522170774.2593.9.camel@synopsys.com
State New
Headers show
Series
  • dma-mapping: clearing GFP_ZERO flag caused crashes of Ethernet on arc/hsdk board.
Related show

Commit Message

Evgeniy Didin March 27, 2018, 5:12 p.m.
Hello,

After commit  57bf5a8963f8 ("dma-mapping: clear harmful GFP_* flags in common code")  we noticed problems with Ethernet controller on one of our platforms (namely ARC HSDK).
I
n particular we see that removal of __GFP_ZERO flag in function dma_alloc_attrs() was the culprit because in our implementation of arc_dma_alloc() we only allocate zeroed pages if
that flag is explicitly set by the caller. Now with unconditional removal of that flag in dma_alloc_attrs() we allocate non-zeroed pages and that seem to cause problems.

From
mentioned commit message I may conclude that architectural code is supposed to always allocate zeroed pages but I cannot find any requirement of that in kernel's documentation.
Coul
d you please point me to that requirement if that exists at all, then we'll implement a fix in our arch code like that:
--------------------->8---------------------
              return NULL;
--------------------->8---------------------

Best regards,
Evgeniy Didin

Comments

Andy Shevchenko March 27, 2018, 6:11 p.m. | #1
On Tue, Mar 27, 2018 at 8:12 PM, Evgeniy Didin
<Evgeniy.Didin@synopsys.com> wrote:
> Hello,
>
> After commit  57bf5a8963f8 ("dma-mapping: clear harmful GFP_* flags in common code")  we noticed problems with Ethernet controller on one of our platforms (namely ARC HSDK).
> I
> n particular we see that removal of __GFP_ZERO flag in function dma_alloc_attrs() was the culprit because in our implementation of arc_dma_alloc() we only allocate zeroed pages if
> that flag is explicitly set by the caller. Now with unconditional removal of that flag in dma_alloc_attrs() we allocate non-zeroed pages and that seem to cause problems.
>
> From
> mentioned commit message I may conclude that architectural code is supposed to always allocate zeroed pages but I cannot find any requirement of that in kernel's documentation.
> Coul
> d you please point me to that requirement if that exists at all, then we'll implement a fix in our arch code like that:

Can you elaborate what driver is in use?
stmmac with dwmac-anarion?

If so, this driver (w/o anarion parts, which I believe doesn't have
anything to do with this) is widely used on other platforms.
We have to see a lot of reports, though only one so far?

The logical question is why?

Another question why caller can't ask for zero pages explicitly?

P.S. Current kernel code shows only 3 use cases of GFP_ZERO. It seems
arm64 has something similar in mind.

> --------------------->8---------------------
> diff --git
> a/arch/arc/mm/dma.c b/arch/arc/mm/dma.c
> index 1dcc404b5aec..c92e518413aa 100644
> --- a/arch/arc/mm/dma.c
> +++ b/arch/arc/mm/dma.c
> @@ -30,7 +30,7 @@ static void *arc_dma_alloc(struct
> device *dev, size_t size,
>       void *kvaddr;
>       int need_coh = 1, need_kvaddr = 0;
>
> -       page = alloc_pages(gfp, order);
> +       page = alloc_pages(gfp | __GFP_ZERO, order);
>
> if (!page)
>               return NULL;
> --------------------->8---------------------
>
> Best regards,
> Evgeniy Didin
Vineet Gupta March 27, 2018, 6:24 p.m. | #2
Hi Christoph, Andy

On 03/27/2018 11:11 AM, Andy Shevchenko wrote:
> On Tue, Mar 27, 2018 at 8:12 PM, Evgeniy Didin
> <Evgeniy.Didin@synopsys.com> wrote:
>> Hello,
>>
>> After commit  57bf5a8963f8 ("dma-mapping: clear harmful GFP_* flags in common code")  we noticed problems with Ethernet controller on one of our platforms (namely ARC HSDK).
>> I
>> n particular we see that removal of __GFP_ZERO flag in function dma_alloc_attrs() was the culprit because in our implementation of arc_dma_alloc() we only allocate zeroed pages if
>> that flag is explicitly set by the caller. Now with unconditional removal of that flag in dma_alloc_attrs() we allocate non-zeroed pages and that seem to cause problems.
>>
>> From
>> mentioned commit message I may conclude that architectural code is supposed to always allocate zeroed pages but I cannot find any requirement of that in kernel's documentation.
>> Coul
>> d you please point me to that requirement if that exists at all, then we'll implement a fix in our arch code like that:

[snip]

> Another question why caller can't ask for zero pages explicitly?

Question to whom ? The caller can ask for it - but the problem here is generic dma 
API code is clearing out GFP_ZERO and expecting arch code to memst unconditionally 
- is that expected of arch code - and is documented ?

That is broken to begin with - arch dma_alloc* simply passes thru gfp flags to 
page allocator and doesn't muck around with them. We could in theory but doesn't 
seem like the right thing to do IMO.

-Vineet
Alexey Brodkin March 27, 2018, 9:19 p.m. | #3
Hi Andy,

On Tue, 2018-03-27 at 21:11 +0300, Andy Shevchenko wrote:
> On Tue, Mar 27, 2018 at 8:12 PM, Evgeniy Didin
> <Evgeniy.Didin@synopsys.com> wrote:
> > Hello,
> > 
> > After commit  57bf5a8963f8 ("dma-mapping: clear harmful GFP_* flags in common code")  we noticed problems with Ethernet controller on one of our
> > platforms (namely ARC HSDK).
> > I
> > n particular we see that removal of __GFP_ZERO flag in function dma_alloc_attrs() was the culprit because in our implementation of arc_dma_alloc()
> > we only allocate zeroed pages if
> > that flag is explicitly set by the caller. Now with unconditional removal of that flag in dma_alloc_attrs() we allocate non-zeroed pages and that
> > seem to cause problems.
> > 
> > From
> > mentioned commit message I may conclude that architectural code is supposed to always allocate zeroed pages but I cannot find any requirement of
> > that in kernel's documentation.
> > Coul
> > d you please point me to that requirement if that exists at all, then we'll implement a fix in our arch code like that:
> 
> Can you elaborate what driver is in use?
> stmmac with dwmac-anarion?

It is indeed DW GMAC (AKA STMMAC) with built-in DMA.

> If so, this driver (w/o anarion parts, which I believe doesn't have
> anything to do with this) is widely used on other platforms.
> We have to see a lot of reports, though only one so far?
> 
> The logical question is why?

1. See that's another platform with ARC core so maybe in case of ARM
   DMA allocator already zeroes pages regardless provided flags -
   personally I didn't check that.

2. Even on HSDK we saw that only on attempt to run "iperf", even DHCP
   client works perfectly fine on that same platform so maybe others
   just don't see problems yet.

3. Who knows if RCs are being tested on other platforms with
   networking so maybe similar reports will start to appear once
   4.16 gets released.

-Alexey
Christoph Hellwig March 28, 2018, 7:53 a.m. | #4
> > The logical question is why?
> 
> 1. See that's another platform with ARC core so maybe in case of ARM
>    DMA allocator already zeroes pages regardless provided flags -
>    personally I didn't check that.

Yes, most architectures always clear memory returned by dma_alloc*.
Looks like a few don't and my commit got them in trouble.  As usual
I'd prefer to match x86 semantics for now to avoid problems.

I'll send patches for arc and s390 which seem to be actually used
holdouts, and will look if anyone else is also affected.

Patch

diff --git
a/arch/arc/mm/dma.c b/arch/arc/mm/dma.c
index 1dcc404b5aec..c92e518413aa 100644
--- a/arch/arc/mm/dma.c
+++ b/arch/arc/mm/dma.c
@@ -30,7 +30,7 @@  static void *arc_dma_alloc(struct
device *dev, size_t size,
      void *kvaddr;
      int need_coh = 1, need_kvaddr = 0;

-       page = alloc_pages(gfp, order);
+       page = alloc_pages(gfp | __GFP_ZERO, order);
     
if (!page)