diff mbox series

[v3,18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

Message ID 1521216991-28706-19-git-send-email-okaya@codeaurora.org
State Changes Requested, archived
Delegated to: David Miller
Headers show
Series Eliminate duplicate barriers on weakly-ordered archs | expand

Commit Message

Sinan Kaya March 16, 2018, 4:16 p.m. UTC
Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/infiniband/hw/cxgb4/t4.h | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Comments

Steve Wise March 16, 2018, 9:05 p.m. UTC | #1
> Code includes wmb() followed by writel(). writel() already has a barrier
on
> some architectures like arm64.
> 
> This ends up CPU observing two barriers back to back before executing the
> register write.
> 
> Since code already has an explicit barrier call, changing writel() to
> writel_relaxed().
> 
> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>

NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is just
writeX().  

I was just looking at this with Chelsio developers, and they said the
writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
get rid of the extra barrier for all architectures.

Also, t4.h:pio_copy() needs to  use __raw_writeq() to enable the write
combining fastpath for ARM and PowerPC.  The code as it stands doesn't
achieve any write combining on PowerPC at least.  

And the writel()s at the end of the ring functions (the non bar2 udb path)
needs a mmiowb() afterwards if you're going to use __raw_writeX() there.
However that path is only used for very old hardware (T4), so I wouldn't
worry about them. 

Steve.


> ---
>  drivers/infiniband/hw/cxgb4/t4.h | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/cxgb4/t4.h
> b/drivers/infiniband/hw/cxgb4/t4.h
> index 8369c7c..7a48c9e 100644
> --- a/drivers/infiniband/hw/cxgb4/t4.h
> +++ b/drivers/infiniband/hw/cxgb4/t4.h
> @@ -477,15 +477,15 @@ static inline void t4_ring_sq_db(struct t4_wq *wq,
> u16 inc, union t4_wr *wqe)
>  				 (u64 *)wqe);
>  		} else {
>  			pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> -			writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> -			       wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> +			writel_relaxed(PIDX_T5_V(inc) | QID_V(wq-
> >sq.bar2_qid),
> +				       wq->sq.bar2_va +
> SGE_UDB_KDOORBELL);
>  		}
> 
>  		/* Flush user doorbell area writes. */
>  		wmb();
>  		return;
>  	}
> -	writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +	writel_relaxed(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
>  }
> 
>  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
> @@ -502,15 +502,15 @@ static inline void t4_ring_rq_db(struct t4_wq *wq,
> u16 inc,
>  				 (void *)wqe);
>  		} else {
>  			pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
> -			writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> -			       wq->rq.bar2_va + SGE_UDB_KDOORBELL);
> +			writel_relaxed(PIDX_T5_V(inc) | QID_V(wq-
> >rq.bar2_qid),
> +				       wq->rq.bar2_va +
> SGE_UDB_KDOORBELL);
>  		}
> 
>  		/* Flush user doorbell area writes. */
>  		wmb();
>  		return;
>  	}
> -	writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +	writel_relaxed(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
>  }
> 
>  static inline int t4_wq_in_error(struct t4_wq *wq)
> --
> 2.7.4
Sinan Kaya March 16, 2018, 9:46 p.m. UTC | #2
On 3/16/2018 5:05 PM, Steve Wise wrote:
>> Code includes wmb() followed by writel(). writel() already has a barrier
> on
>> some architectures like arm64.
>>
>> This ends up CPU observing two barriers back to back before executing the
>> register write.
>>
>> Since code already has an explicit barrier call, changing writel() to
>> writel_relaxed().
>>
>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> 
> NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is just
> writeX().  
> 
> I was just looking at this with Chelsio developers, and they said the
> writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
> get rid of the extra barrier for all architectures.

OK. I can do that but isn't the problem at PowerPC adaptation?

/*
 * We don't do relaxed operations yet, at least not with this semantic
 */
#define readb_relaxed(addr)	readb(addr)
#define readw_relaxed(addr)	readw(addr)
#define readl_relaxed(addr)	readl(addr)
#define readq_relaxed(addr)	readq(addr)
#define writeb_relaxed(v, addr)	writeb(v, addr)
#define writew_relaxed(v, addr)	writew(v, addr)
#define writel_relaxed(v, addr)	writel(v, addr)
#define writeq_relaxed(v, addr)	writeq(v, addr)

Why don't we fix the PowerPC's relaxed operators? Is that a bigger task?

From API perspective both __raw_writeX() and writeX_relaxed() are correct.
It is just PowerPC doesn't seem the follow the definition yet.
Jason Gunthorpe March 16, 2018, 10:13 p.m. UTC | #3
On Fri, Mar 16, 2018 at 04:05:10PM -0500, Steve Wise wrote:
> > Code includes wmb() followed by writel(). writel() already has a barrier
> on
> > some architectures like arm64.
> > 
> > This ends up CPU observing two barriers back to back before executing the
> > register write.
> > 
> > Since code already has an explicit barrier call, changing writel() to
> > writel_relaxed().
> > 
> > Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> 
> NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is just
> writeX().  

?? Why is changing writex() to writeX() a NAK then?

> I was just looking at this with Chelsio developers, and they said the
> writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
> get rid of the extra barrier for all architectures.

That doesn't seem semanticaly sane.

__raw_writeX() should not appear in driver code, IMHO. Only the arch
code can know what the exact semantics of that accessor are..

If ppc can't use writel_relaxed to optimize then we probably need yet
another io accessor semantic defined :(

Jason
Steve Wise March 16, 2018, 11:04 p.m. UTC | #4
> 
> On Fri, Mar 16, 2018 at 04:05:10PM -0500, Steve Wise wrote:
> > > Code includes wmb() followed by writel(). writel() already has a
barrier
> > on
> > > some architectures like arm64.
> > >
> > > This ends up CPU observing two barriers back to back before executing
> the
> > > register write.
> > >
> > > Since code already has an explicit barrier call, changing writel() to
> > > writel_relaxed().
> > >
> > > Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> >
> > NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is
just
> > writeX().
> 
> ?? Why is changing writex() to writeX() a NAK then?

Because I want it correct for PPC as well.

> 
> > I was just looking at this with Chelsio developers, and they said the
> > writeX() should be replaced with __raw_writeX(), not writeX_relaxed(),
to
> > get rid of the extra barrier for all architectures.
> 
> That doesn't seem semanticaly sane.
> 
> __raw_writeX() should not appear in driver code, IMHO. Only the arch
> code can know what the exact semantics of that accessor are..
> 
> If ppc can't use writel_relaxed to optimize then we probably need yet
> another io accessor semantic defined :(


Anybody understand why the PPC implementation of writeX_relaxed() isn't
relaxed?


Steve.
Steve Wise March 16, 2018, 11:05 p.m. UTC | #5
> 
> On 3/16/2018 5:05 PM, Steve Wise wrote:
> >> Code includes wmb() followed by writel(). writel() already has a barrier
> > on
> >> some architectures like arm64.
> >>
> >> This ends up CPU observing two barriers back to back before executing
> the
> >> register write.
> >>
> >> Since code already has an explicit barrier call, changing writel() to
> >> writel_relaxed().
> >>
> >> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> >
> > NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is just
> > writeX().
> >
> > I was just looking at this with Chelsio developers, and they said the
> > writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
> > get rid of the extra barrier for all architectures.
> 
> OK. I can do that but isn't the problem at PowerPC adaptation?
> 
> /*
>  * We don't do relaxed operations yet, at least not with this semantic
>  */
> #define readb_relaxed(addr)	readb(addr)
> #define readw_relaxed(addr)	readw(addr)
> #define readl_relaxed(addr)	readl(addr)
> #define readq_relaxed(addr)	readq(addr)
> #define writeb_relaxed(v, addr)	writeb(v, addr)
> #define writew_relaxed(v, addr)	writew(v, addr)
> #define writel_relaxed(v, addr)	writel(v, addr)
> #define writeq_relaxed(v, addr)	writeq(v, addr)
> 
> Why don't we fix the PowerPC's relaxed operators? Is that a bigger task?

I don't know the answer, but perhaps the proper fix is to correctly implement these for PPC?


> 
> >From API perspective both __raw_writeX() and writeX_relaxed() are
> correct.
> It is just PowerPC doesn't seem the follow the definition yet.
Sinan Kaya March 17, 2018, 3:40 a.m. UTC | #6
On 3/16/2018 7:05 PM, Steve Wise wrote:
>>
>> On 3/16/2018 5:05 PM, Steve Wise wrote:
>>>> Code includes wmb() followed by writel(). writel() already has a barrier
>>> on
>>>> some architectures like arm64.
>>>>
>>>> This ends up CPU observing two barriers back to back before executing
>> the
>>>> register write.
>>>>
>>>> Since code already has an explicit barrier call, changing writel() to
>>>> writel_relaxed().
>>>>
>>>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
>>>
>>> NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is just
>>> writeX().
>>>
>>> I was just looking at this with Chelsio developers, and they said the
>>> writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
>>> get rid of the extra barrier for all architectures.
>>
>> OK. I can do that but isn't the problem at PowerPC adaptation?
>>
>> /*
>>  * We don't do relaxed operations yet, at least not with this semantic
>>  */
>> #define readb_relaxed(addr)	readb(addr)
>> #define readw_relaxed(addr)	readw(addr)
>> #define readl_relaxed(addr)	readl(addr)
>> #define readq_relaxed(addr)	readq(addr)
>> #define writeb_relaxed(v, addr)	writeb(v, addr)
>> #define writew_relaxed(v, addr)	writew(v, addr)
>> #define writel_relaxed(v, addr)	writel(v, addr)
>> #define writeq_relaxed(v, addr)	writeq(v, addr)
>>
>> Why don't we fix the PowerPC's relaxed operators? Is that a bigger task?
> 
> I don't know the answer, but perhaps the proper fix is to correctly implement these for PPC?
> 

I found this article: https://lwn.net/Articles/698014/#PowerPC%20Implementation

Apparently, this issue was discussed at a conference in 2015.

Based on how I read this article, writel() and writel_relaxed() behave exactly
the same on PowerPC because the barrier is enforced by the time code is leaving
a critical section not during MMIO execution.

I also see that __raw_writel() is being heavily used in drivers directory.

https://elixir.bootlin.com/linux/latest/ident/__raw_writel

I'll change writel_relaxed() with __raw_writel() in the series like you suggested
and also look at your other comments.

This seems to be the current norm.

> 
>>
>> >From API perspective both __raw_writeX() and writeX_relaxed() are
>> correct.
>> It is just PowerPC doesn't seem the follow the definition yet.
> 
> 
> 
>
Sinan Kaya March 17, 2018, 4:03 a.m. UTC | #7
On 3/16/2018 11:40 PM, Sinan Kaya wrote:
> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
> and also look at your other comments.

I spoke too soon.

Now that I realized, code needs to follow one of the following patterns for correctness

1)
wmb()
writel()/writel_relaxed()

or 

2)
wmb()
__raw_wrltel()
mmiowb()

but definitely not

wmb()
__raw_wrltel()

Since #1 == #2, I'll stick to my current implementation of writel_relaxed()

Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
for correctness. ARM's mmiowb() implementation is empty.

So, there is no one size fits all solution with the current state of affairs.
Timur Tabi March 17, 2018, 4:08 a.m. UTC | #8
On 3/16/18 6:04 PM, Steve Wise wrote:
> Anybody understand why the PPC implementation of writeX_relaxed() isn't
> relaxed?

You probably should ask that on the linuxppc-dev@lists.ozlabs.org 
mailing list.

I've always wondered why PowerPC has non-standard I/O accessors.
Sinan Kaya March 17, 2018, 4:25 a.m. UTC | #9
On 3/17/2018 12:03 AM, Sinan Kaya wrote:
> On 3/16/2018 11:40 PM, Sinan Kaya wrote:
>> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
>> and also look at your other comments.
> 
> I spoke too soon.
> 
> Now that I realized, code needs to follow one of the following patterns for correctness
> 
> 1)
> wmb()
> writel()/writel_relaxed()
> 
> or 
> 
> 2)
> wmb()
> __raw_wrltel()
> mmiowb()
> 
> but definitely not
> 
> wmb()
> __raw_wrltel()
> 
> Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
> 
> Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
> for correctness. ARM's mmiowb() implementation is empty.
> 
> So, there is no one size fits all solution with the current state of affairs.
> 
> 

I think I finally got what you mean.

Code seems to have

wmb()
writel()/writeq()
wmb()

this can be safely replaced with

wmb()
__raw_writel()/__raw_writeq()
wmb()

This will work on all arches. Below is the new version. Let me know if this is OK.

+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
        int count = 8;

        while (count) {
-               writeq(*src, dst);
+               __raw_writeq(*src, dst);
                src++;
                dst++;
                count--;
@@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
                                 (u64 *)wqe);
                } else {
                        pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
-                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
-                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
+                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
+                                    wq->sq.bar2_va + SGE_UDB_KDOORBELL);
                }

                /* Flush user doorbell area writes. */
                wmb();
                return;
        }
-       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+       __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+       mmiowmb()
 }

 static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
@@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
                                 (void *)wqe);
                } else {
                        pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
-                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
-                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
+                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
+                                    wq->rq.bar2_va + SGE_UDB_KDOORBELL);
                }

                /* Flush user doorbell area writes. */
                wmb();
                return;
        }
-       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+       __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+       mmiowmb();
 }
Timur Tabi March 17, 2018, 4:30 a.m. UTC | #10
On Fri, Mar 16, 2018 at 11:25 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
>                                  (u64 *)wqe);
>                 } else {
>                         pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> -                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),

I think Jason might be right that drivers shouldn't be calling the
__raw_write functions.  You definitely need to run this by the PPC
developers, specifically Ben Herrenschmidt.
Steve Wise March 17, 2018, 1:23 p.m. UTC | #11
> 
> On 3/17/2018 12:03 AM, Sinan Kaya wrote:
> > On 3/16/2018 11:40 PM, Sinan Kaya wrote:
> >> I'll change writel_relaxed() with __raw_writel() in the series like you
> suggested
> >> and also look at your other comments.
> >
> > I spoke too soon.
> >
> > Now that I realized, code needs to follow one of the following patterns
> for correctness
> >
> > 1)
> > wmb()
> > writel()/writel_relaxed()
> >
> > or
> >
> > 2)
> > wmb()
> > __raw_wrltel()
> > mmiowb()
> >
> > but definitely not
> >
> > wmb()
> > __raw_wrltel()
> >
> > Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
> >
> > Changing writel() to writel_relaxed() or __raw_writel() isn't enough.
> PowerPC needs mmiowb()
> > for correctness. ARM's mmiowb() implementation is empty.
> >
> > So, there is no one size fits all solution with the current state of affairs.
> >
> >
> 
> I think I finally got what you mean.
> 
> Code seems to have
> 
> wmb()
> writel()/writeq()
> wmb()
> 
> this can be safely replaced with
> 
> wmb()
> __raw_writel()/__raw_writeq()
> wmb()
> 
> This will work on all arches. Below is the new version. Let me know if this is
> OK.
> 
> +++ b/drivers/infiniband/hw/cxgb4/t4.h
> @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64
> *src)
>         int count = 8;
> 
>         while (count) {
> -               writeq(*src, dst);
> +               __raw_writeq(*src, dst);
>                 src++;
>                 dst++;
>                 count--;
> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq,
> u16 inc, union t4_wr *wqe)
>                                  (u64 *)wqe);
>                 } else {
>                         pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> -                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> +                                    wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>                 }
> 
>                 /* Flush user doorbell area writes. */
>                 wmb();
>                 return;
>         }
> -       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +       __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +       mmiowmb()
>  }
> 
>  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
> @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq,
> u16 inc,
>                                  (void *)wqe);
>                 } else {
>                         pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> -                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> +                                    wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>                 }
> 
>                 /* Flush user doorbell area writes. */
>                 wmb();
>                 return;
>         }
> -       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +       __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +       mmiowmb();
>  }
> 
> 

Yes, this is what chelsio recommended to me.  

Reviewed-by: Steve Wise <swise@opengridcomputing.com>
David Miller March 17, 2018, 1:27 p.m. UTC | #12
From: Sinan Kaya <okaya@codeaurora.org>
Date: Sat, 17 Mar 2018 00:25:14 -0400

> I think I finally got what you mean.
> 
> Code seems to have
> 
> wmb()
> writel()/writeq()
> wmb()
> 
> this can be safely replaced with
> 
> wmb()
> __raw_writel()/__raw_writeq()
> wmb()
> 
> This will work on all arches. Below is the new version. Let me know if this is OK.

Unfortunately, I think this won't work.

At least on sparc, the __raw_*() variants also change the endianness to
native endianness.

PowerPC does this as well, even documented in a comment :-)

/*
 * Non ordered and non-swapping "raw" accessors
 */

Only the non-__raw_ variants do the endianness swap to/from
little-endian.
Jason Gunthorpe March 17, 2018, 3:05 p.m. UTC | #13
On Sat, Mar 17, 2018 at 12:25:14AM -0400, Sinan Kaya wrote:
> On 3/17/2018 12:03 AM, Sinan Kaya wrote:
> > On 3/16/2018 11:40 PM, Sinan Kaya wrote:
> >> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
> >> and also look at your other comments.
> > 
> > I spoke too soon.
> > 
> > Now that I realized, code needs to follow one of the following patterns for correctness
> > 
> > 1)
> > wmb()
> > writel()/writel_relaxed()
> > 
> > or 
> > 
> > 2)
> > wmb()
> > __raw_wrltel()
> > mmiowb()
> > 
> > but definitely not
> > 
> > wmb()
> > __raw_wrltel()
> > 
> > Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
> > 
> > Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
> > for correctness. ARM's mmiowb() implementation is empty.
> > 
> > So, there is no one size fits all solution with the current state of affairs.
> > 
> > 
> 
> I think I finally got what you mean.
> 
> Code seems to have
> 
> wmb()
> writel()/writeq()
> wmb()
> 
> this can be safely replaced with
> 
> wmb()
> __raw_writel()/__raw_writeq()
> wmb()
> 
> This will work on all arches. Below is the new version. Let me know if this is OK.
> 
> +++ b/drivers/infiniband/hw/cxgb4/t4.h
> @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
>         int count = 8;
> 
>         while (count) {
> -               writeq(*src, dst);
> +               __raw_writeq(*src, dst);
>                 src++;
>                 dst++;
>                 count--;
> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
>                                  (u64 *)wqe);
>                 } else {
>                         pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> -                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> +                                    wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>                 }
> 
>                 /* Flush user doorbell area writes. */
>                 wmb();
>                 return;
>         }
> -       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +       __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +       mmiowmb()
>  }
> 
>  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
> @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
>                                  (void *)wqe);
>                 } else {
>                         pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> -                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> +                                    wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>                 }
> 
>                 /* Flush user doorbell area writes. */
>                 wmb();
>                 return;
>         }
> -       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +       __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +       mmiowmb();

No! NAK! Adding *new* barriers to use the *wrong* accessor is crazy!

Your first patch was right, replacing
  wmb()
  writel()

With wmb(); writel_relaxed() is fine, and helps ARM.

The problem with PPC has nothing to do with the writel, it is with the
spinlock, and we can't improve it without adding some new
infrastructure. Certainly don't hack around the problem by using
__raw_writel in multi-arch drivers!

In userspace we settled on something like this as a pattern:

 mmio_wc_spin_lock()
 writel_wc_mmio()
 mmio_wc_spin_unlock()

Where mmio_wc_spin_unlock incorporates the mmiowmb and is defined to
fully contain all MMIO access to WC and UC memory.

Due to our userspace the pattern is specifically used with MMIO writes
to WC BAR memory, but it could be generalized..

This allows all known arches to use the best code in all cases. x86
can even optimize a lot and combine the mmiowmb() into the
spin_unlock, apparently.

Jason
Sinan Kaya March 17, 2018, 6:30 p.m. UTC | #14
+linuxppc-dev@lists.ozlabs.org

On 3/17/2018 11:05 AM, Jason Gunthorpe wrote:
> On Sat, Mar 17, 2018 at 12:25:14AM -0400, Sinan Kaya wrote:
>> On 3/17/2018 12:03 AM, Sinan Kaya wrote:
>>> On 3/16/2018 11:40 PM, Sinan Kaya wrote:
>>>> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
>>>> and also look at your other comments.
>>>
>>> I spoke too soon.
>>>
>>> Now that I realized, code needs to follow one of the following patterns for correctness
>>>
>>> 1)
>>> wmb()
>>> writel()/writel_relaxed()
>>>
>>> or 
>>>
>>> 2)
>>> wmb()
>>> __raw_wrltel()
>>> mmiowb()
>>>
>>> but definitely not
>>>
>>> wmb()
>>> __raw_wrltel()
>>>
>>> Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
>>>
>>> Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
>>> for correctness. ARM's mmiowb() implementation is empty.
>>>
>>> So, there is no one size fits all solution with the current state of affairs.
>>>
>>>
>>
>> I think I finally got what you mean.
>>
>> Code seems to have
>>
>> wmb()
>> writel()/writeq()
>> wmb()
>>
>> this can be safely replaced with
>>
>> wmb()
>> __raw_writel()/__raw_writeq()
>> wmb()
>>
>> This will work on all arches. Below is the new version. Let me know if this is OK.
>>
>> +++ b/drivers/infiniband/hw/cxgb4/t4.h
>> @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
>>         int count = 8;
>>
>>         while (count) {
>> -               writeq(*src, dst);
>> +               __raw_writeq(*src, dst);
>>                 src++;
>>                 dst++;
>>                 count--;
>> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
>>                                  (u64 *)wqe);
>>                 } else {
>>                         pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
>> -                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
>> -                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
>> +                                    wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>>                 }
>>
>>                 /* Flush user doorbell area writes. */
>>                 wmb();
>>                 return;
>>         }
>> -       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
>> +       __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
>> +       mmiowmb()
>>  }
>>
>>  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
>> @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
>>                                  (void *)wqe);
>>                 } else {
>>                         pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
>> -                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
>> -                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
>> +                                    wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>>                 }
>>
>>                 /* Flush user doorbell area writes. */
>>                 wmb();
>>                 return;
>>         }
>> -       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
>> +       __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
>> +       mmiowmb();
> 
> No! NAK! Adding *new* barriers to use the *wrong* accessor is crazy!
> 
> Your first patch was right, replacing
>   wmb()
>   writel()
> 
> With wmb(); writel_relaxed() is fine, and helps ARM.
> 
> The problem with PPC has nothing to do with the writel, it is with the
> spinlock, and we can't improve it without adding some new
> infrastructure. Certainly don't hack around the problem by using
> __raw_writel in multi-arch drivers!
> 
> In userspace we settled on something like this as a pattern:
> 
>  mmio_wc_spin_lock()
>  writel_wc_mmio()
>  mmio_wc_spin_unlock()

> 
> Where mmio_wc_spin_unlock incorporates the mmiowmb and is defined to
> fully contain all MMIO access to WC and UC memory.
> 
> Due to our userspace the pattern is specifically used with MMIO writes
> to WC BAR memory, but it could be generalized..
> 
> This allows all known arches to use the best code in all cases. x86
> can even optimize a lot and combine the mmiowmb() into the
> spin_unlock, apparently.

Given both Dave [1] and Jason's feedback, we have to ask PowerPC developers
to fix writel_relaxed().

Otherwise, PowerPC won't be able to take advantage of the optimizations
introduced in this series.

I don't think we need writel_really_relaxed() API.

Somebody also has to take a task and work very hard to get rid of __raw_writeX()
APIs in drivers/net directory. It looked like a very common practice though
it clearly violates multiarch portability concerns Jason and Deve highlighted.

This will be the next version:

iff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index 8369c7c..6e5658a 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
        int count = 8;
 
        while (count) {
-               writeq(*src, dst);
+               writeq_relaxed(*src, dst);
                src++;
                dst++;
                count--;
@@ -477,15 +477,15 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
                                 (u64 *)wqe);
                } else {
                        pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
-                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
-                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
+                       writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
+                                      wq->sq.bar2_va + SGE_UDB_KDOORBELL);
                }
 
                /* Flush user doorbell area writes. */
                wmb();
                return;
        }
-       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+       writel_relaxed(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
 }
 
 static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
@@ -502,15 +502,15 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
                                 (void *)wqe);
                } else {
                        pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
-                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
-                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
+                       writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
+                                      wq->rq.bar2_va + SGE_UDB_KDOORBELL);
                }
 
                /* Flush user doorbell area writes. */
                wmb();
                return;
        }
-       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+       writel_relaxed(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
 }


[1] https://lkml.org/lkml/2018/3/17/100

> 
> Jason
>
Jason Gunthorpe March 19, 2018, 1:48 a.m. UTC | #15
On Sat, Mar 17, 2018 at 02:30:10PM -0400, Sinan Kaya wrote:

> Somebody also has to take a task and work very hard to get rid of __raw_writeX()
> APIs in drivers/net directory. It looked like a very common practice though
> it clearly violates multiarch portability concerns Jason and Deve highlighted.

When you posted your list I thought most of the hits were in what I'd
think of 'one-arch drivers', eg an IRQ controller or clock driver or
something.. Some might have a reason for it (eg avoiding the swap, for
instance), maybe it is a hold over from before writel_relaxed, or
maybe it is just a cargo-cult behavior..

It is the obviously multi-arch drivers that probably need some
attention..

Jason
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index 8369c7c..7a48c9e 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -477,15 +477,15 @@  static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
 				 (u64 *)wqe);
 		} else {
 			pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
-			writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
-			       wq->sq.bar2_va + SGE_UDB_KDOORBELL);
+			writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
+				       wq->sq.bar2_va + SGE_UDB_KDOORBELL);
 		}
 
 		/* Flush user doorbell area writes. */
 		wmb();
 		return;
 	}
-	writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+	writel_relaxed(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
 }
 
 static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
@@ -502,15 +502,15 @@  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
 				 (void *)wqe);
 		} else {
 			pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
-			writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
-			       wq->rq.bar2_va + SGE_UDB_KDOORBELL);
+			writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
+				       wq->rq.bar2_va + SGE_UDB_KDOORBELL);
 		}
 
 		/* Flush user doorbell area writes. */
 		wmb();
 		return;
 	}
-	writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+	writel_relaxed(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
 }
 
 static inline int t4_wq_in_error(struct t4_wq *wq)