diff mbox series

[1/2,v3] alpha: add a delay to inb_p, inb_w and inb_l

Message ID alpine.LRH.2.02.2005070404420.5006@file01.intranet.prod.int.rdu2.redhat.com
State Not Applicable
Headers show
Series [1/2,v3] alpha: add a delay to inb_p, inb_w and inb_l | expand

Commit Message

Mikulas Patocka May 7, 2020, 8:06 a.m. UTC
On Wed, 6 May 2020, Mikulas Patocka wrote:

> 
> 
> On Wed, 6 May 2020, Arnd Bergmann wrote:
> 
> > On Wed, May 6, 2020 at 1:21 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > 
> > >  /*
> > >   * The yet supported machines all access the RTC index register via
> > >   * an ISA port access but the way to access the date register differs ...
> > > + *
> > > + * The ISA bus on Alpha Avanti doesn't like back-to-back accesses,
> > > + * we need to add a small delay.
> > >   */
> > >  #define CMOS_READ(addr) ({ \
> > >  outb_p((addr),RTC_PORT(0)); \
> > > +udelay(2); \
> > >  inb_p(RTC_PORT(1)); \
> > 
> > 
> > The inb_p() / outb_p() functions are meant to already have a delay in them,
> > maybe we should just add it there for alpha?
> > 
> >      Arnd
> 
> Yes, that is possible too - it fixes the real time clock hang for me.
> 
> 
> -#define inb_p		inb
> -#define inw_p		inw
> -#define inl_p		inl
> +#define inb_p(x)	(ndelay(300), inb(x))
> +#define inw_p(x)	(ndelay(300), inw(x))
> +#define inl_p(x)	(ndelay(300), inl(x))
>  #define outb_p		outb
>  #define outw_p		outw
>  #define outl_p		outl

300ns was too low. We need at least 1400ns to fix the hang reliably.

Mikulas


From: Mikulas Patocka <mpatocka@redhat.com>

The patch 92d7223a74235054f2aa7227d207d9c57f84dca0 ("alpha: io: reorder
barriers to guarantee writeX() and iowriteX() ordering #2") broke boot on
the Alpha Avanti platform.

The patch changes timing between accesses to the ISA bus, in particular,
it reduces the time between "write" access and a subsequent "read" access.

This causes lock-up when accessing the real time clock and serial ports.

This patch fixes the real time clock by adding a small delay to the inb_p,
inw_p and inl_p macros.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 92d7223a7423 ("alpha: io: reorder barriers to guarantee writeX() and iowriteX() ordering #2")
Cc: stable@vger.kernel.org	# v4.17+

---
 arch/alpha/include/asm/io.h |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Greg Kroah-Hartman May 7, 2020, 8:20 a.m. UTC | #1
On Thu, May 07, 2020 at 04:06:31AM -0400, Mikulas Patocka wrote:
> 
> 
> On Wed, 6 May 2020, Mikulas Patocka wrote:
> 
> > 
> > 
> > On Wed, 6 May 2020, Arnd Bergmann wrote:
> > 
> > > On Wed, May 6, 2020 at 1:21 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > 
> > > >  /*
> > > >   * The yet supported machines all access the RTC index register via
> > > >   * an ISA port access but the way to access the date register differs ...
> > > > + *
> > > > + * The ISA bus on Alpha Avanti doesn't like back-to-back accesses,
> > > > + * we need to add a small delay.
> > > >   */
> > > >  #define CMOS_READ(addr) ({ \
> > > >  outb_p((addr),RTC_PORT(0)); \
> > > > +udelay(2); \
> > > >  inb_p(RTC_PORT(1)); \
> > > 
> > > 
> > > The inb_p() / outb_p() functions are meant to already have a delay in them,
> > > maybe we should just add it there for alpha?
> > > 
> > >      Arnd
> > 
> > Yes, that is possible too - it fixes the real time clock hang for me.
> > 
> > 
> > -#define inb_p		inb
> > -#define inw_p		inw
> > -#define inl_p		inl
> > +#define inb_p(x)	(ndelay(300), inb(x))
> > +#define inw_p(x)	(ndelay(300), inw(x))
> > +#define inl_p(x)	(ndelay(300), inl(x))
> >  #define outb_p		outb
> >  #define outw_p		outw
> >  #define outl_p		outl
> 
> 300ns was too low. We need at least 1400ns to fix the hang reliably.
> 
> Mikulas
> 
> 
> From: Mikulas Patocka <mpatocka@redhat.com>
> 
> The patch 92d7223a74235054f2aa7227d207d9c57f84dca0 ("alpha: io: reorder
> barriers to guarantee writeX() and iowriteX() ordering #2") broke boot on
> the Alpha Avanti platform.
> 
> The patch changes timing between accesses to the ISA bus, in particular,
> it reduces the time between "write" access and a subsequent "read" access.
> 
> This causes lock-up when accessing the real time clock and serial ports.
> 
> This patch fixes the real time clock by adding a small delay to the inb_p,
> inw_p and inl_p macros.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Fixes: 92d7223a7423 ("alpha: io: reorder barriers to guarantee writeX() and iowriteX() ordering #2")
> Cc: stable@vger.kernel.org	# v4.17+
> 
> ---
>  arch/alpha/include/asm/io.h |    7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)

<snip>

This is not in a format that anyone can apply it in, please resend in a
proper way if you wish for it to be accepted.

thanks,

greg k-h
Arnd Bergmann May 7, 2020, 1:30 p.m. UTC | #2
On Thu, May 7, 2020 at 10:06 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> On Wed, 6 May 2020, Mikulas Patocka wrote:
> > On Wed, 6 May 2020, Arnd Bergmann wrote:
> > > On Wed, May 6, 2020 at 1:21 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > >
> > > >  /*
> > > >   * The yet supported machines all access the RTC index register via
> > > >   * an ISA port access but the way to access the date register differs ...
> > > > + *
> > > > + * The ISA bus on Alpha Avanti doesn't like back-to-back accesses,
> > > > + * we need to add a small delay.
> > > >   */
> > > >  #define CMOS_READ(addr) ({ \
> > > >  outb_p((addr),RTC_PORT(0)); \
> > > > +udelay(2); \
> > > >  inb_p(RTC_PORT(1)); \
> > >
> > >
> > > The inb_p() / outb_p() functions are meant to already have a delay in them,
> > > maybe we should just add it there for alpha?
> > >
> > >      Arnd
> >
> > Yes, that is possible too - it fixes the real time clock hang for me.
> >
> >
> > -#define inb_p                inb
> > -#define inw_p                inw
> > -#define inl_p                inl
> > +#define inb_p(x)     (ndelay(300), inb(x))
> > +#define inw_p(x)     (ndelay(300), inw(x))
> > +#define inl_p(x)     (ndelay(300), inl(x))
> >  #define outb_p               outb
> >  #define outw_p               outw
> >  #define outl_p               outl
>
> 300ns was too low. We need at least 1400ns to fix the hang reliably.

Are you sure that it is in fact the timing that is important here and not
a barrier? I see that inb() is written in terms of readb(), but the
barrier requirements for I/O space are a bit different from those
on PCI memory space.

In the example you gave first, there is a an outb_p() followed by inb_p().
These are normally serialized by the bus, but I/O space also has the
requirement that an outb() completes before we get to the next
instruction (non-posted write), while writeb() is generally posted and
only needs a barrier before the write rather than both before and after
like outb.

    Arnd
Mikulas Patocka May 7, 2020, 2:09 p.m. UTC | #3
On Thu, 7 May 2020, Arnd Bergmann wrote:

> On Thu, May 7, 2020 at 10:06 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > On Wed, 6 May 2020, Mikulas Patocka wrote:
> > > On Wed, 6 May 2020, Arnd Bergmann wrote:
> > > > On Wed, May 6, 2020 at 1:21 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > >
> > > > >  /*
> > > > >   * The yet supported machines all access the RTC index register via
> > > > >   * an ISA port access but the way to access the date register differs ...
> > > > > + *
> > > > > + * The ISA bus on Alpha Avanti doesn't like back-to-back accesses,
> > > > > + * we need to add a small delay.
> > > > >   */
> > > > >  #define CMOS_READ(addr) ({ \
> > > > >  outb_p((addr),RTC_PORT(0)); \
> > > > > +udelay(2); \
> > > > >  inb_p(RTC_PORT(1)); \
> > > >
> > > >
> > > > The inb_p() / outb_p() functions are meant to already have a delay in them,
> > > > maybe we should just add it there for alpha?
> > > >
> > > >      Arnd
> > >
> > > Yes, that is possible too - it fixes the real time clock hang for me.
> > >
> > >
> > > -#define inb_p                inb
> > > -#define inw_p                inw
> > > -#define inl_p                inl
> > > +#define inb_p(x)     (ndelay(300), inb(x))
> > > +#define inw_p(x)     (ndelay(300), inw(x))
> > > +#define inl_p(x)     (ndelay(300), inl(x))
> > >  #define outb_p               outb
> > >  #define outw_p               outw
> > >  #define outl_p               outl
> >
> > 300ns was too low. We need at least 1400ns to fix the hang reliably.
> 
> Are you sure that it is in fact the timing that is important here and not
> a barrier? I see that inb() is written in terms of readb(), but the
> barrier requirements for I/O space are a bit different from those
> on PCI memory space.

The "in" and "out" instructions are serializing on x86. But alpha doesn't 
have dedicated instructions for accessing ports.

Do you think that all the "in[bwl]" and "out[bwl]" macros on alpha should 
be protected by two memory barriers, to emulate the x86 behavior?

> In the example you gave first, there is a an outb_p() followed by inb_p().
> These are normally serialized by the bus, but I/O space also has the
> requirement that an outb() completes before we get to the next
> instruction (non-posted write), while writeb() is generally posted and
> only needs a barrier before the write rather than both before and after
> like outb.
> 
>     Arnd

I think that the fact that "writeb" is posted is exactly the problem - it 
gets posted, the processor goes on, sends "readb" and they arrive 
back-to-back to the ISA bus. The ISA bus device doesn't like back-to-back 
accesses and locks up.

Anyway - you can change the "ndelay()" function in this patch to "mb()" - 
"mb()" will provide long enough delay that it fixes this bug.

Mikulas
Arnd Bergmann May 7, 2020, 3:08 p.m. UTC | #4
On Thu, May 7, 2020 at 4:09 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
> On Thu, 7 May 2020, Arnd Bergmann wrote:
> > On Thu, May 7, 2020 at 10:06 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> > Are you sure that it is in fact the timing that is important here and not
> > a barrier? I see that inb() is written in terms of readb(), but the
> > barrier requirements for I/O space are a bit different from those
> > on PCI memory space.
>
> The "in" and "out" instructions are serializing on x86. But alpha doesn't
> have dedicated instructions for accessing ports.
>
> Do you think that all the "in[bwl]" and "out[bwl]" macros on alpha should
> be protected by two memory barriers, to emulate the x86 behavior?

That's what we do on some other architectures to emulate the non-posted
behavior of out[bwl], as required by PCI. I can't think of any reasons to
have a barrier before in[bwl], or after write[bwl], but we generally want
one after out[bwl]

> > In the example you gave first, there is a an outb_p() followed by inb_p().
> > These are normally serialized by the bus, but I/O space also has the
> > requirement that an outb() completes before we get to the next
> > instruction (non-posted write), while writeb() is generally posted and
> > only needs a barrier before the write rather than both before and after
> > like outb.
>
> I think that the fact that "writeb" is posted is exactly the problem - it
> gets posted, the processor goes on, sends "readb" and they arrive
> back-to-back to the ISA bus. The ISA bus device doesn't like back-to-back
> accesses and locks up.
>
> Anyway - you can change the "ndelay()" function in this patch to "mb()" -
> "mb()" will provide long enough delay that it fixes this bug.

My preference would be to have whatever makes most sense in theory
and also fixes the problem. If there is some documentation that
says you need a certain amount of time between accesses regardless
of the barriers, then that is fine. I do wonder if there is anything
enforcing the "rpcc" in _delay() to come after the store if there is no
barrier between the two, otherwise the delay method still seems
unreliable.

The barrier after the store at least makes sense to me based on
the theory, both with and without a delay in outb_p().

      Arnd
Mikulas Patocka May 7, 2020, 3:45 p.m. UTC | #5
On Thu, 7 May 2020, Arnd Bergmann wrote:

> On Thu, May 7, 2020 at 4:09 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > On Thu, 7 May 2020, Arnd Bergmann wrote:
> > > On Thu, May 7, 2020 at 10:06 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > >
> > > Are you sure that it is in fact the timing that is important here and not
> > > a barrier? I see that inb() is written in terms of readb(), but the
> > > barrier requirements for I/O space are a bit different from those
> > > on PCI memory space.
> >
> > The "in" and "out" instructions are serializing on x86. But alpha doesn't
> > have dedicated instructions for accessing ports.
> >
> > Do you think that all the "in[bwl]" and "out[bwl]" macros on alpha should
> > be protected by two memory barriers, to emulate the x86 behavior?
> 
> That's what we do on some other architectures to emulate the non-posted
> behavior of out[bwl], as required by PCI. I can't think of any reasons to
> have a barrier before in[bwl], or after write[bwl], but we generally want
> one after out[bwl]

Yes - so we can add a barrier after out[bwl]. It also fixes the serial 
port issue, so we no longer need the serial driver patch for Greg.

> > > In the example you gave first, there is a an outb_p() followed by inb_p().
> > > These are normally serialized by the bus, but I/O space also has the
> > > requirement that an outb() completes before we get to the next
> > > instruction (non-posted write), while writeb() is generally posted and
> > > only needs a barrier before the write rather than both before and after
> > > like outb.
> >
> > I think that the fact that "writeb" is posted is exactly the problem - it
> > gets posted, the processor goes on, sends "readb" and they arrive
> > back-to-back to the ISA bus. The ISA bus device doesn't like back-to-back
> > accesses and locks up.
> >
> > Anyway - you can change the "ndelay()" function in this patch to "mb()" -
> > "mb()" will provide long enough delay that it fixes this bug.
> 
> My preference would be to have whatever makes most sense in theory
> and also fixes the problem. If there is some documentation that
> says you need a certain amount of time between accesses regardless
> of the barriers, then that is fine. I do wonder if there is anything
> enforcing the "rpcc" in _delay() to come after the store if there is no
> barrier between the two, otherwise the delay method still seems
> unreliable.

I measured ndelay - and the overhead of the instruction rpcc is already 
very high. ndelay(1) takes 300ns.

> The barrier after the store at least makes sense to me based on
> the theory, both with and without a delay in outb_p().
> 
>       Arnd

Mikulas
Maciej W. Rozycki May 10, 2020, 1:25 a.m. UTC | #6
On Thu, 7 May 2020, Arnd Bergmann wrote:

> > Do you think that all the "in[bwl]" and "out[bwl]" macros on alpha should
> > be protected by two memory barriers, to emulate the x86 behavior?
> 
> That's what we do on some other architectures to emulate the non-posted
> behavior of out[bwl], as required by PCI. I can't think of any reasons to
> have a barrier before in[bwl], or after write[bwl], but we generally want
> one after out[bwl]

 Alpha is weakly ordered, also WRT MMIO.  The details are a bit obscure 
(and were discussed before in a previous iteration of these patches), but 
my understanding is multiple writes can be merged and writes can be 
reordered WRT reads, even on UP.  It's generally better for performance to 
have ordering barriers before MMIO operations rather than afterwards, 
unless a completion barrier is also required (e.g. for level-triggered 
interrupt acknowledgement).

 Currently we don't fully guarantee that `outX' won't be posted (from 
memory-barriers.txt):

" (*) inX(), outX():
[...]
        Device drivers may expect outX() to emit a non-posted write transaction
        that waits for a completion response from the I/O peripheral before
        returning. This is not guaranteed by all architectures and is therefore
        not part of the portable ordering semantics."

  Maciej
Mikulas Patocka May 10, 2020, 6:50 p.m. UTC | #7
On Sun, 10 May 2020, Maciej W. Rozycki wrote:

> On Thu, 7 May 2020, Arnd Bergmann wrote:
> 
> > > Do you think that all the "in[bwl]" and "out[bwl]" macros on alpha should
> > > be protected by two memory barriers, to emulate the x86 behavior?
> > 
> > That's what we do on some other architectures to emulate the non-posted
> > behavior of out[bwl], as required by PCI. I can't think of any reasons to
> > have a barrier before in[bwl], or after write[bwl], but we generally want
> > one after out[bwl]
> 
>  Alpha is weakly ordered, also WRT MMIO.  The details are a bit obscure 
> (and were discussed before in a previous iteration of these patches), but 
> my understanding is multiple writes can be merged and writes can be 
> reordered WRT reads, even on UP.  It's generally better for performance to 

We discussed it some times ago, and the conclusion was that reads and 
writes to the same device are not reordered on Alpha. Reads and writes to 
different devices or to memory may be reordered.

In these problematic cases, we only access serial port or real time clock 
using a few ports (and these devices don't have DMA, so there's not any 
interaction with memory) - so I conclude that it is timing problem and not 
I/O reordering problem.

> have ordering barriers before MMIO operations rather than afterwards, 
> unless a completion barrier is also required (e.g. for level-triggered 
> interrupt acknowledgement).
> 
>  Currently we don't fully guarantee that `outX' won't be posted (from 
> memory-barriers.txt):
> 
> " (*) inX(), outX():
> [...]
>         Device drivers may expect outX() to emit a non-posted write transaction
>         that waits for a completion response from the I/O peripheral before
>         returning. This is not guaranteed by all architectures and is therefore
>         not part of the portable ordering semantics."
> 
>   Maciej

Mikulas
Maciej W. Rozycki May 11, 2020, 2:58 p.m. UTC | #8
On Sun, 10 May 2020, Mikulas Patocka wrote:

> > > That's what we do on some other architectures to emulate the non-posted
> > > behavior of out[bwl], as required by PCI. I can't think of any reasons to
> > > have a barrier before in[bwl], or after write[bwl], but we generally want
> > > one after out[bwl]
> > 
> >  Alpha is weakly ordered, also WRT MMIO.  The details are a bit obscure 
> > (and were discussed before in a previous iteration of these patches), but 
> > my understanding is multiple writes can be merged and writes can be 
> > reordered WRT reads, even on UP.  It's generally better for performance to 
> 
> We discussed it some times ago, and the conclusion was that reads and 
> writes to the same device are not reordered on Alpha. Reads and writes to 
> different devices or to memory may be reordered.

 Except that "device" in this context is a particular MMIO location; most 
peripherals span multiple locations, including the RTC and the UART in the 
Avanti line in particular.

> In these problematic cases, we only access serial port or real time clock 
> using a few ports (and these devices don't have DMA, so there's not any 
> interaction with memory) - so I conclude that it is timing problem and not 
> I/O reordering problem.

 Individual PCI port locations correspond to different MMIO locations, so 
yes, accesses to these can be reordered (merging won't happen due to the 
use of the sparse address space).

 As I noted using a small program to verify actual behaviour ought to 
reveal what the problem really is.  And /dev/mem can be mmapped for PCI 
port I/O access on Alpha (some X-servers do this), so it can be done even 
in the userland with a running system.

 And if timing is indeed the culprit, then I think it will be best fixed 
in the 82378IB southbridge, i.e.[1]:

"The I/O recovery mechanism in the SIO is used to add additional recovery 
delay between PCI originated 8-bit and 16-bit I/O cycles to the ISA Bus.  
The SIO automatically forces a minimum delay of four SYSCLKs between 
back-to-back 8 and 16 bit I/O cycles to the ISA Bus.  The delay is 
measured from the rising edge of the I/O command (IOR# or IOW#) to the 
falling edge of the next BALE.  If a delay of greater than four SYSCLKs is 
required, the ISA I/O Recovery Time Register can be programmed to increase 
the delay in increments of SYSCLKs.  Note that no additional delay is 
inserted for back-to-back I/O "sub cycles" generated as a result of byte 
assembly or disassembly.  This register defaults to 8 and 16-bit recovery 
enabled with two clocks added to the standard I/O recovery."

where it won't be causing unnecessary overhead for native PCI devices or 
indeed excessive one for ISA devices.  It might be interesting to note 
that later SIO versions like the 82378ZB increased the minimum to five 
SYSCLKs, so maybe a missing SYSCLK (that can still be inserted by suitably
programming the ICRT) is the source of the problem?

References:

[1] "82378IB System I/O (SIO)", April 1993, Intel Corporation, Order 
    Number: 290473-002, Section 4.1.17 "ICRT -- ISA Controller Recovery 
    Timer Register"

  Maciej
Mikulas Patocka May 12, 2020, 7:35 p.m. UTC | #9
On Mon, 11 May 2020, Maciej W. Rozycki wrote:

>  And if timing is indeed the culprit, then I think it will be best fixed 
> in the 82378IB southbridge, i.e.[1]:
> 
> "The I/O recovery mechanism in the SIO is used to add additional recovery 
> delay between PCI originated 8-bit and 16-bit I/O cycles to the ISA Bus.  
> The SIO automatically forces a minimum delay of four SYSCLKs between 
> back-to-back 8 and 16 bit I/O cycles to the ISA Bus.  The delay is 
> measured from the rising edge of the I/O command (IOR# or IOW#) to the 
> falling edge of the next BALE.  If a delay of greater than four SYSCLKs is 
> required, the ISA I/O Recovery Time Register can be programmed to increase 
> the delay in increments of SYSCLKs.  Note that no additional delay is 
> inserted for back-to-back I/O "sub cycles" generated as a result of byte 
> assembly or disassembly.  This register defaults to 8 and 16-bit recovery 
> enabled with two clocks added to the standard I/O recovery."
> 
> where it won't be causing unnecessary overhead for native PCI devices or 
> indeed excessive one for ISA devices.  It might be interesting to note 
> that later SIO versions like the 82378ZB increased the minimum to five 
> SYSCLKs, so maybe a missing SYSCLK (that can still be inserted by suitably
> programming the ICRT) is the source of the problem?
> 
> References:
> 
> [1] "82378IB System I/O (SIO)", April 1993, Intel Corporation, Order 
>     Number: 290473-002, Section 4.1.17 "ICRT -- ISA Controller Recovery 
>     Timer Register"
> 
>   Maciej

I tried to modify this register (I wrote 0x44 to it - it should correspond 
to the maximum delay) and it had no effect on the serial port and rtc 
lock-ups.

Mikulas
Ivan Kokshaysky May 13, 2020, 2:41 p.m. UTC | #10
On Mon, May 11, 2020 at 03:58:24PM +0100, Maciej W. Rozycki wrote:
>  Individual PCI port locations correspond to different MMIO locations, so 
> yes, accesses to these can be reordered (merging won't happen due to the 
> use of the sparse address space).

Correct, it's how Alpha write buffers work. According to 21064 hardware
reference manual, these buffers are flushed when one of the following
conditions is met:

1) The write buffer contains at least two valid entries.
2) The write buffer contains one valid entry and at least 256 CPU cycles
   have elapsed since the execution of the last write buffer-directed
   instruction.
3) The write buffer contains an MB, STQ_C or STL_C instruction.
4) A load miss is pending to an address currently valid in the write
   buffer that requires the write buffer to be flushed.

I'm certain that in these rtc/serial cases we've got readX arriving
to device *before* preceeding writeX because of 2). That's why small
delay (300-1400 ns, apparently depends on CPU frequency) seemingly
"fixes" the problem. The 4) is not met because loads and stores are
to different ports, and 3) has been broken by commit 92d7223a74.

So I believe that correct fix would be to revert 92d7223a74 and
add wmb() before [io]writeX macros to meet memory-barriers.txt
requirement. The "wmb" instruction is cheap enough and won't hurt
IO performance too much.

Ivan.
Greg Kroah-Hartman May 13, 2020, 4:13 p.m. UTC | #11
On Wed, May 13, 2020 at 03:41:28PM +0100, Ivan Kokshaysky wrote:
> On Mon, May 11, 2020 at 03:58:24PM +0100, Maciej W. Rozycki wrote:
> >  Individual PCI port locations correspond to different MMIO locations, so 
> > yes, accesses to these can be reordered (merging won't happen due to the 
> > use of the sparse address space).
> 
> Correct, it's how Alpha write buffers work. According to 21064 hardware
> reference manual, these buffers are flushed when one of the following
> conditions is met:
> 
> 1) The write buffer contains at least two valid entries.
> 2) The write buffer contains one valid entry and at least 256 CPU cycles
>    have elapsed since the execution of the last write buffer-directed
>    instruction.
> 3) The write buffer contains an MB, STQ_C or STL_C instruction.
> 4) A load miss is pending to an address currently valid in the write
>    buffer that requires the write buffer to be flushed.
> 
> I'm certain that in these rtc/serial cases we've got readX arriving
> to device *before* preceeding writeX because of 2). That's why small
> delay (300-1400 ns, apparently depends on CPU frequency) seemingly
> "fixes" the problem. The 4) is not met because loads and stores are
> to different ports, and 3) has been broken by commit 92d7223a74.
> 
> So I believe that correct fix would be to revert 92d7223a74 and
> add wmb() before [io]writeX macros to meet memory-barriers.txt
> requirement. The "wmb" instruction is cheap enough and won't hurt
> IO performance too much.

I agree, that sounds easier, and work with the authors of
memory-barriers.txt in order to straighten things out.

greg k-h
Maciej W. Rozycki May 13, 2020, 5:17 p.m. UTC | #12
On Wed, 13 May 2020, Ivan Kokshaysky wrote:

> >  Individual PCI port locations correspond to different MMIO locations, so 
> > yes, accesses to these can be reordered (merging won't happen due to the 
> > use of the sparse address space).
> 
> Correct, it's how Alpha write buffers work. According to 21064 hardware
> reference manual, these buffers are flushed when one of the following
> conditions is met:
> 
> 1) The write buffer contains at least two valid entries.
> 2) The write buffer contains one valid entry and at least 256 CPU cycles
>    have elapsed since the execution of the last write buffer-directed
>    instruction.
> 3) The write buffer contains an MB, STQ_C or STL_C instruction.
> 4) A load miss is pending to an address currently valid in the write
>    buffer that requires the write buffer to be flushed.
> 
> I'm certain that in these rtc/serial cases we've got readX arriving
> to device *before* preceeding writeX because of 2). That's why small
> delay (300-1400 ns, apparently depends on CPU frequency) seemingly
> "fixes" the problem. The 4) is not met because loads and stores are
> to different ports, and 3) has been broken by commit 92d7223a74.
> 
> So I believe that correct fix would be to revert 92d7223a74 and
> add wmb() before [io]writeX macros to meet memory-barriers.txt
> requirement. The "wmb" instruction is cheap enough and won't hurt
> IO performance too much.

 Hmm, having barriers *afterwards* across all the MMIO accessors, even 
ones that do not have such a requirement according to memory-barriers.txt, 
does hurt performance unnecessarily however.  What I think has to be done 
is adding barriers beforehand, and then only add barriers afterwards where 
necessary.  Commit 92d7223a74 did a part of that, but did not consistently 
update all the remaining accessors.

 So I don't think reverting 92d7223a74 permanently is the right way to go, 
however it certainly makes sense to do that temporarily to get rid of the 
fatal regression, sort all the corner cases and then apply the resulting 
replacement fix.  I think ultimately we do want the barriers beforehand, 
just like the MIPS port has (and survives) in arch/mips/include/asm/io.h.  
Observe that unlike the Alpha ISA the MIPS ISA does have nontrivial `rmb' 
aka the SYNC_RMB hardware instruction.

  Maciej
Mikulas Patocka May 22, 2020, 1:03 p.m. UTC | #13
On Wed, 13 May 2020, Maciej W. Rozycki wrote:

> On Wed, 13 May 2020, Ivan Kokshaysky wrote:
> 
> > >  Individual PCI port locations correspond to different MMIO locations, so 
> > > yes, accesses to these can be reordered (merging won't happen due to the 
> > > use of the sparse address space).
> > 
> > Correct, it's how Alpha write buffers work. According to 21064 hardware
> > reference manual, these buffers are flushed when one of the following
> > conditions is met:
> > 
> > 1) The write buffer contains at least two valid entries.
> > 2) The write buffer contains one valid entry and at least 256 CPU cycles
> >    have elapsed since the execution of the last write buffer-directed
> >    instruction.
> > 3) The write buffer contains an MB, STQ_C or STL_C instruction.
> > 4) A load miss is pending to an address currently valid in the write
> >    buffer that requires the write buffer to be flushed.
> > 
> > I'm certain that in these rtc/serial cases we've got readX arriving
> > to device *before* preceeding writeX because of 2). That's why small
> > delay (300-1400 ns, apparently depends on CPU frequency) seemingly
> > "fixes" the problem. The 4) is not met because loads and stores are
> > to different ports, and 3) has been broken by commit 92d7223a74.
> > 
> > So I believe that correct fix would be to revert 92d7223a74 and
> > add wmb() before [io]writeX macros to meet memory-barriers.txt
> > requirement. The "wmb" instruction is cheap enough and won't hurt
> > IO performance too much.
> 
>  Hmm, having barriers *afterwards* across all the MMIO accessors, even 
> ones that do not have such a requirement according to memory-barriers.txt, 
> does hurt performance unnecessarily however.  What I think has to be done 
> is adding barriers beforehand, and then only add barriers afterwards where 
> necessary.  Commit 92d7223a74 did a part of that, but did not consistently 
> update all the remaining accessors.
> 
>  So I don't think reverting 92d7223a74 permanently is the right way to go, 
> however it certainly makes sense to do that temporarily to get rid of the 
> fatal regression, sort all the corner cases and then apply the resulting 
> replacement fix.

See Documentation/memory-barriers.txt, the section "KERNEL I/O BARRIER 
EFFECTS"

According to the specification, there must be a barrier before a write to 
the MMIO space (paragraph 3) and after a read from MMIO space (parahraph 
4) - if this causes unnecessary slowdown, the driver should use 
readX_relaxed or writeX_relaxed functions - the *_relaxed functions are 
ordered with each other (see the paragraph "(*) readX_relaxed(), 
writeX_relaxed()"), but they are not ordered with respect to memory 
access.

The commit 92d7223a74 fixes that requirement (although there is no real 
driver that was fixed by this), so I don't think it should be reverted. 
The proper fix should be to add delays to the serial port and readltime 
clock (or perhaps to all IO-port accesses).

> I think ultimately we do want the barriers beforehand, just like the 
> MIPS port has (and survives) in arch/mips/include/asm/io.h.  Observe 

If the MIPS port doesn't have MMIO barrier after read[bwl], then it is 
violating the specification. Perhaps there is no existing driver that is 
hurt by this violation, so this violation survived.

> that unlike the Alpha ISA the MIPS ISA does have nontrivial `rmb' aka 
> the SYNC_RMB hardware instruction.
> 
>   Maciej

Mikulas
Mikulas Patocka May 22, 2020, 1:26 p.m. UTC | #14
On Wed, 13 May 2020, Ivan Kokshaysky wrote:

> On Mon, May 11, 2020 at 03:58:24PM +0100, Maciej W. Rozycki wrote:
> >  Individual PCI port locations correspond to different MMIO locations, so 
> > yes, accesses to these can be reordered (merging won't happen due to the 
> > use of the sparse address space).
> 
> Correct, it's how Alpha write buffers work. According to 21064 hardware
> reference manual, these buffers are flushed when one of the following
> conditions is met:
> 
> 1) The write buffer contains at least two valid entries.
> 2) The write buffer contains one valid entry and at least 256 CPU cycles
>    have elapsed since the execution of the last write buffer-directed
>    instruction.
> 3) The write buffer contains an MB, STQ_C or STL_C instruction.
> 4) A load miss is pending to an address currently valid in the write
>    buffer that requires the write buffer to be flushed.
> 
> I'm certain that in these rtc/serial cases we've got readX arriving
> to device *before* preceeding writeX because of 2). That's why small
> delay (300-1400 ns, apparently depends on CPU frequency) seemingly
> "fixes" the problem. The 4) is not met because loads and stores are
> to different ports, and 3) has been broken by commit 92d7223a74.
> 
> So I believe that correct fix would be to revert 92d7223a74 and
> add wmb() before [io]writeX macros to meet memory-barriers.txt
> requirement. The "wmb" instruction is cheap enough and won't hurt
> IO performance too much.
> 
> Ivan.

I agree ... and what about readX_relaxed and writeX_relaxed? According to 
the memory-barriers specification, the _relaxed functions must be ordered 
w.r.t. each other. If Alpha can't keep them ordered, they should have 
barriers between them too.

Mikulas
Maciej W. Rozycki May 22, 2020, 1:37 p.m. UTC | #15
On Fri, 22 May 2020, Mikulas Patocka wrote:

> >  Hmm, having barriers *afterwards* across all the MMIO accessors, even 
> > ones that do not have such a requirement according to memory-barriers.txt, 
> > does hurt performance unnecessarily however.  What I think has to be done 
> > is adding barriers beforehand, and then only add barriers afterwards where 
> > necessary.  Commit 92d7223a74 did a part of that, but did not consistently 
> > update all the remaining accessors.
> > 
> >  So I don't think reverting 92d7223a74 permanently is the right way to go, 
> > however it certainly makes sense to do that temporarily to get rid of the 
> > fatal regression, sort all the corner cases and then apply the resulting 
> > replacement fix.
> 
> See Documentation/memory-barriers.txt, the section "KERNEL I/O BARRIER 
> EFFECTS"
> 
> According to the specification, there must be a barrier before a write to 
> the MMIO space (paragraph 3) and after a read from MMIO space (parahraph 
> 4) - if this causes unnecessary slowdown, the driver should use 
> readX_relaxed or writeX_relaxed functions - the *_relaxed functions are 
> ordered with each other (see the paragraph "(*) readX_relaxed(), 
> writeX_relaxed()"), but they are not ordered with respect to memory 
> access.

 The specification doesn't require a barrier *after* a write however, 
which is what I have been concerned about as it may cost hundreds of 
cycles wasted.  I'm not concerned about a barrier after a read (and one 
beforehand is of course also required).

> The commit 92d7223a74 fixes that requirement (although there is no real 
> driver that was fixed by this), so I don't think it should be reverted. 
> The proper fix should be to add delays to the serial port and readltime 
> clock (or perhaps to all IO-port accesses).

 Adding artificial delays will only paper over the real problem I'm 
afraid.

> > I think ultimately we do want the barriers beforehand, just like the 
> > MIPS port has (and survives) in arch/mips/include/asm/io.h.  Observe 
> 
> If the MIPS port doesn't have MMIO barrier after read[bwl], then it is 
> violating the specification. Perhaps there is no existing driver that is 
> hurt by this violation, so this violation survived.

 It does have a barrier, see:

	/* prevent prefetching of coherent DMA data prematurely */	\
	if (!relax)							\
		rmb();							\

In the light of #5 however:

        5. A readX() by a CPU thread from the peripheral will complete before
           any subsequent delay() loop can begin execution on the same thread.

I think it may have to be replaced with a completion barrier however, and 
that will be tricky because you cannot just issue a second read to the 
resource accessed after the `rmb' to make the first read complete, as a 
MMIO read may have side effects (e.g. clearing an interrupt request).  So 
the read would have to happen to a different location.

 Some architectures have a hardware completion barrier instruction, such 
as the modern MIPS ISA, which makes life sweet and easy (as much as life 
can be sweet and easy with a weakly ordered bus model) as no dummy read is 
then required, but surely not all do (including older MIPS ISA revisions).

  Maciej
Mikulas Patocka May 22, 2020, 8 p.m. UTC | #16
On Fri, 22 May 2020, Mikulas Patocka wrote:

> On Wed, 13 May 2020, Ivan Kokshaysky wrote:
> 
> > On Mon, May 11, 2020 at 03:58:24PM +0100, Maciej W. Rozycki wrote:
> > >  Individual PCI port locations correspond to different MMIO locations, so 
> > > yes, accesses to these can be reordered (merging won't happen due to the 
> > > use of the sparse address space).
> > 
> > Correct, it's how Alpha write buffers work. According to 21064 hardware
> > reference manual, these buffers are flushed when one of the following
> > conditions is met:
> > 
> > 1) The write buffer contains at least two valid entries.
> > 2) The write buffer contains one valid entry and at least 256 CPU cycles
> >    have elapsed since the execution of the last write buffer-directed
> >    instruction.
> > 3) The write buffer contains an MB, STQ_C or STL_C instruction.
> > 4) A load miss is pending to an address currently valid in the write
> >    buffer that requires the write buffer to be flushed.
> > 
> > I'm certain that in these rtc/serial cases we've got readX arriving
> > to device *before* preceeding writeX because of 2). That's why small
> > delay (300-1400 ns, apparently depends on CPU frequency) seemingly
> > "fixes" the problem. The 4) is not met because loads and stores are
> > to different ports, and 3) has been broken by commit 92d7223a74.
> > 
> > So I believe that correct fix would be to revert 92d7223a74 and
> > add wmb() before [io]writeX macros to meet memory-barriers.txt
> > requirement. The "wmb" instruction is cheap enough and won't hurt
> > IO performance too much.
> > 
> > Ivan.
> 
> I agree ... and what about readX_relaxed and writeX_relaxed? According to 
> the memory-barriers specification, the _relaxed functions must be ordered 
> w.r.t. each other. If Alpha can't keep them ordered, they should have 
> barriers between them too.
> 
> Mikulas

I have found the chapter about I/O in the Alpha reference manual, in the 
section "5.6.4.7 Implications for Memory Mapped I/O", it says:

To reliably communicate shared data from a processor to an I/O device:
1. Write the shared data to a memory-like physical memory region on the 
processor.
2. Execute an MB instruction.
3. Write a flag (equivalently, send an interrupt or write a register 
location implemented in the I/O device).

The receiving I/O device must:
1. Read the flag (equivalently, detect the interrupt or detect the write 
to the register location implemented in the I/O device).
2. Execute the equivalent of an MB.
3. Read the shared data.

So, we must use MB, not WMB when ordering I/O accesses.

The WMB instruction specification says that it orders writes to 
memory-like locations with other writes to memory-like locations. And 
writes to non-memory-like locations with other writes to non-memory-like 
locations. But it doesn't order memory-like writes with I/O-like writes.

The section 5.6.3 claims that there are no implied barriers.



So, if we want to support the specifications:

read*_relaxed and write*_relaxed need at least one barrier before or after 
(memory-barriers.txt says that they must be ordered with each other, the 
alpha specification doesn't specify ordering).

read and write - read must have a barrier after it, write before it. There 
must be one more barrier before a read or after a write, to make sure that 
there is barrier between write+read.

Mikulas
diff mbox series

Patch

Index: linux-stable/arch/alpha/include/asm/io.h
===================================================================
--- linux-stable.orig/arch/alpha/include/asm/io.h	2020-05-07 09:54:52.000000000 +0200
+++ linux-stable/arch/alpha/include/asm/io.h	2020-05-07 09:54:52.000000000 +0200
@@ -6,6 +6,7 @@ 
 
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/delay.h>
 #include <asm/compiler.h>
 #include <asm/pgtable.h>
 #include <asm/machvec.h>
@@ -481,9 +482,9 @@  extern inline void writeq(u64 b, volatil
 #define iowrite16be(v,p) iowrite16(cpu_to_be16(v), (p))
 #define iowrite32be(v,p) iowrite32(cpu_to_be32(v), (p))
 
-#define inb_p		inb
-#define inw_p		inw
-#define inl_p		inl
+#define inb_p(x)	(ndelay(1400), inb(x))
+#define inw_p(x)	(ndelay(1400), inw(x))
+#define inl_p(x)	(ndelay(1400), inl(x))
 #define outb_p		outb
 #define outw_p		outw
 #define outl_p		outl