Message ID | alpine.LRH.2.02.2005070404420.5006@file01.intranet.prod.int.rdu2.redhat.com |
---|---|
State | Not Applicable |
Headers | show |
Series | [1/2,v3] alpha: add a delay to inb_p, inb_w and inb_l | expand |
On Thu, May 07, 2020 at 04:06:31AM -0400, Mikulas Patocka wrote: > > > On Wed, 6 May 2020, Mikulas Patocka wrote: > > > > > > > On Wed, 6 May 2020, Arnd Bergmann wrote: > > > > > On Wed, May 6, 2020 at 1:21 PM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > > /* > > > > * The yet supported machines all access the RTC index register via > > > > * an ISA port access but the way to access the date register differs ... > > > > + * > > > > + * The ISA bus on Alpha Avanti doesn't like back-to-back accesses, > > > > + * we need to add a small delay. > > > > */ > > > > #define CMOS_READ(addr) ({ \ > > > > outb_p((addr),RTC_PORT(0)); \ > > > > +udelay(2); \ > > > > inb_p(RTC_PORT(1)); \ > > > > > > > > > The inb_p() / outb_p() functions are meant to already have a delay in them, > > > maybe we should just add it there for alpha? > > > > > > Arnd > > > > Yes, that is possible too - it fixes the real time clock hang for me. > > > > > > -#define inb_p inb > > -#define inw_p inw > > -#define inl_p inl > > +#define inb_p(x) (ndelay(300), inb(x)) > > +#define inw_p(x) (ndelay(300), inw(x)) > > +#define inl_p(x) (ndelay(300), inl(x)) > > #define outb_p outb > > #define outw_p outw > > #define outl_p outl > > 300ns was too low. We need at least 1400ns to fix the hang reliably. > > Mikulas > > > From: Mikulas Patocka <mpatocka@redhat.com> > > The patch 92d7223a74235054f2aa7227d207d9c57f84dca0 ("alpha: io: reorder > barriers to guarantee writeX() and iowriteX() ordering #2") broke boot on > the Alpha Avanti platform. > > The patch changes timing between accesses to the ISA bus, in particular, > it reduces the time between "write" access and a subsequent "read" access. > > This causes lock-up when accessing the real time clock and serial ports. > > This patch fixes the real time clock by adding a small delay to the inb_p, > inw_p and inl_p macros. > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > Fixes: 92d7223a7423 ("alpha: io: reorder barriers to guarantee writeX() and iowriteX() ordering #2") > Cc: stable@vger.kernel.org # v4.17+ > > --- > arch/alpha/include/asm/io.h | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) <snip> This is not in a format that anyone can apply it in, please resend in a proper way if you wish for it to be accepted. thanks, greg k-h
On Thu, May 7, 2020 at 10:06 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > On Wed, 6 May 2020, Mikulas Patocka wrote: > > On Wed, 6 May 2020, Arnd Bergmann wrote: > > > On Wed, May 6, 2020 at 1:21 PM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > > /* > > > > * The yet supported machines all access the RTC index register via > > > > * an ISA port access but the way to access the date register differs ... > > > > + * > > > > + * The ISA bus on Alpha Avanti doesn't like back-to-back accesses, > > > > + * we need to add a small delay. > > > > */ > > > > #define CMOS_READ(addr) ({ \ > > > > outb_p((addr),RTC_PORT(0)); \ > > > > +udelay(2); \ > > > > inb_p(RTC_PORT(1)); \ > > > > > > > > > The inb_p() / outb_p() functions are meant to already have a delay in them, > > > maybe we should just add it there for alpha? > > > > > > Arnd > > > > Yes, that is possible too - it fixes the real time clock hang for me. > > > > > > -#define inb_p inb > > -#define inw_p inw > > -#define inl_p inl > > +#define inb_p(x) (ndelay(300), inb(x)) > > +#define inw_p(x) (ndelay(300), inw(x)) > > +#define inl_p(x) (ndelay(300), inl(x)) > > #define outb_p outb > > #define outw_p outw > > #define outl_p outl > > 300ns was too low. We need at least 1400ns to fix the hang reliably. Are you sure that it is in fact the timing that is important here and not a barrier? I see that inb() is written in terms of readb(), but the barrier requirements for I/O space are a bit different from those on PCI memory space. In the example you gave first, there is a an outb_p() followed by inb_p(). These are normally serialized by the bus, but I/O space also has the requirement that an outb() completes before we get to the next instruction (non-posted write), while writeb() is generally posted and only needs a barrier before the write rather than both before and after like outb. Arnd
On Thu, 7 May 2020, Arnd Bergmann wrote: > On Thu, May 7, 2020 at 10:06 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > On Wed, 6 May 2020, Mikulas Patocka wrote: > > > On Wed, 6 May 2020, Arnd Bergmann wrote: > > > > On Wed, May 6, 2020 at 1:21 PM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > > > > /* > > > > > * The yet supported machines all access the RTC index register via > > > > > * an ISA port access but the way to access the date register differs ... > > > > > + * > > > > > + * The ISA bus on Alpha Avanti doesn't like back-to-back accesses, > > > > > + * we need to add a small delay. > > > > > */ > > > > > #define CMOS_READ(addr) ({ \ > > > > > outb_p((addr),RTC_PORT(0)); \ > > > > > +udelay(2); \ > > > > > inb_p(RTC_PORT(1)); \ > > > > > > > > > > > > The inb_p() / outb_p() functions are meant to already have a delay in them, > > > > maybe we should just add it there for alpha? > > > > > > > > Arnd > > > > > > Yes, that is possible too - it fixes the real time clock hang for me. > > > > > > > > > -#define inb_p inb > > > -#define inw_p inw > > > -#define inl_p inl > > > +#define inb_p(x) (ndelay(300), inb(x)) > > > +#define inw_p(x) (ndelay(300), inw(x)) > > > +#define inl_p(x) (ndelay(300), inl(x)) > > > #define outb_p outb > > > #define outw_p outw > > > #define outl_p outl > > > > 300ns was too low. We need at least 1400ns to fix the hang reliably. > > Are you sure that it is in fact the timing that is important here and not > a barrier? I see that inb() is written in terms of readb(), but the > barrier requirements for I/O space are a bit different from those > on PCI memory space. The "in" and "out" instructions are serializing on x86. But alpha doesn't have dedicated instructions for accessing ports. Do you think that all the "in[bwl]" and "out[bwl]" macros on alpha should be protected by two memory barriers, to emulate the x86 behavior? > In the example you gave first, there is a an outb_p() followed by inb_p(). > These are normally serialized by the bus, but I/O space also has the > requirement that an outb() completes before we get to the next > instruction (non-posted write), while writeb() is generally posted and > only needs a barrier before the write rather than both before and after > like outb. > > Arnd I think that the fact that "writeb" is posted is exactly the problem - it gets posted, the processor goes on, sends "readb" and they arrive back-to-back to the ISA bus. The ISA bus device doesn't like back-to-back accesses and locks up. Anyway - you can change the "ndelay()" function in this patch to "mb()" - "mb()" will provide long enough delay that it fixes this bug. Mikulas
On Thu, May 7, 2020 at 4:09 PM Mikulas Patocka <mpatocka@redhat.com> wrote: > On Thu, 7 May 2020, Arnd Bergmann wrote: > > On Thu, May 7, 2020 at 10:06 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > Are you sure that it is in fact the timing that is important here and not > > a barrier? I see that inb() is written in terms of readb(), but the > > barrier requirements for I/O space are a bit different from those > > on PCI memory space. > > The "in" and "out" instructions are serializing on x86. But alpha doesn't > have dedicated instructions for accessing ports. > > Do you think that all the "in[bwl]" and "out[bwl]" macros on alpha should > be protected by two memory barriers, to emulate the x86 behavior? That's what we do on some other architectures to emulate the non-posted behavior of out[bwl], as required by PCI. I can't think of any reasons to have a barrier before in[bwl], or after write[bwl], but we generally want one after out[bwl] > > In the example you gave first, there is a an outb_p() followed by inb_p(). > > These are normally serialized by the bus, but I/O space also has the > > requirement that an outb() completes before we get to the next > > instruction (non-posted write), while writeb() is generally posted and > > only needs a barrier before the write rather than both before and after > > like outb. > > I think that the fact that "writeb" is posted is exactly the problem - it > gets posted, the processor goes on, sends "readb" and they arrive > back-to-back to the ISA bus. The ISA bus device doesn't like back-to-back > accesses and locks up. > > Anyway - you can change the "ndelay()" function in this patch to "mb()" - > "mb()" will provide long enough delay that it fixes this bug. My preference would be to have whatever makes most sense in theory and also fixes the problem. If there is some documentation that says you need a certain amount of time between accesses regardless of the barriers, then that is fine. I do wonder if there is anything enforcing the "rpcc" in _delay() to come after the store if there is no barrier between the two, otherwise the delay method still seems unreliable. The barrier after the store at least makes sense to me based on the theory, both with and without a delay in outb_p(). Arnd
On Thu, 7 May 2020, Arnd Bergmann wrote: > On Thu, May 7, 2020 at 4:09 PM Mikulas Patocka <mpatocka@redhat.com> wrote: > > On Thu, 7 May 2020, Arnd Bergmann wrote: > > > On Thu, May 7, 2020 at 10:06 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > Are you sure that it is in fact the timing that is important here and not > > > a barrier? I see that inb() is written in terms of readb(), but the > > > barrier requirements for I/O space are a bit different from those > > > on PCI memory space. > > > > The "in" and "out" instructions are serializing on x86. But alpha doesn't > > have dedicated instructions for accessing ports. > > > > Do you think that all the "in[bwl]" and "out[bwl]" macros on alpha should > > be protected by two memory barriers, to emulate the x86 behavior? > > That's what we do on some other architectures to emulate the non-posted > behavior of out[bwl], as required by PCI. I can't think of any reasons to > have a barrier before in[bwl], or after write[bwl], but we generally want > one after out[bwl] Yes - so we can add a barrier after out[bwl]. It also fixes the serial port issue, so we no longer need the serial driver patch for Greg. > > > In the example you gave first, there is a an outb_p() followed by inb_p(). > > > These are normally serialized by the bus, but I/O space also has the > > > requirement that an outb() completes before we get to the next > > > instruction (non-posted write), while writeb() is generally posted and > > > only needs a barrier before the write rather than both before and after > > > like outb. > > > > I think that the fact that "writeb" is posted is exactly the problem - it > > gets posted, the processor goes on, sends "readb" and they arrive > > back-to-back to the ISA bus. The ISA bus device doesn't like back-to-back > > accesses and locks up. > > > > Anyway - you can change the "ndelay()" function in this patch to "mb()" - > > "mb()" will provide long enough delay that it fixes this bug. > > My preference would be to have whatever makes most sense in theory > and also fixes the problem. If there is some documentation that > says you need a certain amount of time between accesses regardless > of the barriers, then that is fine. I do wonder if there is anything > enforcing the "rpcc" in _delay() to come after the store if there is no > barrier between the two, otherwise the delay method still seems > unreliable. I measured ndelay - and the overhead of the instruction rpcc is already very high. ndelay(1) takes 300ns. > The barrier after the store at least makes sense to me based on > the theory, both with and without a delay in outb_p(). > > Arnd Mikulas
On Thu, 7 May 2020, Arnd Bergmann wrote: > > Do you think that all the "in[bwl]" and "out[bwl]" macros on alpha should > > be protected by two memory barriers, to emulate the x86 behavior? > > That's what we do on some other architectures to emulate the non-posted > behavior of out[bwl], as required by PCI. I can't think of any reasons to > have a barrier before in[bwl], or after write[bwl], but we generally want > one after out[bwl] Alpha is weakly ordered, also WRT MMIO. The details are a bit obscure (and were discussed before in a previous iteration of these patches), but my understanding is multiple writes can be merged and writes can be reordered WRT reads, even on UP. It's generally better for performance to have ordering barriers before MMIO operations rather than afterwards, unless a completion barrier is also required (e.g. for level-triggered interrupt acknowledgement). Currently we don't fully guarantee that `outX' won't be posted (from memory-barriers.txt): " (*) inX(), outX(): [...] Device drivers may expect outX() to emit a non-posted write transaction that waits for a completion response from the I/O peripheral before returning. This is not guaranteed by all architectures and is therefore not part of the portable ordering semantics." Maciej
On Sun, 10 May 2020, Maciej W. Rozycki wrote: > On Thu, 7 May 2020, Arnd Bergmann wrote: > > > > Do you think that all the "in[bwl]" and "out[bwl]" macros on alpha should > > > be protected by two memory barriers, to emulate the x86 behavior? > > > > That's what we do on some other architectures to emulate the non-posted > > behavior of out[bwl], as required by PCI. I can't think of any reasons to > > have a barrier before in[bwl], or after write[bwl], but we generally want > > one after out[bwl] > > Alpha is weakly ordered, also WRT MMIO. The details are a bit obscure > (and were discussed before in a previous iteration of these patches), but > my understanding is multiple writes can be merged and writes can be > reordered WRT reads, even on UP. It's generally better for performance to We discussed it some times ago, and the conclusion was that reads and writes to the same device are not reordered on Alpha. Reads and writes to different devices or to memory may be reordered. In these problematic cases, we only access serial port or real time clock using a few ports (and these devices don't have DMA, so there's not any interaction with memory) - so I conclude that it is timing problem and not I/O reordering problem. > have ordering barriers before MMIO operations rather than afterwards, > unless a completion barrier is also required (e.g. for level-triggered > interrupt acknowledgement). > > Currently we don't fully guarantee that `outX' won't be posted (from > memory-barriers.txt): > > " (*) inX(), outX(): > [...] > Device drivers may expect outX() to emit a non-posted write transaction > that waits for a completion response from the I/O peripheral before > returning. This is not guaranteed by all architectures and is therefore > not part of the portable ordering semantics." > > Maciej Mikulas
On Sun, 10 May 2020, Mikulas Patocka wrote: > > > That's what we do on some other architectures to emulate the non-posted > > > behavior of out[bwl], as required by PCI. I can't think of any reasons to > > > have a barrier before in[bwl], or after write[bwl], but we generally want > > > one after out[bwl] > > > > Alpha is weakly ordered, also WRT MMIO. The details are a bit obscure > > (and were discussed before in a previous iteration of these patches), but > > my understanding is multiple writes can be merged and writes can be > > reordered WRT reads, even on UP. It's generally better for performance to > > We discussed it some times ago, and the conclusion was that reads and > writes to the same device are not reordered on Alpha. Reads and writes to > different devices or to memory may be reordered. Except that "device" in this context is a particular MMIO location; most peripherals span multiple locations, including the RTC and the UART in the Avanti line in particular. > In these problematic cases, we only access serial port or real time clock > using a few ports (and these devices don't have DMA, so there's not any > interaction with memory) - so I conclude that it is timing problem and not > I/O reordering problem. Individual PCI port locations correspond to different MMIO locations, so yes, accesses to these can be reordered (merging won't happen due to the use of the sparse address space). As I noted using a small program to verify actual behaviour ought to reveal what the problem really is. And /dev/mem can be mmapped for PCI port I/O access on Alpha (some X-servers do this), so it can be done even in the userland with a running system. And if timing is indeed the culprit, then I think it will be best fixed in the 82378IB southbridge, i.e.[1]: "The I/O recovery mechanism in the SIO is used to add additional recovery delay between PCI originated 8-bit and 16-bit I/O cycles to the ISA Bus. The SIO automatically forces a minimum delay of four SYSCLKs between back-to-back 8 and 16 bit I/O cycles to the ISA Bus. The delay is measured from the rising edge of the I/O command (IOR# or IOW#) to the falling edge of the next BALE. If a delay of greater than four SYSCLKs is required, the ISA I/O Recovery Time Register can be programmed to increase the delay in increments of SYSCLKs. Note that no additional delay is inserted for back-to-back I/O "sub cycles" generated as a result of byte assembly or disassembly. This register defaults to 8 and 16-bit recovery enabled with two clocks added to the standard I/O recovery." where it won't be causing unnecessary overhead for native PCI devices or indeed excessive one for ISA devices. It might be interesting to note that later SIO versions like the 82378ZB increased the minimum to five SYSCLKs, so maybe a missing SYSCLK (that can still be inserted by suitably programming the ICRT) is the source of the problem? References: [1] "82378IB System I/O (SIO)", April 1993, Intel Corporation, Order Number: 290473-002, Section 4.1.17 "ICRT -- ISA Controller Recovery Timer Register" Maciej
On Mon, 11 May 2020, Maciej W. Rozycki wrote: > And if timing is indeed the culprit, then I think it will be best fixed > in the 82378IB southbridge, i.e.[1]: > > "The I/O recovery mechanism in the SIO is used to add additional recovery > delay between PCI originated 8-bit and 16-bit I/O cycles to the ISA Bus. > The SIO automatically forces a minimum delay of four SYSCLKs between > back-to-back 8 and 16 bit I/O cycles to the ISA Bus. The delay is > measured from the rising edge of the I/O command (IOR# or IOW#) to the > falling edge of the next BALE. If a delay of greater than four SYSCLKs is > required, the ISA I/O Recovery Time Register can be programmed to increase > the delay in increments of SYSCLKs. Note that no additional delay is > inserted for back-to-back I/O "sub cycles" generated as a result of byte > assembly or disassembly. This register defaults to 8 and 16-bit recovery > enabled with two clocks added to the standard I/O recovery." > > where it won't be causing unnecessary overhead for native PCI devices or > indeed excessive one for ISA devices. It might be interesting to note > that later SIO versions like the 82378ZB increased the minimum to five > SYSCLKs, so maybe a missing SYSCLK (that can still be inserted by suitably > programming the ICRT) is the source of the problem? > > References: > > [1] "82378IB System I/O (SIO)", April 1993, Intel Corporation, Order > Number: 290473-002, Section 4.1.17 "ICRT -- ISA Controller Recovery > Timer Register" > > Maciej I tried to modify this register (I wrote 0x44 to it - it should correspond to the maximum delay) and it had no effect on the serial port and rtc lock-ups. Mikulas
On Mon, May 11, 2020 at 03:58:24PM +0100, Maciej W. Rozycki wrote: > Individual PCI port locations correspond to different MMIO locations, so > yes, accesses to these can be reordered (merging won't happen due to the > use of the sparse address space). Correct, it's how Alpha write buffers work. According to 21064 hardware reference manual, these buffers are flushed when one of the following conditions is met: 1) The write buffer contains at least two valid entries. 2) The write buffer contains one valid entry and at least 256 CPU cycles have elapsed since the execution of the last write buffer-directed instruction. 3) The write buffer contains an MB, STQ_C or STL_C instruction. 4) A load miss is pending to an address currently valid in the write buffer that requires the write buffer to be flushed. I'm certain that in these rtc/serial cases we've got readX arriving to device *before* preceeding writeX because of 2). That's why small delay (300-1400 ns, apparently depends on CPU frequency) seemingly "fixes" the problem. The 4) is not met because loads and stores are to different ports, and 3) has been broken by commit 92d7223a74. So I believe that correct fix would be to revert 92d7223a74 and add wmb() before [io]writeX macros to meet memory-barriers.txt requirement. The "wmb" instruction is cheap enough and won't hurt IO performance too much. Ivan.
On Wed, May 13, 2020 at 03:41:28PM +0100, Ivan Kokshaysky wrote: > On Mon, May 11, 2020 at 03:58:24PM +0100, Maciej W. Rozycki wrote: > > Individual PCI port locations correspond to different MMIO locations, so > > yes, accesses to these can be reordered (merging won't happen due to the > > use of the sparse address space). > > Correct, it's how Alpha write buffers work. According to 21064 hardware > reference manual, these buffers are flushed when one of the following > conditions is met: > > 1) The write buffer contains at least two valid entries. > 2) The write buffer contains one valid entry and at least 256 CPU cycles > have elapsed since the execution of the last write buffer-directed > instruction. > 3) The write buffer contains an MB, STQ_C or STL_C instruction. > 4) A load miss is pending to an address currently valid in the write > buffer that requires the write buffer to be flushed. > > I'm certain that in these rtc/serial cases we've got readX arriving > to device *before* preceeding writeX because of 2). That's why small > delay (300-1400 ns, apparently depends on CPU frequency) seemingly > "fixes" the problem. The 4) is not met because loads and stores are > to different ports, and 3) has been broken by commit 92d7223a74. > > So I believe that correct fix would be to revert 92d7223a74 and > add wmb() before [io]writeX macros to meet memory-barriers.txt > requirement. The "wmb" instruction is cheap enough and won't hurt > IO performance too much. I agree, that sounds easier, and work with the authors of memory-barriers.txt in order to straighten things out. greg k-h
On Wed, 13 May 2020, Ivan Kokshaysky wrote: > > Individual PCI port locations correspond to different MMIO locations, so > > yes, accesses to these can be reordered (merging won't happen due to the > > use of the sparse address space). > > Correct, it's how Alpha write buffers work. According to 21064 hardware > reference manual, these buffers are flushed when one of the following > conditions is met: > > 1) The write buffer contains at least two valid entries. > 2) The write buffer contains one valid entry and at least 256 CPU cycles > have elapsed since the execution of the last write buffer-directed > instruction. > 3) The write buffer contains an MB, STQ_C or STL_C instruction. > 4) A load miss is pending to an address currently valid in the write > buffer that requires the write buffer to be flushed. > > I'm certain that in these rtc/serial cases we've got readX arriving > to device *before* preceeding writeX because of 2). That's why small > delay (300-1400 ns, apparently depends on CPU frequency) seemingly > "fixes" the problem. The 4) is not met because loads and stores are > to different ports, and 3) has been broken by commit 92d7223a74. > > So I believe that correct fix would be to revert 92d7223a74 and > add wmb() before [io]writeX macros to meet memory-barriers.txt > requirement. The "wmb" instruction is cheap enough and won't hurt > IO performance too much. Hmm, having barriers *afterwards* across all the MMIO accessors, even ones that do not have such a requirement according to memory-barriers.txt, does hurt performance unnecessarily however. What I think has to be done is adding barriers beforehand, and then only add barriers afterwards where necessary. Commit 92d7223a74 did a part of that, but did not consistently update all the remaining accessors. So I don't think reverting 92d7223a74 permanently is the right way to go, however it certainly makes sense to do that temporarily to get rid of the fatal regression, sort all the corner cases and then apply the resulting replacement fix. I think ultimately we do want the barriers beforehand, just like the MIPS port has (and survives) in arch/mips/include/asm/io.h. Observe that unlike the Alpha ISA the MIPS ISA does have nontrivial `rmb' aka the SYNC_RMB hardware instruction. Maciej
On Wed, 13 May 2020, Maciej W. Rozycki wrote: > On Wed, 13 May 2020, Ivan Kokshaysky wrote: > > > > Individual PCI port locations correspond to different MMIO locations, so > > > yes, accesses to these can be reordered (merging won't happen due to the > > > use of the sparse address space). > > > > Correct, it's how Alpha write buffers work. According to 21064 hardware > > reference manual, these buffers are flushed when one of the following > > conditions is met: > > > > 1) The write buffer contains at least two valid entries. > > 2) The write buffer contains one valid entry and at least 256 CPU cycles > > have elapsed since the execution of the last write buffer-directed > > instruction. > > 3) The write buffer contains an MB, STQ_C or STL_C instruction. > > 4) A load miss is pending to an address currently valid in the write > > buffer that requires the write buffer to be flushed. > > > > I'm certain that in these rtc/serial cases we've got readX arriving > > to device *before* preceeding writeX because of 2). That's why small > > delay (300-1400 ns, apparently depends on CPU frequency) seemingly > > "fixes" the problem. The 4) is not met because loads and stores are > > to different ports, and 3) has been broken by commit 92d7223a74. > > > > So I believe that correct fix would be to revert 92d7223a74 and > > add wmb() before [io]writeX macros to meet memory-barriers.txt > > requirement. The "wmb" instruction is cheap enough and won't hurt > > IO performance too much. > > Hmm, having barriers *afterwards* across all the MMIO accessors, even > ones that do not have such a requirement according to memory-barriers.txt, > does hurt performance unnecessarily however. What I think has to be done > is adding barriers beforehand, and then only add barriers afterwards where > necessary. Commit 92d7223a74 did a part of that, but did not consistently > update all the remaining accessors. > > So I don't think reverting 92d7223a74 permanently is the right way to go, > however it certainly makes sense to do that temporarily to get rid of the > fatal regression, sort all the corner cases and then apply the resulting > replacement fix. See Documentation/memory-barriers.txt, the section "KERNEL I/O BARRIER EFFECTS" According to the specification, there must be a barrier before a write to the MMIO space (paragraph 3) and after a read from MMIO space (parahraph 4) - if this causes unnecessary slowdown, the driver should use readX_relaxed or writeX_relaxed functions - the *_relaxed functions are ordered with each other (see the paragraph "(*) readX_relaxed(), writeX_relaxed()"), but they are not ordered with respect to memory access. The commit 92d7223a74 fixes that requirement (although there is no real driver that was fixed by this), so I don't think it should be reverted. The proper fix should be to add delays to the serial port and readltime clock (or perhaps to all IO-port accesses). > I think ultimately we do want the barriers beforehand, just like the > MIPS port has (and survives) in arch/mips/include/asm/io.h. Observe If the MIPS port doesn't have MMIO barrier after read[bwl], then it is violating the specification. Perhaps there is no existing driver that is hurt by this violation, so this violation survived. > that unlike the Alpha ISA the MIPS ISA does have nontrivial `rmb' aka > the SYNC_RMB hardware instruction. > > Maciej Mikulas
On Wed, 13 May 2020, Ivan Kokshaysky wrote: > On Mon, May 11, 2020 at 03:58:24PM +0100, Maciej W. Rozycki wrote: > > Individual PCI port locations correspond to different MMIO locations, so > > yes, accesses to these can be reordered (merging won't happen due to the > > use of the sparse address space). > > Correct, it's how Alpha write buffers work. According to 21064 hardware > reference manual, these buffers are flushed when one of the following > conditions is met: > > 1) The write buffer contains at least two valid entries. > 2) The write buffer contains one valid entry and at least 256 CPU cycles > have elapsed since the execution of the last write buffer-directed > instruction. > 3) The write buffer contains an MB, STQ_C or STL_C instruction. > 4) A load miss is pending to an address currently valid in the write > buffer that requires the write buffer to be flushed. > > I'm certain that in these rtc/serial cases we've got readX arriving > to device *before* preceeding writeX because of 2). That's why small > delay (300-1400 ns, apparently depends on CPU frequency) seemingly > "fixes" the problem. The 4) is not met because loads and stores are > to different ports, and 3) has been broken by commit 92d7223a74. > > So I believe that correct fix would be to revert 92d7223a74 and > add wmb() before [io]writeX macros to meet memory-barriers.txt > requirement. The "wmb" instruction is cheap enough and won't hurt > IO performance too much. > > Ivan. I agree ... and what about readX_relaxed and writeX_relaxed? According to the memory-barriers specification, the _relaxed functions must be ordered w.r.t. each other. If Alpha can't keep them ordered, they should have barriers between them too. Mikulas
On Fri, 22 May 2020, Mikulas Patocka wrote: > > Hmm, having barriers *afterwards* across all the MMIO accessors, even > > ones that do not have such a requirement according to memory-barriers.txt, > > does hurt performance unnecessarily however. What I think has to be done > > is adding barriers beforehand, and then only add barriers afterwards where > > necessary. Commit 92d7223a74 did a part of that, but did not consistently > > update all the remaining accessors. > > > > So I don't think reverting 92d7223a74 permanently is the right way to go, > > however it certainly makes sense to do that temporarily to get rid of the > > fatal regression, sort all the corner cases and then apply the resulting > > replacement fix. > > See Documentation/memory-barriers.txt, the section "KERNEL I/O BARRIER > EFFECTS" > > According to the specification, there must be a barrier before a write to > the MMIO space (paragraph 3) and after a read from MMIO space (parahraph > 4) - if this causes unnecessary slowdown, the driver should use > readX_relaxed or writeX_relaxed functions - the *_relaxed functions are > ordered with each other (see the paragraph "(*) readX_relaxed(), > writeX_relaxed()"), but they are not ordered with respect to memory > access. The specification doesn't require a barrier *after* a write however, which is what I have been concerned about as it may cost hundreds of cycles wasted. I'm not concerned about a barrier after a read (and one beforehand is of course also required). > The commit 92d7223a74 fixes that requirement (although there is no real > driver that was fixed by this), so I don't think it should be reverted. > The proper fix should be to add delays to the serial port and readltime > clock (or perhaps to all IO-port accesses). Adding artificial delays will only paper over the real problem I'm afraid. > > I think ultimately we do want the barriers beforehand, just like the > > MIPS port has (and survives) in arch/mips/include/asm/io.h. Observe > > If the MIPS port doesn't have MMIO barrier after read[bwl], then it is > violating the specification. Perhaps there is no existing driver that is > hurt by this violation, so this violation survived. It does have a barrier, see: /* prevent prefetching of coherent DMA data prematurely */ \ if (!relax) \ rmb(); \ In the light of #5 however: 5. A readX() by a CPU thread from the peripheral will complete before any subsequent delay() loop can begin execution on the same thread. I think it may have to be replaced with a completion barrier however, and that will be tricky because you cannot just issue a second read to the resource accessed after the `rmb' to make the first read complete, as a MMIO read may have side effects (e.g. clearing an interrupt request). So the read would have to happen to a different location. Some architectures have a hardware completion barrier instruction, such as the modern MIPS ISA, which makes life sweet and easy (as much as life can be sweet and easy with a weakly ordered bus model) as no dummy read is then required, but surely not all do (including older MIPS ISA revisions). Maciej
On Fri, 22 May 2020, Mikulas Patocka wrote: > On Wed, 13 May 2020, Ivan Kokshaysky wrote: > > > On Mon, May 11, 2020 at 03:58:24PM +0100, Maciej W. Rozycki wrote: > > > Individual PCI port locations correspond to different MMIO locations, so > > > yes, accesses to these can be reordered (merging won't happen due to the > > > use of the sparse address space). > > > > Correct, it's how Alpha write buffers work. According to 21064 hardware > > reference manual, these buffers are flushed when one of the following > > conditions is met: > > > > 1) The write buffer contains at least two valid entries. > > 2) The write buffer contains one valid entry and at least 256 CPU cycles > > have elapsed since the execution of the last write buffer-directed > > instruction. > > 3) The write buffer contains an MB, STQ_C or STL_C instruction. > > 4) A load miss is pending to an address currently valid in the write > > buffer that requires the write buffer to be flushed. > > > > I'm certain that in these rtc/serial cases we've got readX arriving > > to device *before* preceeding writeX because of 2). That's why small > > delay (300-1400 ns, apparently depends on CPU frequency) seemingly > > "fixes" the problem. The 4) is not met because loads and stores are > > to different ports, and 3) has been broken by commit 92d7223a74. > > > > So I believe that correct fix would be to revert 92d7223a74 and > > add wmb() before [io]writeX macros to meet memory-barriers.txt > > requirement. The "wmb" instruction is cheap enough and won't hurt > > IO performance too much. > > > > Ivan. > > I agree ... and what about readX_relaxed and writeX_relaxed? According to > the memory-barriers specification, the _relaxed functions must be ordered > w.r.t. each other. If Alpha can't keep them ordered, they should have > barriers between them too. > > Mikulas I have found the chapter about I/O in the Alpha reference manual, in the section "5.6.4.7 Implications for Memory Mapped I/O", it says: To reliably communicate shared data from a processor to an I/O device: 1. Write the shared data to a memory-like physical memory region on the processor. 2. Execute an MB instruction. 3. Write a flag (equivalently, send an interrupt or write a register location implemented in the I/O device). The receiving I/O device must: 1. Read the flag (equivalently, detect the interrupt or detect the write to the register location implemented in the I/O device). 2. Execute the equivalent of an MB. 3. Read the shared data. So, we must use MB, not WMB when ordering I/O accesses. The WMB instruction specification says that it orders writes to memory-like locations with other writes to memory-like locations. And writes to non-memory-like locations with other writes to non-memory-like locations. But it doesn't order memory-like writes with I/O-like writes. The section 5.6.3 claims that there are no implied barriers. So, if we want to support the specifications: read*_relaxed and write*_relaxed need at least one barrier before or after (memory-barriers.txt says that they must be ordered with each other, the alpha specification doesn't specify ordering). read and write - read must have a barrier after it, write before it. There must be one more barrier before a read or after a write, to make sure that there is barrier between write+read. Mikulas
Index: linux-stable/arch/alpha/include/asm/io.h =================================================================== --- linux-stable.orig/arch/alpha/include/asm/io.h 2020-05-07 09:54:52.000000000 +0200 +++ linux-stable/arch/alpha/include/asm/io.h 2020-05-07 09:54:52.000000000 +0200 @@ -6,6 +6,7 @@ #include <linux/kernel.h> #include <linux/mm.h> +#include <linux/delay.h> #include <asm/compiler.h> #include <asm/pgtable.h> #include <asm/machvec.h> @@ -481,9 +482,9 @@ extern inline void writeq(u64 b, volatil #define iowrite16be(v,p) iowrite16(cpu_to_be16(v), (p)) #define iowrite32be(v,p) iowrite32(cpu_to_be32(v), (p)) -#define inb_p inb -#define inw_p inw -#define inl_p inl +#define inb_p(x) (ndelay(1400), inb(x)) +#define inw_p(x) (ndelay(1400), inw(x)) +#define inl_p(x) (ndelay(1400), inl(x)) #define outb_p outb #define outw_p outw #define outl_p outl