diff mbox

[v9,1/4] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

Message ID 1501917313-9812-2-git-send-email-dingtianhong@huawei.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Ding Tianhong Aug. 5, 2017, 7:15 a.m. UTC
From: Casey Leedom <leedom@chelsio.com>

The patch adds a new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING to indicate that
Relaxed Ordering (RO) attribute should not be used for Transaction Layer
Packets (TLP) targetted towards these affected root complexes. Current list
of affected parts include some Intel Xeon processors root complex which suffers from
flow control credits that result in performance issues. On these affected
parts RO can still be used for peer-2-peer traffic. AMD A1100 ARM ("SEATTLE")
Root complexes don't obey PCIe 3.0 ordering rules, hence could lead to
data-corruption.

Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Ashok Raj <ashok.raj@intel.com>
---
 drivers/pci/quirks.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h  |  2 ++
 2 files changed, 90 insertions(+)

Comments

Bjorn Helgaas Aug. 8, 2017, 11:22 p.m. UTC | #1
On Sat, Aug 05, 2017 at 03:15:10PM +0800, Ding Tianhong wrote:
> From: Casey Leedom <leedom@chelsio.com>
> 
> The patch adds a new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING to indicate that
> Relaxed Ordering (RO) attribute should not be used for Transaction Layer
> Packets (TLP) targetted towards these affected root complexes. Current list
> of affected parts include some Intel Xeon processors root complex which suffers from
> flow control credits that result in performance issues. On these affected
> parts RO can still be used for peer-2-peer traffic. AMD A1100 ARM ("SEATTLE")
> Root complexes don't obey PCIe 3.0 ordering rules, hence could lead to
> data-corruption.

This needs to include a link to the Intel spec
(https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf,
sec 3.9.1).

It should also include a pointer to the AMD erratum, if available, or
at least some reference to how we know it doesn't obey the rules.

Ashok, thanks for chiming in.  Now that you have, I have a few more
questions for you:

  - Is the above doc the one you mentioned as being now public?
  
  - Is this considered a hardware erratum?
  
  - If so, is there a pointer to that as well?
  
  - If this is not considered an erratum, can you provide any guidance
    about how an OS should determine when it should use RO?
    
Relying on a list of device IDs in an optimization manual is OK for an
erratum, but if it's *not* an erratum, it seems like a hole in the
specs because as far as I know there's no generic way for the OS to
discover whether to use RO.

Bjorn
Casey Leedom Aug. 9, 2017, 1:40 a.m. UTC | #2
| From: Bjorn Helgaas <helgaas@kernel.org>
| Sent: Tuesday, August 8, 2017 4:22 PM
| 
| This needs to include a link to the Intel spec
| (https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf,
| sec 3.9.1).

  In the commit message or as a comment?  Regardless, I agree.  It's always
nice to be able to go back and see what the official documentation says.
However, that said, links on the internet are ... fragile as time goes by,
so we might want to simply quote section 3.9.1 in the commit message since
it's relatively short:

    3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory
          and Toward MMIO Regions (P2P)

    In order to maximize performance for PCIe devices in the processors
    listed in Table 3-6 below, the soft- ware should determine whether the
    accesses are toward coherent memory (system memory) or toward MMIO
    regions (P2P access to other devices). If the access is toward MMIO
    region, then software can command HW to set the RO bit in the TLP
    header, as this would allow hardware to achieve maximum throughput for
    these types of accesses. For accesses toward coherent memory, software
    can command HW to clear the RO bit in the TLP header (no RO), as this
    would allow hardware to achieve maximum throughput for these types of
    accesses.

    Table 3-6. Intel Processor CPU RP Device IDs for Processors Optimizing
               PCIe Performance

    Processor                            CPU RP Device IDs

    Intel Xeon processors based on       6F01H-6F0EH
    Broadwell microarchitecture

    Intel Xeon processors based on       2F01H-2F0EH
    Haswell microarchitecture

| It should also include a pointer to the AMD erratum, if available, or
| at least some reference to how we know it doesn't obey the rules.

  Getting an ACK from AMD seems like a forlorn cause at this point.  My
contact was Bob Shaw <Bob.Shaw@amd.com> and he stopped responding to me
messages almost a year ago saying that all of AMD's energies were being
redirected towards upcoming x86 products (likely Ryzen as we now know).  As
far as I can tell AMD has walked away from their A1100 (AKA "Seattle") ARM
SoC.

  On the specific issue, I can certainly write up somthing even more
extensive than I wrote up for the comment in drivers/pci/quirks.c.  Please
review the comment I wrote up and tell me if you'd like something even more
detailed -- I'm usually acused of writing comments which are too long, so
this would be a new one on me ... :-)

| Ashok, thanks for chiming in.  Now that you have, I have a few more
| questions for you:

  I can answer a few of these:

|  - Is the above doc the one you mentioned as being now public?

  Yes.  Ashok worked with me to the extent he was allowed prior to the
publishing of the public technocal note, but he couldn't say much.  (Believe
it or not, it is possible to say less than the quoted section above.)  When
the note was published, Patrick Cramer sent me the note about it and pointed
me at section 3.9.1.

|  - Is this considered a hardware erratum?

  I certainly consider it a Hardware Bug.  And I'm really hoping that Ashok
will be able to find a "Chicken Bit" which allows the broken feature to be
turned off.  Remember, the Relaxed Ordering Attribute on a Transaction Layer
Packet is simply a HINT.  It is perfectly reasonable for a compliant
implementation to simply ignore the Relaxed Ordering Attribute on an
incoming TLP Request.  The sole responsibility of a compliant implementation
is to return the exact same Relaxed Ordering and No Snoop Attributes in any
TLP Response (The rules for ID-Based Ordering Attribute are more complex.)
 
  Earlier Intel Root Complexes did exactly this: they ignored the Relaxed
Ordering Attribute and there was no performance difference for
using/not-using it.  It's pretty obvious that an attempt was made to
implement optimizations surounding the use of Relaxed Ordering and they
didn't work.

|  - If so, is there a pointer to that as well?

  Intel is historically tight-lipped about admiting any bugs/errata in their
products.  I'm guessing that the above quoted Section 3.9.1 is likely to be
all we ever get. The language above regarding TLPs targetting Coherent
Shared Memory are basically as much of an admission that they got it wrong
as we're going to get.  But heck, maybe we'll get lucky ...  Especially with
regard to the hoped for "Chicken Bit" ...

|  - If this is not considered an erratum, can you provide any guidance
|    about how an OS should determine when it should use RO?

  Software?  We don't need no stinking software!

  Sorry, I couldn't resist.

| Relying on a list of device IDs in an optimization manual is OK for an
| erratum, but if it's *not* an erratum, it seems like a hole in the specs
| because as far as I know there's no generic way for the OS to discover
| whether to use RO.

  Well, here's to hoping that Ashok and/or Patrick are able to offer more
detailed information ...

Casey
Bjorn Helgaas Aug. 9, 2017, 3:02 a.m. UTC | #3
On Wed, Aug 09, 2017 at 01:40:01AM +0000, Casey Leedom wrote:
> | From: Bjorn Helgaas <helgaas@kernel.org>
> | Sent: Tuesday, August 8, 2017 4:22 PM
> | 
> | This needs to include a link to the Intel spec
> | (https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf,
> | sec 3.9.1).
> 
>   In the commit message or as a comment?  Regardless, I agree.  It's always
> nice to be able to go back and see what the official documentation says.
> However, that said, links on the internet are ... fragile as time goes by,
> so we might want to simply quote section 3.9.1 in the commit message since
> it's relatively short:
> 
>     3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory
>           and Toward MMIO Regions (P2P)
> 
>     In order to maximize performance for PCIe devices in the processors
>     listed in Table 3-6 below, the soft- ware should determine whether the
>     accesses are toward coherent memory (system memory) or toward MMIO
>     regions (P2P access to other devices). If the access is toward MMIO
>     region, then software can command HW to set the RO bit in the TLP
>     header, as this would allow hardware to achieve maximum throughput for
>     these types of accesses. For accesses toward coherent memory, software
>     can command HW to clear the RO bit in the TLP header (no RO), as this
>     would allow hardware to achieve maximum throughput for these types of
>     accesses.
> 
>     Table 3-6. Intel Processor CPU RP Device IDs for Processors Optimizing
>                PCIe Performance
> 
>     Processor                            CPU RP Device IDs
> 
>     Intel Xeon processors based on       6F01H-6F0EH
>     Broadwell microarchitecture
> 
>     Intel Xeon processors based on       2F01H-2F0EH
>     Haswell microarchitecture

Agreed, links are prone to being broken.  I would include in the
changelog the complete title and order number, along with the link as
a footnote.  Wouldn't hurt to quote the section too, since it's short.

> | It should also include a pointer to the AMD erratum, if available, or
> | at least some reference to how we know it doesn't obey the rules.
> 
>   Getting an ACK from AMD seems like a forlorn cause at this point.  My
> contact was Bob Shaw <Bob.Shaw@amd.com> and he stopped responding to me
> messages almost a year ago saying that all of AMD's energies were being
> redirected towards upcoming x86 products (likely Ryzen as we now know).  As
> far as I can tell AMD has walked away from their A1100 (AKA "Seattle") ARM
> SoC.
> 
>   On the specific issue, I can certainly write up somthing even more
> extensive than I wrote up for the comment in drivers/pci/quirks.c.  Please
> review the comment I wrote up and tell me if you'd like something even more
> detailed -- I'm usually acused of writing comments which are too long, so
> this would be a new one on me ... :-)

If you have any bug reports with info about how you debugged it and
concluded that Seattle is broken, you could include a link (probably
in the changelog).  But if there isn't anything, there isn't anything.

I might reorganize those patches as:

  1) Add a PCI_DEV_FLAGS_RELAXED_ORDERING_BROKEN flag, the quirk that
  sets it, and the current patch [2/4] that uses it.

  2) Add the Intel DECLARE_PCI_FIXUP_CLASS_EARLY()s with the Intel
  details.

  3) Add the AMD DECLARE_PCI_FIXUP_CLASS_EARLY()s with the AMD
  details.
Ding Tianhong Aug. 9, 2017, 12:17 p.m. UTC | #4
On 2017/8/9 11:02, Bjorn Helgaas wrote:
> On Wed, Aug 09, 2017 at 01:40:01AM +0000, Casey Leedom wrote:
>> | From: Bjorn Helgaas <helgaas@kernel.org>
>> | Sent: Tuesday, August 8, 2017 4:22 PM
>> | 
>> | This needs to include a link to the Intel spec
>> | (https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf,
>> | sec 3.9.1).
>>
>>   In the commit message or as a comment?  Regardless, I agree.  It's always
>> nice to be able to go back and see what the official documentation says.
>> However, that said, links on the internet are ... fragile as time goes by,
>> so we might want to simply quote section 3.9.1 in the commit message since
>> it's relatively short:
>>
>>     3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory
>>           and Toward MMIO Regions (P2P)
>>
>>     In order to maximize performance for PCIe devices in the processors
>>     listed in Table 3-6 below, the soft- ware should determine whether the
>>     accesses are toward coherent memory (system memory) or toward MMIO
>>     regions (P2P access to other devices). If the access is toward MMIO
>>     region, then software can command HW to set the RO bit in the TLP
>>     header, as this would allow hardware to achieve maximum throughput for
>>     these types of accesses. For accesses toward coherent memory, software
>>     can command HW to clear the RO bit in the TLP header (no RO), as this
>>     would allow hardware to achieve maximum throughput for these types of
>>     accesses.
>>
>>     Table 3-6. Intel Processor CPU RP Device IDs for Processors Optimizing
>>                PCIe Performance
>>
>>     Processor                            CPU RP Device IDs
>>
>>     Intel Xeon processors based on       6F01H-6F0EH
>>     Broadwell microarchitecture
>>
>>     Intel Xeon processors based on       2F01H-2F0EH
>>     Haswell microarchitecture
> 
> Agreed, links are prone to being broken.  I would include in the
> changelog the complete title and order number, along with the link as
> a footnote.  Wouldn't hurt to quote the section too, since it's short.
> 

OK

>> | It should also include a pointer to the AMD erratum, if available, or
>> | at least some reference to how we know it doesn't obey the rules.
>>
>>   Getting an ACK from AMD seems like a forlorn cause at this point.  My
>> contact was Bob Shaw <Bob.Shaw@amd.com> and he stopped responding to me
>> messages almost a year ago saying that all of AMD's energies were being
>> redirected towards upcoming x86 products (likely Ryzen as we now know).  As
>> far as I can tell AMD has walked away from their A1100 (AKA "Seattle") ARM
>> SoC.
>>
>>   On the specific issue, I can certainly write up somthing even more
>> extensive than I wrote up for the comment in drivers/pci/quirks.c.  Please
>> review the comment I wrote up and tell me if you'd like something even more
>> detailed -- I'm usually acused of writing comments which are too long, so
>> this would be a new one on me ... :-)
> 
> If you have any bug reports with info about how you debugged it and
> concluded that Seattle is broken, you could include a link (probably
> in the changelog).  But if there isn't anything, there isn't anything.
> 
> I might reorganize those patches as:
> 
>   1) Add a PCI_DEV_FLAGS_RELAXED_ORDERING_BROKEN flag, the quirk that
>   sets it, and the current patch [2/4] that uses it.
> 
>   2) Add the Intel DECLARE_PCI_FIXUP_CLASS_EARLY()s with the Intel
>   details.
> 
>   3) Add the AMD DECLARE_PCI_FIXUP_CLASS_EARLY()s with the AMD
>   details.
> 

OK, I could reorganize it, but still need the Casey to give me the link
for the Seattle, otherwise I could remove the AMD part and wait until
someone show it. Thanks

Ding
> .
>
Ashok Raj Aug. 9, 2017, 3:58 p.m. UTC | #5
Hi Bjorn

On Tue, Aug 08, 2017 at 06:22:00PM -0500, Bjorn Helgaas wrote:
> On Sat, Aug 05, 2017 at 03:15:10PM +0800, Ding Tianhong wrote:
> > From: Casey Leedom <leedom@chelsio.com>
> > 
> > Root complexes don't obey PCIe 3.0 ordering rules, hence could lead to
> > data-corruption.
> 
> This needs to include a link to the Intel spec
> (https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf,
> sec 3.9.1).
> 
> It should also include a pointer to the AMD erratum, if available, or
> at least some reference to how we know it doesn't obey the rules.
> 
> Ashok, thanks for chiming in.  Now that you have, I have a few more
> questions for you:
> 
>   - Is the above doc the one you mentioned as being now public?

Yes. 
>   
>   - Is this considered a hardware erratum?

I would think so. I have tried to pursue the publication in that direction
but it morphed into the optimization guide instead. Once it got into some
open doc i stopped pushing.. but will continue to get this into erratum. i do
agree that's the right place holder for this.

>   
>   - If so, is there a pointer to that as well?
>   
>   - If this is not considered an erratum, can you provide any guidance
>     about how an OS should determine when it should use RO?

The optimization guide states that it only applies to transactions targetting
system memory. For peer-2-peer RO is allowed and has perf upside.

As Casey pointed out in an earlier thread, we choose the heavy hammer approach
because there are some that can lead to data-corruption as opposed to perf
degradation. 

This looks ugly, but maybe we can have 2 flags. one that indicates its a strict
no-no, and one that says no to system memory only. That way driver can
determine when the device would turn the hint on in the TLP.

>     
> Relying on a list of device IDs in an optimization manual is OK for an
> erratum, but if it's *not* an erratum, it seems like a hole in the

Good point.. for this specific case its really an erratum, but for some
reason they made the decision to use this doc vs. the generic errata
data-sheet that would have been the preferred way to document.

> specs because as far as I know there's no generic way for the OS to
> discover whether to use RO.
> 

Cheers,
Ashok
Casey Leedom Aug. 9, 2017, 4:36 p.m. UTC | #6
| From: Ding Tianhong <dingtianhong@huawei.com>
| Sent: Wednesday, August 9, 2017 5:17 AM
|
| On 2017/8/9 11:02, Bjorn Helgaas wrote:
| >
| > On Wed, Aug 09, 2017 at 01:40:01AM +0000, Casey Leedom wrote:
| > >
| >> | From: Bjorn Helgaas <helgaas@kernel.org>
| >> | Sent: Tuesday, August 8, 2017 4:22 PM
| >> | ...
| >> | It should also include a pointer to the AMD erratum, if available, or
| >> | at least some reference to how we know it doesn't obey the rules.
| >>
| >>   Getting an ACK from AMD seems like a forlorn cause at this point.  My
| >> contact was Bob Shaw <Bob.Shaw@amd.com> and he stopped responding to me
| >> messages almost a year ago saying that all of AMD's energies were being
| >> redirected towards upcoming x86 products (likely Ryzen as we now know).
| >> As far as I can tell AMD has walked away from their A1100 (AKA
| >> "Seattle") ARM SoC.
| >>
| >>   On the specific issue, I can certainly write up somthing even more
| >> extensive than I wrote up for the comment in drivers/pci/quirks.c.
| >> Please review the comment I wrote up and tell me if you'd like
| >> something even more detailed -- I'm usually acused of writing comments
| >> which are too long, so this would be a new one on me ... :-)
| >
| > If you have any bug reports with info about how you debugged it and
| > concluded that Seattle is broken, you could include a link (probably
| > in the changelog).  But if there isn't anything, there isn't anything.
| ...
| OK, I could reorganize it, but still need the Casey to give me the link
| for the Seattle, otherwise I could remove the AMD part and wait until
| someone show it. Thanks

There are no links and I was never given an internal bug number at AMD.  As
I said, they stopped responding to my notes about a years ago saying that
they were moving the focus of all their people and no longer had resources
to pursue the issue.  Hopefully for them, Ryzen doesn't have the same
Data Corruption problem ...

As for how we diagnosed it, with our Ingress Packet delivery, we have the
Ingress Packet Data delivered (DMA Write) into Free List Buffers, and then
then a small message (DMA Write) to a "Response Queue" indicating delivery
of the Ingress Packet Data into the Free List Buffers.  The Transaction
Layer Packets which convey the Ingress Packet Data all have the Relaxed
Ordering Attribute set, while the following TLP carring the Ingress Data
delivery notification into the Response Queue does not have the Relaxed
Ordering Attribute set.

The rules for processing TLPs with and without the Relaxed Ordering
Attribute set are covered in Section 2.4.1 of the PCIe 3.0 specification
(Revision 3.0 November 10, 2010).  Table 2-34 "Ordering Rules Summary"
covers the cases where one TLP may "pass" (be proccessed earlier) than a
preceding TLP.  In the case we're talking about, we have a sequence of one
or more Posted DMA Write TLPs with the Relaxed Ordering Attribute set and a
following Posted DMA Write TLP without the Relaxed Ordering Attribute set.
Thus we need to look at the Row A, Column 2 cell of Table 2-34 governing
when a Posted Request may "pass" a preceeding Posted Request.  In that cell
we have:

    a) No
    b) Y/N

with the explanatory text:

    A2a    A Posted Request must not pass another Posted Request
           unless A2b applies.

    A2b    A Posted Request with RO[23] Set is permitted to pass
           another Posted Request[24].  A Posted Request with IDO
           Set is permitted to pass another Posted Request if the
           two Requester IDs are different.

    [23] In this section, "RO" is an abbreviation for the Relaxed
         Ordering Attribute field.

    [24] Some usages are enabled by not implementing this passing
         (see the No RO-enabled PR-PR Passing bit in Section
         7.8.15).

In our case, we were getting notifications of Ingress Packet Delivery in our
Response Queues, but not all of the Ingress Packet Data Posted DMA Write
TLPs had been processed yet by the Root Complex.  As a result, we were
picking up old stale memory data before those lagging Ingress Packet Data
TLPs could be processed.  This is a clear violation of the PCIe 3.0 TLP
processing rules outlined above.

Does that help?

Casey
Casey Leedom Aug. 9, 2017, 4:46 p.m. UTC | #7
| From: Raj, Ashok <ashok.raj@intel.com>
| Sent: Wednesday, August 9, 2017 8:58 AM
| ...
| As Casey pointed out in an earlier thread, we choose the heavy hammer
| approach because there are some that can lead to data-corruption as opposed
| to perf degradation.

Careful.  As far as I'm aware, there is no Data Corruption problem
whatsoever with Intel Root Ports and processing of Transaction Layer Packets
with and without the Relaxed Ordering Attribute set.

The only issue which we've discovered with relatively recent Intel Root Port
implementations and the use of the Relaxed Ordering Attribute is a
performance issue.  To the best of our ability to analyze the PCIe traces,
it appeared that the Intel Root Complex delayed returning Link Flow Control
Credits resulting in lowered performance (total bandwidth).  When we used
Relaxed Ordering for Ingress Packet Data delivery on a 100Gb/s Ethernet
link with 1500-byte MTU, we were pegged at ~75Gb/s.  Once we disabled
Relaxed Ordering, we were able to deliver Ingress Packet Data to Host Memory
at the full link rate.

Casey
Ashok Raj Aug. 9, 2017, 6 p.m. UTC | #8
On Wed, Aug 09, 2017 at 04:46:07PM +0000, Casey Leedom wrote:
> | From: Raj, Ashok <ashok.raj@intel.com>
> | Sent: Wednesday, August 9, 2017 8:58 AM
> | ...
> | As Casey pointed out in an earlier thread, we choose the heavy hammer
> | approach because there are some that can lead to data-corruption as opposed
> | to perf degradation.
> 
> Careful.  As far as I'm aware, there is no Data Corruption problem
> whatsoever with Intel Root Ports and processing of Transaction Layer Packets
> with and without the Relaxed Ordering Attribute set.

That's right.. no data-corruption on Intel parts :-).. It was with
other vendor. Only performance issue with intel root-ports in the parts
identified by the optimization guide. 

Cheers,
AShok
Casey Leedom Aug. 9, 2017, 8:11 p.m. UTC | #9
| From: Raj, Ashok <ashok.raj@intel.com>
| Sent: Wednesday, August 9, 2017 11:00 AM
|
| On Wed, Aug 09, 2017 at 04:46:07PM +0000, Casey Leedom wrote:
| > | From: Raj, Ashok <ashok.raj@intel.com>
| > | Sent: Wednesday, August 9, 2017 8:58 AM
| > | ...
| > | As Casey pointed out in an earlier thread, we choose the heavy hammer
| > | approach because there are some that can lead to data-corruption as
| > | opposed to perf degradation.
| >
| > Careful.  As far as I'm aware, there is no Data Corruption problem
| > whatsoever with Intel Root Ports and processing of Transaction Layer
| > Packets with and without the Relaxed Ordering Attribute set.
|
| That's right.. no data-corruption on Intel parts :-).. It was with other
| vendor. Only performance issue with intel root-ports in the parts identified
| by the optimization guide.

Yes, I didn't want you to get into any trouble over that possible reading of
what you wrote.

Any progress on the "Chicken Bit" investigation?  Being able to disable the
non-optimal Relaxed Ordering "optimization" would be the best PCI Quirk of
all ...

Casey
diff mbox

Patch

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 6967c6b..5c9e125 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4016,6 +4016,94 @@  static void quirk_tw686x_class(struct pci_dev *pdev)
 			      quirk_tw686x_class);
 
 /*
+ * Some devices have problems with Transaction Layer Packets with the Relaxed
+ * Ordering Attribute set.  Such devices should mark themselves and other
+ * Device Drivers should check before sending TLPs with RO set.
+ */
+static void quirk_relaxedordering_disable(struct pci_dev *dev)
+{
+	dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING;
+}
+
+/*
+ * Intel Xeon processors based on Broadwell/Haswell microarchitecture Root
+ * Complex has a Flow Control Credit issue which can cause performance
+ * problems with Upstream Transaction Layer Packets with Relaxed Ordering set.
+ */
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f01, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f02, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f03, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f04, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f05, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f06, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f07, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f08, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f09, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f0a, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f0b, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f0c, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f0d, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f0e, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f01, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f02, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f03, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f04, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f05, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f06, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f07, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f08, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f09, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f0a, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f0b, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f0c, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f0d, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f0e, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+
+/*
+ * The AMD ARM A1100 (AKA "SEATTLE") SoC has a bug in its PCIe Root Complex
+ * where Upstream Transaction Layer Packets with the Relaxed Ordering
+ * Attribute clear are allowed to bypass earlier TLPs with Relaxed Ordering
+ * set.  This is a violation of the PCIe 3.0 Transaction Ordering Rules
+ * outlined in Section 2.4.1 (PCI Express(r) Base Specification Revision 3.0
+ * November 10, 2010).  As a result, on this platform we can't use Relaxed
+ * Ordering for Upstream TLPs.
+ */
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AMD, 0x1a00, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AMD, 0x1a01, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AMD, 0x1a02, PCI_CLASS_NOT_DEFINED, 8,
+			      quirk_relaxedordering_disable);
+
+/*
  * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same
  * values for the Attribute as were supplied in the header of the
  * corresponding Request, except as explicitly allowed when IDO is used."
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 4869e66..412ec1c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -188,6 +188,8 @@  enum pci_dev_flags {
 	 * the direct_complete optimization.
 	 */
 	PCI_DEV_FLAGS_NEEDS_RESUME = (__force pci_dev_flags_t) (1 << 11),
+	/* Don't use Relaxed Ordering for TLPs directed at this device */
+	PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 12),
 };
 
 enum pci_irq_reroute_variant {