diff mbox

[v7,2/3] PCI: Enable PCIe Relaxed Ordering if supported

Message ID MWHPR12MB1600A6986E4AD13D302F2DC9C8B00@MWHPR12MB1600.namprd12.prod.outlook.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Casey Leedom Aug. 2, 2017, 5:53 p.m. UTC
Okay, here you go.  As you can tell, it's almost a trivial copy of the
cxgb4 patch.
 
  By the way, I realized that we have yet another hole which is likely not
to be fixable.  If we're dealing with a problematic Root Complex, and we
instantiate Virtual Functions and attach them to a Virtual Machine along
with an NVMe device which can deal with Relaxed Ordering TLPs, the VF driver
in the VM will be able to tell that it shouldn't attempt to send RO TLPs to
the RC because it will see the state of its own PCIe Capability Device
Control[Relaxed Ordering Enable] (a copy of the setting in the VF's
corresponding PF), but it won't be able to change that and send non-RO TLPs
to the RC, and RO TLPs to the NVMe device.  Oh well.

  I sure wish that the Intel guys would pop up with a hidden register change
for these problematic Intel RCs that perform poorly with RO TLPs.  Their
silence has been frustrating.

Casey

----------

cxgb4vf Ethernet driver now queries PCIe configuration space to determine if
it can send TLPs to it with the Relaxed Ordering Attribute set.

Comments

Ashok Raj Aug. 3, 2017, 8:31 a.m. UTC | #1
Hi Casey

On Wed, Aug 02, 2017 at 05:53:52PM +0000, Casey Leedom wrote:
>   Okay, here you go.  As you can tell, it's almost a trivial copy of the
> cxgb4 patch.
>  
>   By the way, I realized that we have yet another hole which is likely not
> to be fixable.  If we're dealing with a problematic Root Complex, and we
> instantiate Virtual Functions and attach them to a Virtual Machine along
> with an NVMe device which can deal with Relaxed Ordering TLPs, the VF driver
> in the VM will be able to tell that it shouldn't attempt to send RO TLPs to
> the RC because it will see the state of its own PCIe Capability Device
> Control[Relaxed Ordering Enable] (a copy of the setting in the VF's
> corresponding PF), but it won't be able to change that and send non-RO TLPs
> to the RC, and RO TLPs to the NVMe device.  Oh well.

I don't understand this completely.. So your driver would know not to send RO
TLP's to root complex. But you want to send RO to the NVMe device? This is the
peer-2-peer case correct?

The issue in the current patchset is that we device to turn off the device 
capability for all devices in the hierarchy so one would expect that 
the NVMe also would have RO turned off i suppose. 

The other approach is to not turn off the device capabilty, but let the
driver do the right thing. i.e for transactions towards system memory vs. 
peer-2-peer? But since we wanted to take a big hammer approach because
some platforms there can be data-corruption and we can't let trust guest
drivers to do the right thing. This isn't something we can fix in this 
current version.

One possible approach is to provide a strict flag, where we use this heavy 
hammer approach only on platforms that have a serious implication, and the 
other is we let the driver do the right thing depending on the platform.

Worst case if the driver doesn't do the right thing, you would see perf issues
but nothing bad would happen. It would allow you to select when to turn on 
RO and when to turn it off.

Cheers,
Ashok
Casey Leedom Aug. 4, 2017, 8:20 p.m. UTC | #2
| From: Raj, Ashok <ashok.raj@intel.com>
| Sent: Thursday, August 3, 2017 1:31 AM
|
| I don't understand this completely.. So your driver would know not to send
| RO TLP's to root complex. But you want to send RO to the NVMe device? This
| is the peer-2-peer case correct?

Yes, this is the "heavy hammer" issue which you alluded to later.  There are
applications where a device will want to send TLPs to a Root Complex without
Relaxed Ordering set, but will want to use it when sending TLPs to a Peer
device (say, an NVMe storage device).  The current approach doesn't make
that easy ... and in fact, I still don't kow how to code a solution for this
with the proposed APIs.  This means that we may be trading off one
performance problem for another and that Relaxed Ordering may be doomed for
use under Linux for the foreseeable future.

As I've noted a number of times, it would be great if the Intel Hardware
Engineers who attempted to implement the Relaxed Ordering semantics in the
current generation of Root Complexes had left the ability to turn off the
logic which is obviously not working.  If there was a way to disable the
logic via an undocumented register, then we could have the Linux PCI Quirk
do that.  Since Relaxed Ordering is just a hint, it's completely legitimate
to completely ignore it.

Casey
Ashok Raj Aug. 4, 2017, 8:21 p.m. UTC | #3
On Fri, Aug 04, 2017 at 08:20:37PM +0000, Casey Leedom wrote:
> | From: Raj, Ashok <ashok.raj@intel.com>
> | Sent: Thursday, August 3, 2017 1:31 AM
> |
> | I don't understand this completely.. So your driver would know not to send
> | RO TLP's to root complex. But you want to send RO to the NVMe device? This
> | is the peer-2-peer case correct?
> 
> Yes, this is the "heavy hammer" issue which you alluded to later.  There are
> applications where a device will want to send TLPs to a Root Complex without
> Relaxed Ordering set, but will want to use it when sending TLPs to a Peer
> device (say, an NVMe storage device).  The current approach doesn't make
> that easy ... and in fact, I still don't kow how to code a solution for this
> with the proposed APIs.  This means that we may be trading off one
> performance problem for another and that Relaxed Ordering may be doomed for
> use under Linux for the foreseeable future.
> 
> As I've noted a number of times, it would be great if the Intel Hardware
> Engineers who attempted to implement the Relaxed Ordering semantics in the
> current generation of Root Complexes had left the ability to turn off the
> logic which is obviously not working.  If there was a way to disable the
> logic via an undocumented register, then we could have the Linux PCI Quirk
> do that.  Since Relaxed Ordering is just a hint, it's completely legitimate
> to completely ignore it.

Suppose you are looking for the existence of a chicken bit to instruct the
port to ignore RO traffic. So all we would do is turn the chicken bit on
but would permit p2p traffic to be allowed since we won't turn off the
capability as currently proposed.

Let me look into that keep you posted.

Cheers,
Ashok
> 
> Casey
Casey Leedom Aug. 4, 2017, 8:48 p.m. UTC | #4
| From: Raj, Ashok <ashok.raj@intel.com>
| Sent: Friday, August 4, 2017 1:21 PM
|
| On Fri, Aug 04, 2017 at 08:20:37PM +0000, Casey Leedom wrote:
| > ...
| > As I've noted a number of times, it would be great if the Intel Hardware
| > Engineers who attempted to implement the Relaxed Ordering semantics in the
| > current generation of Root Complexes had left the ability to turn off the
| > logic which is obviously not working.  If there was a way to disable the
| > logic via an undocumented register, then we could have the Linux PCI Quirk
| > do that.  Since Relaxed Ordering is just a hint, it's completely legitimate
| > to completely ignore it.
|
| Suppose you are looking for the existence of a chicken bit to instruct the
| port to ignore RO traffic. So all we would do is turn the chicken bit on
| but would permit p2p traffic to be allowed since we won't turn off the
| capability as currently proposed.
|
| Let me look into that keep you posted.

Huh, I'd never heard it called a "chicken bit" before, but yes, that's what
I'm talking about.

Whenever our Hardware Designers implement new functionality in our hardware,
they almost always put in A. several "knobs" which can control fundamental
parameters of the new Hardware Feature, and B.  a mechanism of completely
disabling it if necessary.  This stems from the incredibly long Design ->
Deployment cyle for Hardware (as opposed to the edit->compile->run cycle for s!

It's obvious that handling Relaxed Ordering is a new Hardware Feature for
Intel's Root Complexes since previous versions simply ignored it (because
that's legal[1]).  If I was a Hardware Engineer tasked with implementing
Relaxed Ordering semantics for a device, I would certainly have also
implemented a switch to turn it off in case there were unintended
consequences (performance in this case).

And if there is such a mechanism to simply disable processing of Relaxed
Ordering semantics in the Root Complex, that would be a far easier "fix" for
this problem ... and leave the code in place to continue sending Relaxed
Ordering TLPs for a future Root Complex implementation which got it right ...

Casey

[1] One can't ~quite~ just ignore the Relaxed Ordering Attribute on an
    incoming Transaction Layer Packet Request: PCIe completion rules (see
    section 2.2.9 of the PCIe 3.0 specificatin) require that the Relaxed
    Ordering and No Snoop Attributes in a Request TLP be reflected back
    verbatim in any corresponding Response TLP.  (The rules for ID-Based
    Ordering are more complex.)
David Laight Aug. 7, 2017, 9:04 a.m. UTC | #5
From: Casey Leedom
> Sent: 04 August 2017 21:49
...
> Whenever our Hardware Designers implement new functionality in our hardware,
> they almost always put in A. several "knobs" which can control fundamental
> parameters of the new Hardware Feature, and B.  a mechanism of completely
> disabling it if necessary.  This stems from the incredibly long Design ->
> Deployment cyle for Hardware (as opposed to the edit->compile->run cycle for s!

Indeed, I'd also expect there to be an undocumented flag to turn
it on (broken) in earlier parts to allow testing.

	David
diff mbox

Patch

diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h b/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h
index 109bc63..08c6ddb 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h
@@ -408,6 +408,7 @@  enum { /* adapter flags */
 	USING_MSI          = (1UL << 1),
 	USING_MSIX         = (1UL << 2),
 	QUEUES_BOUND       = (1UL << 3),
+	ROOT_NO_RELAXED_ORDERING = (1UL << 4),
 };
 
 /*
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index ac7a150..59e7639 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -2888,6 +2888,24 @@  static int cxgb4vf_pci_probe(struct pci_dev *pdev,
 	 */
 	adapter->name = pci_name(pdev);
 	adapter->msg_enable = DFLT_MSG_ENABLE;
+
+	/* If possible, we use PCIe Relaxed Ordering Attribute to deliver
+	 * Ingress Packet Data to Free List Buffers in order to allow for
+	 * chipset performance optimizations between the Root Complex and
+	 * Memory Controllers.  (Messages to the associated Ingress Queue
+	 * notifying new Packet Placement in the Free Lists Buffers will be
+	 * send without the Relaxed Ordering Attribute thus guaranteeing that
+	 * all preceding PCIe Transaction Layer Packets will be processed
+	 * first.)  But some Root Complexes have various issues with Upstream
+	 * Transaction Layer Packets with the Relaxed Ordering Attribute set.
+	 * The PCIe devices which under the Root Complexes will be cleared the
+	 * Relaxed Ordering bit in the configuration space, So we check our
+	 * PCIe configuration space to see if it's flagged with advice against
+	 * using Relaxed Ordering.
+	 */
+	if (!pcie_relaxed_ordering_supported(pdev))
+		adapter->flags |= ROOT_NO_RELAXED_ORDERING;
+
 	err = adap_init0(adapter);
 	if (err)
 		goto err_unmap_bar;
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
index e37dde2..05498e7 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
@@ -2205,6 +2205,7 @@  int t4vf_sge_alloc_rxq(struct adapter *adapter, struct sge_rspq *rspq,
 	struct port_info *pi = netdev_priv(dev);
 	struct fw_iq_cmd cmd, rpl;
 	int ret, iqandst, flsz = 0;
+	int relaxed = !(adapter->flags & ROOT_NO_RELAXED_ORDERING);
 
 	/*
 	 * If we're using MSI interrupts and we're not initializing the
@@ -2300,6 +2301,8 @@  int t4vf_sge_alloc_rxq(struct adapter *adapter, struct sge_rspq *rspq,
 			cpu_to_be32(
 				FW_IQ_CMD_FL0HOSTFCMODE_V(SGE_HOSTFCMODE_NONE) |
 				FW_IQ_CMD_FL0PACKEN_F |
+				FW_IQ_CMD_FL0FETCHRO_V(relaxed) |
+				FW_IQ_CMD_FL0DATARO_V(relaxed) |
 				FW_IQ_CMD_FL0PADEN_F);
 
 		/* In T6, for egress queue type FL there is internal overhead