Patchwork PCI: update device mps when doing pci hotplug

login
register
mail settings
Submitter Yijing Wang
Date Feb. 5, 2013, 3:55 a.m.
Message ID <1360036520-31032-1-git-send-email-wangyijing@huawei.com>
Download mbox | patch
Permalink /patch/218144/
State Not Applicable
Headers show

Comments

Yijing Wang - Feb. 5, 2013, 3:55 a.m.
Currently we dont't update device's mps vaule when doing
pci device hot-add. The hot-added device's mps will be set
to default value (128B). But the upstream port device's mps
may be larger than 128B which was set by firmware during
system bootup. In this case the new added device may not
work normally.

The reference discussion at
http://marc.info/?l=linux-pci&m=135420434508910&w=2
and
http://marc.info/?l=linux-pci&m=134815603407842&w=2

Reported-by: Joe Jin <joe.jin@oracle.com>
Reported-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Cc: Jon Mason <jdmason@kudzu.us>
---
 drivers/pci/probe.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 49 insertions(+), 0 deletions(-)
Yijing Wang - May 28, 2013, 3:15 a.m.
Hi Bjorn and Jon,
   I'm sorry to disturb you. This patch is sent so long, but nobody seems had comment about it.
Do you have any comment with this patch?

This patch try to update device mps in following case:
1) target device under root port
   Because root port can split TLP, so target device mps greatr than root port mps is ok.
   But if root port mps greater than target device mps, it's bad, because target device cannot
   receive TLP payload size greater than its MPS. So if a target device under a root port, I think
   we should assign its mps greater than or equal root port mps.
2) target device under non root port
   We assume the target device both is a transmitter and receiver, so the safest way is to assign target
   device mps equal to its parent device.

Any comments about this patch is welcome!

Thanks!
Yijing.

On 2013/2/5 11:55, Yijing Wang wrote:
> Currently we dont't update device's mps vaule when doing
> pci device hot-add. The hot-added device's mps will be set
> to default value (128B). But the upstream port device's mps
> may be larger than 128B which was set by firmware during
> system bootup. In this case the new added device may not
> work normally.
> 
> The reference discussion at
> http://marc.info/?l=linux-pci&m=135420434508910&w=2
> and
> http://marc.info/?l=linux-pci&m=134815603407842&w=2
> 
> Reported-by: Joe Jin <joe.jin@oracle.com>
> Reported-by: Yijing Wang <wangyijing@huawei.com>
> Signed-off-by: Yijing Wang <wangyijing@huawei.com>
> Cc: Jon Mason <jdmason@kudzu.us>
> ---
>  drivers/pci/probe.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 49 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index bbe4be7..57d9a5b 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1556,6 +1556,52 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
>  	return 0;
>  }
>  
> +static int pcie_bus_update_set(struct pci_dev *dev, void *data)
> +{
> +	int mps, p_mps;
> +
> +	if (!pci_is_pcie(dev) || !dev->bus->self)
> +		return 0;
> +
> +	mps = pcie_get_mps(dev);
> +	p_mps = pcie_get_mps(dev->bus->self);
> +
> +	if (pci_pcie_type(dev->bus->self) != PCI_EXP_TYPE_ROOT_PORT) {
> +		/* update mps when current device mps is not equal to upstream mps */
> +		if (mps != p_mps)
> +			goto update;
> +	} else {
> +		/* update mps when current device mps is smaller than upstream mps */
> +		if (mps < p_mps)
> +			goto update;
> +	}
> +
> +	return 0;
> +
> +update:
> +	/* If current mpss is lager than upstream, use upstream mps to update
> +	 * current mps, otherwise print warning info.
> +	 */
> +	if ((128 << dev->pcie_mpss) >= p_mps)
> +		pcie_write_mps(dev, p_mps);
> +	else
> +		dev_warn(&dev->dev, "MPS %d MPSS %d both smaller than upstream MPS %d\n"
> +				"If necessary, use \"pci=pcie_bus_peer2peer\" boot parameter to avoid this problem\n",
> +				mps, 128 << dev->pcie_mpss, p_mps);
> +	return 0;
> +}
> +
> +static void pcie_bus_update_setting(struct pci_bus *bus)
> +{
> +
> +	/*
> +	 * After hot added a pci device, the device's mps will set to default
> +	 * vaule(128 bytes). But the upstream port mps may be larger than 128B.
> +	 * In this case, we should update this device's mps for better performance.
> +	 */
> +	pci_walk_bus(bus, pcie_bus_update_set, NULL);
> +}
> +
>  /* pcie_bus_configure_settings requires that pci_walk_bus work in a top-down,
>   * parents then children fashion.  If this changes, then this code will not
>   * work as designed.
> @@ -1566,6 +1612,9 @@ void pcie_bus_configure_settings(struct pci_bus *bus, u8 mpss)
>  
>  	if (!pci_is_pcie(bus->self))
>  		return;
> +
> +	/* update mps setting for newly hot added device */
> +	pcie_bus_update_setting(bus);
>  
>  	if (pcie_bus_config == PCIE_BUS_TUNE_OFF)
>  		return;
>
Bjorn Helgaas - July 29, 2013, 11:33 p.m.
On Mon, May 27, 2013 at 9:15 PM, Yijing Wang <wangyijing@huawei.com> wrote:
> Hi Bjorn and Jon,
>    I'm sorry to disturb you. This patch is sent so long, but nobody seems had comment about it.
> Do you have any comment with this patch?
>
> This patch try to update device mps in following case:
> 1) target device under root port
>    Because root port can split TLP, so target device mps greatr than root port mps is ok.
>    But if root port mps greater than target device mps, it's bad, because target device cannot
>    receive TLP payload size greater than its MPS. So if a target device under a root port, I think
>    we should assign its mps greater than or equal root port mps.
> 2) target device under non root port
>    We assume the target device both is a transmitter and receiver, so the safest way is to assign target
>    device mps equal to its parent device.

Thanks, I just started reviewing this patch, and your notes above are
exactly the question I was going to ask.  The comments in
pcie_bus_update_set() only tell me what the code does.  I can read the
C code just fine; what we need there is the explanation about *why* we
handle devices below root ports differently than others.  Maybe we can
adapt some of your notes as comments in the code.

Do you have references to the spec where it talks about this
difference?  I want to make sure we can rely on the fact that a root
port can accept TLPs larger than its MPS.

Bjorn

> On 2013/2/5 11:55, Yijing Wang wrote:
>> Currently we dont't update device's mps vaule when doing
>> pci device hot-add. The hot-added device's mps will be set
>> to default value (128B). But the upstream port device's mps
>> may be larger than 128B which was set by firmware during
>> system bootup. In this case the new added device may not
>> work normally.
>>
>> The reference discussion at
>> http://marc.info/?l=linux-pci&m=135420434508910&w=2
>> and
>> http://marc.info/?l=linux-pci&m=134815603407842&w=2
>>
>> Reported-by: Joe Jin <joe.jin@oracle.com>
>> Reported-by: Yijing Wang <wangyijing@huawei.com>
>> Signed-off-by: Yijing Wang <wangyijing@huawei.com>
>> Cc: Jon Mason <jdmason@kudzu.us>
>> ---
>>  drivers/pci/probe.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 files changed, 49 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> index bbe4be7..57d9a5b 100644
>> --- a/drivers/pci/probe.c
>> +++ b/drivers/pci/probe.c
>> @@ -1556,6 +1556,52 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
>>       return 0;
>>  }
>>
>> +static int pcie_bus_update_set(struct pci_dev *dev, void *data)
>> +{
>> +     int mps, p_mps;
>> +
>> +     if (!pci_is_pcie(dev) || !dev->bus->self)
>> +             return 0;
>> +
>> +     mps = pcie_get_mps(dev);
>> +     p_mps = pcie_get_mps(dev->bus->self);
>> +
>> +     if (pci_pcie_type(dev->bus->self) != PCI_EXP_TYPE_ROOT_PORT) {
>> +             /* update mps when current device mps is not equal to upstream mps */
>> +             if (mps != p_mps)
>> +                     goto update;
>> +     } else {
>> +             /* update mps when current device mps is smaller than upstream mps */
>> +             if (mps < p_mps)
>> +                     goto update;
>> +     }
>> +
>> +     return 0;
>> +
>> +update:
>> +     /* If current mpss is lager than upstream, use upstream mps to update
>> +      * current mps, otherwise print warning info.
>> +      */
>> +     if ((128 << dev->pcie_mpss) >= p_mps)
>> +             pcie_write_mps(dev, p_mps);
>> +     else
>> +             dev_warn(&dev->dev, "MPS %d MPSS %d both smaller than upstream MPS %d\n"
>> +                             "If necessary, use \"pci=pcie_bus_peer2peer\" boot parameter to avoid this problem\n",
>> +                             mps, 128 << dev->pcie_mpss, p_mps);
>> +     return 0;
>> +}
>> +
>> +static void pcie_bus_update_setting(struct pci_bus *bus)
>> +{
>> +
>> +     /*
>> +      * After hot added a pci device, the device's mps will set to default
>> +      * vaule(128 bytes). But the upstream port mps may be larger than 128B.
>> +      * In this case, we should update this device's mps for better performance.
>> +      */
>> +     pci_walk_bus(bus, pcie_bus_update_set, NULL);
>> +}
>> +
>>  /* pcie_bus_configure_settings requires that pci_walk_bus work in a top-down,
>>   * parents then children fashion.  If this changes, then this code will not
>>   * work as designed.
>> @@ -1566,6 +1612,9 @@ void pcie_bus_configure_settings(struct pci_bus *bus, u8 mpss)
>>
>>       if (!pci_is_pcie(bus->self))
>>               return;
>> +
>> +     /* update mps setting for newly hot added device */
>> +     pcie_bus_update_setting(bus);
>>
>>       if (pcie_bus_config == PCIE_BUS_TUNE_OFF)
>>               return;
>>
>
>
> --
> Thanks!
> Yijing
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yijing Wang - July 30, 2013, 3:20 a.m.
On 2013/7/30 7:33, Bjorn Helgaas wrote:
> On Mon, May 27, 2013 at 9:15 PM, Yijing Wang <wangyijing@huawei.com> wrote:
>> Hi Bjorn and Jon,
>>    I'm sorry to disturb you. This patch is sent so long, but nobody seems had comment about it.
>> Do you have any comment with this patch?
>>
>> This patch try to update device mps in following case:
>> 1) target device under root port
>>    Because root port can split TLP, so target device mps greatr than root port mps is ok.
>>    But if root port mps greater than target device mps, it's bad, because target device cannot
>>    receive TLP payload size greater than its MPS. So if a target device under a root port, I think
>>    we should assign its mps greater than or equal root port mps.
>> 2) target device under non root port
>>    We assume the target device both is a transmitter and receiver, so the safest way is to assign target
>>    device mps equal to its parent device.
> 
> Thanks, I just started reviewing this patch, and your notes above are
> exactly the question I was going to ask.  The comments in
> pcie_bus_update_set() only tell me what the code does.  I can read the
> C code just fine; what we need there is the explanation about *why* we
> handle devices below root ports differently than others.  Maybe we can
> adapt some of your notes as comments in the code.

Hi Bjorn,
   Thanks for your review and comments!

> 
> Do you have references to the spec where it talks about this
> difference?  I want to make sure we can rely on the fact that a root
> port can accept TLPs larger than its MPS.

PCIe Spec does not explicitly mention this issue, we can only get the message that
root port/ root complex can split the TLP into smaller packets. For instance
one 256B packet split into two 128B packet.

I confirm this issue in my X86 machine and IA64 machine.
1. I unload NIC driver to make sure the safety during  change the NIC MPS.
2. Use setpci change NIC MPS to the max value it supports.
3. Reload the NIC driver
4. Ping and use scp cpoy large file bwtween machines. Result is ok.

linux:/home/yijing # lspci -tv
 \-[0000:00]-+-00.0  Intel Corporation 5500 I/O Hub to ESI Port
             +-01.0-[01]--+-00.0  Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
             |            \-00.1  Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
             +-03.0-[02]----00.0  Xilinx Corporation Default PCIe endpoint ID
             +-07.0-[03]--+-00.0  Intel Corporation 82576 Gigabit Network Connection
             |            \-00.1  Intel Corporation 82576 Gigabit Network Connection
             +-09.0-[04]----00.0  LSI Logic / Symbios Logic MegaRAID SAS 1078
		................

linux:/home/yijing # ifconfig
eth1      Link encap:Ethernet  HWaddr 80:FB:06:AD:B2:FF
          inet addr:128.5.160.31  Bcast:128.5.160.255  Mask:255.255.255.0
          inet6 addr: fe80::82fb:6ff:fead:b2ff/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2737201 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2665883 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3681912141 (3511.3 Mb)  TX bytes:3672206941 (3502.0 Mb)

linux:/home/yijing # ethtool -i eth1
driver: bnx2
version: 2.2.3
firmware-version: bc 4.6.4
bus-info: 0000:01:00.1  ------------->device
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

linux:/home/yijing # lspci -vvv -s 0000:00:01.0
00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22) (prog-if 00 [Normal decode])
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 256 bytes
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	I/O behind bridge: 0000f000-00000fff
	Memory behind bridge: c0000000-c3ffffff
	Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
	BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Subsystem: Device 19e5:2008
	Capabilities: [60] MSI: Enable- Count=1/2 Maskable+ 64bit-
		Address: 00000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [90] Express (v2) Root Port (Slot+), MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 128 bytes         --------------------------->root port device, MPS is 128B
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x4, ASPM L0s L1, Latency L0 <512ns, L1 <64us
			ClockPM- Surprise+ LLActRep+ BwNot+
		.........[snip].......


linux:/home/yijing # lspci -vvv -s 01:00.1        ----------------->EP device, MPS change from 128B to 512B
01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
	Subsystem: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 256 bytes
	Interrupt: pin B routed to IRQ 40
	Region 0: Memory at c2000000 (64-bit, non-prefetchable) [size=32M]
	Capabilities: [48] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data
		Product Name: Broadcom NetXtreme II Ethernet Controller
		Read-only fields:
			[PN] Part number: BCM95706A0
			[EC] Engineering changes: 220197-2
			[SN] Serial number: 0123456789
			[MN] Manufacture ID: 31 34 65 34
			[RV] Reserved: checksum good, 31 byte(s) reserved
		End
	Capabilities: [58] MSI: Enable- Count=1/16 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [a0] MSI-X: Enable+ Count=9 Masked-
		Vector table: BAR=0 offset=0000c000
		PBA: BAR=0 offset=0000e000
	Capabilities: [ac] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 512 bytes, MaxReadReq 512 bytes  ---------------------------->EP device, MPS is 512B
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x4, ASPM L0s L1, Latency L0 <2us, L1 <2us
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

 		...............[snip].............

linux:/home/yijing # scp yijing@128.5.160.28:/home/yijing/ISO/HUAWEI_Enterprise_Linux_B016.iso ./
yijing@128.5.160.28's password:
HUAWEI_Enterprise_Linux_B016.iso                                                                                                   100% 3318MB  53.5MB/s   01:02


linux:/home/yijing # ping 128.5.64.144 -l 65530
WARNING: probably, rcvbuf is not enough to hold preload.
PING 128.5.64.144 (128.5.64.144) 56(84) bytes of data.
64 bytes from 128.5.64.144: icmp_seq=1 ttl=126 time=9.12 ms
64 bytes from 128.5.64.144: icmp_seq=2 ttl=126 time=9.11 ms
64 bytes from 128.5.64.144: icmp_seq=3 ttl=126 time=10.0 ms
64 bytes from 128.5.64.144: icmp_seq=4 ttl=126 time=10.0 ms
64 bytes from 128.5.64.144: icmp_seq=5 ttl=126 time=10.0 ms
64 bytes from 128.5.64.144: icmp_seq=6 ttl=126 time=10.1 ms
64 bytes from 128.5.64.144: icmp_seq=7 ttl=126 time=7.66 ms
64 bytes from 128.5.64.144: icmp_seq=8 ttl=126 time=7.94 ms
64 bytes from 128.5.64.144: icmp_seq=9 ttl=126 time=59.3 ms
64 bytes from 128.5.64.144: icmp_seq=10 ttl=126 time=7.97 ms
64 bytes from 128.5.64.144: icmp_seq=11 ttl=126 time=9.68 ms
64 bytes from 128.5.64.144: icmp_seq=12 ttl=126 time=8.21 ms
64 bytes from 128.5.64.144: icmp_seq=13 ttl=126 time=7.95 ms
64 bytes from 128.5.64.144: icmp_seq=14 ttl=126 time=8.04 ms
64 bytes from 128.5.64.144: icmp_seq=15 ttl=126 time=7.77 ms













> 
> Bjorn
> 
>> On 2013/2/5 11:55, Yijing Wang wrote:
>>> Currently we dont't update device's mps vaule when doing
>>> pci device hot-add. The hot-added device's mps will be set
>>> to default value (128B). But the upstream port device's mps
>>> may be larger than 128B which was set by firmware during
>>> system bootup. In this case the new added device may not
>>> work normally.
>>>
>>> The reference discussion at
>>> http://marc.info/?l=linux-pci&m=135420434508910&w=2
>>> and
>>> http://marc.info/?l=linux-pci&m=134815603407842&w=2
>>>
>>> Reported-by: Joe Jin <joe.jin@oracle.com>
>>> Reported-by: Yijing Wang <wangyijing@huawei.com>
>>> Signed-off-by: Yijing Wang <wangyijing@huawei.com>
>>> Cc: Jon Mason <jdmason@kudzu.us>
>>> ---
>>>  drivers/pci/probe.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 files changed, 49 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>>> index bbe4be7..57d9a5b 100644
>>> --- a/drivers/pci/probe.c
>>> +++ b/drivers/pci/probe.c
>>> @@ -1556,6 +1556,52 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
>>>       return 0;
>>>  }
>>>
>>> +static int pcie_bus_update_set(struct pci_dev *dev, void *data)
>>> +{
>>> +     int mps, p_mps;
>>> +
>>> +     if (!pci_is_pcie(dev) || !dev->bus->self)
>>> +             return 0;
>>> +
>>> +     mps = pcie_get_mps(dev);
>>> +     p_mps = pcie_get_mps(dev->bus->self);
>>> +
>>> +     if (pci_pcie_type(dev->bus->self) != PCI_EXP_TYPE_ROOT_PORT) {
>>> +             /* update mps when current device mps is not equal to upstream mps */
>>> +             if (mps != p_mps)
>>> +                     goto update;
>>> +     } else {
>>> +             /* update mps when current device mps is smaller than upstream mps */
>>> +             if (mps < p_mps)
>>> +                     goto update;
>>> +     }
>>> +
>>> +     return 0;
>>> +
>>> +update:
>>> +     /* If current mpss is lager than upstream, use upstream mps to update
>>> +      * current mps, otherwise print warning info.
>>> +      */
>>> +     if ((128 << dev->pcie_mpss) >= p_mps)
>>> +             pcie_write_mps(dev, p_mps);
>>> +     else
>>> +             dev_warn(&dev->dev, "MPS %d MPSS %d both smaller than upstream MPS %d\n"
>>> +                             "If necessary, use \"pci=pcie_bus_peer2peer\" boot parameter to avoid this problem\n",
>>> +                             mps, 128 << dev->pcie_mpss, p_mps);
>>> +     return 0;
>>> +}
>>> +
>>> +static void pcie_bus_update_setting(struct pci_bus *bus)
>>> +{
>>> +
>>> +     /*
>>> +      * After hot added a pci device, the device's mps will set to default
>>> +      * vaule(128 bytes). But the upstream port mps may be larger than 128B.
>>> +      * In this case, we should update this device's mps for better performance.
>>> +      */
>>> +     pci_walk_bus(bus, pcie_bus_update_set, NULL);
>>> +}
>>> +
>>>  /* pcie_bus_configure_settings requires that pci_walk_bus work in a top-down,
>>>   * parents then children fashion.  If this changes, then this code will not
>>>   * work as designed.
>>> @@ -1566,6 +1612,9 @@ void pcie_bus_configure_settings(struct pci_bus *bus, u8 mpss)
>>>
>>>       if (!pci_is_pcie(bus->self))
>>>               return;
>>> +
>>> +     /* update mps setting for newly hot added device */
>>> +     pcie_bus_update_setting(bus);
>>>
>>>       if (pcie_bus_config == PCIE_BUS_TUNE_OFF)
>>>               return;
>>>
>>
>>
>> --
>> Thanks!
>> Yijing
>>
> 
> .
>
Bjorn Helgaas - July 30, 2013, 3:42 a.m.
On Mon, Jul 29, 2013 at 9:20 PM, Yijing Wang <wangyijing@huawei.com> wrote:
> On 2013/7/30 7:33, Bjorn Helgaas wrote:
>> On Mon, May 27, 2013 at 9:15 PM, Yijing Wang <wangyijing@huawei.com> wrote:
>>> Hi Bjorn and Jon,
>>>    I'm sorry to disturb you. This patch is sent so long, but nobody seems had comment about it.
>>> Do you have any comment with this patch?
>>>
>>> This patch try to update device mps in following case:
>>> 1) target device under root port
>>>    Because root port can split TLP, so target device mps greatr than root port mps is ok.
>>>    But if root port mps greater than target device mps, it's bad, because target device cannot
>>>    receive TLP payload size greater than its MPS. So if a target device under a root port, I think
>>>    we should assign its mps greater than or equal root port mps.
>>> 2) target device under non root port
>>>    We assume the target device both is a transmitter and receiver, so the safest way is to assign target
>>>    device mps equal to its parent device.
>>
>> Thanks, I just started reviewing this patch, and your notes above are
>> exactly the question I was going to ask.  The comments in
>> pcie_bus_update_set() only tell me what the code does.  I can read the
>> C code just fine; what we need there is the explanation about *why* we
>> handle devices below root ports differently than others.  Maybe we can
>> adapt some of your notes as comments in the code.
>
> Hi Bjorn,
>    Thanks for your review and comments!
>
>>
>> Do you have references to the spec where it talks about this
>> difference?  I want to make sure we can rely on the fact that a root
>> port can accept TLPs larger than its MPS.
>
> PCIe Spec does not explicitly mention this issue, we can only get the message that
> root port/ root complex can split the TLP into smaller packets. For instance
> one 256B packet split into two 128B packet.
>
> I confirm this issue in my X86 machine and IA64 machine.
> 1. I unload NIC driver to make sure the safety during  change the NIC MPS.
> 2. Use setpci change NIC MPS to the max value it supports.
> 3. Reload the NIC driver
> 4. Ping and use scp cpoy large file bwtween machines. Result is ok.

The fact that it works on two pieces of hardware is not enough to be
confident that it will work on all spec-conforming hardware.  Maybe we
can deduce this from something in the spec, but I'll have to dig into
it more tomorrow.  I just hoped that you had a spec reference that
could save me some time.

> linux:/home/yijing # lspci -tv
>  \-[0000:00]-+-00.0  Intel Corporation 5500 I/O Hub to ESI Port
>              +-01.0-[01]--+-00.0  Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
>              |            \-00.1  Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
>              +-03.0-[02]----00.0  Xilinx Corporation Default PCIe endpoint ID
>              +-07.0-[03]--+-00.0  Intel Corporation 82576 Gigabit Network Connection
>              |            \-00.1  Intel Corporation 82576 Gigabit Network Connection
>              +-09.0-[04]----00.0  LSI Logic / Symbios Logic MegaRAID SAS 1078
>                 ................
>
> linux:/home/yijing # ifconfig
> eth1      Link encap:Ethernet  HWaddr 80:FB:06:AD:B2:FF
>           inet addr:128.5.160.31  Bcast:128.5.160.255  Mask:255.255.255.0
>           inet6 addr: fe80::82fb:6ff:fead:b2ff/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:2737201 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:2665883 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:3681912141 (3511.3 Mb)  TX bytes:3672206941 (3502.0 Mb)
>
> linux:/home/yijing # ethtool -i eth1
> driver: bnx2
> version: 2.2.3
> firmware-version: bc 4.6.4
> bus-info: 0000:01:00.1  ------------->device
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: no
>
> linux:/home/yijing # lspci -vvv -s 0000:00:01.0
> 00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22) (prog-if 00 [Normal decode])
>         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0, Cache Line Size: 256 bytes
>         Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
>         I/O behind bridge: 0000f000-00000fff
>         Memory behind bridge: c0000000-c3ffffff
>         Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
>         Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
>         BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
>                 PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>         Capabilities: [40] Subsystem: Device 19e5:2008
>         Capabilities: [60] MSI: Enable- Count=1/2 Maskable+ 64bit-
>                 Address: 00000000  Data: 0000
>                 Masking: 00000000  Pending: 00000000
>         Capabilities: [90] Express (v2) Root Port (Slot+), MSI 00
>                 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
>                         ExtTag+ RBE+ FLReset-
>                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>                         RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>                         MaxPayload 128 bytes, MaxReadReq 128 bytes         --------------------------->root port device, MPS is 128B
>                 DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
>                 LnkCap: Port #0, Speed 5GT/s, Width x4, ASPM L0s L1, Latency L0 <512ns, L1 <64us
>                         ClockPM- Surprise+ LLActRep+ BwNot+
>                 .........[snip].......
>
>
> linux:/home/yijing # lspci -vvv -s 01:00.1        ----------------->EP device, MPS change from 128B to 512B
> 01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
>         Subsystem: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
>         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0, Cache Line Size: 256 bytes
>         Interrupt: pin B routed to IRQ 40
>         Region 0: Memory at c2000000 (64-bit, non-prefetchable) [size=32M]
>         Capabilities: [48] Power Management version 3
>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
>         Capabilities: [50] Vital Product Data
>                 Product Name: Broadcom NetXtreme II Ethernet Controller
>                 Read-only fields:
>                         [PN] Part number: BCM95706A0
>                         [EC] Engineering changes: 220197-2
>                         [SN] Serial number: 0123456789
>                         [MN] Manufacture ID: 31 34 65 34
>                         [RV] Reserved: checksum good, 31 byte(s) reserved
>                 End
>         Capabilities: [58] MSI: Enable- Count=1/16 Maskable- 64bit+
>                 Address: 0000000000000000  Data: 0000
>         Capabilities: [a0] MSI-X: Enable+ Count=9 Masked-
>                 Vector table: BAR=0 offset=0000c000
>                 PBA: BAR=0 offset=0000e000
>         Capabilities: [ac] Express (v2) Endpoint, MSI 00
>                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>                 DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
>                         RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
>                         MaxPayload 512 bytes, MaxReadReq 512 bytes  ---------------------------->EP device, MPS is 512B
>                 DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
>                 LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s L1, Latency L0 <2us, L1 <2us
>                         ClockPM- Surprise- LLActRep- BwNot-
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>
>                 ...............[snip].............
>
> linux:/home/yijing # scp yijing@128.5.160.28:/home/yijing/ISO/HUAWEI_Enterprise_Linux_B016.iso ./
> yijing@128.5.160.28's password:
> HUAWEI_Enterprise_Linux_B016.iso                                                                                                   100% 3318MB  53.5MB/s   01:02
>
>
> linux:/home/yijing # ping 128.5.64.144 -l 65530
> WARNING: probably, rcvbuf is not enough to hold preload.
> PING 128.5.64.144 (128.5.64.144) 56(84) bytes of data.
> 64 bytes from 128.5.64.144: icmp_seq=1 ttl=126 time=9.12 ms
> 64 bytes from 128.5.64.144: icmp_seq=2 ttl=126 time=9.11 ms
> 64 bytes from 128.5.64.144: icmp_seq=3 ttl=126 time=10.0 ms
> 64 bytes from 128.5.64.144: icmp_seq=4 ttl=126 time=10.0 ms
> 64 bytes from 128.5.64.144: icmp_seq=5 ttl=126 time=10.0 ms
> 64 bytes from 128.5.64.144: icmp_seq=6 ttl=126 time=10.1 ms
> 64 bytes from 128.5.64.144: icmp_seq=7 ttl=126 time=7.66 ms
> 64 bytes from 128.5.64.144: icmp_seq=8 ttl=126 time=7.94 ms
> 64 bytes from 128.5.64.144: icmp_seq=9 ttl=126 time=59.3 ms
> 64 bytes from 128.5.64.144: icmp_seq=10 ttl=126 time=7.97 ms
> 64 bytes from 128.5.64.144: icmp_seq=11 ttl=126 time=9.68 ms
> 64 bytes from 128.5.64.144: icmp_seq=12 ttl=126 time=8.21 ms
> 64 bytes from 128.5.64.144: icmp_seq=13 ttl=126 time=7.95 ms
> 64 bytes from 128.5.64.144: icmp_seq=14 ttl=126 time=8.04 ms
> 64 bytes from 128.5.64.144: icmp_seq=15 ttl=126 time=7.77 ms
>
>
>
>
>
>
>
>
>
>
>
>
>
>>
>> Bjorn
>>
>>> On 2013/2/5 11:55, Yijing Wang wrote:
>>>> Currently we dont't update device's mps vaule when doing
>>>> pci device hot-add. The hot-added device's mps will be set
>>>> to default value (128B). But the upstream port device's mps
>>>> may be larger than 128B which was set by firmware during
>>>> system bootup. In this case the new added device may not
>>>> work normally.
>>>>
>>>> The reference discussion at
>>>> http://marc.info/?l=linux-pci&m=135420434508910&w=2
>>>> and
>>>> http://marc.info/?l=linux-pci&m=134815603407842&w=2
>>>>
>>>> Reported-by: Joe Jin <joe.jin@oracle.com>
>>>> Reported-by: Yijing Wang <wangyijing@huawei.com>
>>>> Signed-off-by: Yijing Wang <wangyijing@huawei.com>
>>>> Cc: Jon Mason <jdmason@kudzu.us>
>>>> ---
>>>>  drivers/pci/probe.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 files changed, 49 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>>>> index bbe4be7..57d9a5b 100644
>>>> --- a/drivers/pci/probe.c
>>>> +++ b/drivers/pci/probe.c
>>>> @@ -1556,6 +1556,52 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
>>>>       return 0;
>>>>  }
>>>>
>>>> +static int pcie_bus_update_set(struct pci_dev *dev, void *data)
>>>> +{
>>>> +     int mps, p_mps;
>>>> +
>>>> +     if (!pci_is_pcie(dev) || !dev->bus->self)
>>>> +             return 0;
>>>> +
>>>> +     mps = pcie_get_mps(dev);
>>>> +     p_mps = pcie_get_mps(dev->bus->self);
>>>> +
>>>> +     if (pci_pcie_type(dev->bus->self) != PCI_EXP_TYPE_ROOT_PORT) {
>>>> +             /* update mps when current device mps is not equal to upstream mps */
>>>> +             if (mps != p_mps)
>>>> +                     goto update;
>>>> +     } else {
>>>> +             /* update mps when current device mps is smaller than upstream mps */
>>>> +             if (mps < p_mps)
>>>> +                     goto update;
>>>> +     }
>>>> +
>>>> +     return 0;
>>>> +
>>>> +update:
>>>> +     /* If current mpss is lager than upstream, use upstream mps to update
>>>> +      * current mps, otherwise print warning info.
>>>> +      */
>>>> +     if ((128 << dev->pcie_mpss) >= p_mps)
>>>> +             pcie_write_mps(dev, p_mps);
>>>> +     else
>>>> +             dev_warn(&dev->dev, "MPS %d MPSS %d both smaller than upstream MPS %d\n"
>>>> +                             "If necessary, use \"pci=pcie_bus_peer2peer\" boot parameter to avoid this problem\n",
>>>> +                             mps, 128 << dev->pcie_mpss, p_mps);
>>>> +     return 0;
>>>> +}
>>>> +
>>>> +static void pcie_bus_update_setting(struct pci_bus *bus)
>>>> +{
>>>> +
>>>> +     /*
>>>> +      * After hot added a pci device, the device's mps will set to default
>>>> +      * vaule(128 bytes). But the upstream port mps may be larger than 128B.
>>>> +      * In this case, we should update this device's mps for better performance.
>>>> +      */
>>>> +     pci_walk_bus(bus, pcie_bus_update_set, NULL);
>>>> +}
>>>> +
>>>>  /* pcie_bus_configure_settings requires that pci_walk_bus work in a top-down,
>>>>   * parents then children fashion.  If this changes, then this code will not
>>>>   * work as designed.
>>>> @@ -1566,6 +1612,9 @@ void pcie_bus_configure_settings(struct pci_bus *bus, u8 mpss)
>>>>
>>>>       if (!pci_is_pcie(bus->self))
>>>>               return;
>>>> +
>>>> +     /* update mps setting for newly hot added device */
>>>> +     pcie_bus_update_setting(bus);
>>>>
>>>>       if (pcie_bus_config == PCIE_BUS_TUNE_OFF)
>>>>               return;
>>>>
>>>
>>>
>>> --
>>> Thanks!
>>> Yijing
>>>
>>
>> .
>>
>
>
> --
> Thanks!
> Yijing
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas - July 30, 2013, 10:29 p.m.
On Mon, Jul 29, 2013 at 9:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Mon, Jul 29, 2013 at 9:20 PM, Yijing Wang <wangyijing@huawei.com> wrote:
>> On 2013/7/30 7:33, Bjorn Helgaas wrote:
>>> On Mon, May 27, 2013 at 9:15 PM, Yijing Wang <wangyijing@huawei.com> wrote:
>>>> Hi Bjorn and Jon,
>>>>    I'm sorry to disturb you. This patch is sent so long, but nobody seems had comment about it.
>>>> Do you have any comment with this patch?
>>>>
>>>> This patch try to update device mps in following case:
>>>> 1) target device under root port
>>>>    Because root port can split TLP, so target device mps greatr than root port mps is ok.
>>>>    But if root port mps greater than target device mps, it's bad, because target device cannot
>>>>    receive TLP payload size greater than its MPS. So if a target device under a root port, I think
>>>>    we should assign its mps greater than or equal root port mps.
>>>> 2) target device under non root port
>>>>    We assume the target device both is a transmitter and receiver, so the safest way is to assign target
>>>>    device mps equal to its parent device.
>>>
>>> Thanks, I just started reviewing this patch, and your notes above are
>>> exactly the question I was going to ask.  The comments in
>>> pcie_bus_update_set() only tell me what the code does.  I can read the
>>> C code just fine; what we need there is the explanation about *why* we
>>> handle devices below root ports differently than others.  Maybe we can
>>> adapt some of your notes as comments in the code.
>>
>> Hi Bjorn,
>>    Thanks for your review and comments!
>>
>>>
>>> Do you have references to the spec where it talks about this
>>> difference?  I want to make sure we can rely on the fact that a root
>>> port can accept TLPs larger than its MPS.
>>
>> PCIe Spec does not explicitly mention this issue, we can only get the message that
>> root port/ root complex can split the TLP into smaller packets. For instance
>> one 256B packet split into two 128B packet.
>>
>> I confirm this issue in my X86 machine and IA64 machine.
>> 1. I unload NIC driver to make sure the safety during  change the NIC MPS.
>> 2. Use setpci change NIC MPS to the max value it supports.
>> 3. Reload the NIC driver
>> 4. Ping and use scp cpoy large file bwtween machines. Result is ok.

Just as a way to confirm that the MPS change is actually doing
something, I assume you observe a performance difference between
MPS=128 and MPS=512 on the NIC (and the root port MPS=128 in both
cases)?  Or maybe you can confirm with an analyzer that there are
actually 512-byte TLPs on the link?

I assume there are no AER or other errors logged by the root port?
The test you showed was a copy *to* the local machine, so the NIC
would have been doing DMA writes to memory.  I assume it works equally
well doing a copy *from* the local machine to another machine across
the network, where the NIC is doing DMA reads from memory?

> The fact that it works on two pieces of hardware is not enough to be
> confident that it will work on all spec-conforming hardware.  Maybe we
> can deduce this from something in the spec, but I'll have to dig into
> it more tomorrow.  I just hoped that you had a spec reference that
> could save me some time.

The only mention I can find in the spec is sec 1.3.1, where it says "a
Root Complex is generally permitted to split a packet into smaller
packets when routing transactions peer-to-peer between hierarchy
domains ..."

I'm not a hardware guy (I often wish I were :)), but here's how I
interpret that statement.  Let's take the following example:

  00:01.0 Root port bridge to [bus 01] MPS=128
  01:00.1 Endpoint MPS=512

  00:02.0 Root Port bridge to [bus 02] MPS=256
  00:03.0 Root Port bridge to [bus 03] MPS=128
  02:00.0 Endpoint MPS=256
  03:00.0 Endpoint MPS=128

If 02:00.0 (MPS=256) generates a DMA write destined for 03:00.0, it
may transmit a TLP with a data payload of 256 bytes, and 00:02.0
(MPS=256 also) will accept it.  The root complex may route the packet
to 00:03.0 (MPS=128), and here it would need to be split into two
128-byte TLPs before being transmitted by 00:03.0 to 03:00.0
(MPS=128).

Your situation is basically 01:00.1 (MPS=512) doing a DMA write
destined for memory and sending a 512-byte TLP to 00:01.0 (MPS=128).
In this case, the root complex isn't doing any peer-to-peer routing
between hierarchy domains, so I don't think the statement in sec 1.3.1
applies.  So I don't understand why the root port would accept that
TLP.  I would think it would report a malformed TLP error.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yijing Wang - July 31, 2013, 9:15 a.m.
>>> PCIe Spec does not explicitly mention this issue, we can only get the message that
>>> root port/ root complex can split the TLP into smaller packets. For instance
>>> one 256B packet split into two 128B packet.
>>>
>>> I confirm this issue in my X86 machine and IA64 machine.
>>> 1. I unload NIC driver to make sure the safety during  change the NIC MPS.
>>> 2. Use setpci change NIC MPS to the max value it supports.
>>> 3. Reload the NIC driver
>>> 4. Ping and use scp cpoy large file bwtween machines. Result is ok.
> 
> Just as a way to confirm that the MPS change is actually doing
> something, I assume you observe a performance difference between
> MPS=128 and MPS=512 on the NIC (and the root port MPS=128 in both
> cases)?  Or maybe you can confirm with an analyzer that there are
> actually 512-byte TLPs on the link?

Hi Bjorn,
   I didn't observe a performance difference between MPS=128 and MPS=512. I use ping $dest_ip -s 65500(large size packet)
to test the different situations.

1. root port MPS = 128, EP MPS = 256.

root port --------Endpoint device
00:01.0           01:00.1

In this case, I use ping in the local machine, and result is ok.
linux:~ # ping 128.5.160.28 -s 65500
PING 128.5.160.28 (128.5.160.28) 65500(65528) bytes of data.
65508 bytes from 128.5.160.28: icmp_seq=1 ttl=64 time=1.43 ms
65508 bytes from 128.5.160.28: icmp_seq=2 ttl=64 time=1.42 ms
65508 bytes from 128.5.160.28: icmp_seq=3 ttl=64 time=1.41 ms
65508 bytes from 128.5.160.28: icmp_seq=4 ttl=64 time=1.37 ms
65508 bytes from 128.5.160.28: icmp_seq=5 ttl=64 time=1.43 ms
..........

 \-[0000:00]-+-00.0  Intel Corporation 5500 I/O Hub to ESI Port
             +-01.0-[01]--+-00.0  Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet
             |            \-00.1  Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet

linux:~ # lspci -vvv -s 01:00.1
01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
............[snip].............
	Capabilities: [ac] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes

linux:~ # lspci -vvv -s 00:01.0
00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22) (prog-if 00 [Normal decode])
...........[snip]..............
	Capabilities: [90] Express (v2) Root Port (Slot+), MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 128 bytes


2. root port MPS = 256, EP MPS = 128.
In this case, use "ping $dest_ip -s 65500" to test, but result is fail.

So I guess the packet size during ping is larger than 128, EP device discard these TLPs.

I have no analyzer to catch the TLP packets. So I can not Guarantee this conclusion(EP MPS larger than Root port is 100% safe).

> 
> I assume there are no AER or other errors logged by the root port?
Yes, AER is not support in local machine.

> The test you showed was a copy *to* the local machine, so the NIC
> would have been doing DMA writes to memory.  I assume it works equally
> well doing a copy *from* the local machine to another machine across
> the network, where the NIC is doing DMA reads from memory?

Yes, I tested in both copy direction, and result is ok.

> The only mention I can find in the spec is sec 1.3.1, where it says "a
> Root Complex is generally permitted to split a packet into smaller
> packets when routing transactions peer-to-peer between hierarchy
> domains ..."
> 
> I'm not a hardware guy (I often wish I were :)), but here's how I
> interpret that statement.  Let's take the following example:
> 
>   00:01.0 Root port bridge to [bus 01] MPS=128
>   01:00.1 Endpoint MPS=512
> 
>   00:02.0 Root Port bridge to [bus 02] MPS=256
>   00:03.0 Root Port bridge to [bus 03] MPS=128
>   02:00.0 Endpoint MPS=256
>   03:00.0 Endpoint MPS=128
> 
> If 02:00.0 (MPS=256) generates a DMA write destined for 03:00.0, it
> may transmit a TLP with a data payload of 256 bytes, and 00:02.0
> (MPS=256 also) will accept it.  The root complex may route the packet
> to 00:03.0 (MPS=128), and here it would need to be split into two
> 128-byte TLPs before being transmitted by 00:03.0 to 03:00.0
> (MPS=128).
> 
> Your situation is basically 01:00.1 (MPS=512) doing a DMA write
> destined for memory and sending a 512-byte TLP to 00:01.0 (MPS=128).
> In this case, the root complex isn't doing any peer-to-peer routing
> between hierarchy domains, so I don't think the statement in sec 1.3.1
> applies.  So I don't understand why the root port would accept that
> TLP.  I would think it would report a malformed TLP error.

Hmmm, PCIe Spec does not involve too much about MPS setting. So maybe different platform
has different strategy.

Conservatively, as a improvement for mps setting after hotplug. I think update mps setting equal to its parent
make sense. This is no harm to other devices, we only modify the hotplug device itself mps register.

So if you agree, I will update my patch ,only try to modify hotplug device mps, make them equal to its parent.

Thanks!
Yijing.
Bjorn Helgaas - July 31, 2013, 5:53 p.m.
On Wed, Jul 31, 2013 at 3:15 AM, Yijing Wang <wangyijing@huawei.com> wrote:
> Hi Bjorn,
>    I didn't observe a performance difference between MPS=128 and MPS=512. I use ping $dest_ip -s 65500(large size packet)
> to test the different situations.

Interesting.  "ping" is probably not a good way to see performance
differences, but hopefully you could see a difference in *some*
scenario.  Otherwise, there's not much point in increasing MPS :)

>> I assume there are no AER or other errors logged by the root port?
> Yes, AER is not support in local machine.

Per the 5520/5500 spec, it does support AER (sec 19.11.5).  Maybe
there's some platform support required in addition.  You might still
be able to see some info just with "lspci -vv"

> Hmmm, PCIe Spec does not involve too much about MPS setting. So maybe different platform
> has different strategy.

I think there's enough in the spec to tell us what we need to do (this
is sec 2.2.2):

  - A Transmitter must not send a TLP larger than its Max_Payload_Size
  - A Receiver must treat TLPs larger than its Max_Payload_Size as malformed

The only way I can see to guarantee that is to set the MPS on both
ends of the link the same.

> Conservatively, as a improvement for mps setting after hotplug. I think update mps setting equal to its parent
> make sense. This is no harm to other devices, we only modify the hotplug device itself mps register.
>
> So if you agree, I will update my patch ,only try to modify hotplug device mps, make them equal to its parent.

Yes, I think that would be safe.  If the switch is set to a larger MPS
than the hot-added device supports, I don't think we can safely use
the device.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas - July 31, 2013, 8:42 p.m.
On Wed, Jul 31, 2013 at 11:53 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Wed, Jul 31, 2013 at 3:15 AM, Yijing Wang <wangyijing@huawei.com> wrote:
>> Hi Bjorn,
>>    I didn't observe a performance difference between MPS=128 and MPS=512. I use ping $dest_ip -s 65500(large size packet)
>> to test the different situations.
>
> Interesting.  "ping" is probably not a good way to see performance
> differences, but hopefully you could see a difference in *some*
> scenario.  Otherwise, there's not much point in increasing MPS :)
>
>>> I assume there are no AER or other errors logged by the root port?
>> Yes, AER is not support in local machine.
>
> Per the 5520/5500 spec, it does support AER (sec 19.11.5).  Maybe
> there's some platform support required in addition.  You might still
> be able to see some info just with "lspci -vv"
>
>> Hmmm, PCIe Spec does not involve too much about MPS setting. So maybe different platform
>> has different strategy.
>
> I think there's enough in the spec to tell us what we need to do (this
> is sec 2.2.2):
>
>   - A Transmitter must not send a TLP larger than its Max_Payload_Size
>   - A Receiver must treat TLPs larger than its Max_Payload_Size as malformed
>
> The only way I can see to guarantee that is to set the MPS on both
> ends of the link the same.
>
>> Conservatively, as a improvement for mps setting after hotplug. I think update mps setting equal to its parent
>> make sense. This is no harm to other devices, we only modify the hotplug device itself mps register.
>>
>> So if you agree, I will update my patch ,only try to modify hotplug device mps, make them equal to its parent.
>
> Yes, I think that would be safe.  If the switch is set to a larger MPS
> than the hot-added device supports, I don't think we can safely use
> the device.

I opened a bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=60671
for this problem.  Please correct any mistakes in my summary and
reference it in your changelog.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yijing Wang - Aug. 1, 2013, 1:21 a.m.
> Yes, I think that would be safe.  If the switch is set to a larger MPS
> than the hot-added device supports, I don't think we can safely use
> the device.

OK, I will refresh my patch, after test in my machine, I will send it out.

Thanks!
Yijing.

> 
> Bjorn
> 
>
Yijing Wang - Aug. 1, 2013, 1:23 a.m.
>> Yes, I think that would be safe.  If the switch is set to a larger MPS
>> than the hot-added device supports, I don't think we can safely use
>> the device.
> 
> I opened a bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=60671
> for this problem.  Please correct any mistakes in my summary and
> reference it in your changelog.

OK, thanks!

> 
> Bjorn
> 
>

Patch

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index bbe4be7..57d9a5b 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1556,6 +1556,52 @@  static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
 	return 0;
 }
 
+static int pcie_bus_update_set(struct pci_dev *dev, void *data)
+{
+	int mps, p_mps;
+
+	if (!pci_is_pcie(dev) || !dev->bus->self)
+		return 0;
+
+	mps = pcie_get_mps(dev);
+	p_mps = pcie_get_mps(dev->bus->self);
+
+	if (pci_pcie_type(dev->bus->self) != PCI_EXP_TYPE_ROOT_PORT) {
+		/* update mps when current device mps is not equal to upstream mps */
+		if (mps != p_mps)
+			goto update;
+	} else {
+		/* update mps when current device mps is smaller than upstream mps */
+		if (mps < p_mps)
+			goto update;
+	}
+
+	return 0;
+
+update:
+	/* If current mpss is lager than upstream, use upstream mps to update
+	 * current mps, otherwise print warning info.
+	 */
+	if ((128 << dev->pcie_mpss) >= p_mps)
+		pcie_write_mps(dev, p_mps);
+	else
+		dev_warn(&dev->dev, "MPS %d MPSS %d both smaller than upstream MPS %d\n"
+				"If necessary, use \"pci=pcie_bus_peer2peer\" boot parameter to avoid this problem\n",
+				mps, 128 << dev->pcie_mpss, p_mps);
+	return 0;
+}
+
+static void pcie_bus_update_setting(struct pci_bus *bus)
+{
+
+	/*
+	 * After hot added a pci device, the device's mps will set to default
+	 * vaule(128 bytes). But the upstream port mps may be larger than 128B.
+	 * In this case, we should update this device's mps for better performance.
+	 */
+	pci_walk_bus(bus, pcie_bus_update_set, NULL);
+}
+
 /* pcie_bus_configure_settings requires that pci_walk_bus work in a top-down,
  * parents then children fashion.  If this changes, then this code will not
  * work as designed.
@@ -1566,6 +1612,9 @@  void pcie_bus_configure_settings(struct pci_bus *bus, u8 mpss)
 
 	if (!pci_is_pcie(bus->self))
 		return;
+
+	/* update mps setting for newly hot added device */
+	pcie_bus_update_setting(bus);
 
 	if (pcie_bus_config == PCIE_BUS_TUNE_OFF)
 		return;