diff mbox

[v7,1/2] PCI: iproc: Retry request when CRS returned from EP

Message ID 1503331124-25029-2-git-send-email-oza.oza@broadcom.com
State Superseded
Headers show

Commit Message

Oza Pawandeep Aug. 21, 2017, 3:58 p.m. UTC
PCIe spec r3.1, sec 2.3.2
If CRS software visibility is not enabled, the RC must reissue the
config request as a new request.

- If CRS software visibility is enabled,
- for a config read of Vendor ID, the RC must return 0x0001 data
- for all other config reads/writes, the RC must reissue the
  request

iproc PCIe Controller spec:
4.7.3.3. Retry Status On Configuration Cycle
Endpoints are allowed to generate retry status on configuration
cycles. In this case, the RC needs to re-issue the request. The IP
does not handle this because the number of configuration cycles needed
will probably be less than the total number of non-posted operations
needed.

When a retry status is received on the User RX interface for a
configuration request that was sent on the User TX interface,
it will be indicated with a completion with the CMPL_STATUS field set
to 2=CRS, and the user will have to find the address and data values
and send a new transaction on the User TX interface.
When the internal configuration space returns a retry status during a
configuration cycle (user_cscfg = 1) on the Command/Status interface,
the pcie_cscrs will assert with the pcie_csack signal to indicate the
CRS status.
When the CRS Software Visibility Enable register in the Root Control
register is enabled, the IP will return the data value to 0x0001 for
the Vendor ID value and 0xffff  (all 1’s) for the rest of the data in
the request for reads of offset 0 that return with CRS status.  This
is true for both the User RX Interface and for the Command/Status
interface.  When CRS Software Visibility is enabled, the CMPL_STATUS
field of the completion on the User RX Interface will not be 2=CRS and
the pcie_cscrs signal will not assert on the Command/Status interface.

Per PCIe r3.1, sec 2.3.2, config requests that receive completions
with Configuration Request Retry Status (CRS) should be reissued by
the hardware except reads of the Vendor ID when CRS Software
Visibility is enabled.

This hardware never reissues configuration requests when it receives
CRS completions.
Note that, neither PCIe host bridge nor PCIe core re-issues the
request for any configuration offset.
As a result of the fact, PCIe RC driver (sw) should take care of
retrying.

For config writes, there is no way for this driver to learn about
CRS completions. A write that receives a CRS completion will not be
retried and the data will be discarded. This is a potential data
corruption.

For config reads, this hardware returns CFG_RETRY_STATUS data when
it receives a CRS completion for a config read, regardless of the
address of the read or the CRS Software Visibility Enable bit.  As a
partial workaround for this, we retry in software any read that
returns CFG_RETRY_STATUS.

It implements iproc_pcie_config_read which gets called for Stingray.
For other iproc based SOC, it falls back to PCI generic APIs.

Signed-off-by: Oza Pawandeep <oza.oza@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>

Comments

Bjorn Helgaas Aug. 22, 2017, 8:33 p.m. UTC | #1
[+cc Sinan, Timur, Alex]

Hi Oza,

On Mon, Aug 21, 2017 at 09:28:43PM +0530, Oza Pawandeep wrote:
> PCIe spec r3.1, sec 2.3.2
> If CRS software visibility is not enabled, the RC must reissue the
> config request as a new request.
> 
> - If CRS software visibility is enabled,
> - for a config read of Vendor ID, the RC must return 0x0001 data
> - for all other config reads/writes, the RC must reissue the
>   request

You mentioned early on that this fixes an issue with NVMe after a
reset.  Is that reset an FLR?  I'm wondering if this overlaps with the
work Sinan is doing:

  http://lkml.kernel.org/r/20170818212310.15145.21732.stgit@bhelgaas-glaptop.roam.corp.google.com

With the current code in the tree, pci_flr_wait() probably waits only
100ms on your system.  It currently reads PCI_COMMAND and probably
sees 0xffff0001 because the NVMe device returns a CRS completion
(which this iproc RC incorrectly turns into 0xffff0001 data), so we
exit the loop after a single msleep(100).

Sinan is extending pci_flr_wait() to read the Vendor ID and look for
the CRS SV indication.  If this is where you're seeing the NVMe issue,
I bet we can fix this by simply changing iproc_pcie_config_read() to
return the correct data when it sees a CRS completion.  Then the
retry loop in pci_flr_wait() will work as intended.

Sec 2.3.2 says the RC may limit the number of retries, and it doesn't
say anything about how many retries the RC should do or what interval
should be between them.  So I think it should be legal to say "this RC
does zero retries".

I think an RC that does zero retries and doesn't support CRS SV would
always handle a CRS completion as a "failed transaction" so it would
synthesize 0xffffffff data (and, I assume, set an error bit like
Received Master Abort).  If we treated this controller as though it
doesn't support CRS SV, the workaround in iproc_pcie_config_read()
would be simply:

  data = readl(cfg_data_p);
  if (data == CFG_RETRY_STATUS)    /* 0xffff0001 */
    data = 0xffffffff;

That would make the loop in pci_flr_wait() that reads PCI_COMMAND work
correctly.  It would get 0xffffffff data as long as the device
returned CRS completions.

It wouldn't make Sinan's new CRS SV code work, but that's OK: that all
has to work correctly even on systems that don't support CRS SV.

The only problem is that when we read a non-Vendor ID register that
happens to really be 0xffff0001, we *should* return 0xffff0001, but we
incorrectly return 0xffffffff.  But we already know we just have to
live with that issue.

One minor refactoring comment below.

> iproc PCIe Controller spec:
> 4.7.3.3. Retry Status On Configuration Cycle
> Endpoints are allowed to generate retry status on configuration
> cycles. In this case, the RC needs to re-issue the request. The IP
> does not handle this because the number of configuration cycles needed
> will probably be less than the total number of non-posted operations
> needed.
> 
> When a retry status is received on the User RX interface for a
> configuration request that was sent on the User TX interface,
> it will be indicated with a completion with the CMPL_STATUS field set
> to 2=CRS, and the user will have to find the address and data values
> and send a new transaction on the User TX interface.
> When the internal configuration space returns a retry status during a
> configuration cycle (user_cscfg = 1) on the Command/Status interface,
> the pcie_cscrs will assert with the pcie_csack signal to indicate the
> CRS status.
> When the CRS Software Visibility Enable register in the Root Control
> register is enabled, the IP will return the data value to 0x0001 for
> the Vendor ID value and 0xffff  (all 1’s) for the rest of the data in
> the request for reads of offset 0 that return with CRS status.  This
> is true for both the User RX Interface and for the Command/Status
> interface.  When CRS Software Visibility is enabled, the CMPL_STATUS
> field of the completion on the User RX Interface will not be 2=CRS and
> the pcie_cscrs signal will not assert on the Command/Status interface.
> 
> Per PCIe r3.1, sec 2.3.2, config requests that receive completions
> with Configuration Request Retry Status (CRS) should be reissued by
> the hardware except reads of the Vendor ID when CRS Software
> Visibility is enabled.
> 
> This hardware never reissues configuration requests when it receives
> CRS completions.
> Note that, neither PCIe host bridge nor PCIe core re-issues the
> request for any configuration offset.
> As a result of the fact, PCIe RC driver (sw) should take care of
> retrying.
> 
> For config writes, there is no way for this driver to learn about
> CRS completions. A write that receives a CRS completion will not be
> retried and the data will be discarded. This is a potential data
> corruption.
> 
> For config reads, this hardware returns CFG_RETRY_STATUS data when
> it receives a CRS completion for a config read, regardless of the
> address of the read or the CRS Software Visibility Enable bit.  As a
> partial workaround for this, we retry in software any read that
> returns CFG_RETRY_STATUS.
> 
> It implements iproc_pcie_config_read which gets called for Stingray.
> For other iproc based SOC, it falls back to PCI generic APIs.
> 
> Signed-off-by: Oza Pawandeep <oza.oza@broadcom.com>
> Reviewed-by: Ray Jui <ray.jui@broadcom.com>
> Reviewed-by: Scott Branden <scott.branden@broadcom.com>
> 
> diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c
> index c574863..a3edae4 100644
> --- a/drivers/pci/host/pcie-iproc.c
> +++ b/drivers/pci/host/pcie-iproc.c
> @@ -68,6 +68,9 @@
>  #define APB_ERR_EN_SHIFT             0
>  #define APB_ERR_EN                   BIT(APB_ERR_EN_SHIFT)
>  
> +#define CFG_RETRY_STATUS             0xffff0001
> +#define CFG_RETRY_STATUS_TIMEOUT_US  500000 /* 500 milli-seconds. */
> +
>  /* derive the enum index of the outbound/inbound mapping registers */
>  #define MAP_REG(base_reg, index)      ((base_reg) + (index) * 2)
>  
> @@ -448,6 +451,89 @@ static inline void iproc_pcie_apb_err_disable(struct pci_bus *bus,
>  	}
>  }
>  
> +static unsigned int iproc_pcie_cfg_retry(void __iomem *cfg_data_p)
> +{
> +	int timeout = CFG_RETRY_STATUS_TIMEOUT_US;
> +	unsigned int data;
> +
> +	/*
> +	 * As per PCIe spec r3.1, sec 2.3.2, CRS Software
> +	 * Visibility only affects config read of the Vendor ID.
> +	 * For config write or any other config read the Root must
> +	 * automatically re-issue configuration request again as a
> +	 * new request.
> +	 *
> +	 * For config reads, this hardware returns CFG_RETRY_STATUS data when
> +	 * it receives a CRS completion for a config read, regardless of the
> +	 * address of the read or the CRS Software Visibility Enable bit. As a
> +	 * partial workaround for this, we retry in software any read that
> +	 * returns CFG_RETRY_STATUS.
> +	 */
> +	data = readl(cfg_data_p);
> +	while (data == CFG_RETRY_STATUS && timeout--) {
> +		udelay(1);
> +		data = readl(cfg_data_p);
> +	}
> +
> +	if (data == CFG_RETRY_STATUS)
> +		data = 0xffffffff;
> +
> +	return data;
> +}
> +
> +static void __iomem *iproc_pcie_map_ep_cfg_reg(struct iproc_pcie *pcie,
> +					       unsigned int busno,
> +					       unsigned int slot,
> +					       unsigned int fn,
> +					       int where)
> +{
> +	u16 offset;
> +	u32 val;
> +
> +	/* EP device access */
> +	val = (busno << CFG_ADDR_BUS_NUM_SHIFT) |
> +		(slot << CFG_ADDR_DEV_NUM_SHIFT) |
> +		(fn << CFG_ADDR_FUNC_NUM_SHIFT) |
> +		(where & CFG_ADDR_REG_NUM_MASK) |
> +		(1 & CFG_ADDR_CFG_TYPE_MASK);
> +
> +	iproc_pcie_write_reg(pcie, IPROC_PCIE_CFG_ADDR, val);
> +	offset = iproc_pcie_reg_offset(pcie, IPROC_PCIE_CFG_DATA);
> +
> +	if (iproc_pcie_reg_is_invalid(offset))
> +		return NULL;
> +
> +	return (pcie->base + offset);
> +}
> +
> +static int iproc_pcie_config_read(struct pci_bus *bus, unsigned int devfn,
> +				    int where, int size, u32 *val)
> +{
> +	struct iproc_pcie *pcie = iproc_data(bus);
> +	unsigned int slot = PCI_SLOT(devfn);
> +	unsigned int fn = PCI_FUNC(devfn);
> +	unsigned int busno = bus->number;
> +	void __iomem *cfg_data_p;
> +	u32 data;
> +
> +	/* root complex access. */
> +	if (busno == 0)
> +		return pci_generic_config_read32(bus, devfn, where, size, val);
> +
> +	cfg_data_p = iproc_pcie_map_ep_cfg_reg(pcie, busno, slot, fn, where);
> +
> +	if (!cfg_data_p)
> +		return PCIBIOS_DEVICE_NOT_FOUND;
> +
> +	data = iproc_pcie_cfg_retry(cfg_data_p);
> +
> +	*val = data;
> +	if (size <= 2)
> +		*val = (data >> (8 * (where & 3))) & ((1 << (size * 8)) - 1);
> +
> +	return PCIBIOS_SUCCESSFUL;
> +}
> +
>  /**
>   * Note access to the configuration registers are protected at the higher layer
>   * by 'pci_lock' in drivers/pci/access.c
> @@ -459,7 +545,6 @@ static void __iomem *iproc_pcie_map_cfg_bus(struct iproc_pcie *pcie,
>  {
>  	unsigned slot = PCI_SLOT(devfn);
>  	unsigned fn = PCI_FUNC(devfn);
> -	u32 val;
>  	u16 offset;
>  
>  	/* root complex access */
> @@ -484,18 +569,7 @@ static void __iomem *iproc_pcie_map_cfg_bus(struct iproc_pcie *pcie,
>  		if (slot > 0)
>  			return NULL;
>  
> -	/* EP device access */
> -	val = (busno << CFG_ADDR_BUS_NUM_SHIFT) |
> -		(slot << CFG_ADDR_DEV_NUM_SHIFT) |
> -		(fn << CFG_ADDR_FUNC_NUM_SHIFT) |
> -		(where & CFG_ADDR_REG_NUM_MASK) |
> -		(1 & CFG_ADDR_CFG_TYPE_MASK);
> -	iproc_pcie_write_reg(pcie, IPROC_PCIE_CFG_ADDR, val);
> -	offset = iproc_pcie_reg_offset(pcie, IPROC_PCIE_CFG_DATA);
> -	if (iproc_pcie_reg_is_invalid(offset))
> -		return NULL;
> -	else
> -		return (pcie->base + offset);
> +	return iproc_pcie_map_ep_cfg_reg(pcie, busno, slot, fn, where);

Thanks for factoring this out.  Can you please split that refactor
into a separate patch?  That will make this CRS-handling patch smaller
and simpler.

>  }
>  
>  static void __iomem *iproc_pcie_bus_map_cfg_bus(struct pci_bus *bus,
> @@ -554,8 +628,13 @@ static int iproc_pcie_config_read32(struct pci_bus *bus, unsigned int devfn,
>  				    int where, int size, u32 *val)
>  {
>  	int ret;
> +	struct iproc_pcie *pcie = iproc_data(bus);
>  
>  	iproc_pcie_apb_err_disable(bus, true);
> +	if (pcie->type == IPROC_PCIE_PAXB_V2)
> +		ret = iproc_pcie_config_read(bus, devfn, where, size, val);
> +	else
> +		ret = pci_generic_config_read32(bus, devfn, where, size, val);
>  	ret = pci_generic_config_read32(bus, devfn, where, size, val);
>  	iproc_pcie_apb_err_disable(bus, false);
>  
> -- 
> 1.9.1
>
Oza Pawandeep Aug. 23, 2017, 10:27 a.m. UTC | #2
On Wed, Aug 23, 2017 at 2:03 AM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> [+cc Sinan, Timur, Alex]
>
> Hi Oza,
>
> On Mon, Aug 21, 2017 at 09:28:43PM +0530, Oza Pawandeep wrote:
>> PCIe spec r3.1, sec 2.3.2
>> If CRS software visibility is not enabled, the RC must reissue the
>> config request as a new request.
>>
>> - If CRS software visibility is enabled,
>> - for a config read of Vendor ID, the RC must return 0x0001 data
>> - for all other config reads/writes, the RC must reissue the
>>   request
>
> You mentioned early on that this fixes an issue with NVMe after a
> reset.  Is that reset an FLR?  I'm wondering if this overlaps with the
> work Sinan is doing:
>
>   http://lkml.kernel.org/r/20170818212310.15145.21732.stgit@bhelgaas-glaptop.roam.corp.google.com
>
> With the current code in the tree, pci_flr_wait() probably waits only
> 100ms on your system.  It currently reads PCI_COMMAND and probably
> sees 0xffff0001 because the NVMe device returns a CRS completion
> (which this iproc RC incorrectly turns into 0xffff0001 data), so we
> exit the loop after a single msleep(100).
>
> Sinan is extending pci_flr_wait() to read the Vendor ID and look for
> the CRS SV indication.  If this is where you're seeing the NVMe issue,
> I bet we can fix this by simply changing iproc_pcie_config_read() to
> return the correct data when it sees a CRS completion.  Then the
> retry loop in pci_flr_wait() will work as intended.
>

The issue with User Space poll mode drivers resulting into CRS.


root@bcm958742k:~# /usr/share/spdk/examples/nvme/perf -r 'trtype:PCIe
traddr:0000:01:00.0' -q 128 -s 4096 -w read -d 2048 -t 30 -c 0x1
Starting DPDK 16.11.1 initialization...
[ DPDK EAL parameters: perf -c 1 -m 2048 --file-prefix=spdk_pid2202 ]
EAL: Detected 8 lcore(s)
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: cannot open /proc/self/numa_maps, consider that all memory is in
socket_id 0
Initializing NVMe Controllers
EAL: PCI device 0000:01:00.0 on NUMA socket 0
EAL:   probe driver: 8086:953 spdk_nvme
EAL:   using IOMMU type 1 (Type 1)
[   45.811353] vfio_ecap_init: 0000:01:00.0 hiding ecap 0x19@0x2a0
Attaching to NVMe Controller at 0000:01:00.0 [8086:0953]
[   46.486601] vfio_bar_restore: 0000:01:00.0 reset recovery - restoring bars
nvme_ctrlr.c:1237:nvme_ctrlr_process_init: ***ERROR*** Initialization
timed out in state 3
nvme_ctrlr.c: 414:nvme_ctrlr_shutdown: ***ERROR*** did not shutdown
within 5 seconds
EAL: Hotplug doesn't support vfio yet
spdk_nvme_probe() failed for transport address '0000:01:00.0'
/usr/share/spdk/examples/nvme/perf: errors occured

ok, so this is in the context of
[   52.500579] [<ffff000008088498>] dump_backtrace+0x0/0x208
[   52.506147] [<ffff0000080886b4>] show_stack+0x14/0x20
[   52.511357] [<ffff0000083ace34>] dump_stack+0x8c/0xb0
[   52.516567] [<ffff00000840a550>] pci_flr_wait+0x18/0xc8
[   52.521955] [<ffff00000840bf9c>] pcie_flr+0x34/0x68
[   52.526986] [<ffff00000840e3d0>] __pci_dev_reset+0x100/0x2c0
[   52.532823] [<ffff00000840e60c>] pci_try_reset_function+0x64/0x80
[   52.539108] [<ffff00000856e4b8>] vfio_pci_ioctl+0x398/0xb78
[   52.544857] [<ffff000008568378>] vfio_device_fops_unl_ioctl+0x20/0x30
[   52.551501] [<ffff0000081d4e24>] do_vfs_ioctl+0xa4/0x738
[   52.556980] [<ffff0000081d5544>] SyS_ioctl+0x8c/0xa0
[   52.562100] [<ffff000008082f8c>] __sys_trace_return+0x0/0x4


So, I have two things to do here.

1) remove retry funciton and just do following.
   data = readl(cfg_data_p);
   if (data == CFG_RETRY_STATUS)    /* 0xffff0001 */
     data = 0xffffffff;

2) refactor the code with separate patch.

but for that Sinan's patch is required.
then with both the patches I think, things will work out correctly for
iproc SOC.

Let me know how this sounds and if you agree with above two points, I
will be posting final set of patches.

let me know if you want me to change anything in the commit description.
in my opinion I should remove config write description which is irrelevant here.

Regards,
Oza.

> Sec 2.3.2 says the RC may limit the number of retries, and it doesn't
> say anything about how many retries the RC should do or what interval
> should be between them.  So I think it should be legal to say "this RC
> does zero retries".
>
> I think an RC that does zero retries and doesn't support CRS SV would
> always handle a CRS completion as a "failed transaction" so it would
> synthesize 0xffffffff data (and, I assume, set an error bit like
> Received Master Abort).  If we treated this controller as though it
> doesn't support CRS SV, the workaround in iproc_pcie_config_read()
> would be simply:
>
>   data = readl(cfg_data_p);
>   if (data == CFG_RETRY_STATUS)    /* 0xffff0001 */
>     data = 0xffffffff;
>
> That would make the loop in pci_flr_wait() that reads PCI_COMMAND work
> correctly.  It would get 0xffffffff data as long as the device
> returned CRS completions.
>
> It wouldn't make Sinan's new CRS SV code work, but that's OK: that all
> has to work correctly even on systems that don't support CRS SV.
>
> The only problem is that when we read a non-Vendor ID register that
> happens to really be 0xffff0001, we *should* return 0xffff0001, but we
> incorrectly return 0xffffffff.  But we already know we just have to
> live with that issue.
>
> One minor refactoring comment below.
>
>> iproc PCIe Controller spec:
>> 4.7.3.3. Retry Status On Configuration Cycle
>> Endpoints are allowed to generate retry status on configuration
>> cycles. In this case, the RC needs to re-issue the request. The IP
>> does not handle this because the number of configuration cycles needed
>> will probably be less than the total number of non-posted operations
>> needed.
>>
>> When a retry status is received on the User RX interface for a
>> configuration request that was sent on the User TX interface,
>> it will be indicated with a completion with the CMPL_STATUS field set
>> to 2=CRS, and the user will have to find the address and data values
>> and send a new transaction on the User TX interface.
>> When the internal configuration space returns a retry status during a
>> configuration cycle (user_cscfg = 1) on the Command/Status interface,
>> the pcie_cscrs will assert with the pcie_csack signal to indicate the
>> CRS status.
>> When the CRS Software Visibility Enable register in the Root Control
>> register is enabled, the IP will return the data value to 0x0001 for
>> the Vendor ID value and 0xffff  (all 1’s) for the rest of the data in
>> the request for reads of offset 0 that return with CRS status.  This
>> is true for both the User RX Interface and for the Command/Status
>> interface.  When CRS Software Visibility is enabled, the CMPL_STATUS
>> field of the completion on the User RX Interface will not be 2=CRS and
>> the pcie_cscrs signal will not assert on the Command/Status interface.
>>
>> Per PCIe r3.1, sec 2.3.2, config requests that receive completions
>> with Configuration Request Retry Status (CRS) should be reissued by
>> the hardware except reads of the Vendor ID when CRS Software
>> Visibility is enabled.
>>
>> This hardware never reissues configuration requests when it receives
>> CRS completions.
>> Note that, neither PCIe host bridge nor PCIe core re-issues the
>> request for any configuration offset.
>> As a result of the fact, PCIe RC driver (sw) should take care of
>> retrying.
>>
>> For config writes, there is no way for this driver to learn about
>> CRS completions. A write that receives a CRS completion will not be
>> retried and the data will be discarded. This is a potential data
>> corruption.
>>
>> For config reads, this hardware returns CFG_RETRY_STATUS data when
>> it receives a CRS completion for a config read, regardless of the
>> address of the read or the CRS Software Visibility Enable bit.  As a
>> partial workaround for this, we retry in software any read that
>> returns CFG_RETRY_STATUS.
>>
>> It implements iproc_pcie_config_read which gets called for Stingray.
>> For other iproc based SOC, it falls back to PCI generic APIs.
>>
>> Signed-off-by: Oza Pawandeep <oza.oza@broadcom.com>
>> Reviewed-by: Ray Jui <ray.jui@broadcom.com>
>> Reviewed-by: Scott Branden <scott.branden@broadcom.com>
>>
>> diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c
>> index c574863..a3edae4 100644
>> --- a/drivers/pci/host/pcie-iproc.c
>> +++ b/drivers/pci/host/pcie-iproc.c
>> @@ -68,6 +68,9 @@
>>  #define APB_ERR_EN_SHIFT             0
>>  #define APB_ERR_EN                   BIT(APB_ERR_EN_SHIFT)
>>
>> +#define CFG_RETRY_STATUS             0xffff0001
>> +#define CFG_RETRY_STATUS_TIMEOUT_US  500000 /* 500 milli-seconds. */
>> +
>>  /* derive the enum index of the outbound/inbound mapping registers */
>>  #define MAP_REG(base_reg, index)      ((base_reg) + (index) * 2)
>>
>> @@ -448,6 +451,89 @@ static inline void iproc_pcie_apb_err_disable(struct pci_bus *bus,
>>       }
>>  }
>>
>> +static unsigned int iproc_pcie_cfg_retry(void __iomem *cfg_data_p)
>> +{
>> +     int timeout = CFG_RETRY_STATUS_TIMEOUT_US;
>> +     unsigned int data;
>> +
>> +     /*
>> +      * As per PCIe spec r3.1, sec 2.3.2, CRS Software
>> +      * Visibility only affects config read of the Vendor ID.
>> +      * For config write or any other config read the Root must
>> +      * automatically re-issue configuration request again as a
>> +      * new request.
>> +      *
>> +      * For config reads, this hardware returns CFG_RETRY_STATUS data when
>> +      * it receives a CRS completion for a config read, regardless of the
>> +      * address of the read or the CRS Software Visibility Enable bit. As a
>> +      * partial workaround for this, we retry in software any read that
>> +      * returns CFG_RETRY_STATUS.
>> +      */
>> +     data = readl(cfg_data_p);
>> +     while (data == CFG_RETRY_STATUS && timeout--) {
>> +             udelay(1);
>> +             data = readl(cfg_data_p);
>> +     }
>> +
>> +     if (data == CFG_RETRY_STATUS)
>> +             data = 0xffffffff;
>> +
>> +     return data;
>> +}
>> +
>> +static void __iomem *iproc_pcie_map_ep_cfg_reg(struct iproc_pcie *pcie,
>> +                                            unsigned int busno,
>> +                                            unsigned int slot,
>> +                                            unsigned int fn,
>> +                                            int where)
>> +{
>> +     u16 offset;
>> +     u32 val;
>> +
>> +     /* EP device access */
>> +     val = (busno << CFG_ADDR_BUS_NUM_SHIFT) |
>> +             (slot << CFG_ADDR_DEV_NUM_SHIFT) |
>> +             (fn << CFG_ADDR_FUNC_NUM_SHIFT) |
>> +             (where & CFG_ADDR_REG_NUM_MASK) |
>> +             (1 & CFG_ADDR_CFG_TYPE_MASK);
>> +
>> +     iproc_pcie_write_reg(pcie, IPROC_PCIE_CFG_ADDR, val);
>> +     offset = iproc_pcie_reg_offset(pcie, IPROC_PCIE_CFG_DATA);
>> +
>> +     if (iproc_pcie_reg_is_invalid(offset))
>> +             return NULL;
>> +
>> +     return (pcie->base + offset);
>> +}
>> +
>> +static int iproc_pcie_config_read(struct pci_bus *bus, unsigned int devfn,
>> +                                 int where, int size, u32 *val)
>> +{
>> +     struct iproc_pcie *pcie = iproc_data(bus);
>> +     unsigned int slot = PCI_SLOT(devfn);
>> +     unsigned int fn = PCI_FUNC(devfn);
>> +     unsigned int busno = bus->number;
>> +     void __iomem *cfg_data_p;
>> +     u32 data;
>> +
>> +     /* root complex access. */
>> +     if (busno == 0)
>> +             return pci_generic_config_read32(bus, devfn, where, size, val);
>> +
>> +     cfg_data_p = iproc_pcie_map_ep_cfg_reg(pcie, busno, slot, fn, where);
>> +
>> +     if (!cfg_data_p)
>> +             return PCIBIOS_DEVICE_NOT_FOUND;
>> +
>> +     data = iproc_pcie_cfg_retry(cfg_data_p);
>> +
>> +     *val = data;
>> +     if (size <= 2)
>> +             *val = (data >> (8 * (where & 3))) & ((1 << (size * 8)) - 1);
>> +
>> +     return PCIBIOS_SUCCESSFUL;
>> +}
>> +
>>  /**
>>   * Note access to the configuration registers are protected at the higher layer
>>   * by 'pci_lock' in drivers/pci/access.c
>> @@ -459,7 +545,6 @@ static void __iomem *iproc_pcie_map_cfg_bus(struct iproc_pcie *pcie,
>>  {
>>       unsigned slot = PCI_SLOT(devfn);
>>       unsigned fn = PCI_FUNC(devfn);
>> -     u32 val;
>>       u16 offset;
>>
>>       /* root complex access */
>> @@ -484,18 +569,7 @@ static void __iomem *iproc_pcie_map_cfg_bus(struct iproc_pcie *pcie,
>>               if (slot > 0)
>>                       return NULL;
>>
>> -     /* EP device access */
>> -     val = (busno << CFG_ADDR_BUS_NUM_SHIFT) |
>> -             (slot << CFG_ADDR_DEV_NUM_SHIFT) |
>> -             (fn << CFG_ADDR_FUNC_NUM_SHIFT) |
>> -             (where & CFG_ADDR_REG_NUM_MASK) |
>> -             (1 & CFG_ADDR_CFG_TYPE_MASK);
>> -     iproc_pcie_write_reg(pcie, IPROC_PCIE_CFG_ADDR, val);
>> -     offset = iproc_pcie_reg_offset(pcie, IPROC_PCIE_CFG_DATA);
>> -     if (iproc_pcie_reg_is_invalid(offset))
>> -             return NULL;
>> -     else
>> -             return (pcie->base + offset);
>> +     return iproc_pcie_map_ep_cfg_reg(pcie, busno, slot, fn, where);
>
> Thanks for factoring this out.  Can you please split that refactor
> into a separate patch?  That will make this CRS-handling patch smaller
> and simpler.
>
>>  }
>>
>>  static void __iomem *iproc_pcie_bus_map_cfg_bus(struct pci_bus *bus,
>> @@ -554,8 +628,13 @@ static int iproc_pcie_config_read32(struct pci_bus *bus, unsigned int devfn,
>>                                   int where, int size, u32 *val)
>>  {
>>       int ret;
>> +     struct iproc_pcie *pcie = iproc_data(bus);
>>
>>       iproc_pcie_apb_err_disable(bus, true);
>> +     if (pcie->type == IPROC_PCIE_PAXB_V2)
>> +             ret = iproc_pcie_config_read(bus, devfn, where, size, val);
>> +     else
>> +             ret = pci_generic_config_read32(bus, devfn, where, size, val);
>>       ret = pci_generic_config_read32(bus, devfn, where, size, val);
>>       iproc_pcie_apb_err_disable(bus, false);
>>
>> --
>> 1.9.1
>>
Bjorn Helgaas Aug. 23, 2017, 1:51 p.m. UTC | #3
On Wed, Aug 23, 2017 at 03:57:02PM +0530, Oza Oza wrote:
> On Wed, Aug 23, 2017 at 2:03 AM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > [+cc Sinan, Timur, Alex]
> >
> > Hi Oza,
> >
> > On Mon, Aug 21, 2017 at 09:28:43PM +0530, Oza Pawandeep wrote:
> >> PCIe spec r3.1, sec 2.3.2
> >> If CRS software visibility is not enabled, the RC must reissue the
> >> config request as a new request.
> >>
> >> - If CRS software visibility is enabled,
> >> - for a config read of Vendor ID, the RC must return 0x0001 data
> >> - for all other config reads/writes, the RC must reissue the
> >>   request
> >
> > You mentioned early on that this fixes an issue with NVMe after a
> > reset.  Is that reset an FLR?  I'm wondering if this overlaps with the
> > work Sinan is doing:
> >
> >   http://lkml.kernel.org/r/20170818212310.15145.21732.stgit@bhelgaas-glaptop.roam.corp.google.com
> >
> > With the current code in the tree, pci_flr_wait() probably waits only
> > 100ms on your system.  It currently reads PCI_COMMAND and probably
> > sees 0xffff0001 because the NVMe device returns a CRS completion
> > (which this iproc RC incorrectly turns into 0xffff0001 data), so we
> > exit the loop after a single msleep(100).
> >
> > Sinan is extending pci_flr_wait() to read the Vendor ID and look for
> > the CRS SV indication.  If this is where you're seeing the NVMe issue,
> > I bet we can fix this by simply changing iproc_pcie_config_read() to
> > return the correct data when it sees a CRS completion.  Then the
> > retry loop in pci_flr_wait() will work as intended.
> >
> 
> The issue with User Space poll mode drivers resulting into CRS.
> 
> 
> root@bcm958742k:~# /usr/share/spdk/examples/nvme/perf -r 'trtype:PCIe
> traddr:0000:01:00.0' -q 128 -s 4096 -w read -d 2048 -t 30 -c 0x1
> Starting DPDK 16.11.1 initialization...
> [ DPDK EAL parameters: perf -c 1 -m 2048 --file-prefix=spdk_pid2202 ]
> EAL: Detected 8 lcore(s)
> EAL: Probing VFIO support...
> EAL: VFIO support initialized
> EAL: cannot open /proc/self/numa_maps, consider that all memory is in
> socket_id 0
> Initializing NVMe Controllers
> EAL: PCI device 0000:01:00.0 on NUMA socket 0
> EAL:   probe driver: 8086:953 spdk_nvme
> EAL:   using IOMMU type 1 (Type 1)
> [   45.811353] vfio_ecap_init: 0000:01:00.0 hiding ecap 0x19@0x2a0
> Attaching to NVMe Controller at 0000:01:00.0 [8086:0953]
> [   46.486601] vfio_bar_restore: 0000:01:00.0 reset recovery - restoring bars
> nvme_ctrlr.c:1237:nvme_ctrlr_process_init: ***ERROR*** Initialization
> timed out in state 3
> nvme_ctrlr.c: 414:nvme_ctrlr_shutdown: ***ERROR*** did not shutdown
> within 5 seconds
> EAL: Hotplug doesn't support vfio yet
> spdk_nvme_probe() failed for transport address '0000:01:00.0'
> /usr/share/spdk/examples/nvme/perf: errors occured
> 
> ok, so this is in the context of
> [   52.500579] [<ffff000008088498>] dump_backtrace+0x0/0x208
> [   52.506147] [<ffff0000080886b4>] show_stack+0x14/0x20
> [   52.511357] [<ffff0000083ace34>] dump_stack+0x8c/0xb0
> [   52.516567] [<ffff00000840a550>] pci_flr_wait+0x18/0xc8
> [   52.521955] [<ffff00000840bf9c>] pcie_flr+0x34/0x68
> [   52.526986] [<ffff00000840e3d0>] __pci_dev_reset+0x100/0x2c0
> [   52.532823] [<ffff00000840e60c>] pci_try_reset_function+0x64/0x80
> [   52.539108] [<ffff00000856e4b8>] vfio_pci_ioctl+0x398/0xb78
> [   52.544857] [<ffff000008568378>] vfio_device_fops_unl_ioctl+0x20/0x30
> [   52.551501] [<ffff0000081d4e24>] do_vfs_ioctl+0xa4/0x738
> [   52.556980] [<ffff0000081d5544>] SyS_ioctl+0x8c/0xa0
> [   52.562100] [<ffff000008082f8c>] __sys_trace_return+0x0/0x4
> 
> 
> So, I have two things to do here.
> 
> 1) remove retry funciton and just do following.
>    data = readl(cfg_data_p);
>    if (data == CFG_RETRY_STATUS)    /* 0xffff0001 */
>      data = 0xffffffff;
> 
> 2) refactor the code with separate patch.
> 
> but for that Sinan's patch is required.
> then with both the patches I think, things will work out correctly for
> iproc SOC.

It looks like you are using the pci_flr_wait() path as I suspected.

If you do the check as you mentioned in 1), this should work even
without Sinan's patch.

It's possible you'd have to increase the pci_flr_wait() timeout
(currently 1sec total).  I think we'll end up doing that as part of
Sinan's work, so we use the same timeout whether or not the RC
supports CRS SV.

> 
> Let me know how this sounds and if you agree with above two points, I
> will be posting final set of patches.
> 
> let me know if you want me to change anything in the commit description.
> in my opinion I should remove config write description which is irrelevant here.

I agree.  I don't think you need to say anything special about write,
since I think an RC is probably allowed to drop (with no retries) a
config write that receives a CRS completion.  In this case, I expect
the RC would log an error, so you should see something in PCI_STATUS.
Oza Pawandeep Aug. 23, 2017, 3:42 p.m. UTC | #4
On Wed, Aug 23, 2017 at 7:21 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> On Wed, Aug 23, 2017 at 03:57:02PM +0530, Oza Oza wrote:
>> On Wed, Aug 23, 2017 at 2:03 AM, Bjorn Helgaas <helgaas@kernel.org> wrote:
>> > [+cc Sinan, Timur, Alex]
>> >
>> > Hi Oza,
>> >
>> > On Mon, Aug 21, 2017 at 09:28:43PM +0530, Oza Pawandeep wrote:
>> >> PCIe spec r3.1, sec 2.3.2
>> >> If CRS software visibility is not enabled, the RC must reissue the
>> >> config request as a new request.
>> >>
>> >> - If CRS software visibility is enabled,
>> >> - for a config read of Vendor ID, the RC must return 0x0001 data
>> >> - for all other config reads/writes, the RC must reissue the
>> >>   request
>> >
>> > You mentioned early on that this fixes an issue with NVMe after a
>> > reset.  Is that reset an FLR?  I'm wondering if this overlaps with the
>> > work Sinan is doing:
>> >
>> >   http://lkml.kernel.org/r/20170818212310.15145.21732.stgit@bhelgaas-glaptop.roam.corp.google.com
>> >
>> > With the current code in the tree, pci_flr_wait() probably waits only
>> > 100ms on your system.  It currently reads PCI_COMMAND and probably
>> > sees 0xffff0001 because the NVMe device returns a CRS completion
>> > (which this iproc RC incorrectly turns into 0xffff0001 data), so we
>> > exit the loop after a single msleep(100).
>> >
>> > Sinan is extending pci_flr_wait() to read the Vendor ID and look for
>> > the CRS SV indication.  If this is where you're seeing the NVMe issue,
>> > I bet we can fix this by simply changing iproc_pcie_config_read() to
>> > return the correct data when it sees a CRS completion.  Then the
>> > retry loop in pci_flr_wait() will work as intended.
>> >
>>
>> The issue with User Space poll mode drivers resulting into CRS.
>>
>>
>> root@bcm958742k:~# /usr/share/spdk/examples/nvme/perf -r 'trtype:PCIe
>> traddr:0000:01:00.0' -q 128 -s 4096 -w read -d 2048 -t 30 -c 0x1
>> Starting DPDK 16.11.1 initialization...
>> [ DPDK EAL parameters: perf -c 1 -m 2048 --file-prefix=spdk_pid2202 ]
>> EAL: Detected 8 lcore(s)
>> EAL: Probing VFIO support...
>> EAL: VFIO support initialized
>> EAL: cannot open /proc/self/numa_maps, consider that all memory is in
>> socket_id 0
>> Initializing NVMe Controllers
>> EAL: PCI device 0000:01:00.0 on NUMA socket 0
>> EAL:   probe driver: 8086:953 spdk_nvme
>> EAL:   using IOMMU type 1 (Type 1)
>> [   45.811353] vfio_ecap_init: 0000:01:00.0 hiding ecap 0x19@0x2a0
>> Attaching to NVMe Controller at 0000:01:00.0 [8086:0953]
>> [   46.486601] vfio_bar_restore: 0000:01:00.0 reset recovery - restoring bars
>> nvme_ctrlr.c:1237:nvme_ctrlr_process_init: ***ERROR*** Initialization
>> timed out in state 3
>> nvme_ctrlr.c: 414:nvme_ctrlr_shutdown: ***ERROR*** did not shutdown
>> within 5 seconds
>> EAL: Hotplug doesn't support vfio yet
>> spdk_nvme_probe() failed for transport address '0000:01:00.0'
>> /usr/share/spdk/examples/nvme/perf: errors occured
>>
>> ok, so this is in the context of
>> [   52.500579] [<ffff000008088498>] dump_backtrace+0x0/0x208
>> [   52.506147] [<ffff0000080886b4>] show_stack+0x14/0x20
>> [   52.511357] [<ffff0000083ace34>] dump_stack+0x8c/0xb0
>> [   52.516567] [<ffff00000840a550>] pci_flr_wait+0x18/0xc8
>> [   52.521955] [<ffff00000840bf9c>] pcie_flr+0x34/0x68
>> [   52.526986] [<ffff00000840e3d0>] __pci_dev_reset+0x100/0x2c0
>> [   52.532823] [<ffff00000840e60c>] pci_try_reset_function+0x64/0x80
>> [   52.539108] [<ffff00000856e4b8>] vfio_pci_ioctl+0x398/0xb78
>> [   52.544857] [<ffff000008568378>] vfio_device_fops_unl_ioctl+0x20/0x30
>> [   52.551501] [<ffff0000081d4e24>] do_vfs_ioctl+0xa4/0x738
>> [   52.556980] [<ffff0000081d5544>] SyS_ioctl+0x8c/0xa0
>> [   52.562100] [<ffff000008082f8c>] __sys_trace_return+0x0/0x4
>>
>>
>> So, I have two things to do here.
>>
>> 1) remove retry funciton and just do following.
>>    data = readl(cfg_data_p);
>>    if (data == CFG_RETRY_STATUS)    /* 0xffff0001 */
>>      data = 0xffffffff;
>>
>> 2) refactor the code with separate patch.
>>
>> but for that Sinan's patch is required.
>> then with both the patches I think, things will work out correctly for
>> iproc SOC.
>
> It looks like you are using the pci_flr_wait() path as I suspected.
>
> If you do the check as you mentioned in 1), this should work even
> without Sinan's patch.

looks like this can not work.
If I add folowing in iproc_pcie_config_read()

 if (data == CFG_RETRY_STATUS)    /* 0xffff0001 */
      data = 0xffffffff;

because pci_bus_read_dev_vendor_id returns false
/* some broken boards return 0 or ~0 if a slot is empty: */
if (*l == 0xffffffff || *l == 0x00000000 ||
   *l == 0x0000ffff || *l == 0xffff0000)
return false;


In working Enumuration case I get following:
[    9.125976] pci 0000:00:00.0: bridge configuration invalid ([bus
00-00]), re-configuring
[    9.134267] where=0x0 val=0xffff0001
[    9.146946] where=0x0 val=0xffff0001
[    9.158943] where=0x0 val=0xffff0001
[    9.170945] where=0x0 val=0xffff0001
[    9.186945] where=0x0 val=0xffff0001
[    9.210944] where=0x0 val=0xffff0001
[    9.250943] where=0x0 val=0xffff0001
[    9.322942] where=0x0 val=0xffff0001
[    9.458943] where=0x0 val=0xffff0001
[    9.726942] where=0x0 val=0x9538086    >> actual vendor and device id.

so I think I have to retry in RC driver, so the old code still holds good.
except that I have to do factoring out.

please let me know If I missed anything, or you want me to try anything else.

Regards,
Oza.

>
> It's possible you'd have to increase the pci_flr_wait() timeout
> (currently 1sec total).  I think we'll end up doing that as part of
> Sinan's work, so we use the same timeout whether or not the RC
> supports CRS SV.
>
>>
>> Let me know how this sounds and if you agree with above two points, I
>> will be posting final set of patches.
>>
>> let me know if you want me to change anything in the commit description.
>> in my opinion I should remove config write description which is irrelevant here.
>
> I agree.  I don't think you need to say anything special about write,
> since I think an RC is probably allowed to drop (with no retries) a
> config write that receives a CRS completion.  In this case, I expect
> the RC would log an error, so you should see something in PCI_STATUS.
Sinan Kaya Aug. 23, 2017, 3:52 p.m. UTC | #5
Hi Oza,

> In working Enumuration case I get following:
> [    9.125976] pci 0000:00:00.0: bridge configuration invalid ([bus
> 00-00]), re-configuring
> [    9.134267] where=0x0 val=0xffff0001
> [    9.146946] where=0x0 val=0xffff0001
> [    9.158943] where=0x0 val=0xffff0001
> [    9.170945] where=0x0 val=0xffff0001
> [    9.186945] where=0x0 val=0xffff0001
> [    9.210944] where=0x0 val=0xffff0001
> [    9.250943] where=0x0 val=0xffff0001
> [    9.322942] where=0x0 val=0xffff0001
> [    9.458943] where=0x0 val=0xffff0001
> [    9.726942] where=0x0 val=0x9538086    >> actual vendor and device id.
> 
> so I think I have to retry in RC driver, so the old code still holds good.
> except that I have to do factoring out
You need to return 0xFFFF0001 for vendor ID register and return 0xFFFFFFFF for
other registers like COMMAND register during the CRS period.

> 
> please let me know If I missed anything, or you want me to try anything else.

Sinan
Oza Pawandeep Aug. 23, 2017, 4:02 p.m. UTC | #6
On Wed, Aug 23, 2017 at 9:22 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
> Hi Oza,
>
>> In working Enumuration case I get following:
>> [    9.125976] pci 0000:00:00.0: bridge configuration invalid ([bus
>> 00-00]), re-configuring
>> [    9.134267] where=0x0 val=0xffff0001
>> [    9.146946] where=0x0 val=0xffff0001
>> [    9.158943] where=0x0 val=0xffff0001
>> [    9.170945] where=0x0 val=0xffff0001
>> [    9.186945] where=0x0 val=0xffff0001
>> [    9.210944] where=0x0 val=0xffff0001
>> [    9.250943] where=0x0 val=0xffff0001
>> [    9.322942] where=0x0 val=0xffff0001
>> [    9.458943] where=0x0 val=0xffff0001
>> [    9.726942] where=0x0 val=0x9538086    >> actual vendor and device id.
>>
>> so I think I have to retry in RC driver, so the old code still holds good.
>> except that I have to do factoring out
> You need to return 0xFFFF0001 for vendor ID register and return 0xFFFFFFFF for
> other registers like COMMAND register during the CRS period.
>

Hi Sinan,

Although I don't disagree entirely with your suggestion, but at the same time,
I am not sure, in RC driver I should have special case for vendor id
and device id, and have another case for rest of the configuration
access.
my thinking is, the RC driver should closely reflect the PCI iproc HW
behaviour (in a generic way, rather than handling vendor/device id
case separately)

Bjorn, please comment as how do you see it being handled.

Regards,
Oza.

>>
>> please let me know If I missed anything, or you want me to try anything else.
>
> Sinan
Bjorn Helgaas Aug. 23, 2017, 6 p.m. UTC | #7
On Wed, Aug 23, 2017 at 09:32:06PM +0530, Oza Oza wrote:
> On Wed, Aug 23, 2017 at 9:22 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
> > Hi Oza,
> >
> >> In working Enumuration case I get following:
> >> [    9.125976] pci 0000:00:00.0: bridge configuration invalid ([bus
> >> 00-00]), re-configuring
> >> [    9.134267] where=0x0 val=0xffff0001
> >> [    9.146946] where=0x0 val=0xffff0001
> >> [    9.158943] where=0x0 val=0xffff0001
> >> [    9.170945] where=0x0 val=0xffff0001
> >> [    9.186945] where=0x0 val=0xffff0001
> >> [    9.210944] where=0x0 val=0xffff0001
> >> [    9.250943] where=0x0 val=0xffff0001
> >> [    9.322942] where=0x0 val=0xffff0001
> >> [    9.458943] where=0x0 val=0xffff0001
> >> [    9.726942] where=0x0 val=0x9538086    >> actual vendor and device id.
> >>
> >> so I think I have to retry in RC driver, so the old code still holds good.
> >> except that I have to do factoring out
> > You need to return 0xFFFF0001 for vendor ID register and return 0xFFFFFFFF for
> > other registers like COMMAND register during the CRS period.

The proposal we're trying to implement is to handle this controller as
an RP that does not support CRS SV.  Such an RP would never return
0xffff0001 for the vendor/device ID.  If it never got a successful
completion, it would return 0xffffffff.

So I think you do have to either retry (as in your v7 patch) or add a
delay before we start enumeration.

It looks like we waited somewhere between 320ms and 590ms before this
device became ready.

The PCI specs do require a delay after a reset (PCIe r3.1, sec 6.6.1)
before software sends a config request.  I don't think there's
anywhere in the PCI core where we delay before we first enumerate
devices under a host bridge.  I'm not sure we'd *want* such a delay in
the PCI core, because on many systems the BIOS has already initialized
the PCI controller, and an additional delay would be unnecessary and
would only slow down boot.

But it might make sense to add a delay in the PCI controller driver --
it knows when the reset occurs and probably knows more about the
firmware environment.  I haven't tried to analyze all of sec 6.6.1,
but this section:

  Unless Readiness Notifications mechanisms are used (see Section
  6.23), the Root Complex and/or system software must allow at least
  1.0 s after a Conventional Reset of a device, before it may
  determine that a device which fails to return a Successful
  Completion status for a valid Configuration Request is a broken
  device. This period is independent of how quickly Link training
  completes.

  Note: This delay is analogous to the Trhfa parameter specified for
  PCI/PCI-X, and is intended to allow an adequate amount of time for
  devices which require self initialization.

makes it sound like a 1sec delay might be needed.  That sounds like an
awful lot, but this device did take close to that amount of time.

I don't care which way you go.  You've already implemented the delay
in the v7 patch, and that's fine with me.

Bjorn
Oza Pawandeep Aug. 24, 2017, 4:40 a.m. UTC | #8
On Wed, Aug 23, 2017 at 11:30 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> On Wed, Aug 23, 2017 at 09:32:06PM +0530, Oza Oza wrote:
>> On Wed, Aug 23, 2017 at 9:22 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
>> > Hi Oza,
>> >
>> >> In working Enumuration case I get following:
>> >> [    9.125976] pci 0000:00:00.0: bridge configuration invalid ([bus
>> >> 00-00]), re-configuring
>> >> [    9.134267] where=0x0 val=0xffff0001
>> >> [    9.146946] where=0x0 val=0xffff0001
>> >> [    9.158943] where=0x0 val=0xffff0001
>> >> [    9.170945] where=0x0 val=0xffff0001
>> >> [    9.186945] where=0x0 val=0xffff0001
>> >> [    9.210944] where=0x0 val=0xffff0001
>> >> [    9.250943] where=0x0 val=0xffff0001
>> >> [    9.322942] where=0x0 val=0xffff0001
>> >> [    9.458943] where=0x0 val=0xffff0001
>> >> [    9.726942] where=0x0 val=0x9538086    >> actual vendor and device id.
>> >>
>> >> so I think I have to retry in RC driver, so the old code still holds good.
>> >> except that I have to do factoring out
>> > You need to return 0xFFFF0001 for vendor ID register and return 0xFFFFFFFF for
>> > other registers like COMMAND register during the CRS period.
>
> The proposal we're trying to implement is to handle this controller as
> an RP that does not support CRS SV.  Such an RP would never return
> 0xffff0001 for the vendor/device ID.  If it never got a successful
> completion, it would return 0xffffffff.
>
> So I think you do have to either retry (as in your v7 patch) or add a
> delay before we start enumeration.
>
> It looks like we waited somewhere between 320ms and 590ms before this
> device became ready.
>
> The PCI specs do require a delay after a reset (PCIe r3.1, sec 6.6.1)
> before software sends a config request.  I don't think there's
> anywhere in the PCI core where we delay before we first enumerate
> devices under a host bridge.  I'm not sure we'd *want* such a delay in
> the PCI core, because on many systems the BIOS has already initialized
> the PCI controller, and an additional delay would be unnecessary and
> would only slow down boot.
>
> But it might make sense to add a delay in the PCI controller driver --
> it knows when the reset occurs and probably knows more about the
> firmware environment.  I haven't tried to analyze all of sec 6.6.1,
> but this section:
>
>   Unless Readiness Notifications mechanisms are used (see Section
>   6.23), the Root Complex and/or system software must allow at least
>   1.0 s after a Conventional Reset of a device, before it may
>   determine that a device which fails to return a Successful
>   Completion status for a valid Configuration Request is a broken
>   device. This period is independent of how quickly Link training
>   completes.
>
>   Note: This delay is analogous to the Trhfa parameter specified for
>   PCI/PCI-X, and is intended to allow an adequate amount of time for
>   devices which require self initialization.
>
> makes it sound like a 1sec delay might be needed.  That sounds like an
> awful lot, but this device did take close to that amount of time.
>
> I don't care which way you go.  You've already implemented the delay
> in the v7 patch, and that's fine with me.
>

Thanks for your inputs Bjorn.
I will have v8 patch which will have;
factored out separate patch + retry implementation of v7 patch.

> Bjorn
diff mbox

Patch

diff --git a/drivers/pci/host/pcie-iproc.c b/drivers/pci/host/pcie-iproc.c
index c574863..a3edae4 100644
--- a/drivers/pci/host/pcie-iproc.c
+++ b/drivers/pci/host/pcie-iproc.c
@@ -68,6 +68,9 @@ 
 #define APB_ERR_EN_SHIFT             0
 #define APB_ERR_EN                   BIT(APB_ERR_EN_SHIFT)
 
+#define CFG_RETRY_STATUS             0xffff0001
+#define CFG_RETRY_STATUS_TIMEOUT_US  500000 /* 500 milli-seconds. */
+
 /* derive the enum index of the outbound/inbound mapping registers */
 #define MAP_REG(base_reg, index)      ((base_reg) + (index) * 2)
 
@@ -448,6 +451,89 @@  static inline void iproc_pcie_apb_err_disable(struct pci_bus *bus,
 	}
 }
 
+static unsigned int iproc_pcie_cfg_retry(void __iomem *cfg_data_p)
+{
+	int timeout = CFG_RETRY_STATUS_TIMEOUT_US;
+	unsigned int data;
+
+	/*
+	 * As per PCIe spec r3.1, sec 2.3.2, CRS Software
+	 * Visibility only affects config read of the Vendor ID.
+	 * For config write or any other config read the Root must
+	 * automatically re-issue configuration request again as a
+	 * new request.
+	 *
+	 * For config reads, this hardware returns CFG_RETRY_STATUS data when
+	 * it receives a CRS completion for a config read, regardless of the
+	 * address of the read or the CRS Software Visibility Enable bit. As a
+	 * partial workaround for this, we retry in software any read that
+	 * returns CFG_RETRY_STATUS.
+	 */
+	data = readl(cfg_data_p);
+	while (data == CFG_RETRY_STATUS && timeout--) {
+		udelay(1);
+		data = readl(cfg_data_p);
+	}
+
+	if (data == CFG_RETRY_STATUS)
+		data = 0xffffffff;
+
+	return data;
+}
+
+static void __iomem *iproc_pcie_map_ep_cfg_reg(struct iproc_pcie *pcie,
+					       unsigned int busno,
+					       unsigned int slot,
+					       unsigned int fn,
+					       int where)
+{
+	u16 offset;
+	u32 val;
+
+	/* EP device access */
+	val = (busno << CFG_ADDR_BUS_NUM_SHIFT) |
+		(slot << CFG_ADDR_DEV_NUM_SHIFT) |
+		(fn << CFG_ADDR_FUNC_NUM_SHIFT) |
+		(where & CFG_ADDR_REG_NUM_MASK) |
+		(1 & CFG_ADDR_CFG_TYPE_MASK);
+
+	iproc_pcie_write_reg(pcie, IPROC_PCIE_CFG_ADDR, val);
+	offset = iproc_pcie_reg_offset(pcie, IPROC_PCIE_CFG_DATA);
+
+	if (iproc_pcie_reg_is_invalid(offset))
+		return NULL;
+
+	return (pcie->base + offset);
+}
+
+static int iproc_pcie_config_read(struct pci_bus *bus, unsigned int devfn,
+				    int where, int size, u32 *val)
+{
+	struct iproc_pcie *pcie = iproc_data(bus);
+	unsigned int slot = PCI_SLOT(devfn);
+	unsigned int fn = PCI_FUNC(devfn);
+	unsigned int busno = bus->number;
+	void __iomem *cfg_data_p;
+	u32 data;
+
+	/* root complex access. */
+	if (busno == 0)
+		return pci_generic_config_read32(bus, devfn, where, size, val);
+
+	cfg_data_p = iproc_pcie_map_ep_cfg_reg(pcie, busno, slot, fn, where);
+
+	if (!cfg_data_p)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
+	data = iproc_pcie_cfg_retry(cfg_data_p);
+
+	*val = data;
+	if (size <= 2)
+		*val = (data >> (8 * (where & 3))) & ((1 << (size * 8)) - 1);
+
+	return PCIBIOS_SUCCESSFUL;
+}
+
 /**
  * Note access to the configuration registers are protected at the higher layer
  * by 'pci_lock' in drivers/pci/access.c
@@ -459,7 +545,6 @@  static void __iomem *iproc_pcie_map_cfg_bus(struct iproc_pcie *pcie,
 {
 	unsigned slot = PCI_SLOT(devfn);
 	unsigned fn = PCI_FUNC(devfn);
-	u32 val;
 	u16 offset;
 
 	/* root complex access */
@@ -484,18 +569,7 @@  static void __iomem *iproc_pcie_map_cfg_bus(struct iproc_pcie *pcie,
 		if (slot > 0)
 			return NULL;
 
-	/* EP device access */
-	val = (busno << CFG_ADDR_BUS_NUM_SHIFT) |
-		(slot << CFG_ADDR_DEV_NUM_SHIFT) |
-		(fn << CFG_ADDR_FUNC_NUM_SHIFT) |
-		(where & CFG_ADDR_REG_NUM_MASK) |
-		(1 & CFG_ADDR_CFG_TYPE_MASK);
-	iproc_pcie_write_reg(pcie, IPROC_PCIE_CFG_ADDR, val);
-	offset = iproc_pcie_reg_offset(pcie, IPROC_PCIE_CFG_DATA);
-	if (iproc_pcie_reg_is_invalid(offset))
-		return NULL;
-	else
-		return (pcie->base + offset);
+	return iproc_pcie_map_ep_cfg_reg(pcie, busno, slot, fn, where);
 }
 
 static void __iomem *iproc_pcie_bus_map_cfg_bus(struct pci_bus *bus,
@@ -554,8 +628,13 @@  static int iproc_pcie_config_read32(struct pci_bus *bus, unsigned int devfn,
 				    int where, int size, u32 *val)
 {
 	int ret;
+	struct iproc_pcie *pcie = iproc_data(bus);
 
 	iproc_pcie_apb_err_disable(bus, true);
+	if (pcie->type == IPROC_PCIE_PAXB_V2)
+		ret = iproc_pcie_config_read(bus, devfn, where, size, val);
+	else
+		ret = pci_generic_config_read32(bus, devfn, where, size, val);
 	ret = pci_generic_config_read32(bus, devfn, where, size, val);
 	iproc_pcie_apb_err_disable(bus, false);