[v4,3/8] drm/amdgpu: Fix SMU error failure

Message ID	1599072130-10043-4-git-send-email-andrey.grodzovsky@amd.com
State	New
Headers	show Return-Path: <linux-pci-owner@vger.kernel.org> Received-SPF: None (protection.outlook.com: amd.com does not designate permitted sender hosts) From: Andrey Grodzovsky <andrey.grodzovsky@amd.com> To: <amd-gfx@lists.freedesktop.org>, <sathyanarayanan.kuppuswamy@linux.intel.com>, <linux-pci@vger.kernel.org> CC: <alexander.deucher@amd.com>, <nirmodas@amd.com>, <Dennis.Li@amd.com>, <christian.koenig@amd.com>, <luben.tuikov@amd.com>, <bhelgaas@google.com>, Andrey Grodzovsky <andrey.grodzovsky@amd.com> Subject: [PATCH v4 3/8] drm/amdgpu: Fix SMU error failure Date: Wed, 2 Sep 2020 14:42:05 -0400 Message-ID: <1599072130-10043-4-git-send-email-andrey.grodzovsky@amd.com> In-Reply-To: <1599072130-10043-1-git-send-email-andrey.grodzovsky@amd.com> References: <1599072130-10043-1-git-send-email-andrey.grodzovsky@amd.com> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-pci-owner@vger.kernel.org Precedence: bulk
Series	Implement PCI Error Recovery on Navi12 \| expand [v4,0/8] Implement PCI Error Recovery on Navi12 [v4,1/8] drm/amdgpu: Avoid accessing HW when suspending SW state [v4,2/8] drm/amdgpu: Block all job scheduling activity during DPC recovery [v4,3/8] drm/amdgpu: Fix SMU error failure [v4,4/8] drm/amdgpu: Fix consecutive DPC recovery failures. [v4,5/8] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code. [v4,6/8] drm/amdgpu: Disable DPC for XGMI for now. [v4,7/8] drm/amdgpu: Minor checkpatch fix [v4,8/8] Revert "PCI/ERR: Update error status after reset_link()"

Message ID

1599072130-10043-4-git-send-email-andrey.grodzovsky@amd.com

State

New

Headers

Received-SPF: None (protection.outlook.com: amd.com does not designate
 permitted sender hosts)
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: <amd-gfx@lists.freedesktop.org>,
        <sathyanarayanan.kuppuswamy@linux.intel.com>,
        <linux-pci@vger.kernel.org>
CC: <alexander.deucher@amd.com>, <nirmodas@amd.com>,
        <Dennis.Li@amd.com>, <christian.koenig@amd.com>,
        <luben.tuikov@amd.com>, <bhelgaas@google.com>,
        Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Subject: [PATCH v4 3/8] drm/amdgpu: Fix SMU error failure
Date: Wed, 2 Sep 2020 14:42:05 -0400
Message-ID: <1599072130-10043-4-git-send-email-andrey.grodzovsky@amd.com>
In-Reply-To: <1599072130-10043-1-git-send-email-andrey.grodzovsky@amd.com>
References: <1599072130-10043-1-git-send-email-andrey.grodzovsky@amd.com>
MIME-Version: 1.0
Content-Type: text/plain
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 02 Sep 2020 18:42:41.7380
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 b65824b9-39de-4040-adb7-08d84f6ff970
X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: 
 TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB02.amd.com]
X-MS-Exchange-CrossTenant-AuthSource: 
 BN8NAM11FT029.eop-nam11.prod.protection.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: MN2PR12MB3838
Sender: linux-pci-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-pci.vger.kernel.org>
X-Mailing-List: linux-pci@vger.kernel.org

Series

Implement PCI Error Recovery on Navi12 | expand

Commit Message

Andrey Grodzovsky Sept. 2, 2020, 6:42 p.m. UTC

Wait for HW/PSP initiated ASIC reset to complete before
starting the recovery operations.

v2: Remove typo

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

Comments

Bjorn Helgaas Sept. 2, 2020, 10:05 p.m. UTC | #1

On Wed, Sep 02, 2020 at 02:42:05PM -0400, Andrey Grodzovsky wrote:
> Wait for HW/PSP initiated ASIC reset to complete before
> starting the recovery operations.
> 
> v2: Remove typo
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e999f1f..412d07e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4838,14 +4838,32 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>  {
>  	struct drm_device *dev = pci_get_drvdata(pdev);
>  	struct amdgpu_device *adev = drm_to_adev(dev);
> -	int r;
> +	int r, i;
>  	bool vram_lost;
> +	u32 memsize;
>  
>  	DRM_INFO("PCI error: slot reset callback!!\n");
>  
> +	/* wait for asic to come out of reset */

I know it's totally OCD, but it is a minor speed bump to read "asic"
here and "ASIC" in the commit log above and the new comment below.

> +	msleep(500);
> +
>  	pci_restore_state(pdev);
>  
> -	adev->in_pci_err_recovery = true;
> +	/* confirm  ASIC came out of reset */
> +	for (i = 0; i < adev->usec_timeout; i++) {
> +		memsize = amdgpu_asic_get_config_memsize(adev);
> +
> +		if (memsize != 0xffffffff)

I guess this is a spot where you actually depend on an MMIO read
returning 0xffffffff because adev->in_pci_err_recovery is false at
this point, so amdgpu_mm_rreg() will actually *do* the MMIO read
instead of returning 0.  Right?

> +			break;
> +		udelay(1);
> +	}
> +	if (memsize == 0xffffffff) {
> +		r = -ETIME;
> +		goto out;
> +	}
> +
> +	/* TODO Call amdgpu_pre_asic_reset instead */
> +	adev->in_pci_err_recovery = true;	
>  	r = amdgpu_device_ip_suspend(adev);
>  	adev->in_pci_err_recovery = false;
>  	if (r)
> -- 
> 2.7.4
>

Andrey Grodzovsky Sept. 3, 2020, 3:29 p.m. UTC | #2

On 9/2/20 6:05 PM, Bjorn Helgaas wrote:
> On Wed, Sep 02, 2020 at 02:42:05PM -0400, Andrey Grodzovsky wrote:
>> Wait for HW/PSP initiated ASIC reset to complete before
>> starting the recovery operations.
>>
>> v2: Remove typo
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
>>   1 file changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index e999f1f..412d07e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4838,14 +4838,32 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>   {
>>   	struct drm_device *dev = pci_get_drvdata(pdev);
>>   	struct amdgpu_device *adev = drm_to_adev(dev);
>> -	int r;
>> +	int r, i;
>>   	bool vram_lost;
>> +	u32 memsize;
>>   
>>   	DRM_INFO("PCI error: slot reset callback!!\n");
>>   
>> +	/* wait for asic to come out of reset */
> I know it's totally OCD, but it is a minor speed bump to read "asic"
> here and "ASIC" in the commit log above and the new comment below.
>
>> +	msleep(500);
>> +
>>   	pci_restore_state(pdev);
>>   
>> -	adev->in_pci_err_recovery = true;
>> +	/* confirm  ASIC came out of reset */
>> +	for (i = 0; i < adev->usec_timeout; i++) {
>> +		memsize = amdgpu_asic_get_config_memsize(adev);
>> +
>> +		if (memsize != 0xffffffff)
> I guess this is a spot where you actually depend on an MMIO read
> returning 0xffffffff because adev->in_pci_err_recovery is false at
> this point, so amdgpu_mm_rreg() will actually *do* the MMIO read
> instead of returning 0.  Right?


Yes, i picked this form another place where they use it to confirm PCI confspace and
the BARs specifically are restored and active and so you can read back sane MMIO 
regs values.

Andrey


>
>> +			break;
>> +		udelay(1);
>> +	}
>> +	if (memsize == 0xffffffff) {
>> +		r = -ETIME;
>> +		goto out;
>> +	}
>> +
>> +	/* TODO Call amdgpu_pre_asic_reset instead */
>> +	adev->in_pci_err_recovery = true;	
>>   	r = amdgpu_device_ip_suspend(adev);
>>   	adev->in_pci_err_recovery = false;
>>   	if (r)
>> -- 
>> 2.7.4
>>

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e999f1f..412d07e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4838,14 +4838,32 @@  pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 {
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = drm_to_adev(dev);
-	int r;
+	int r, i;
 	bool vram_lost;
+	u32 memsize;
 
 	DRM_INFO("PCI error: slot reset callback!!\n");
 
+	/* wait for asic to come out of reset */
+	msleep(500);
+
 	pci_restore_state(pdev);
 
-	adev->in_pci_err_recovery = true;
+	/* confirm  ASIC came out of reset */
+	for (i = 0; i < adev->usec_timeout; i++) {
+		memsize = amdgpu_asic_get_config_memsize(adev);
+
+		if (memsize != 0xffffffff)
+			break;
+		udelay(1);
+	}
+	if (memsize == 0xffffffff) {
+		r = -ETIME;
+		goto out;
+	}
+
+	/* TODO Call amdgpu_pre_asic_reset instead */
+	adev->in_pci_err_recovery = true;	
 	r = amdgpu_device_ip_suspend(adev);
 	adev->in_pci_err_recovery = false;
 	if (r)

[v4,3/8] drm/amdgpu: Fix SMU error failure

Commit Message

Comments

Patch