diff mbox series

[PULL,2/7] s390x: do a subsystem reset before the unprotect on reboot

Message ID 20230912114112.296428-3-thuth@redhat.com
State New
Headers show
Series [PULL,1/7] s390x/ap: fix missing subsystem reset registration | expand

Commit Message

Thomas Huth Sept. 12, 2023, 11:41 a.m. UTC
From: Janosch Frank <frankja@linux.ibm.com>

Bound APQNs have to be reset before tearing down the secure config via
s390_machine_unprotect(). Otherwise the Ultravisor will return a error
code.

So let's do a subsystem_reset() which includes a AP reset before the
unprotect call. We'll do a full device_reset() afterwards which will
reset some devices twice. That's ok since we can't move the
device_reset() before the unprotect as it includes a CPU clear reset
which the Ultravisor does not expect at that point in time.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-ID: <20230901114851.154357-1-frankja@linux.ibm.com>
Tested-by: Viktor Mihajlovski <mihajlov@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
---
 hw/s390x/s390-virtio-ccw.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Cédric Le Goater Jan. 10, 2024, 6:30 p.m. UTC | #1
On 9/12/23 13:41, Thomas Huth wrote:
> From: Janosch Frank <frankja@linux.ibm.com>
> 
> Bound APQNs have to be reset before tearing down the secure config via
> s390_machine_unprotect(). Otherwise the Ultravisor will return a error
> code.
> 
> So let's do a subsystem_reset() which includes a AP reset before the
> unprotect call. We'll do a full device_reset() afterwards which will
> reset some devices twice. That's ok since we can't move the
> device_reset() before the unprotect as it includes a CPU clear reset
> which the Ultravisor does not expect at that point in time.
> 
> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
> Message-ID: <20230901114851.154357-1-frankja@linux.ibm.com>
> Tested-by: Viktor Mihajlovski <mihajlov@linux.ibm.com>
> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
> Signed-off-by: Thomas Huth <thuth@redhat.com>
> ---
>   hw/s390x/s390-virtio-ccw.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
> index 3dd0b2372d..2d75f2131f 100644
> --- a/hw/s390x/s390-virtio-ccw.c
> +++ b/hw/s390x/s390-virtio-ccw.c
> @@ -438,10 +438,20 @@ static void s390_machine_reset(MachineState *machine, ShutdownCause reason)
>       switch (reset_type) {
>       case S390_RESET_EXTERNAL:
>       case S390_RESET_REIPL:
> +        /*
> +         * Reset the subsystem which includes a AP reset. If a PV
> +         * guest had APQNs attached the AP reset is a prerequisite to
> +         * unprotecting since the UV checks if all APQNs are reset.
> +         */
> +        subsystem_reset();


This commit introduced a regression with pass-though ISM devices.

After startup, a reboot will generate extra device resets (vfio-pci in
this case) which break the pass-though ISM device in a subtle way,
probably related to IOMMU mapping according to 03451953c79e
("s390x/pci: reset ISM passthrough devices on shutdown and system
reset"). After poweroff, the device is left in a sort-of-a-use state
on the host and the LPAR has to be rebooted to clear the invalid state
of the device. To be noted, that standard PCI devices are immune to
this change.

The extra resets should avoided in some ways, (a shutdown notifier and
a reset callback are already registered for ISM devices by 03451953c79e)
and, most important, once the VM terminates, the device resources
should be cleared in the host kernel. So there seem to be two issues
to address in mainline QEMU and in Linux AFAICT.

Thanks,

C.



>           if (s390_is_pv()) {
>               s390_machine_unprotect(ms);
>           }
>   
> +        /*
> +         * Device reset includes CPU clear resets so this has to be
> +         * done AFTER the unprotect call above.
> +         */
>           qemu_devices_reset(reason);
>           s390_crypto_reset();
>
Matthew Rosato Jan. 10, 2024, 8:28 p.m. UTC | #2
On 1/10/24 1:30 PM, Cédric Le Goater wrote:
> On 9/12/23 13:41, Thomas Huth wrote:
>> From: Janosch Frank <frankja@linux.ibm.com>
>>
>> Bound APQNs have to be reset before tearing down the secure config via
>> s390_machine_unprotect(). Otherwise the Ultravisor will return a error
>> code.
>>
>> So let's do a subsystem_reset() which includes a AP reset before the
>> unprotect call. We'll do a full device_reset() afterwards which will
>> reset some devices twice. That's ok since we can't move the
>> device_reset() before the unprotect as it includes a CPU clear reset
>> which the Ultravisor does not expect at that point in time.
>>
>> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
>> Message-ID: <20230901114851.154357-1-frankja@linux.ibm.com>
>> Tested-by: Viktor Mihajlovski <mihajlov@linux.ibm.com>
>> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>> ---
>>   hw/s390x/s390-virtio-ccw.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
>> index 3dd0b2372d..2d75f2131f 100644
>> --- a/hw/s390x/s390-virtio-ccw.c
>> +++ b/hw/s390x/s390-virtio-ccw.c
>> @@ -438,10 +438,20 @@ static void s390_machine_reset(MachineState *machine, ShutdownCause reason)
>>       switch (reset_type) {
>>       case S390_RESET_EXTERNAL:
>>       case S390_RESET_REIPL:
>> +        /*
>> +         * Reset the subsystem which includes a AP reset. If a PV
>> +         * guest had APQNs attached the AP reset is a prerequisite to
>> +         * unprotecting since the UV checks if all APQNs are reset.
>> +         */
>> +        subsystem_reset();
> 
> 
> This commit introduced a regression with pass-though ISM devices.
> 
> After startup, a reboot will generate extra device resets (vfio-pci in
> this case) which break the pass-though ISM device in a subtle way,

Hi Cedric, thanks for reporting this...  I was able to reproduce just now, and it looks like ISM firmware is unhappy specifically with this susbystem_reset call added by ef1535901a0, not necessarily the multiple attempts at reset -- I verified that reverting ef1535901a0 resolves the ISM issue, but if I instead try reverting the older 03451953c79e while leaving ef1535901a0 in place then ISM devices still break on guest reboot.


> probably related to IOMMU mapping according to 03451953c79e
> ("s390x/pci: reset ISM passthrough devices on shutdown and system
> reset"). After poweroff, the device is left in a sort-of-a-use state
> on the host and the LPAR has to be rebooted to clear the invalid state
> of the device. To be noted, that standard PCI devices are immune to
> this change.

As a bit of background, ISM firmware is very sensitive re: the contents of the (host) IOMMU and attempts at manipulation that it deems to be out-of-order; the point of 03451953c79e was to ensure that the device gets a reset before we attempt at unmapping anything that wasn't cleaned up in an orderly fashion by the (guest) ism driver at the time of shutdown/reset (e.g. underlying firmware may view guest SBAs in the IOMMU as still registered for use and will throw an error condition at attempts to remove their entries in the IOMMU without first going through an unregistration process).

The unmap that would make ISM upset would generally be coming out of vfio_listener_region_del where we just do one big vfio_dma_unmap -- a quick trace shows that the subsystem_reset call added by ef1535901a0 is causing the vfio_listener_region_del to once again trigger before the pci reset of the ISM device, effectively re-introducing the condition that 03451953c79e was trying to resolve.

> 
> The extra resets should avoided in some ways, (a shutdown notifier and
> a reset callback are already registered for ISM devices by 03451953c79e)

So as mentioned above, it's not the extra resets that are the issue, it's the order of operations.  Basically, we need to drive pci_device_reset for any ISM device associated with the guest before we destroy the vfio memory listener (now triggered in this case via subsystem_reset).  So if we must drive this subsystem_reset before we trigger the device reset callbacks then it might require a s390 pci bus routine that is called before or during subystem_reset just to reset the ISM devices associated with this guest first; I'm not sure yet.

As an aside:  I wonder why we are always doing the subsystem_reset here unconditionally rather than only when s390_is_pv() since that seems to be the only case that requires it.

> and, most important, once the VM terminates, the device resources
> should be cleared in the host kernel. So there seem to be two issues
> to address in mainline QEMU and in Linux AFAICT.

Because of the condition detected by ISM firmware as described above, the host device was placed in an error state and remains in that state.  After shutting down the guest, you should be able to use zpcictl --reset on the affected host device(s) to clear the error condition and re-enable it for use.

Thanks,
Matt
Cédric Le Goater Jan. 11, 2024, 9:43 a.m. UTC | #3
On 1/10/24 21:28, Matthew Rosato wrote:
> On 1/10/24 1:30 PM, Cédric Le Goater wrote:
>> On 9/12/23 13:41, Thomas Huth wrote:
>>> From: Janosch Frank <frankja@linux.ibm.com>
>>>
>>> Bound APQNs have to be reset before tearing down the secure config via
>>> s390_machine_unprotect(). Otherwise the Ultravisor will return a error
>>> code.
>>>
>>> So let's do a subsystem_reset() which includes a AP reset before the
>>> unprotect call. We'll do a full device_reset() afterwards which will
>>> reset some devices twice. That's ok since we can't move the
>>> device_reset() before the unprotect as it includes a CPU clear reset
>>> which the Ultravisor does not expect at that point in time.
>>>
>>> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
>>> Message-ID: <20230901114851.154357-1-frankja@linux.ibm.com>
>>> Tested-by: Viktor Mihajlovski <mihajlov@linux.ibm.com>
>>> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>> ---
>>>    hw/s390x/s390-virtio-ccw.c | 10 ++++++++++
>>>    1 file changed, 10 insertions(+)
>>>
>>> diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
>>> index 3dd0b2372d..2d75f2131f 100644
>>> --- a/hw/s390x/s390-virtio-ccw.c
>>> +++ b/hw/s390x/s390-virtio-ccw.c
>>> @@ -438,10 +438,20 @@ static void s390_machine_reset(MachineState *machine, ShutdownCause reason)
>>>        switch (reset_type) {
>>>        case S390_RESET_EXTERNAL:
>>>        case S390_RESET_REIPL:
>>> +        /*
>>> +         * Reset the subsystem which includes a AP reset. If a PV
>>> +         * guest had APQNs attached the AP reset is a prerequisite to
>>> +         * unprotecting since the UV checks if all APQNs are reset.
>>> +         */
>>> +        subsystem_reset();
>>
>>
>> This commit introduced a regression with pass-though ISM devices.
>>
>> After startup, a reboot will generate extra device resets (vfio-pci in
>> this case) which break the pass-though ISM device in a subtle way,
> 
> Hi Cedric, thanks for reporting this...  I was able to reproduce just now, and it looks like ISM firmware is unhappy specifically with this susbystem_reset call added by ef1535901a0, not necessarily the multiple attempts at reset -- I verified that reverting ef1535901a0 resolves the ISM issue, but if I instead try reverting the older 03451953c79e while leaving ef1535901a0 in place then ISM devices still break on guest reboot.
> 
> 
>> probably related to IOMMU mapping according to 03451953c79e
>> ("s390x/pci: reset ISM passthrough devices on shutdown and system
>> reset"). After poweroff, the device is left in a sort-of-a-use state
>> on the host and the LPAR has to be rebooted to clear the invalid state
>> of the device. To be noted, that standard PCI devices are immune to
>> this change.
> 
> As a bit of background, ISM firmware is very sensitive re: the contents of the (host) IOMMU and attempts at manipulation that it deems to be out-of-order; the point of 03451953c79e was to ensure that the device gets a reset before we attempt at unmapping anything that wasn't cleaned up in an orderly fashion by the (guest) ism driver at the time of shutdown/reset (e.g. underlying firmware may view guest SBAs in the IOMMU as still registered for use and will throw an error condition at attempts to remove their entries in the IOMMU without first going through an unregistration process).
> 
> The unmap that would make ISM upset would generally be coming out of vfio_listener_region_del where we just do one big vfio_dma_unmap -- a quick trace shows that the subsystem_reset call added by ef1535901a0 is causing the vfio_listener_region_del to once again trigger before the pci reset of the ISM device, effectively re-introducing the condition that 03451953c79e was trying to resolve.

Yes. I saw the vfio_listener_region_del trace coming first and came to
the conclusion it was related to IOMMU mappings.

>> The extra resets should avoided in some ways, (a shutdown notifier and
>> a reset callback are already registered for ISM devices by 03451953c79e)
> 
> So as mentioned above, it's not the extra resets that are the issue, it's the order of operations.  Basically, we need to drive pci_device_reset for any ISM device associated with the guest before we destroy the vfio memory listener (now triggered in this case via subsystem_reset).  So if we must drive this subsystem_reset before we trigger the device reset callbacks then it might require a s390 pci bus routine that is called before or during subystem_reset just to reset the ISM devices associated with this guest first; I'm not sure yet.
> 
> As an aside:  I wonder why we are always doing the subsystem_reset here unconditionally rather than only when s390_is_pv() since that seems to be the only case that requires it.

That would be a start to workaround the issue.
  
>> and, most important, once the VM terminates, the device resources
>> should be cleared in the host kernel. So there seem to be two issues
>> to address in mainline QEMU and in Linux AFAICT.
> 
> Because of the condition detected by ISM firmware as described above, the host device was placed in an error state and remains in that state. 

OK. this condition is considered serious enough to be reported to a
management level. This seems a bit excessive since the recovery can be
handled by software, but manually. Are there any plans to address this
problem ?


> After shutting down the guest, you should be able to use zpcictl --reset on the affected host device(s) to clear the error condition and re-enable it for use.

ok. That's better than reboot.


On a side note, I am also seeing :

[   73.989688] ------------[ cut here ]------------
[   73.989696] unexpected non zero alert.mask 0x20
[   73.989748] WARNING: CPU: 9 PID: 4503 at arch/s390/kvm/interrupt.c:3214 kvm_s390_gisa_destroy+0xd4/0xe8 [kvm]
[   73.989791] Modules linked in: vfio_pci vfio_pci_core irqbypass vhost_net vhost vhost_iotlb tap tun xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink 8021q garp mrp rfkill sunrpc ext4 mbcache jbd2 vfio_ap zcrypt_cex4 vfio_ccw mdev vfio_iommu_type1 vfio drm fuse i2c_core drm_panel_orientation_quirks xfs libcrc32c dm_service_time mlx5_core sd_mod t10_pi ghash_s390 sg prng des_s390 libdes sha3_512_s390 sha3_256_s390 mlxfw tls scm_block psample eadm_sch qeth_l2 bridge stp llc dasd_eckd_mod zfcp qeth dasd_mod scsi_transport_fc ccwgroup qdio dm_multipath dm_mirror dm_region_hash dm_log dm_mod pkey zcrypt kvm aes_s390
[   73.989825] CPU: 9 PID: 4503 Comm: worker Kdump: loaded Not tainted 6.7.0-clg-dirty #52
[   73.989827] Hardware name: IBM 3931 LA1 400 (LPAR)
[   73.989829] Krnl PSW : 0704c00180000000 000003ff7fcd2198 (kvm_s390_gisa_destroy+0xd8/0xe8 [kvm])
[   73.989845]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[   73.989847] Krnl GPRS: c0000000fffeffff 0000000700000027 0000000000000023 00000007df4249c8
[   73.989849]            000003800649b858 000003800649b850 00000007fcb9db00 0000000000000000
[   73.989851]            000000008ebae8c8 0000000083a8c4f0 0000000000b69900 000000008ebac000
[   73.989853]            000003ff903aef68 000003800649bd98 000003ff7fcd2194 000003800649b9f8
[   73.989859] Krnl Code: 000003ff7fcd2188: c02000024f88	larl	%r2,000003ff7fd1c098
                           000003ff7fcd218e: c0e5fffea360	brasl	%r14,000003ff7fca684e
                          #000003ff7fcd2194: af000000		mc	0,0
                          >000003ff7fcd2198: e310b7680204	lg	%r1,10088(%r11)
                           000003ff7fcd219e: a7f4ffae		brc	15,000003ff7fcd20fa
                           000003ff7fcd21a2: 0707		bcr	0,%r7
                           000003ff7fcd21a4: 0707		bcr	0,%r7
                           000003ff7fcd21a6: 0707		bcr	0,%r7
[   73.989929] Call Trace:
[   73.989931]  [<000003ff7fcd2198>] kvm_s390_gisa_destroy+0xd8/0xe8 [kvm]
[   73.989946] ([<000003ff7fcd2194>] kvm_s390_gisa_destroy+0xd4/0xe8 [kvm])
[   73.989960]  [<000003ff7fcc1578>] kvm_arch_destroy_vm+0x50/0x118 [kvm]
[   73.989974]  [<000003ff7fcb00a2>] kvm_destroy_vm+0x15a/0x260 [kvm]
[   73.989985]  [<000003ff7fcb021e>] kvm_vm_release+0x36/0x48 [kvm]
[   73.989996]  [<00000007de4f830c>] __fput+0x94/0x2d0
[   73.990009]  [<00000007de20d838>] task_work_run+0x88/0xe8
[   73.990013]  [<00000007de1e75e0>] do_exit+0x2e0/0x4e0
[   73.990016]  [<00000007de1e79c0>] do_group_exit+0x40/0xb8
[   73.990017]  [<00000007de1f96e8>] send_sig_info+0x0/0xa8
[   73.990021]  [<00000007de194b26>] arch_do_signal_or_restart+0x56/0x318
[   73.990025]  [<00000007de28bf12>] exit_to_user_mode_prepare+0x10a/0x1a0
[   73.990028]  [<00000007deb607d2>] __do_syscall+0x152/0x1f8
[   73.990032]  [<00000007deb70ac8>] system_call+0x70/0x98
[   73.990036] Last Breaking-Event-Address:
[   73.990037]  [<00000007de1e0c58>] __warn_printk+0x78/0xe8


Thanks,

C.
Christian Borntraeger Jan. 11, 2024, 10:18 a.m. UTC | #4
Am 11.01.24 um 10:43 schrieb Cédric Le Goater:
[...]
> 
> 
> On a side note, I am also seeing :

Michael?

> 
> [   73.989688] ------------[ cut here ]------------
> [   73.989696] unexpected non zero alert.mask 0x20
> [   73.989748] WARNING: CPU: 9 PID: 4503 at arch/s390/kvm/interrupt.c:3214 kvm_s390_gisa_destroy+0xd4/0xe8 [kvm]
> [   73.989791] Modules linked in: vfio_pci vfio_pci_core irqbypass vhost_net vhost vhost_iotlb tap tun xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink 8021q garp mrp rfkill sunrpc ext4 mbcache jbd2 vfio_ap zcrypt_cex4 vfio_ccw mdev vfio_iommu_type1 vfio drm fuse i2c_core drm_panel_orientation_quirks xfs libcrc32c dm_service_time mlx5_core sd_mod t10_pi ghash_s390 sg prng des_s390 libdes sha3_512_s390 sha3_256_s390 mlxfw tls scm_block psample eadm_sch qeth_l2 bridge stp llc dasd_eckd_mod zfcp qeth dasd_mod scsi_transport_fc ccwgroup qdio dm_multipath dm_mirror dm_region_hash dm_log dm_mod pkey zcrypt kvm aes_s390
> [   73.989825] CPU: 9 PID: 4503 Comm: worker Kdump: loaded Not tainted 6.7.0-clg-dirty #52
> [   73.989827] Hardware name: IBM 3931 LA1 400 (LPAR)
> [   73.989829] Krnl PSW : 0704c00180000000 000003ff7fcd2198 (kvm_s390_gisa_destroy+0xd8/0xe8 [kvm])
> [   73.989845]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> [   73.989847] Krnl GPRS: c0000000fffeffff 0000000700000027 0000000000000023 00000007df4249c8
> [   73.989849]            000003800649b858 000003800649b850 00000007fcb9db00 0000000000000000
> [   73.989851]            000000008ebae8c8 0000000083a8c4f0 0000000000b69900 000000008ebac000
> [   73.989853]            000003ff903aef68 000003800649bd98 000003ff7fcd2194 000003800649b9f8
> [   73.989859] Krnl Code: 000003ff7fcd2188: c02000024f88    larl    %r2,000003ff7fd1c098
>                            000003ff7fcd218e: c0e5fffea360    brasl    %r14,000003ff7fca684e
>                           #000003ff7fcd2194: af000000        mc    0,0
>                           >000003ff7fcd2198: e310b7680204    lg    %r1,10088(%r11)
>                            000003ff7fcd219e: a7f4ffae        brc    15,000003ff7fcd20fa
>                            000003ff7fcd21a2: 0707        bcr    0,%r7
>                            000003ff7fcd21a4: 0707        bcr    0,%r7
>                            000003ff7fcd21a6: 0707        bcr    0,%r7
> [   73.989929] Call Trace:
> [   73.989931]  [<000003ff7fcd2198>] kvm_s390_gisa_destroy+0xd8/0xe8 [kvm]
> [   73.989946] ([<000003ff7fcd2194>] kvm_s390_gisa_destroy+0xd4/0xe8 [kvm])
> [   73.989960]  [<000003ff7fcc1578>] kvm_arch_destroy_vm+0x50/0x118 [kvm]
> [   73.989974]  [<000003ff7fcb00a2>] kvm_destroy_vm+0x15a/0x260 [kvm]
> [   73.989985]  [<000003ff7fcb021e>] kvm_vm_release+0x36/0x48 [kvm]
> [   73.989996]  [<00000007de4f830c>] __fput+0x94/0x2d0
> [   73.990009]  [<00000007de20d838>] task_work_run+0x88/0xe8
> [   73.990013]  [<00000007de1e75e0>] do_exit+0x2e0/0x4e0
> [   73.990016]  [<00000007de1e79c0>] do_group_exit+0x40/0xb8
> [   73.990017]  [<00000007de1f96e8>] send_sig_info+0x0/0xa8
> [   73.990021]  [<00000007de194b26>] arch_do_signal_or_restart+0x56/0x318
> [   73.990025]  [<00000007de28bf12>] exit_to_user_mode_prepare+0x10a/0x1a0
> [   73.990028]  [<00000007deb607d2>] __do_syscall+0x152/0x1f8
> [   73.990032]  [<00000007deb70ac8>] system_call+0x70/0x98
> [   73.990036] Last Breaking-Event-Address:
> [   73.990037]  [<00000007de1e0c58>] __warn_printk+0x78/0xe8
> 
>
Janosch Frank Jan. 11, 2024, 3 p.m. UTC | #5
On 1/11/24 10:43, Cédric Le Goater wrote:
> On 1/10/24 21:28, Matthew Rosato wrote:
>> On 1/10/24 1:30 PM, Cédric Le Goater wrote:
>>> On 9/12/23 13:41, Thomas Huth wrote:
>>>> From: Janosch Frank <frankja@linux.ibm.com>
>>>>
>>>> Bound APQNs have to be reset before tearing down the secure config via
>>>> s390_machine_unprotect(). Otherwise the Ultravisor will return a error
>>>> code.
>>>>
>>>> So let's do a subsystem_reset() which includes a AP reset before the
>>>> unprotect call. We'll do a full device_reset() afterwards which will
>>>> reset some devices twice. That's ok since we can't move the
>>>> device_reset() before the unprotect as it includes a CPU clear reset
>>>> which the Ultravisor does not expect at that point in time.
>>>>
>>>> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
>>>> Message-ID: <20230901114851.154357-1-frankja@linux.ibm.com>
>>>> Tested-by: Viktor Mihajlovski <mihajlov@linux.ibm.com>
>>>> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>>>> Signed-off-by: Thomas Huth <thuth@redhat.com>
>>>> ---
>>>>     hw/s390x/s390-virtio-ccw.c | 10 ++++++++++
>>>>     1 file changed, 10 insertions(+)
>>>>
>>>> diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
>>>> index 3dd0b2372d..2d75f2131f 100644
>>>> --- a/hw/s390x/s390-virtio-ccw.c
>>>> +++ b/hw/s390x/s390-virtio-ccw.c
>>>> @@ -438,10 +438,20 @@ static void s390_machine_reset(MachineState *machine, ShutdownCause reason)
>>>>         switch (reset_type) {
>>>>         case S390_RESET_EXTERNAL:
>>>>         case S390_RESET_REIPL:
>>>> +        /*
>>>> +         * Reset the subsystem which includes a AP reset. If a PV
>>>> +         * guest had APQNs attached the AP reset is a prerequisite to
>>>> +         * unprotecting since the UV checks if all APQNs are reset.
>>>> +         */
>>>> +        subsystem_reset();
>>>
>>>
>>> This commit introduced a regression with pass-though ISM devices.
>>>
>>> After startup, a reboot will generate extra device resets (vfio-pci in
>>> this case) which break the pass-though ISM device in a subtle way,
>>
>> Hi Cedric, thanks for reporting this...  I was able to reproduce just now, and it looks like ISM firmware is unhappy specifically with this susbystem_reset call added by ef1535901a0, not necessarily the multiple attempts at reset -- I verified that reverting ef1535901a0 resolves the ISM issue, but if I instead try reverting the older 03451953c79e while leaving ef1535901a0 in place then ISM devices still break on guest reboot.
>>
>>
>>> probably related to IOMMU mapping according to 03451953c79e
>>> ("s390x/pci: reset ISM passthrough devices on shutdown and system
>>> reset"). After poweroff, the device is left in a sort-of-a-use state
>>> on the host and the LPAR has to be rebooted to clear the invalid state
>>> of the device. To be noted, that standard PCI devices are immune to
>>> this change.
>>
>> As a bit of background, ISM firmware is very sensitive re: the contents of the (host) IOMMU and attempts at manipulation that it deems to be out-of-order; the point of 03451953c79e was to ensure that the device gets a reset before we attempt at unmapping anything that wasn't cleaned up in an orderly fashion by the (guest) ism driver at the time of shutdown/reset (e.g. underlying firmware may view guest SBAs in the IOMMU as still registered for use and will throw an error condition at attempts to remove their entries in the IOMMU without first going through an unregistration process).
>>
>> The unmap that would make ISM upset would generally be coming out of vfio_listener_region_del where we just do one big vfio_dma_unmap -- a quick trace shows that the subsystem_reset call added by ef1535901a0 is causing the vfio_listener_region_del to once again trigger before the pci reset of the ISM device, effectively re-introducing the condition that 03451953c79e was trying to resolve.
> 
> Yes. I saw the vfio_listener_region_del trace coming first and came to
> the conclusion it was related to IOMMU mappings.
> 
>>> The extra resets should avoided in some ways, (a shutdown notifier and
>>> a reset callback are already registered for ISM devices by 03451953c79e)
>>
>> So as mentioned above, it's not the extra resets that are the issue, it's the order of operations.  Basically, we need to drive pci_device_reset for any ISM device associated with the guest before we destroy the vfio memory listener (now triggered in this case via subsystem_reset).  So if we must drive this subsystem_reset before we trigger the device reset callbacks then it might require a s390 pci bus routine that is called before or during subystem_reset just to reset the ISM devices associated with this guest first; I'm not sure yet.
>>
>> As an aside:  I wonder why we are always doing the subsystem_reset here unconditionally rather than only when s390_is_pv() since that seems to be the only case that requires it.
> 
> That would be a start to workaround the issue.
>    

I can have a look into that, I'm just a bit hesitant since there's an 
ominous feeling in my brain that tells me that there might have been a 
reason that I forgot after this patch was pulled.
Matthew Rosato Jan. 11, 2024, 3:26 p.m. UTC | #6
On 1/11/24 5:18 AM, Christian Borntraeger wrote:
> 
> 
> Am 11.01.24 um 10:43 schrieb Cédric Le Goater:
> [...]
>>
>>
>> On a side note, I am also seeing :
> 
> Michael?
> 

Hmm, it looks like this warning is tripping because we have a path in PCI passthrough where we don't unregister the gisc.  This warning notices that at the time we destroy the VM (it believes that are still consumers of the guest ISC so the bit in the alert mask is still on) -- after cutting the warning the code forces the alerts off at least.

Not sure yet if that is directly related to the device going into error condition or if it is an independent issue, will have a look.

>>
>> [   73.989688] ------------[ cut here ]------------
>> [   73.989696] unexpected non zero alert.mask 0x20
>> [   73.989748] WARNING: CPU: 9 PID: 4503 at arch/s390/kvm/interrupt.c:3214 kvm_s390_gisa_destroy+0xd4/0xe8 [kvm]
>> [   73.989791] Modules linked in: vfio_pci vfio_pci_core irqbypass vhost_net vhost vhost_iotlb tap tun xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink 8021q garp mrp rfkill sunrpc ext4 mbcache jbd2 vfio_ap zcrypt_cex4 vfio_ccw mdev vfio_iommu_type1 vfio drm fuse i2c_core drm_panel_orientation_quirks xfs libcrc32c dm_service_time mlx5_core sd_mod t10_pi ghash_s390 sg prng des_s390 libdes sha3_512_s390 sha3_256_s390 mlxfw tls scm_block psample eadm_sch qeth_l2 bridge stp llc dasd_eckd_mod zfcp qeth dasd_mod scsi_transport_fc ccwgroup qdio dm_multipath dm_mirror dm_region_hash dm_log dm_mod pkey zcrypt kvm aes_s390
>> [   73.989825] CPU: 9 PID: 4503 Comm: worker Kdump: loaded Not tainted 6.7.0-clg-dirty #52
>> [   73.989827] Hardware name: IBM 3931 LA1 400 (LPAR)
>> [   73.989829] Krnl PSW : 0704c00180000000 000003ff7fcd2198 (kvm_s390_gisa_destroy+0xd8/0xe8 [kvm])
>> [   73.989845]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>> [   73.989847] Krnl GPRS: c0000000fffeffff 0000000700000027 0000000000000023 00000007df4249c8
>> [   73.989849]            000003800649b858 000003800649b850 00000007fcb9db00 0000000000000000
>> [   73.989851]            000000008ebae8c8 0000000083a8c4f0 0000000000b69900 000000008ebac000
>> [   73.989853]            000003ff903aef68 000003800649bd98 000003ff7fcd2194 000003800649b9f8
>> [   73.989859] Krnl Code: 000003ff7fcd2188: c02000024f88    larl    %r2,000003ff7fd1c098
>>                            000003ff7fcd218e: c0e5fffea360    brasl    %r14,000003ff7fca684e
>>                           #000003ff7fcd2194: af000000        mc    0,0
>>                           >000003ff7fcd2198: e310b7680204    lg    %r1,10088(%r11)
>>                            000003ff7fcd219e: a7f4ffae        brc    15,000003ff7fcd20fa
>>                            000003ff7fcd21a2: 0707        bcr    0,%r7
>>                            000003ff7fcd21a4: 0707        bcr    0,%r7
>>                            000003ff7fcd21a6: 0707        bcr    0,%r7
>> [   73.989929] Call Trace:
>> [   73.989931]  [<000003ff7fcd2198>] kvm_s390_gisa_destroy+0xd8/0xe8 [kvm]
>> [   73.989946] ([<000003ff7fcd2194>] kvm_s390_gisa_destroy+0xd4/0xe8 [kvm])
>> [   73.989960]  [<000003ff7fcc1578>] kvm_arch_destroy_vm+0x50/0x118 [kvm]
>> [   73.989974]  [<000003ff7fcb00a2>] kvm_destroy_vm+0x15a/0x260 [kvm]
>> [   73.989985]  [<000003ff7fcb021e>] kvm_vm_release+0x36/0x48 [kvm]
>> [   73.989996]  [<00000007de4f830c>] __fput+0x94/0x2d0
>> [   73.990009]  [<00000007de20d838>] task_work_run+0x88/0xe8
>> [   73.990013]  [<00000007de1e75e0>] do_exit+0x2e0/0x4e0
>> [   73.990016]  [<00000007de1e79c0>] do_group_exit+0x40/0xb8
>> [   73.990017]  [<00000007de1f96e8>] send_sig_info+0x0/0xa8
>> [   73.990021]  [<00000007de194b26>] arch_do_signal_or_restart+0x56/0x318
>> [   73.990025]  [<00000007de28bf12>] exit_to_user_mode_prepare+0x10a/0x1a0
>> [   73.990028]  [<00000007deb607d2>] __do_syscall+0x152/0x1f8
>> [   73.990032]  [<00000007deb70ac8>] system_call+0x70/0x98
>> [   73.990036] Last Breaking-Event-Address:
>> [   73.990037]  [<00000007de1e0c58>] __warn_printk+0x78/0xe8
>>
>>
Matthew Rosato Jan. 11, 2024, 3:27 p.m. UTC | #7
On 1/11/24 4:43 AM, Cédric Le Goater wrote:

> OK. this condition is considered serious enough to be reported to a
> management level. This seems a bit excessive since the recovery can be
> handled by software, but manually. Are there any plans to address this
> problem ?

Yes, eventually.  Currently for drivers other than QEMU+vfio-pci we rely on automated PCI recovery to handle this event, but today we do not forward the event to QEMU when using vfio-pci because the code in vfio_err_notifier_handler will halt the guest.  This leaves the device in an error state to be recovered manually.  The intent is to eventually address this by triggering the guest to initiate the recovery action rather than halting the guest for s390.
diff mbox series

Patch

diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
index 3dd0b2372d..2d75f2131f 100644
--- a/hw/s390x/s390-virtio-ccw.c
+++ b/hw/s390x/s390-virtio-ccw.c
@@ -438,10 +438,20 @@  static void s390_machine_reset(MachineState *machine, ShutdownCause reason)
     switch (reset_type) {
     case S390_RESET_EXTERNAL:
     case S390_RESET_REIPL:
+        /*
+         * Reset the subsystem which includes a AP reset. If a PV
+         * guest had APQNs attached the AP reset is a prerequisite to
+         * unprotecting since the UV checks if all APQNs are reset.
+         */
+        subsystem_reset();
         if (s390_is_pv()) {
             s390_machine_unprotect(ms);
         }
 
+        /*
+         * Device reset includes CPU clear resets so this has to be
+         * done AFTER the unprotect call above.
+         */
         qemu_devices_reset(reason);
         s390_crypto_reset();