diff mbox series

KVM: PPC: Book3S HV nestedv2: Cancel pending HDEC exception

Message ID 20240313072625.76804-1-vaibhav@linux.ibm.com
State New
Headers show
Series KVM: PPC: Book3S HV nestedv2: Cancel pending HDEC exception | expand

Commit Message

Vaibhav Jain March 13, 2024, 7:26 a.m. UTC
This reverts commit 180c6b072bf360b686e53d893d8dcf7dbbaec6bb ("KVM: PPC:
Book3S HV nestedv2: Do not cancel pending decrementer exception") which
prevented cancelling a pending HDEC exception for nestedv2 KVM guests. It
was done to avoid overhead of a H_GUEST_GET_STATE hcall to read the 'HDEC
expiry TB' register which was higher compared to handling extra decrementer
exceptions.

This overhead of reading 'HDEC expiry TB' register has been mitigated
recently by the L0 hypervisor(PowerVM) by putting the value of this
register in L2 guest-state output buffer on trap to L1. From there the
value of this register is cached, made available in kvmhv_run_single_vcpu()
to compare it against host(L1) timebase and cancel the pending hypervisor
decrementer exception if needed.

Fixes: 180c6b072bf3 ("KVM: PPC: Book3S HV nestedv2: Do not cancel pending decrementer exception")
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 arch/powerpc/kvm/book3s_hv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Nicholas Piggin March 20, 2024, 1:43 p.m. UTC | #1
On Wed Mar 13, 2024 at 5:26 PM AEST, Vaibhav Jain wrote:
> This reverts commit 180c6b072bf360b686e53d893d8dcf7dbbaec6bb ("KVM: PPC:
> Book3S HV nestedv2: Do not cancel pending decrementer exception") which
> prevented cancelling a pending HDEC exception for nestedv2 KVM guests. It
> was done to avoid overhead of a H_GUEST_GET_STATE hcall to read the 'HDEC
> expiry TB' register which was higher compared to handling extra decrementer
> exceptions.
>
> This overhead of reading 'HDEC expiry TB' register has been mitigated
> recently by the L0 hypervisor(PowerVM) by putting the value of this
> register in L2 guest-state output buffer on trap to L1. From there the
> value of this register is cached, made available in kvmhv_run_single_vcpu()
> to compare it against host(L1) timebase and cancel the pending hypervisor
> decrementer exception if needed.

Ah, I figured out the problem here. Guest entry never clears the
queued dec, because it's level triggered on the DEC MSB so it
doesn't go away when it's delivered. So upstream code is indeed
buggy and I think I take the blame for suggesting this nestedv2
workaround.

I actually don't think that is necessary though, we could treat it
like other interrupts.  I think that would solve the problem without
having to test dec here.

I am wondering though, what workload slows down that this patch
was needed in the first place. We'd only get here after a cede
returns, then we'd dequeue the dec and stop having to GET_STATE
it here.

Thanks,
Nick

>
> Fixes: 180c6b072bf3 ("KVM: PPC: Book3S HV nestedv2: Do not cancel pending decrementer exception")
> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  arch/powerpc/kvm/book3s_hv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 0b921704da45..e47b954ce266 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -4856,7 +4856,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
>  	 * entering a nested guest in which case the decrementer is now owned
>  	 * by L2 and the L1 decrementer is provided in hdec_expires
>  	 */
> -	if (!kvmhv_is_nestedv2() && kvmppc_core_pending_dec(vcpu) &&
> +	if (kvmppc_core_pending_dec(vcpu) &&
>  			((tb < kvmppc_dec_expires_host_tb(vcpu)) ||
>  			 (trap == BOOK3S_INTERRUPT_SYSCALL &&
>  			  kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED)))
Linux regression tracking (Thorsten Leemhuis) April 4, 2024, 8:35 a.m. UTC | #2
Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

Was this regression ever resolved? Doesn't look like it, but maybe I
just missed something.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

On 20.03.24 14:43, Nicholas Piggin wrote:
> On Wed Mar 13, 2024 at 5:26 PM AEST, Vaibhav Jain wrote:
>> This reverts commit 180c6b072bf360b686e53d893d8dcf7dbbaec6bb ("KVM: PPC:
>> Book3S HV nestedv2: Do not cancel pending decrementer exception") which
>> prevented cancelling a pending HDEC exception for nestedv2 KVM guests. It
>> was done to avoid overhead of a H_GUEST_GET_STATE hcall to read the 'HDEC
>> expiry TB' register which was higher compared to handling extra decrementer
>> exceptions.
>>
>> This overhead of reading 'HDEC expiry TB' register has been mitigated
>> recently by the L0 hypervisor(PowerVM) by putting the value of this
>> register in L2 guest-state output buffer on trap to L1. From there the
>> value of this register is cached, made available in kvmhv_run_single_vcpu()
>> to compare it against host(L1) timebase and cancel the pending hypervisor
>> decrementer exception if needed.
> 
> Ah, I figured out the problem here. Guest entry never clears the
> queued dec, because it's level triggered on the DEC MSB so it
> doesn't go away when it's delivered. So upstream code is indeed
> buggy and I think I take the blame for suggesting this nestedv2
> workaround.
> 
> I actually don't think that is necessary though, we could treat it
> like other interrupts.  I think that would solve the problem without
> having to test dec here.
> 
> I am wondering though, what workload slows down that this patch
> was needed in the first place. We'd only get here after a cede
> returns, then we'd dequeue the dec and stop having to GET_STATE
> it here.
> 
> Thanks,
> Nick
> 
>>
>> Fixes: 180c6b072bf3 ("KVM: PPC: Book3S HV nestedv2: Do not cancel pending decrementer exception")
>> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
>> ---
>>  arch/powerpc/kvm/book3s_hv.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 0b921704da45..e47b954ce266 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -4856,7 +4856,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
>>  	 * entering a nested guest in which case the decrementer is now owned
>>  	 * by L2 and the L1 decrementer is provided in hdec_expires
>>  	 */
>> -	if (!kvmhv_is_nestedv2() && kvmppc_core_pending_dec(vcpu) &&
>> +	if (kvmppc_core_pending_dec(vcpu) &&
>>  			((tb < kvmppc_dec_expires_host_tb(vcpu)) ||
>>  			 (trap == BOOK3S_INTERRUPT_SYSCALL &&
>>  			  kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED)))
>
Michael Ellerman April 5, 2024, 3:20 a.m. UTC | #3
"Linux regression tracking (Thorsten Leemhuis)"
<regressions@leemhuis.info> writes:
> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> for once, to make this easily accessible to everyone.
>
> Was this regression ever resolved? Doesn't look like it, but maybe I
> just missed something.

I'm not sure how it ended up on the regression list. IMHO it's not
really a regression. It was an attempt at a performance optimisation,
which is no longer needed due to changes in (unreleased) firmware.

I haven't merged it because Nick's reply contained several questions for
Vaibhav, so I'm expecting either a reply to those or a new version of
the patch.

cheers

> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot poke
>
> On 20.03.24 14:43, Nicholas Piggin wrote:
>> On Wed Mar 13, 2024 at 5:26 PM AEST, Vaibhav Jain wrote:
>>> This reverts commit 180c6b072bf360b686e53d893d8dcf7dbbaec6bb ("KVM: PPC:
>>> Book3S HV nestedv2: Do not cancel pending decrementer exception") which
>>> prevented cancelling a pending HDEC exception for nestedv2 KVM guests. It
>>> was done to avoid overhead of a H_GUEST_GET_STATE hcall to read the 'HDEC
>>> expiry TB' register which was higher compared to handling extra decrementer
>>> exceptions.
>>>
>>> This overhead of reading 'HDEC expiry TB' register has been mitigated
>>> recently by the L0 hypervisor(PowerVM) by putting the value of this
>>> register in L2 guest-state output buffer on trap to L1. From there the
>>> value of this register is cached, made available in kvmhv_run_single_vcpu()
>>> to compare it against host(L1) timebase and cancel the pending hypervisor
>>> decrementer exception if needed.
>> 
>> Ah, I figured out the problem here. Guest entry never clears the
>> queued dec, because it's level triggered on the DEC MSB so it
>> doesn't go away when it's delivered. So upstream code is indeed
>> buggy and I think I take the blame for suggesting this nestedv2
>> workaround.
>> 
>> I actually don't think that is necessary though, we could treat it
>> like other interrupts.  I think that would solve the problem without
>> having to test dec here.
>> 
>> I am wondering though, what workload slows down that this patch
>> was needed in the first place. We'd only get here after a cede
>> returns, then we'd dequeue the dec and stop having to GET_STATE
>> it here.
>> 
>> Thanks,
>> Nick
>> 
>>>
>>> Fixes: 180c6b072bf3 ("KVM: PPC: Book3S HV nestedv2: Do not cancel pending decrementer exception")
>>> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
>>> ---
>>>  arch/powerpc/kvm/book3s_hv.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>> index 0b921704da45..e47b954ce266 100644
>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>> @@ -4856,7 +4856,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
>>>  	 * entering a nested guest in which case the decrementer is now owned
>>>  	 * by L2 and the L1 decrementer is provided in hdec_expires
>>>  	 */
>>> -	if (!kvmhv_is_nestedv2() && kvmppc_core_pending_dec(vcpu) &&
>>> +	if (kvmppc_core_pending_dec(vcpu) &&
>>>  			((tb < kvmppc_dec_expires_host_tb(vcpu)) ||
>>>  			 (trap == BOOK3S_INTERRUPT_SYSCALL &&
>>>  			  kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED)))
>>
Linux regression tracking (Thorsten Leemhuis) April 5, 2024, 5:54 a.m. UTC | #4
On 05.04.24 05:20, Michael Ellerman wrote:
> "Linux regression tracking (Thorsten Leemhuis)"
> <regressions@leemhuis.info> writes:
>> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
>> for once, to make this easily accessible to everyone.
>>
>> Was this regression ever resolved? Doesn't look like it, but maybe I
>> just missed something.
> 
> I'm not sure how it ended up on the regression list.

That is easy to explain: I let lei search for mails containing words
like regress, bisect, and revert to become aware of regressions that
might need tracking. And...

> IMHO it's not really a regression.

...sometimes I misjudge or misinterpret something and add it to the
regression tracking. Looks like that happened here.

Sorry for that and the noise it caused!

#regzbot resolve: invalid: was not really a regression in the first place

Ciao, Thorsten
Michael Ellerman April 8, 2024, 5:20 a.m. UTC | #5
Thorsten Leemhuis <regressions@leemhuis.info> writes:
> On 05.04.24 05:20, Michael Ellerman wrote:
>> "Linux regression tracking (Thorsten Leemhuis)"
>> <regressions@leemhuis.info> writes:
>>> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
>>> for once, to make this easily accessible to everyone.
>>>
>>> Was this regression ever resolved? Doesn't look like it, but maybe I
>>> just missed something.
>> 
>> I'm not sure how it ended up on the regression list.
>
> That is easy to explain: I let lei search for mails containing words
> like regress, bisect, and revert to become aware of regressions that
> might need tracking. And...
>
>> IMHO it's not really a regression.
>
> ...sometimes I misjudge or misinterpret something and add it to the
> regression tracking. Looks like that happened here.
>
> Sorry for that and the noise it caused!

No worries.

cheers
Nicholas Piggin April 10, 2024, 4:28 a.m. UTC | #6
On Mon Apr 8, 2024 at 3:20 PM AEST, Michael Ellerman wrote:
> Thorsten Leemhuis <regressions@leemhuis.info> writes:
> > On 05.04.24 05:20, Michael Ellerman wrote:
> >> "Linux regression tracking (Thorsten Leemhuis)"
> >> <regressions@leemhuis.info> writes:
> >>> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> >>> for once, to make this easily accessible to everyone.
> >>>
> >>> Was this regression ever resolved? Doesn't look like it, but maybe I
> >>> just missed something.
> >> 
> >> I'm not sure how it ended up on the regression list.
> >
> > That is easy to explain: I let lei search for mails containing words
> > like regress, bisect, and revert to become aware of regressions that
> > might need tracking. And...
> >
> >> IMHO it's not really a regression.
> >
> > ...sometimes I misjudge or misinterpret something and add it to the
> > regression tracking. Looks like that happened here.
> >
> > Sorry for that and the noise it caused!
>
> No worries.

It is actually a regression. It only really affects performance,
but the logic is broken (my fault). We were going to revert it, and
solve the initial regression a better way.

Thanks,
Nick
diff mbox series

Patch

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 0b921704da45..e47b954ce266 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4856,7 +4856,7 @@  int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 time_limit,
 	 * entering a nested guest in which case the decrementer is now owned
 	 * by L2 and the L1 decrementer is provided in hdec_expires
 	 */
-	if (!kvmhv_is_nestedv2() && kvmppc_core_pending_dec(vcpu) &&
+	if (kvmppc_core_pending_dec(vcpu) &&
 			((tb < kvmppc_dec_expires_host_tb(vcpu)) ||
 			 (trap == BOOK3S_INTERRUPT_SYSCALL &&
 			  kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED)))