diff mbox

[2/2] kvm: arm/arm64: Fix use after free of stage2 page table

Message ID 1493821072-29713-3-git-send-email-suzuki.poulose@arm.com
State New
Headers show

Commit Message

Suzuki K Poulose May 3, 2017, 2:17 p.m. UTC
We yield the kvm->mmu_lock occassionaly while performing an operation
(e.g, unmap or permission changes) on a large area of stage2 mappings.
However this could possibly cause another thread to clear and free up
the stage2 page tables while we were waiting for regaining the lock and
thus the original thread could end up in accessing memory that was
freed. This patch fixes the problem by making sure that the stage2
pagetable is still valid after we regain the lock. The fact that
mmu_notifer->release() could be called twice (via __mmu_notifier_release
and mmu_notifier_unregsister) enhances the possibility of hitting
this race where there are two threads trying to unmap the entire guest
shadow pages.

While at it, cleanup the redudant checks around cond_resched_lock in
stage2_wp_range(), as cond_resched_lock already does the same checks.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: andreyknvl@google.com
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
---
 arch/arm/kvm/mmu.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

Comments

Marc Zyngier May 12, 2017, 9:15 a.m. UTC | #1
On 03/05/17 15:17, Suzuki K Poulose wrote:
> We yield the kvm->mmu_lock occassionaly while performing an operation
                             occasionally
> (e.g, unmap or permission changes) on a large area of stage2 mappings.
> However this could possibly cause another thread to clear and free up
> the stage2 page tables while we were waiting for regaining the lock and
> thus the original thread could end up in accessing memory that was
> freed. This patch fixes the problem by making sure that the stage2
> pagetable is still valid after we regain the lock. The fact that
> mmu_notifer->release() could be called twice (via __mmu_notifier_release
> and mmu_notifier_unregsister) enhances the possibility of hitting
                   unregister
> this race where there are two threads trying to unmap the entire guest
> shadow pages.
> 
> While at it, cleanup the redudant checks around cond_resched_lock in
                           redundant
> stage2_wp_range(), as cond_resched_lock already does the same checks.
> 
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: andreyknvl@google.com
> Cc: Christoffer Dall <christoffer.dall@linaro.org>
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  arch/arm/kvm/mmu.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 909a1a7..5b3e0db 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -301,9 +301,14 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
>  		/*
>  		 * If the range is too large, release the kvm->mmu_lock
>  		 * to prevent starvation and lockup detector warnings.
> +		 * Make sure the page table is still active when we regain
> +		 * the lock.
>  		 */
> -		if (next != end)
> +		if (next != end) {
>  			cond_resched_lock(&kvm->mmu_lock);
> +			if (!READ_ONCE(kvm->arch.pgd))
> +				break;
> +		}
>  	} while (pgd++, addr = next, addr != end);
>  }
>  
> @@ -1170,11 +1175,13 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  		 * large. Otherwise, we may see kernel panics with
>  		 * CONFIG_DETECT_HUNG_TASK, CONFIG_LOCKUP_DETECTOR,
>  		 * CONFIG_LOCKDEP. Additionally, holding the lock too long
> -		 * will also starve other vCPUs.
> +		 * will also starve other vCPUs. We have to also make sure
> +		 * that the page tables are not freed while we released
> +		 * the lock.
>  		 */
> -		if (need_resched() || spin_needbreak(&kvm->mmu_lock))
> -			cond_resched_lock(&kvm->mmu_lock);
> -
> +		cond_resched_lock(&kvm->mmu_lock);
> +		if (!READ_ONCE(kvm->arch.pgd))
> +			break;
>  		next = stage2_pgd_addr_end(addr, end);
>  		if (stage2_pgd_present(*pgd))
>  			stage2_wp_puds(pgd, addr, next);
> 

Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>

Thanks,

	M.
Christoffer Dall May 15, 2017, 10 a.m. UTC | #2
Hi Suzuki,

On Wed, May 03, 2017 at 03:17:52PM +0100, Suzuki K Poulose wrote:
> We yield the kvm->mmu_lock occassionaly while performing an operation
> (e.g, unmap or permission changes) on a large area of stage2 mappings.
> However this could possibly cause another thread to clear and free up
> the stage2 page tables while we were waiting for regaining the lock and
> thus the original thread could end up in accessing memory that was
> freed. This patch fixes the problem by making sure that the stage2
> pagetable is still valid after we regain the lock. The fact that
> mmu_notifer->release() could be called twice (via __mmu_notifier_release
> and mmu_notifier_unregsister) enhances the possibility of hitting
> this race where there are two threads trying to unmap the entire guest
> shadow pages.
> 
> While at it, cleanup the redudant checks around cond_resched_lock in
> stage2_wp_range(), as cond_resched_lock already does the same checks.
> 
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: andreyknvl@google.com
> Cc: Christoffer Dall <christoffer.dall@linaro.org>
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  arch/arm/kvm/mmu.c | 17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 909a1a7..5b3e0db 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -301,9 +301,14 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
>  		/*
>  		 * If the range is too large, release the kvm->mmu_lock
>  		 * to prevent starvation and lockup detector warnings.
> +		 * Make sure the page table is still active when we regain
> +		 * the lock.
>  		 */
> -		if (next != end)
> +		if (next != end) {
>  			cond_resched_lock(&kvm->mmu_lock);
> +			if (!READ_ONCE(kvm->arch.pgd))
> +				break;
> +		}

So I don't think this change is wrong, but I wonder if it's sufficient.
For example, I can see that this function is called from

stage2_unmsp_vm
 -> stage2_unmap_memslot
   -> unmap_stage2_range

and

kvm_arch_flush_shadow_memslot
 -> unmap_stage2_range

which never check if the pgd pointer is valid, and finally
kvm_free_stage2_pgd also checks the pgd pointer outside of holding the
kvm->mmu_lock so why is this not racy?

>  	} while (pgd++, addr = next, addr != end);
>  }
>  
> @@ -1170,11 +1175,13 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  		 * large. Otherwise, we may see kernel panics with
>  		 * CONFIG_DETECT_HUNG_TASK, CONFIG_LOCKUP_DETECTOR,
>  		 * CONFIG_LOCKDEP. Additionally, holding the lock too long
> -		 * will also starve other vCPUs.
> +		 * will also starve other vCPUs. We have to also make sure
> +		 * that the page tables are not freed while we released
> +		 * the lock.
>  		 */
> -		if (need_resched() || spin_needbreak(&kvm->mmu_lock))
> -			cond_resched_lock(&kvm->mmu_lock);
> -
> +		cond_resched_lock(&kvm->mmu_lock);
> +		if (!READ_ONCE(kvm->arch.pgd))
> +			break;

Here I suppose you don't have the issue becase you check the pgd pointer
before derefencing it in all cases.

Thanks,
-Christoffer

>  		next = stage2_pgd_addr_end(addr, end);
>  		if (stage2_pgd_present(*pgd))
>  			stage2_wp_puds(pgd, addr, next);
> -- 
> 2.7.4
>
Suzuki K Poulose May 15, 2017, 1:36 p.m. UTC | #3
On 15/05/17 11:00, Christoffer Dall wrote:
> Hi Suzuki,
>
> On Wed, May 03, 2017 at 03:17:52PM +0100, Suzuki K Poulose wrote:
>> We yield the kvm->mmu_lock occassionaly while performing an operation
>> (e.g, unmap or permission changes) on a large area of stage2 mappings.
>> However this could possibly cause another thread to clear and free up
>> the stage2 page tables while we were waiting for regaining the lock and
>> thus the original thread could end up in accessing memory that was
>> freed. This patch fixes the problem by making sure that the stage2
>> pagetable is still valid after we regain the lock. The fact that
>> mmu_notifer->release() could be called twice (via __mmu_notifier_release
>> and mmu_notifier_unregsister) enhances the possibility of hitting
>> this race where there are two threads trying to unmap the entire guest
>> shadow pages.
>>
>> While at it, cleanup the redudant checks around cond_resched_lock in
>> stage2_wp_range(), as cond_resched_lock already does the same checks.
>>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>> Cc: andreyknvl@google.com
>> Cc: Christoffer Dall <christoffer.dall@linaro.org>
>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>>  arch/arm/kvm/mmu.c | 17 ++++++++++++-----
>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 909a1a7..5b3e0db 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -301,9 +301,14 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
>>  		/*
>>  		 * If the range is too large, release the kvm->mmu_lock
>>  		 * to prevent starvation and lockup detector warnings.
>> +		 * Make sure the page table is still active when we regain
>> +		 * the lock.
>>  		 */
>> -		if (next != end)
>> +		if (next != end) {
>>  			cond_resched_lock(&kvm->mmu_lock);
>> +			if (!READ_ONCE(kvm->arch.pgd))
>> +				break;
>> +		}
>
> So I don't think this change is wrong, but I wonder if it's sufficient.
> For example, I can see that this function is called from
>
> stage2_unmsp_vm
>  -> stage2_unmap_memslot
>    -> unmap_stage2_range
>
> and
>
> kvm_arch_flush_shadow_memslot
>  -> unmap_stage2_range
>
> which never check if the pgd pointer is valid,

You are right. Those two callers do not check it. We could fix all of this by simply
moving the check to the beginning of the loop.
i.e, something like this :

@@ -295,6 +295,12 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
	assert_spin_locked(&kvm->mmu_lock);
	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
	do {
+		/*
+		 * Make sure the page table is still active, as we could
+		 * another thread could have possibly freed the page table.
+		 */
+		if (!READ_ONCE(kvm->arch.pgd))
+			break;
		next = stage2_pgd_addr_end(addr, end);
		if (!stage2_pgd_none(*pgd))
			unmap_stage2_puds(kvm, pgd, addr, next);




> and finally, kvm_free_stage2_pgd also checks the pgd pointer outside of holding the
> kvm->mmu_lock so why is this not racy?

This has been fixed by patch 1 in the series. So should be fine.


I can respin the patch with the changes if you are OK with it.

Suzuki
Suzuki K Poulose May 15, 2017, 1:38 p.m. UTC | #4
On 15/05/17 14:36, Suzuki K Poulose wrote:
> On 15/05/17 11:00, Christoffer Dall wrote:
>> Hi Suzuki,
>>
>> On Wed, May 03, 2017 at 03:17:52PM +0100, Suzuki K Poulose wrote:
>>> We yield the kvm->mmu_lock occassionaly while performing an operation
>>> (e.g, unmap or permission changes) on a large area of stage2 mappings.
>>> However this could possibly cause another thread to clear and free up
>>> the stage2 page tables while we were waiting for regaining the lock and
>>> thus the original thread could end up in accessing memory that was
>>> freed. This patch fixes the problem by making sure that the stage2
>>> pagetable is still valid after we regain the lock. The fact that
>>> mmu_notifer->release() could be called twice (via __mmu_notifier_release
>>> and mmu_notifier_unregsister) enhances the possibility of hitting
>>> this race where there are two threads trying to unmap the entire guest
>>> shadow pages.
>>>
>>> While at it, cleanup the redudant checks around cond_resched_lock in
>>> stage2_wp_range(), as cond_resched_lock already does the same checks.
>>>
>>> Cc: Mark Rutland <mark.rutland@arm.com>
>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>>> Cc: andreyknvl@google.com
>>> Cc: Christoffer Dall <christoffer.dall@linaro.org>
>>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>> ---
>>>  arch/arm/kvm/mmu.c | 17 ++++++++++++-----
>>>  1 file changed, 12 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>> index 909a1a7..5b3e0db 100644
>>> --- a/arch/arm/kvm/mmu.c
>>> +++ b/arch/arm/kvm/mmu.c
>>> @@ -301,9 +301,14 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
>>>          /*
>>>           * If the range is too large, release the kvm->mmu_lock
>>>           * to prevent starvation and lockup detector warnings.
>>> +         * Make sure the page table is still active when we regain
>>> +         * the lock.
>>>           */
>>> -        if (next != end)
>>> +        if (next != end) {
>>>              cond_resched_lock(&kvm->mmu_lock);
>>> +            if (!READ_ONCE(kvm->arch.pgd))
>>> +                break;
>>> +        }
>>
>> So I don't think this change is wrong, but I wonder if it's sufficient.
>> For example, I can see that this function is called from
>>
>> stage2_unmsp_vm
>>  -> stage2_unmap_memslot
>>    -> unmap_stage2_range
>>
>> and
>>
>> kvm_arch_flush_shadow_memslot
>>  -> unmap_stage2_range
>>
>> which never check if the pgd pointer is valid,
>
> You are right. Those two callers do not check it. We could fix all of this by simply
> moving the check to the beginning of the loop.
> i.e, something like this :
>
> @@ -295,6 +295,12 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
>     assert_spin_locked(&kvm->mmu_lock);
>     pgd = kvm->arch.pgd + stage2_pgd_index(addr);
>     do {
> +        /*
> +         * Make sure the page table is still active, as we could
> +         * another thread could have possibly freed the page table.
> +         */

The comment portion has been fixed locally.
Christoffer Dall May 15, 2017, 5:43 p.m. UTC | #5
On Mon, May 15, 2017 at 02:36:58PM +0100, Suzuki K Poulose wrote:
> On 15/05/17 11:00, Christoffer Dall wrote:
> >Hi Suzuki,
> >
> >On Wed, May 03, 2017 at 03:17:52PM +0100, Suzuki K Poulose wrote:
> >>We yield the kvm->mmu_lock occassionaly while performing an operation
> >>(e.g, unmap or permission changes) on a large area of stage2 mappings.
> >>However this could possibly cause another thread to clear and free up
> >>the stage2 page tables while we were waiting for regaining the lock and
> >>thus the original thread could end up in accessing memory that was
> >>freed. This patch fixes the problem by making sure that the stage2
> >>pagetable is still valid after we regain the lock. The fact that
> >>mmu_notifer->release() could be called twice (via __mmu_notifier_release
> >>and mmu_notifier_unregsister) enhances the possibility of hitting
> >>this race where there are two threads trying to unmap the entire guest
> >>shadow pages.
> >>
> >>While at it, cleanup the redudant checks around cond_resched_lock in
> >>stage2_wp_range(), as cond_resched_lock already does the same checks.
> >>
> >>Cc: Mark Rutland <mark.rutland@arm.com>
> >>Cc: Radim Krčmář <rkrcmar@redhat.com>
> >>Cc: andreyknvl@google.com
> >>Cc: Christoffer Dall <christoffer.dall@linaro.org>
> >>Cc: Marc Zyngier <marc.zyngier@arm.com>
> >>Cc: Paolo Bonzini <pbonzini@redhat.com>
> >>Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> >>---
> >> arch/arm/kvm/mmu.c | 17 ++++++++++++-----
> >> 1 file changed, 12 insertions(+), 5 deletions(-)
> >>
> >>diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >>index 909a1a7..5b3e0db 100644
> >>--- a/arch/arm/kvm/mmu.c
> >>+++ b/arch/arm/kvm/mmu.c
> >>@@ -301,9 +301,14 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
> >> 		/*
> >> 		 * If the range is too large, release the kvm->mmu_lock
> >> 		 * to prevent starvation and lockup detector warnings.
> >>+		 * Make sure the page table is still active when we regain
> >>+		 * the lock.
> >> 		 */
> >>-		if (next != end)
> >>+		if (next != end) {
> >> 			cond_resched_lock(&kvm->mmu_lock);
> >>+			if (!READ_ONCE(kvm->arch.pgd))
> >>+				break;
> >>+		}
> >
> >So I don't think this change is wrong, but I wonder if it's sufficient.
> >For example, I can see that this function is called from
> >
> >stage2_unmsp_vm
> > -> stage2_unmap_memslot
> >   -> unmap_stage2_range
> >
> >and
> >
> >kvm_arch_flush_shadow_memslot
> > -> unmap_stage2_range
> >
> >which never check if the pgd pointer is valid,
> 
> You are right. Those two callers do not check it. We could fix all of this by simply
> moving the check to the beginning of the loop.
> i.e, something like this :
> 
> @@ -295,6 +295,12 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
> 	assert_spin_locked(&kvm->mmu_lock);
> 	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> 	do {
> +		/*
> +		 * Make sure the page table is still active, as we could
> +		 * another thread could have possibly freed the page table.
> +		 */
> +		if (!READ_ONCE(kvm->arch.pgd))
> +			break;
> 		next = stage2_pgd_addr_end(addr, end);
> 		if (!stage2_pgd_none(*pgd))
> 			unmap_stage2_puds(kvm, pgd, addr, next);
> 
> 
> 
> 
> >and finally, kvm_free_stage2_pgd also checks the pgd pointer outside of holding the
> >kvm->mmu_lock so why is this not racy?
> 
> This has been fixed by patch 1 in the series. So should be fine.
> 
> 
> I can respin the patch with the changes if you are OK with it.
> 
Yes, absolutely.  I've already applied patch 1 so no need to include
that in your respin.

Thanks!

-Christoffer
Suzuki K Poulose May 15, 2017, 5:51 p.m. UTC | #6
On 15/05/17 18:43, Christoffer Dall wrote:
> On Mon, May 15, 2017 at 02:36:58PM +0100, Suzuki K Poulose wrote:
>> On 15/05/17 11:00, Christoffer Dall wrote:
>>> Hi Suzuki,
>>> So I don't think this change is wrong, but I wonder if it's sufficient.
>>> For example, I can see that this function is called from
>>>
>>> stage2_unmsp_vm
>>> -> stage2_unmap_memslot
>>>   -> unmap_stage2_range
>>>
>>> and
>>>
>>> kvm_arch_flush_shadow_memslot
>>> -> unmap_stage2_range
>>>
>>> which never check if the pgd pointer is valid,
>>
>> You are right. Those two callers do not check it. We could fix all of this by simply
>> moving the check to the beginning of the loop.
>> i.e, something like this :
>>
>> @@ -295,6 +295,12 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
>> 	assert_spin_locked(&kvm->mmu_lock);
>> 	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
>> 	do {
>> +		/*
>> +		 * Make sure the page table is still active, as we could
>> +		 * another thread could have possibly freed the page table.
>> +		 */
>> +		if (!READ_ONCE(kvm->arch.pgd))
>> +			break;
>> 		next = stage2_pgd_addr_end(addr, end);
>> 		if (!stage2_pgd_none(*pgd))
>> 			unmap_stage2_puds(kvm, pgd, addr, next);
>>
>>
>>
>>
>>> and finally, kvm_free_stage2_pgd also checks the pgd pointer outside of holding the
>>> kvm->mmu_lock so why is this not racy?
>>
>> This has been fixed by patch 1 in the series. So should be fine.
>>
>>
>> I can respin the patch with the changes if you are OK with it.
>>
> Yes, absolutely.  I've already applied patch 1 so no need to include
> that in your respin.

I have made a minor change to the 1st patch, to make use of READ_ONCE/WRITE_ONCE,
to make sure we don't use the cached value of the kvm->arch.pgd. Something like :


@@ -829,22 +829,22 @@ void stage2_unmap_vm(struct kvm *kvm)
   * Walks the level-1 page table pointed to by kvm->arch.pgd and frees all
   * underlying level-2 and level-3 tables before freeing the actual level-1 table
   * and setting the struct pointer to NULL.
- *
- * Note we don't need locking here as this is only called when the VM is
- * destroyed, which can only be done once.
   */
  void kvm_free_stage2_pgd(struct kvm *kvm)
  {
-	if (kvm->arch.pgd == NULL)
-		return;
+	void *pgd = NULL;
  
  	spin_lock(&kvm->mmu_lock);
-	unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
+	if (kvm->arch.pgd) {
+		unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
+		pgd = READ_ONCE(kvm->arch.pgd);
+		WRITE_ONCE(kvm->arch.pgd, NULL);
+	}
         spin_unlock(&kvm->mmu_lock);
  

Let me know if you could fix it up or I could send a fresh series. Sorry about that.

Suzuki
Christoffer Dall May 15, 2017, 6:22 p.m. UTC | #7
On Mon, May 15, 2017 at 06:51:26PM +0100, Suzuki K Poulose wrote:
> On 15/05/17 18:43, Christoffer Dall wrote:
> >On Mon, May 15, 2017 at 02:36:58PM +0100, Suzuki K Poulose wrote:
> >>On 15/05/17 11:00, Christoffer Dall wrote:
> >>>Hi Suzuki,
> >>>So I don't think this change is wrong, but I wonder if it's sufficient.
> >>>For example, I can see that this function is called from
> >>>
> >>>stage2_unmsp_vm
> >>>-> stage2_unmap_memslot
> >>>  -> unmap_stage2_range
> >>>
> >>>and
> >>>
> >>>kvm_arch_flush_shadow_memslot
> >>>-> unmap_stage2_range
> >>>
> >>>which never check if the pgd pointer is valid,
> >>
> >>You are right. Those two callers do not check it. We could fix all of this by simply
> >>moving the check to the beginning of the loop.
> >>i.e, something like this :
> >>
> >>@@ -295,6 +295,12 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
> >>	assert_spin_locked(&kvm->mmu_lock);
> >>	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> >>	do {
> >>+		/*
> >>+		 * Make sure the page table is still active, as we could
> >>+		 * another thread could have possibly freed the page table.
> >>+		 */
> >>+		if (!READ_ONCE(kvm->arch.pgd))
> >>+			break;
> >>		next = stage2_pgd_addr_end(addr, end);
> >>		if (!stage2_pgd_none(*pgd))
> >>			unmap_stage2_puds(kvm, pgd, addr, next);
> >>
> >>
> >>
> >>
> >>>and finally, kvm_free_stage2_pgd also checks the pgd pointer outside of holding the
> >>>kvm->mmu_lock so why is this not racy?
> >>
> >>This has been fixed by patch 1 in the series. So should be fine.
> >>
> >>
> >>I can respin the patch with the changes if you are OK with it.
> >>
> >Yes, absolutely.  I've already applied patch 1 so no need to include
> >that in your respin.
> 
> I have made a minor change to the 1st patch, to make use of READ_ONCE/WRITE_ONCE,
> to make sure we don't use the cached value of the kvm->arch.pgd. Something like :
> 
> 
> @@ -829,22 +829,22 @@ void stage2_unmap_vm(struct kvm *kvm)
>   * Walks the level-1 page table pointed to by kvm->arch.pgd and frees all
>   * underlying level-2 and level-3 tables before freeing the actual level-1 table
>   * and setting the struct pointer to NULL.
> - *
> - * Note we don't need locking here as this is only called when the VM is
> - * destroyed, which can only be done once.
>   */
>  void kvm_free_stage2_pgd(struct kvm *kvm)
>  {
> -	if (kvm->arch.pgd == NULL)
> -		return;
> +	void *pgd = NULL;
>  	spin_lock(&kvm->mmu_lock);
> -	unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> +	if (kvm->arch.pgd) {
> +		unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> +		pgd = READ_ONCE(kvm->arch.pgd);

I'm not sure I understand this.  What do you use pgd for?  Wouldn't it
be cleaner t use the READ_ONCE() in the if statement?

> +		WRITE_ONCE(kvm->arch.pgd, NULL);
> +	}
>         spin_unlock(&kvm->mmu_lock);
> 
> Let me know if you could fix it up or I could send a fresh series.
> 

At this point it may be better to send a new series with two patches
against what I already have in kvmarm/master.

> Sorry about that.

No worries at all.

Thanks,
-Christoffer
diff mbox

Patch

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 909a1a7..5b3e0db 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -301,9 +301,14 @@  static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
 		/*
 		 * If the range is too large, release the kvm->mmu_lock
 		 * to prevent starvation and lockup detector warnings.
+		 * Make sure the page table is still active when we regain
+		 * the lock.
 		 */
-		if (next != end)
+		if (next != end) {
 			cond_resched_lock(&kvm->mmu_lock);
+			if (!READ_ONCE(kvm->arch.pgd))
+				break;
+		}
 	} while (pgd++, addr = next, addr != end);
 }
 
@@ -1170,11 +1175,13 @@  static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
 		 * large. Otherwise, we may see kernel panics with
 		 * CONFIG_DETECT_HUNG_TASK, CONFIG_LOCKUP_DETECTOR,
 		 * CONFIG_LOCKDEP. Additionally, holding the lock too long
-		 * will also starve other vCPUs.
+		 * will also starve other vCPUs. We have to also make sure
+		 * that the page tables are not freed while we released
+		 * the lock.
 		 */
-		if (need_resched() || spin_needbreak(&kvm->mmu_lock))
-			cond_resched_lock(&kvm->mmu_lock);
-
+		cond_resched_lock(&kvm->mmu_lock);
+		if (!READ_ONCE(kvm->arch.pgd))
+			break;
 		next = stage2_pgd_addr_end(addr, end);
 		if (stage2_pgd_present(*pgd))
 			stage2_wp_puds(pgd, addr, next);