[v2,5/7] exec: memory radix tree page level compression

Message ID	1384370613-6433-6-git-send-email-mst@redhat.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> Date: Wed, 13 Nov 2013 21:23:58 +0200 From: "Michael S. Tsirkin" <mst@redhat.com> To: qemu-devel@nongnu.org Message-ID: <1384370613-6433-6-git-send-email-mst@redhat.com> References: <1384370613-6433-1-git-send-email-mst@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1384370613-6433-1-git-send-email-mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com>, Luiz Capitulino <lcapitulino@redhat.com> Subject: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level compression Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Michael S. Tsirkin Nov. 13, 2013, 7:23 p.m. UTC

At the moment, memory radix tree is already variable width, but it can
only skip the low bits of address.

This is efficient if we have huge memory regions but inefficient if we
are only using a tiny portion of the address space.

After we have built up the map, detect
configurations where a single L2 entry is valid.

We then speed up the lookup by skipping one or more levels.
In case any levels were skipped, we might end up in a valid section
instead of erroring out. We handle this by checking that
the address is in range of the resulting section.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 exec.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 74 insertions(+), 1 deletion(-)

Avi Kivity Nov. 14, 2013, 8:54 a.m. UTC | #1

Michael S. Tsirkin <mst <at> redhat.com> writes:

> 
> At the moment, memory radix tree is already variable width, but it can
> only skip the low bits of address.
> 
> This is efficient if we have huge memory regions but inefficient if we
> are only using a tiny portion of the address space.
> 
> After we have built up the map, detect
> configurations where a single L2 entry is valid.
> 
> We then speed up the lookup by skipping one or more levels.
> In case any levels were skipped, we might end up in a valid section
> instead of erroring out. We handle this by checking that
> the address is in range of the resulting section.
> 


I think this is overkill.  It can be done in a simpler way as follows:


phys_page_find(RadixTree* tr, hwaddr index, ...)
{
   if (index & rt->invalid_index_mask) {
       // not found
   }
   lp = rt->root;
   for (i = rt->nb_levels - 1; i >= 0 && !lp.is_leaf; --i) {
     ...

This exploits the fact the lower portion of the address space is always 
filled, at least in the cases that matter to us.

Michael S. Tsirkin Nov. 14, 2013, 2:40 p.m. UTC | #2

On Thu, Nov 14, 2013 at 08:54:11AM +0000, Avi Kivity wrote:
> Michael S. Tsirkin <mst <at> redhat.com> writes:
> 
> > 
> > At the moment, memory radix tree is already variable width, but it can
> > only skip the low bits of address.
> > 
> > This is efficient if we have huge memory regions but inefficient if we
> > are only using a tiny portion of the address space.
> > 
> > After we have built up the map, detect
> > configurations where a single L2 entry is valid.
> > 
> > We then speed up the lookup by skipping one or more levels.
> > In case any levels were skipped, we might end up in a valid section
> > instead of erroring out. We handle this by checking that
> > the address is in range of the resulting section.
> > 
> 
> 
> I think this is overkill.  It can be done in a simpler way as follows:
> 
> 
> phys_page_find(RadixTree* tr, hwaddr index, ...)
> {
>    if (index & rt->invalid_index_mask) {
>        // not found
>    }
>    lp = rt->root;
>    for (i = rt->nb_levels - 1; i >= 0 && !lp.is_leaf; --i) {
>      ...
> 
> This exploits the fact the lower portion of the address space is always 
> filled, at least in the cases that matter to us.
> 
> 
> 
> 
> 

Basically skip unused high bits?
Sure.
In fact I think both optimizations can be combined.

Avi Kivity Nov. 14, 2013, 2:56 p.m. UTC | #3

On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:
> On Thu, Nov 14, 2013 at 08:54:11AM +0000, Avi Kivity wrote:
>> Michael S. Tsirkin <mst <at> redhat.com> writes:
>>
>>> At the moment, memory radix tree is already variable width, but it can
>>> only skip the low bits of address.
>>>
>>> This is efficient if we have huge memory regions but inefficient if we
>>> are only using a tiny portion of the address space.
>>>
>>> After we have built up the map, detect
>>> configurations where a single L2 entry is valid.
>>>
>>> We then speed up the lookup by skipping one or more levels.
>>> In case any levels were skipped, we might end up in a valid section
>>> instead of erroring out. We handle this by checking that
>>> the address is in range of the resulting section.
>>>
>>
>> I think this is overkill.  It can be done in a simpler way as follows:
>>
>>
>> phys_page_find(RadixTree* tr, hwaddr index, ...)
>> {
>>     if (index & rt->invalid_index_mask) {
>>         // not found
>>     }
>>     lp = rt->root;
>>     for (i = rt->nb_levels - 1; i >= 0 && !lp.is_leaf; --i) {
>>       ...
>>
>> This exploits the fact the lower portion of the address space is always
>> filled, at least in the cases that matter to us.
>>
>>
>>
>>
>>
> Basically skip unused high bits?
> Sure.
> In fact I think both optimizations can be combined.

Not much value in combining them -- the variable level tree check will 
be dominated by the level skip logic.

IMO however skipping intermediate levels will be too rare to justify the 
complexity and the doubling of the page table size -- it can only happen 
in iommu setups that place memory in very high addresses.  These ought 
to be rare.

Michael S. Tsirkin Nov. 14, 2013, 3:37 p.m. UTC | #4

On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote:
> On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:
> >On Thu, Nov 14, 2013 at 08:54:11AM +0000, Avi Kivity wrote:
> >>Michael S. Tsirkin <mst <at> redhat.com> writes:
> >>
> >>>At the moment, memory radix tree is already variable width, but it can
> >>>only skip the low bits of address.
> >>>
> >>>This is efficient if we have huge memory regions but inefficient if we
> >>>are only using a tiny portion of the address space.
> >>>
> >>>After we have built up the map, detect
> >>>configurations where a single L2 entry is valid.
> >>>
> >>>We then speed up the lookup by skipping one or more levels.
> >>>In case any levels were skipped, we might end up in a valid section
> >>>instead of erroring out. We handle this by checking that
> >>>the address is in range of the resulting section.
> >>>
> >>
> >>I think this is overkill.  It can be done in a simpler way as follows:
> >>
> >>
> >>phys_page_find(RadixTree* tr, hwaddr index, ...)
> >>{
> >>    if (index & rt->invalid_index_mask) {
> >>        // not found
> >>    }
> >>    lp = rt->root;
> >>    for (i = rt->nb_levels - 1; i >= 0 && !lp.is_leaf; --i) {
> >>      ...
> >>
> >>This exploits the fact the lower portion of the address space is always
> >>filled, at least in the cases that matter to us.
> >>
> >>
> >>
> >>
> >>
> >Basically skip unused high bits?
> >Sure.
> >In fact I think both optimizations can be combined.
> 
> Not much value in combining them -- the variable level tree check
> will be dominated by the level skip logic.
> 
> IMO however skipping intermediate levels will be too rare to justify
> the complexity and the doubling of the page table size -- it can
> only happen in iommu setups that place memory in very high
> addresses.  These ought to be rare.
> 

Well maybe not very high address, but you can have a device with
a 64 bit bar and this will add back levels (though it would not
be slower than it is currently).

I agree the simplicity is attractive.

However I really would like some logic that can handle > 1 leaf
somehow.

Specifically both tricks break if we add a full 64 bit io region
in order to return -1 for reads on master abort.

Avi Kivity Nov. 14, 2013, 3:42 p.m. UTC | #5

On 11/14/2013 05:37 PM, Michael S. Tsirkin wrote:
> On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote:
>> On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:
>>> On Thu, Nov 14, 2013 at 08:54:11AM +0000, Avi Kivity wrote:
>>>> Michael S. Tsirkin <mst <at> redhat.com> writes:
>>>>
>>>>> At the moment, memory radix tree is already variable width, but it can
>>>>> only skip the low bits of address.
>>>>>
>>>>> This is efficient if we have huge memory regions but inefficient if we
>>>>> are only using a tiny portion of the address space.
>>>>>
>>>>> After we have built up the map, detect
>>>>> configurations where a single L2 entry is valid.
>>>>>
>>>>> We then speed up the lookup by skipping one or more levels.
>>>>> In case any levels were skipped, we might end up in a valid section
>>>>> instead of erroring out. We handle this by checking that
>>>>> the address is in range of the resulting section.
>>>>>
>>>> I think this is overkill.  It can be done in a simpler way as follows:
>>>>
>>>>
>>>> phys_page_find(RadixTree* tr, hwaddr index, ...)
>>>> {
>>>>     if (index & rt->invalid_index_mask) {
>>>>         // not found
>>>>     }
>>>>     lp = rt->root;
>>>>     for (i = rt->nb_levels - 1; i >= 0 && !lp.is_leaf; --i) {
>>>>       ...
>>>>
>>>> This exploits the fact the lower portion of the address space is always
>>>> filled, at least in the cases that matter to us.
>>>>
>>>>
>>>>
>>>>
>>>>
>>> Basically skip unused high bits?
>>> Sure.
>>> In fact I think both optimizations can be combined.
>> Not much value in combining them -- the variable level tree check
>> will be dominated by the level skip logic.
>>
>> IMO however skipping intermediate levels will be too rare to justify
>> the complexity and the doubling of the page table size -- it can
>> only happen in iommu setups that place memory in very high
>> addresses.  These ought to be rare.
>>
> Well maybe not very high address, but you can have a device with
> a 64 bit bar and this will add back levels (though it would not
> be slower than it is currently).
>
> I agree the simplicity is attractive.
>
> However I really would like some logic that can handle > 1 leaf
> somehow.
>
> Specifically both tricks break if we add a full 64 bit io region
> in order to return -1 for reads on master abort.

That can be achieved by directing a missed lookup to a catch-all region.

It would be best to do this completely internally to the memory code, 
without API changes.

Michael S. Tsirkin Nov. 17, 2013, 3:37 p.m. UTC | #6

On Thu, Nov 14, 2013 at 05:42:26PM +0200, Avi Kivity wrote:
> On 11/14/2013 05:37 PM, Michael S. Tsirkin wrote:
> >On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote:
> >>On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:
> >>>On Thu, Nov 14, 2013 at 08:54:11AM +0000, Avi Kivity wrote:
> >>>>Michael S. Tsirkin <mst <at> redhat.com> writes:
> >>>>
> >>>>>At the moment, memory radix tree is already variable width, but it can
> >>>>>only skip the low bits of address.
> >>>>>
> >>>>>This is efficient if we have huge memory regions but inefficient if we
> >>>>>are only using a tiny portion of the address space.
> >>>>>
> >>>>>After we have built up the map, detect
> >>>>>configurations where a single L2 entry is valid.
> >>>>>
> >>>>>We then speed up the lookup by skipping one or more levels.
> >>>>>In case any levels were skipped, we might end up in a valid section
> >>>>>instead of erroring out. We handle this by checking that
> >>>>>the address is in range of the resulting section.
> >>>>>
> >>>>I think this is overkill.  It can be done in a simpler way as follows:
> >>>>
> >>>>
> >>>>phys_page_find(RadixTree* tr, hwaddr index, ...)
> >>>>{
> >>>>    if (index & rt->invalid_index_mask) {
> >>>>        // not found
> >>>>    }
> >>>>    lp = rt->root;
> >>>>    for (i = rt->nb_levels - 1; i >= 0 && !lp.is_leaf; --i) {
> >>>>      ...
> >>>>
> >>>>This exploits the fact the lower portion of the address space is always
> >>>>filled, at least in the cases that matter to us.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>Basically skip unused high bits?
> >>>Sure.
> >>>In fact I think both optimizations can be combined.
> >>Not much value in combining them -- the variable level tree check
> >>will be dominated by the level skip logic.
> >>
> >>IMO however skipping intermediate levels will be too rare to justify
> >>the complexity and the doubling of the page table size -- it can
> >>only happen in iommu setups that place memory in very high
> >>addresses.  These ought to be rare.
> >>
> >Well maybe not very high address, but you can have a device with
> >a 64 bit bar and this will add back levels (though it would not
> >be slower than it is currently).
> >
> >I agree the simplicity is attractive.
> >
> >However I really would like some logic that can handle > 1 leaf
> >somehow.
> >
> >Specifically both tricks break if we add a full 64 bit io region
> >in order to return -1 for reads on master abort.
> 
> That can be achieved by directing a missed lookup to a catch-all region.
> 
> It would be best to do this completely internally to the memory
> code, without API changes.

You mean e.g. add a catch-all region to the address space?

Avi Kivity Nov. 17, 2013, 4:12 p.m. UTC | #7

On 11/17/2013 05:37 PM, Michael S. Tsirkin wrote:
> On Thu, Nov 14, 2013 at 05:42:26PM +0200, Avi Kivity wrote:
>> On 11/14/2013 05:37 PM, Michael S. Tsirkin wrote:
>>> On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote:
>>>> On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:
>>>>> On Thu, Nov 14, 2013 at 08:54:11AM +0000, Avi Kivity wrote:
>>>>>> Michael S. Tsirkin <mst <at> redhat.com> writes:
>>>>>>
>>>>>>> At the moment, memory radix tree is already variable width, but it can
>>>>>>> only skip the low bits of address.
>>>>>>>
>>>>>>> This is efficient if we have huge memory regions but inefficient if we
>>>>>>> are only using a tiny portion of the address space.
>>>>>>>
>>>>>>> After we have built up the map, detect
>>>>>>> configurations where a single L2 entry is valid.
>>>>>>>
>>>>>>> We then speed up the lookup by skipping one or more levels.
>>>>>>> In case any levels were skipped, we might end up in a valid section
>>>>>>> instead of erroring out. We handle this by checking that
>>>>>>> the address is in range of the resulting section.
>>>>>>>
>>>>>> I think this is overkill.  It can be done in a simpler way as follows:
>>>>>>
>>>>>>
>>>>>> phys_page_find(RadixTree* tr, hwaddr index, ...)
>>>>>> {
>>>>>>     if (index & rt->invalid_index_mask) {
>>>>>>         // not found
>>>>>>     }
>>>>>>     lp = rt->root;
>>>>>>     for (i = rt->nb_levels - 1; i >= 0 && !lp.is_leaf; --i) {
>>>>>>       ...
>>>>>>
>>>>>> This exploits the fact the lower portion of the address space is always
>>>>>> filled, at least in the cases that matter to us.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Basically skip unused high bits?
>>>>> Sure.
>>>>> In fact I think both optimizations can be combined.
>>>> Not much value in combining them -- the variable level tree check
>>>> will be dominated by the level skip logic.
>>>>
>>>> IMO however skipping intermediate levels will be too rare to justify
>>>> the complexity and the doubling of the page table size -- it can
>>>> only happen in iommu setups that place memory in very high
>>>> addresses.  These ought to be rare.
>>>>
>>> Well maybe not very high address, but you can have a device with
>>> a 64 bit bar and this will add back levels (though it would not
>>> be slower than it is currently).
>>>
>>> I agree the simplicity is attractive.
>>>
>>> However I really would like some logic that can handle > 1 leaf
>>> somehow.
>>>
>>> Specifically both tricks break if we add a full 64 bit io region
>>> in order to return -1 for reads on master abort.
>> That can be achieved by directing a missed lookup to a catch-all region.
>>
>> It would be best to do this completely internally to the memory
>> code, without API changes.
> You mean e.g. add a catch-all region to the address space?
>

The API already specifies a catch-all region - the container region 
catches everything if an access doesn't hit a sub-region (though I 
wanted to discourage the practice of using the same region for both I/O 
and a container), or you can have an I/O region as a subregion with the 
lowest priority, that spans the entire space, so that any access which 
misses the other sub-regions will hit the default subregion.

So we don't need API changes, just to change the internals.

I think this can be done by looking at the address space, and finding 
out what region is mapped to the very last byte in the address space, if 
any, and using that as the default if an access misses the tree.  That 
is, this region will be the target of null pointers in the tree, or if 
(addr & addr_invalid_mask) is true.

This will ensure the tree is of the smallest possible size.

Another option is to simply give up on the tree and use 
std::binary_search() on the FlatView.  It will be more cpu intensive, 
but on the other hand more cache-friendly.  tcg already caches stuff in 
the iotlb, and maybe it makes sense to add a cache for kvm lookups too 
(the advantage of the cache is that it only needs to contain addresses 
that are actually touched, so it can be very small and fast, compared to 
the radix tree which needs to cover all addresses).

[v2,5/7] exec: memory radix tree page level compression

Commit Message

Comments

Patch