Patchwork powerpc-powernv: align BARs to PAGE_SIZE on powernv platform

login
register
mail settings
Submitter Alexey Kardashevskiy
Date Sept. 4, 2012, 7:36 a.m.
Message ID <1346744201-31262-1-git-send-email-aik@ozlabs.ru>
Download mbox | patch
Permalink /patch/181511/
State Not Applicable
Headers show

Comments

Alexey Kardashevskiy - Sept. 4, 2012, 7:36 a.m.
VFIO adds a separate memory region for every BAR and tries
to mmap() it to provide direct BAR mapping to the guest.
If it succeedes, QEMU registers this address with kvm_set_phys_mem().
However it is not always possible because such a BAR should
be host page size aligned. In this case VFIO uses "slow" path
and emulated BAR access in QEMU.

In order to avoid "slow" path, BARs have to be PAGE_SIZE aligned
in the host kernel and this is what the patch does.

The patch adds powernv platform specific hook which makes all
BARs sizes 64K aligned. The pci_reassigndev_resource_alignment()
function from drivers/pci/pci.c has been used as a reference.

This is purely an optimization patch, the things will work without
it, just a bit slower.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/setup.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)
Benjamin Herrenschmidt - Sept. 4, 2012, 7:45 p.m.
On Tue, 2012-09-04 at 17:36 +1000, Alexey Kardashevskiy wrote:
> VFIO adds a separate memory region for every BAR and tries
> to mmap() it to provide direct BAR mapping to the guest.
> If it succeedes, QEMU registers this address with kvm_set_phys_mem().
> However it is not always possible because such a BAR should
> be host page size aligned. In this case VFIO uses "slow" path
> and emulated BAR access in QEMU.
> 
> In order to avoid "slow" path, BARs have to be PAGE_SIZE aligned
> in the host kernel and this is what the patch does.
> 
> The patch adds powernv platform specific hook which makes all
> BARs sizes 64K aligned. The pci_reassigndev_resource_alignment()
> function from drivers/pci/pci.c has been used as a reference.
> 
> This is purely an optimization patch, the things will work without
> it, just a bit slower.

It's still bad in more ways that I care to explain...

The main one is that you do the "fixup" in a very wrong place anyway and
it might cause cases of overlapping BARs.

In any case this is wrong. It's a VFIO design bug and needs to be fixed
there (CC'ing Alex).

IE. We need a way to know where the BAR is within a page at which point
VFIO can still map the page, but can also properly take into account the
offset.

We also need a way to tell VFIO userspace that it's OK to use the fast
path for such small BARs. It's not for all host platforms. We know it's
ok for PowerNV because we know the devices are grouped by PEs and the PE
granularity is larger than a page but that's not necessarily going to be
the case on all powerpc platforms that support KVM.

Cheers,
Ben.

> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  arch/powerpc/platforms/powernv/setup.c |   26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
> index db1ad1c..331838e 100644
> --- a/arch/powerpc/platforms/powernv/setup.c
> +++ b/arch/powerpc/platforms/powernv/setup.c
> @@ -25,6 +25,7 @@
>  #include <linux/of.h>
>  #include <linux/interrupt.h>
>  #include <linux/bug.h>
> +#include <linux/pci.h>
>  
>  #include <asm/machdep.h>
>  #include <asm/firmware.h>
> @@ -179,6 +180,30 @@ static int __init pnv_probe(void)
>  	return 1;
>  }
>  
> +static void pnv_pcibios_fixup_resources(struct pci_dev *pdev)
> +{
> +	struct resource *r;
> +	int i;
> +
> +	/*
> +	 * Aligning resources to PAGE_SIZE in order to
> +	 * support "fast" path for PCI BAR access under VFIO
> +	 * which maps every BAR individually to the guest
> +	 * so BARs have to be PAGE aligned.
> +	 */
> +	for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
> +		r = &pdev->resource[i];
> +		if (!r->flags)
> +			continue;
> +		pr_debug("powernv: %s, aligning BAR#%d %llx..%llx",
> +			pdev->dev.kobj.name, i, r->start, r->end);
> +		r->end = PAGE_ALIGN(r->end - r->start + 1) - 1;
> +		r->start = 0;
> +		r->flags |= IORESOURCE_UNSET;
> +		pr_debug(" to  %llx..%llx\n", r->start, r->end);
> +	}
> +}
> +
>  define_machine(powernv) {
>  	.name			= "PowerNV",
>  	.probe			= pnv_probe,
> @@ -189,6 +214,7 @@ define_machine(powernv) {
>  	.progress		= pnv_progress,
>  	.power_save             = power7_idle,
>  	.calibrate_decr		= generic_calibrate_decr,
> +	.pcibios_fixup_resources= pnv_pcibios_fixup_resources,
>  #ifdef CONFIG_KEXEC
>  	.kexec_cpu_down		= pnv_kexec_cpu_down,
>  #endif
Alexey Kardashevskiy - Sept. 5, 2012, 12:55 a.m.
On 05/09/12 05:45, Benjamin Herrenschmidt wrote:
> On Tue, 2012-09-04 at 17:36 +1000, Alexey Kardashevskiy wrote:
>> VFIO adds a separate memory region for every BAR and tries
>> to mmap() it to provide direct BAR mapping to the guest.
>> If it succeedes, QEMU registers this address with kvm_set_phys_mem().
>> However it is not always possible because such a BAR should
>> be host page size aligned. In this case VFIO uses "slow" path
>> and emulated BAR access in QEMU.
>>
>> In order to avoid "slow" path, BARs have to be PAGE_SIZE aligned
>> in the host kernel and this is what the patch does.
>>
>> The patch adds powernv platform specific hook which makes all
>> BARs sizes 64K aligned. The pci_reassigndev_resource_alignment()
>> function from drivers/pci/pci.c has been used as a reference.
>>
>> This is purely an optimization patch, the things will work without
>> it, just a bit slower.
>
> It's still bad in more ways that I care to explain...

Well it is right before pci_reassigndev_resource_alignment() which is 
common and does the same thing.

> The main one is that you do the "fixup" in a very wrong place anyway and
> it might cause cases of overlapping BARs.

As far as I can tell it may only happen if someone tries to align resource 
via kernel command line.

But ok. I trust you :)

> In any case this is wrong. It's a VFIO design bug and needs to be fixed
> there (CC'ing Alex).

It can be fixed in VFIO only if VFIO will stop treating functions 
separately and start mapping group's MMIO space as a whole thing. But this 
is not going to happen.

The example of the problem is NEC USB PCI which has 3 functions, each has 
one BAR, these BARs are 4K aligned and I cannot see how it can be fixed 
with 64K page size and VFIO creating memory regions per BAR (not per PHB).


> IE. We need a way to know where the BAR is within a page at which point
> VFIO can still map the page, but can also properly take into account the
> offset.

It is not about VFIO, it is about KVM. I cannot put non-aligned page to 
kvm_set_phys_mem(). Cannot understand how we would solve this.


You better discuss it with David, my vocab is weak.



> We also need a way to tell VFIO userspace that it's OK to use the fast
> path for such small BARs. It's not for all host platforms. We know it's
> ok for PowerNV because we know the devices are grouped by PEs and the PE
> granularity is larger than a page but that's not necessarily going to be
> the case on all powerpc platforms that support KVM.
>
> Cheers,
> Ben.
>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>   arch/powerpc/platforms/powernv/setup.c |   26 ++++++++++++++++++++++++++
>>   1 file changed, 26 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
>> index db1ad1c..331838e 100644
>> --- a/arch/powerpc/platforms/powernv/setup.c
>> +++ b/arch/powerpc/platforms/powernv/setup.c
>> @@ -25,6 +25,7 @@
>>   #include <linux/of.h>
>>   #include <linux/interrupt.h>
>>   #include <linux/bug.h>
>> +#include <linux/pci.h>
>>
>>   #include <asm/machdep.h>
>>   #include <asm/firmware.h>
>> @@ -179,6 +180,30 @@ static int __init pnv_probe(void)
>>   	return 1;
>>   }
>>
>> +static void pnv_pcibios_fixup_resources(struct pci_dev *pdev)
>> +{
>> +	struct resource *r;
>> +	int i;
>> +
>> +	/*
>> +	 * Aligning resources to PAGE_SIZE in order to
>> +	 * support "fast" path for PCI BAR access under VFIO
>> +	 * which maps every BAR individually to the guest
>> +	 * so BARs have to be PAGE aligned.
>> +	 */
>> +	for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
>> +		r = &pdev->resource[i];
>> +		if (!r->flags)
>> +			continue;
>> +		pr_debug("powernv: %s, aligning BAR#%d %llx..%llx",
>> +			pdev->dev.kobj.name, i, r->start, r->end);
>> +		r->end = PAGE_ALIGN(r->end - r->start + 1) - 1;
>> +		r->start = 0;
>> +		r->flags |= IORESOURCE_UNSET;
>> +		pr_debug(" to  %llx..%llx\n", r->start, r->end);
>> +	}
>> +}
>> +
>>   define_machine(powernv) {
>>   	.name			= "PowerNV",
>>   	.probe			= pnv_probe,
>> @@ -189,6 +214,7 @@ define_machine(powernv) {
>>   	.progress		= pnv_progress,
>>   	.power_save             = power7_idle,
>>   	.calibrate_decr		= generic_calibrate_decr,
>> +	.pcibios_fixup_resources= pnv_pcibios_fixup_resources,
>>   #ifdef CONFIG_KEXEC
>>   	.kexec_cpu_down		= pnv_kexec_cpu_down,
>>   #endif
>
>
Benjamin Herrenschmidt - Sept. 5, 2012, 1:16 a.m.
> > It's still bad in more ways that I care to explain...
> 
> Well it is right before pci_reassigndev_resource_alignment() which is 
> common and does the same thing.
> 
> > The main one is that you do the "fixup" in a very wrong place anyway and
> > it might cause cases of overlapping BARs.
> 
> As far as I can tell it may only happen if someone tries to align resource 
> via kernel command line.
> 
> But ok. I trust you :)

I have reasons to believe that this realignment crap is wrong too :-)

> > In any case this is wrong. It's a VFIO design bug and needs to be fixed
> > there (CC'ing Alex).
> 
> It can be fixed in VFIO only if VFIO will stop treating functions 
> separately and start mapping group's MMIO space as a whole thing. But this 
> is not going to happen.

It still can be fixed without that...

> The example of the problem is NEC USB PCI which has 3 functions, each has 
> one BAR, these BARs are 4K aligned and I cannot see how it can be fixed 
> with 64K page size and VFIO creating memory regions per BAR (not per PHB).

VFIO can perfectly well realize it's the same MR or even map the same
area 3 times and create 3 MRs, both options work. All it needs is to
know the offset of the BAR inside the page.

> > IE. We need a way to know where the BAR is within a page at which point
> > VFIO can still map the page, but can also properly take into account the
> > offset.
> 
> It is not about VFIO, it is about KVM. I cannot put non-aligned page to 
> kvm_set_phys_mem(). Cannot understand how we would solve this.

No, VFIO still maps the whole page and creates an MR for the whole page,
that's fine. But you still need to know the offset within the page.

Now the main problem here is going to be that the guest itself might
reallocate the BAR and move it around (well, it's version of the BAR
which isn't the real thing), and so we cannot create a direct MMU
mapping between -that- and the real BAR.

IE. We can only allow that direct mapping if the guest BAR mapping has
the same "offset within page" as the host BAR mapping. 

Our guests don't mess with BARs but SLOF does ... it's really tempting
to look into bringing the whole BAR allocation back into qemu and out of
SLOF :-( (We might have to if we ever do hotplug anyway). That way qemu
could set offsets that match appropriately.
 
Cheers,
Ben.

> You better discuss it with David, my vocab is weak.
> 
> 
> 
> > We also need a way to tell VFIO userspace that it's OK to use the fast
> > path for such small BARs. It's not for all host platforms. We know it's
> > ok for PowerNV because we know the devices are grouped by PEs and the PE
> > granularity is larger than a page but that's not necessarily going to be
> > the case on all powerpc platforms that support KVM.
> >
> > Cheers,
> > Ben.
> >
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>   arch/powerpc/platforms/powernv/setup.c |   26 ++++++++++++++++++++++++++
> >>   1 file changed, 26 insertions(+)
> >>
> >> diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
> >> index db1ad1c..331838e 100644
> >> --- a/arch/powerpc/platforms/powernv/setup.c
> >> +++ b/arch/powerpc/platforms/powernv/setup.c
> >> @@ -25,6 +25,7 @@
> >>   #include <linux/of.h>
> >>   #include <linux/interrupt.h>
> >>   #include <linux/bug.h>
> >> +#include <linux/pci.h>
> >>
> >>   #include <asm/machdep.h>
> >>   #include <asm/firmware.h>
> >> @@ -179,6 +180,30 @@ static int __init pnv_probe(void)
> >>   	return 1;
> >>   }
> >>
> >> +static void pnv_pcibios_fixup_resources(struct pci_dev *pdev)
> >> +{
> >> +	struct resource *r;
> >> +	int i;
> >> +
> >> +	/*
> >> +	 * Aligning resources to PAGE_SIZE in order to
> >> +	 * support "fast" path for PCI BAR access under VFIO
> >> +	 * which maps every BAR individually to the guest
> >> +	 * so BARs have to be PAGE aligned.
> >> +	 */
> >> +	for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
> >> +		r = &pdev->resource[i];
> >> +		if (!r->flags)
> >> +			continue;
> >> +		pr_debug("powernv: %s, aligning BAR#%d %llx..%llx",
> >> +			pdev->dev.kobj.name, i, r->start, r->end);
> >> +		r->end = PAGE_ALIGN(r->end - r->start + 1) - 1;
> >> +		r->start = 0;
> >> +		r->flags |= IORESOURCE_UNSET;
> >> +		pr_debug(" to  %llx..%llx\n", r->start, r->end);
> >> +	}
> >> +}
> >> +
> >>   define_machine(powernv) {
> >>   	.name			= "PowerNV",
> >>   	.probe			= pnv_probe,
> >> @@ -189,6 +214,7 @@ define_machine(powernv) {
> >>   	.progress		= pnv_progress,
> >>   	.power_save             = power7_idle,
> >>   	.calibrate_decr		= generic_calibrate_decr,
> >> +	.pcibios_fixup_resources= pnv_pcibios_fixup_resources,
> >>   #ifdef CONFIG_KEXEC
> >>   	.kexec_cpu_down		= pnv_kexec_cpu_down,
> >>   #endif
> >
> >
> 
>
Alex Williamson - Sept. 5, 2012, 4:57 a.m.
On Wed, 2012-09-05 at 11:16 +1000, Benjamin Herrenschmidt wrote:
> > > It's still bad in more ways that I care to explain...
> > 
> > Well it is right before pci_reassigndev_resource_alignment() which is 
> > common and does the same thing.
> > 
> > > The main one is that you do the "fixup" in a very wrong place anyway and
> > > it might cause cases of overlapping BARs.
> > 
> > As far as I can tell it may only happen if someone tries to align resource 
> > via kernel command line.
> > 
> > But ok. I trust you :)
> 
> I have reasons to believe that this realignment crap is wrong too :-)
> 
> > > In any case this is wrong. It's a VFIO design bug and needs to be fixed
> > > there (CC'ing Alex).
> > 
> > It can be fixed in VFIO only if VFIO will stop treating functions 
> > separately and start mapping group's MMIO space as a whole thing. But this 
> > is not going to happen.
> 
> It still can be fixed without that...
> 
> > The example of the problem is NEC USB PCI which has 3 functions, each has 
> > one BAR, these BARs are 4K aligned and I cannot see how it can be fixed 
> > with 64K page size and VFIO creating memory regions per BAR (not per PHB).
> 
> VFIO can perfectly well realize it's the same MR or even map the same
> area 3 times and create 3 MRs, both options work. All it needs is to
> know the offset of the BAR inside the page.

Yep, I think I agree...

> > > IE. We need a way to know where the BAR is within a page at which point
> > > VFIO can still map the page, but can also properly take into account the
> > > offset.
> > 
> > It is not about VFIO, it is about KVM. I cannot put non-aligned page to 
> > kvm_set_phys_mem(). Cannot understand how we would solve this.
> 
> No, VFIO still maps the whole page and creates an MR for the whole page,
> that's fine. But you still need to know the offset within the page.

Do we need an extra region info field, or is it sufficient that we
define a region to be mmap'able with getpagesize() pages when the MMAP
flag is set and simply offset the region within the device fd?  ex.

BAR0: 0x10000 /* no offset */
BAR1: 0x21000 /* 4k offset */
BAR2: 0x32000 /* 8k offset */

A second level optimization might make these 0x10000, 0x11000, 0x12000.

This will obviously require some arch hooks w/in vfio as we can't do
this on x86 since we can't guarantee that whatever lives in the
overflow/gaps is in the same group and power is going to need to make
sure we don't accidentally allow msix table mapping... in fact hiding
the msix table might be a lot more troublesome on 64k page hosts.

> Now the main problem here is going to be that the guest itself might
> reallocate the BAR and move it around (well, it's version of the BAR
> which isn't the real thing), and so we cannot create a direct MMU
> mapping between -that- and the real BAR.
> 
> IE. We can only allow that direct mapping if the guest BAR mapping has
> the same "offset within page" as the host BAR mapping. 

Euw...

> Our guests don't mess with BARs but SLOF does ... it's really tempting
> to look into bringing the whole BAR allocation back into qemu and out of
> SLOF :-( (We might have to if we ever do hotplug anyway). That way qemu
> could set offsets that match appropriately.

BTW, as I mentioned elsewhere, I'm on vacation this week, but I'll try
to keep up as much as I have time for.

Thanks,

Alex
Benjamin Herrenschmidt - Sept. 5, 2012, 5:17 a.m.
On Tue, 2012-09-04 at 22:57 -0600, Alex Williamson wrote:

> Do we need an extra region info field, or is it sufficient that we
> define a region to be mmap'able with getpagesize() pages when the MMAP
> flag is set and simply offset the region within the device fd?  ex.

Alexey ? You mentioned you had ways to get at the offset with the
existing interfaces ?

> BAR0: 0x10000 /* no offset */
> BAR1: 0x21000 /* 4k offset */
> BAR2: 0x32000 /* 8k offset */
> 
> A second level optimization might make these 0x10000, 0x11000, 0x12000.
> 
> This will obviously require some arch hooks w/in vfio as we can't do
> this on x86 since we can't guarantee that whatever lives in the
> overflow/gaps is in the same group and power is going to need to make
> sure we don't accidentally allow msix table mapping... in fact hiding
> the msix table might be a lot more troublesome on 64k page hosts.

Fortunately, our guests don't access the msix table directly anyway, at
least most of the time :-) There's a paravirt API for it, and our iommu
makes sure that if for some reason the guest still accesses it and does
the wrong thing to it, the side effects will be contained to the guest.

> > Now the main problem here is going to be that the guest itself might
> > reallocate the BAR and move it around (well, it's version of the BAR
> > which isn't the real thing), and so we cannot create a direct MMU
> > mapping between -that- and the real BAR.
> > 
> > IE. We can only allow that direct mapping if the guest BAR mapping has
> > the same "offset within page" as the host BAR mapping. 
> 
> Euw...

Yeah sucks :-) Basically, let's say page size is 64K. Host side BAR
(real BAR) is at 0xf0001000.

qemu maps 0xf0000000..0xf000ffff to a virtual address inside QEMU,
itself 64k aligned, let's say 0x80000000 and knows that the BAR is at
offset 0x1000 in there.

However, the KVM "MR" API is such that we can only map PAGE_SIZE regions
into the guest as well, so if the guest assigns a value ADDR to the
guest BAR, let's say 0x40002000, all KVM can do is an MR that maps
0x40000000 (guest physical) to 0x80000000 (qemu). Any access within that
64K page will have the low bits transferred directly from guest to HW.

So the guest will end up having that 0x2000 offset instead of the 0x1000
needed to actually access the BAR. FAIL.

There are ways to fix that but all are nasty.

 - In theory, we have the capability (and use it today) to restrict IO
mappings in the guest to 4K HW pages, so knowing that, KVM could use a
"special" MR that plays tricks here... but that would break all sort of
generic code both in qemu and kvm and generally be very nasty.

 - The best approach is to rely on the fact that our guest kernels don't
do BAR assignment, they rely on FW to do it (ie not at all, unlike x86,
we can't even fixup because in the general case, the hypervisor won't
let us anyway). So we could move our guest BAR allocation code out of
our guest firmware (SLOF) back into qemu (where we had it very early
on), which allows us to make sure that the guest BAR values we assign
have the same "offset within the page" as the host side values. This
would also allow us to avoid messing up too many MRs (this can have a
performance impact with KVM) and eventually handle our "group" regions
instead of individual BARs for mappings. We might need to do that anyway
in the long run for hotplug as our hotplug hypervisor APIs also rely on
the "new" hotplugged devices to have the BARs pre-assigned when they get
handed out to the guest. 

> > Our guests don't mess with BARs but SLOF does ... it's really tempting
> > to look into bringing the whole BAR allocation back into qemu and out of
> > SLOF :-( (We might have to if we ever do hotplug anyway). That way qemu
> > could set offsets that match appropriately.
> 
> BTW, as I mentioned elsewhere, I'm on vacation this week, but I'll try
> to keep up as much as I have time for.

No worries,

Cheers,
Ben.
Alexey Kardashevskiy - Sept. 5, 2012, 5:27 a.m.
On 05/09/12 15:17, Benjamin Herrenschmidt wrote:
> On Tue, 2012-09-04 at 22:57 -0600, Alex Williamson wrote:
>
>> Do we need an extra region info field, or is it sufficient that we
>> define a region to be mmap'able with getpagesize() pages when the MMAP
>> flag is set and simply offset the region within the device fd?  ex.
>
> Alexey ? You mentioned you had ways to get at the offset with the
> existing interfaces ?


Yes, VFIO_DEVICE_GET_REGION_INFO ioctl of vfio-pci host driver, the "info" 
struct has an "offset" field.
I just do not have a place to use it in the QEMU right now as the guest 
does the same allocation as the host does (by accident).


>> BAR0: 0x10000 /* no offset */
>> BAR1: 0x21000 /* 4k offset */
>> BAR2: 0x32000 /* 8k offset */
>>
>> A second level optimization might make these 0x10000, 0x11000, 0x12000.
>>
>> This will obviously require some arch hooks w/in vfio as we can't do
>> this on x86 since we can't guarantee that whatever lives in the
>> overflow/gaps is in the same group and power is going to need to make
>> sure we don't accidentally allow msix table mapping... in fact hiding
>> the msix table might be a lot more troublesome on 64k page hosts.
>
> Fortunately, our guests don't access the msix table directly anyway, at
> least most of the time :-)


Not at all in our case. It took me some time to push a QEMU patch which 
changes msix table :)


> There's a paravirt API for it, and our iommu
> makes sure that if for some reason the guest still accesses it and does
> the wrong thing to it, the side effects will be contained to the guest.

>>> Now the main problem here is going to be that the guest itself might
>>> reallocate the BAR and move it around (well, it's version of the BAR
>>> which isn't the real thing), and so we cannot create a direct MMU
>>> mapping between -that- and the real BAR.
>>>
>>> IE. We can only allow that direct mapping if the guest BAR mapping has
>>> the same "offset within page" as the host BAR mapping.
>>
>> Euw...
>
> Yeah sucks :-) Basically, let's say page size is 64K. Host side BAR
> (real BAR) is at 0xf0001000.
>
> qemu maps 0xf0000000..0xf000ffff to a virtual address inside QEMU,
> itself 64k aligned, let's say 0x80000000 and knows that the BAR is at
> offset 0x1000 in there.
>
> However, the KVM "MR" API is such that we can only map PAGE_SIZE regions
> into the guest as well, so if the guest assigns a value ADDR to the
> guest BAR, let's say 0x40002000, all KVM can do is an MR that maps
> 0x40000000 (guest physical) to 0x80000000 (qemu). Any access within that
> 64K page will have the low bits transferred directly from guest to HW.
>
> So the guest will end up having that 0x2000 offset instead of the 0x1000
> needed to actually access the BAR. FAIL.
>
> There are ways to fix that but all are nasty.
>
>   - In theory, we have the capability (and use it today) to restrict IO
> mappings in the guest to 4K HW pages, so knowing that, KVM could use a
> "special" MR that plays tricks here... but that would break all sort of
> generic code both in qemu and kvm and generally be very nasty.
>
>   - The best approach is to rely on the fact that our guest kernels don't
> do BAR assignment, they rely on FW to do it (ie not at all, unlike x86,
> we can't even fixup because in the general case, the hypervisor won't
> let us anyway). So we could move our guest BAR allocation code out of
> our guest firmware (SLOF) back into qemu (where we had it very early
> on), which allows us to make sure that the guest BAR values we assign
> have the same "offset within the page" as the host side values. This
> would also allow us to avoid messing up too many MRs (this can have a
> performance impact with KVM) and eventually handle our "group" regions
> instead of individual BARs for mappings. We might need to do that anyway
> in the long run for hotplug as our hotplug hypervisor APIs also rely on
> the "new" hotplugged devices to have the BARs pre-assigned when they get
> handed out to the guest.
>
>>> Our guests don't mess with BARs but SLOF does ... it's really tempting
>>> to look into bringing the whole BAR allocation back into qemu and out of
>>> SLOF :-( (We might have to if we ever do hotplug anyway). That way qemu
>>> could set offsets that match appropriately.
>>
>> BTW, as I mentioned elsewhere, I'm on vacation this week, but I'll try
>> to keep up as much as I have time for.
>
> No worries,
Alex Williamson - Sept. 10, 2012, 5:06 p.m.
On Wed, 2012-09-05 at 15:27 +1000, Alexey Kardashevskiy wrote:
> On 05/09/12 15:17, Benjamin Herrenschmidt wrote:
> > On Tue, 2012-09-04 at 22:57 -0600, Alex Williamson wrote:
> >
> >> Do we need an extra region info field, or is it sufficient that we
> >> define a region to be mmap'able with getpagesize() pages when the MMAP
> >> flag is set and simply offset the region within the device fd?  ex.
> >
> > Alexey ? You mentioned you had ways to get at the offset with the
> > existing interfaces ?
> 
> 
> Yes, VFIO_DEVICE_GET_REGION_INFO ioctl of vfio-pci host driver, the "info" 
> struct has an "offset" field.
> I just do not have a place to use it in the QEMU right now as the guest 
> does the same allocation as the host does (by accident).

Yep, this is the offset into the device fd though.  We currently used a
fixed 40bit region for each BAR, but that's mostly a leftover from
before the API described the offset.  It's a little bit of an
optimization on the kernel side to convert offset->BAR w/o a lookup, but
we're hopefully mmap'ing as much as possible, modulo the page size
issues here.

> >> BAR0: 0x10000 /* no offset */
> >> BAR1: 0x21000 /* 4k offset */
> >> BAR2: 0x32000 /* 8k offset */
> >>
> >> A second level optimization might make these 0x10000, 0x11000, 0x12000.
> >>
> >> This will obviously require some arch hooks w/in vfio as we can't do
> >> this on x86 since we can't guarantee that whatever lives in the
> >> overflow/gaps is in the same group and power is going to need to make
> >> sure we don't accidentally allow msix table mapping... in fact hiding
> >> the msix table might be a lot more troublesome on 64k page hosts.
> >
> > Fortunately, our guests don't access the msix table directly anyway, at
> > least most of the time :-)
> 
> 
> Not at all in our case. It took me some time to push a QEMU patch which 
> changes msix table :)

vfio needs to be safe regardless of whether it's being used by qemu or
some other userspace driver though.

> > There's a paravirt API for it, and our iommu
> > makes sure that if for some reason the guest still accesses it and does
> > the wrong thing to it, the side effects will be contained to the guest.

If direct access to the MSIX table neither leaks information nor leads
to exploitable holes, then I have no problem allowing a platform hook to
make it mmap'able.  We should be looking at this for x86 too on
platforms where we have interrupt remapping capabilities.

> >>> Now the main problem here is going to be that the guest itself might
> >>> reallocate the BAR and move it around (well, it's version of the BAR
> >>> which isn't the real thing), and so we cannot create a direct MMU
> >>> mapping between -that- and the real BAR.
> >>>
> >>> IE. We can only allow that direct mapping if the guest BAR mapping has
> >>> the same "offset within page" as the host BAR mapping.
> >>
> >> Euw...
> >
> > Yeah sucks :-) Basically, let's say page size is 64K. Host side BAR
> > (real BAR) is at 0xf0001000.
> >
> > qemu maps 0xf0000000..0xf000ffff to a virtual address inside QEMU,
> > itself 64k aligned, let's say 0x80000000 and knows that the BAR is at
> > offset 0x1000 in there.
> >
> > However, the KVM "MR" API is such that we can only map PAGE_SIZE regions
> > into the guest as well, so if the guest assigns a value ADDR to the
> > guest BAR, let's say 0x40002000, all KVM can do is an MR that maps
> > 0x40000000 (guest physical) to 0x80000000 (qemu). Any access within that
> > 64K page will have the low bits transferred directly from guest to HW.
> >
> > So the guest will end up having that 0x2000 offset instead of the 0x1000
> > needed to actually access the BAR. FAIL.
> >
> > There are ways to fix that but all are nasty.
> >
> >   - In theory, we have the capability (and use it today) to restrict IO
> > mappings in the guest to 4K HW pages, so knowing that, KVM could use a
> > "special" MR that plays tricks here... but that would break all sort of
> > generic code both in qemu and kvm and generally be very nasty.
> >
> >   - The best approach is to rely on the fact that our guest kernels don't
> > do BAR assignment, they rely on FW to do it (ie not at all, unlike x86,
> > we can't even fixup because in the general case, the hypervisor won't
> > let us anyway). So we could move our guest BAR allocation code out of
> > our guest firmware (SLOF) back into qemu (where we had it very early
> > on), which allows us to make sure that the guest BAR values we assign
> > have the same "offset within the page" as the host side values. This
> > would also allow us to avoid messing up too many MRs (this can have a
> > performance impact with KVM) and eventually handle our "group" regions
> > instead of individual BARs for mappings. We might need to do that anyway
> > in the long run for hotplug as our hotplug hypervisor APIs also rely on
> > the "new" hotplugged devices to have the BARs pre-assigned when they get
> > handed out to the guest.

Ok, now it's making more sense how the original patch here was
beneficial.  If the physical BAR is 64k aligned we could expose the BAR
as being at least 64k and therefore everything would line up.  We have
the same issue on x86 where devices might have <4k BARs and we'd prefer
to mmap them.  So far we've successfully ignored them because they're
not "high performance" devices and because we can't always assume the
extra space is unused or even safe to access.  Obviously the problem
gets much worse at 64k.

You know that the MMIO space consumed by a group is 64k aligned, but
individual BARs are not.  Guest and host may end up with different BAR
offsets within a 64k page though, so that doesn't buy us much.  Blech.
Yeah, relying on the guest not changing mappings is pretty ugly, but may
be a solution.  For the general case though we'd really prefer each BAR
to be treated as a minimum 64k then exposed to the user (qemu) as the
same.  Doing that across a system is pretty wasteful and remapping is
not always possible.  You almost want a setting per PE to specify how
BARs are aligned (and x86 might want the same, but it's not as clear how
to define what devices to apply it to).  Hard problem.  Thanks,

Alex

Patch

diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index db1ad1c..331838e 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -25,6 +25,7 @@ 
 #include <linux/of.h>
 #include <linux/interrupt.h>
 #include <linux/bug.h>
+#include <linux/pci.h>
 
 #include <asm/machdep.h>
 #include <asm/firmware.h>
@@ -179,6 +180,30 @@  static int __init pnv_probe(void)
 	return 1;
 }
 
+static void pnv_pcibios_fixup_resources(struct pci_dev *pdev)
+{
+	struct resource *r;
+	int i;
+
+	/*
+	 * Aligning resources to PAGE_SIZE in order to
+	 * support "fast" path for PCI BAR access under VFIO
+	 * which maps every BAR individually to the guest
+	 * so BARs have to be PAGE aligned.
+	 */
+	for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
+		r = &pdev->resource[i];
+		if (!r->flags)
+			continue;
+		pr_debug("powernv: %s, aligning BAR#%d %llx..%llx",
+			pdev->dev.kobj.name, i, r->start, r->end);
+		r->end = PAGE_ALIGN(r->end - r->start + 1) - 1;
+		r->start = 0;
+		r->flags |= IORESOURCE_UNSET;
+		pr_debug(" to  %llx..%llx\n", r->start, r->end);
+	}
+}
+
 define_machine(powernv) {
 	.name			= "PowerNV",
 	.probe			= pnv_probe,
@@ -189,6 +214,7 @@  define_machine(powernv) {
 	.progress		= pnv_progress,
 	.power_save             = power7_idle,
 	.calibrate_decr		= generic_calibrate_decr,
+	.pcibios_fixup_resources= pnv_pcibios_fixup_resources,
 #ifdef CONFIG_KEXEC
 	.kexec_cpu_down		= pnv_kexec_cpu_down,
 #endif