diff mbox series

MMIO should have more priority then IO

Message ID 1656433761-9163-1-git-send-email-akaher@vmware.com
State New
Headers show
Series MMIO should have more priority then IO | expand

Commit Message

Ajay Kaher June 28, 2022, 4:29 p.m. UTC
Port IO instructions (PIO) are less efficient than MMIO (memory
mapped I/O). They require twice as many PCI accesses and PIO
instructions are serializing. As a result, MMIO should be preferred
when possible over PIO.

Bare metal test result
1 million reads using raw_pci_read() took:
PIO: 0.433153 Sec.
MMIO: 0.268792 Sec.

Virtual Machine test result
1 hundred thousand reads using raw_pci_read() took:
PIO: 12.809 Sec.
MMIO: took 8.517 Sec.

Signed-off-by: Ajay Kaher <akaher@vmware.com>
---
 arch/x86/pci/common.c          |  8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

Comments

Bjorn Helgaas June 28, 2022, 6:09 p.m. UTC | #1
[+cc Matthew]

On Tue, Jun 28, 2022 at 09:59:21PM +0530, Ajay Kaher wrote:
> Port IO instructions (PIO) are less efficient than MMIO (memory
> mapped I/O). They require twice as many PCI accesses and PIO
> instructions are serializing. As a result, MMIO should be preferred
> when possible over PIO.
> 
> Bare metal test result
> 1 million reads using raw_pci_read() took:
> PIO: 0.433153 Sec.
> MMIO: 0.268792 Sec.
> 
> Virtual Machine test result
> 1 hundred thousand reads using raw_pci_read() took:
> PIO: 12.809 Sec.
> MMIO: took 8.517 Sec.
> 
> Signed-off-by: Ajay Kaher <akaher@vmware.com>
> ---
>  arch/x86/pci/common.c          |  8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
> index 3507f456f..0b3383d9c 100644
> --- a/arch/x86/pci/common.c
> +++ b/arch/x86/pci/common.c
> @@ -40,20 +40,20 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>  int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>  						int reg, int len, u32 *val)
>  {
> +	if (raw_pci_ext_ops)
> +		return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>  	if (domain == 0 && reg < 256 && raw_pci_ops)
>  		return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
> -	if (raw_pci_ext_ops)
> -		return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>  	return -EINVAL;

This organization of raw_pci_read() dates to b6ce068a1285 ("Change
pci_raw_ops to pci_raw_read/write"), by Matthew.  Cc'd him for
comment, since I think he considered the ordering at the time.

>  }
>  
>  int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>  						int reg, int len, u32 val)
>  {
> +	if (raw_pci_ext_ops)
> +		return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>  	if (domain == 0 && reg < 256 && raw_pci_ops)
>  		return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
> -	if (raw_pci_ext_ops)
> -		return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>  	return -EINVAL;
>  }
>  
> -- 
> 2.30.0
>
Greg Kroah-Hartman June 29, 2022, 6:12 a.m. UTC | #2
On Tue, Jun 28, 2022 at 09:59:21PM +0530, Ajay Kaher wrote:
> Port IO instructions (PIO) are less efficient than MMIO (memory
> mapped I/O). They require twice as many PCI accesses and PIO
> instructions are serializing. As a result, MMIO should be preferred
> when possible over PIO.
> 
> Bare metal test result
> 1 million reads using raw_pci_read() took:
> PIO: 0.433153 Sec.
> MMIO: 0.268792 Sec.
> 
> Virtual Machine test result
> 1 hundred thousand reads using raw_pci_read() took:
> PIO: 12.809 Sec.
> MMIO: took 8.517 Sec.
> 
> Signed-off-by: Ajay Kaher <akaher@vmware.com>
> ---
>  arch/x86/pci/common.c          |  8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
> index 3507f456f..0b3383d9c 100644
> --- a/arch/x86/pci/common.c
> +++ b/arch/x86/pci/common.c
> @@ -40,20 +40,20 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>  int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>  						int reg, int len, u32 *val)
>  {
> +	if (raw_pci_ext_ops)
> +		return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>  	if (domain == 0 && reg < 256 && raw_pci_ops)
>  		return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
> -	if (raw_pci_ext_ops)
> -		return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>  	return -EINVAL;
>  }
>  
>  int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>  						int reg, int len, u32 val)
>  {
> +	if (raw_pci_ext_ops)
> +		return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>  	if (domain == 0 && reg < 256 && raw_pci_ops)
>  		return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
> -	if (raw_pci_ext_ops)
> -		return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>  	return -EINVAL;
>  }
>  
> -- 
> 2.30.0
> 

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>
H. Peter Anvin June 29, 2022, 6:37 a.m. UTC | #3
On June 28, 2022 11:12:41 PM PDT, Greg KH <gregkh@linuxfoundation.org> wrote:
>On Tue, Jun 28, 2022 at 09:59:21PM +0530, Ajay Kaher wrote:
>> Port IO instructions (PIO) are less efficient than MMIO (memory
>> mapped I/O). They require twice as many PCI accesses and PIO
>> instructions are serializing. As a result, MMIO should be preferred
>> when possible over PIO.
>> 
>> Bare metal test result
>> 1 million reads using raw_pci_read() took:
>> PIO: 0.433153 Sec.
>> MMIO: 0.268792 Sec.
>> 
>> Virtual Machine test result
>> 1 hundred thousand reads using raw_pci_read() took:
>> PIO: 12.809 Sec.
>> MMIO: took 8.517 Sec.
>> 
>> Signed-off-by: Ajay Kaher <akaher@vmware.com>
>> ---
>>  arch/x86/pci/common.c          |  8 ++++----
>>  1 files changed, 4 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>> index 3507f456f..0b3383d9c 100644
>> --- a/arch/x86/pci/common.c
>> +++ b/arch/x86/pci/common.c
>> @@ -40,20 +40,20 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>>  int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>  						int reg, int len, u32 *val)
>>  {
>> +	if (raw_pci_ext_ops)
>> +		return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>>  	if (domain == 0 && reg < 256 && raw_pci_ops)
>>  		return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>> -	if (raw_pci_ext_ops)
>> -		return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>>  	return -EINVAL;
>>  }
>>  
>>  int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>>  						int reg, int len, u32 val)
>>  {
>> +	if (raw_pci_ext_ops)
>> +		return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>>  	if (domain == 0 && reg < 256 && raw_pci_ops)
>>  		return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>> -	if (raw_pci_ext_ops)
>> -		return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>>  	return -EINVAL;
>>  }
>>  
>> -- 
>> 2.30.0
>> 
>
><formletter>
>
>This is not the correct way to submit patches for inclusion in the
>stable kernel tree.  Please read:
>    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
>for how to do this properly.
>
></formletter>

The statement in the header is also incorrect.
Ajay Kaher July 8, 2022, 5:56 a.m. UTC | #4
On 28/06/22, 11:39 PM, "Bjorn Helgaas" <helgaas@kernel.org> wrote:
> [+cc Matthew]
>
> On Tue, Jun 28, 2022 at 09:59:21PM +0530, Ajay Kaher wrote:
>> Port IO instructions (PIO) are less efficient than MMIO (memory
>> mapped I/O). They require twice as many PCI accesses and PIO
>> instructions are serializing. As a result, MMIO should be preferred
>> when possible over PIO.
>>
>> Bare metal test result
>> 1 million reads using raw_pci_read() took:
>> PIO: 0.433153 Sec.
>> MMIO: 0.268792 Sec.
>>
>> Virtual Machine test result
>> 1 hundred thousand reads using raw_pci_read() took:
>> PIO: 12.809 Sec.
>> MMIO: took 8.517 Sec.
>>
>> Signed-off-by: Ajay Kaher <akaher@vmware.com>
>> ---
>>  arch/x86/pci/common.c          |  8 ++++----
>>  1 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
>> index 3507f456f..0b3383d9c 100644
>> --- a/arch/x86/pci/common.c
>> +++ b/arch/x86/pci/common.c
>> @@ -40,20 +40,20 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
>>  int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
>>                                               int reg, int len, u32 *val)
>>  {
>> +     if (raw_pci_ext_ops)
>> +             return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>>       if (domain == 0 && reg < 256 && raw_pci_ops)
>>               return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
>> -     if (raw_pci_ext_ops)
>> -             return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
>>       return -EINVAL;
>
> This organization of raw_pci_read() dates to b6ce068a1285 ("Change
> pci_raw_ops to pci_raw_read/write"), by Matthew.  Cc'd him for
> comment, since I think he considered the ordering at the time.

Thanks Bjorn for quick response.

Matthew, b6ce068a1285 is old commit. It will be very helpful if you could
provide some detail on ordering as Bjorn mentioned above.

- Ajay

>>  }
>>
>>  int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>>                                               int reg, int len, u32 val)
>>  {
>> +     if (raw_pci_ext_ops)
>> +             return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>>       if (domain == 0 && reg < 256 && raw_pci_ops)
>>               return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
>> -     if (raw_pci_ext_ops)
>> -             return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
>>       return -EINVAL;
>>  }
>>
>> --
>> 2.30.0
>>
Matthew Wilcox (Oracle) July 8, 2022, 12:56 p.m. UTC | #5
On Fri, Jul 08, 2022 at 05:56:07AM +0000, Ajay Kaher wrote:
> 
> On 28/06/22, 11:39 PM, "Bjorn Helgaas" <helgaas@kernel.org> wrote:
> > [+cc Matthew]
> >
> > On Tue, Jun 28, 2022 at 09:59:21PM +0530, Ajay Kaher wrote:
> >> Port IO instructions (PIO) are less efficient than MMIO (memory
> >> mapped I/O). They require twice as many PCI accesses and PIO
> >> instructions are serializing. As a result, MMIO should be preferred
> >> when possible over PIO.
> >>
> >> Bare metal test result
> >> 1 million reads using raw_pci_read() took:
> >> PIO: 0.433153 Sec.
> >> MMIO: 0.268792 Sec.
> >>
> >> Virtual Machine test result
> >> 1 hundred thousand reads using raw_pci_read() took:
> >> PIO: 12.809 Sec.
> >> MMIO: took 8.517 Sec.

While this is true, we're talking about config space accesses.  These
should be rare.  What workload does this make any measurable difference
to?

And looking at the results above, it's not so much the PIO vs MMIO
that makes a difference, it's the virtualisation.  A mmio access goes
from 269ns to 85us.  Rather than messing around with preferring MMIO
over PIO for config space, having an "enlightenment" to do config
space accesses would be a more profitable path.

> >> Signed-off-by: Ajay Kaher <akaher@vmware.com>
> >> ---
> >>  arch/x86/pci/common.c          |  8 ++++----
> >>  1 files changed, 4 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
> >> index 3507f456f..0b3383d9c 100644
> >> --- a/arch/x86/pci/common.c
> >> +++ b/arch/x86/pci/common.c
> >> @@ -40,20 +40,20 @@ const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
> >>  int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
> >>                                               int reg, int len, u32 *val)
> >>  {
> >> +     if (raw_pci_ext_ops)
> >> +             return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
> >>       if (domain == 0 && reg < 256 && raw_pci_ops)
> >>               return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
> >> -     if (raw_pci_ext_ops)
> >> -             return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
> >>       return -EINVAL;
> >
> > This organization of raw_pci_read() dates to b6ce068a1285 ("Change
> > pci_raw_ops to pci_raw_read/write"), by Matthew.  Cc'd him for
> > comment, since I think he considered the ordering at the time.
> 
> Thanks Bjorn for quick response.
> 
> Matthew, b6ce068a1285 is old commit. It will be very helpful if you could
> provide some detail on ordering as Bjorn mentioned above.

Sorry for the delay; this came in while I was on holiday.

I'll note that b6ce068a1285 does _not_ cause any changes in whether
MMIO or PIO is used for accesses below 256 bytes.  Yes, it introduces
this code, but it also removes lines like this:

-       if (reg < 256)
-               return pci_conf1_read(seg,bus,devfn,reg,len,value);

from the MMCONFIG accessors.

Those were introduced in a0ca99096094 by Ivan Kokshaysky.  But if you
look further back in the file history, you can find all kinds of nasty
bugs; broken BIOSes, resource conflicts, bleh.  You'd hope they've all
been fixed by now, but do you want to bet?

I still have a working machine here which hung when using MMCONFIG for all
accesses.  The problem lay in, IIRC, the graphics BAR.  When attempting
to size it (by writing all-ones to it and seeing what bits remained as
zero), it happens to overlay the MMCONFIG area.  That meant that the
subsequent attempt to write to the BAR actually ended up being a write
to graphics memory ... and so did all subsequent MMCONFIG accesses.

In short, here be dragons, and you need to move very VERY carefully to
avoid breaking peoples machines.

We do have the possibility of a white-list approach.  We can set
'raw_pci_ops' to NULL on machines which we're certain mmconfig works.
I still think you're better off having a special raw_pci_ops for vmware
than you are dinking about trying to save 50% of the time.  If any of
this is really worth it at all, which I doubt.
Nadav Amit July 8, 2022, 4:45 p.m. UTC | #6
On Jul 8, 2022, at 5:56 AM, Matthew Wilcox <willy@infradead.org> wrote:

> And looking at the results above, it's not so much the PIO vs MMIO
> that makes a difference, it's the virtualisation. A mmio access goes
> from 269ns to 85us. Rather than messing around with preferring MMIO
> over PIO for config space, having an "enlightenment" to do config
> space accesses would be a more profitable path.

I am unfamiliar with the motivation for this patch, but I just wanted to
briefly regard the advice about enlightments.

“enlightenment”, AFAIK, is Microsoft’s term for "para-virtualization", so
let’s regard the generic term. I think that you consider the bare-metal
results as the possible results from a paravirtual machine, which is mostly
wrong. Para-virtualization usually still requires a VM-exit and for the most
part the hypervisor/host runs similar code for MMIO/hypercall (conceptually;
the code of paravirtual and fully-virtual devices is often different, but
IIUC, this is not what Ajay measured).

Para-virtualization could have *perhaps* helped to reduce the number of
PIO/MMIO and improve performance this way. If, for instance, all the
PIO/MMIO are done during initialization, a paravirtual interface can be use
to batch them together, and that would help. But it is more complicated to
get a performance benefit from paravirtualization if the PIO/MMIO accesses
are “spread”, for instance, done after each interrupt.

Para-virtauilzation and full-virtualization both have pros and cons.
Para-virtualization is many times more efficient, but requires the VM to
have dedicated device drivers for the matter. Try to run a less-common OS
than Linux and it would not work since the OS would not have drivers for the
paras-virtual devices. And even if you add support today for a para-virtual
devices, there are many deployed OSes that do not have such support, and you
would not be able to run them in a VM.

Regardless to virtualization, Ajay’s results show PIO is slower on
bare-metal, and according to his numbers by 165ns, which is significant.
Emulating PIO in hypervisors on x86 is inherently more complex than MMIO, so
the results he got would most likely happen on all hypervisors.

tl;dr: Let’s keep this discussion focused and put paravirtualization aside.
It is not a solution for all the problems in the world.
Matthew Wilcox (Oracle) July 8, 2022, 5:55 p.m. UTC | #7
On Fri, Jul 08, 2022 at 04:45:00PM +0000, Nadav Amit wrote:
> On Jul 8, 2022, at 5:56 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> > And looking at the results above, it's not so much the PIO vs MMIO
> > that makes a difference, it's the virtualisation. A mmio access goes
> > from 269ns to 85us. Rather than messing around with preferring MMIO
> > over PIO for config space, having an "enlightenment" to do config
> > space accesses would be a more profitable path.
> 
> I am unfamiliar with the motivation for this patch, but I just wanted to
> briefly regard the advice about enlightments.
> 
> “enlightenment”, AFAIK, is Microsoft’s term for "para-virtualization", so
> let’s regard the generic term. I think that you consider the bare-metal
> results as the possible results from a paravirtual machine, which is mostly
> wrong. Para-virtualization usually still requires a VM-exit and for the most
> part the hypervisor/host runs similar code for MMIO/hypercall (conceptually;
> the code of paravirtual and fully-virtual devices is often different, but
> IIUC, this is not what Ajay measured).
> 
> Para-virtualization could have *perhaps* helped to reduce the number of
> PIO/MMIO and improve performance this way. If, for instance, all the
> PIO/MMIO are done during initialization, a paravirtual interface can be use
> to batch them together, and that would help. But it is more complicated to
> get a performance benefit from paravirtualization if the PIO/MMIO accesses
> are “spread”, for instance, done after each interrupt.

What kind of lousy programming interface requires you to do a config
space access after every interrupt?  This is looney-tunes.

You've used a lot of words to not answer the question that was so
important that I asked it twice.  What's the use case, what's the
workload that would benefit from this patch?

> Para-virtauilzation and full-virtualization both have pros and cons.
> Para-virtualization is many times more efficient, but requires the VM to
> have dedicated device drivers for the matter. Try to run a less-common OS
> than Linux and it would not work since the OS would not have drivers for the
> paras-virtual devices. And even if you add support today for a para-virtual
> devices, there are many deployed OSes that do not have such support, and you
> would not be able to run them in a VM.
> 
> Regardless to virtualization, Ajay’s results show PIO is slower on
> bare-metal, and according to his numbers by 165ns, which is significant.
> Emulating PIO in hypervisors on x86 is inherently more complex than MMIO, so
> the results he got would most likely happen on all hypervisors.
> 
> tl;dr: Let’s keep this discussion focused and put paravirtualization aside.
> It is not a solution for all the problems in the world.
Nadav Amit July 8, 2022, 6:35 p.m. UTC | #8
On Jul 8, 2022, at 10:55 AM, Matthew Wilcox <willy@infradead.org> wrote:

> ⚠ External Email
> 
> On Fri, Jul 08, 2022 at 04:45:00PM +0000, Nadav Amit wrote:
>> On Jul 8, 2022, at 5:56 AM, Matthew Wilcox <willy@infradead.org> wrote:
>> 
>>> And looking at the results above, it's not so much the PIO vs MMIO
>>> that makes a difference, it's the virtualisation. A mmio access goes
>>> from 269ns to 85us. Rather than messing around with preferring MMIO
>>> over PIO for config space, having an "enlightenment" to do config
>>> space accesses would be a more profitable path.
>> 
>> I am unfamiliar with the motivation for this patch, but I just wanted to
>> briefly regard the advice about enlightments.
>> 
>> “enlightenment”, AFAIK, is Microsoft’s term for "para-virtualization", so
>> let’s regard the generic term. I think that you consider the bare-metal
>> results as the possible results from a paravirtual machine, which is mostly
>> wrong. Para-virtualization usually still requires a VM-exit and for the most
>> part the hypervisor/host runs similar code for MMIO/hypercall (conceptually;
>> the code of paravirtual and fully-virtual devices is often different, but
>> IIUC, this is not what Ajay measured).
>> 
>> Para-virtualization could have *perhaps* helped to reduce the number of
>> PIO/MMIO and improve performance this way. If, for instance, all the
>> PIO/MMIO are done during initialization, a paravirtual interface can be use
>> to batch them together, and that would help. But it is more complicated to
>> get a performance benefit from paravirtualization if the PIO/MMIO accesses
>> are “spread”, for instance, done after each interrupt.
> 
> What kind of lousy programming interface requires you to do a config
> space access after every interrupt? This is looney-tunes.

Wild example, hence the “for instance”.

> 
> You've used a lot of words to not answer the question that was so
> important that I asked it twice. What's the use case, what's the
> workload that would benefit from this patch?

Well, you used a lot of words to say “it causes problems” without saying
which. It appeared you have misconceptions about paravirtualization that
I wanted to correct.

As I said before, I am not familiar with the exact motivation for this
patch. I now understood from Ajay that it shortens VM boot time
considerably.

I was talking to Ajay to see if there is a possibility of a VMware specific
solution. I am afraid that init_hypervisor_platform() might take place too
late.
Matthew Wilcox (Oracle) July 8, 2022, 6:43 p.m. UTC | #9
On Fri, Jul 08, 2022 at 06:35:48PM +0000, Nadav Amit wrote:
> On Jul 8, 2022, at 10:55 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> > ⚠ External Email
> > 
> > On Fri, Jul 08, 2022 at 04:45:00PM +0000, Nadav Amit wrote:
> >> On Jul 8, 2022, at 5:56 AM, Matthew Wilcox <willy@infradead.org> wrote:
> >> 
> >>> And looking at the results above, it's not so much the PIO vs MMIO
> >>> that makes a difference, it's the virtualisation. A mmio access goes
> >>> from 269ns to 85us. Rather than messing around with preferring MMIO
> >>> over PIO for config space, having an "enlightenment" to do config
> >>> space accesses would be a more profitable path.
> >> 
> >> I am unfamiliar with the motivation for this patch, but I just wanted to
> >> briefly regard the advice about enlightments.
> >> 
> >> “enlightenment”, AFAIK, is Microsoft’s term for "para-virtualization", so
> >> let’s regard the generic term. I think that you consider the bare-metal
> >> results as the possible results from a paravirtual machine, which is mostly
> >> wrong. Para-virtualization usually still requires a VM-exit and for the most
> >> part the hypervisor/host runs similar code for MMIO/hypercall (conceptually;
> >> the code of paravirtual and fully-virtual devices is often different, but
> >> IIUC, this is not what Ajay measured).
> >> 
> >> Para-virtualization could have *perhaps* helped to reduce the number of
> >> PIO/MMIO and improve performance this way. If, for instance, all the
> >> PIO/MMIO are done during initialization, a paravirtual interface can be use
> >> to batch them together, and that would help. But it is more complicated to
> >> get a performance benefit from paravirtualization if the PIO/MMIO accesses
> >> are “spread”, for instance, done after each interrupt.
> > 
> > What kind of lousy programming interface requires you to do a config
> > space access after every interrupt? This is looney-tunes.
> 
> Wild example, hence the “for instance”.

Stupid example that doesn't help.

> > You've used a lot of words to not answer the question that was so
> > important that I asked it twice. What's the use case, what's the
> > workload that would benefit from this patch?
> 
> Well, you used a lot of words to say “it causes problems” without saying
> which. It appeared you have misconceptions about paravirtualization that
> I wanted to correct.

Well now, that's some bullshit.  I did my fucking research.  I went
back 14+ years in history to figure out what was going on back then.
I cited commit IDs.  You're just tossing off some opinions.

I have no misconceptions about whatever you want to call the mechanism
for communicating with the hypervisor at a higher level than "prod this
byte".  For example, one of the more intensive things we use config
space for is sizing BARs.  If we had a hypercall to siz a BAR, that
would eliminate:

 - Read current value from BAR
 - Write all-ones to BAR
 - Read new value from BAR
 - Write original value back to BAR

Bingo, one hypercall instead of 4 MMIO or 8 PIO accesses.

Just because I don't use your terminology, you think I have
"misconceptions"?  Fuck you, you condescending piece of shit.

> As I said before, I am not familiar with the exact motivation for this
> patch. I now understood from Ajay that it shortens VM boot time
> considerably.

And yet, no numbers.  Yes, microbenchmark numbers that provde nothing,
but no numbers about how much it improves boot time.

> I was talking to Ajay to see if there is a possibility of a VMware specific
> solution. I am afraid that init_hypervisor_platform() might take place too
> late.
>
Nadav Amit July 8, 2022, 7:49 p.m. UTC | #10
On Jul 8, 2022, at 11:43 AM, Matthew Wilcox <willy@infradead.org> wrote:

> ⚠ External Email
> 
> On Fri, Jul 08, 2022 at 06:35:48PM +0000, Nadav Amit wrote:
>> On Jul 8, 2022, at 10:55 AM, Matthew Wilcox <willy@infradead.org> wrote:
>> 
>>> ⚠ External Email
>>> 
>>> On Fri, Jul 08, 2022 at 04:45:00PM +0000, Nadav Amit wrote:
>>>> On Jul 8, 2022, at 5:56 AM, Matthew Wilcox <willy@infradead.org> wrote:
>>>> 
>>>>> And looking at the results above, it's not so much the PIO vs MMIO
>>>>> that makes a difference, it's the virtualisation. A mmio access goes
>>>>> from 269ns to 85us. Rather than messing around with preferring MMIO
>>>>> over PIO for config space, having an "enlightenment" to do config
>>>>> space accesses would be a more profitable path.
>>>> 
>>>> I am unfamiliar with the motivation for this patch, but I just wanted to
>>>> briefly regard the advice about enlightments.
>>>> 
>>>> “enlightenment”, AFAIK, is Microsoft’s term for "para-virtualization", so
>>>> let’s regard the generic term. I think that you consider the bare-metal
>>>> results as the possible results from a paravirtual machine, which is mostly
>>>> wrong. Para-virtualization usually still requires a VM-exit and for the most
>>>> part the hypervisor/host runs similar code for MMIO/hypercall (conceptually;
>>>> the code of paravirtual and fully-virtual devices is often different, but
>>>> IIUC, this is not what Ajay measured).
>>>> 
>>>> Para-virtualization could have *perhaps* helped to reduce the number of
>>>> PIO/MMIO and improve performance this way. If, for instance, all the
>>>> PIO/MMIO are done during initialization, a paravirtual interface can be use
>>>> to batch them together, and that would help. But it is more complicated to
>>>> get a performance benefit from paravirtualization if the PIO/MMIO accesses
>>>> are “spread”, for instance, done after each interrupt.
>>> 
>>> What kind of lousy programming interface requires you to do a config
>>> space access after every interrupt? This is looney-tunes.
>> 
>> Wild example, hence the “for instance”.
> 
> Stupid example that doesn't help.
> 
>>> You've used a lot of words to not answer the question that was so
>>> important that I asked it twice. What's the use case, what's the
>>> workload that would benefit from this patch?
>> 
>> Well, you used a lot of words to say “it causes problems” without saying
>> which. It appeared you have misconceptions about paravirtualization that
>> I wanted to correct.
> 
> Well now, that's some bullshit. I did my fucking research. I went
> back 14+ years in history to figure out what was going on back then.
> I cited commit IDs. You're just tossing off some opinions.
> 
> I have no misconceptions about whatever you want to call the mechanism
> for communicating with the hypervisor at a higher level than "prod this
> byte". For example, one of the more intensive things we use config
> space for is sizing BARs. If we had a hypercall to siz a BAR, that
> would eliminate:
> 
> - Read current value from BAR
> - Write all-ones to BAR
> - Read new value from BAR
> - Write original value back to BAR
> 
> Bingo, one hypercall instead of 4 MMIO or 8 PIO accesses.
> 
> Just because I don't use your terminology, you think I have
> "misconceptions"? Fuck you, you condescending piece of shit.

Matthew,

I did not mean to sound condescending and I apologize if I did. You have my
*full* respect for your coding/design skills.

Out of my respect to you, I am giving you a pass on your conduct this time
and *this time only*. Do not use such language with me or my colleagues
again. The only reason I got involved in this discussion is that I feel that
my colleagues have concerns about kernel toxic environment.

Back to the issue at hand: I think that a new paravirtual interface is a
possible solution, with some serious drawbacks. Xen did something similar,
IIRC, to a certain extent.

More reasonable, I think, based on what you said before, is to check if we
run on a hypervisor, and update raw_pci_ops accordingly. There is an issue
of whether hypervisor detection might take place too late, but I think this
can be relatively easily resolved. The question is whether assigned devices
might still be broken. Based on the information that you provided - I do not
know.

If you can answer this question, that would be helpful. Let’s also wait for
Ajay to give some numbers about boot time with this change.
Ajay Kaher July 11, 2022, 6:31 a.m. UTC | #11
On 09/07/22, 1:19 AM, "Nadav Amit" <namit@vmware.com> wrote:

> On Jul 8, 2022, at 11:43 AM, Matthew Wilcox <willy@infradead.org> wrote:

>> I have no misconceptions about whatever you want to call the mechanism
>> for communicating with the hypervisor at a higher level than "prod this
>> byte". For example, one of the more intensive things we use config
>> space for is sizing BARs. If we had a hypercall to siz a BAR, that
>> would eliminate:
>>
>> - Read current value from BAR
>> - Write all-ones to BAR
>> - Read new value from BAR
>> - Write original value back to BAR
>>
>> Bingo, one hypercall instead of 4 MMIO or 8 PIO accesses.

To improve further we can have following mechanism:
Map (as read only) the 'virtual device config i.e. 4KB ECAM' to
VM MMIO. VM will have direct read access using MMIO but
not using PIO.

Virtual Machine test result with above mechanism:
1 hundred thousand read using raw_pci_read() took:
PIO: 12.809 Sec.
MMIO: 0.010 Sec.

And while VM booting, PCI scan and initialization time have been
reduced by ~65%. In our case it reduced to ~18 mSec from ~55 mSec.

Thanks Matthew, for sharing history and your views on this patch.

As you mentioned ordering change may impact some Hardware, so
it's better to have this change for VMware hypervisor or generic to
all hypervisor.

- Ajay

> Back to the issue at hand: I think that a new paravirtual interface is a
> possible solution, with some serious drawbacks. Xen did something similar,
> IIRC, to a certain extent.
>
> More reasonable, I think, based on what you said before, is to check if we
> run on a hypervisor, and update raw_pci_ops accordingly. There is an issue
> of whether hypervisor detection might take place too late, but I think this
> can be relatively easily resolved. The question is whether assigned devices
> might still be broken. Based on the information that you provided - I do not
> know.
>
> If you can answer this question, that would be helpful. Let’s also wait for
> Ajay to give some numbers about boot time with this change.
Nadav Amit July 11, 2022, 5:04 p.m. UTC | #12
On Jul 10, 2022, at 11:31 PM, Ajay Kaher <akaher@vmware.com> wrote:

> On 09/07/22, 1:19 AM, "Nadav Amit" <namit@vmware.com> wrote:
> 
>> On Jul 8, 2022, at 11:43 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
>>> I have no misconceptions about whatever you want to call the mechanism
>>> for communicating with the hypervisor at a higher level than "prod this
>>> byte". For example, one of the more intensive things we use config
>>> space for is sizing BARs. If we had a hypercall to siz a BAR, that
>>> would eliminate:
>>> 
>>> - Read current value from BAR
>>> - Write all-ones to BAR
>>> - Read new value from BAR
>>> - Write original value back to BAR
>>> 
>>> Bingo, one hypercall instead of 4 MMIO or 8 PIO accesses.
> 
> To improve further we can have following mechanism:
> Map (as read only) the 'virtual device config i.e. 4KB ECAM' to
> VM MMIO. VM will have direct read access using MMIO but
> not using PIO.
> 
> Virtual Machine test result with above mechanism:
> 1 hundred thousand read using raw_pci_read() took:
> PIO: 12.809 Sec.
> MMIO: 0.010 Sec.
> 
> And while VM booting, PCI scan and initialization time have been
> reduced by ~65%. In our case it reduced to ~18 mSec from ~55 mSec.
> 
> Thanks Matthew, for sharing history and your views on this patch.
> 
> As you mentioned ordering change may impact some Hardware, so
> it's better to have this change for VMware hypervisor or generic to
> all hypervisor.

I was chatting with Ajay, since I personally did not fully understand his
use-case from the email. Others may have fully understood and can ignore
this email. Here is a short summary of my understanding:

During boot-time there are many PCI reads. Currently, when these reads are
performed by a virtual machine, they all cause a VM-exit, and therefore each
one of them induces a considerable overhead.

When using MMIO (but not PIO), it is possible to map the PCI BARs of the
virtual machine to some memory area that holds the values that the “emulated
hardware” is supposed to return. The memory region is mapped as "read-only”
in the NPT/EPT, so reads from these BAR regions would be treated as regular
memory reads. Writes would still be trapped and emulated by the hypervisor.

I have a vague recollection from some similar project that I had 10 years
ago that this might not work for certain emulated device registers. For
instance some hardware registers, specifically those the report hardware
events, are “clear-on-read”. Apparently, Ajay took that into consideration.

That is the reason for this quite amazing difference - several orders of
magnitude - between the overhead that is caused by raw_pci_read(): 120us for
PIO and 100ns for MMIO. Admittedly, I do not understand why PIO access would
take 120us (I would have expected it to be 10 times faster, at least), but
the benefit is quite clear.
Ajay Kaher July 11, 2022, 5:53 p.m. UTC | #13
On 11/07/22, 10:34 PM, "Nadav Amit" <namit@vmware.com> wrote:

> On Jul 10, 2022, at 11:31 PM, Ajay Kaher <akaher@vmware.com> wrote:
>
> During boot-time there are many PCI reads. Currently, when these reads are
> performed by a virtual machine, they all cause a VM-exit, and therefore each
> one of them induces a considerable overhead.
>
> When using MMIO (but not PIO), it is possible to map the PCI BARs of the
> virtual machine to some memory area that holds the values that the “emulated
> hardware” is supposed to return. The memory region is mapped as "read-only”
> in the NPT/EPT, so reads from these BAR regions would be treated as regular
> memory reads. Writes would still be trapped and emulated by the hypervisor.

I guess some typo mistake in above paragraph, it's per-device PCI config space
i.e. 4KB ECAM not PCI BARs. Please read above paragraph as:

When using MMIO (but not PIO), it is possible to map the PCI config space of the
virtual machine to some memory area that holds the values that the “emulated
hardware” is supposed to return. The memory region is mapped as "read-only”
in the NPT/EPT, so reads from these PCI config space would be treated as regular
memory reads. Writes would still be trapped and emulated by the hypervisor.

We will send v2 or new patch which will be VMware specific.

> I have a vague recollection from some similar project that I had 10 years
> ago that this might not work for certain emulated device registers. For
> instance some hardware registers, specifically those the report hardware
> events, are “clear-on-read”. Apparently, Ajay took that into consideration.
>
> That is the reason for this quite amazing difference - several orders of
> magnitude - between the overhead that is caused by raw_pci_read(): 120us for
> PIO and 100ns for MMIO. Admittedly, I do not understand why PIO access would
> take 120us (I would have expected it to be 10 times faster, at least), but
> the benefit is quite clear.
H. Peter Anvin July 11, 2022, 6:18 p.m. UTC | #14
On July 11, 2022 10:53:54 AM PDT, Ajay Kaher <akaher@vmware.com> wrote:
>
>On 11/07/22, 10:34 PM, "Nadav Amit" <namit@vmware.com> wrote:
>
>> On Jul 10, 2022, at 11:31 PM, Ajay Kaher <akaher@vmware.com> wrote:
>>
>> During boot-time there are many PCI reads. Currently, when these reads are
>> performed by a virtual machine, they all cause a VM-exit, and therefore each
>> one of them induces a considerable overhead.
>>
>> When using MMIO (but not PIO), it is possible to map the PCI BARs of the
>> virtual machine to some memory area that holds the values that the “emulated
>> hardware” is supposed to return. The memory region is mapped as "read-only”
>> in the NPT/EPT, so reads from these BAR regions would be treated as regular
>> memory reads. Writes would still be trapped and emulated by the hypervisor.
>
>I guess some typo mistake in above paragraph, it's per-device PCI config space
>i.e. 4KB ECAM not PCI BARs. Please read above paragraph as:
>
>When using MMIO (but not PIO), it is possible to map the PCI config space of the
>virtual machine to some memory area that holds the values that the “emulated
>hardware” is supposed to return. The memory region is mapped as "read-only”
>in the NPT/EPT, so reads from these PCI config space would be treated as regular
>memory reads. Writes would still be trapped and emulated by the hypervisor.
>
>We will send v2 or new patch which will be VMware specific.
>
>> I have a vague recollection from some similar project that I had 10 years
>> ago that this might not work for certain emulated device registers. For
>> instance some hardware registers, specifically those the report hardware
>> events, are “clear-on-read”. Apparently, Ajay took that into consideration.
>>
>> That is the reason for this quite amazing difference - several orders of
>> magnitude - between the overhead that is caused by raw_pci_read(): 120us for
>> PIO and 100ns for MMIO. Admittedly, I do not understand why PIO access would
>> take 120us (I would have expected it to be 10 times faster, at least), but
>> the benefit is quite clear.
>
>
>

For one thing, please correct the explanation.

It does not take "more PCI cycles" to use PIO – they are exactly the same, in fact.  The source of improvements are all in the CPU and VMM interfaces; on the PCI bus, they are (mostly) just address spaces.

"Using MMIO may allow a VMM to map a shadow memory area readonly, so read transactions can be executed without needing any VMEXIT at all. In contrast, PIO transactions to PCI configuration space are done through an indirect address-data interface, requiring two VMEXITs per transaction regardless of the properties of the underlying register."

You should call out exactly what is being done to prevent incorrect handling of registers with read side effects (I believe that would be all on the VMM side; unfortunately the presence of a register with read side effects probably would mean losing this optimization for the entire 4K page = this entire function, but read side effects have always been discouraged although not prohibited in config space.)
Paolo Bonzini July 11, 2022, 7:43 p.m. UTC | #15
On 7/8/22 20:43, Matthew Wilcox wrote:
> Just because I don't use your terminology, you think I have
> "misconceptions"?  Fuck you, you condescending piece of shit.

Matthew, perhaps your vacation was not long enough?

This is not even a violation of the code of conduct, it's a violation of 
basic human decency.

Paolo
Shuah Khan July 12, 2022, 9:20 p.m. UTC | #16
Hi Matthew,

On 7/8/22 12:43 PM, Matthew Wilcox wrote:
> On Fri, Jul 08, 2022 at 06:35:48PM +0000, Nadav Amit wrote:
>> On Jul 8, 2022, at 10:55 AM, Matthew Wilcox <willy@infradead.org> wrote:
>>
> 
> Just because I don't use your terminology, you think I have
> "misconceptions"?  Fuck you, you condescending piece of shit.
> 

This is clearly unacceptable and violates the Code of Conduct.

Others including the person this was directed at already called out
the violation:

https://lore.kernel.org/all/85071FE5-E37A-44CF-9EF7-CB80C116A876@vmware.com/
https://lore.kernel.org/all/880aac01-9c78-34f7-8a11-d48179e1635c@redhat.com/

You should be well aware that this type of language and personal attack is a
clear violation of the Linux kernel Contributor Covenant Code of Conduct as
outlined in the following:

https://www.kernel.org/doc/html/latest/process/code-of-conduct.html

Refer to the Code of Conduct and refrain from violating the Code of Conduct
in the future.

thanks,
-- Shuah (On behalf of the Code of Conduct Committee)
Matthew Wilcox (Oracle) Aug. 31, 2022, 9:44 p.m. UTC | #17
On Tue, Jul 12, 2022 at 03:20:57PM -0600, Shuah Khan wrote:
> Hi Matthew,
> 
> On 7/8/22 12:43 PM, Matthew Wilcox wrote:
> > On Fri, Jul 08, 2022 at 06:35:48PM +0000, Nadav Amit wrote:
> > > On Jul 8, 2022, at 10:55 AM, Matthew Wilcox <willy@infradead.org> wrote:
> > > 
> > 
> > Just because I don't use your terminology, you think I have
> > "misconceptions"?  Fuck you, you condescending piece of shit.
> > 
> 
> This is clearly unacceptable and violates the Code of Conduct.

In a message I sent to this mailing list on July 8, I used language
which offended some readers.  I would like to apologise for my choice of
words.  I remain committed to fostering a community where disagreements
can be handled respectfully and will do my best to uphold these important
principles in the future.  I would like to thank the Code of Conduct
committee for their diligence in overseeing these matters.
Shuah Khan Aug. 31, 2022, 10:23 p.m. UTC | #18
On 8/31/22 15:44, Matthew Wilcox wrote:
> On Tue, Jul 12, 2022 at 03:20:57PM -0600, Shuah Khan wrote:
>> Hi Matthew,
>>
>> On 7/8/22 12:43 PM, Matthew Wilcox wrote:
>>> On Fri, Jul 08, 2022 at 06:35:48PM +0000, Nadav Amit wrote:
>>>> On Jul 8, 2022, at 10:55 AM, Matthew Wilcox <willy@infradead.org> wrote:
>>>>
>>>
>>> Just because I don't use your terminology, you think I have
>>> "misconceptions"?  Fuck you, you condescending piece of shit.
>>>
>>
>> This is clearly unacceptable and violates the Code of Conduct.
> 
> In a message I sent to this mailing list on July 8, I used language
> which offended some readers.  I would like to apologise for my choice of
> words.  I remain committed to fostering a community where disagreements
> can be handled respectfully and will do my best to uphold these important
> principles in the future.  I would like to thank the Code of Conduct
> committee for their diligence in overseeing these matters.
> 

Thank you Matthew for taking this important step to show your commitment
to keeping our community a welcoming place that fosters respectful and
productive technical discourse.

thanks,
-- Shuah (On behalf of the Code of Conduct Committee)
diff mbox series

Patch

diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index 3507f456f..0b3383d9c 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -40,20 +40,20 @@  const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
 int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
 						int reg, int len, u32 *val)
 {
+	if (raw_pci_ext_ops)
+		return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
 	if (domain == 0 && reg < 256 && raw_pci_ops)
 		return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
-	if (raw_pci_ext_ops)
-		return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
 	return -EINVAL;
 }
 
 int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
 						int reg, int len, u32 val)
 {
+	if (raw_pci_ext_ops)
+		return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
 	if (domain == 0 && reg < 256 && raw_pci_ops)
 		return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
-	if (raw_pci_ext_ops)
-		return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
 	return -EINVAL;
 }