[XSA-126] xen: limit guest control of PCI command register

Message ID	alpine.DEB.2.02.1503311510300.7690@kaball.uk.xensource.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> Date: Tue, 31 Mar 2015 15:18:03 +0100 From: Stefano Stabellini <stefano.stabellini@eu.citrix.com> To: <qemu-devel@nongnu.org> Message-ID: <alpine.DEB.2.02.1503311510300.7690@kaball.uk.xensource.com> User-Agent: Alpine 2.02 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Cc: pmatouse@redhat.com, Stefano Stabellini <Stefano.Stabellini@eu.citrix.com> Subject: [Qemu-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Stefano Stabellini March 31, 2015, 2:18 p.m. UTC

From: Jan Beulich <jbeulich@suse.com>

Otherwise the guest can abuse that control to cause e.g. PCIe
Unsupported Request responses (by disabling memory and/or I/O decoding
and subsequently causing [CPU side] accesses to the respective address
ranges), which (depending on system configuration) may be fatal to the
host.

This is CVE-2015-2756 / XSA-126.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

Michael S. Tsirkin April 1, 2015, 9:01 a.m. UTC | #1

On Tue, Mar 31, 2015 at 03:18:03PM +0100, Stefano Stabellini wrote:
> From: Jan Beulich <jbeulich@suse.com>
> 
> Otherwise the guest can abuse that control to cause e.g. PCIe
> Unsupported Request responses (by disabling memory and/or I/O decoding
> and subsequently causing [CPU side] accesses to the respective address
> ranges), which (depending on system configuration) may be fatal to the
> host.
> 
> This is CVE-2015-2756 / XSA-126.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> Acked-by: Ian Campbell <ian.campbell@citrix.com>

The patch description seems somewhat incorrect to me.
UR should not be fatal to the system, and it's not platform
specific.

In particular, there could be more reasons for devices
to generate URs, for example, if they get a transaction
during FLR. I don't think we ever tried to prevent this.

Here's the description from pci express spec:

IMPLEMENTATION NOTE
Software UR Reporting Compatibility with 1.0a Devices

	With 1.0a device Functions, 96 if the Unsupported Request Reporting Enable bit is set, the Function
	when operating as a Completer will send an uncorrectable error Message (if enabled) when a UR
	error is detected. On platforms where an uncorrectable error Message is handled as a System Error,
	this will break PC-compatible Configuration Space probing, so software/firmware on such
	platforms may need to avoid setting the Unsupported Request Reporting Enable bit.
	With device Functions implementing Role-Based Error Reporting, setting the Unsupported Request
	Reporting Enable bit will not interfere with PC-compatible Configuration Space probing, assuming
	that the severity for UR is left at its default of non-fatal. However, setting the Unsupported Request
	Reporting Enable bit will enable the Function to report UR errors detected with posted Requests,
	helping avoid this case for potential silent data corruption.
	On platforms where robust error handling and PC-compatible Configuration Space probing is
	required, it is suggested that software or firmware have the Unsupported Request Reporting Enable
	bit Set for Role-Based Error Reporting Functions, but clear for 1.0a Functions. Software or
	firmware can distinguish the two classes of Functions by examining the Role-Based Error Reporting
	bit in the Device Capabilities register.


What I think you have is a very old 1.0a system, and host set Unsupported
Request Reporting Enable.

The right thing is for host kernel to do is what the note says, and disable UR
reporting for this system.

As a work around for broken kernels, this is OK I guess, though
guest and host being out of sync is always a risk.
But I think it's a good idea to add documentation explaining
this, and work on the correct fix in linux, before we
add workarounds.


> diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
> index f2893b2..d095c08 100644
> --- a/hw/xen/xen_pt.c
> +++ b/hw/xen/xen_pt.c
> @@ -388,7 +388,7 @@ static const MemoryRegionOps ops = {
>      .write = xen_pt_bar_write,
>  };
>  
> -static int xen_pt_register_regions(XenPCIPassthroughState *s)
> +static int xen_pt_register_regions(XenPCIPassthroughState *s, uint16_t *cmd)
>  {
>      int i = 0;
>      XenHostPCIDevice *d = &s->real_device;
> @@ -406,6 +406,7 @@ static int xen_pt_register_regions(XenPCIPassthroughState *s)
>  
>          if (r->type & XEN_HOST_PCI_REGION_TYPE_IO) {
>              type = PCI_BASE_ADDRESS_SPACE_IO;
> +            *cmd |= PCI_COMMAND_IO;
>          } else {
>              type = PCI_BASE_ADDRESS_SPACE_MEMORY;
>              if (r->type & XEN_HOST_PCI_REGION_TYPE_PREFETCH) {
> @@ -414,6 +415,7 @@ static int xen_pt_register_regions(XenPCIPassthroughState *s)
>              if (r->type & XEN_HOST_PCI_REGION_TYPE_MEM_64) {
>                  type |= PCI_BASE_ADDRESS_MEM_TYPE_64;
>              }
> +            *cmd |= PCI_COMMAND_MEMORY;
>          }
>  
>          memory_region_init_io(&s->bar[i], OBJECT(s), &ops, &s->dev,
> @@ -638,6 +640,7 @@ static int xen_pt_initfn(PCIDevice *d)
>      XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
>      int rc = 0;
>      uint8_t machine_irq = 0;
> +    uint16_t cmd = 0;
>      int pirq = XEN_PT_UNASSIGNED_PIRQ;
>  
>      /* register real device */
> @@ -672,7 +675,7 @@ static int xen_pt_initfn(PCIDevice *d)
>      s->io_listener = xen_pt_io_listener;
>  
>      /* Handle real device's MMIO/PIO BARs */
> -    xen_pt_register_regions(s);
> +    xen_pt_register_regions(s, &cmd);
>  
>      /* reinitialize each config register to be emulated */
>      if (xen_pt_config_init(s)) {
> @@ -736,6 +739,11 @@ static int xen_pt_initfn(PCIDevice *d)
>      }
>  
>  out:
> +    if (cmd) {
> +        xen_host_pci_set_word(&s->real_device, PCI_COMMAND,
> +                              pci_get_word(d->config + PCI_COMMAND) | cmd);
> +    }
> +

Is this writing to host device, forcing cmd and io bits to enabled simply
because they are present? If yes, I think that's wrong: you don't know whether
bios enabled them, if it didn't host BARs might conflict with other devices,
and this will crash host.  I don't see why you are touching host command
register at all.  The point in the description is to avoid touching host.


>      memory_listener_register(&s->memory_listener, &s->dev.bus_master_as);
>      memory_listener_register(&s->io_listener, &address_space_io);
>      XEN_PT_LOG(d,
> diff --git a/hw/xen/xen_pt_config_init.c b/hw/xen/xen_pt_config_init.c
> index d99c22e..95a51db 100644
> --- a/hw/xen/xen_pt_config_init.c
> +++ b/hw/xen/xen_pt_config_init.c
> @@ -286,23 +286,6 @@ static int xen_pt_irqpin_reg_init(XenPCIPassthroughState *s,
>  }
>  
>  /* Command register */
> -static int xen_pt_cmd_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> -                               uint16_t *value, uint16_t valid_mask)
> -{
> -    XenPTRegInfo *reg = cfg_entry->reg;
> -    uint16_t valid_emu_mask = 0;
> -    uint16_t emu_mask = reg->emu_mask;
> -
> -    if (s->is_virtfn) {
> -        emu_mask |= PCI_COMMAND_MEMORY;
> -    }
> -
> -    /* emulate word register */
> -    valid_emu_mask = emu_mask & valid_mask;
> -    *value = XEN_PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> -
> -    return 0;
> -}
>  static int xen_pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
>                                  uint16_t *val, uint16_t dev_value,
>                                  uint16_t valid_mask)
> @@ -310,18 +293,13 @@ static int xen_pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
>      XenPTRegInfo *reg = cfg_entry->reg;
>      uint16_t writable_mask = 0;
>      uint16_t throughable_mask = 0;
> -    uint16_t emu_mask = reg->emu_mask;
> -
> -    if (s->is_virtfn) {
> -        emu_mask |= PCI_COMMAND_MEMORY;
> -    }
>  
>      /* modify emulate register */
>      writable_mask = ~reg->ro_mask & valid_mask;
>      cfg_entry->data = XEN_PT_MERGE_VALUE(*val, cfg_entry->data, writable_mask);
>  
>      /* create value for writing to I/O device register */
> -    throughable_mask = ~emu_mask & valid_mask;
> +    throughable_mask = ~reg->emu_mask & valid_mask;
>  
>      if (*val & PCI_COMMAND_INTX_DISABLE) {
>          throughable_mask |= PCI_COMMAND_INTX_DISABLE;
> @@ -603,9 +581,9 @@ static XenPTRegInfo xen_pt_emu_reg_header0[] = {
>          .size       = 2,
>          .init_val   = 0x0000,
>          .ro_mask    = 0xF880,
> -        .emu_mask   = 0x0740,
> +        .emu_mask   = 0x0743,
>          .init       = xen_pt_common_reg_init,
> -        .u.w.read   = xen_pt_cmd_reg_read,
> +        .u.w.read   = xen_pt_word_reg_read,
>          .u.w.write  = xen_pt_cmd_reg_write,
>      },
>      /* Capabilities Pointer reg */

Stefano Stabellini April 1, 2015, 9:20 a.m. UTC | #2

CC'ing the author of the patch and xen-devel.
FYI I think that Jan is going to be on vacation for a couple of weeks.

On Wed, 1 Apr 2015, Michael S. Tsirkin wrote:
> On Tue, Mar 31, 2015 at 03:18:03PM +0100, Stefano Stabellini wrote:
> > From: Jan Beulich <jbeulich@suse.com>
> > 
> > Otherwise the guest can abuse that control to cause e.g. PCIe
> > Unsupported Request responses (by disabling memory and/or I/O decoding
> > and subsequently causing [CPU side] accesses to the respective address
> > ranges), which (depending on system configuration) may be fatal to the
> > host.
> > 
> > This is CVE-2015-2756 / XSA-126.
> > 
> > Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > Acked-by: Ian Campbell <ian.campbell@citrix.com>
> 
> The patch description seems somewhat incorrect to me.
> UR should not be fatal to the system, and it's not platform
> specific.

I think that people have been able to repro this, but I'll let Jan
comment on it.


> In particular, there could be more reasons for devices
> to generate URs, for example, if they get a transaction
> during FLR. I don't think we ever tried to prevent this.

That cannot be triggered by guest behaviour.


> Here's the description from pci express spec:
> 
> IMPLEMENTATION NOTE
> Software UR Reporting Compatibility with 1.0a Devices
> 
> 	With 1.0a device Functions, 96 if the Unsupported Request Reporting Enable bit is set, the Function
> 	when operating as a Completer will send an uncorrectable error Message (if enabled) when a UR
> 	error is detected. On platforms where an uncorrectable error Message is handled as a System Error,
> 	this will break PC-compatible Configuration Space probing, so software/firmware on such
> 	platforms may need to avoid setting the Unsupported Request Reporting Enable bit.
> 	With device Functions implementing Role-Based Error Reporting, setting the Unsupported Request
> 	Reporting Enable bit will not interfere with PC-compatible Configuration Space probing, assuming
> 	that the severity for UR is left at its default of non-fatal. However, setting the Unsupported Request
> 	Reporting Enable bit will enable the Function to report UR errors detected with posted Requests,
> 	helping avoid this case for potential silent data corruption.
> 	On platforms where robust error handling and PC-compatible Configuration Space probing is
> 	required, it is suggested that software or firmware have the Unsupported Request Reporting Enable
> 	bit Set for Role-Based Error Reporting Functions, but clear for 1.0a Functions. Software or
> 	firmware can distinguish the two classes of Functions by examining the Role-Based Error Reporting
> 	bit in the Device Capabilities register.
> 
> 
> What I think you have is a very old 1.0a system, and host set Unsupported
> Request Reporting Enable.
> 
> The right thing is for host kernel to do is what the note says, and disable UR
> reporting for this system.
> 
> As a work around for broken kernels, this is OK I guess, though
> guest and host being out of sync is always a risk.
> But I think it's a good idea to add documentation explaining
> this, and work on the correct fix in linux, before we
> add workarounds.

There can be guest kernels other than Linux, we cannot fix them all, and
we cannot allow PCI passthrough only with Linux guests.


> > diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
> > index f2893b2..d095c08 100644
> > --- a/hw/xen/xen_pt.c
> > +++ b/hw/xen/xen_pt.c
> > @@ -388,7 +388,7 @@ static const MemoryRegionOps ops = {
> >      .write = xen_pt_bar_write,
> >  };
> >  
> > -static int xen_pt_register_regions(XenPCIPassthroughState *s)
> > +static int xen_pt_register_regions(XenPCIPassthroughState *s, uint16_t *cmd)
> >  {
> >      int i = 0;
> >      XenHostPCIDevice *d = &s->real_device;
> > @@ -406,6 +406,7 @@ static int xen_pt_register_regions(XenPCIPassthroughState *s)
> >  
> >          if (r->type & XEN_HOST_PCI_REGION_TYPE_IO) {
> >              type = PCI_BASE_ADDRESS_SPACE_IO;
> > +            *cmd |= PCI_COMMAND_IO;
> >          } else {
> >              type = PCI_BASE_ADDRESS_SPACE_MEMORY;
> >              if (r->type & XEN_HOST_PCI_REGION_TYPE_PREFETCH) {
> > @@ -414,6 +415,7 @@ static int xen_pt_register_regions(XenPCIPassthroughState *s)
> >              if (r->type & XEN_HOST_PCI_REGION_TYPE_MEM_64) {
> >                  type |= PCI_BASE_ADDRESS_MEM_TYPE_64;
> >              }
> > +            *cmd |= PCI_COMMAND_MEMORY;
> >          }
> >  
> >          memory_region_init_io(&s->bar[i], OBJECT(s), &ops, &s->dev,
> > @@ -638,6 +640,7 @@ static int xen_pt_initfn(PCIDevice *d)
> >      XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
> >      int rc = 0;
> >      uint8_t machine_irq = 0;
> > +    uint16_t cmd = 0;
> >      int pirq = XEN_PT_UNASSIGNED_PIRQ;
> >  
> >      /* register real device */
> > @@ -672,7 +675,7 @@ static int xen_pt_initfn(PCIDevice *d)
> >      s->io_listener = xen_pt_io_listener;
> >  
> >      /* Handle real device's MMIO/PIO BARs */
> > -    xen_pt_register_regions(s);
> > +    xen_pt_register_regions(s, &cmd);
> >  
> >      /* reinitialize each config register to be emulated */
> >      if (xen_pt_config_init(s)) {
> > @@ -736,6 +739,11 @@ static int xen_pt_initfn(PCIDevice *d)
> >      }
> >  
> >  out:
> > +    if (cmd) {
> > +        xen_host_pci_set_word(&s->real_device, PCI_COMMAND,
> > +                              pci_get_word(d->config + PCI_COMMAND) | cmd);
> > +    }
> > +
> 
> Is this writing to host device, forcing cmd and io bits to enabled simply
> because they are present? If yes, I think that's wrong: you don't know whether
> bios enabled them, if it didn't host BARs might conflict with other devices,
> and this will crash host.  I don't see why you are touching host command
> register at all.  The point in the description is to avoid touching host.

pciback clears the command register when seizing the device:

pcistub_seize() -> pcistub_init_device() -> xen_pcibk_reset_device()


> >      memory_listener_register(&s->memory_listener, &s->dev.bus_master_as);
> >      memory_listener_register(&s->io_listener, &address_space_io);
> >      XEN_PT_LOG(d,
> > diff --git a/hw/xen/xen_pt_config_init.c b/hw/xen/xen_pt_config_init.c
> > index d99c22e..95a51db 100644
> > --- a/hw/xen/xen_pt_config_init.c
> > +++ b/hw/xen/xen_pt_config_init.c
> > @@ -286,23 +286,6 @@ static int xen_pt_irqpin_reg_init(XenPCIPassthroughState *s,
> >  }
> >  
> >  /* Command register */
> > -static int xen_pt_cmd_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> > -                               uint16_t *value, uint16_t valid_mask)
> > -{
> > -    XenPTRegInfo *reg = cfg_entry->reg;
> > -    uint16_t valid_emu_mask = 0;
> > -    uint16_t emu_mask = reg->emu_mask;
> > -
> > -    if (s->is_virtfn) {
> > -        emu_mask |= PCI_COMMAND_MEMORY;
> > -    }
> > -
> > -    /* emulate word register */
> > -    valid_emu_mask = emu_mask & valid_mask;
> > -    *value = XEN_PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> > -
> > -    return 0;
> > -}
> >  static int xen_pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> >                                  uint16_t *val, uint16_t dev_value,
> >                                  uint16_t valid_mask)
> > @@ -310,18 +293,13 @@ static int xen_pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> >      XenPTRegInfo *reg = cfg_entry->reg;
> >      uint16_t writable_mask = 0;
> >      uint16_t throughable_mask = 0;
> > -    uint16_t emu_mask = reg->emu_mask;
> > -
> > -    if (s->is_virtfn) {
> > -        emu_mask |= PCI_COMMAND_MEMORY;
> > -    }
> >  
> >      /* modify emulate register */
> >      writable_mask = ~reg->ro_mask & valid_mask;
> >      cfg_entry->data = XEN_PT_MERGE_VALUE(*val, cfg_entry->data, writable_mask);
> >  
> >      /* create value for writing to I/O device register */
> > -    throughable_mask = ~emu_mask & valid_mask;
> > +    throughable_mask = ~reg->emu_mask & valid_mask;
> >  
> >      if (*val & PCI_COMMAND_INTX_DISABLE) {
> >          throughable_mask |= PCI_COMMAND_INTX_DISABLE;
> > @@ -603,9 +581,9 @@ static XenPTRegInfo xen_pt_emu_reg_header0[] = {
> >          .size       = 2,
> >          .init_val   = 0x0000,
> >          .ro_mask    = 0xF880,
> > -        .emu_mask   = 0x0740,
> > +        .emu_mask   = 0x0743,
> >          .init       = xen_pt_common_reg_init,
> > -        .u.w.read   = xen_pt_cmd_reg_read,
> > +        .u.w.read   = xen_pt_word_reg_read,
> >          .u.w.write  = xen_pt_cmd_reg_write,
> >      },
> >      /* Capabilities Pointer reg */
>

Michael S. Tsirkin April 1, 2015, 9:32 a.m. UTC | #3

On Wed, Apr 01, 2015 at 10:20:06AM +0100, Stefano Stabellini wrote:
> CC'ing the author of the patch and xen-devel.
> FYI I think that Jan is going to be on vacation for a couple of weeks.
> 
> On Wed, 1 Apr 2015, Michael S. Tsirkin wrote:
> > On Tue, Mar 31, 2015 at 03:18:03PM +0100, Stefano Stabellini wrote:
> > > From: Jan Beulich <jbeulich@suse.com>
> > > 
> > > Otherwise the guest can abuse that control to cause e.g. PCIe
> > > Unsupported Request responses (by disabling memory and/or I/O decoding
> > > and subsequently causing [CPU side] accesses to the respective address
> > > ranges), which (depending on system configuration) may be fatal to the
> > > host.
> > > 
> > > This is CVE-2015-2756 / XSA-126.
> > > 
> > > Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > > Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > > Acked-by: Ian Campbell <ian.campbell@citrix.com>
> > 
> > The patch description seems somewhat incorrect to me.
> > UR should not be fatal to the system, and it's not platform
> > specific.
> 
> I think that people have been able to repro this, but I'll let Jan
> comment on it.
> 
> 
> > In particular, there could be more reasons for devices
> > to generate URs, for example, if they get a transaction
> > during FLR. I don't think we ever tried to prevent this.
> 
> That cannot be triggered by guest behaviour.

Emulated bus reset often triggers FLR, but I can't speak for
Xen here.  Many devices have vendor specific reset registers though.
I don't believe anyone has a comprehensive list of such devices.

Fundamendally IOMMU is hosts's only protection. IOMMU doesn't
protect against URs so they mustn't trigger system errors.


> 
> > Here's the description from pci express spec:
> > 
> > IMPLEMENTATION NOTE
> > Software UR Reporting Compatibility with 1.0a Devices
> > 
> > 	With 1.0a device Functions, 96 if the Unsupported Request Reporting Enable bit is set, the Function
> > 	when operating as a Completer will send an uncorrectable error Message (if enabled) when a UR
> > 	error is detected. On platforms where an uncorrectable error Message is handled as a System Error,
> > 	this will break PC-compatible Configuration Space probing, so software/firmware on such
> > 	platforms may need to avoid setting the Unsupported Request Reporting Enable bit.
> > 	With device Functions implementing Role-Based Error Reporting, setting the Unsupported Request
> > 	Reporting Enable bit will not interfere with PC-compatible Configuration Space probing, assuming
> > 	that the severity for UR is left at its default of non-fatal. However, setting the Unsupported Request
> > 	Reporting Enable bit will enable the Function to report UR errors detected with posted Requests,
> > 	helping avoid this case for potential silent data corruption.
> > 	On platforms where robust error handling and PC-compatible Configuration Space probing is
> > 	required, it is suggested that software or firmware have the Unsupported Request Reporting Enable
> > 	bit Set for Role-Based Error Reporting Functions, but clear for 1.0a Functions. Software or
> > 	firmware can distinguish the two classes of Functions by examining the Role-Based Error Reporting
> > 	bit in the Device Capabilities register.
> > 
> > 
> > What I think you have is a very old 1.0a system, and host set Unsupported
> > Request Reporting Enable.
> > 
> > The right thing is for host kernel to do is what the note says, and disable UR
> > reporting for this system.
> > 
> > As a work around for broken kernels, this is OK I guess, though
> > guest and host being out of sync is always a risk.
> > But I think it's a good idea to add documentation explaining
> > this, and work on the correct fix in linux, before we
> > add workarounds.
> 
> There can be guest kernels other than Linux, we cannot fix them all, and
> we cannot allow PCI passthrough only with Linux guests.

Are you trying to fix guest crashes, or host crashes?
If guest crashes, how is this a CVE?

What I said above has nothing to do with guest bugs.
I think the bug is in whoever configured the host pci bridge.
That's host kernel, not guest kernel.


> 
> > > diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
> > > index f2893b2..d095c08 100644
> > > --- a/hw/xen/xen_pt.c
> > > +++ b/hw/xen/xen_pt.c
> > > @@ -388,7 +388,7 @@ static const MemoryRegionOps ops = {
> > >      .write = xen_pt_bar_write,
> > >  };
> > >  
> > > -static int xen_pt_register_regions(XenPCIPassthroughState *s)
> > > +static int xen_pt_register_regions(XenPCIPassthroughState *s, uint16_t *cmd)
> > >  {
> > >      int i = 0;
> > >      XenHostPCIDevice *d = &s->real_device;
> > > @@ -406,6 +406,7 @@ static int xen_pt_register_regions(XenPCIPassthroughState *s)
> > >  
> > >          if (r->type & XEN_HOST_PCI_REGION_TYPE_IO) {
> > >              type = PCI_BASE_ADDRESS_SPACE_IO;
> > > +            *cmd |= PCI_COMMAND_IO;
> > >          } else {
> > >              type = PCI_BASE_ADDRESS_SPACE_MEMORY;
> > >              if (r->type & XEN_HOST_PCI_REGION_TYPE_PREFETCH) {
> > > @@ -414,6 +415,7 @@ static int xen_pt_register_regions(XenPCIPassthroughState *s)
> > >              if (r->type & XEN_HOST_PCI_REGION_TYPE_MEM_64) {
> > >                  type |= PCI_BASE_ADDRESS_MEM_TYPE_64;
> > >              }
> > > +            *cmd |= PCI_COMMAND_MEMORY;
> > >          }
> > >  
> > >          memory_region_init_io(&s->bar[i], OBJECT(s), &ops, &s->dev,
> > > @@ -638,6 +640,7 @@ static int xen_pt_initfn(PCIDevice *d)
> > >      XenPCIPassthroughState *s = DO_UPCAST(XenPCIPassthroughState, dev, d);
> > >      int rc = 0;
> > >      uint8_t machine_irq = 0;
> > > +    uint16_t cmd = 0;
> > >      int pirq = XEN_PT_UNASSIGNED_PIRQ;
> > >  
> > >      /* register real device */
> > > @@ -672,7 +675,7 @@ static int xen_pt_initfn(PCIDevice *d)
> > >      s->io_listener = xen_pt_io_listener;
> > >  
> > >      /* Handle real device's MMIO/PIO BARs */
> > > -    xen_pt_register_regions(s);
> > > +    xen_pt_register_regions(s, &cmd);
> > >  
> > >      /* reinitialize each config register to be emulated */
> > >      if (xen_pt_config_init(s)) {
> > > @@ -736,6 +739,11 @@ static int xen_pt_initfn(PCIDevice *d)
> > >      }
> > >  
> > >  out:
> > > +    if (cmd) {
> > > +        xen_host_pci_set_word(&s->real_device, PCI_COMMAND,
> > > +                              pci_get_word(d->config + PCI_COMMAND) | cmd);
> > > +    }
> > > +
> > 
> > Is this writing to host device, forcing cmd and io bits to enabled simply
> > because they are present? If yes, I think that's wrong: you don't know whether
> > bios enabled them, if it didn't host BARs might conflict with other devices,
> > and this will crash host.  I don't see why you are touching host command
> > register at all.  The point in the description is to avoid touching host.
> 
> pciback clears the command register when seizing the device:
> 
> pcistub_seize() -> pcistub_init_device() -> xen_pcibk_reset_device()


I think you have a bug then.  You must keep memory/io enable bits that
bios set for you, otherwise you risk conflicts with other devices,
or conflicts between BARs. I guess you do this for BAR values,
same applies here.




> 
> > >      memory_listener_register(&s->memory_listener, &s->dev.bus_master_as);
> > >      memory_listener_register(&s->io_listener, &address_space_io);
> > >      XEN_PT_LOG(d,
> > > diff --git a/hw/xen/xen_pt_config_init.c b/hw/xen/xen_pt_config_init.c
> > > index d99c22e..95a51db 100644
> > > --- a/hw/xen/xen_pt_config_init.c
> > > +++ b/hw/xen/xen_pt_config_init.c
> > > @@ -286,23 +286,6 @@ static int xen_pt_irqpin_reg_init(XenPCIPassthroughState *s,
> > >  }
> > >  
> > >  /* Command register */
> > > -static int xen_pt_cmd_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> > > -                               uint16_t *value, uint16_t valid_mask)
> > > -{
> > > -    XenPTRegInfo *reg = cfg_entry->reg;
> > > -    uint16_t valid_emu_mask = 0;
> > > -    uint16_t emu_mask = reg->emu_mask;
> > > -
> > > -    if (s->is_virtfn) {
> > > -        emu_mask |= PCI_COMMAND_MEMORY;
> > > -    }
> > > -
> > > -    /* emulate word register */
> > > -    valid_emu_mask = emu_mask & valid_mask;
> > > -    *value = XEN_PT_MERGE_VALUE(*value, cfg_entry->data, ~valid_emu_mask);
> > > -
> > > -    return 0;
> > > -}
> > >  static int xen_pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> > >                                  uint16_t *val, uint16_t dev_value,
> > >                                  uint16_t valid_mask)
> > > @@ -310,18 +293,13 @@ static int xen_pt_cmd_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
> > >      XenPTRegInfo *reg = cfg_entry->reg;
> > >      uint16_t writable_mask = 0;
> > >      uint16_t throughable_mask = 0;
> > > -    uint16_t emu_mask = reg->emu_mask;
> > > -
> > > -    if (s->is_virtfn) {
> > > -        emu_mask |= PCI_COMMAND_MEMORY;
> > > -    }
> > >  
> > >      /* modify emulate register */
> > >      writable_mask = ~reg->ro_mask & valid_mask;
> > >      cfg_entry->data = XEN_PT_MERGE_VALUE(*val, cfg_entry->data, writable_mask);
> > >  
> > >      /* create value for writing to I/O device register */
> > > -    throughable_mask = ~emu_mask & valid_mask;
> > > +    throughable_mask = ~reg->emu_mask & valid_mask;
> > >  
> > >      if (*val & PCI_COMMAND_INTX_DISABLE) {
> > >          throughable_mask |= PCI_COMMAND_INTX_DISABLE;
> > > @@ -603,9 +581,9 @@ static XenPTRegInfo xen_pt_emu_reg_header0[] = {
> > >          .size       = 2,
> > >          .init_val   = 0x0000,
> > >          .ro_mask    = 0xF880,
> > > -        .emu_mask   = 0x0740,
> > > +        .emu_mask   = 0x0743,
> > >          .init       = xen_pt_common_reg_init,
> > > -        .u.w.read   = xen_pt_cmd_reg_read,
> > > +        .u.w.read   = xen_pt_word_reg_read,
> > >          .u.w.write  = xen_pt_cmd_reg_write,
> > >      },
> > >      /* Capabilities Pointer reg */
> >

Andrew Cooper April 1, 2015, 9:41 a.m. UTC | #4

On 01/04/15 10:20, Stefano Stabellini wrote:
> CC'ing the author of the patch and xen-devel.
> FYI I think that Jan is going to be on vacation for a couple of weeks.
>
> On Wed, 1 Apr 2015, Michael S. Tsirkin wrote:
>> On Tue, Mar 31, 2015 at 03:18:03PM +0100, Stefano Stabellini wrote:
>>> From: Jan Beulich <jbeulich@suse.com>
>>>
>>> Otherwise the guest can abuse that control to cause e.g. PCIe
>>> Unsupported Request responses (by disabling memory and/or I/O decoding
>>> and subsequently causing [CPU side] accesses to the respective address
>>> ranges), which (depending on system configuration) may be fatal to the
>>> host.
>>>
>>> This is CVE-2015-2756 / XSA-126.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>>> Acked-by: Ian Campbell <ian.campbell@citrix.com>
>> The patch description seems somewhat incorrect to me.
>> UR should not be fatal to the system, and it's not platform
>> specific.
> I think that people have been able to repro this, but I'll let Jan
> comment on it.

Depending on how the BIOS sets up AER (if even available), a UR can very
easily be fatal to the system.

If firmware first mode is set, Xen (or indeed Linux) can't fix a
problematic setup.  Experimentally, doing so can cause infinite loops in
certain vendors SMM handlers.

>
>
>> In particular, there could be more reasons for devices
>> to generate URs, for example, if they get a transaction
>> during FLR. I don't think we ever tried to prevent this.
> That cannot be triggered by guest behaviour.

What cannot be triggered by guest behaviour?

Many devices have secondary access into config space via a BAR, which
allows a guest driver full and unmediated control of everything.

Under Xen, we have covered this with XSA-124 which basically says that
for such devices, all bets are off.

~Andrew

Ian Campbell April 1, 2015, 9:50 a.m. UTC | #5

On Wed, 2015-04-01 at 10:20 +0100, Stefano Stabellini wrote:
> CC'ing the author of the patch and xen-devel.

Adding Andy C who I think knows about this stuff.

> FYI I think that Jan is going to be on vacation for a couple of weeks.
> 
> On Wed, 1 Apr 2015, Michael S. Tsirkin wrote:
> > On Tue, Mar 31, 2015 at 03:18:03PM +0100, Stefano Stabellini wrote:
> > > From: Jan Beulich <jbeulich@suse.com>
> > > 
> > > Otherwise the guest can abuse that control to cause e.g. PCIe
> > > Unsupported Request responses (by disabling memory and/or I/O decoding
> > > and subsequently causing [CPU side] accesses to the respective address
> > > ranges), which (depending on system configuration) may be fatal to the
> > > host.
> > > 
> > > This is CVE-2015-2756 / XSA-126.
> > > 
> > > Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > > Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > > Acked-by: Ian Campbell <ian.campbell@citrix.com>
> > 
> > The patch description seems somewhat incorrect to me.
> > UR should not be fatal to the system, and it's not platform
> > specific.
> 
> I think that people have been able to repro this, but I'll let Jan
> comment on it.

XSA-126 says:

        In the event that the platform surfaces aforementioned UR responses as
        Non-Maskable Interrupts, and either the OS is configured to treat NMIs
        as fatal or (e.g. via ACPI's APEI) the platform tells the OS to treat
        these errors as fatal, the host would crash, leading to a Denial of
        Service.

AIUI from the discussions on security@ (perhaps in the context of -120
rather than this) using ACPI APEI in this way is the norm. Andy?

> > Here's the description from pci express spec:
> > 
> > IMPLEMENTATION NOTE
> > Software UR Reporting Compatibility with 1.0a Devices
> > 
> > 	With 1.0a device Functions, 96 if the Unsupported Request Reporting Enable bit is set, the Function
> > 	when operating as a Completer will send an uncorrectable error Message (if enabled) when a UR
> > 	error is detected. On platforms where an uncorrectable error Message is handled as a System Error,
> > 	this will break PC-compatible Configuration Space probing, so software/firmware on such
> > 	platforms may need to avoid setting the Unsupported Request Reporting Enable bit.
> > 	With device Functions implementing Role-Based Error Reporting, setting the Unsupported Request
> > 	Reporting Enable bit will not interfere with PC-compatible Configuration Space probing, assuming
> > 	that the severity for UR is left at its default of non-fatal. However, setting the Unsupported Request
> > 	Reporting Enable bit will enable the Function to report UR errors detected with posted Requests,
> > 	helping avoid this case for potential silent data corruption.
> > 	On platforms where robust error handling and PC-compatible Configuration Space probing is
> > 	required, it is suggested that software or firmware have the Unsupported Request Reporting Enable
> > 	bit Set for Role-Based Error Reporting Functions, but clear for 1.0a Functions. Software or
> > 	firmware can distinguish the two classes of Functions by examining the Role-Based Error Reporting
> > 	bit in the Device Capabilities register.
> > 
> > 
> > What I think you have is a very old 1.0a system, and host set Unsupported
> > Request Reporting Enable.
> > 
> > The right thing is for host kernel

I think part of the problem is that its not the host kernel acting as
"Completer" here, but rather the host firmware, which we do not control.
I think Andy can probably confirm or deny this.

Ian.

Michael S. Tsirkin April 1, 2015, 9:59 a.m. UTC | #6

On Wed, Apr 01, 2015 at 10:41:12AM +0100, Andrew Cooper wrote:
> On 01/04/15 10:20, Stefano Stabellini wrote:
> > CC'ing the author of the patch and xen-devel.
> > FYI I think that Jan is going to be on vacation for a couple of weeks.
> >
> > On Wed, 1 Apr 2015, Michael S. Tsirkin wrote:
> >> On Tue, Mar 31, 2015 at 03:18:03PM +0100, Stefano Stabellini wrote:
> >>> From: Jan Beulich <jbeulich@suse.com>
> >>>
> >>> Otherwise the guest can abuse that control to cause e.g. PCIe
> >>> Unsupported Request responses (by disabling memory and/or I/O decoding
> >>> and subsequently causing [CPU side] accesses to the respective address
> >>> ranges), which (depending on system configuration) may be fatal to the
> >>> host.
> >>>
> >>> This is CVE-2015-2756 / XSA-126.
> >>>
> >>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >>> Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> >>> Acked-by: Ian Campbell <ian.campbell@citrix.com>
> >> The patch description seems somewhat incorrect to me.
> >> UR should not be fatal to the system, and it's not platform
> >> specific.
> > I think that people have been able to repro this, but I'll let Jan
> > comment on it.
> 
> Depending on how the BIOS sets up AER (if even available), a UR can very
> easily be fatal to the system.
> 
> If firmware first mode is set, Xen (or indeed Linux) can't fix a
> problematic setup.  Experimentally, doing so can cause infinite loops in
> certain vendors SMM handlers.

I think it can, just disable UR reporting, this is up to OS.  This is
what the PCI spec says - you have snipped the relevant part out from the
mail you are replying to.

> >
> >> In particular, there could be more reasons for devices
> >> to generate URs, for example, if they get a transaction
> >> during FLR. I don't think we ever tried to prevent this.
> > That cannot be triggered by guest behaviour.
> 
> What cannot be triggered by guest behaviour?
> 
> Many devices have secondary access into config space via a BAR, which
> allows a guest driver full and unmediated control of everything.
> 
> Under Xen, we have covered this with XSA-124 which basically says that
> for such devices, all bets are off.
> 
> ~Andrew

That one basically says don't pass through PFs if you want security :)
One has to wonder what's the point of XA-126 is then, it can only be
triggered with PFs.
I guess it'd apply to the theorectically possible "devices the entire
scope of whose functionality is known and has been reviewed for PCI
passthrough security and correctness". Do you know of such hardware
in real life?

Michael S. Tsirkin April 1, 2015, 10:12 a.m. UTC | #7

On Wed, Apr 01, 2015 at 10:50:45AM +0100, Ian Campbell wrote:
> On Wed, 2015-04-01 at 10:20 +0100, Stefano Stabellini wrote:
> > CC'ing the author of the patch and xen-devel.
> 
> Adding Andy C who I think knows about this stuff.
> 
> > FYI I think that Jan is going to be on vacation for a couple of weeks.
> > 
> > On Wed, 1 Apr 2015, Michael S. Tsirkin wrote:
> > > On Tue, Mar 31, 2015 at 03:18:03PM +0100, Stefano Stabellini wrote:
> > > > From: Jan Beulich <jbeulich@suse.com>
> > > > 
> > > > Otherwise the guest can abuse that control to cause e.g. PCIe
> > > > Unsupported Request responses (by disabling memory and/or I/O decoding
> > > > and subsequently causing [CPU side] accesses to the respective address
> > > > ranges), which (depending on system configuration) may be fatal to the
> > > > host.
> > > > 
> > > > This is CVE-2015-2756 / XSA-126.
> > > > 
> > > > Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > > > Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > > > Acked-by: Ian Campbell <ian.campbell@citrix.com>
> > > 
> > > The patch description seems somewhat incorrect to me.
> > > UR should not be fatal to the system, and it's not platform
> > > specific.
> > 
> > I think that people have been able to repro this, but I'll let Jan
> > comment on it.
> 
> XSA-126 says:
> 
>         In the event that the platform surfaces aforementioned UR responses as
>         Non-Maskable Interrupts,
>	   and either the OS is configured to treat NMIs
>         as fatal or (e.g. via ACPI's APEI) the platform tells the OS to treat
>         these errors as fatal, the host would crash, leading to a Denial of
>         Service.
> 
> AIUI from the discussions on security@ (perhaps in the context of -120
> rather than this) using ACPI APEI in this way is the norm. Andy?

The implementation note at the bottom explains when this
happens I think: it's when you are using a 1.0a function
(which is pretty old - 1.1 went out 10 years ago)
and enable UR reporting in the AER register.

It sounds quite reasonable to detect such hardware and disable UR
reporting, even though APEI tells you AER is supported.

> 
> > > Here's the description from pci express spec:
> > > 
> > > IMPLEMENTATION NOTE
> > > Software UR Reporting Compatibility with 1.0a Devices
> > > 
> > > 	With 1.0a device Functions, 96 if the Unsupported Request Reporting Enable bit is set, the Function
> > > 	when operating as a Completer will send an uncorrectable error Message (if enabled) when a UR
> > > 	error is detected. On platforms where an uncorrectable error Message is handled as a System Error,
> > > 	this will break PC-compatible Configuration Space probing, so software/firmware on such
> > > 	platforms may need to avoid setting the Unsupported Request Reporting Enable bit.
> > > 	With device Functions implementing Role-Based Error Reporting, setting the Unsupported Request
> > > 	Reporting Enable bit will not interfere with PC-compatible Configuration Space probing, assuming
> > > 	that the severity for UR is left at its default of non-fatal. However, setting the Unsupported Request
> > > 	Reporting Enable bit will enable the Function to report UR errors detected with posted Requests,
> > > 	helping avoid this case for potential silent data corruption.
> > > 	On platforms where robust error handling and PC-compatible Configuration Space probing is
> > > 	required, it is suggested that software or firmware have the Unsupported Request Reporting Enable
> > > 	bit Set for Role-Based Error Reporting Functions, but clear for 1.0a Functions. Software or
> > > 	firmware can distinguish the two classes of Functions by examining the Role-Based Error Reporting
> > > 	bit in the Device Capabilities register.
> > > 
> > > 
> > > What I think you have is a very old 1.0a system, and host set Unsupported
> > > Request Reporting Enable.
> > > 
> > > The right thing is for host kernel
> 
> I think part of the problem is that its not the host kernel acting as
> "Completer" here, but rather the host firmware, which we do not control.
> I think Andy can probably confirm or deny this.
> 
> Ian.

No, the Completer is the assigned device.
Host firmware configures whether uncorrectable error Message generates
NMIs, but it's not the one that configures Unsupported Request Reporting
Enable - host OS does this.

Peter Maydell April 9, 2015, 6:10 p.m. UTC | #8

On 31 March 2015 at 15:18, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> From: Jan Beulich <jbeulich@suse.com>
>
> Otherwise the guest can abuse that control to cause e.g. PCIe
> Unsupported Request responses (by disabling memory and/or I/O decoding
> and subsequently causing [CPU side] accesses to the respective address
> ranges), which (depending on system configuration) may be fatal to the
> host.
>
> This is CVE-2015-2756 / XSA-126.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> Acked-by: Ian Campbell <ian.campbell@citrix.com>

Oops, this one got lost. I'm going to commit it to qemu master tomorrow
(so it will go in -rc3), unless there are objections (I can't
really tell from the thread what the conclusion of the discussion
was).

-- PMM

Peter Maydell April 10, 2015, 11:45 a.m. UTC | #9

On 9 April 2015 at 19:10, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 31 March 2015 at 15:18, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>>
>> Otherwise the guest can abuse that control to cause e.g. PCIe
>> Unsupported Request responses (by disabling memory and/or I/O decoding
>> and subsequently causing [CPU side] accesses to the respective address
>> ranges), which (depending on system configuration) may be fatal to the
>> host.
>>
>> This is CVE-2015-2756 / XSA-126.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>> Acked-by: Ian Campbell <ian.campbell@citrix.com>
>
> Oops, this one got lost. I'm going to commit it to qemu master tomorrow
> (so it will go in -rc3), unless there are objections (I can't
> really tell from the thread what the conclusion of the discussion
> was).

Committed.

-- PMM

Peter Maydell April 10, 2015, 11:49 a.m. UTC | #10

On 10 April 2015 at 12:45, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 9 April 2015 at 19:10, Peter Maydell <peter.maydell@linaro.org> wrote:
>> On 31 March 2015 at 15:18, Stefano Stabellini
>> <stefano.stabellini@eu.citrix.com> wrote:
>>> From: Jan Beulich <jbeulich@suse.com>
>>>
>>> Otherwise the guest can abuse that control to cause e.g. PCIe
>>> Unsupported Request responses (by disabling memory and/or I/O decoding
>>> and subsequently causing [CPU side] accesses to the respective address
>>> ranges), which (depending on system configuration) may be fatal to the
>>> host.
>>>
>>> This is CVE-2015-2756 / XSA-126.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>>> Acked-by: Ian Campbell <ian.campbell@citrix.com>
>>
>> Oops, this one got lost. I'm going to commit it to qemu master tomorrow
>> (so it will go in -rc3), unless there are objections (I can't
>> really tell from the thread what the conclusion of the discussion
>> was).
>
> Committed.

I forgot to add the CC:stable to the commit message even after
mentioning it on IRC, though :-( Sorry about that.

-- PMM

Jan Beulich April 13, 2015, 8:17 a.m. UTC | #11

>>> On 01.04.15 at 11:59, <mst@redhat.com> wrote:
> On Wed, Apr 01, 2015 at 10:41:12AM +0100, Andrew Cooper wrote:
>> On 01/04/15 10:20, Stefano Stabellini wrote:
>> > CC'ing the author of the patch and xen-devel.
>> > FYI I think that Jan is going to be on vacation for a couple of weeks.
>> >
>> > On Wed, 1 Apr 2015, Michael S. Tsirkin wrote:
>> >> On Tue, Mar 31, 2015 at 03:18:03PM +0100, Stefano Stabellini wrote:
>> >>> From: Jan Beulich <jbeulich@suse.com>
>> >>>
>> >>> Otherwise the guest can abuse that control to cause e.g. PCIe
>> >>> Unsupported Request responses (by disabling memory and/or I/O decoding
>> >>> and subsequently causing [CPU side] accesses to the respective address
>> >>> ranges), which (depending on system configuration) may be fatal to the
>> >>> host.
>> >>>
>> >>> This is CVE-2015-2756 / XSA-126.
>> >>>
>> >>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> >>> Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>> >>> Acked-by: Ian Campbell <ian.campbell@citrix.com>
>> >> The patch description seems somewhat incorrect to me.
>> >> UR should not be fatal to the system, and it's not platform
>> >> specific.
>> > I think that people have been able to repro this, but I'll let Jan
>> > comment on it.
>> 
>> Depending on how the BIOS sets up AER (if even available), a UR can very
>> easily be fatal to the system.
>> 
>> If firmware first mode is set, Xen (or indeed Linux) can't fix a
>> problematic setup.  Experimentally, doing so can cause infinite loops in
>> certain vendors SMM handlers.
> 
> I think it can, just disable UR reporting, this is up to OS.  This is
> what the PCI spec says - you have snipped the relevant part out from the
> mail you are replying to.

As already said by Andrew, the OS must not do such when the
system is in APEI firmware first mode.

Jan

Michael S. Tsirkin April 13, 2015, 11:19 a.m. UTC | #12

On Mon, Apr 13, 2015 at 09:17:04AM +0100, Jan Beulich wrote:
> >>> On 01.04.15 at 11:59, <mst@redhat.com> wrote:
> > On Wed, Apr 01, 2015 at 10:41:12AM +0100, Andrew Cooper wrote:
> >> On 01/04/15 10:20, Stefano Stabellini wrote:
> >> > CC'ing the author of the patch and xen-devel.
> >> > FYI I think that Jan is going to be on vacation for a couple of weeks.
> >> >
> >> > On Wed, 1 Apr 2015, Michael S. Tsirkin wrote:
> >> >> On Tue, Mar 31, 2015 at 03:18:03PM +0100, Stefano Stabellini wrote:
> >> >>> From: Jan Beulich <jbeulich@suse.com>
> >> >>>
> >> >>> Otherwise the guest can abuse that control to cause e.g. PCIe
> >> >>> Unsupported Request responses (by disabling memory and/or I/O decoding
> >> >>> and subsequently causing [CPU side] accesses to the respective address
> >> >>> ranges), which (depending on system configuration) may be fatal to the
> >> >>> host.
> >> >>>
> >> >>> This is CVE-2015-2756 / XSA-126.
> >> >>>
> >> >>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> >>> Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> >> >>> Acked-by: Ian Campbell <ian.campbell@citrix.com>
> >> >> The patch description seems somewhat incorrect to me.
> >> >> UR should not be fatal to the system, and it's not platform
> >> >> specific.
> >> > I think that people have been able to repro this, but I'll let Jan
> >> > comment on it.
> >> 
> >> Depending on how the BIOS sets up AER (if even available), a UR can very
> >> easily be fatal to the system.
> >> 
> >> If firmware first mode is set, Xen (or indeed Linux) can't fix a
> >> problematic setup.  Experimentally, doing so can cause infinite loops in
> >> certain vendors SMM handlers.
> > 
> > I think it can, just disable UR reporting, this is up to OS.  This is
> > what the PCI spec says - you have snipped the relevant part out from the
> > mail you are replying to.
> 
> As already said by Andrew, the OS must not do such when the
> system is in APEI firmware first mode.
> 
> Jan


Yes Linux can't fix firmware 1st mode, but
PCI express spec says what firmware should do in this case:

IMPLEMENTATION NOTE Software UR Reporting Compatibility with 1.0a Devices

        With 1.0a device Functions, 96 if the Unsupported Request Reporting Enable bit is set, the Function
        when operating as a Completer will send an uncorrectable error Message (if enabled) when a UR
        error is detected. On platforms where an uncorrectable error Message is handled as a System Error,
        this will break PC-compatible Configuration Space probing, so software/firmware on such
        platforms may need to avoid setting the Unsupported Request Reporting Enable bit.
        With device Functions implementing Role-Based Error Reporting, setting the Unsupported Request
        Reporting Enable bit will not interfere with PC-compatible Configuration Space probing, assuming
        that the severity for UR is left at its default of non-fatal. However, setting the Unsupported Request
        Reporting Enable bit will enable the Function to report UR errors detected with posted Requests,
        helping avoid this case for potential silent data corruption.
        On platforms where robust error handling and PC-compatible Configuration Space probing is
        required, it is suggested that software or firmware have the Unsupported Request Reporting Enable
        bit Set for Role-Based Error Reporting Functions, but clear for 1.0a Functions. Software or
        firmware can distinguish the two classes of Functions by examining the Role-Based Error Reporting
        bit in the Device Capabilities register.


What I think you have is a very old 1.0a system, and you set Unsupported
Request Reporting Enable.

Can you confirm?

If you have access to the system in question, I can provide
a test script to detect this.

You will have other problems if your firmware doesn't follow the spec. So how
about either

- Don't use firmware 1st mode with pci express
  (Seems no reason to do firmware 1st for PCIE, architecture is completely
   standard. I saw mentions of using combined/parallel mode, using AER for some
   devices but not others, but I don't know how this is supposed to be enabled.
   Any idea?)

or

- ask your vendor to update firmware if it doesn't do the right thing

Jan Beulich April 13, 2015, 11:34 a.m. UTC | #13

>>> On 13.04.15 at 13:19, <mst@redhat.com> wrote:
> Yes Linux can't fix firmware 1st mode, but
> PCI express spec says what firmware should do in this case:
> 
> IMPLEMENTATION NOTE Software UR Reporting Compatibility with 1.0a Devices
> 
>         With 1.0a device Functions, 96 if the Unsupported Request Reporting 
> Enable bit is set, the Function
>         when operating as a Completer will send an uncorrectable error 
> Message (if enabled) when a UR
>         error is detected. On platforms where an uncorrectable error Message 
> is handled as a System Error,
>         this will break PC-compatible Configuration Space probing, so 
> software/firmware on such
>         platforms may need to avoid setting the Unsupported Request 
> Reporting Enable bit.
>         With device Functions implementing Role-Based Error Reporting, 
> setting the Unsupported Request
>         Reporting Enable bit will not interfere with PC-compatible 
> Configuration Space probing, assuming
>         that the severity for UR is left at its default of non-fatal. 
> However, setting the Unsupported Request
>         Reporting Enable bit will enable the Function to report UR errors 
> detected with posted Requests,
>         helping avoid this case for potential silent data corruption.
>         On platforms where robust error handling and PC-compatible 
> Configuration Space probing is
>         required, it is suggested that software or firmware have the 
> Unsupported Request Reporting Enable
>         bit Set for Role-Based Error Reporting Functions, but clear for 1.0a 
> Functions. Software or
>         firmware can distinguish the two classes of Functions by examining 
> the Role-Based Error Reporting
>         bit in the Device Capabilities register.
> 
> 
> What I think you have is a very old 1.0a system, and you set Unsupported
> Request Reporting Enable.
> 
> Can you confirm?

No. In at least one of the two cases we got reports of the original
problem, triggering the finding of this issue, this is a brand new one,
only soon to become available publicly. Furthermore I'm being
confused by the mention of PC-compatible config space probing
above: The URs we talk about here don't result from config space
accessed at all.

> You will have other problems if your firmware doesn't follow the spec. So 
> how about either
> 
> - Don't use firmware 1st mode with pci express
>   (Seems no reason to do firmware 1st for PCIE, architecture is completely
>    standard. I saw mentions of using combined/parallel mode, using AER for 
> some
>    devices but not others, but I don't know how this is supposed to be 
> enabled.
>    Any idea?)
> 
> or
> 
> - ask your vendor to update firmware if it doesn't do the right thing

Both not very practical suggestions, based on experience.

Jan

Michael S. Tsirkin April 13, 2015, 11:47 a.m. UTC | #14

On Mon, Apr 13, 2015 at 12:34:34PM +0100, Jan Beulich wrote:
> >>> On 13.04.15 at 13:19, <mst@redhat.com> wrote:
> > Yes Linux can't fix firmware 1st mode, but
> > PCI express spec says what firmware should do in this case:
> > 
> > IMPLEMENTATION NOTE Software UR Reporting Compatibility with 1.0a Devices
> > 
> >         With 1.0a device Functions, 96 if the Unsupported Request Reporting 
> > Enable bit is set, the Function
> >         when operating as a Completer will send an uncorrectable error 
> > Message (if enabled) when a UR
> >         error is detected. On platforms where an uncorrectable error Message 
> > is handled as a System Error,
> >         this will break PC-compatible Configuration Space probing, so 
> > software/firmware on such
> >         platforms may need to avoid setting the Unsupported Request 
> > Reporting Enable bit.
> >         With device Functions implementing Role-Based Error Reporting, 
> > setting the Unsupported Request
> >         Reporting Enable bit will not interfere with PC-compatible 
> > Configuration Space probing, assuming
> >         that the severity for UR is left at its default of non-fatal. 
> > However, setting the Unsupported Request
> >         Reporting Enable bit will enable the Function to report UR errors 
> > detected with posted Requests,
> >         helping avoid this case for potential silent data corruption.
> >         On platforms where robust error handling and PC-compatible 
> > Configuration Space probing is
> >         required, it is suggested that software or firmware have the 
> > Unsupported Request Reporting Enable
> >         bit Set for Role-Based Error Reporting Functions, but clear for 1.0a 
> > Functions. Software or
> >         firmware can distinguish the two classes of Functions by examining 
> > the Role-Based Error Reporting
> >         bit in the Device Capabilities register.
> > 
> > 
> > What I think you have is a very old 1.0a system, and you set Unsupported
> > Request Reporting Enable.
> > 
> > Can you confirm?
> 
> No. In at least one of the two cases we got reports of the original
> problem, triggering the finding of this issue, this is a brand new one,
> only soon to become available publicly. Furthermore I'm being
> confused by the mention of PC-compatible config space probing
> above: The URs we talk about here don't result from config space
> accessed at all.

OK. Can you please explain why does UR cause a system error then?
It looks like a hardware bug: PCIE 1.1 seems to say it shouldn't.

> > You will have other problems if your firmware doesn't follow the spec. So 
> > how about either
> > 
> > - Don't use firmware 1st mode with pci express
> >   (Seems no reason to do firmware 1st for PCIE, architecture is completely
> >    standard. I saw mentions of using combined/parallel mode, using AER for 
> > some
> >    devices but not others, but I don't know how this is supposed to be 
> > enabled.
> >    Any idea?)
> > 
> > or
> > 
> > - ask your vendor to update firmware if it doesn't do the right thing
> 
> Both not very practical suggestions, based on experience.
> 
> Jan

Well using OS native mode is definitely practical, the question
is how to detect the problematic configurations.

There's always XSA-124 which says buggy hardware can cause
security problems.

Jan Beulich April 13, 2015, 12:40 p.m. UTC | #15

>>> On 13.04.15 at 13:47, <mst@redhat.com> wrote:
> On Mon, Apr 13, 2015 at 12:34:34PM +0100, Jan Beulich wrote:
>> >>> On 13.04.15 at 13:19, <mst@redhat.com> wrote:
>> > Yes Linux can't fix firmware 1st mode, but
>> > PCI express spec says what firmware should do in this case:
>> > 
>> > IMPLEMENTATION NOTE Software UR Reporting Compatibility with 1.0a Devices
>> > 
>> >         With 1.0a device Functions, 96 if the Unsupported Request Reporting 
> 
>> > Enable bit is set, the Function
>> >         when operating as a Completer will send an uncorrectable error 
>> > Message (if enabled) when a UR
>> >         error is detected. On platforms where an uncorrectable error 
> Message 
>> > is handled as a System Error,
>> >         this will break PC-compatible Configuration Space probing, so 
>> > software/firmware on such
>> >         platforms may need to avoid setting the Unsupported Request 
>> > Reporting Enable bit.
>> >         With device Functions implementing Role-Based Error Reporting, 
>> > setting the Unsupported Request
>> >         Reporting Enable bit will not interfere with PC-compatible 
>> > Configuration Space probing, assuming
>> >         that the severity for UR is left at its default of non-fatal. 
>> > However, setting the Unsupported Request
>> >         Reporting Enable bit will enable the Function to report UR errors 
>> > detected with posted Requests,
>> >         helping avoid this case for potential silent data corruption.
>> >         On platforms where robust error handling and PC-compatible 
>> > Configuration Space probing is
>> >         required, it is suggested that software or firmware have the 
>> > Unsupported Request Reporting Enable
>> >         bit Set for Role-Based Error Reporting Functions, but clear for 1.0a 
> 
>> > Functions. Software or
>> >         firmware can distinguish the two classes of Functions by examining 
>> > the Role-Based Error Reporting
>> >         bit in the Device Capabilities register.
>> > 
>> > 
>> > What I think you have is a very old 1.0a system, and you set Unsupported
>> > Request Reporting Enable.
>> > 
>> > Can you confirm?
>> 
>> No. In at least one of the two cases we got reports of the original
>> problem, triggering the finding of this issue, this is a brand new one,
>> only soon to become available publicly. Furthermore I'm being
>> confused by the mention of PC-compatible config space probing
>> above: The URs we talk about here don't result from config space
>> accessed at all.
> 
> OK. Can you please explain why does UR cause a system error then?
> It looks like a hardware bug: PCIE 1.1 seems to say it shouldn't.

Quite possible. Looking at the ITP log we were provided, the UR
severity bit is clear (non-fatal), yet the error got surfaced to the
OS as a fatal one (I would guess because it validly gets flagged as
uncorrectable at the same time).

Jan

Michael S. Tsirkin April 13, 2015, 12:47 p.m. UTC | #16

On Mon, Apr 13, 2015 at 01:40:59PM +0100, Jan Beulich wrote:
> >>> On 13.04.15 at 13:47, <mst@redhat.com> wrote:
> > On Mon, Apr 13, 2015 at 12:34:34PM +0100, Jan Beulich wrote:
> >> >>> On 13.04.15 at 13:19, <mst@redhat.com> wrote:
> >> > Yes Linux can't fix firmware 1st mode, but
> >> > PCI express spec says what firmware should do in this case:
> >> > 
> >> > IMPLEMENTATION NOTE Software UR Reporting Compatibility with 1.0a Devices
> >> > 
> >> >         With 1.0a device Functions, 96 if the Unsupported Request Reporting 
> > 
> >> > Enable bit is set, the Function
> >> >         when operating as a Completer will send an uncorrectable error 
> >> > Message (if enabled) when a UR
> >> >         error is detected. On platforms where an uncorrectable error 
> > Message 
> >> > is handled as a System Error,
> >> >         this will break PC-compatible Configuration Space probing, so 
> >> > software/firmware on such
> >> >         platforms may need to avoid setting the Unsupported Request 
> >> > Reporting Enable bit.
> >> >         With device Functions implementing Role-Based Error Reporting, 
> >> > setting the Unsupported Request
> >> >         Reporting Enable bit will not interfere with PC-compatible 
> >> > Configuration Space probing, assuming
> >> >         that the severity for UR is left at its default of non-fatal. 
> >> > However, setting the Unsupported Request
> >> >         Reporting Enable bit will enable the Function to report UR errors 
> >> > detected with posted Requests,
> >> >         helping avoid this case for potential silent data corruption.
> >> >         On platforms where robust error handling and PC-compatible 
> >> > Configuration Space probing is
> >> >         required, it is suggested that software or firmware have the 
> >> > Unsupported Request Reporting Enable
> >> >         bit Set for Role-Based Error Reporting Functions, but clear for 1.0a 
> > 
> >> > Functions. Software or
> >> >         firmware can distinguish the two classes of Functions by examining 
> >> > the Role-Based Error Reporting
> >> >         bit in the Device Capabilities register.
> >> > 
> >> > 
> >> > What I think you have is a very old 1.0a system, and you set Unsupported
> >> > Request Reporting Enable.
> >> > 
> >> > Can you confirm?
> >> 
> >> No. In at least one of the two cases we got reports of the original
> >> problem, triggering the finding of this issue, this is a brand new one,
> >> only soon to become available publicly. Furthermore I'm being
> >> confused by the mention of PC-compatible config space probing
> >> above: The URs we talk about here don't result from config space
> >> accessed at all.
> > 
> > OK. Can you please explain why does UR cause a system error then?
> > It looks like a hardware bug: PCIE 1.1 seems to say it shouldn't.
> 
> Quite possible. Looking at the ITP log we were provided, the UR
> severity bit is clear (non-fatal), yet the error got surfaced to the
> OS as a fatal one (I would guess because it validly gets flagged as
> uncorrectable at the same time).
> 
> Jan

No, that's not valid.
Can you check device capabilities register, offset 0x4 within
pci express capability structure?
Bit 15 is 15 Role-Based Error Reporting.
Is it set?

The spec says:

	15
	On platforms where robust error handling and PC-compatible Configuration Space probing is
	required, it is suggested that software or firmware have the Unsupported Request Reporting Enable
	bit Set for Role-Based Error Reporting Functions, but clear for 1.0a Functions. Software or
	firmware can distinguish the two classes of Functions by examining the Role-Based Error Reporting
	bit in the Device Capabilities register.

Jan Beulich April 13, 2015, 12:51 p.m. UTC | #17

>>> On 13.04.15 at 14:47, <mst@redhat.com> wrote:
> On Mon, Apr 13, 2015 at 01:40:59PM +0100, Jan Beulich wrote:
>> Quite possible. Looking at the ITP log we were provided, the UR
>> severity bit is clear (non-fatal), yet the error got surfaced to the
>> OS as a fatal one (I would guess because it validly gets flagged as
>> uncorrectable at the same time).
> 
> No, that's not valid.
> Can you check device capabilities register, offset 0x4 within
> pci express capability structure?
> Bit 15 is 15 Role-Based Error Reporting.
> Is it set?
> 
> The spec says:
> 
> 	15
> 	On platforms where robust error handling and PC-compatible Configuration 
> Space probing is
> 	required, it is suggested that software or firmware have the Unsupported 
> Request Reporting Enable
> 	bit Set for Role-Based Error Reporting Functions, but clear for 1.0a 
> Functions. Software or
> 	firmware can distinguish the two classes of Functions by examining the 
> Role-Based Error Reporting
> 	bit in the Device Capabilities register.

Yes, that bit is set.

Jan

Michael S. Tsirkin April 20, 2015, 1:43 p.m. UTC | #18

On Mon, Apr 13, 2015 at 01:51:06PM +0100, Jan Beulich wrote:
> >>> On 13.04.15 at 14:47, <mst@redhat.com> wrote:
> > On Mon, Apr 13, 2015 at 01:40:59PM +0100, Jan Beulich wrote:
> >> Quite possible. Looking at the ITP log we were provided, the UR
> >> severity bit is clear (non-fatal), yet the error got surfaced to the
> >> OS as a fatal one (I would guess because it validly gets flagged as
> >> uncorrectable at the same time).
> > 
> > No, that's not valid.
> > Can you check device capabilities register, offset 0x4 within
> > pci express capability structure?
> > Bit 15 is 15 Role-Based Error Reporting.
> > Is it set?
> > 
> > The spec says:
> > 
> > 	15
> > 	On platforms where robust error handling and PC-compatible Configuration 
> > Space probing is
> > 	required, it is suggested that software or firmware have the Unsupported 
> > Request Reporting Enable
> > 	bit Set for Role-Based Error Reporting Functions, but clear for 1.0a 
> > Functions. Software or
> > 	firmware can distinguish the two classes of Functions by examining the 
> > Role-Based Error Reporting
> > 	bit in the Device Capabilities register.
> 
> Yes, that bit is set.
> 
> Jan


curiouser and curiouser.

So with functions that do support Role-Based Error Reporting, we have
this:


	With device Functions implementing Role-Based Error Reporting, setting the Unsupported Request
	Reporting Enable bit will not interfere with PC-compatible Configuration Space probing, assuming
	that the severity for UR is left at its default of non-fatal. However, setting the Unsupported Request
	Reporting Enable bit will enable the Function to report UR errors 97 detected with posted Requests,
	helping avoid this case for potential silent data corruption.

did firmware reconfigure this device to report URs as fatal errors then?

Jan Beulich April 20, 2015, 2:08 p.m. UTC | #19

>>> On 20.04.15 at 15:43, <mst@redhat.com> wrote:
> On Mon, Apr 13, 2015 at 01:51:06PM +0100, Jan Beulich wrote:
>> >>> On 13.04.15 at 14:47, <mst@redhat.com> wrote:
>> > Can you check device capabilities register, offset 0x4 within
>> > pci express capability structure?
>> > Bit 15 is 15 Role-Based Error Reporting.
>> > Is it set?
>> > 
>> > The spec says:
>> > 
>> > 	15
>> > 	On platforms where robust error handling and PC-compatible Configuration 
>> > Space probing is
>> > 	required, it is suggested that software or firmware have the Unsupported 
>> > Request Reporting Enable
>> > 	bit Set for Role-Based Error Reporting Functions, but clear for 1.0a 
>> > Functions. Software or
>> > 	firmware can distinguish the two classes of Functions by examining the 
>> > Role-Based Error Reporting
>> > 	bit in the Device Capabilities register.
>> 
>> Yes, that bit is set.
> 
> curiouser and curiouser.
> 
> So with functions that do support Role-Based Error Reporting, we have
> this:
> 
> 
> 	With device Functions implementing Role-Based Error Reporting, setting the 
> Unsupported Request
> 	Reporting Enable bit will not interfere with PC-compatible Configuration 
> Space probing, assuming
> 	that the severity for UR is left at its default of non-fatal. However, 
> setting the Unsupported Request
> 	Reporting Enable bit will enable the Function to report UR errors 97 
> detected with posted Requests,
> 	helping avoid this case for potential silent data corruption.

I still don't see what the PC-compatible config space probing has to
do with our issue.

> did firmware reconfigure this device to report URs as fatal errors then?

No, the Unsupported Request Error Serverity flag is zero.

Jan

Michael S. Tsirkin April 20, 2015, 2:32 p.m. UTC | #20

On Mon, Apr 20, 2015 at 03:08:09PM +0100, Jan Beulich wrote:
> >>> On 20.04.15 at 15:43, <mst@redhat.com> wrote:
> > On Mon, Apr 13, 2015 at 01:51:06PM +0100, Jan Beulich wrote:
> >> >>> On 13.04.15 at 14:47, <mst@redhat.com> wrote:
> >> > Can you check device capabilities register, offset 0x4 within
> >> > pci express capability structure?
> >> > Bit 15 is 15 Role-Based Error Reporting.
> >> > Is it set?
> >> > 
> >> > The spec says:
> >> > 
> >> > 	15
> >> > 	On platforms where robust error handling and PC-compatible Configuration 
> >> > Space probing is
> >> > 	required, it is suggested that software or firmware have the Unsupported 
> >> > Request Reporting Enable
> >> > 	bit Set for Role-Based Error Reporting Functions, but clear for 1.0a 
> >> > Functions. Software or
> >> > 	firmware can distinguish the two classes of Functions by examining the 
> >> > Role-Based Error Reporting
> >> > 	bit in the Device Capabilities register.
> >> 
> >> Yes, that bit is set.
> > 
> > curiouser and curiouser.
> > 
> > So with functions that do support Role-Based Error Reporting, we have
> > this:
> > 
> > 
> > 	With device Functions implementing Role-Based Error Reporting, setting the 
> > Unsupported Request
> > 	Reporting Enable bit will not interfere with PC-compatible Configuration 
> > Space probing, assuming
> > 	that the severity for UR is left at its default of non-fatal. However, 
> > setting the Unsupported Request
> > 	Reporting Enable bit will enable the Function to report UR errors 97 
> > detected with posted Requests,
> > 	helping avoid this case for potential silent data corruption.
> 
> I still don't see what the PC-compatible config space probing has to
> do with our issue.

I'm not sure but I think it's listed here because it causes a ton of URs
when device scan probes unimplemented functions.

> > did firmware reconfigure this device to report URs as fatal errors then?
> 
> No, the Unsupported Request Error Serverity flag is zero.
> 
> Jan

OK, that's the correct configuration, so how come the box crashes when
there's a UR then?

Jan Beulich April 20, 2015, 2:57 p.m. UTC | #21

>>> On 20.04.15 at 16:32, <mst@redhat.com> wrote:
> On Mon, Apr 20, 2015 at 03:08:09PM +0100, Jan Beulich wrote:
>> >>> On 20.04.15 at 15:43, <mst@redhat.com> wrote:
>> > did firmware reconfigure this device to report URs as fatal errors then?
>> 
>> No, the Unsupported Request Error Serverity flag is zero.
> 
> OK, that's the correct configuration, so how come the box crashes when
> there's a UR then?

Apparently something marks it fatal in the process of delivering it
through APEI. I'll raise the question with the reporters.

Jan

Michael S. Tsirkin June 7, 2015, 6:23 a.m. UTC | #22

On Mon, Apr 20, 2015 at 04:32:12PM +0200, Michael S. Tsirkin wrote:
> On Mon, Apr 20, 2015 at 03:08:09PM +0100, Jan Beulich wrote:
> > >>> On 20.04.15 at 15:43, <mst@redhat.com> wrote:
> > > On Mon, Apr 13, 2015 at 01:51:06PM +0100, Jan Beulich wrote:
> > >> >>> On 13.04.15 at 14:47, <mst@redhat.com> wrote:
> > >> > Can you check device capabilities register, offset 0x4 within
> > >> > pci express capability structure?
> > >> > Bit 15 is 15 Role-Based Error Reporting.
> > >> > Is it set?
> > >> > 
> > >> > The spec says:
> > >> > 
> > >> > 	15
> > >> > 	On platforms where robust error handling and PC-compatible Configuration 
> > >> > Space probing is
> > >> > 	required, it is suggested that software or firmware have the Unsupported 
> > >> > Request Reporting Enable
> > >> > 	bit Set for Role-Based Error Reporting Functions, but clear for 1.0a 
> > >> > Functions. Software or
> > >> > 	firmware can distinguish the two classes of Functions by examining the 
> > >> > Role-Based Error Reporting
> > >> > 	bit in the Device Capabilities register.
> > >> 
> > >> Yes, that bit is set.
> > > 
> > > curiouser and curiouser.
> > > 
> > > So with functions that do support Role-Based Error Reporting, we have
> > > this:
> > > 
> > > 
> > > 	With device Functions implementing Role-Based Error Reporting, setting the 
> > > Unsupported Request
> > > 	Reporting Enable bit will not interfere with PC-compatible Configuration 
> > > Space probing, assuming
> > > 	that the severity for UR is left at its default of non-fatal. However, 
> > > setting the Unsupported Request
> > > 	Reporting Enable bit will enable the Function to report UR errors 97 
> > > detected with posted Requests,
> > > 	helping avoid this case for potential silent data corruption.
> > 
> > I still don't see what the PC-compatible config space probing has to
> > do with our issue.
> 
> I'm not sure but I think it's listed here because it causes a ton of URs
> when device scan probes unimplemented functions.
> 
> > > did firmware reconfigure this device to report URs as fatal errors then?
> > 
> > No, the Unsupported Request Error Serverity flag is zero.
> > 
> > Jan
> 
> OK, that's the correct configuration, so how come the box crashes when
> there's a UR then?

Ping - any update on this?
Do we can chalk this up to hardware bugs on a specific box?

> -- 
> MST

Jan Beulich June 8, 2015, 7:42 a.m. UTC | #23

>>> On 07.06.15 at 08:23, <mst@redhat.com> wrote:
> On Mon, Apr 20, 2015 at 04:32:12PM +0200, Michael S. Tsirkin wrote:
>> On Mon, Apr 20, 2015 at 03:08:09PM +0100, Jan Beulich wrote:
>> > >>> On 20.04.15 at 15:43, <mst@redhat.com> wrote:
>> > > On Mon, Apr 13, 2015 at 01:51:06PM +0100, Jan Beulich wrote:
>> > >> >>> On 13.04.15 at 14:47, <mst@redhat.com> wrote:
>> > >> > Can you check device capabilities register, offset 0x4 within
>> > >> > pci express capability structure?
>> > >> > Bit 15 is 15 Role-Based Error Reporting.
>> > >> > Is it set?
>> > >> > 
>> > >> > The spec says:
>> > >> > 
>> > >> > 	15
>> > >> > 	On platforms where robust error handling and PC-compatible Configuration 
>> > >> > Space probing is
>> > >> > 	required, it is suggested that software or firmware have the Unsupported 
>> > >> > Request Reporting Enable
>> > >> > 	bit Set for Role-Based Error Reporting Functions, but clear for 1.0a 
>> > >> > Functions. Software or
>> > >> > 	firmware can distinguish the two classes of Functions by examining the 
>> > >> > Role-Based Error Reporting
>> > >> > 	bit in the Device Capabilities register.
>> > >> 
>> > >> Yes, that bit is set.
>> > > 
>> > > curiouser and curiouser.
>> > > 
>> > > So with functions that do support Role-Based Error Reporting, we have
>> > > this:
>> > > 
>> > > 
>> > > 	With device Functions implementing Role-Based Error Reporting, setting the 
>> > > Unsupported Request
>> > > 	Reporting Enable bit will not interfere with PC-compatible Configuration 
>> > > Space probing, assuming
>> > > 	that the severity for UR is left at its default of non-fatal. However, 
>> > > setting the Unsupported Request
>> > > 	Reporting Enable bit will enable the Function to report UR errors 97 
>> > > detected with posted Requests,
>> > > 	helping avoid this case for potential silent data corruption.
>> > 
>> > I still don't see what the PC-compatible config space probing has to
>> > do with our issue.
>> 
>> I'm not sure but I think it's listed here because it causes a ton of URs
>> when device scan probes unimplemented functions.
>> 
>> > > did firmware reconfigure this device to report URs as fatal errors then?
>> > 
>> > No, the Unsupported Request Error Serverity flag is zero.
>> 
>> OK, that's the correct configuration, so how come the box crashes when
>> there's a UR then?
> 
> Ping - any update on this?

Not really. All we concluded so far is that _maybe_ the bridge, upon
seeing the UR, generates a Master Abort, rendering the whole thing
fatal. Otoh the respective root port also has
- Received Master Abort set in its Secondary Status register (but
  that's also already the case in the log that we have before the UR
  occurs, i.e. that doesn't mean all that much),
- Received System Error set in its Secondary Status register (and
  after the UR the sibling endpoint [UR originating from 83:00.0,
  sibling being 83:00.1] also shows Signaled System Error set).

> Do we can chalk this up to hardware bugs on a specific box?

I have to admit that I'm still very uncertain whether to consider all
this correct behavior, a firmware flaw, or a hardware bug.

Jan

Malcolm Crossley June 8, 2015, 8:09 a.m. UTC | #24

On 08/06/15 08:42, Jan Beulich wrote:
>>>> On 07.06.15 at 08:23, <mst@redhat.com> wrote:
>> On Mon, Apr 20, 2015 at 04:32:12PM +0200, Michael S. Tsirkin wrote:
>>> On Mon, Apr 20, 2015 at 03:08:09PM +0100, Jan Beulich wrote:
>>>>>>> On 20.04.15 at 15:43, <mst@redhat.com> wrote:
>>>>> On Mon, Apr 13, 2015 at 01:51:06PM +0100, Jan Beulich wrote:
>>>>>>>>> On 13.04.15 at 14:47, <mst@redhat.com> wrote:
>>>>>>> Can you check device capabilities register, offset 0x4 within
>>>>>>> pci express capability structure?
>>>>>>> Bit 15 is 15 Role-Based Error Reporting.
>>>>>>> Is it set?
>>>>>>>
>>>>>>> The spec says:
>>>>>>>
>>>>>>> 	15
>>>>>>> 	On platforms where robust error handling and PC-compatible Configuration 
>>>>>>> Space probing is
>>>>>>> 	required, it is suggested that software or firmware have the Unsupported 
>>>>>>> Request Reporting Enable
>>>>>>> 	bit Set for Role-Based Error Reporting Functions, but clear for 1.0a 
>>>>>>> Functions. Software or
>>>>>>> 	firmware can distinguish the two classes of Functions by examining the 
>>>>>>> Role-Based Error Reporting
>>>>>>> 	bit in the Device Capabilities register.
>>>>>>
>>>>>> Yes, that bit is set.
>>>>>
>>>>> curiouser and curiouser.
>>>>>
>>>>> So with functions that do support Role-Based Error Reporting, we have
>>>>> this:
>>>>>
>>>>>
>>>>> 	With device Functions implementing Role-Based Error Reporting, setting the 
>>>>> Unsupported Request
>>>>> 	Reporting Enable bit will not interfere with PC-compatible Configuration 
>>>>> Space probing, assuming
>>>>> 	that the severity for UR is left at its default of non-fatal. However, 
>>>>> setting the Unsupported Request
>>>>> 	Reporting Enable bit will enable the Function to report UR errors 97 
>>>>> detected with posted Requests,
>>>>> 	helping avoid this case for potential silent data corruption.
>>>>
>>>> I still don't see what the PC-compatible config space probing has to
>>>> do with our issue.
>>>
>>> I'm not sure but I think it's listed here because it causes a ton of URs
>>> when device scan probes unimplemented functions.
>>>
>>>>> did firmware reconfigure this device to report URs as fatal errors then?
>>>>
>>>> No, the Unsupported Request Error Serverity flag is zero.
>>>
>>> OK, that's the correct configuration, so how come the box crashes when
>>> there's a UR then?
>>
>> Ping - any update on this?
> 
> Not really. All we concluded so far is that _maybe_ the bridge, upon
> seeing the UR, generates a Master Abort, rendering the whole thing
> fatal. Otoh the respective root port also has
> - Received Master Abort set in its Secondary Status register (but
>   that's also already the case in the log that we have before the UR
>   occurs, i.e. that doesn't mean all that much),
> - Received System Error set in its Secondary Status register (and
>   after the UR the sibling endpoint [UR originating from 83:00.0,
>   sibling being 83:00.1] also shows Signaled System Error set).
> 

Disabling the Memory decode in the command register could also result in a completion timeout on the
root port issuing a transaction towards the PCI device in question. PCIE completion timeouts can be
escalated to Fatal AER errors which trigger system firmware to inject NMI's into the host.

Unsupported requests can also be escalated to be Fatal AER errors (which would again trigger system
firmware to inject an NMI).

Here is an example AER setup for a PCIE root port. You can see UnsupReq errors are masked and so do
not trigger errors. CmpltTO ( completion timeout) errors are not masked and the errors are treated
as Fatal because the corresponding bit in the Uncorrectable Severity register is set.

Capabilities: [148 v1] Advanced Error Reporting
UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol+
UESvrt:	DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk:	RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-

A root port completion timeout will also result in the master abort bit being set.

Typically system firmware clears the error in the AER registers after it's processed it. So the
operating system may not be able to determine what error triggered the NMI in the first place.


>> Do we can chalk this up to hardware bugs on a specific box?
> 
> I have to admit that I'm still very uncertain whether to consider all
> this correct behavior, a firmware flaw, or a hardware bug.
I believe the correct behaviour is happening but a PCIE completion timeout is occurring instead of a
unsupported request.

Malcolm

> 
> Jan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

Michael S. Tsirkin June 8, 2015, 8:59 a.m. UTC | #25

On Mon, Jun 08, 2015 at 09:09:15AM +0100, Malcolm Crossley wrote:
> On 08/06/15 08:42, Jan Beulich wrote:
> >>>> On 07.06.15 at 08:23, <mst@redhat.com> wrote:
> >> On Mon, Apr 20, 2015 at 04:32:12PM +0200, Michael S. Tsirkin wrote:
> >>> On Mon, Apr 20, 2015 at 03:08:09PM +0100, Jan Beulich wrote:
> >>>>>>> On 20.04.15 at 15:43, <mst@redhat.com> wrote:
> >>>>> On Mon, Apr 13, 2015 at 01:51:06PM +0100, Jan Beulich wrote:
> >>>>>>>>> On 13.04.15 at 14:47, <mst@redhat.com> wrote:
> >>>>>>> Can you check device capabilities register, offset 0x4 within
> >>>>>>> pci express capability structure?
> >>>>>>> Bit 15 is 15 Role-Based Error Reporting.
> >>>>>>> Is it set?
> >>>>>>>
> >>>>>>> The spec says:
> >>>>>>>
> >>>>>>> 	15
> >>>>>>> 	On platforms where robust error handling and PC-compatible Configuration 
> >>>>>>> Space probing is
> >>>>>>> 	required, it is suggested that software or firmware have the Unsupported 
> >>>>>>> Request Reporting Enable
> >>>>>>> 	bit Set for Role-Based Error Reporting Functions, but clear for 1.0a 
> >>>>>>> Functions. Software or
> >>>>>>> 	firmware can distinguish the two classes of Functions by examining the 
> >>>>>>> Role-Based Error Reporting
> >>>>>>> 	bit in the Device Capabilities register.
> >>>>>>
> >>>>>> Yes, that bit is set.
> >>>>>
> >>>>> curiouser and curiouser.
> >>>>>
> >>>>> So with functions that do support Role-Based Error Reporting, we have
> >>>>> this:
> >>>>>
> >>>>>
> >>>>> 	With device Functions implementing Role-Based Error Reporting, setting the 
> >>>>> Unsupported Request
> >>>>> 	Reporting Enable bit will not interfere with PC-compatible Configuration 
> >>>>> Space probing, assuming
> >>>>> 	that the severity for UR is left at its default of non-fatal. However, 
> >>>>> setting the Unsupported Request
> >>>>> 	Reporting Enable bit will enable the Function to report UR errors 97 
> >>>>> detected with posted Requests,
> >>>>> 	helping avoid this case for potential silent data corruption.
> >>>>
> >>>> I still don't see what the PC-compatible config space probing has to
> >>>> do with our issue.
> >>>
> >>> I'm not sure but I think it's listed here because it causes a ton of URs
> >>> when device scan probes unimplemented functions.
> >>>
> >>>>> did firmware reconfigure this device to report URs as fatal errors then?
> >>>>
> >>>> No, the Unsupported Request Error Serverity flag is zero.
> >>>
> >>> OK, that's the correct configuration, so how come the box crashes when
> >>> there's a UR then?
> >>
> >> Ping - any update on this?
> > 
> > Not really. All we concluded so far is that _maybe_ the bridge, upon
> > seeing the UR, generates a Master Abort, rendering the whole thing
> > fatal. Otoh the respective root port also has
> > - Received Master Abort set in its Secondary Status register (but
> >   that's also already the case in the log that we have before the UR
> >   occurs, i.e. that doesn't mean all that much),
> > - Received System Error set in its Secondary Status register (and
> >   after the UR the sibling endpoint [UR originating from 83:00.0,
> >   sibling being 83:00.1] also shows Signaled System Error set).
> > 
> 
> Disabling the Memory decode in the command register could also result in a completion timeout on the
> root port issuing a transaction towards the PCI device in question.

Can it really? Such device would violate the PCIE spec, which says:

	If the request is not claimed, then it is handled as an
	Unsupported Request, which is the
	PCI Express equivalent of conventional PCI’s Master Abort termination.




> PCIE completion timeouts can be
> escalated to Fatal AER errors which trigger system firmware to inject NMI's into the host.
> 
> Unsupported requests can also be escalated to be Fatal AER errors (which would again trigger system
> firmware to inject an NMI).

Only if the system is misconfigured. We found out the system in question
is not configured to do this.


> Here is an example AER setup for a PCIE root port. You can see UnsupReq errors are masked and so do
> not trigger errors. CmpltTO ( completion timeout) errors are not masked and the errors are treated
> as Fatal because the corresponding bit in the Uncorrectable Severity register is set.
> 
> Capabilities: [148 v1] Advanced Error Reporting
> UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol+
> UESvrt:	DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> CEMsk:	RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
> AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
> 
> A root port completion timeout will also result in the master abort bit being set.

How do you figure this one out? The spec I have says master abort is the
equivalent of UR.

> Typically system firmware clears the error in the AER registers after it's processed it. So the
> operating system may not be able to determine what error triggered the NMI in the first place.

At least for debugging, just disable firmware and handle everything in
software.

> >> Do we can chalk this up to hardware bugs on a specific box?
> > 
> > I have to admit that I'm still very uncertain whether to consider all
> > this correct behavior, a firmware flaw, or a hardware bug.
> I believe the correct behaviour is happening but a PCIE completion timeout is occurring instead of a
> unsupported request.
> 
> Malcolm

This guess would be easy to check - just mask out the timeout bit.



> 
> > 
> > Jan
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> >

Jan Beulich June 8, 2015, 9:03 a.m. UTC | #26

>>> On 08.06.15 at 10:09, <malcolm.crossley@citrix.com> wrote:
> On 08/06/15 08:42, Jan Beulich wrote:
>> Not really. All we concluded so far is that _maybe_ the bridge, upon
>> seeing the UR, generates a Master Abort, rendering the whole thing
>> fatal. Otoh the respective root port also has
>> - Received Master Abort set in its Secondary Status register (but
>>   that's also already the case in the log that we have before the UR
>>   occurs, i.e. that doesn't mean all that much),
>> - Received System Error set in its Secondary Status register (and
>>   after the UR the sibling endpoint [UR originating from 83:00.0,
>>   sibling being 83:00.1] also shows Signaled System Error set).
>> 
> 
> Disabling the Memory decode in the command register could also result in a 
> completion timeout on the
> root port issuing a transaction towards the PCI device in question. PCIE 
> completion timeouts can be
> escalated to Fatal AER errors which trigger system firmware to inject NMI's 
> into the host.

And how does all that play with PC compatibility (where writes into
no-where get dropped, and reads from no-where get all ones
returned)? Remember - we#re talking about CPU side accesses
here.

> Here is an example AER setup for a PCIE root port. You can see UnsupReq 
> errors are masked and so do
> not trigger errors. CmpltTO ( completion timeout) errors are not masked and 
> the errors are treated
> as Fatal because the corresponding bit in the Uncorrectable Severity 
> register is set.
> 
> Capabilities: [148 v1] Advanced Error Reporting
> UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- 
> ACSViol-
> UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- 
> UnsupReq+ ACSViol+
> UESvrt:	DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- 
> UnsupReq- ACSViol-
> CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> CEMsk:	RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
> AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
> 
> A root port completion timeout will also result in the master abort bit 
> being set.
> 
> Typically system firmware clears the error in the AER registers after it's 
> processed it. So the
> operating system may not be able to determine what error triggered the NMI 
> in the first place.

Right, but in the case at hand we have an ITP log available, which
increases the hope that we see a reasonably complete picture.

>>> Do we can chalk this up to hardware bugs on a specific box?
>> 
>> I have to admit that I'm still very uncertain whether to consider all
>> this correct behavior, a firmware flaw, or a hardware bug.
> I believe the correct behaviour is happening but a PCIE completion timeout 
> is occurring instead of a
> unsupported request.

Might it be that with the supposedly correct device returning UR
the root port reissues the request to the sibling device, which then
fails it in a more dramatic way (albeit the sibling's Uncorrectable
Error Status Register also has only Unsupported Request Error
Status set)?

Jan

Michael S. Tsirkin June 8, 2015, 9:30 a.m. UTC | #27

On Mon, Jun 08, 2015 at 08:42:57AM +0100, Jan Beulich wrote:
> >>> On 07.06.15 at 08:23, <mst@redhat.com> wrote:
> > On Mon, Apr 20, 2015 at 04:32:12PM +0200, Michael S. Tsirkin wrote:
> >> On Mon, Apr 20, 2015 at 03:08:09PM +0100, Jan Beulich wrote:
> >> > >>> On 20.04.15 at 15:43, <mst@redhat.com> wrote:
> >> > > On Mon, Apr 13, 2015 at 01:51:06PM +0100, Jan Beulich wrote:
> >> > >> >>> On 13.04.15 at 14:47, <mst@redhat.com> wrote:
> >> > >> > Can you check device capabilities register, offset 0x4 within
> >> > >> > pci express capability structure?
> >> > >> > Bit 15 is 15 Role-Based Error Reporting.
> >> > >> > Is it set?
> >> > >> > 
> >> > >> > The spec says:
> >> > >> > 
> >> > >> > 	15
> >> > >> > 	On platforms where robust error handling and PC-compatible Configuration 
> >> > >> > Space probing is
> >> > >> > 	required, it is suggested that software or firmware have the Unsupported 
> >> > >> > Request Reporting Enable
> >> > >> > 	bit Set for Role-Based Error Reporting Functions, but clear for 1.0a 
> >> > >> > Functions. Software or
> >> > >> > 	firmware can distinguish the two classes of Functions by examining the 
> >> > >> > Role-Based Error Reporting
> >> > >> > 	bit in the Device Capabilities register.
> >> > >> 
> >> > >> Yes, that bit is set.
> >> > > 
> >> > > curiouser and curiouser.
> >> > > 
> >> > > So with functions that do support Role-Based Error Reporting, we have
> >> > > this:
> >> > > 
> >> > > 
> >> > > 	With device Functions implementing Role-Based Error Reporting, setting the 
> >> > > Unsupported Request
> >> > > 	Reporting Enable bit will not interfere with PC-compatible Configuration 
> >> > > Space probing, assuming
> >> > > 	that the severity for UR is left at its default of non-fatal. However, 
> >> > > setting the Unsupported Request
> >> > > 	Reporting Enable bit will enable the Function to report UR errors 97 
> >> > > detected with posted Requests,
> >> > > 	helping avoid this case for potential silent data corruption.
> >> > 
> >> > I still don't see what the PC-compatible config space probing has to
> >> > do with our issue.
> >> 
> >> I'm not sure but I think it's listed here because it causes a ton of URs
> >> when device scan probes unimplemented functions.
> >> 
> >> > > did firmware reconfigure this device to report URs as fatal errors then?
> >> > 
> >> > No, the Unsupported Request Error Serverity flag is zero.
> >> 
> >> OK, that's the correct configuration, so how come the box crashes when
> >> there's a UR then?
> > 
> > Ping - any update on this?
> 
> Not really. All we concluded so far is that _maybe_ the bridge, upon
> seeing the UR, generates a Master Abort, rendering the whole thing
> fatal.

But Master Abort is the equivalent of the UR, so I think that a
reasonable system would not be configured to trigger a fatal error
in this case - and you previously said it's configured reasonably.

> Otoh the respective root port also has
> - Received Master Abort set in its Secondary Status register (but
>   that's also already the case in the log that we have before the UR
>   occurs, i.e. that doesn't mean all that much),
> - Received System Error set in its Secondary Status register (and
>   after the UR the sibling endpoint [UR originating from 83:00.0,
>   sibling being 83:00.1] also shows Signaled System Error set).

It's another function of the same physical device, correct?

So is this sibling the only function sending SERR?
What happens if you disable SERR# in the command register
of 83:00.1?



> > Do we can chalk this up to hardware bugs on a specific box?
> 
> I have to admit that I'm still very uncertain whether to consider all
> this correct behavior, a firmware flaw, or a hardware bug.
> 
> Jan

Questions:
1.  Does this only happen with a specific endpoint?
    What if you add another endpoint to the same system?
2.  Has a driver initialized this endpoint? What if you don't
    load a driver before sending the transaction resulting in the UR?

Michael S. Tsirkin June 8, 2015, 9:36 a.m. UTC | #28

On Mon, Jun 08, 2015 at 10:03:18AM +0100, Jan Beulich wrote:
> >>> On 08.06.15 at 10:09, <malcolm.crossley@citrix.com> wrote:
> > On 08/06/15 08:42, Jan Beulich wrote:
> >> Not really. All we concluded so far is that _maybe_ the bridge, upon
> >> seeing the UR, generates a Master Abort, rendering the whole thing
> >> fatal. Otoh the respective root port also has
> >> - Received Master Abort set in its Secondary Status register (but
> >>   that's also already the case in the log that we have before the UR
> >>   occurs, i.e. that doesn't mean all that much),
> >> - Received System Error set in its Secondary Status register (and
> >>   after the UR the sibling endpoint [UR originating from 83:00.0,
> >>   sibling being 83:00.1] also shows Signaled System Error set).
> >> 
> > 
> > Disabling the Memory decode in the command register could also result in a 
> > completion timeout on the
> > root port issuing a transaction towards the PCI device in question. PCIE 
> > completion timeouts can be
> > escalated to Fatal AER errors which trigger system firmware to inject NMI's 
> > into the host.
> 
> And how does all that play with PC compatibility (where writes into
> no-where get dropped, and reads from no-where get all ones
> returned)? Remember - we#re talking about CPU side accesses
> here.
> 
> > Here is an example AER setup for a PCIE root port. You can see UnsupReq 
> > errors are masked and so do
> > not trigger errors. CmpltTO ( completion timeout) errors are not masked and 
> > the errors are treated
> > as Fatal because the corresponding bit in the Uncorrectable Severity 
> > register is set.
> > 
> > Capabilities: [148 v1] Advanced Error Reporting
> > UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- 
> > ACSViol-
> > UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- 
> > UnsupReq+ ACSViol+
> > UESvrt:	DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- 
> > UnsupReq- ACSViol-
> > CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> > CEMsk:	RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
> > AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
> > 
> > A root port completion timeout will also result in the master abort bit 
> > being set.
> > 
> > Typically system firmware clears the error in the AER registers after it's 
> > processed it. So the
> > operating system may not be able to determine what error triggered the NMI 
> > in the first place.
> 
> Right, but in the case at hand we have an ITP log available, which
> increases the hope that we see a reasonably complete picture.
> 
> >>> Do we can chalk this up to hardware bugs on a specific box?
> >> 
> >> I have to admit that I'm still very uncertain whether to consider all
> >> this correct behavior, a firmware flaw, or a hardware bug.
> > I believe the correct behaviour is happening but a PCIE completion timeout 
> > is occurring instead of a
> > unsupported request.
> 
> Might it be that with the supposedly correct device returning UR
> the root port reissues the request to the sibling device, which then
> fails it in a more dramatic way (albeit the sibling's Uncorrectable
> Error Status Register also has only Unsupported Request Error
> Status set)?
> 
> Jan

Isn't the sibling a function on the same device?
And is the request causing the UR a memory read?
If so doesn't this use address routing?
What does it mean that the request is "to the sibling device" then?
Does the sibling device have a BAR overlapping the address?

Jan Beulich June 8, 2015, 10:38 a.m. UTC | #29

>>> On 08.06.15 at 11:30, <mst@redhat.com> wrote:
> On Mon, Jun 08, 2015 at 08:42:57AM +0100, Jan Beulich wrote:
>> Otoh the respective root port also has
>> - Received Master Abort set in its Secondary Status register (but
>>   that's also already the case in the log that we have before the UR
>>   occurs, i.e. that doesn't mean all that much),
>> - Received System Error set in its Secondary Status register (and
>>   after the UR the sibling endpoint [UR originating from 83:00.0,
>>   sibling being 83:00.1] also shows Signaled System Error set).
> 
> It's another function of the same physical device, correct?

Yes (obviously with their BDF only differing in the function).

> So is this sibling the only function sending SERR?

Yes, albeit I can't tell whether the root port generated SERR on
its own or in response to the endpoint doing so.

> What happens if you disable SERR# in the command register
> of 83:00.1?

Any experiments with that system will be quite difficult, as they're
only accessible by partners or ours. But I'll ask to try this if you
think this can provide useful insight.

>> > Do we can chalk this up to hardware bugs on a specific box?
>> 
>> I have to admit that I'm still very uncertain whether to consider all
>> this correct behavior, a firmware flaw, or a hardware bug.
> 
> Questions:
> 1.  Does this only happen with a specific endpoint?
>     What if you add another endpoint to the same system?

We've got reports of this for two systems (two different vendors)
using a Broadcomm NIC in one case and an Intel one in the other.

> 2.  Has a driver initialized this endpoint? What if you don't
>     load a driver before sending the transaction resulting in the UR?

I'd have to ask for this to be tried too, and getting an answer
may take some time.

Jan

Jan Beulich June 8, 2015, 10:55 a.m. UTC | #30

>>> On 08.06.15 at 11:36, <mst@redhat.com> wrote:
> On Mon, Jun 08, 2015 at 10:03:18AM +0100, Jan Beulich wrote:
>> >>> On 08.06.15 at 10:09, <malcolm.crossley@citrix.com> wrote:
>> > I believe the correct behaviour is happening but a PCIE completion timeout 
>> > is occurring instead of a
>> > unsupported request.
>> 
>> Might it be that with the supposedly correct device returning UR
>> the root port reissues the request to the sibling device, which then
>> fails it in a more dramatic way (albeit the sibling's Uncorrectable
>> Error Status Register also has only Unsupported Request Error
>> Status set)?
> 
> Isn't the sibling a function on the same device?

Yes.

> And is the request causing the UR a memory read?

No, it's a write according to the ITP log.

> If so doesn't this use address routing?
> What does it mean that the request is "to the sibling device" then?

The way I expressed things above may simply be due to my limited
knowledge of PCIe terminology: I simply don't know (and can't find
this spelled out in the spec) what the root port's behavior ought to
be when a transaction comes in with an address that's within one of
its base/limit regions, but none of the endpoints can fulfill the
request. But you asking this made me look more closely at the
memory ranges dumped out to the ITP log: The root port has

0x20: Memory Base              = 0xca40
0x22: Memory Limit             = 0xca40
0x24: Prefetchable Mem Base    = 0xca21
0x26: Prefetchable Mem Limit   = 0xca21

while function 0 has

0x10: Base Address Register 0  = 0xca23000c (Memory space, 64-bit access, prefetchable)
0x18: Base Address Register 2  = 0xca24000c (Memory space, 64-bit access, prefetchable)
0x20: Base Address Register 4  = 0xca25000c (Memory space, 64-bit access, prefetchable)

and function 1

0x10: Base Address Register 0  = 0xca20000c (Memory space, 64-bit access, prefetchable)
0x18: Base Address Register 2  = 0xca21000c (Memory space, 64-bit access, prefetchable)
0x20: Base Address Register 4  = 0xca22000c (Memory space, 64-bit access, prefetchable)

Isn't is bogus for all but one of the BARs being outside the root
ports prefetchable memory window (the MSI-X tables being inside
the BAR 4 region in both cases)?

> Does the sibling device have a BAR overlapping the address?

No, its BARs are fully separate.

Jan

Michael S. Tsirkin June 8, 2015, 11:28 a.m. UTC | #31

On Mon, Jun 08, 2015 at 11:55:22AM +0100, Jan Beulich wrote:
> >>> On 08.06.15 at 11:36, <mst@redhat.com> wrote:
> > On Mon, Jun 08, 2015 at 10:03:18AM +0100, Jan Beulich wrote:
> >> >>> On 08.06.15 at 10:09, <malcolm.crossley@citrix.com> wrote:
> >> > I believe the correct behaviour is happening but a PCIE completion timeout 
> >> > is occurring instead of a
> >> > unsupported request.
> >> 
> >> Might it be that with the supposedly correct device returning UR
> >> the root port reissues the request to the sibling device, which then
> >> fails it in a more dramatic way (albeit the sibling's Uncorrectable
> >> Error Status Register also has only Unsupported Request Error
> >> Status set)?
> > 
> > Isn't the sibling a function on the same device?
> 
> Yes.
> > And is the request causing the UR a memory read?
> 
> No, it's a write according to the ITP log.

So as far as I know, requesters normally can't
reissue posted requests such as writes.
See 6.2.3.1. Completion Status.



> > If so doesn't this use address routing?
> > What does it mean that the request is "to the sibling device" then?
> 
> The way I expressed things above may simply be due to my limited
> knowledge of PCIe terminology: I simply don't know (and can't find
> this spelled out in the spec) what the root port's behavior ought to
> be when a transaction comes in with an address that's within one of
> its base/limit regions, but none of the endpoints can fulfill the
> request.

I think that's because root port doesn't know this. root port forwards
the transaction downstream. It then gets back a UR.

> But you asking this made me look more closely at the
> memory ranges dumped out to the ITP log: The root port has
> 
> 0x20: Memory Base              = 0xca40
> 0x22: Memory Limit             = 0xca40
> 0x24: Prefetchable Mem Base    = 0xca21
> 0x26: Prefetchable Mem Limit   = 0xca21
> 
> while function 0 has
> 
> 0x10: Base Address Register 0  = 0xca23000c (Memory space, 64-bit access, prefetchable)
> 0x18: Base Address Register 2  = 0xca24000c (Memory space, 64-bit access, prefetchable)
> 0x20: Base Address Register 4  = 0xca25000c (Memory space, 64-bit access, prefetchable)
> 
> and function 1
> 
> 0x10: Base Address Register 0  = 0xca20000c (Memory space, 64-bit access, prefetchable)
> 0x18: Base Address Register 2  = 0xca21000c (Memory space, 64-bit access, prefetchable)

And what is the size of this BAR?

> 0x20: Base Address Register 4  = 0xca22000c (Memory space, 64-bit access, prefetchable)
> 
> Isn't is bogus for all but one of the BARs being outside the root
> ports prefetchable memory window (the MSI-X tables being inside
> the BAR 4 region in both cases)?

I think it's an OK configuration in the sense that the spec does not
prohibit it. But the root port will never forward a transaction to these
addresses, so they can't be accessed from outside.

> > Does the sibling device have a BAR overlapping the address?
> 
> No, its BARs are fully separate.
> 
> Jan


Judging from the above, it's actually function 1's BAR 2 that
is accessed? Are you saying disabling memory on function 0
breaks function 2 somehow?

Jan Beulich June 8, 2015, 11:44 a.m. UTC | #32

>>> On 08.06.15 at 13:28, <mst@redhat.com> wrote:
> On Mon, Jun 08, 2015 at 11:55:22AM +0100, Jan Beulich wrote:
>> But you asking this made me look more closely at the
>> memory ranges dumped out to the ITP log: The root port has
>> 
>> 0x20: Memory Base              = 0xca40
>> 0x22: Memory Limit             = 0xca40
>> 0x24: Prefetchable Mem Base    = 0xca21
>> 0x26: Prefetchable Mem Limit   = 0xca21
>> 
>> while function 0 has
>> 
>> 0x10: Base Address Register 0  = 0xca23000c (Memory space, 64-bit access, 
> prefetchable)
>> 0x18: Base Address Register 2  = 0xca24000c (Memory space, 64-bit access, 
> prefetchable)
>> 0x20: Base Address Register 4  = 0xca25000c (Memory space, 64-bit access, 
> prefetchable)
>> 
>> and function 1
>> 
>> 0x10: Base Address Register 0  = 0xca20000c (Memory space, 64-bit access, 
> prefetchable)
>> 0x18: Base Address Register 2  = 0xca21000c (Memory space, 64-bit access, 
> prefetchable)
> 
> And what is the size of this BAR?

Sadly I don't seem have a matching pair of kernel and ITP logs, so I
can only guess that it's 64k (or less).

Jan

Jan Beulich June 10, 2015, 7 a.m. UTC | #33

>>> On 08.06.15 at 13:28, <mst@redhat.com> wrote:
> On Mon, Jun 08, 2015 at 11:55:22AM +0100, Jan Beulich wrote:
>> while function 0 has
>> 
>> 0x10: Base Address Register 0  = 0xca23000c (Memory space, 64-bit access, prefetchable)
>> 0x18: Base Address Register 2  = 0xca24000c (Memory space, 64-bit access, prefetchable)
>> 0x20: Base Address Register 4  = 0xca25000c (Memory space, 64-bit access, prefetchable)
>> 
>> and function 1
>> 
>> 0x10: Base Address Register 0  = 0xca20000c (Memory space, 64-bit access, prefetchable)
>> 0x18: Base Address Register 2  = 0xca21000c (Memory space, 64-bit access, prefetchable)
>> 0x20: Base Address Register 4  = 0xca22000c (Memory space, 64-bit access, prefetchable)
>> 
>> > Does the sibling device have a BAR overlapping the address?
>> 
>> No, its BARs are fully separate.
> 
> Judging from the above, it's actually function 1's BAR 2 that
> is accessed? Are you saying disabling memory on function 0
> breaks function 2 somehow?

Oops, just noticed I didn't reply to this. Not sure how you
come to that conclusion - the ITP log says that the bad write is to
0xca25004c.


Jan

Jan Beulich June 10, 2015, 7:08 a.m. UTC | #34

>>> On 08.06.15 at 11:30, <mst@redhat.com> wrote:
> What happens if you disable SERR# in the command register
> of 83:00.1?

We've just been told that with SERR not enabled in any of the
sibling endpoints the NMI still occurs. Not really surprising with
us now assuming that it's the root port that generates the SERR
in response to the UR coming back from an endpoint. But otoh
in conflict with what we see in the ITP log (where SERR clearly
is enabled on the endpoints, and the information we got says
that it _is_ disabled, not that they had to do anything to disable
it).

> 2.  Has a driver initialized this endpoint? What if you don't
>     load a driver before sending the transaction resulting in the UR?

They now tried at least without loading a driver in Dom0, which
didn't make a difference. Did you mean to also not load any driver
in the guest? Otoh I can't really see what difference this makes,
as the cleanup after the guest inside the hypervisor doesn't
really depend much on whether it actively used any of the MSI-X
entries.

Jan

Michael S. Tsirkin June 10, 2015, 11:43 a.m. UTC | #35

On Wed, Jun 10, 2015 at 08:00:55AM +0100, Jan Beulich wrote:
> >>> On 08.06.15 at 13:28, <mst@redhat.com> wrote:
> > On Mon, Jun 08, 2015 at 11:55:22AM +0100, Jan Beulich wrote:
> >> while function 0 has
> >> 
> >> 0x10: Base Address Register 0  = 0xca23000c (Memory space, 64-bit access, prefetchable)
> >> 0x18: Base Address Register 2  = 0xca24000c (Memory space, 64-bit access, prefetchable)
> >> 0x20: Base Address Register 4  = 0xca25000c (Memory space, 64-bit access, prefetchable)
> >> 
> >> and function 1
> >> 
> >> 0x10: Base Address Register 0  = 0xca20000c (Memory space, 64-bit access, prefetchable)
> >> 0x18: Base Address Register 2  = 0xca21000c (Memory space, 64-bit access, prefetchable)
> >> 0x20: Base Address Register 4  = 0xca22000c (Memory space, 64-bit access, prefetchable)
> >> 
> >> > Does the sibling device have a BAR overlapping the address?
> >> 
> >> No, its BARs are fully separate.
> > 
> > Judging from the above, it's actually function 1's BAR 2 that
> > is accessed? Are you saying disabling memory on function 0
> > breaks function 2 somehow?
> 
> Oops, just noticed I didn't reply to this. Not sure how you
> come to that conclusion - the ITP log says that the bad write is to
> 0xca25004c.
> 
> 
> Jan

Look at the bridge configuration though - looks like it
will only forward transactions to 0xca21XXXX.
Anything else will be terminated by the bridge itself.

Michael S. Tsirkin June 10, 2015, 11:46 a.m. UTC | #36

On Wed, Jun 10, 2015 at 08:08:37AM +0100, Jan Beulich wrote:
> >>> On 08.06.15 at 11:30, <mst@redhat.com> wrote:
> > What happens if you disable SERR# in the command register
> > of 83:00.1?
> 
> We've just been told that with SERR not enabled in any of the
> sibling endpoints the NMI still occurs. Not really surprising with
> us now assuming that it's the root port that generates the SERR
> in response to the UR coming back from an endpoint. But otoh
> in conflict with what we see in the ITP log (where SERR clearly
> is enabled on the endpoints, and the information we got says
> that it _is_ disabled, not that they had to do anything to disable
> it).
> 
> > 2.  Has a driver initialized this endpoint? What if you don't
> >     load a driver before sending the transaction resulting in the UR?
> 
> They now tried at least without loading a driver in Dom0, which
> didn't make a difference. Did you mean to also not load any driver
> in the guest? Otoh I can't really see what difference this makes,
> as the cleanup after the guest inside the hypervisor doesn't
> really depend much on whether it actively used any of the MSI-X
> entries.
> 
> Jan

I don't really know. The idea would be that device
is not designed for memory to be disabled when it's
active, and starts behaving in broken ways.

Jan Beulich June 10, 2015, 12:06 p.m. UTC | #37

>>> On 10.06.15 at 13:43, <mst@redhat.com> wrote:
> On Wed, Jun 10, 2015 at 08:00:55AM +0100, Jan Beulich wrote:
>> >>> On 08.06.15 at 13:28, <mst@redhat.com> wrote:
>> > On Mon, Jun 08, 2015 at 11:55:22AM +0100, Jan Beulich wrote:
>> >> while function 0 has
>> >> 
>> >> 0x10: Base Address Register 0  = 0xca23000c (Memory space, 64-bit access, prefetchable)
>> >> 0x18: Base Address Register 2  = 0xca24000c (Memory space, 64-bit access, prefetchable)
>> >> 0x20: Base Address Register 4  = 0xca25000c (Memory space, 64-bit access, prefetchable)
>> >> 
>> >> and function 1
>> >> 
>> >> 0x10: Base Address Register 0  = 0xca20000c (Memory space, 64-bit access, prefetchable)
>> >> 0x18: Base Address Register 2  = 0xca21000c (Memory space, 64-bit access, prefetchable)
>> >> 0x20: Base Address Register 4  = 0xca22000c (Memory space, 64-bit access, prefetchable)
>> >> 
>> >> > Does the sibling device have a BAR overlapping the address?
>> >> 
>> >> No, its BARs are fully separate.
>> > 
>> > Judging from the above, it's actually function 1's BAR 2 that
>> > is accessed? Are you saying disabling memory on function 0
>> > breaks function 2 somehow?
>> 
>> Oops, just noticed I didn't reply to this. Not sure how you
>> come to that conclusion - the ITP log says that the bad write is to
>> 0xca25004c.
> 
> Look at the bridge configuration though - looks like it
> will only forward transactions to 0xca21XXXX.
> Anything else will be terminated by the bridge itself.

Right, that's what I had pointed out before, but then again things
work prior to the guest shutting down (and in the absence of any
guest), even if I can't explain why or how.

Jan

Jan Beulich June 10, 2015, 12:10 p.m. UTC | #38

>>> On 10.06.15 at 13:46, <mst@redhat.com> wrote:
> I don't really know. The idea would be that device
> is not designed for memory to be disabled when it's
> active, and starts behaving in broken ways.

But you recall that we know the origin of the offending write is in
the hypervisor, and we also know that the device isn't active at the
point the problem occurs?

Jan

Michael S. Tsirkin June 10, 2015, 1:35 p.m. UTC | #39

On Wed, Jun 10, 2015 at 01:06:27PM +0100, Jan Beulich wrote:
> >>> On 10.06.15 at 13:43, <mst@redhat.com> wrote:
> > On Wed, Jun 10, 2015 at 08:00:55AM +0100, Jan Beulich wrote:
> >> >>> On 08.06.15 at 13:28, <mst@redhat.com> wrote:
> >> > On Mon, Jun 08, 2015 at 11:55:22AM +0100, Jan Beulich wrote:
> >> >> while function 0 has
> >> >> 
> >> >> 0x10: Base Address Register 0  = 0xca23000c (Memory space, 64-bit access, prefetchable)
> >> >> 0x18: Base Address Register 2  = 0xca24000c (Memory space, 64-bit access, prefetchable)
> >> >> 0x20: Base Address Register 4  = 0xca25000c (Memory space, 64-bit access, prefetchable)
> >> >> 
> >> >> and function 1
> >> >> 
> >> >> 0x10: Base Address Register 0  = 0xca20000c (Memory space, 64-bit access, prefetchable)
> >> >> 0x18: Base Address Register 2  = 0xca21000c (Memory space, 64-bit access, prefetchable)
> >> >> 0x20: Base Address Register 4  = 0xca22000c (Memory space, 64-bit access, prefetchable)
> >> >> 
> >> >> > Does the sibling device have a BAR overlapping the address?
> >> >> 
> >> >> No, its BARs are fully separate.
> >> > 
> >> > Judging from the above, it's actually function 1's BAR 2 that
> >> > is accessed? Are you saying disabling memory on function 0
> >> > breaks function 2 somehow?
> >> 
> >> Oops, just noticed I didn't reply to this. Not sure how you
> >> come to that conclusion - the ITP log says that the bad write is to
> >> 0xca25004c.
> > 
> > Look at the bridge configuration though - looks like it
> > will only forward transactions to 0xca21XXXX.
> > Anything else will be terminated by the bridge itself.
> 
> Right, that's what I had pointed out before, but then again things
> work prior to the guest shutting down (and in the absence of any
> guest), even if I can't explain why or how.
> 
> Jan

I have a wild idea. Maybe there's a chance function 1 sends the
offending write to 0xca25000c, then gets confused and crashes
if that fails?

[XSA-126] xen: limit guest control of PCI command register

Commit Message

Comments

Patch