diff mbox

Multi GPU passthrough via VFIO

Message ID 1391800246.6959.280.camel@bling.home
State New
Headers show

Commit Message

Alex Williamson Feb. 7, 2014, 7:10 p.m. UTC
On Fri, 2014-02-07 at 01:22 +0100, Maik Broemme wrote:
> Interesting is the diff between 1st and 2nd boot, so if I do the lspci
> prior to the booting. The only difference between 1st start and 2nd
> start are:
> 
> --- 001-lspci.290x.before.1st.log	2014-02-07 01:13:41.498827928 +0100
> +++ 004-lspci.290x.before.2nd.log	2014-02-07 01:16:50.966611282 +0100
> @@ -24,7 +24,7 @@
>  			ClockPM- Surprise- LLActRep- BwNot-
>  		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
>  			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> -		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> +		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>  		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
>  		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
>  		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> @@ -33,13 +33,13 @@
>  		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
>  			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>  	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> -		Address: 0000000000000000  Data: 0000
> +		Address: 00000000fee00000  Data: 0000
>  	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
>  	Capabilities: [150 v2] Advanced Error Reporting
>  		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>  		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>  		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> -		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> +		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>  		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>  		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
>  	Capabilities: [270 v1] #19
> 
> After that if I do suspend-to-ram / resume trick I have again lspci
> output from before 1st boot.

The Link Status change after X is stopped seems the most interesting to
me.  The MSI change is probably explained by the MSI save/restore of the
device, but should be harmless since MSI is disabled.  I'm a bit
surprised the Correctable Error Status in the AER capability didn't get
cleared.  I would have thought that a bus reset would have caused the
link to retrain back to the original speed/width as well.  Let's check
that we're actually getting a bus reset, try this in addition to the
previous qemu patch.  This just enables debug logging for the bus resest
function.  Thanks,

Alex

Comments

Maik Broemme Feb. 7, 2014, 8:17 p.m. UTC | #1
Hi Alex,

Alex Williamson <alex.williamson@redhat.com> wrote:
> On Fri, 2014-02-07 at 01:22 +0100, Maik Broemme wrote:
> > Interesting is the diff between 1st and 2nd boot, so if I do the lspci
> > prior to the booting. The only difference between 1st start and 2nd
> > start are:
> > 
> > --- 001-lspci.290x.before.1st.log	2014-02-07 01:13:41.498827928 +0100
> > +++ 004-lspci.290x.before.2nd.log	2014-02-07 01:16:50.966611282 +0100
> > @@ -24,7 +24,7 @@
> >  			ClockPM- Surprise- LLActRep- BwNot-
> >  		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
> >  			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > -		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > +		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> >  		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
> >  		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> >  		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> > @@ -33,13 +33,13 @@
> >  		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
> >  			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> >  	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> > -		Address: 0000000000000000  Data: 0000
> > +		Address: 00000000fee00000  Data: 0000
> >  	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
> >  	Capabilities: [150 v2] Advanced Error Reporting
> >  		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> >  		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> >  		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> > -		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> > +		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> >  		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> >  		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> >  	Capabilities: [270 v1] #19
> > 
> > After that if I do suspend-to-ram / resume trick I have again lspci
> > output from before 1st boot.
> 
> The Link Status change after X is stopped seems the most interesting to
> me.  The MSI change is probably explained by the MSI save/restore of the
> device, but should be harmless since MSI is disabled.  I'm a bit
> surprised the Correctable Error Status in the AER capability didn't get
> cleared.  I would have thought that a bus reset would have caused the
> link to retrain back to the original speed/width as well.  Let's check
> that we're actually getting a bus reset, try this in addition to the
> previous qemu patch.  This just enables debug logging for the bus resest
> function.  Thanks,
> 

Below are the outputs from 2 boots, VGA, load fglrx and start X. (2nd
time X gets killed and oops happened)

- 1st boot:

vfio: vfio_pci_hot_reset(0000:01:00.1) multi
vfio: 0000:01:00.1: hot reset dependent devices:
vfio: 	0000:01:00.0 group 1
vfio: 	0000:01:00.1 group 1
vfio: 0000:01:00.1 hot reset: Success
vfio: vfio_pci_hot_reset(0000:01:00.1) one
vfio: 0000:01:00.1: hot reset dependent devices:
vfio: 	0000:01:00.0 group 1
vfio: vfio: found another in-use device 0000:01:00.0
vfio: vfio_pci_hot_reset(0000:01:00.0) one
vfio: 0000:01:00.0: hot reset dependent devices:
vfio: 	0000:01:00.0 group 1
vfio: 	0000:01:00.1 group 1
vfio: vfio: found another in-use device 0000:01:00.1

- 2nd boot:

vfio: vfio_pci_hot_reset(0000:01:00.1) multi
vfio: 0000:01:00.1: hot reset dependent devices:
vfio: 	0000:01:00.0 group 1
vfio: 	0000:01:00.1 group 1
vfio: 0000:01:00.1 hot reset: Success
vfio: vfio_pci_hot_reset(0000:01:00.1) one
vfio: 0000:01:00.1: hot reset dependent devices:
vfio: 	0000:01:00.0 group 1
vfio: vfio: found another in-use device 0000:01:00.0
vfio: vfio_pci_hot_reset(0000:01:00.0) one
vfio: 0000:01:00.0: hot reset dependent devices:
vfio: 	0000:01:00.0 group 1
vfio: 	0000:01:00.1 group 1
vfio: vfio: found another in-use device 0000:01:00.1

> Alex
> 
> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index 8db182f..7fec259 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
> @@ -2927,6 +2927,10 @@ static bool vfio_pci_host_match(PCIHostDeviceAddress *hos
>              host1->slot == host2->slot && host1->function == host2->function);
>  }
>  
> +#undef DPRINTF
> +#define DPRINTF(fmt, ...) \
> +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> +
>  static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
>  {
>      VFIOGroup *group;
> @@ -3104,6 +3108,15 @@ out_single:
>      return ret;
>  }
>  
> +#undef DPRINTF
> +#ifdef DEBUG_VFIO
> +#define DPRINTF(fmt, ...) \
> +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> +#else
> +#define DPRINTF(fmt, ...) \
> +    do { } while (0)
> +#endif
> +
>  /*
>   * We want to differentiate hot reset of mulitple in-use devices vs hot reset
>   * of a single in-use device.  VFIO_DEVICE_RESET will already handle the case
> 
> 

--Maik
Maik Broemme Feb. 14, 2014, 12:01 a.m. UTC | #2
Hi Alex,

Maik Broemme <mbroemme@parallels.com> wrote:
> Hi Alex,
> 
> Alex Williamson <alex.williamson@redhat.com> wrote:
> > On Fri, 2014-02-07 at 01:22 +0100, Maik Broemme wrote:
> > > Interesting is the diff between 1st and 2nd boot, so if I do the lspci
> > > prior to the booting. The only difference between 1st start and 2nd
> > > start are:
> > > 
> > > --- 001-lspci.290x.before.1st.log	2014-02-07 01:13:41.498827928 +0100
> > > +++ 004-lspci.290x.before.2nd.log	2014-02-07 01:16:50.966611282 +0100
> > > @@ -24,7 +24,7 @@
> > >  			ClockPM- Surprise- LLActRep- BwNot-
> > >  		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
> > >  			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > -		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > +		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > >  		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
> > >  		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> > >  		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> > > @@ -33,13 +33,13 @@
> > >  		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
> > >  			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> > >  	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> > > -		Address: 0000000000000000  Data: 0000
> > > +		Address: 00000000fee00000  Data: 0000
> > >  	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
> > >  	Capabilities: [150 v2] Advanced Error Reporting
> > >  		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > >  		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > >  		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> > > -		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> > > +		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> > >  		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> > >  		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> > >  	Capabilities: [270 v1] #19
> > > 
> > > After that if I do suspend-to-ram / resume trick I have again lspci
> > > output from before 1st boot.
> > 
> > The Link Status change after X is stopped seems the most interesting to
> > me.  The MSI change is probably explained by the MSI save/restore of the
> > device, but should be harmless since MSI is disabled.  I'm a bit
> > surprised the Correctable Error Status in the AER capability didn't get
> > cleared.  I would have thought that a bus reset would have caused the
> > link to retrain back to the original speed/width as well.  Let's check
> > that we're actually getting a bus reset, try this in addition to the
> > previous qemu patch.  This just enables debug logging for the bus resest
> > function.  Thanks,
> > 
> 
> Below are the outputs from 2 boots, VGA, load fglrx and start X. (2nd
> time X gets killed and oops happened)
> 
> - 1st boot:
> 
> vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> vfio: 0000:01:00.1: hot reset dependent devices:
> vfio: 	0000:01:00.0 group 1
> vfio: 	0000:01:00.1 group 1
> vfio: 0000:01:00.1 hot reset: Success
> vfio: vfio_pci_hot_reset(0000:01:00.1) one
> vfio: 0000:01:00.1: hot reset dependent devices:
> vfio: 	0000:01:00.0 group 1
> vfio: vfio: found another in-use device 0000:01:00.0
> vfio: vfio_pci_hot_reset(0000:01:00.0) one
> vfio: 0000:01:00.0: hot reset dependent devices:
> vfio: 	0000:01:00.0 group 1
> vfio: 	0000:01:00.1 group 1
> vfio: vfio: found another in-use device 0000:01:00.1
> 
> - 2nd boot:
> 
> vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> vfio: 0000:01:00.1: hot reset dependent devices:
> vfio: 	0000:01:00.0 group 1
> vfio: 	0000:01:00.1 group 1
> vfio: 0000:01:00.1 hot reset: Success
> vfio: vfio_pci_hot_reset(0000:01:00.1) one
> vfio: 0000:01:00.1: hot reset dependent devices:
> vfio: 	0000:01:00.0 group 1
> vfio: vfio: found another in-use device 0000:01:00.0
> vfio: vfio_pci_hot_reset(0000:01:00.0) one
> vfio: 0000:01:00.0: hot reset dependent devices:
> vfio: 	0000:01:00.0 group 1
> vfio: 	0000:01:00.1 group 1
> vfio: vfio: found another in-use device 0000:01:00.1
> 

Did you had already a chance to look into it or anything else I can help
with?

> > Alex
> > 
> > diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> > index 8db182f..7fec259 100644
> > --- a/hw/misc/vfio.c
> > +++ b/hw/misc/vfio.c
> > @@ -2927,6 +2927,10 @@ static bool vfio_pci_host_match(PCIHostDeviceAddress *hos
> >              host1->slot == host2->slot && host1->function == host2->function);
> >  }
> >  
> > +#undef DPRINTF
> > +#define DPRINTF(fmt, ...) \
> > +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> > +
> >  static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
> >  {
> >      VFIOGroup *group;
> > @@ -3104,6 +3108,15 @@ out_single:
> >      return ret;
> >  }
> >  
> > +#undef DPRINTF
> > +#ifdef DEBUG_VFIO
> > +#define DPRINTF(fmt, ...) \
> > +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> > +#else
> > +#define DPRINTF(fmt, ...) \
> > +    do { } while (0)
> > +#endif
> > +
> >  /*
> >   * We want to differentiate hot reset of mulitple in-use devices vs hot reset
> >   * of a single in-use device.  VFIO_DEVICE_RESET will already handle the case
> > 
> > 
> 
> --Maik
> 

--Maik
Alex Williamson Feb. 14, 2014, 12:33 a.m. UTC | #3
On Fri, 2014-02-14 at 01:01 +0100, Maik Broemme wrote:
> Hi Alex,
> 
> Maik Broemme <mbroemme@parallels.com> wrote:
> > Hi Alex,
> > 
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > On Fri, 2014-02-07 at 01:22 +0100, Maik Broemme wrote:
> > > > Interesting is the diff between 1st and 2nd boot, so if I do the lspci
> > > > prior to the booting. The only difference between 1st start and 2nd
> > > > start are:
> > > > 
> > > > --- 001-lspci.290x.before.1st.log	2014-02-07 01:13:41.498827928 +0100
> > > > +++ 004-lspci.290x.before.2nd.log	2014-02-07 01:16:50.966611282 +0100
> > > > @@ -24,7 +24,7 @@
> > > >  			ClockPM- Surprise- LLActRep- BwNot-
> > > >  		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
> > > >  			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > -		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > +		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > >  		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
> > > >  		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> > > >  		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> > > > @@ -33,13 +33,13 @@
> > > >  		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
> > > >  			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> > > >  	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> > > > -		Address: 0000000000000000  Data: 0000
> > > > +		Address: 00000000fee00000  Data: 0000
> > > >  	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
> > > >  	Capabilities: [150 v2] Advanced Error Reporting
> > > >  		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > >  		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > >  		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> > > > -		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> > > > +		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> > > >  		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> > > >  		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> > > >  	Capabilities: [270 v1] #19
> > > > 
> > > > After that if I do suspend-to-ram / resume trick I have again lspci
> > > > output from before 1st boot.
> > > 
> > > The Link Status change after X is stopped seems the most interesting to
> > > me.  The MSI change is probably explained by the MSI save/restore of the
> > > device, but should be harmless since MSI is disabled.  I'm a bit
> > > surprised the Correctable Error Status in the AER capability didn't get
> > > cleared.  I would have thought that a bus reset would have caused the
> > > link to retrain back to the original speed/width as well.  Let's check
> > > that we're actually getting a bus reset, try this in addition to the
> > > previous qemu patch.  This just enables debug logging for the bus resest
> > > function.  Thanks,
> > > 
> > 
> > Below are the outputs from 2 boots, VGA, load fglrx and start X. (2nd
> > time X gets killed and oops happened)
> > 
> > - 1st boot:
> > 
> > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > vfio: 0000:01:00.1: hot reset dependent devices:
> > vfio: 	0000:01:00.0 group 1
> > vfio: 	0000:01:00.1 group 1
> > vfio: 0000:01:00.1 hot reset: Success
> > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > vfio: 0000:01:00.1: hot reset dependent devices:
> > vfio: 	0000:01:00.0 group 1
> > vfio: vfio: found another in-use device 0000:01:00.0
> > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > vfio: 0000:01:00.0: hot reset dependent devices:
> > vfio: 	0000:01:00.0 group 1
> > vfio: 	0000:01:00.1 group 1
> > vfio: vfio: found another in-use device 0000:01:00.1
> > 
> > - 2nd boot:
> > 
> > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > vfio: 0000:01:00.1: hot reset dependent devices:
> > vfio: 	0000:01:00.0 group 1
> > vfio: 	0000:01:00.1 group 1
> > vfio: 0000:01:00.1 hot reset: Success
> > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > vfio: 0000:01:00.1: hot reset dependent devices:
> > vfio: 	0000:01:00.0 group 1
> > vfio: vfio: found another in-use device 0000:01:00.0
> > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > vfio: 0000:01:00.0: hot reset dependent devices:
> > vfio: 	0000:01:00.0 group 1
> > vfio: 	0000:01:00.1 group 1
> > vfio: vfio: found another in-use device 0000:01:00.1
> > 
> 
> Did you had already a chance to look into it or anything else I can help
> with?

According to the log we're doing the bus reset on both the first and 2nd
boot (it's expected that only the "multi" call gets to success).  I'm
surprised then that the link doesn't retrain back to the original width.
You could try forcing the link to retrain.  Look at the root port
upstream from the GPU, lspci -t is handy for this.  Run lspci on the
root port to get the PCI express capability offset, then use setpci to
set the link retrain bit.  For example:

# lspci -tv | grep NVIDIA
           +-07.0-[03]--+-00.0  NVIDIA Corporation GK106GL [Quadro K4000]
           |            \-00.1  NVIDIA Corporation GK106 HDMI Audio Controller

(upstream root port is 00:07.0)

# lspci -v -s 7.0 | grep Capabilities
	Capabilities: [40] Subsystem: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7
	Capabilities: [60] MSI: Enable+ Count=1/2 Maskable+ 64bit-
	Capabilities: [90] Express Root Port (Slot+), MSI 00
	Capabilities: [e0] Power Management version 3
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [150] Access Control Services
	Capabilities: [160] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>

(PCI express capability is offset 0x90, Link Control is 0x10 off that)

# setpci -s 7.0 a0.w
0040

(retrain is bit 5, 0x20, OR'd with read value is 0x60)

# setpci -s 7.0 a0.w=60

# lspci... did it work?

Try doing that after the first boot to see if you can get back to a x16
link.  If that works, we may need to add something in the kernel to do
it automatically around a bus reset.  Thanks,

Alex

> > > diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> > > index 8db182f..7fec259 100644
> > > --- a/hw/misc/vfio.c
> > > +++ b/hw/misc/vfio.c
> > > @@ -2927,6 +2927,10 @@ static bool vfio_pci_host_match(PCIHostDeviceAddress *hos
> > >              host1->slot == host2->slot && host1->function == host2->function);
> > >  }
> > >  
> > > +#undef DPRINTF
> > > +#define DPRINTF(fmt, ...) \
> > > +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> > > +
> > >  static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
> > >  {
> > >      VFIOGroup *group;
> > > @@ -3104,6 +3108,15 @@ out_single:
> > >      return ret;
> > >  }
> > >  
> > > +#undef DPRINTF
> > > +#ifdef DEBUG_VFIO
> > > +#define DPRINTF(fmt, ...) \
> > > +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> > > +#else
> > > +#define DPRINTF(fmt, ...) \
> > > +    do { } while (0)
> > > +#endif
> > > +
> > >  /*
> > >   * We want to differentiate hot reset of mulitple in-use devices vs hot reset
> > >   * of a single in-use device.  VFIO_DEVICE_RESET will already handle the case
> > > 
> > > 
> > 
> > --Maik
> > 
> 
> --Maik
Maik Broemme Feb. 14, 2014, 2:51 p.m. UTC | #4
Hi Alex,

Alex Williamson <alex.williamson@redhat.com> wrote:
> On Fri, 2014-02-14 at 01:01 +0100, Maik Broemme wrote:
> > Hi Alex,
> > 
> > Maik Broemme <mbroemme@parallels.com> wrote:
> > > Hi Alex,
> > > 
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > > On Fri, 2014-02-07 at 01:22 +0100, Maik Broemme wrote:
> > > > > Interesting is the diff between 1st and 2nd boot, so if I do the lspci
> > > > > prior to the booting. The only difference between 1st start and 2nd
> > > > > start are:
> > > > > 
> > > > > --- 001-lspci.290x.before.1st.log	2014-02-07 01:13:41.498827928 +0100
> > > > > +++ 004-lspci.290x.before.2nd.log	2014-02-07 01:16:50.966611282 +0100
> > > > > @@ -24,7 +24,7 @@
> > > > >  			ClockPM- Surprise- LLActRep- BwNot-
> > > > >  		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
> > > > >  			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > > -		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > > +		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > >  		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
> > > > >  		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> > > > >  		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> > > > > @@ -33,13 +33,13 @@
> > > > >  		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
> > > > >  			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> > > > >  	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> > > > > -		Address: 0000000000000000  Data: 0000
> > > > > +		Address: 00000000fee00000  Data: 0000
> > > > >  	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
> > > > >  	Capabilities: [150 v2] Advanced Error Reporting
> > > > >  		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > > >  		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > > >  		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> > > > > -		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> > > > > +		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> > > > >  		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> > > > >  		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> > > > >  	Capabilities: [270 v1] #19
> > > > > 
> > > > > After that if I do suspend-to-ram / resume trick I have again lspci
> > > > > output from before 1st boot.
> > > > 
> > > > The Link Status change after X is stopped seems the most interesting to
> > > > me.  The MSI change is probably explained by the MSI save/restore of the
> > > > device, but should be harmless since MSI is disabled.  I'm a bit
> > > > surprised the Correctable Error Status in the AER capability didn't get
> > > > cleared.  I would have thought that a bus reset would have caused the
> > > > link to retrain back to the original speed/width as well.  Let's check
> > > > that we're actually getting a bus reset, try this in addition to the
> > > > previous qemu patch.  This just enables debug logging for the bus resest
> > > > function.  Thanks,
> > > > 
> > > 
> > > Below are the outputs from 2 boots, VGA, load fglrx and start X. (2nd
> > > time X gets killed and oops happened)
> > > 
> > > - 1st boot:
> > > 
> > > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > vfio: 	0000:01:00.0 group 1
> > > vfio: 	0000:01:00.1 group 1
> > > vfio: 0000:01:00.1 hot reset: Success
> > > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > vfio: 	0000:01:00.0 group 1
> > > vfio: vfio: found another in-use device 0000:01:00.0
> > > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > > vfio: 0000:01:00.0: hot reset dependent devices:
> > > vfio: 	0000:01:00.0 group 1
> > > vfio: 	0000:01:00.1 group 1
> > > vfio: vfio: found another in-use device 0000:01:00.1
> > > 
> > > - 2nd boot:
> > > 
> > > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > vfio: 	0000:01:00.0 group 1
> > > vfio: 	0000:01:00.1 group 1
> > > vfio: 0000:01:00.1 hot reset: Success
> > > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > vfio: 	0000:01:00.0 group 1
> > > vfio: vfio: found another in-use device 0000:01:00.0
> > > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > > vfio: 0000:01:00.0: hot reset dependent devices:
> > > vfio: 	0000:01:00.0 group 1
> > > vfio: 	0000:01:00.1 group 1
> > > vfio: vfio: found another in-use device 0000:01:00.1
> > > 
> > 
> > Did you had already a chance to look into it or anything else I can help
> > with?
> 
> According to the log we're doing the bus reset on both the first and 2nd
> boot (it's expected that only the "multi" call gets to success).  I'm
> surprised then that the link doesn't retrain back to the original width.
> You could try forcing the link to retrain.  Look at the root port
> upstream from the GPU, lspci -t is handy for this.  Run lspci on the
> root port to get the PCI express capability offset, then use setpci to
> set the link retrain bit.  For example:
> 
> # lspci -tv | grep NVIDIA
>            +-07.0-[03]--+-00.0  NVIDIA Corporation GK106GL [Quadro K4000]
>            |            \-00.1  NVIDIA Corporation GK106 HDMI Audio Controller
> 
> (upstream root port is 00:07.0)
> 
> # lspci -v -s 7.0 | grep Capabilities
> 	Capabilities: [40] Subsystem: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7
> 	Capabilities: [60] MSI: Enable+ Count=1/2 Maskable+ 64bit-
> 	Capabilities: [90] Express Root Port (Slot+), MSI 00
> 	Capabilities: [e0] Power Management version 3
> 	Capabilities: [100] Advanced Error Reporting
> 	Capabilities: [150] Access Control Services
> 	Capabilities: [160] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>
> 
> (PCI express capability is offset 0x90, Link Control is 0x10 off that)
> 
> # setpci -s 7.0 a0.w
> 0040
> 
> (retrain is bit 5, 0x20, OR'd with read value is 0x60)
> 
> # setpci -s 7.0 a0.w=60
> 
> # lspci... did it work?
> 
> Try doing that after the first boot to see if you can get back to a x16
> link.  If that works, we may need to add something in the kernel to do
> it automatically around a bus reset.  Thanks,
> 

Well this doesn't help either and it looks like VFIO reset is setting it
already back to original width. For example:

           +-02.0-[01]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Hawaii XT [Radeon HD 8970]
           |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Device aac8

Before 1st run:

root@homer:~# lspci -vvv -s 00:02.0 | grep LnkSta:
		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
root@homer:~# lspci -vvv -s 01:00.0 | grep LnkSta:
		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

After power down of VM:

root@homer:~# lspci -vvv -s 00:02.0 | grep LnkSta:
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
root@homer:~# lspci -vvv -s 01:00.0 | grep LnkSta:
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

After 2nd start once VFIO did reset:

root@homer:~# lspci -vvv -s 00:02.0 | grep LnkSta:
		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
root@homer:~# lspci -vvv -s 01:00.0 | grep LnkSta:
		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

The only difference on bus I see here is ABWMgmt- vs ABWMgmt+ but it
shouldn't be relevant here as it the same if I unload fglrx module
before shutdown the VM which is the only case where I can run multiple
VM reboot cycles.

So the only difference on bus is the following:

-60: 10 08 00 00 02 cd 31 00 40 00 02 b1 80 25 14 00
+60: 10 08 00 00 02 cd 31 00 40 00 11 b0 80 25 14 00

6a (before 02, after 11)
6b (before b1, after b0)

But I cannot write these parameters using setpci. My PCI express capability
is offset 0x58 + 0x10 for link control which is already set back to 40

root@homer:~# lspci -vvv -s 00:02.0 | grep Capa
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit-
	Capabilities: [b0] Subsystem: Gigabyte Technology Co., Ltd Device 5000
	Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [190 v1] Access Control Services

> Alex
> 
> > > > diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> > > > index 8db182f..7fec259 100644
> > > > --- a/hw/misc/vfio.c
> > > > +++ b/hw/misc/vfio.c
> > > > @@ -2927,6 +2927,10 @@ static bool vfio_pci_host_match(PCIHostDeviceAddress *hos
> > > >              host1->slot == host2->slot && host1->function == host2->function);
> > > >  }
> > > >  
> > > > +#undef DPRINTF
> > > > +#define DPRINTF(fmt, ...) \
> > > > +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> > > > +
> > > >  static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
> > > >  {
> > > >      VFIOGroup *group;
> > > > @@ -3104,6 +3108,15 @@ out_single:
> > > >      return ret;
> > > >  }
> > > >  
> > > > +#undef DPRINTF
> > > > +#ifdef DEBUG_VFIO
> > > > +#define DPRINTF(fmt, ...) \
> > > > +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> > > > +#else
> > > > +#define DPRINTF(fmt, ...) \
> > > > +    do { } while (0)
> > > > +#endif
> > > > +
> > > >  /*
> > > >   * We want to differentiate hot reset of mulitple in-use devices vs hot reset
> > > >   * of a single in-use device.  VFIO_DEVICE_RESET will already handle the case
> > > > 
> > > > 
> > > 
> > > --Maik
> > > 
> > 
> > --Maik
> 
> 
> 

--Maik
Maik Broemme April 14, 2014, 5:03 p.m. UTC | #5
Hi Alex,

Maik Broemme <mbroemme@parallels.com> wrote:
> Hi Alex,
> 
> Alex Williamson <alex.williamson@redhat.com> wrote:
> > On Fri, 2014-02-14 at 01:01 +0100, Maik Broemme wrote:
> > > Hi Alex,
> > > 
> > > Maik Broemme <mbroemme@parallels.com> wrote:
> > > > Hi Alex,
> > > > 
> > > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > > > On Fri, 2014-02-07 at 01:22 +0100, Maik Broemme wrote:
> > > > > > Interesting is the diff between 1st and 2nd boot, so if I do the lspci
> > > > > > prior to the booting. The only difference between 1st start and 2nd
> > > > > > start are:
> > > > > > 
> > > > > > --- 001-lspci.290x.before.1st.log	2014-02-07 01:13:41.498827928 +0100
> > > > > > +++ 004-lspci.290x.before.2nd.log	2014-02-07 01:16:50.966611282 +0100
> > > > > > @@ -24,7 +24,7 @@
> > > > > >  			ClockPM- Surprise- LLActRep- BwNot-
> > > > > >  		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
> > > > > >  			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > > > -		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > > > +		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > > >  		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
> > > > > >  		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> > > > > >  		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> > > > > > @@ -33,13 +33,13 @@
> > > > > >  		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
> > > > > >  			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> > > > > >  	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> > > > > > -		Address: 0000000000000000  Data: 0000
> > > > > > +		Address: 00000000fee00000  Data: 0000
> > > > > >  	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
> > > > > >  	Capabilities: [150 v2] Advanced Error Reporting
> > > > > >  		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > > > >  		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > > > >  		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> > > > > > -		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> > > > > > +		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> > > > > >  		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> > > > > >  		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> > > > > >  	Capabilities: [270 v1] #19
> > > > > > 
> > > > > > After that if I do suspend-to-ram / resume trick I have again lspci
> > > > > > output from before 1st boot.
> > > > > 
> > > > > The Link Status change after X is stopped seems the most interesting to
> > > > > me.  The MSI change is probably explained by the MSI save/restore of the
> > > > > device, but should be harmless since MSI is disabled.  I'm a bit
> > > > > surprised the Correctable Error Status in the AER capability didn't get
> > > > > cleared.  I would have thought that a bus reset would have caused the
> > > > > link to retrain back to the original speed/width as well.  Let's check
> > > > > that we're actually getting a bus reset, try this in addition to the
> > > > > previous qemu patch.  This just enables debug logging for the bus resest
> > > > > function.  Thanks,
> > > > > 
> > > > 
> > > > Below are the outputs from 2 boots, VGA, load fglrx and start X. (2nd
> > > > time X gets killed and oops happened)
> > > > 
> > > > - 1st boot:
> > > > 
> > > > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > vfio: 	0000:01:00.0 group 1
> > > > vfio: 	0000:01:00.1 group 1
> > > > vfio: 0000:01:00.1 hot reset: Success
> > > > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > vfio: 	0000:01:00.0 group 1
> > > > vfio: vfio: found another in-use device 0000:01:00.0
> > > > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > > > vfio: 0000:01:00.0: hot reset dependent devices:
> > > > vfio: 	0000:01:00.0 group 1
> > > > vfio: 	0000:01:00.1 group 1
> > > > vfio: vfio: found another in-use device 0000:01:00.1
> > > > 
> > > > - 2nd boot:
> > > > 
> > > > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > vfio: 	0000:01:00.0 group 1
> > > > vfio: 	0000:01:00.1 group 1
> > > > vfio: 0000:01:00.1 hot reset: Success
> > > > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > vfio: 	0000:01:00.0 group 1
> > > > vfio: vfio: found another in-use device 0000:01:00.0
> > > > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > > > vfio: 0000:01:00.0: hot reset dependent devices:
> > > > vfio: 	0000:01:00.0 group 1
> > > > vfio: 	0000:01:00.1 group 1
> > > > vfio: vfio: found another in-use device 0000:01:00.1
> > > > 
> > > 
> > > Did you had already a chance to look into it or anything else I can help
> > > with?
> > 
> > According to the log we're doing the bus reset on both the first and 2nd
> > boot (it's expected that only the "multi" call gets to success).  I'm
> > surprised then that the link doesn't retrain back to the original width.
> > You could try forcing the link to retrain.  Look at the root port
> > upstream from the GPU, lspci -t is handy for this.  Run lspci on the
> > root port to get the PCI express capability offset, then use setpci to
> > set the link retrain bit.  For example:
> > 
> > # lspci -tv | grep NVIDIA
> >            +-07.0-[03]--+-00.0  NVIDIA Corporation GK106GL [Quadro K4000]
> >            |            \-00.1  NVIDIA Corporation GK106 HDMI Audio Controller
> > 
> > (upstream root port is 00:07.0)
> > 
> > # lspci -v -s 7.0 | grep Capabilities
> > 	Capabilities: [40] Subsystem: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7
> > 	Capabilities: [60] MSI: Enable+ Count=1/2 Maskable+ 64bit-
> > 	Capabilities: [90] Express Root Port (Slot+), MSI 00
> > 	Capabilities: [e0] Power Management version 3
> > 	Capabilities: [100] Advanced Error Reporting
> > 	Capabilities: [150] Access Control Services
> > 	Capabilities: [160] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>
> > 
> > (PCI express capability is offset 0x90, Link Control is 0x10 off that)
> > 
> > # setpci -s 7.0 a0.w
> > 0040
> > 
> > (retrain is bit 5, 0x20, OR'd with read value is 0x60)
> > 
> > # setpci -s 7.0 a0.w=60
> > 
> > # lspci... did it work?
> > 
> > Try doing that after the first boot to see if you can get back to a x16
> > link.  If that works, we may need to add something in the kernel to do
> > it automatically around a bus reset.  Thanks,
> > 
> 
> Well this doesn't help either and it looks like VFIO reset is setting it
> already back to original width. For example:
> 
>            +-02.0-[01]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Hawaii XT [Radeon HD 8970]
>            |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Device aac8
> 
> Before 1st run:
> 
> root@homer:~# lspci -vvv -s 00:02.0 | grep LnkSta:
> 		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
> root@homer:~# lspci -vvv -s 01:00.0 | grep LnkSta:
> 		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> 
> After power down of VM:
> 
> root@homer:~# lspci -vvv -s 00:02.0 | grep LnkSta:
> 		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
> root@homer:~# lspci -vvv -s 01:00.0 | grep LnkSta:
> 		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> 
> After 2nd start once VFIO did reset:
> 
> root@homer:~# lspci -vvv -s 00:02.0 | grep LnkSta:
> 		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
> root@homer:~# lspci -vvv -s 01:00.0 | grep LnkSta:
> 		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> 
> The only difference on bus I see here is ABWMgmt- vs ABWMgmt+ but it
> shouldn't be relevant here as it the same if I unload fglrx module
> before shutdown the VM which is the only case where I can run multiple
> VM reboot cycles.
> 
> So the only difference on bus is the following:
> 
> -60: 10 08 00 00 02 cd 31 00 40 00 02 b1 80 25 14 00
> +60: 10 08 00 00 02 cd 31 00 40 00 11 b0 80 25 14 00
> 
> 6a (before 02, after 11)
> 6b (before b1, after b0)
> 
> But I cannot write these parameters using setpci. My PCI express capability
> is offset 0x58 + 0x10 for link control which is already set back to 40
> 
> root@homer:~# lspci -vvv -s 00:02.0 | grep Capa
> 	Capabilities: [50] Power Management version 3
> 	Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
> 	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit-
> 	Capabilities: [b0] Subsystem: Gigabyte Technology Co., Ltd Device 5000
> 	Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+
> 	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
> 	Capabilities: [190 v1] Access Control Services
> 

Wouldn't it be a possible solution to do a D0 -> D3 -> D0 transition for
devices which doesn't support FLR? The setpci way doesn't help me at all

> > Alex
> > 
> > > > > diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> > > > > index 8db182f..7fec259 100644
> > > > > --- a/hw/misc/vfio.c
> > > > > +++ b/hw/misc/vfio.c
> > > > > @@ -2927,6 +2927,10 @@ static bool vfio_pci_host_match(PCIHostDeviceAddress *hos
> > > > >              host1->slot == host2->slot && host1->function == host2->function);
> > > > >  }
> > > > >  
> > > > > +#undef DPRINTF
> > > > > +#define DPRINTF(fmt, ...) \
> > > > > +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> > > > > +
> > > > >  static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
> > > > >  {
> > > > >      VFIOGroup *group;
> > > > > @@ -3104,6 +3108,15 @@ out_single:
> > > > >      return ret;
> > > > >  }
> > > > >  
> > > > > +#undef DPRINTF
> > > > > +#ifdef DEBUG_VFIO
> > > > > +#define DPRINTF(fmt, ...) \
> > > > > +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> > > > > +#else
> > > > > +#define DPRINTF(fmt, ...) \
> > > > > +    do { } while (0)
> > > > > +#endif
> > > > > +
> > > > >  /*
> > > > >   * We want to differentiate hot reset of mulitple in-use devices vs hot reset
> > > > >   * of a single in-use device.  VFIO_DEVICE_RESET will already handle the case
> > > > > 
> > > > > 
> > > > 
> > > > --Maik
> > > > 
> > > 
> > > --Maik
> > 
> > 
> > 
> 
> --Maik
> 

--Maik
Maik Broemme Jan. 16, 2015, 12:21 p.m. UTC | #6
Hi Alex,

Maik Broemme <mbroemme@parallels.com> wrote:
> Hi Alex,
> 
> Maik Broemme <mbroemme@parallels.com> wrote:
> > Hi Alex,
> > 
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > On Fri, 2014-02-14 at 01:01 +0100, Maik Broemme wrote:
> > > > Hi Alex,
> > > > 
> > > > Maik Broemme <mbroemme@parallels.com> wrote:
> > > > > Hi Alex,
> > > > > 
> > > > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > > > > On Fri, 2014-02-07 at 01:22 +0100, Maik Broemme wrote:
> > > > > > > Interesting is the diff between 1st and 2nd boot, so if I do the lspci
> > > > > > > prior to the booting. The only difference between 1st start and 2nd
> > > > > > > start are:
> > > > > > > 
> > > > > > > --- 001-lspci.290x.before.1st.log	2014-02-07 01:13:41.498827928 +0100
> > > > > > > +++ 004-lspci.290x.before.2nd.log	2014-02-07 01:16:50.966611282 +0100
> > > > > > > @@ -24,7 +24,7 @@
> > > > > > >  			ClockPM- Surprise- LLActRep- BwNot-
> > > > > > >  		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
> > > > > > >  			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > > > > -		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > > > > +		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > > > >  		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
> > > > > > >  		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> > > > > > >  		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> > > > > > > @@ -33,13 +33,13 @@
> > > > > > >  		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
> > > > > > >  			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> > > > > > >  	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> > > > > > > -		Address: 0000000000000000  Data: 0000
> > > > > > > +		Address: 00000000fee00000  Data: 0000
> > > > > > >  	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
> > > > > > >  	Capabilities: [150 v2] Advanced Error Reporting
> > > > > > >  		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > > > > >  		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > > > > >  		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> > > > > > > -		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> > > > > > > +		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> > > > > > >  		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> > > > > > >  		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> > > > > > >  	Capabilities: [270 v1] #19
> > > > > > > 
> > > > > > > After that if I do suspend-to-ram / resume trick I have again lspci
> > > > > > > output from before 1st boot.
> > > > > > 
> > > > > > The Link Status change after X is stopped seems the most interesting to
> > > > > > me.  The MSI change is probably explained by the MSI save/restore of the
> > > > > > device, but should be harmless since MSI is disabled.  I'm a bit
> > > > > > surprised the Correctable Error Status in the AER capability didn't get
> > > > > > cleared.  I would have thought that a bus reset would have caused the
> > > > > > link to retrain back to the original speed/width as well.  Let's check
> > > > > > that we're actually getting a bus reset, try this in addition to the
> > > > > > previous qemu patch.  This just enables debug logging for the bus resest
> > > > > > function.  Thanks,
> > > > > > 
> > > > > 
> > > > > Below are the outputs from 2 boots, VGA, load fglrx and start X. (2nd
> > > > > time X gets killed and oops happened)
> > > > > 
> > > > > - 1st boot:
> > > > > 
> > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > vfio: 	0000:01:00.0 group 1
> > > > > vfio: 	0000:01:00.1 group 1
> > > > > vfio: 0000:01:00.1 hot reset: Success
> > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > vfio: 	0000:01:00.0 group 1
> > > > > vfio: vfio: found another in-use device 0000:01:00.0
> > > > > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > > > > vfio: 0000:01:00.0: hot reset dependent devices:
> > > > > vfio: 	0000:01:00.0 group 1
> > > > > vfio: 	0000:01:00.1 group 1
> > > > > vfio: vfio: found another in-use device 0000:01:00.1
> > > > > 
> > > > > - 2nd boot:
> > > > > 
> > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > vfio: 	0000:01:00.0 group 1
> > > > > vfio: 	0000:01:00.1 group 1
> > > > > vfio: 0000:01:00.1 hot reset: Success
> > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > vfio: 	0000:01:00.0 group 1
> > > > > vfio: vfio: found another in-use device 0000:01:00.0
> > > > > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > > > > vfio: 0000:01:00.0: hot reset dependent devices:
> > > > > vfio: 	0000:01:00.0 group 1
> > > > > vfio: 	0000:01:00.1 group 1
> > > > > vfio: vfio: found another in-use device 0000:01:00.1
> > > > > 
> > > > 
> > > > Did you had already a chance to look into it or anything else I can help
> > > > with?
> > > 
> > > According to the log we're doing the bus reset on both the first and 2nd
> > > boot (it's expected that only the "multi" call gets to success).  I'm
> > > surprised then that the link doesn't retrain back to the original width.
> > > You could try forcing the link to retrain.  Look at the root port
> > > upstream from the GPU, lspci -t is handy for this.  Run lspci on the
> > > root port to get the PCI express capability offset, then use setpci to
> > > set the link retrain bit.  For example:
> > > 
> > > # lspci -tv | grep NVIDIA
> > >            +-07.0-[03]--+-00.0  NVIDIA Corporation GK106GL [Quadro K4000]
> > >            |            \-00.1  NVIDIA Corporation GK106 HDMI Audio Controller
> > > 
> > > (upstream root port is 00:07.0)
> > > 
> > > # lspci -v -s 7.0 | grep Capabilities
> > > 	Capabilities: [40] Subsystem: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7
> > > 	Capabilities: [60] MSI: Enable+ Count=1/2 Maskable+ 64bit-
> > > 	Capabilities: [90] Express Root Port (Slot+), MSI 00
> > > 	Capabilities: [e0] Power Management version 3
> > > 	Capabilities: [100] Advanced Error Reporting
> > > 	Capabilities: [150] Access Control Services
> > > 	Capabilities: [160] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>
> > > 
> > > (PCI express capability is offset 0x90, Link Control is 0x10 off that)
> > > 
> > > # setpci -s 7.0 a0.w
> > > 0040
> > > 
> > > (retrain is bit 5, 0x20, OR'd with read value is 0x60)
> > > 
> > > # setpci -s 7.0 a0.w=60
> > > 
> > > # lspci... did it work?
> > > 
> > > Try doing that after the first boot to see if you can get back to a x16
> > > link.  If that works, we may need to add something in the kernel to do
> > > it automatically around a bus reset.  Thanks,
> > > 
> > 
> > Well this doesn't help either and it looks like VFIO reset is setting it
> > already back to original width. For example:
> > 
> >            +-02.0-[01]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Hawaii XT [Radeon HD 8970]
> >            |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Device aac8
> > 
> > Before 1st run:
> > 
> > root@homer:~# lspci -vvv -s 00:02.0 | grep LnkSta:
> > 		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
> > root@homer:~# lspci -vvv -s 01:00.0 | grep LnkSta:
> > 		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > 
> > After power down of VM:
> > 
> > root@homer:~# lspci -vvv -s 00:02.0 | grep LnkSta:
> > 		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
> > root@homer:~# lspci -vvv -s 01:00.0 | grep LnkSta:
> > 		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > 
> > After 2nd start once VFIO did reset:
> > 
> > root@homer:~# lspci -vvv -s 00:02.0 | grep LnkSta:
> > 		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
> > root@homer:~# lspci -vvv -s 01:00.0 | grep LnkSta:
> > 		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > 
> > The only difference on bus I see here is ABWMgmt- vs ABWMgmt+ but it
> > shouldn't be relevant here as it the same if I unload fglrx module
> > before shutdown the VM which is the only case where I can run multiple
> > VM reboot cycles.
> > 
> > So the only difference on bus is the following:
> > 
> > -60: 10 08 00 00 02 cd 31 00 40 00 02 b1 80 25 14 00
> > +60: 10 08 00 00 02 cd 31 00 40 00 11 b0 80 25 14 00
> > 
> > 6a (before 02, after 11)
> > 6b (before b1, after b0)
> > 
> > But I cannot write these parameters using setpci. My PCI express capability
> > is offset 0x58 + 0x10 for link control which is already set back to 40
> > 
> > root@homer:~# lspci -vvv -s 00:02.0 | grep Capa
> > 	Capabilities: [50] Power Management version 3
> > 	Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
> > 	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit-
> > 	Capabilities: [b0] Subsystem: Gigabyte Technology Co., Ltd Device 5000
> > 	Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+
> > 	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
> > 	Capabilities: [190 v1] Access Control Services
> > 
> 
> Wouldn't it be a possible solution to do a D0 -> D3 -> D0 transition for
> devices which doesn't support FLR? The setpci way doesn't help me at all
> 

I want to renew the thread a bit as with latest slot/bus reset some
things have changed but it still doesn't work in all cases.

#1 QEMU+OVMF (UEFI):

I've flashed my R9 290X with an UEFI compatible BIOS and QEMU+OVMF
(without CSM) boots Windows 8.1 fine. Catalyst 14.12 drivers can be
installed without issues and work fine. However an attempt to reboot the
VM result in Windows 8.1 typical "Something went wrong :(" screen. The
suspend/resume trick still works between VM reboots.

#2 QEMU (BIOS):

In this scenario I use secondary GPU passthrough (no VGA as primary
adapter) using Windows 7. Catalyst 14.12 drivers can be installed
without issues and work fine. Also I was surprised that an attempt to
reboot the VM was also working. Windows 7 restarts fine, I see the login
screen and no performance issues. But it doesn't work always, sometimes
it works for 3-4 reboots and next one fails with just a black screen
(but Windows VM is pingable and ACPI shutdown still works), sometimes it
works only for one reboot. In all cases the suspend/resume trick still
works.

So I would like to narrow down the problem. Anything I can try Alex,
like debugging logs of QEMU.

Used QEMU version is 2.2.0, kernel is 3.18.2.

> > > Alex
> > > 
> > > > > > diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> > > > > > index 8db182f..7fec259 100644
> > > > > > --- a/hw/misc/vfio.c
> > > > > > +++ b/hw/misc/vfio.c
> > > > > > @@ -2927,6 +2927,10 @@ static bool vfio_pci_host_match(PCIHostDeviceAddress *hos
> > > > > >              host1->slot == host2->slot && host1->function == host2->function);
> > > > > >  }
> > > > > >  
> > > > > > +#undef DPRINTF
> > > > > > +#define DPRINTF(fmt, ...) \
> > > > > > +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> > > > > > +
> > > > > >  static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
> > > > > >  {
> > > > > >      VFIOGroup *group;
> > > > > > @@ -3104,6 +3108,15 @@ out_single:
> > > > > >      return ret;
> > > > > >  }
> > > > > >  
> > > > > > +#undef DPRINTF
> > > > > > +#ifdef DEBUG_VFIO
> > > > > > +#define DPRINTF(fmt, ...) \
> > > > > > +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> > > > > > +#else
> > > > > > +#define DPRINTF(fmt, ...) \
> > > > > > +    do { } while (0)
> > > > > > +#endif
> > > > > > +
> > > > > >  /*
> > > > > >   * We want to differentiate hot reset of mulitple in-use devices vs hot reset
> > > > > >   * of a single in-use device.  VFIO_DEVICE_RESET will already handle the case
> > > > > > 
> > > > > > 
> > > > > 
> > > > > --Maik
> > > > > 
> > > > 
> > > > --Maik
> > > 
> > > 
> > > 
> > 
> > --Maik
> > 
> 
> --Maik
> 

--Maik
Alex Williamson Jan. 19, 2015, 5:43 p.m. UTC | #7
On Fri, 2015-01-16 at 13:21 +0100, Maik Broemme wrote:
> Hi Alex,
> 
> Maik Broemme <mbroemme@parallels.com> wrote:
> > Hi Alex,
> > 
> > Maik Broemme <mbroemme@parallels.com> wrote:
> > > Hi Alex,
> > > 
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > > On Fri, 2014-02-14 at 01:01 +0100, Maik Broemme wrote:
> > > > > Hi Alex,
> > > > > 
> > > > > Maik Broemme <mbroemme@parallels.com> wrote:
> > > > > > Hi Alex,
> > > > > > 
> > > > > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > > > > > On Fri, 2014-02-07 at 01:22 +0100, Maik Broemme wrote:
> > > > > > > > Interesting is the diff between 1st and 2nd boot, so if I do the lspci
> > > > > > > > prior to the booting. The only difference between 1st start and 2nd
> > > > > > > > start are:
> > > > > > > > 
> > > > > > > > --- 001-lspci.290x.before.1st.log	2014-02-07 01:13:41.498827928 +0100
> > > > > > > > +++ 004-lspci.290x.before.2nd.log	2014-02-07 01:16:50.966611282 +0100
> > > > > > > > @@ -24,7 +24,7 @@
> > > > > > > >  			ClockPM- Surprise- LLActRep- BwNot-
> > > > > > > >  		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
> > > > > > > >  			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > > > > > -		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > > > > > +		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > > > > > >  		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
> > > > > > > >  		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
> > > > > > > >  		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> > > > > > > > @@ -33,13 +33,13 @@
> > > > > > > >  		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
> > > > > > > >  			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> > > > > > > >  	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> > > > > > > > -		Address: 0000000000000000  Data: 0000
> > > > > > > > +		Address: 00000000fee00000  Data: 0000
> > > > > > > >  	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
> > > > > > > >  	Capabilities: [150 v2] Advanced Error Reporting
> > > > > > > >  		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > > > > > >  		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > > > > > >  		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> > > > > > > > -		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> > > > > > > > +		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> > > > > > > >  		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> > > > > > > >  		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> > > > > > > >  	Capabilities: [270 v1] #19
> > > > > > > > 
> > > > > > > > After that if I do suspend-to-ram / resume trick I have again lspci
> > > > > > > > output from before 1st boot.
> > > > > > > 
> > > > > > > The Link Status change after X is stopped seems the most interesting to
> > > > > > > me.  The MSI change is probably explained by the MSI save/restore of the
> > > > > > > device, but should be harmless since MSI is disabled.  I'm a bit
> > > > > > > surprised the Correctable Error Status in the AER capability didn't get
> > > > > > > cleared.  I would have thought that a bus reset would have caused the
> > > > > > > link to retrain back to the original speed/width as well.  Let's check
> > > > > > > that we're actually getting a bus reset, try this in addition to the
> > > > > > > previous qemu patch.  This just enables debug logging for the bus resest
> > > > > > > function.  Thanks,
> > > > > > > 
> > > > > > 
> > > > > > Below are the outputs from 2 boots, VGA, load fglrx and start X. (2nd
> > > > > > time X gets killed and oops happened)
> > > > > > 
> > > > > > - 1st boot:
> > > > > > 
> > > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > > vfio: 	0000:01:00.0 group 1
> > > > > > vfio: 	0000:01:00.1 group 1
> > > > > > vfio: 0000:01:00.1 hot reset: Success
> > > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > > vfio: 	0000:01:00.0 group 1
> > > > > > vfio: vfio: found another in-use device 0000:01:00.0
> > > > > > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > > > > > vfio: 0000:01:00.0: hot reset dependent devices:
> > > > > > vfio: 	0000:01:00.0 group 1
> > > > > > vfio: 	0000:01:00.1 group 1
> > > > > > vfio: vfio: found another in-use device 0000:01:00.1
> > > > > > 
> > > > > > - 2nd boot:
> > > > > > 
> > > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > > vfio: 	0000:01:00.0 group 1
> > > > > > vfio: 	0000:01:00.1 group 1
> > > > > > vfio: 0000:01:00.1 hot reset: Success
> > > > > > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > > > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > > > vfio: 	0000:01:00.0 group 1
> > > > > > vfio: vfio: found another in-use device 0000:01:00.0
> > > > > > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > > > > > vfio: 0000:01:00.0: hot reset dependent devices:
> > > > > > vfio: 	0000:01:00.0 group 1
> > > > > > vfio: 	0000:01:00.1 group 1
> > > > > > vfio: vfio: found another in-use device 0000:01:00.1
> > > > > > 
> > > > > 
> > > > > Did you had already a chance to look into it or anything else I can help
> > > > > with?
> > > > 
> > > > According to the log we're doing the bus reset on both the first and 2nd
> > > > boot (it's expected that only the "multi" call gets to success).  I'm
> > > > surprised then that the link doesn't retrain back to the original width.
> > > > You could try forcing the link to retrain.  Look at the root port
> > > > upstream from the GPU, lspci -t is handy for this.  Run lspci on the
> > > > root port to get the PCI express capability offset, then use setpci to
> > > > set the link retrain bit.  For example:
> > > > 
> > > > # lspci -tv | grep NVIDIA
> > > >            +-07.0-[03]--+-00.0  NVIDIA Corporation GK106GL [Quadro K4000]
> > > >            |            \-00.1  NVIDIA Corporation GK106 HDMI Audio Controller
> > > > 
> > > > (upstream root port is 00:07.0)
> > > > 
> > > > # lspci -v -s 7.0 | grep Capabilities
> > > > 	Capabilities: [40] Subsystem: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7
> > > > 	Capabilities: [60] MSI: Enable+ Count=1/2 Maskable+ 64bit-
> > > > 	Capabilities: [90] Express Root Port (Slot+), MSI 00
> > > > 	Capabilities: [e0] Power Management version 3
> > > > 	Capabilities: [100] Advanced Error Reporting
> > > > 	Capabilities: [150] Access Control Services
> > > > 	Capabilities: [160] Vendor Specific Information: ID=0002 Rev=0 Len=00c <?>
> > > > 
> > > > (PCI express capability is offset 0x90, Link Control is 0x10 off that)
> > > > 
> > > > # setpci -s 7.0 a0.w
> > > > 0040
> > > > 
> > > > (retrain is bit 5, 0x20, OR'd with read value is 0x60)
> > > > 
> > > > # setpci -s 7.0 a0.w=60
> > > > 
> > > > # lspci... did it work?
> > > > 
> > > > Try doing that after the first boot to see if you can get back to a x16
> > > > link.  If that works, we may need to add something in the kernel to do
> > > > it automatically around a bus reset.  Thanks,
> > > > 
> > > 
> > > Well this doesn't help either and it looks like VFIO reset is setting it
> > > already back to original width. For example:
> > > 
> > >            +-02.0-[01]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Hawaii XT [Radeon HD 8970]
> > >            |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Device aac8
> > > 
> > > Before 1st run:
> > > 
> > > root@homer:~# lspci -vvv -s 00:02.0 | grep LnkSta:
> > > 		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
> > > root@homer:~# lspci -vvv -s 01:00.0 | grep LnkSta:
> > > 		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > 
> > > After power down of VM:
> > > 
> > > root@homer:~# lspci -vvv -s 00:02.0 | grep LnkSta:
> > > 		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
> > > root@homer:~# lspci -vvv -s 01:00.0 | grep LnkSta:
> > > 		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > 
> > > After 2nd start once VFIO did reset:
> > > 
> > > root@homer:~# lspci -vvv -s 00:02.0 | grep LnkSta:
> > > 		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
> > > root@homer:~# lspci -vvv -s 01:00.0 | grep LnkSta:
> > > 		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > 
> > > The only difference on bus I see here is ABWMgmt- vs ABWMgmt+ but it
> > > shouldn't be relevant here as it the same if I unload fglrx module
> > > before shutdown the VM which is the only case where I can run multiple
> > > VM reboot cycles.
> > > 
> > > So the only difference on bus is the following:
> > > 
> > > -60: 10 08 00 00 02 cd 31 00 40 00 02 b1 80 25 14 00
> > > +60: 10 08 00 00 02 cd 31 00 40 00 11 b0 80 25 14 00
> > > 
> > > 6a (before 02, after 11)
> > > 6b (before b1, after b0)
> > > 
> > > But I cannot write these parameters using setpci. My PCI express capability
> > > is offset 0x58 + 0x10 for link control which is already set back to 40
> > > 
> > > root@homer:~# lspci -vvv -s 00:02.0 | grep Capa
> > > 	Capabilities: [50] Power Management version 3
> > > 	Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
> > > 	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit-
> > > 	Capabilities: [b0] Subsystem: Gigabyte Technology Co., Ltd Device 5000
> > > 	Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+
> > > 	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
> > > 	Capabilities: [190 v1] Access Control Services
> > > 
> > 
> > Wouldn't it be a possible solution to do a D0 -> D3 -> D0 transition for
> > devices which doesn't support FLR? The setpci way doesn't help me at all
> > 
> 
> I want to renew the thread a bit as with latest slot/bus reset some
> things have changed but it still doesn't work in all cases.
> 
> #1 QEMU+OVMF (UEFI):
> 
> I've flashed my R9 290X with an UEFI compatible BIOS and QEMU+OVMF
> (without CSM) boots Windows 8.1 fine. Catalyst 14.12 drivers can be
> installed without issues and work fine. However an attempt to reboot the
> VM result in Windows 8.1 typical "Something went wrong :(" screen. The
> suspend/resume trick still works between VM reboots.
> 
> #2 QEMU (BIOS):
> 
> In this scenario I use secondary GPU passthrough (no VGA as primary
> adapter) using Windows 7. Catalyst 14.12 drivers can be installed
> without issues and work fine. Also I was surprised that an attempt to
> reboot the VM was also working. Windows 7 restarts fine, I see the login
> screen and no performance issues. But it doesn't work always, sometimes
> it works for 3-4 reboots and next one fails with just a black screen
> (but Windows VM is pingable and ACPI shutdown still works), sometimes it
> works only for one reboot. In all cases the suspend/resume trick still
> works.
> 
> So I would like to narrow down the problem. Anything I can try Alex,
> like debugging logs of QEMU.
> 
> Used QEMU version is 2.2.0, kernel is 3.18.2.

There's a small changed queued for v3.20 that will exclude PM reset as
an option for AMD GPUs (because it doesn't so anything), but I don't
expect this will change anything for you.  It mostly just enables reset
on release for cards like my HD8570 that report they support PM reset.

Cards like your R9 290X (if I'm remembering correctly) and my R7790
simply don't seem to reset their internal components like they're
supposed to during a bus reset.  I've reached out to AMD developers
regarding this problem; it has theoretically been passed to the
appropriate teams, but I haven't heard of any progress or resolution.

Assignment as a secondary GPU requires driver support and while AMD
seems interested in supporting GPU assignment, I haven't seen any
evidence that they're willing to do anything to make it happen.
Guessing what might be wrong in case #2 is not fun, so it's not a very
interesting case unless AMD wants to make an effort there.  Have you
tried reporting the bug to AMD?  Perhaps you can install a VNC server in
the guest so you can interact and collect data in the failure case.
Case #1 is stuck at the reset problem, and again AMD isn't offering much
help there and I'm out of ideas short of dissecting datasheets for
various root ports to figure out if we can toggle power to the slot.
Thanks,

Alex
diff mbox

Patch

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 8db182f..7fec259 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -2927,6 +2927,10 @@  static bool vfio_pci_host_match(PCIHostDeviceAddress *hos
             host1->slot == host2->slot && host1->function == host2->function);
 }
 
+#undef DPRINTF
+#define DPRINTF(fmt, ...) \
+    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
+
 static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
 {
     VFIOGroup *group;
@@ -3104,6 +3108,15 @@  out_single:
     return ret;
 }
 
+#undef DPRINTF
+#ifdef DEBUG_VFIO
+#define DPRINTF(fmt, ...) \
+    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
 /*
  * We want to differentiate hot reset of mulitple in-use devices vs hot reset
  * of a single in-use device.  VFIO_DEVICE_RESET will already handle the case