diff mbox

Hard and silent lock up since linux 3.14 with PCIe pass through (vfio)

Message ID 1414606581.27420.266.camel@ul30vt.home
State Not Applicable
Headers show

Commit Message

Alex Williamson Oct. 29, 2014, 6:16 p.m. UTC
On Wed, 2014-10-29 at 18:57 +0100, Andreas Hartmann wrote:
> Alex Williamson schrieb:
> > On Wed, 2014-10-29 at 17:47 +0100, Andreas Hartmann wrote:
> >> Alex Williamson wrote:
> >>> On Sat, 2014-10-25 at 08:03 +0200, Andreas Hartmann wrote:
> >>>>
> >>>> Out of interest:
> >>>> Bjorn's patch disables vc save/restore support - and the machine works
> >>>> fine again. Why is it needed at all if it seems to work perfectly w/o
> >>>> it? What's the additional benefit? Or in other words: What am I missing
> >>>> until today :-) ? What would be better? What could I do more?
> >>>
> >>>
> >>> You're right, in the configuration you have the endpoint device has a
> >>> Virtual Channel capability but the upstream root port does not.  The
> >>> spec is not at all clear about defining the endpoints for enabling
> >>> Virtual Channel in each type of configuration, but I think that if we
> >>> have an upstream port that does not support Virtual Channel, we can skip
> >>> the save/restore.  Please test the patch below.
> >>>
> >>> I'm also still completely confused about whether this is a VC
> >>> save/restore issue or a bus reset issue.  You originally bisected this
> >>> back to the VC save/restore patch, but you also found that a manual,
> >>> setpci-based bus reset triggered a system hang.
> >>
> >> With your additional patch posted here:
> >> http://article.gmane.org/gmane.linux.kernel.pci/36162
> > 
> > Right, a reset via sysfs also triggered it with that patch, but the
> > reset via setpci is independent of any VC save/restore and still hung
> > your box.
> > 
> >>
> >>>  I believe that
> >>> re-ordering the kernel reset mechanisms also triggered this.  Since
> >>> recent versions of QEMU are going to favor a bus reset over PM reset, I
> >>> don't have a lot of confidence that we're actually solving the problem
> >>> for you.  Please make sure to test with a recent QEMU to be sure we'll
> >>> do a bus reset.
> >>
> >> I'm running qemu 2.1.0 (newest is 2.1.2 - but this shouldn't be a
> >> problem) and tested w/ linux 3.17.
> > 
> > Yep, just want to make sure it's QEMU new enough to do a bus reset and
> > kernel with matching support.
> > 
> >>> diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
> >>> index 7e1304d..6d13d34 100644
> >>> --- a/drivers/pci/vc.c
> >>> +++ b/drivers/pci/vc.c
> >>> @@ -339,6 +339,25 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
> >>>  	return buf ? 0 : len;
> >>>  }
> >>>  
> >>> +/**
> >>> + * pci_vc_needs_save - Determine whether a VC capability needs to be saved
> >>> + * @dev: device
> >>> + * @id: VC capability ID (VC/VC9/MFVC)
> >>> + *
> >>> + * In configurations where we have a VC or MFVC capability, but the upstream
> >>> + * device does not, we assume that VC save (and therefore restore) is not
> >>> + * necessary.  The intention is to only do VC save/restore in configuration
> >>> + * where it's necessary and hopefully avoid reset issues.
> >>> + */
> >>> +static bool pci_vc_needs_save(struct pci_dev *dev, u16 id)
> >>> +{
> >>> +	if (id == PCI_EXT_CAP_ID_VC9 || pci_is_root_bus(dev->bus) ||
> >>> +	    pci_find_ext_capability(dev->bus->self, PCI_EXT_CAP_ID_VC))
> >>> +		return true;
> >>> +
> >>> +	return false;
> >>> +}
> >>> +
> >>>  static struct {
> >>>  	u16 id;
> >>>  	const char *name;
> >>> @@ -362,7 +381,7 @@ int pci_save_vc_state(struct pci_dev *dev)
> >>>  		struct pci_cap_saved_state *save_state;
> >>>  
> >>>  		pos = pci_find_ext_capability(dev, vc_caps[i].id);
> >>> -		if (!pos)
> >>> +		if (!posi || !pci_vc_needs_save(dev, vc_caps[i].id))
> >>                         ^
> >> This should be most probably !pos (and not !posi - because !posi does
> >> through a compile error).
> > 
> > Oops, sorry.
> > 
> >>>  			continue;
> >>>  
> >>>  		save_state = pci_find_saved_ext_cap(dev, vc_caps[i].id);
> >>> @@ -422,7 +441,7 @@ void pci_allocate_vc_save_buffers(struct pci_dev *dev)
> >>>  	for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
> >>>  		int len, pos = pci_find_ext_capability(dev, vc_caps[i].id);
> >>>  
> >>> -		if (!pos)
> >>> +		if (!pos || !pci_vc_needs_save(dev, vc_caps[i].id))
> >>>  			continue;
> >>>  
> >>>  		len = pci_vc_do_save_buffer(dev, pos, NULL, false);
> >>
> >> W/ the above patch, the machine hangs again (w/ qemu and setpci), but w/
> >> Bjorn's patch (and nothing more applied) which disables vc save/restore,
> >> the machine just works fine ... . I especially retested this case to be
> >> really sure. I'm so sorry. But that's how it behaves here :-(
> > 
> > Hmm, the intention was that this should effectively do the same thing as
> > Bjorn's patch.  The Atheros device (03:00.0) reports a VC capability but
> > the root port above it (00:05.0) does not.
> 
> Are you sure, that this patch really works (-> here!) as expected? Would
> it be possible to add some debug output printing to the actual console
> (not to log file) to be sure it really works as expected? Maybe some
> more output to get an idea what's actually going on? Or is it just a
> timing issue?

Sure, here's some added printks (and fixed posi).  You should be able to
run 'dmesg | grep pci_vc_needs_save' after boot and see device
0000:03:00.0.  Hopefully you won't see the pci_save_vc_state() printk as
you assign the device.




--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andreas Hartmann Oct. 29, 2014, 7:43 p.m. UTC | #1
Alex Williamson wrote:
> On Wed, 2014-10-29 at 18:57 +0100, Andreas Hartmann wrote:
>> Alex Williamson wrote:
>>> On Wed, 2014-10-29 at 17:47 +0100, Andreas Hartmann wrote:
>>>> Alex Williamson wrote:
>>>>> On Sat, 2014-10-25 at 08:03 +0200, Andreas Hartmann wrote:
>>>>>>
>>>>>> Out of interest:
>>>>>> Bjorn's patch disables vc save/restore support - and the machine works
>>>>>> fine again. Why is it needed at all if it seems to work perfectly w/o
>>>>>> it? What's the additional benefit? Or in other words: What am I missing
>>>>>> until today :-) ? What would be better? What could I do more?
>>>>>
>>>>>
>>>>> You're right, in the configuration you have the endpoint device has a
>>>>> Virtual Channel capability but the upstream root port does not.  The
>>>>> spec is not at all clear about defining the endpoints for enabling
>>>>> Virtual Channel in each type of configuration, but I think that if we
>>>>> have an upstream port that does not support Virtual Channel, we can skip
>>>>> the save/restore.  Please test the patch below.
>>>>>
>>>>> I'm also still completely confused about whether this is a VC
>>>>> save/restore issue or a bus reset issue.  You originally bisected this
>>>>> back to the VC save/restore patch, but you also found that a manual,
>>>>> setpci-based bus reset triggered a system hang.
>>>>
>>>> With your additional patch posted here:
>>>> http://article.gmane.org/gmane.linux.kernel.pci/36162
>>>
>>> Right, a reset via sysfs also triggered it with that patch, but the
>>> reset via setpci is independent of any VC save/restore and still hung
>>> your box.
>>>
>>>>
>>>>>  I believe that
>>>>> re-ordering the kernel reset mechanisms also triggered this.  Since
>>>>> recent versions of QEMU are going to favor a bus reset over PM reset, I
>>>>> don't have a lot of confidence that we're actually solving the problem
>>>>> for you.  Please make sure to test with a recent QEMU to be sure we'll
>>>>> do a bus reset.
>>>>
>>>> I'm running qemu 2.1.0 (newest is 2.1.2 - but this shouldn't be a
>>>> problem) and tested w/ linux 3.17.
>>>
>>> Yep, just want to make sure it's QEMU new enough to do a bus reset and
>>> kernel with matching support.
>>>
>>>>> diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
>>>>> index 7e1304d..6d13d34 100644
>>>>> --- a/drivers/pci/vc.c
>>>>> +++ b/drivers/pci/vc.c
>>>>> @@ -339,6 +339,25 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
>>>>>  	return buf ? 0 : len;
>>>>>  }
>>>>>  
>>>>> +/**
>>>>> + * pci_vc_needs_save - Determine whether a VC capability needs to be saved
>>>>> + * @dev: device
>>>>> + * @id: VC capability ID (VC/VC9/MFVC)
>>>>> + *
>>>>> + * In configurations where we have a VC or MFVC capability, but the upstream
>>>>> + * device does not, we assume that VC save (and therefore restore) is not
>>>>> + * necessary.  The intention is to only do VC save/restore in configuration
>>>>> + * where it's necessary and hopefully avoid reset issues.
>>>>> + */
>>>>> +static bool pci_vc_needs_save(struct pci_dev *dev, u16 id)
>>>>> +{
>>>>> +	if (id == PCI_EXT_CAP_ID_VC9 || pci_is_root_bus(dev->bus) ||
>>>>> +	    pci_find_ext_capability(dev->bus->self, PCI_EXT_CAP_ID_VC))
>>>>> +		return true;
>>>>> +
>>>>> +	return false;
>>>>> +}
>>>>> +
>>>>>  static struct {
>>>>>  	u16 id;
>>>>>  	const char *name;
>>>>> @@ -362,7 +381,7 @@ int pci_save_vc_state(struct pci_dev *dev)
>>>>>  		struct pci_cap_saved_state *save_state;
>>>>>  
>>>>>  		pos = pci_find_ext_capability(dev, vc_caps[i].id);
>>>>> -		if (!pos)
>>>>> +		if (!posi || !pci_vc_needs_save(dev, vc_caps[i].id))
>>>>                         ^
>>>> This should be most probably !pos (and not !posi - because !posi does
>>>> through a compile error).
>>>
>>> Oops, sorry.
>>>
>>>>>  			continue;
>>>>>  
>>>>>  		save_state = pci_find_saved_ext_cap(dev, vc_caps[i].id);
>>>>> @@ -422,7 +441,7 @@ void pci_allocate_vc_save_buffers(struct pci_dev *dev)
>>>>>  	for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
>>>>>  		int len, pos = pci_find_ext_capability(dev, vc_caps[i].id);
>>>>>  
>>>>> -		if (!pos)
>>>>> +		if (!pos || !pci_vc_needs_save(dev, vc_caps[i].id))
>>>>>  			continue;
>>>>>  
>>>>>  		len = pci_vc_do_save_buffer(dev, pos, NULL, false);
>>>>
>>>> W/ the above patch, the machine hangs again (w/ qemu and setpci), but w/
>>>> Bjorn's patch (and nothing more applied) which disables vc save/restore,
>>>> the machine just works fine ... . I especially retested this case to be
>>>> really sure. I'm so sorry. But that's how it behaves here :-(
>>>
>>> Hmm, the intention was that this should effectively do the same thing as
>>> Bjorn's patch.  The Atheros device (03:00.0) reports a VC capability but
>>> the root port above it (00:05.0) does not.
>>
>> Are you sure, that this patch really works (-> here!) as expected? Would
>> it be possible to add some debug output printing to the actual console
>> (not to log file) to be sure it really works as expected? Maybe some
>> more output to get an idea what's actually going on? Or is it just a
>> timing issue?
> 
> Sure, here's some added printks (and fixed posi).  You should be able to
> run 'dmesg | grep pci_vc_needs_save' after boot and see device
> 0000:03:00.0.  Hopefully you won't see the pci_save_vc_state() printk as
> you assign the device.

[...]

I'm getting the expected output:

[    1.156857] pci_vc_needs_save(0000:03:00.0, 2) returning false
[    1.158866] pci_vc_needs_save(0000:04:00.0, 2) returning false

This is most probably triggered by pci_allocate_vc_save_buffers, true?
Therefore, I never should need pci_save_vc_state and
pci_restore_vc_state. Thus, it should be ok to add "return" at the
beginning of each of these function, true? Then it should work.

I tested it. It worked.

But if I'm removing only one of these returns either in
pci_save_vc_state or pci_restore_vc_state, the machine hangs again.

Therefore, there must be something odd going on in the for loops. Isn't
it possible to add some useful debug code to these loops to see what's
really going on? But the output *must* go to the actual console,
otherwise I can't see it!


int pci_save_vc_state(struct pci_dev *dev)
{
        return 0; // must be set
        int i;

        for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
                int pos, ret;
                struct pci_cap_saved_state *save_state;

                pos = pci_find_ext_capability(dev, vc_caps[i].id);
                if (!pos || !pci_vc_needs_save(dev, vc_caps[i].id))
                        continue;

                save_state = pci_find_saved_ext_cap(dev, vc_caps[i].id);
                if (!save_state) {
                        dev_err(&dev->dev, "%s buffer not found in %s\n",
                                vc_caps[i].name, __func__);
                        return -ENOMEM;
                }

                printk("%s doing %s save on %s\n", __func__, vc_caps[i].name, pci_name(dev));
                ret = pci_vc_do_save_buffer(dev, pos, save_state, true);
                if (ret) {
                        dev_err(&dev->dev, "%s save unsuccessful %s\n",
                                vc_caps[i].name, __func__);
                        return ret;
                }
        }

        return 0;
}


void pci_restore_vc_state(struct pci_dev *dev)
{
        return; // must be set
        int i;

        for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
                int pos;
                struct pci_cap_saved_state *save_state;

                pos = pci_find_ext_capability(dev, vc_caps[i].id);
                save_state = pci_find_saved_ext_cap(dev, vc_caps[i].id);
                if (!save_state || !pos)
                        continue;

                printk("%s doing %s restore on %s\n", __func__, vc_caps[i].name, pci_name(dev));
                pci_vc_do_save_buffer(dev, pos, save_state, false);
        }
}



Thanks,
Andreas
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson Oct. 29, 2014, 8:50 p.m. UTC | #2
On Wed, 2014-10-29 at 20:43 +0100, Andreas Hartmann wrote:
> Alex Williamson wrote:
> > On Wed, 2014-10-29 at 18:57 +0100, Andreas Hartmann wrote:
> >> Alex Williamson wrote:
> >>> On Wed, 2014-10-29 at 17:47 +0100, Andreas Hartmann wrote:
> >>>> Alex Williamson wrote:
> >>>>> On Sat, 2014-10-25 at 08:03 +0200, Andreas Hartmann wrote:
> >>>>>>
> >>>>>> Out of interest:
> >>>>>> Bjorn's patch disables vc save/restore support - and the machine works
> >>>>>> fine again. Why is it needed at all if it seems to work perfectly w/o
> >>>>>> it? What's the additional benefit? Or in other words: What am I missing
> >>>>>> until today :-) ? What would be better? What could I do more?
> >>>>>
> >>>>>
> >>>>> You're right, in the configuration you have the endpoint device has a
> >>>>> Virtual Channel capability but the upstream root port does not.  The
> >>>>> spec is not at all clear about defining the endpoints for enabling
> >>>>> Virtual Channel in each type of configuration, but I think that if we
> >>>>> have an upstream port that does not support Virtual Channel, we can skip
> >>>>> the save/restore.  Please test the patch below.
> >>>>>
> >>>>> I'm also still completely confused about whether this is a VC
> >>>>> save/restore issue or a bus reset issue.  You originally bisected this
> >>>>> back to the VC save/restore patch, but you also found that a manual,
> >>>>> setpci-based bus reset triggered a system hang.
> >>>>
> >>>> With your additional patch posted here:
> >>>> http://article.gmane.org/gmane.linux.kernel.pci/36162
> >>>
> >>> Right, a reset via sysfs also triggered it with that patch, but the
> >>> reset via setpci is independent of any VC save/restore and still hung
> >>> your box.
> >>>
> >>>>
> >>>>>  I believe that
> >>>>> re-ordering the kernel reset mechanisms also triggered this.  Since
> >>>>> recent versions of QEMU are going to favor a bus reset over PM reset, I
> >>>>> don't have a lot of confidence that we're actually solving the problem
> >>>>> for you.  Please make sure to test with a recent QEMU to be sure we'll
> >>>>> do a bus reset.
> >>>>
> >>>> I'm running qemu 2.1.0 (newest is 2.1.2 - but this shouldn't be a
> >>>> problem) and tested w/ linux 3.17.
> >>>
> >>> Yep, just want to make sure it's QEMU new enough to do a bus reset and
> >>> kernel with matching support.
> >>>
> >>>>> diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
> >>>>> index 7e1304d..6d13d34 100644
> >>>>> --- a/drivers/pci/vc.c
> >>>>> +++ b/drivers/pci/vc.c
> >>>>> @@ -339,6 +339,25 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
> >>>>>  	return buf ? 0 : len;
> >>>>>  }
> >>>>>  
> >>>>> +/**
> >>>>> + * pci_vc_needs_save - Determine whether a VC capability needs to be saved
> >>>>> + * @dev: device
> >>>>> + * @id: VC capability ID (VC/VC9/MFVC)
> >>>>> + *
> >>>>> + * In configurations where we have a VC or MFVC capability, but the upstream
> >>>>> + * device does not, we assume that VC save (and therefore restore) is not
> >>>>> + * necessary.  The intention is to only do VC save/restore in configuration
> >>>>> + * where it's necessary and hopefully avoid reset issues.
> >>>>> + */
> >>>>> +static bool pci_vc_needs_save(struct pci_dev *dev, u16 id)
> >>>>> +{
> >>>>> +	if (id == PCI_EXT_CAP_ID_VC9 || pci_is_root_bus(dev->bus) ||
> >>>>> +	    pci_find_ext_capability(dev->bus->self, PCI_EXT_CAP_ID_VC))
> >>>>> +		return true;
> >>>>> +
> >>>>> +	return false;
> >>>>> +}
> >>>>> +
> >>>>>  static struct {
> >>>>>  	u16 id;
> >>>>>  	const char *name;
> >>>>> @@ -362,7 +381,7 @@ int pci_save_vc_state(struct pci_dev *dev)
> >>>>>  		struct pci_cap_saved_state *save_state;
> >>>>>  
> >>>>>  		pos = pci_find_ext_capability(dev, vc_caps[i].id);
> >>>>> -		if (!pos)
> >>>>> +		if (!posi || !pci_vc_needs_save(dev, vc_caps[i].id))
> >>>>                         ^
> >>>> This should be most probably !pos (and not !posi - because !posi does
> >>>> through a compile error).
> >>>
> >>> Oops, sorry.
> >>>
> >>>>>  			continue;
> >>>>>  
> >>>>>  		save_state = pci_find_saved_ext_cap(dev, vc_caps[i].id);
> >>>>> @@ -422,7 +441,7 @@ void pci_allocate_vc_save_buffers(struct pci_dev *dev)
> >>>>>  	for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
> >>>>>  		int len, pos = pci_find_ext_capability(dev, vc_caps[i].id);
> >>>>>  
> >>>>> -		if (!pos)
> >>>>> +		if (!pos || !pci_vc_needs_save(dev, vc_caps[i].id))
> >>>>>  			continue;
> >>>>>  
> >>>>>  		len = pci_vc_do_save_buffer(dev, pos, NULL, false);
> >>>>
> >>>> W/ the above patch, the machine hangs again (w/ qemu and setpci), but w/
> >>>> Bjorn's patch (and nothing more applied) which disables vc save/restore,
> >>>> the machine just works fine ... . I especially retested this case to be
> >>>> really sure. I'm so sorry. But that's how it behaves here :-(
> >>>
> >>> Hmm, the intention was that this should effectively do the same thing as
> >>> Bjorn's patch.  The Atheros device (03:00.0) reports a VC capability but
> >>> the root port above it (00:05.0) does not.
> >>
> >> Are you sure, that this patch really works (-> here!) as expected? Would
> >> it be possible to add some debug output printing to the actual console
> >> (not to log file) to be sure it really works as expected? Maybe some
> >> more output to get an idea what's actually going on? Or is it just a
> >> timing issue?
> > 
> > Sure, here's some added printks (and fixed posi).  You should be able to
> > run 'dmesg | grep pci_vc_needs_save' after boot and see device
> > 0000:03:00.0.  Hopefully you won't see the pci_save_vc_state() printk as
> > you assign the device.
> 
> [...]
> 
> I'm getting the expected output:
> 
> [    1.156857] pci_vc_needs_save(0000:03:00.0, 2) returning false
> [    1.158866] pci_vc_needs_save(0000:04:00.0, 2) returning false
> 
> This is most probably triggered by pci_allocate_vc_save_buffers, true?

Yes, it will be done at device discovery.

> Therefore, I never should need pci_save_vc_state and
> pci_restore_vc_state. Thus, it should be ok to add "return" at the
> beginning of each of these function, true? Then it should work.
> 
> I tested it. It worked.
> 
> But if I'm removing only one of these returns either in
> pci_save_vc_state or pci_restore_vc_state, the machine hangs again.
> 
> Therefore, there must be something odd going on in the for loops. Isn't
> it possible to add some useful debug code to these loops to see what's
> really going on? But the output *must* go to the actual console,
> otherwise I can't see it!
> 
> 
> int pci_save_vc_state(struct pci_dev *dev)
> {
>         return 0; // must be set
>         int i;
> 
>         for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
>                 int pos, ret;
>                 struct pci_cap_saved_state *save_state;
> 
>                 pos = pci_find_ext_capability(dev, vc_caps[i].id);
>                 if (!pos || !pci_vc_needs_save(dev, vc_caps[i].id))
>                         continue;

Take the next logical step, comment out the if here and we'll statically
take the continue.  Does it still fail?  If so, move the continue above
the call to pci_find_ext_capability(), if not...

> 
>                 save_state = pci_find_saved_ext_cap(dev, vc_caps[i].id);

If not, add a continue; here.  Unless my pci_vc_needs_save() function is
broken, we shouldn't be getting here anyway.

>                 if (!save_state) {
>                         dev_err(&dev->dev, "%s buffer not found in %s\n",
>                                 vc_caps[i].name, __func__);
>                         return -ENOMEM;
>                 }
> 
>                 printk("%s doing %s save on %s\n", __func__, vc_caps[i].name, pci_name(dev));
>                 ret = pci_vc_do_save_buffer(dev, pos, save_state, true);
>                 if (ret) {
>                         dev_err(&dev->dev, "%s save unsuccessful %s\n",
>                                 vc_caps[i].name, __func__);
>                         return ret;
>                 }
>         }
> 
>         return 0;
> }
> 
> 
> void pci_restore_vc_state(struct pci_dev *dev)
> {
>         return; // must be set
>         int i;
> 
>         for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
>                 int pos;
>                 struct pci_cap_saved_state *save_state;
> 
>                 pos = pci_find_ext_capability(dev, vc_caps[i].id);
>                 save_state = pci_find_saved_ext_cap(dev, vc_caps[i].id);

This should never find a save_state with the pci_vc_needs_save() patch,
so we should always take the branch below.  Comment out the if (... and
leave the continue, does the behavior change?  If so, add a continue;
line above pci_find_saved_ext_cap(), does it work?  If not, add another
continue above pci_find_ext_capability().

>                 if (!save_state || !pos)
>                         continue;
> 
>                 printk("%s doing %s restore on %s\n", __func__, vc_caps[i].name, pci_name(dev));
>                 pci_vc_do_save_buffer(dev, pos, save_state, false);
>         }
> }

In the "working" case with Bjorn's patch, are you actually trying to use
the device or just testing to see if the system survives reset?  You
might at least want to run lspci -xxxx on it after reset to make sure
it's really there.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Hartmann Oct. 29, 2014, 9:35 p.m. UTC | #3
Alex Williamson wrote:

[...] more tomorrow

> In the "working" case with Bjorn's patch, are you actually trying to use
> the device or just testing to see if the system survives reset? 

The VM starts hostapd using this device. Hostapd didn't show any error
in the logfile using this device. Therefore I think it should have worked:

1414609514.733245: Using interface wlan1 with hwaddr 64:70:02:... and
ssid "..."

1414609514.766353: wlan1: interface state HT_SCAN->ENABLED
1414609514.766380: wlan1: AP-ENABLED


Regards,
Andreas
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Hartmann Oct. 30, 2014, 4:35 p.m. UTC | #4
Alex Williamson wrote:
> On Wed, 2014-10-29 at 20:43 +0100, Andreas Hartmann wrote:
[...]
>> Therefore, I never should need pci_save_vc_state and
>> pci_restore_vc_state. Thus, it should be ok to add "return" at the
>> beginning of each of these function, true? Then it should work.
>>
>> I tested it. It worked.
>>
>> But if I'm removing only one of these returns either in
>> pci_save_vc_state or pci_restore_vc_state, the machine hangs again.
>>
>> Therefore, there must be something odd going on in the for loops. Isn't
>> it possible to add some useful debug code to these loops to see what's
>> really going on? But the output *must* go to the actual console,
>> otherwise I can't see it!
>>
>>
>> int pci_save_vc_state(struct pci_dev *dev)
>> {
>>         return 0; // must be set
>>         int i;
>>
>>         for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
		   // continue; -> works
>>                 int pos, ret;
>>                 struct pci_cap_saved_state *save_state;
		   // continue does not work!

--> Most probably the

            struct pci_cap_saved_state *save_state;

    makes the system hang!  ARRAY_SIZE(vc_caps) is 3 and the whole
    function is called 3 times when starting the vm.

>>
>>                 pos = pci_find_ext_capability(dev, vc_caps[i].id);
>>                 if (!pos || !pci_vc_needs_save(dev, vc_caps[i].id))
>>                         continue;
> 
> Take the next logical step, comment out the if here and we'll statically
> take the continue.  Does it still fail?  If so, move the continue above
> the call to pci_find_ext_capability(), if not...
> 
>>
>>                 save_state = pci_find_saved_ext_cap(dev, vc_caps[i].id);
> 
> If not, add a continue; here.  Unless my pci_vc_needs_save() function is
> broken, we shouldn't be getting here anyway.
> 
>>                 if (!save_state) {
>>                         dev_err(&dev->dev, "%s buffer not found in %s\n",
>>                                 vc_caps[i].name, __func__);
>>                         return -ENOMEM;
>>                 }
>>
>>                 printk("%s doing %s save on %s\n", __func__, vc_caps[i].name, pci_name(dev));
>>                 ret = pci_vc_do_save_buffer(dev, pos, save_state, true);
>>                 if (ret) {
>>                         dev_err(&dev->dev, "%s save unsuccessful %s\n",
>>                                 vc_caps[i].name, __func__);
>>                         return ret;
>>                 }
>>         }
>>
>>         return 0;
>> }
>>
>>
>> void pci_restore_vc_state(struct pci_dev *dev)
>> {
>>         return; // must be set
>>         int i;
>>
>>         for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
>>                 int pos;
>>                 struct pci_cap_saved_state *save_state;
>>
>>                 pos = pci_find_ext_capability(dev, vc_caps[i].id);
>>                 save_state = pci_find_saved_ext_cap(dev, vc_caps[i].id);
> 
> This should never find a save_state with the pci_vc_needs_save() patch,
> so we should always take the branch below.  Comment out the if (... and
> leave the continue, does the behavior change?  If so, add a continue;
> line above pci_find_saved_ext_cap(), does it work?  If not, add another
> continue above pci_find_ext_capability().
> 
>>                 if (!save_state || !pos)
>>                         continue;
>>
>>                 printk("%s doing %s restore on %s\n", __func__, vc_caps[i].name, pci_name(dev));
>>                 pci_vc_do_save_buffer(dev, pos, save_state, false);
>>         }
>> }
> 
> In the "working" case with Bjorn's patch, are you actually trying to use
> the device or just testing to see if the system survives reset?  You
> might at least want to run lspci -xxxx on it after reset to make sure
> it's really there.  Thanks,
> 
> Alex
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson Oct. 30, 2014, 4:58 p.m. UTC | #5
On Thu, 2014-10-30 at 17:35 +0100, Andreas Hartmann wrote:
> Alex Williamson wrote:
> > On Wed, 2014-10-29 at 20:43 +0100, Andreas Hartmann wrote:
> [...]
> >> Therefore, I never should need pci_save_vc_state and
> >> pci_restore_vc_state. Thus, it should be ok to add "return" at the
> >> beginning of each of these function, true? Then it should work.
> >>
> >> I tested it. It worked.
> >>
> >> But if I'm removing only one of these returns either in
> >> pci_save_vc_state or pci_restore_vc_state, the machine hangs again.
> >>
> >> Therefore, there must be something odd going on in the for loops. Isn't
> >> it possible to add some useful debug code to these loops to see what's
> >> really going on? But the output *must* go to the actual console,
> >> otherwise I can't see it!
> >>
> >>
> >> int pci_save_vc_state(struct pci_dev *dev)
> >> {
> >>         return 0; // must be set
> >>         int i;
> >>
> >>         for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
> 		   // continue; -> works
> >>                 int pos, ret;
> >>                 struct pci_cap_saved_state *save_state;
> 		   // continue does not work!
> 
> --> Most probably the
> 
>             struct pci_cap_saved_state *save_state;
> 
>     makes the system hang!

We've done nothing more than declare variables there, there's no actual
code.  What happens if you increase the delay after bus reset, edit
drivers/pci/pci.c, find the call to ssleep(1) and change the 1 to a 2,
doubling the delay after reset.  It seems like VC save/restore is just a
scapegoat for the platform already being broken by the bus reset.  Also,
if you have any other card to test in this slot, it would be useful
comparison data to know if we're dealing with an endpoint issue or a bus
issue.

>   ARRAY_SIZE(vc_caps) is 3 and the whole
>     function is called 3 times when starting the vm.

Sounds right.  The array is declared right above these functions and has
entries for VC, VC9, and MFVC types.  VFIO will try to reset the device
when it's initially opened and then QEMU does it twice (for some
reason), so that makes 3.  Thanks,

Alex

> >>
> >>                 pos = pci_find_ext_capability(dev, vc_caps[i].id);
> >>                 if (!pos || !pci_vc_needs_save(dev, vc_caps[i].id))
> >>                         continue;
> > 
> > Take the next logical step, comment out the if here and we'll statically
> > take the continue.  Does it still fail?  If so, move the continue above
> > the call to pci_find_ext_capability(), if not...
> > 
> >>
> >>                 save_state = pci_find_saved_ext_cap(dev, vc_caps[i].id);
> > 
> > If not, add a continue; here.  Unless my pci_vc_needs_save() function is
> > broken, we shouldn't be getting here anyway.
> > 
> >>                 if (!save_state) {
> >>                         dev_err(&dev->dev, "%s buffer not found in %s\n",
> >>                                 vc_caps[i].name, __func__);
> >>                         return -ENOMEM;
> >>                 }
> >>
> >>                 printk("%s doing %s save on %s\n", __func__, vc_caps[i].name, pci_name(dev));
> >>                 ret = pci_vc_do_save_buffer(dev, pos, save_state, true);
> >>                 if (ret) {
> >>                         dev_err(&dev->dev, "%s save unsuccessful %s\n",
> >>                                 vc_caps[i].name, __func__);
> >>                         return ret;
> >>                 }
> >>         }
> >>
> >>         return 0;
> >> }
> >>
> >>
> >> void pci_restore_vc_state(struct pci_dev *dev)
> >> {
> >>         return; // must be set
> >>         int i;
> >>
> >>         for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
> >>                 int pos;
> >>                 struct pci_cap_saved_state *save_state;
> >>
> >>                 pos = pci_find_ext_capability(dev, vc_caps[i].id);
> >>                 save_state = pci_find_saved_ext_cap(dev, vc_caps[i].id);
> > 
> > This should never find a save_state with the pci_vc_needs_save() patch,
> > so we should always take the branch below.  Comment out the if (... and
> > leave the continue, does the behavior change?  If so, add a continue;
> > line above pci_find_saved_ext_cap(), does it work?  If not, add another
> > continue above pci_find_ext_capability().
> > 
> >>                 if (!save_state || !pos)
> >>                         continue;
> >>
> >>                 printk("%s doing %s restore on %s\n", __func__, vc_caps[i].name, pci_name(dev));
> >>                 pci_vc_do_save_buffer(dev, pos, save_state, false);
> >>         }
> >> }
> > 
> > In the "working" case with Bjorn's patch, are you actually trying to use
> > the device or just testing to see if the system survives reset?  You
> > might at least want to run lspci -xxxx on it after reset to make sure
> > it's really there.  Thanks,
> > 
> > Alex
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Hartmann Oct. 30, 2014, 7:09 p.m. UTC | #6
Alex Williamson wrote:
> On Thu, 2014-10-30 at 17:35 +0100, Andreas Hartmann wrote:
>> Alex Williamson wrote:
>>> On Wed, 2014-10-29 at 20:43 +0100, Andreas Hartmann wrote:
>> [...]
>>>> Therefore, I never should need pci_save_vc_state and
>>>> pci_restore_vc_state. Thus, it should be ok to add "return" at the
>>>> beginning of each of these function, true? Then it should work.
>>>>
>>>> I tested it. It worked.
>>>>
>>>> But if I'm removing only one of these returns either in
>>>> pci_save_vc_state or pci_restore_vc_state, the machine hangs again.
>>>>
>>>> Therefore, there must be something odd going on in the for loops. Isn't
>>>> it possible to add some useful debug code to these loops to see what's
>>>> really going on? But the output *must* go to the actual console,
>>>> otherwise I can't see it!
>>>>
>>>>
>>>> int pci_save_vc_state(struct pci_dev *dev)
>>>> {
>>>>         return 0; // must be set
>>>>         int i;
>>>>
>>>>         for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
>> 		   // continue; -> works
>>>>                 int pos, ret;
>>>>                 struct pci_cap_saved_state *save_state;
>> 		   // continue does not work!
>>
>> --> Most probably the
>>
>>             struct pci_cap_saved_state *save_state;
>>
>>     makes the system hang!
> 
> We've done nothing more than declare variables there, there's no actual
> code.  What happens if you increase the delay after bus reset, edit
> drivers/pci/pci.c, find the call to ssleep(1) and change the 1 to a 2,
> doubling the delay after reset.

Same behaviour.

>  It seems like VC save/restore is just a
> scapegoat for the platform already being broken by the bus reset.  Also,
> if you have any other card to test in this slot, it would be useful
> comparison data to know if we're dealing with an endpoint issue or a bus
> issue.

I organized an Intel pcie card:

03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
        Subsystem: Intel Corporation Gigabit CT Desktop Adapter
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 17
        Region 0: Memory at fdbc0000 (32-bit, non-prefetchable) [disabled] [size=128K]
        Region 1: Memory at fdb00000 (32-bit, non-prefetchable) [disabled] [size=512K]
        Region 2: I/O ports at cf00 [disabled] [size=32]
        Region 3: Memory at fdbfc000 (32-bit, non-prefetchable) [disabled] [size=16K]
        [virtual] Expansion ROM at fdb80000 [disabled] [size=256K]
        Capabilities: [c8] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable+ DSel=0 DScale=1 PME-
        Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [e0] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [a0] MSI-X: Enable- Count=5 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00002000
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [140 v1] Device Serial Number 00-1b-21-ff-ff-cf-8f-57
        Kernel driver in use: vfio-pci


and tested with the same kernel, which hangs w/ atheros card. It just
worked. Not just once, but each of the tests I did. I retested w/
atheros -> hang. Tested again with intel-card -> works. Back to
atheros -> hang.

Seems to be really a problem w/ the atheros card, which is triggered by
new vc save/restore.

Well, but what to do now? I know how to "fix" it. But this means I have
to compile my kernels again on my own if it is >= 3.14.


Thanks,
kind regards,
Andreas
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson Oct. 30, 2014, 7:45 p.m. UTC | #7
On Thu, 2014-10-30 at 20:09 +0100, Andreas Hartmann wrote:
> Alex Williamson wrote:
> > On Thu, 2014-10-30 at 17:35 +0100, Andreas Hartmann wrote:
> >> Alex Williamson wrote:
> >>> On Wed, 2014-10-29 at 20:43 +0100, Andreas Hartmann wrote:
> >> [...]
> >>>> Therefore, I never should need pci_save_vc_state and
> >>>> pci_restore_vc_state. Thus, it should be ok to add "return" at the
> >>>> beginning of each of these function, true? Then it should work.
> >>>>
> >>>> I tested it. It worked.
> >>>>
> >>>> But if I'm removing only one of these returns either in
> >>>> pci_save_vc_state or pci_restore_vc_state, the machine hangs again.
> >>>>
> >>>> Therefore, there must be something odd going on in the for loops. Isn't
> >>>> it possible to add some useful debug code to these loops to see what's
> >>>> really going on? But the output *must* go to the actual console,
> >>>> otherwise I can't see it!
> >>>>
> >>>>
> >>>> int pci_save_vc_state(struct pci_dev *dev)
> >>>> {
> >>>>         return 0; // must be set
> >>>>         int i;
> >>>>
> >>>>         for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
> >> 		   // continue; -> works
> >>>>                 int pos, ret;
> >>>>                 struct pci_cap_saved_state *save_state;
> >> 		   // continue does not work!
> >>
> >> --> Most probably the
> >>
> >>             struct pci_cap_saved_state *save_state;
> >>
> >>     makes the system hang!
> > 
> > We've done nothing more than declare variables there, there's no actual
> > code.  What happens if you increase the delay after bus reset, edit
> > drivers/pci/pci.c, find the call to ssleep(1) and change the 1 to a 2,
> > doubling the delay after reset.
> 
> Same behaviour.
> 
> >  It seems like VC save/restore is just a
> > scapegoat for the platform already being broken by the bus reset.  Also,
> > if you have any other card to test in this slot, it would be useful
> > comparison data to know if we're dealing with an endpoint issue or a bus
> > issue.
> 
> I organized an Intel pcie card:
> 
> 03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
>         Subsystem: Intel Corporation Gigabit CT Desktop Adapter
>         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Interrupt: pin A routed to IRQ 17
>         Region 0: Memory at fdbc0000 (32-bit, non-prefetchable) [disabled] [size=128K]
>         Region 1: Memory at fdb00000 (32-bit, non-prefetchable) [disabled] [size=512K]
>         Region 2: I/O ports at cf00 [disabled] [size=32]
>         Region 3: Memory at fdbfc000 (32-bit, non-prefetchable) [disabled] [size=16K]
>         [virtual] Expansion ROM at fdb80000 [disabled] [size=256K]
>         Capabilities: [c8] Power Management version 2
>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
>                 Status: D0 NoSoftRst- PME-Enable+ DSel=0 DScale=1 PME-
>         Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
>                 Address: 0000000000000000  Data: 0000
>         Capabilities: [e0] Express (v1) Endpoint, MSI 00
>                 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>                         RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
>                 LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us
>                         ClockPM- Surprise- LLActRep- BwNot-
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>         Capabilities: [a0] MSI-X: Enable- Count=5 Masked-
>                 Vector table: BAR=3 offset=00000000
>                 PBA: BAR=3 offset=00002000
>         Capabilities: [100 v1] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>                 AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
>         Capabilities: [140 v1] Device Serial Number 00-1b-21-ff-ff-cf-8f-57
>         Kernel driver in use: vfio-pci
> 
> 
> and tested with the same kernel, which hangs w/ atheros card. It just
> worked. Not just once, but each of the tests I did. I retested w/
> atheros -> hang. Tested again with intel-card -> works. Back to
> atheros -> hang.

Thanks for the test.

> Seems to be really a problem w/ the atheros card, which is triggered by
> new vc save/restore.

It seems more like the bus reset, not the VC save/restore.  As far a
interacting with hardware is concerned, there's no difference between
the two cases where you found one continue works and the other doesn't.

> Well, but what to do now? I know how to "fix" it. But this means I have
> to compile my kernels again on my own if it is >= 3.14.

Let's not give up hope just yet, I'd like to try another bus reset
mechanism with setpci.  Install the Atheros card and bind it to
pci-stub, then do:

setpci -s 00:05.0 68.w=0010:0010
sleep 0.1
setpci -s 00:05.0 68.w=0000:0010
sleep 1
lspci -xxx -s 3:00.0

This uses the link disable control rather than the secondary bus reset.
Typically the results between the two are the same, but maybe we'll get
lucky.  The BIOS manages to reset the bus with this device installed
somehow, so there must be a mechanism to do it.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Hartmann Oct. 30, 2014, 8:21 p.m. UTC | #8
Alex Williamson wrote:
> On Thu, 2014-10-30 at 20:09 +0100, Andreas Hartmann wrote:
[...]
>> Well, but what to do now? I know how to "fix" it. But this means I have
>> to compile my kernels again on my own if it is >= 3.14.
> 
> Let's not give up hope just yet, I'd like to try another bus reset
> mechanism with setpci.  Install the Atheros card and bind it to
> pci-stub, then do:
> 
> setpci -s 00:05.0 68.w=0010:0010
> sleep 0.1
> setpci -s 00:05.0 68.w=0000:0010
> sleep 1
> lspci -xxx -s 3:00.0

Exactly same -> hang :-(.


Thanks,
Regards,
Andreas
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
index 7e1304d..300e126 100644
--- a/drivers/pci/vc.c
+++ b/drivers/pci/vc.c
@@ -339,6 +339,26 @@  static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
 	return buf ? 0 : len;
 }
 
+/**
+ * pci_vc_needs_save - Determine whether a VC capability needs to be saved
+ * @dev: device
+ * @id: VC capability ID (VC/VC9/MFVC)
+ *
+ * In configurations where we have a VC or MFVC capability, but the upstream
+ * device does not, we assume that VC save (and therefore restore) is not
+ * necessary.  The intention is to only do VC save/restore in configuration
+ * where it's necessary and hopefully avoid reset issues.
+ */
+static bool pci_vc_needs_save(struct pci_dev *dev, u16 id)
+{
+	if (id == PCI_EXT_CAP_ID_VC9 || pci_is_root_bus(dev->bus) ||
+	    pci_find_ext_capability(dev->bus->self, PCI_EXT_CAP_ID_VC))
+		return true;
+
+	printk("%s(%s, %x) returning false\n", __func__, pci_name(dev), id);
+	return false;
+}
+
 static struct {
 	u16 id;
 	const char *name;
@@ -362,7 +382,7 @@  int pci_save_vc_state(struct pci_dev *dev)
 		struct pci_cap_saved_state *save_state;
 
 		pos = pci_find_ext_capability(dev, vc_caps[i].id);
-		if (!pos)
+		if (!pos || !pci_vc_needs_save(dev, vc_caps[i].id))
 			continue;
 
 		save_state = pci_find_saved_ext_cap(dev, vc_caps[i].id);
@@ -372,6 +392,7 @@  int pci_save_vc_state(struct pci_dev *dev)
 			return -ENOMEM;
 		}
 
+	I	printk("%s doing %s save on %s\n", __func__, vc_caps[i].name, pci_name(dev));
 		ret = pci_vc_do_save_buffer(dev, pos, save_state, true);
 		if (ret) {
 			dev_err(&dev->dev, "%s save unsuccessful %s\n",
@@ -403,6 +424,7 @@  void pci_restore_vc_state(struct pci_dev *dev)
 		if (!save_state || !pos)
 			continue;
 
+	I	printk("%s doing %s restore on %s\n", __func__, vc_caps[i].name, pci_name(dev));
 		pci_vc_do_save_buffer(dev, pos, save_state, false);
 	}
 }
@@ -422,7 +444,7 @@  void pci_allocate_vc_save_buffers(struct pci_dev *dev)
 	for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
 		int len, pos = pci_find_ext_capability(dev, vc_caps[i].id);
 
-		if (!pos)
+		if (!pos || !pci_vc_needs_save(dev, vc_caps[i].id))
 			continue;
 
 		len = pci_vc_do_save_buffer(dev, pos, NULL, false);