diff mbox series

[RFC,5/6] pci: Protect the enable/disable state of pci_dev using the state mutex

Message ID 20180817044902.31420-6-benh@kernel.crashing.org
State Not Applicable
Delegated to: Bjorn Helgaas
Headers show
Series pci: Rework is_added race fix and address bridge enable races | expand

Commit Message

Benjamin Herrenschmidt Aug. 17, 2018, 4:49 a.m. UTC
This protects enable/disable operations using the state mutex to
avoid races with, for example, concurrent enables on a bridge.

The bus hierarchy is walked first before taking the lock to
avoid lock nesting (though it would be ok if we ensured that
we always nest them bottom-up, it is better to just avoid the
issue alltogether, especially as we might find cases where
we want to take it top-down later).

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 drivers/pci/pci.c | 51 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 35 insertions(+), 16 deletions(-)

Comments

Marta Rybczynska Aug. 17, 2018, 8:09 a.m. UTC | #1
----- On 17 Aug, 2018, at 06:49, Benjamin Herrenschmidt benh@kernel.crashing.org wrote:

> This protects enable/disable operations using the state mutex to
> avoid races with, for example, concurrent enables on a bridge.
> 
> The bus hierarchy is walked first before taking the lock to
> avoid lock nesting (though it would be ok if we ensured that
> we always nest them bottom-up, it is better to just avoid the
> issue alltogether, especially as we might find cases where
> we want to take it top-down later).
> 
> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>


> 
> static void pci_enable_bridge(struct pci_dev *dev)
> {
> 	struct pci_dev *bridge;
> -	int retval;
> +	int retval, enabled;
> 
> 	bridge = pci_upstream_bridge(dev);
> 	if (bridge)
> 		pci_enable_bridge(bridge);
> 
> -	if (pci_is_enabled(dev)) {
> -		if (!dev->is_busmaster)
> -			pci_set_master(dev);
> +	/* Already enabled ? */
> +	pci_dev_state_lock(dev);
> +	enabled = pci_is_enabled(dev);
> +	if (enabled && !dev->is_busmaster)
> +		pci_set_master(dev);
> +	pci_dev_state_unlock(dev);
> +	if (enabled)
> 		return;
> -	}
> 

This looks complicated too me especially with the double locking. What do you
think about a function doing that check that used an unlocked version of
pcie_set_master?

Like:

        dev_state_lock(dev);
        enabled = pci_is_enabled(dev);
        if (enabled &&  !dev->is_busmaster)
                pci_set_master_unlocked();
        pci_dev_state_unlock(dev);

BTW If I remember correctly the code today can set bus mastering multiple
times without checking if already done.

Marta
Benjamin Herrenschmidt Aug. 17, 2018, 8:30 a.m. UTC | #2
On Fri, 2018-08-17 at 10:09 +0200, Marta Rybczynska wrote:
> 
> ----- On 17 Aug, 2018, at 06:49, Benjamin Herrenschmidt benh@kernel.crashing.org wrote:
> 
> > This protects enable/disable operations using the state mutex to
> > avoid races with, for example, concurrent enables on a bridge.
> > 
> > The bus hierarchy is walked first before taking the lock to
> > avoid lock nesting (though it would be ok if we ensured that
> > we always nest them bottom-up, it is better to just avoid the
> > issue alltogether, especially as we might find cases where
> > we want to take it top-down later).
> > 
> > Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> 
> 
> > 
> > static void pci_enable_bridge(struct pci_dev *dev)
> > {
> > 	struct pci_dev *bridge;
> > -	int retval;
> > +	int retval, enabled;
> > 
> > 	bridge = pci_upstream_bridge(dev);
> > 	if (bridge)
> > 		pci_enable_bridge(bridge);
> > 
> > -	if (pci_is_enabled(dev)) {
> > -		if (!dev->is_busmaster)
> > -			pci_set_master(dev);
> > +	/* Already enabled ? */
> > +	pci_dev_state_lock(dev);
> > +	enabled = pci_is_enabled(dev);
> > +	if (enabled && !dev->is_busmaster)
> > +		pci_set_master(dev);
> > +	pci_dev_state_unlock(dev);
> > +	if (enabled)
> > 		return;
> > -	}
> > 
> 
> This looks complicated too me especially with the double locking. What do you
> think about a function doing that check that used an unlocked version of
> pcie_set_master?
> 
> Like:
> 
>         dev_state_lock(dev);
>         enabled = pci_is_enabled(dev);
>         if (enabled &&  !dev->is_busmaster)
>                 pci_set_master_unlocked();
>         pci_dev_state_unlock(dev);
> 
> BTW If I remember correctly the code today can set bus mastering multiple
> times without checking if already done.

I don't mind but I tend to dislike all those _unlocked() suffixes, I
suppose my generation is typing adverse enough that we call these
__something instead :)

As for setting multiple times, yes pci_set_master() doesn't check but
look at the "-" hunks of my patch, I'm not changing the existing test
here. Not that it matters much, it's an optimization.

In fact my original version just completely removed the whole lot
and just unconditionally did pci_enable_device() + pci_set_master(),
just ignoring the existing state :-)

But I decided to keep the patch functionally equivalent so I added it
back.

Cheers,
Ben.
Hari Vyas Aug. 17, 2018, 9 a.m. UTC | #3
On Fri, Aug 17, 2018 at 2:00 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Fri, 2018-08-17 at 10:09 +0200, Marta Rybczynska wrote:
>>
>> ----- On 17 Aug, 2018, at 06:49, Benjamin Herrenschmidt benh@kernel.crashing.org wrote:
>>
>> > This protects enable/disable operations using the state mutex to
>> > avoid races with, for example, concurrent enables on a bridge.
>> >
>> > The bus hierarchy is walked first before taking the lock to
>> > avoid lock nesting (though it would be ok if we ensured that
>> > we always nest them bottom-up, it is better to just avoid the
>> > issue alltogether, especially as we might find cases where
>> > we want to take it top-down later).
>> >
>> > Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>
>>
>> >
>> > static void pci_enable_bridge(struct pci_dev *dev)
>> > {
>> >     struct pci_dev *bridge;
>> > -   int retval;
>> > +   int retval, enabled;
>> >
>> >     bridge = pci_upstream_bridge(dev);
>> >     if (bridge)
>> >             pci_enable_bridge(bridge);
>> >
>> > -   if (pci_is_enabled(dev)) {
>> > -           if (!dev->is_busmaster)
>> > -                   pci_set_master(dev);
>> > +   /* Already enabled ? */
>> > +   pci_dev_state_lock(dev);
>> > +   enabled = pci_is_enabled(dev);
>> > +   if (enabled && !dev->is_busmaster)
>> > +           pci_set_master(dev);
>> > +   pci_dev_state_unlock(dev);
>> > +   if (enabled)
>> >             return;
>> > -   }
>> >
>>
>> This looks complicated too me especially with the double locking. What do you
>> think about a function doing that check that used an unlocked version of
>> pcie_set_master?
>>
>> Like:
>>
>>         dev_state_lock(dev);
>>         enabled = pci_is_enabled(dev);
>>         if (enabled &&  !dev->is_busmaster)
>>                 pci_set_master_unlocked();
>>         pci_dev_state_unlock(dev);
>>
>> BTW If I remember correctly the code today can set bus mastering multiple
>> times without checking if already done.
>
> I don't mind but I tend to dislike all those _unlocked() suffixes, I
> suppose my generation is typing adverse enough that we call these
> __something instead :)
>
> As for setting multiple times, yes pci_set_master() doesn't check but
> look at the "-" hunks of my patch, I'm not changing the existing test
> here. Not that it matters much, it's an optimization.
>
> In fact my original version just completely removed the whole lot
> and just unconditionally did pci_enable_device() + pci_set_master(),
> just ignoring the existing state :-)
>
> But I decided to keep the patch functionally equivalent so I added it
> back.
>
> Cheers,
> Ben.
>
>
So many mail threads for common issues going so just trying to
summarize concern from my side.
1) HW level
PCI Config data overwritten (ex: PCI_COMMAND bits etc) was happening
at lower layer so from my
perspective, it is best to handle concern at lower level with locking
mechanism and that is what
I proposed in my upcoming patch. Without that, I guess, we may end up
in issues later with some weird scenario
which may not be known today  and we will again be fine tuning stuff
here and there.
2) SW level(internal data structure):
About is_added,is_busmaster: These all are bit fields so infact I too
suggested to remove those bit fields and
make separate variables in pci_dev structure. This will avoid all
data-overwritten,concurrency issues but ofcourse
at the level of space cost.
Other possibility will be either to use atomic or locking mechanism
for these. My point here is also same, better
avoid fine tuning later stage.
Moving is_added up and down doesn't look like good as we are just
shifting issue up and down.

Please check and decide accordingly. I will hold my to-be-submitted
modify() patch about handling hw level
over-written case with locking around read-write operation.

Regards,
hari
Benjamin Herrenschmidt Aug. 17, 2018, 9:39 a.m. UTC | #4
On Fri, 2018-08-17 at 14:30 +0530, Hari Vyas wrote:
> So many mail threads for common issues going so just trying to
> summarize concern from my side.
> 1) HW level
> PCI Config data overwritten (ex: PCI_COMMAND bits etc) was happening
> at lower layer so from my
> perspective, it is best to handle concern at lower level with locking
> mechanism and that is what
> I proposed in my upcoming patch. Without that, I guess, we may end up
> in issues later with some weird scenario
> which may not be known today  and we will again be fine tuning stuff
> here and there.

Maybe... at this point I'm not trying to address that specific issue.
That being said, the PCI_COMMAND register should be under control of
the driver at runtime and thus shouldn't be exposed to races like that
except in the usual case of races in configuring bridges. Those races
definitely need to be handled at the higher level.

So I'm rather dubious of adding a whole new layer of "modify" callbacks
to config space accessors for that, especially since they won't do any
good on architectures with lockless config accesses such as ... x86 

> 2) SW level(internal data structure):
> About is_added,is_busmaster: These all are bit fields so infact I too
> suggested to remove those bit fields and
> make separate variables in pci_dev structure. 
> This will avoid all data-overwritten,concurrency issues but ofcourse
> at the level of space cost.

It's unnecessary to do blanket changes without first understanding the
details of what's going on. A lot of these things are never touched
outside of the overall single threaded environment of
discovery/assignment or under driver control, in which case it's
expectd that the driver doesn't call them in racy ways

That said, I'm happy to move some of them under my proposed
dev_state_lock.

For is_added specifically, the field is simply set at the wrong time as
you can see in my previous patch.

> Other possibility will be either to use atomic

Atomic is almost always wrong. It doesn't synchronize anything other
than the field, so either it's a big waste of it gives a false sense of
security for something that's almost always still racy.

I'd rather keep the indication that a bunch of those fields *are*
unprotected and rely on the outer layers being single threaded.

>  or locking mechanism
> for these. My point here is also same, better
> avoid fine tuning later stage.
> Moving is_added up and down doesn't look like good as we are just
> shifting issue up and down.

No, you are wrong. I also originally covered is_added with my new mutex
but in hindsight it's the wrong thing to do, and went back to the
correct fix at this point which is precisely to move it up.

is_added is essentially owned by the part of the PCI probe code that
runs single threaded before the device gets exposed to the device
model.

Pretty much all of the code, including all the corresponding fields
(struct resources etc...) in pci_dev are unprotected and rely on being
accessed in a single threaded manner. is_added is no exception.

It was simply a mistake to set it after device_attach().

Thus moving it back up fixes it *correctly* by making it follow the
same rules as all the related fields.

That said, if we want to start adding protection to all those other
fields, then this is a different matter alltogether. But at this point,
this is the best fix for the problem at hand.

> Please check and decide accordingly. I will hold my to-be-submitted
> modify() patch about handling hw level
> over-written case with locking around read-write operation.

Can you remind us in this thread which specific cases of RMW races of
config space you were trying to address ?

Cheers,
Ben.
Hari Vyas Aug. 17, 2018, 10:10 a.m. UTC | #5
On Fri, Aug 17, 2018 at 3:09 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Fri, 2018-08-17 at 14:30 +0530, Hari Vyas wrote:
>> So many mail threads for common issues going so just trying to
>> summarize concern from my side.
>> 1) HW level
>> PCI Config data overwritten (ex: PCI_COMMAND bits etc) was happening
>> at lower layer so from my
>> perspective, it is best to handle concern at lower level with locking
>> mechanism and that is what
>> I proposed in my upcoming patch. Without that, I guess, we may end up
>> in issues later with some weird scenario
>> which may not be known today  and we will again be fine tuning stuff
>> here and there.
>
> Maybe... at this point I'm not trying to address that specific issue.
> That being said, the PCI_COMMAND register should be under control of
> the driver at runtime and thus shouldn't be exposed to races like that
> except in the usual case of races in configuring bridges. Those races
> definitely need to be handled at the higher level.
>
> So I'm rather dubious of adding a whole new layer of "modify" callbacks
> to config space accessors for that, especially since they won't do any
> good on architectures with lockless config accesses such as ... x86
>
modify() is optional.  host/controller layers may override it.
It was just a proposal to avoid race issues which happens in SMP environment and
solves other issues. I agree, nothing great anyhow can be done in
lockless config

>> 2) SW level(internal data structure):
>> About is_added,is_busmaster: These all are bit fields so infact I too
>> suggested to remove those bit fields and
>> make separate variables in pci_dev structure.
>> This will avoid all data-overwritten,concurrency issues but ofcourse
>> at the level of space cost.
>
> It's unnecessary to do blanket changes without first understanding the
> details of what's going on. A lot of these things are never touched
> outside of the overall single threaded environment of
> discovery/assignment or under driver control, in which case it's
> expectd that the driver doesn't call them in racy ways
>
> That said, I'm happy to move some of them under my proposed
> dev_state_lock.
>
> For is_added specifically, the field is simply set at the wrong time as
> you can see in my previous patch.
>
Issue needs to be addressed and that is our goal.
Some times simple mistakes need lot of debugging which happened in my case and
my suggestion is to just avoid. SMP related issues are popping up now
so we just need
to be careful.
>> Other possibility will be either to use atomic
>
> Atomic is almost always wrong. It doesn't synchronize anything other
> than the field, so either it's a big waste of it gives a false sense of
> security for something that's almost always still racy.
>
> I'd rather keep the indication that a bunch of those fields *are*
> unprotected and rely on the outer layers being single threaded.
>
>>  or locking mechanism
>> for these. My point here is also same, better
>> avoid fine tuning later stage.
>> Moving is_added up and down doesn't look like good as we are just
>> shifting issue up and down.
>
> No, you are wrong. I also originally covered is_added with my new mutex
> but in hindsight it's the wrong thing to do, and went back to the
> correct fix at this point which is precisely to move it up.
>
> is_added is essentially owned by the part of the PCI probe code that
> runs single threaded before the device gets exposed to the device
> model.
>
> Pretty much all of the code, including all the corresponding fields
> (struct resources etc...) in pci_dev are unprotected and rely on being
> accessed in a single threaded manner. is_added is no exception.
>
> It was simply a mistake to set it after device_attach().
>
> Thus moving it back up fixes it *correctly* by making it follow the
> same rules as all the related fields.
>
> That said, if we want to start adding protection to all those other
> fields, then this is a different matter alltogether. But at this point,
> this is the best fix for the problem at hand.
>
>> Please check and decide accordingly. I will hold my to-be-submitted
>> modify() patch about handling hw level
>> over-written case with locking around read-write operation.
>
> Can you remind us in this thread which specific cases of RMW races of
> config space you were trying to address ?
>
Same pci bridge master, memory bit setting concern only (which my
colleague Srinath figured
out after lot of effort some time back) where only one bit in
PCI_COMMAND  was getting set.
(Bug 200793 - PCI: read-write config operation doesn't look like SMP safe)
My approach is to handle with modify operations at lower level so bits
are not over-written or lost.


As stated earlier, issue should be just resolved in better way. No
issue in going with majority
Regards,
hari
> Cheers,
> Ben.
>
>
Benjamin Herrenschmidt Aug. 17, 2018, 10:24 a.m. UTC | #6
On Fri, 2018-08-17 at 15:40 +0530, Hari Vyas wrote:
> 
> > So I'm rather dubious of adding a whole new layer of "modify" callbacks
> > to config space accessors for that, especially since they won't do any
> > good on architectures with lockless config accesses such as ... x86
> > 
> 
> modify() is optional.  host/controller layers may override it.
> It was just a proposal to avoid race issues which happens in SMP environment and
> solves other issues. I agree, nothing great anyhow can be done in
> lockless config

Right so let's not go down the path of creating low level accessors
providing a semantic that cannot actually be provided by some
architectures :-)

> > > 2) SW level(internal data structure):
> > > About is_added,is_busmaster: These all are bit fields so infact I too
> > > suggested to remove those bit fields and
> > > make separate variables in pci_dev structure.
> > > This will avoid all data-overwritten,concurrency issues but ofcourse
> > > at the level of space cost.
> > 
> > It's unnecessary to do blanket changes without first understanding the
> > details of what's going on. A lot of these things are never touched
> > outside of the overall single threaded environment of
> > discovery/assignment or under driver control, in which case it's
> > expectd that the driver doesn't call them in racy ways
> > 
> > That said, I'm happy to move some of them under my proposed
> > dev_state_lock.
> > 
> > For is_added specifically, the field is simply set at the wrong time as
> > you can see in my previous patch.
> > 
> Issue needs to be addressed and that is our goal.

Yes.

> Some times simple mistakes need lot of debugging which happened in my case and
> my suggestion is to just avoid. SMP related issues are popping up now
> so we just need to be careful.

Oh I agree, and I took me a while to re-debug independently some of the
issues on my side too :-) I should be clearer that I very much do
appreciate your debugging work and finding those problems !

I might disagree on the solution but your effort is very valuable and
I'm sure we can converge on a solution that works for everybody.

 .../...

> > Can you remind us in this thread which specific cases of RMW races of
> > config space you were trying to address ?
> > 
> 
> Same pci bridge master, memory bit setting concern only (which my
> colleague Srinath figured out after lot of effort some time back) where only one bit in
> PCI_COMMAND  was getting set.
> (Bug 200793 - PCI: read-write config operation doesn't look like SMP safe)
> My approach is to handle with modify operations at lower level so bits
> are not over-written or lost.

Right so those are the same old races with pci_set_master() on the
bridge upward chain ?

If yes then this should be addressed by my patches but there are still
some debate about whether to add that new mutex or try to use an
existing one, so let's see what Bjorn has to say.

I think the set-master and enable/disable issues are intimately related
and should be solved together by some higher level locking.

This is a typical case where the "atomic" enable_cnt provided people
with a false sense of safety while the code in practice was still
completely racy.

> As stated earlier, issue should be just resolved in better way. No
> issue in going with majority

Hehe, ok. I think the is_added fix is simply to move it up since it
matches the rest of the code in that area, but let's see what Bjorn has
to say in that regard.

As for the enable/disable/set_master races, Ive wrote my arguments in
favor of a dedicated lock, let's see what others have to respond.

Cheers,
Ben.

> Regards,
> hari
> > Cheers,
> > Ben.
> > 
> >
diff mbox series

Patch

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 62591c153999..68152de2b5a0 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1540,26 +1540,33 @@  static int do_pci_enable_device(struct pci_dev *dev, int bars)
  */
 int pci_reenable_device(struct pci_dev *dev)
 {
+	int rc = 0;
+
+	pci_dev_state_lock(dev);
 	if (pci_is_enabled(dev))
-		return do_pci_enable_device(dev, (1 << PCI_NUM_RESOURCES) - 1);
-	return 0;
+		rc = do_pci_enable_device(dev, (1 << PCI_NUM_RESOURCES) - 1);
+	pci_dev_state_unlock(dev);
+	return rc;
 }
 EXPORT_SYMBOL(pci_reenable_device);
 
 static void pci_enable_bridge(struct pci_dev *dev)
 {
 	struct pci_dev *bridge;
-	int retval;
+	int retval, enabled;
 
 	bridge = pci_upstream_bridge(dev);
 	if (bridge)
 		pci_enable_bridge(bridge);
 
-	if (pci_is_enabled(dev)) {
-		if (!dev->is_busmaster)
-			pci_set_master(dev);
+	/* Already enabled ? */
+	pci_dev_state_lock(dev);
+	enabled = pci_is_enabled(dev);
+	if (enabled && !dev->is_busmaster)
+		pci_set_master(dev);
+	pci_dev_state_unlock(dev);
+	if (enabled)
 		return;
-	}
 
 	retval = pci_enable_device(dev);
 	if (retval)
@@ -1571,9 +1578,16 @@  static void pci_enable_bridge(struct pci_dev *dev)
 static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags)
 {
 	struct pci_dev *bridge;
-	int err;
+	int err = 0;
 	int i, bars = 0;
 
+	/* Handle upstream bridges first to avoid locking issues */
+	bridge = pci_upstream_bridge(dev);
+	if (bridge)
+		pci_enable_bridge(bridge);
+
+	pci_dev_state_lock(dev);
+
 	/*
 	 * Power state could be unknown at this point, either due to a fresh
 	 * boot or a device removal call.  So get the current power state
@@ -1586,12 +1600,9 @@  static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags)
 		dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
 	}
 
+	/* Already enabled ? */
 	if (atomic_inc_return(&dev->enable_cnt) > 1)
-		return 0;		/* already enabled */
-
-	bridge = pci_upstream_bridge(dev);
-	if (bridge)
-		pci_enable_bridge(bridge);
+		goto bail;
 
 	/* only skip sriov related */
 	for (i = 0; i <= PCI_ROM_RESOURCE; i++)
@@ -1604,6 +1615,8 @@  static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags)
 	err = do_pci_enable_device(dev, bars);
 	if (err < 0)
 		atomic_dec(&dev->enable_cnt);
+ bail:
+	pci_dev_state_unlock(dev);
 	return err;
 }
 
@@ -1820,12 +1833,16 @@  static void do_pci_disable_device(struct pci_dev *dev)
  * @dev: PCI device to disable
  *
  * NOTE: This function is a backend of PCI power management routines and is
- * not supposed to be called drivers.
+ * not supposed to be called drivers. It will keep enable_cnt and is_busmaster
+ * unmodified so that the resume code knows how to restore the corresponding
+ * command register bits.
  */
 void pci_disable_enabled_device(struct pci_dev *dev)
 {
+	pci_dev_state_lock(dev);
 	if (pci_is_enabled(dev))
 		do_pci_disable_device(dev);
+	pci_dev_state_unlock(dev);
 }
 
 /**
@@ -1849,12 +1866,14 @@  void pci_disable_device(struct pci_dev *dev)
 	dev_WARN_ONCE(&dev->dev, atomic_read(&dev->enable_cnt) <= 0,
 		      "disabling already-disabled device");
 
+	pci_dev_state_lock(dev);
 	if (atomic_dec_return(&dev->enable_cnt) != 0)
-		return;
+		goto bail;
 
 	do_pci_disable_device(dev);
-
 	dev->is_busmaster = 0;
+ bail:
+	pci_dev_state_unlock(dev);
 }
 EXPORT_SYMBOL(pci_disable_device);