Patchwork sata_mv port lockup on hotplug (kernel 2.6.38.2)

login
register
mail settings
Submitter Mark Lord
Date April 13, 2011, 3:10 p.m.
Message ID <4DA5BCF3.5080205@teksavvy.com>
Download mbox | patch
Permalink /patch/91023/
State Not Applicable
Delegated to: David Miller
Headers show

Comments

Mark Lord - April 13, 2011, 3:10 p.m.
On 11-04-12 06:30 AM, Bruce Stenning wrote:
..
> I am currently inserting tracing into 2.6.38.2 to try to work out what is going
> on.  From mv_write_main_irq_mask I can see that the IRQ for each port is still
> enabled, even when ports stop responding.
..
> One thing I noticed was that there is no spinlock around the
> mv_save_cached_regs/mv_edma_cfg in mv_hardreset (unlike mv_port_start and
> mv_port_stop); why is this?

Yeah, I'm suspecting there's a loophole in the logic there somewhere.

I dusted off the 6041 reference card I have here, and played with the cables
for a while.  Managed to get one port to stop responding to hot plug fairly
quickly, though I'm not sure how/why.

Then I added a debug printk() to mv_write_main_irq_mask(),
with no other changes, and that appears to have been enough
to change the race timing so that I could no longer produce
the problem.

Bruce, here's a slightly-ugly patch that should remove all
doubt about races in the irq_mask.  Please apply it, test with it,
and let me know here if the issue goes away.

Thanks

 static void mv_enable_port_irqs(struct ata_port *ap,
Bruce Stenning - April 13, 2011, 4:13 p.m.
> Yeah, I'm suspecting there's a loophole in the logic there somewhere.

>

> I dusted off the 6041 reference card I have here, and played with the cables for a while.  Managed to > get one port to stop responding to hot plug fairly quickly, though I'm not sure how/why.

>

> Then I added a debug printk() to mv_write_main_irq_mask(), with no other changes, and that appears to > have been enough to change the race timing so that I could no longer produce the problem.

>

> Bruce, here's a slightly-ugly patch that should remove all doubt about races in the irq_mask.  Please > apply it, test with it, and let me know here if the issue goes away.

>

> Thanks


Thanks Mark. I was about to try out some tracing in writelfl because I
suspected the irq mask was getting clobbered somewhere along the way, but
I'd been distracted by other work.  I shall try your patch out as soon as
I can, and report back.

Cheers,

Bruce.


Bruce Stenning,
IndigoVision,
b <dot> stenning <at> indigovision <dot> com


Latest News at: http://www.indigovision.com/index.php/en/news.html
Bruce Stenning - April 14, 2011, 4:50 p.m.
PiA+IE9uZSB0aGluZyBJIG5vdGljZWQgd2FzIHRoYXQgdGhlcmUgaXMgbm8gc3BpbmxvY2sgYXJv
dW5kIHRoZQ0KPiA+IG12X3NhdmVfY2FjaGVkX3JlZ3MvbXZfZWRtYV9jZmcgaW4gbXZfaGFyZHJl
c2V0ICh1bmxpa2UgbXZfcG9ydF9zdGFydA0KPiA+IGFuZCBtdl9wb3J0X3N0b3ApOyB3aHkgaXMg
dGhpcz8NCj4NCj4gWWVhaCwgSSdtIHN1c3BlY3RpbmcgdGhlcmUncyBhIGxvb3Bob2xlIGluIHRo
ZSBsb2dpYyB0aGVyZSBzb21ld2hlcmUuDQo+DQo+IEkgZHVzdGVkIG9mZiB0aGUgNjA0MSByZWZl
cmVuY2UgY2FyZCBJIGhhdmUgaGVyZSwgYW5kIHBsYXllZCB3aXRoIHRoZSBjYWJsZXMNCj4gZm9y
IGEgd2hpbGUuICBNYW5hZ2VkIHRvIGdldCBvbmUgcG9ydCB0byBzdG9wIHJlc3BvbmRpbmcgdG8g
aG90IHBsdWcgZmFpcmx5DQo+IHF1aWNrbHksIHRob3VnaCBJJ20gbm90IHN1cmUgaG93L3doeS4N
Cj4NCj4gVGhlbiBJIGFkZGVkIGEgZGVidWcgcHJpbnRrKCkgdG8gbXZfd3JpdGVfbWFpbl9pcnFf
bWFzaygpLCB3aXRoIG5vIG90aGVyDQo+IGNoYW5nZXMsIGFuZCB0aGF0IGFwcGVhcnMgdG8gaGF2
ZSBiZWVuIGVub3VnaCB0byBjaGFuZ2UgdGhlIHJhY2UgdGltaW5nDQo+IHNvIHRoYXQgSSBjb3Vs
ZCBubyBsb25nZXIgcHJvZHVjZSB0aGUgcHJvYmxlbS4NCj4NCj4gQnJ1Y2UsIGhlcmUncyBhIHNs
aWdodGx5LXVnbHkgcGF0Y2ggdGhhdCBzaG91bGQgcmVtb3ZlIGFsbCBkb3VidCBhYm91dCByYWNl
cw0KPiBpbiB0aGUgaXJxX21hc2suICBQbGVhc2UgYXBwbHkgaXQsIHRlc3Qgd2l0aCBpdCwgYW5k
IGxldCBtZSBrbm93IGhlcmUgaWYgdGhlDQo+IGlzc3VlIGdvZXMgYXdheS4NCj4NCj4gVGhhbmtz
DQo+DQo+IFtwYXRjaF0NCg0KDQpIaSBNYXJrLA0KDQpJIGhhdmUgdHJpZWQgdGhlIHBhdGNoIG91
dCBhbmQgSSB3YXMgYWJsZSB0byByZXByb2R1Y2UgdGhlIHBvcnQgbG9ja3VwIHdpdGggaXQuDQoN
CkkgYWxzbyB0cmllZCBvdXQgbXkgdHJhY2luZyBpbiB3cml0ZWxmbCAod2l0aG91dCBhbnkgb3Ro
ZXIgY2hhbmdlcykgYW5kIEkgd2FzDQphYmxlIHRvIGxvY2sgdXAgYSBwb3J0IHdpdGhvdXQgYW55
IGFwcGFyZW50IHVudXN1YWwgY2hhbmdlcyB0byB0aGUgcmVnaXN0ZXINCmNvbnRhaW5pbmcgdGhl
IGlycSBtYXNrcy4gIEFyZSB0aGVyZSBvdGhlciByb3V0ZXMgdG8gZGlzYWJsaW5nIHRoZSBpcnFz
Pw0KDQpJIHNoYWxsIGNvbnRpbnVlIGxvb2tpbmcuDQoNCkNoZWVycywNCg0KQnJ1Y2UuDQoNCg0K
QnJ1Y2UgU3Rlbm5pbmcsDQpJbmRpZ29WaXNpb24sDQpiIDxkb3Q+IHN0ZW5uaW5nIDxhdD4gaW5k
aWdvdmlzaW9uIDxkb3Q+IGNvbQ0KDQoNCkxhdGVzdCBOZXdzIGF0OiBodHRwOi8vd3d3LmluZGln
b3Zpc2lvbi5jb20vaW5kZXgucGhwL2VuL25ld3MuaHRtbA0K
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Lord - April 15, 2011, 12:28 a.m.
On 11-04-14 12:50 PM, Bruce Stenning wrote:
..
> I have tried the patch out and I was able to reproduce the port lockup with it.

Bummer.

> I also tried out my tracing in writelfl (without any other changes) and I was
> able to lock up a port without any apparent unusual changes to the register
> containing the irq masks.  Are there other routes to disabling the irqs?


Mmm.. presumably you are NOT using MSI interrupts, right?

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bruce Stenning - April 15, 2011, 1:21 p.m.
> > I also tried out my tracing in writelfl (without any other changes) and I was

> > able to lock up a port without any apparent unusual changes to the register

> > containing the irq masks.  Are there other routes to disabling the irqs?

>

> Mmm.. presumably you are NOT using MSI interrupts, right?


Hi Mark,

That is correct:

sata_mv 0000:00:04.0: Gen-II 32 slots 4 ports SCSI mode IRQ via INTx

I am building with CONFIG_ARCH_SUPPORTS_MSI=y, but also with CONFIG_PCI_MSI
not set.

With all this discussion of SATA link speed, I ought to say that we limit our
SATA links to 1.5Gbps with the following kernel parameter:

libata.force=1.5g

I had noticed occasional messages similar to the following earlier in the week:

ata4.00: limiting speed to UDMA/100:PIO4

I was surprised to see them, and thought they might be related to the hotplug
issues, but I was able to reproduce the lockups without these messages being
generated.  In some cases (but not all) both sata_down_spd_limit and
ata_down_xfermask_limit are called when backing off the link speed.

Mark, I was intrigued by your comment from the following message:

http://www.spinics.net/lists/linux-ide/msg36922.html
> If it's like their non-AHCI controllers (sata_mv), then the chipset/phy

> could be very particular about the sequence/timing used when changing speeds.


Is it possible that the driver is doing some of the work to change the link
speed (even though it has nowhere to go) and clobbering the link entirely?
I shall take another closer look at the source code and tracing around hotplug
w.r.t. link speed.


Cheers,

Bruce.


Bruce Stenning,
IndigoVision,
b <dot> stenning <at> indigovision <dot> com


Latest News at: http://www.indigovision.com/index.php/en/news.html
Mark Lord - April 15, 2011, 3:36 p.m.
On 11-04-15 09:21 AM, Bruce Stenning wrote:
..
> sata_mv 0000:00:04.0: Gen-II 32 slots 4 ports SCSI mode IRQ via INTx
..
> 
> I had noticed occasional messages similar to the following earlier in the week:
> 
> ata4.00: limiting speed to UDMA/100:PIO4

Ugh.   Tejun:  that's just plain silly.
Why would we EVER EVER EVER fall back to "PIO" for SATA ???


> I am building with CONFIG_ARCH_SUPPORTS_MSI=y, but also with CONFIG_PCI_MSI
> not set.
> 
> With all this discussion of SATA link speed, I ought to say that we limit our
> SATA links to 1.5Gbps with the following kernel parameter:
> 
> libata.force=1.5g

Mmm... I suppose it could be getting stuck at 1.5g,
and not transitioning back to "auto detect" or something.

That part of sata_mv involves a bit of voo-doo,
because I was reluctant to cut/paste the 200 lines or so
of generic code just for the sake of changing five lines.

Perhaps the generic code has changed since this was originally working,
and now no longer always works for sata_mv..

I suppose you have a good reason for using "libata.force=1.5g" there?

Cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bruce Stenning - April 18, 2011, 10:30 a.m.
> > libata.force=1.5g

>

> Mmm... I suppose it could be getting stuck at 1.5g,

> and not transitioning back to "auto detect" or something.

>

> That part of sata_mv involves a bit of voo-doo,

> because I was reluctant to cut/paste the 200 lines or so

> of generic code just for the sake of changing five lines.

>

> Perhaps the generic code has changed since this was originally working,

> and now no longer always works for sata_mv..

>

> I suppose you have a good reason for using "libata.force=1.5g" there?


Hi Mark,

I hope you had a good weekend.

The SATA speed was limited because there were concerns over signal integrity at
3Gbps early on in our product development, and even at 1.5Gbps the SATA links
are not a bottleneck in our platform.  The signal integrity issues have been
addressed, and we would like to remove the limit, but this obviously has an
additional test overhead and it is low down on our priorities.

Out of interest I examined the behaviour of the hotplug with the 1.5Gbps limit
removed, but I saw no change in the port lockup behaviour.


Cheers,

Bruce.


Latest News at: http://www.indigovision.com/index.php/en/news.html
Tejun Heo - April 23, 2011, 12:56 a.m.
On Fri, Apr 15, 2011 at 11:36:38AM -0400, Mark Lord wrote:
> On 11-04-15 09:21 AM, Bruce Stenning wrote:
> ..
> > sata_mv 0000:00:04.0: Gen-II 32 slots 4 ports SCSI mode IRQ via INTx
> ..
> > 
> > I had noticed occasional messages similar to the following earlier in the week:
> > 
> > ata4.00: limiting speed to UDMA/100:PIO4
> 
> Ugh.   Tejun:  that's just plain silly.
> Why would we EVER EVER EVER fall back to "PIO" for SATA ???

Nah, it's just printing both DMA and PIO speed settings.  It's limited
to UDMA/100 not PIO4.

Thanks.

Patch

--- old/drivers/ata/sata_mv.c	2011-04-13 11:03:07.442481426 -0400
+++ linux/drivers/ata/sata_mv.c	2011-04-13 11:07:55.224249621 -0400
@@ -1027,15 +1027,19 @@ 
 static void mv_set_main_irq_mask(struct ata_host *host,
 				 u32 disable_bits, u32 enable_bits)
 {
+	static spinlock_t irq_mask_lock = __SPIN_LOCK_UNLOCKED(irq_mask_lock); // FIXME: per-host would be nicer
 	struct mv_host_priv *hpriv = host->private_data;
 	u32 old_mask, new_mask;
+	unsigned long flags;
 
+	spin_lock_irqsave(&irq_mask_lock, flags);
 	old_mask = hpriv->main_irq_mask;
 	new_mask = (old_mask & ~disable_bits) | enable_bits;
 	if (new_mask != old_mask) {
 		hpriv->main_irq_mask = new_mask;
 		mv_write_main_irq_mask(new_mask, hpriv);
 	}
+	spin_unlock_irqrestore(&irq_mask_lock, flags);
 }
 
 static void mv_enable_port_irqs(struct ata_port *ap,