Patchwork 2.6.37-git17 virtual IO boot failure

login
register
mail settings
Submitter Nishanth Aravamudan
Date Jan. 18, 2011, 10:47 p.m.
Message ID <20110118224718.GA19039@us.ibm.com>
Download mbox | patch
Permalink /patch/79369/
State Not Applicable
Headers show

Comments

Nishanth Aravamudan - Jan. 18, 2011, 10:47 p.m.
On 18.01.2011 [12:31:52 +1100], Anton Blanchard wrote:
> Hi,
> 
> I was testing 2.6.37-git17 on a POWER7 with virtual IO and hit this:
> 
> Trying to unpack rootfs image as initramfs...
> Freeing initrd memory: 7446k freed
> vio 30000000: Warning: IOMMU dma not supported: mask
> 0xffffffffffffffff, table unavailable
> vio 4000: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> table unavailable
> vio 4001: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> table unavailable
> vio 4002: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> table unavailable
> vio 4004: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> table unavailable
> audit: initializing netlink socket (disabled)
> 
> Haven't had a chance to look closer yet.

After debugging a bit, this would appear to be due to the second hunk of
b3c73856ae47d43d0d181f9de1c1c6c0820c4515.


Milton, Sonny, any thoughts?

Thanks,
Nish
Nishanth Aravamudan - Jan. 19, 2011, 12:48 a.m.
On 18.01.2011 [14:47:18 -0800], Nishanth Aravamudan wrote:
> On 18.01.2011 [12:31:52 +1100], Anton Blanchard wrote:
> > Hi,
> > 
> > I was testing 2.6.37-git17 on a POWER7 with virtual IO and hit this:
> > 
> > Trying to unpack rootfs image as initramfs...
> > Freeing initrd memory: 7446k freed
> > vio 30000000: Warning: IOMMU dma not supported: mask
> > 0xffffffffffffffff, table unavailable
> > vio 4000: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> > table unavailable
> > vio 4001: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> > table unavailable
> > vio 4002: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> > table unavailable
> > vio 4004: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> > table unavailable
> > audit: initializing netlink socket (disabled)
> > 
> > Haven't had a chance to look closer yet.
> 
> After debugging a bit, this would appear to be due to the second hunk of
> b3c73856ae47d43d0d181f9de1c1c6c0820c4515.
> 
> diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
> index b265405..1b695fd 100644
> --- a/arch/powerpc/kernel/vio.c
> +++ b/arch/powerpc/kernel/vio.c
> @@ -1257,6 +1257,10 @@ struct vio_dev *vio_register_device_node(struct device_node *of_node)
>         viodev->dev.parent = &vio_bus_device.dev;
>         viodev->dev.bus = &vio_bus_type;
>         viodev->dev.release = vio_dev_release;
> +        /* needed to ensure proper operation of coherent allocations
> +         * later, in case driver doesn't set it explicitly */
> +        dma_set_mask(&viodev->dev, DMA_BIT_MASK(64));
> +        dma_set_coherent_mask(&viodev->dev, DMA_BIT_MASK(64));
> 
>         /* register with generic device framework */
>         if (device_register(&viodev->dev)) {
> 
> Milton, Sonny, any thoughts?

A bit more detail after trying a few more kernels on the box that
originally showed the error:

1) This doesn't actually prevent booting, afaict. I think it "just"
disables DMA, which is bad, but not a boot fail, technically.

2) Reverting the above commit definitely prevents those messages.

3) I'm seeing a separate issue with 2.6.37-git17 (that's not present in
2.6.37):

sd 0:4:2:0: [sda] Aborting command: 2A
sd 0:4:2:0: Abort timed out. Resetting bus.

At which point the box locks up :)

So testing fixes is a bit of a challenge right now.

Ben, if you're ok with waiting to see if Milton or Sonny have any ideas,
I'd like to hold off on asking for a revert. In the case they do, I'll
be able to test and send out any proposed fix rapidly.

Thanks,
Nish
Benjamin Herrenschmidt - Jan. 19, 2011, 4:06 a.m.
On Tue, 2011-01-18 at 14:47 -0800, Nishanth Aravamudan wrote:
> On 18.01.2011 [12:31:52 +1100], Anton Blanchard wrote:
> > Hi,
> > 
> > I was testing 2.6.37-git17 on a POWER7 with virtual IO and hit this:
> > 
> > Trying to unpack rootfs image as initramfs...
> > Freeing initrd memory: 7446k freed
> > vio 30000000: Warning: IOMMU dma not supported: mask
> > 0xffffffffffffffff, table unavailable
> > vio 4000: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> > table unavailable
> > vio 4001: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> > table unavailable
> > vio 4002: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> > table unavailable
> > vio 4004: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> > table unavailable
> > audit: initializing netlink socket (disabled)
> > 
> > Haven't had a chance to look closer yet.

Well, this causes messages for vdevices that don't do DMA at all (such
as vterm etc...) and don't have the necessary properties. However, it
didn't -break- anything for me in my tests so far, just spurrious
messages. Not sure what's up with Anton's setup. Anton, can you hack the
printk to display the OF path to the device so we see what devices are
complaining ? It could be a different issue that prevents booting.

Cheers,
Ben.

> After debugging a bit, this would appear to be due to the second hunk of
> b3c73856ae47d43d0d181f9de1c1c6c0820c4515.
> 
> diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
> index b265405..1b695fd 100644
> --- a/arch/powerpc/kernel/vio.c
> +++ b/arch/powerpc/kernel/vio.c
> @@ -1257,6 +1257,10 @@ struct vio_dev *vio_register_device_node(struct device_node *of_node)
>         viodev->dev.parent = &vio_bus_device.dev;
>         viodev->dev.bus = &vio_bus_type;
>         viodev->dev.release = vio_dev_release;
> +        /* needed to ensure proper operation of coherent allocations
> +         * later, in case driver doesn't set it explicitly */
> +        dma_set_mask(&viodev->dev, DMA_BIT_MASK(64));
> +        dma_set_coherent_mask(&viodev->dev, DMA_BIT_MASK(64));
> 
>         /* register with generic device framework */
>         if (device_register(&viodev->dev)) {
> 
> Milton, Sonny, any thoughts?
> 
> Thanks,
> Nish
>
Nishanth Aravamudan - Jan. 19, 2011, 4:37 a.m.
On 19.01.2011 [15:06:20 +1100], Benjamin Herrenschmidt wrote:
> On Tue, 2011-01-18 at 14:47 -0800, Nishanth Aravamudan wrote:
> > On 18.01.2011 [12:31:52 +1100], Anton Blanchard wrote:
> > > Hi,
> > > 
> > > I was testing 2.6.37-git17 on a POWER7 with virtual IO and hit this:
> > > 
> > > Trying to unpack rootfs image as initramfs...
> > > Freeing initrd memory: 7446k freed
> > > vio 30000000: Warning: IOMMU dma not supported: mask
> > > 0xffffffffffffffff, table unavailable
> > > vio 4000: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> > > table unavailable
> > > vio 4001: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> > > table unavailable
> > > vio 4002: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> > > table unavailable
> > > vio 4004: Warning: IOMMU dma not supported: mask 0xffffffffffffffff,
> > > table unavailable
> > > audit: initializing netlink socket (disabled)
> > > 
> > > Haven't had a chance to look closer yet.
> 
> Well, this causes messages for vdevices that don't do DMA at all (such
> as vterm etc...) and don't have the necessary properties. However, it
> didn't -break- anything for me in my tests so far, just spurrious
> messages. Not sure what's up with Anton's setup. Anton, can you hack the
> printk to display the OF path to the device so we see what devices are
> complaining ? It could be a different issue that prevents booting.

Is this what you were looking for?

vio 30000000: Warning: IOMMU dma not supported: mask 0xffffffffffffffff, table unavailable
vio 30000000: Path: /vdevice/vty@30000000
vio 4000: Warning: IOMMU dma not supported: mask 0xffffffffffffffff, table unavailable
vio 4000: Path: /vdevice/IBM,sp@4000
vio 4001: Warning: IOMMU dma not supported: mask 0xffffffffffffffff, table unavailable
vio 4001: Path: /vdevice/rtc@4001
vio 4002: Warning: IOMMU dma not supported: mask 0xffffffffffffffff, table unavailable
vio 4002: Path: /vdevice/nvram@4002
vio 4004: Warning: IOMMU dma not supported: mask 0xffffffffffffffff, table unavailable
vio 4004: Path: /vdevice/gscsi@4004

FWIW, I looked at Anton's logs, and I don't think the boot failed, per
se. I think it may have timed out (but not positive on that). I was able
to boot 2.6.27-git17 on the exact same box, albeit it locks up at a
later point (the sd abort I e-mailed about in a follow-up).



> 
> Cheers,
> Ben.
> 
> > After debugging a bit, this would appear to be due to the second hunk of
> > b3c73856ae47d43d0d181f9de1c1c6c0820c4515.
> > 
> > diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
> > index b265405..1b695fd 100644
> > --- a/arch/powerpc/kernel/vio.c
> > +++ b/arch/powerpc/kernel/vio.c
> > @@ -1257,6 +1257,10 @@ struct vio_dev *vio_register_device_node(struct device_node *of_node)
> >         viodev->dev.parent = &vio_bus_device.dev;
> >         viodev->dev.bus = &vio_bus_type;
> >         viodev->dev.release = vio_dev_release;
> > +        /* needed to ensure proper operation of coherent allocations
> > +         * later, in case driver doesn't set it explicitly */
> > +        dma_set_mask(&viodev->dev, DMA_BIT_MASK(64));
> > +        dma_set_coherent_mask(&viodev->dev, DMA_BIT_MASK(64));
> > 
> >         /* register with generic device framework */
> >         if (device_register(&viodev->dev)) {
> > 
> > Milton, Sonny, any thoughts?
> > 
> > Thanks,
> > Nish
> > 
> 
>
Benjamin Herrenschmidt - Jan. 19, 2011, 4:54 a.m.
On Tue, 2011-01-18 at 20:37 -0800, Nishanth Aravamudan wrote:

> Is this what you were looking for?
> 
> vio 30000000: Warning: IOMMU dma not supported: mask 0xffffffffffffffff, table unavailable
> vio 30000000: Path: /vdevice/vty@30000000
> vio 4000: Warning: IOMMU dma not supported: mask 0xffffffffffffffff, table unavailable
> vio 4000: Path: /vdevice/IBM,sp@4000
> vio 4001: Warning: IOMMU dma not supported: mask 0xffffffffffffffff, table unavailable
> vio 4001: Path: /vdevice/rtc@4001
> vio 4002: Warning: IOMMU dma not supported: mask 0xffffffffffffffff, table unavailable
> vio 4002: Path: /vdevice/nvram@4002
> vio 4004: Warning: IOMMU dma not supported: mask 0xffffffffffffffff, table unavailable
> vio 4004: Path: /vdevice/gscsi@4004

Ok, so they are all harmess (none of those device do DMA, appart maybe
gscsi, I have no idea what it is :-)

> FWIW, I looked at Anton's logs, and I don't think the boot failed, per
> se. I think it may have timed out (but not positive on that). I was able
> to boot 2.6.27-git17 on the exact same box, albeit it locks up at a
> later point (the sd abort I e-mailed about in a follow-up).

I haven't seen your email. I'll dig. Have to run now.

Cheers,
Ben.

> 
> 
> > 
> > Cheers,
> > Ben.
> > 
> > > After debugging a bit, this would appear to be due to the second hunk of
> > > b3c73856ae47d43d0d181f9de1c1c6c0820c4515.
> > > 
> > > diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
> > > index b265405..1b695fd 100644
> > > --- a/arch/powerpc/kernel/vio.c
> > > +++ b/arch/powerpc/kernel/vio.c
> > > @@ -1257,6 +1257,10 @@ struct vio_dev *vio_register_device_node(struct device_node *of_node)
> > >         viodev->dev.parent = &vio_bus_device.dev;
> > >         viodev->dev.bus = &vio_bus_type;
> > >         viodev->dev.release = vio_dev_release;
> > > +        /* needed to ensure proper operation of coherent allocations
> > > +         * later, in case driver doesn't set it explicitly */
> > > +        dma_set_mask(&viodev->dev, DMA_BIT_MASK(64));
> > > +        dma_set_coherent_mask(&viodev->dev, DMA_BIT_MASK(64));
> > > 
> > >         /* register with generic device framework */
> > >         if (device_register(&viodev->dev)) {
> > > 
> > > Milton, Sonny, any thoughts?
> > > 
> > > Thanks,
> > > Nish
> > > 
> > 
> > 
>
Benjamin Herrenschmidt - Jan. 19, 2011, 6:06 a.m.
On Tue, 2011-01-18 at 16:48 -0800, Nishanth Aravamudan wrote:
> 
> Ben, if you're ok with waiting to see if Milton or Sonny have any
> ideas,
> I'd like to hold off on asking for a revert. In the case they do, I'll
> be able to test and send out any proposed fix rapidly. 

I don't believe this specific error is causing the lockup, I think we
only hit a spurrious message on devices that don't have DMA capabilities
in the first place. (But I may be wrong, I'll wait for you guys to dig
more or I'll have a look myself tomorrow if I manage to get out of
meetings).

So there's another problem with SCSI tho it -could- also be a DMA issue,
hard to tell at this point.

BTW. I'm not too happy with those defaults set to 64-bit. Probably not
an issue until your other patches go in, but some devices like veth
cannot do 64-bit DMA. I think we should default to 32-bit in the VIO
base code and explicitely enable 64-bit DMA from drivers that support it
(in theory vscsi but I haven't verified the implementation).

Cheers,
Ben.
Nishanth Aravamudan - Jan. 19, 2011, 10:26 p.m.
On 19.01.2011 [17:06:18 +1100], Benjamin Herrenschmidt wrote:
> On Tue, 2011-01-18 at 16:48 -0800, Nishanth Aravamudan wrote:
> > 
> > Ben, if you're ok with waiting to see if Milton or Sonny have any
> > ideas,
> > I'd like to hold off on asking for a revert. In the case they do, I'll
> > be able to test and send out any proposed fix rapidly. 
> 
> I don't believe this specific error is causing the lockup, I think we
> only hit a spurrious message on devices that don't have DMA
> capabilities in the first place. (But I may be wrong, I'll wait for
> you guys to dig more or I'll have a look myself tomorrow if I manage
> to get out of meetings).

Yes, this seems accurate. Like I mentioned elsewhere, this box came up
ok even with these messages and seemed ok (up until the disk locked up).

> So there's another problem with SCSI tho it -could- also be a DMA issue,
> hard to tell at this point.

Right, I'm not sure how to determine that. I did see the lockup, though,
with both my patches reverted (the patches for vio, I mean, after
2.6.37)

> BTW. I'm not too happy with those defaults set to 64-bit. Probably not
> an issue until your other patches go in, but some devices like veth
> cannot do 64-bit DMA. I think we should default to 32-bit in the VIO
> base code and explicitely enable 64-bit DMA from drivers that support it
> (in theory vscsi but I haven't verified the implementation).

Ok, so change the bit-mask to 32-bit? Or would it be appropriate to
attempt 64-bit, if it fails fallback to 32-bit? Seems to be a common
pattern throughout the DMA bit-setting callers.

Thanks,
Nish
Anton Blanchard - Jan. 29, 2011, 10:22 p.m.
Hi,

> FWIW, I looked at Anton's logs, and I don't think the boot failed, per
> se. I think it may have timed out (but not positive on that). I was
> able to boot 2.6.27-git17 on the exact same box, albeit it locks up
> at a later point (the sd abort I e-mailed about in a follow-up).

This fail bisects down to the VPHN (shared processor affinity) patch.
I've got some fixes on the way.

Anton

Patch

diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index b265405..1b695fd 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1257,6 +1257,10 @@  struct vio_dev *vio_register_device_node(struct device_node *of_node)
        viodev->dev.parent = &vio_bus_device.dev;
        viodev->dev.bus = &vio_bus_type;
        viodev->dev.release = vio_dev_release;
+        /* needed to ensure proper operation of coherent allocations
+         * later, in case driver doesn't set it explicitly */
+        dma_set_mask(&viodev->dev, DMA_BIT_MASK(64));
+        dma_set_coherent_mask(&viodev->dev, DMA_BIT_MASK(64));

        /* register with generic device framework */
        if (device_register(&viodev->dev)) {