Patchwork Fixing sh4 serial abort

login
register
mail settings
Submitter Rob Landley
Date July 27, 2012, 1:45 p.m.
Message ID <50129B8A.3060102@landley.net>
Download mbox | patch
Permalink /patch/173684/
State New
Headers show

Comments

Rob Landley - July 27, 2012, 1:45 p.m.
If you grab the current aboriginal linux build scripts:

  http://landley.net/hg/aboriginal/archive/tip.tar.bz2

And "./build.sh sh4", then cd to build/system-image-sh4 and
"./run-emulator.sh" you get this:

  sh_serial: unsupported read from 0x18
  Aborted

The bug was triggered by linux kernel commit 73c3d53f38e0a8e6 back
between v3.2 and v3.3, which did this:

             break;
-#if 0
         case 0x18:
             ret = s->fcr;
             break;
-#endif
         case 0x1c:
             ret = s->rx_cnt;
             break;

(It can also just do ret = 0; and that works too. Or comment out the
abort() near the end of the function.)

Doing that, qemu boots the sh4 system image to a shell prompt, and the
result compiles "hello world" natively.

By the way, this board emulation (r2d) only has 64 megs ram. Is there an
easy way to get 256 megs out of it?

Rob
Peter Maydell - July 27, 2012, 2:32 p.m.
On 27 July 2012 14:45, Rob Landley <rob@landley.net> wrote:
> I.E. sci_getreg(port, SCFCR) move to before checking whether or not
> we'll ever possibly use the result. SCFCR is 0x18 and QEMU calls abort()
> on an attempt to read from an unimplemented register.
>
> I can patch the kernel to work around this (and probably will for this
> release), but the _proper_ fix is to get qemu not to abort on a register
> read that works fine if it just returns 0.

The thing this analysis is missing is any examination of the question
"what is the hardware we are modelling documented to do?". If the hardware
is documented to have a readable register here, QEMU should be fixed.
If it is not then the kernel is buggy and should be fixed.

As it happens, the "SH7214 Group, SH7216 Group User’s Manual: Hardware"
(which I think is the right doc for this) says the register is r/w,
so I think your suggested patch is correct.

(Aborting is a little unfriendly but our logging infrastructure
for "guest did something wrong" is not great, unfortunately.)

There are an awful lot of "#if 0"s in that source file...

-- PMM
Rob Landley - July 27, 2012, 5:16 p.m.
On 07/27/2012 09:32 AM, Peter Maydell wrote:
> On 27 July 2012 14:45, Rob Landley <rob@landley.net> wrote:
>> I.E. sci_getreg(port, SCFCR) move to before checking whether or not
>> we'll ever possibly use the result. SCFCR is 0x18 and QEMU calls abort()
>> on an attempt to read from an unimplemented register.
>>
>> I can patch the kernel to work around this (and probably will for this
>> release), but the _proper_ fix is to get qemu not to abort on a register
>> read that works fine if it just returns 0.
> 
> The thing this analysis is missing is any examination of the question
> "what is the hardware we are modelling documented to do?".

Given that 3.3, 3.4, and 3.5 kernels have already shipped with this, I'm
guessing "not immediately crash"?

Then again I can't really criticize sh4 for multiple kernel releases not
working under qemu and nobody noticing when _arm_ currenty has a similar
problem. The arm versatile board's scsi emulation was broken by linux
commit 4d5fc58dbe34 in 3.4 (yanking the versatile's mach/io.h when the
default one breaks PCI), and then before _that_ got reverted (in commit
9b0f7e399238c6) this commit went in:

commit 1bc39ac5dab265b76ce6e20d6c85f900539fd190
Author: Russell King <rmk+kernel@arm.linux.org.uk>
Date:   Sat Mar 10 11:32:34 2012 +0000

    ARM: PCI: versatile: fix PCI interrupt setup

    This is at odds with the documentation in the file; it says pin 1 on
    slots 24,25,26,27 map to IRQs 27,28,29,30, but the function will
    always be entered with slot=0 due to the lack of swizzle function.
    Fix this function to behave as the comments say, and use the
    standard PCI swizzle.

    Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

So the arm maintainer noticed a place where the code didn't match the
documentation, changed the code to match the docs, and result doesn't
work under qemu's versatile board emulation. But at least 3.5 doesn't
work for a _different_ reason than 3.4 didn't work, so there's that.

I dunno if this is a qemu bug or a kernel bug, but I _do_ know that I'm
far more interested in getting it to work with existing software than
getting it to match the docs.

I hadn't reported this one yet because I still haven't root caused it,
just bisected to find the break. I know reverting the IRQ assignment
line in 3.5 doesn't fix it, which implies the "swizzle" bit is to blame
(which seems ot have something to do with PCI bridges), and thus calling
the default function instead of calling no function breaks qemu's
versatile board emulation.

But that part isn't trivially reverted in 3.5 becauase they "cleaned up"
the code so you now get the default behavior without asking and I
haven't dug deep enough to figure out what's actually going on yet.

> If the hardware
> is documented to have a readable register here, QEMU should be fixed.
> If it is not then the kernel is buggy and should be fixed.

The sh4 Linux architecture is currently maintained by Renesas
developers, and they're just as good a maintainer of the dreamcast-era
stuff as Oracle is of Sun workstations and HP/Compaq was of the Microvax.

They have explicitly told me they aren't interested in anyone who isn't
a Renesas customer, and that since attempting to run sh4 under qemu or
on a dreamcast isn't what they're paid to support, they therefore do not
care about it.

Pointing out to them that they broke older versions of stuff has never
worked for me. You're welcome to try to engage them, but personally I
treat the sh4 changes as read only at this point and carry local patches
to revert the really stupid bits.

> As it happens, the "SH7214 Group, SH7216 Group User’s Manual: Hardware"
> (which I think is the right doc for this) says the register is r/w,
> so I think your suggested patch is correct.

Yay!

> (Aborting is a little unfriendly but our logging infrastructure
> for "guest did something wrong" is not great, unfortunately.)

My first fix was just replacing the abort with assigning the value to 0,
but that left debug output spewing a bit.

> There are an awful lot of "#if 0"s in that source file...

This one was put in by Thiemo Suefer. He didn't get to finish cleaning
up before he died.

I'm aware there's not a huge amount of interest in Linux on sh4, but my
aboriginal linux project is aimed at getting the same basic native
development environment working on as many different architectures as
possible, for things like automated cross-platform regression testing.
(I was thinking "userspace packages", but if I can set up a cron job
doing nightly builds of git maybe some of this crud would be caught
earlier. Pity I have NO free time. Writing this at work over lunch.)

So if qemu emulates it, I'm interested in getting a linux running on it
with a native compiler. Which means I'm perpetually out of my depth, but
oh well...

> -- PMM

Thanks for the review,

Rob
Peter Maydell - July 27, 2012, 5:28 p.m.
On 27 July 2012 18:16, Rob Landley <rob@landley.net> wrote:
> On 07/27/2012 09:32 AM, Peter Maydell wrote:
>> On 27 July 2012 14:45, Rob Landley <rob@landley.net> wrote:
>>> I.E. sci_getreg(port, SCFCR) move to before checking whether or not
>>> we'll ever possibly use the result. SCFCR is 0x18 and QEMU calls abort()
>>> on an attempt to read from an unimplemented register.
>>>
>>> I can patch the kernel to work around this (and probably will for this
>>> release), but the _proper_ fix is to get qemu not to abort on a register
>>> read that works fine if it just returns 0.
>>
>> The thing this analysis is missing is any examination of the question
>> "what is the hardware we are modelling documented to do?".
>
> Given that 3.3, 3.4, and 3.5 kernels have already shipped with this, I'm
> guessing "not immediately crash"?

I said "documented", not "what it happens to do in practice".

> Then again I can't really criticize sh4 for multiple kernel releases not
> working under qemu and nobody noticing when _arm_ currenty has a similar
> problem. The arm versatile board's scsi emulation was broken by linux
> commit 4d5fc58dbe34 in 3.4 (yanking the versatile's mach/io.h when the
> default one breaks PCI), and then before _that_ got reverted (in commit
> 9b0f7e399238c6) this commit went in:
>
> commit 1bc39ac5dab265b76ce6e20d6c85f900539fd190
> Author: Russell King <rmk+kernel@arm.linux.org.uk>
> Date:   Sat Mar 10 11:32:34 2012 +0000
>
>     ARM: PCI: versatile: fix PCI interrupt setup
>
>     This is at odds with the documentation in the file; it says pin 1 on
>     slots 24,25,26,27 map to IRQs 27,28,29,30, but the function will
>     always be entered with slot=0 due to the lack of swizzle function.
>     Fix this function to behave as the comments say, and use the
>     standard PCI swizzle.
>
>     Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
>
> So the arm maintainer noticed a place where the code didn't match the
> documentation, changed the code to match the docs, and result doesn't
> work under qemu's versatile board emulation. But at least 3.5 doesn't
> work for a _different_ reason than 3.4 didn't work, so there's that.

PCI on versatilepb is hopelessly broken -- the only thing the kernel
code does (or did) is work on QEMU. Arnd Bergmann had some kernel
patches which fixed all that so it worked on hardware, but they never
got applied. It looks like Russell has now belatedly noticed a
subset of that wrongness.

> I hadn't reported this one yet because I still haven't root caused it,
> just bisected to find the break. I know reverting the IRQ assignment
> line in 3.5 doesn't fix it, which implies the "swizzle" bit is to blame
> (which seems ot have something to do with PCI bridges), and thus calling
> the default function instead of calling no function breaks qemu's
> versatile board emulation.

1. Find Arnd's kernel patches and see what of them still need to
be applied (http://comments.gmane.org/gmane.linux.ports.arm.kernel/93393)
2. Get kernel working on real hardware
3. Submit qemu patches to fix our versatilepb PCI emulation to
match hardware

If you like you can do 3 first and I'll happily apply those patches
regardless of whether anybody cares to fix the kernel.

I'm afraid I don't have a great deal of time for versatilepb
emulation fixes because it's an incredibly ancient devboard.
vexpress breakage will get more attention from me (though I
don't habitually test new kernels on qemu so I won't necessarily
notice bugs unless my attention is drawn to them.)

-- PMM
Rob Landley - Aug. 1, 2012, 11:50 a.m.
On 07/27/2012 12:28 PM, Peter Maydell wrote:
> On 27 July 2012 18:16, Rob Landley <rob@landley.net> wrote:
>> On 07/27/2012 09:32 AM, Peter Maydell wrote:
>>> The thing this analysis is missing is any examination of the question
>>> "what is the hardware we are modelling documented to do?".
>>
>> Given that 3.3, 3.4, and 3.5 kernels have already shipped with this, I'm
>> guessing "not immediately crash"?
> 
> I said "documented", not "what it happens to do in practice".

I'm trying to make it work in practice.

>> So the arm maintainer noticed a place where the code didn't match the
>> documentation, changed the code to match the docs, and result doesn't
>> work under qemu's versatile board emulation. But at least 3.5 doesn't
>> work for a _different_ reason than 3.4 didn't work, so there's that.
> 
> PCI on versatilepb is hopelessly broken -- the only thing the kernel
> code does (or did) is work on QEMU.

Which is all I used it for.


>> I hadn't reported this one yet because I still haven't root caused it,
>> just bisected to find the break. I know reverting the IRQ assignment
>> line in 3.5 doesn't fix it, which implies the "swizzle" bit is to blame
>> (which seems ot have something to do with PCI bridges), and thus calling
>> the default function instead of calling no function breaks qemu's
>> versatile board emulation.
> 
> 1. Find Arnd's kernel patches and see what of them still need to
> be applied (http://comments.gmane.org/gmane.linux.ports.arm.kernel/93393)
> 2. Get kernel working on real hardware

You seem to be under the impression I have real arm hardware. (Or real
sh4 hardware, or real powerpc hardware, or real mips hardware...)

I've got an armv4t board in a box (a patch to support that was submited
to the list by the vendor, and never merged), and a phone that's
presumably armv7l but it's in use as a phone with the vendor software on
it, and at work I use an cluster of unreleased armv7 SOCs in an "I can't
believe it's not NUMA" configuration. The userspaces I worked out under
qemu run in a chroot under there.

I've got two diferent mips routers in a box somewhere with a dongle that
lets me do a serial console but both require kernel patches to add board
support (neither vendor ever submitted it upstream). I've run the root
filesystems I worked out under qemu in a chroot under them but haven't
gotten them out of the box in over a year. (I think it's in storage,
actually.)

A couple of the game consoles might be powerpc, but I've never run linux
on them. I don't own sh4 or sparc hardware.

> 3. Submit qemu patches to fix our versatilepb PCI emulation to
> match hardware

4) locally patch the kernel back to work with existing qemu releases
since all I want to do is boot an arm userspace to a shell prompt and
run arm code on it, and don't really care about the board emulation
except for the laundry list of features I need out of it such as "a
working persistent clock so make doesn't lose its marbles", "enough
physical memory", "serial console", "working network device", and "three
working block device I can plug filesystem images into".

My goal is to run an emulated linux userspace, and application emulation
is _way_ more complicated and fiddly than system emulation. If real
hardware was as easily accessible for me as qemu, qemu would be
significantly less useful.

> If you like you can do 3 first and I'll happily apply those patches
> regardless of whether anybody cares to fix the kernel.

I just want it to work. Working the same way real hardware does is a
bonus, but if the change isn't visible to userspace (virtio? emulated
scsi? emulated ide?) it's not actually relevant to my use case.

> I'm afraid I don't have a great deal of time for versatilepb
> emulation fixes because it's an incredibly ancient devboard.

I can switch to a newer board, but I want to plug armv4tl, armv5l,
armv6l, and armv7l processors into it. (And eventually armv8 but nothing
supports that yet.) If it's always running armv7 then I can't _prove_
that this userspace would actually run on armv5 and didn't leak armv7
instructions in there.

Last time I looked at newer boards the idea of plugging older processors
into them confused both qemu and the kernel's config stuff.

Looking at current qemu-git, there's no "-M vexpress", there is instead

vexpress-a9          ARM Versatile Express for Cortex-A9
vexpress-a15         ARM Versatile Express for Cortex-A15

I.E. the assumption about what processor you're plugging into this board
is baked into the board _name_, and both of those are armv7 only.

Hmmm, then again it looks like

  http://www.mail-archive.com/qemu-devel@nongnu.org/msg19370.html

Never got merged so the -cpu restrictions for armv4t and armv5l are
currently commented out anyway. (I tested with it at one point, but I
ship system images that people use with their existing qemu versions, so
I can't patch qemu for them....)

> vexpress breakage will get more attention from me (though I
> don't habitually test new kernels on qemu so I won't necessarily
> notice bugs unless my attention is drawn to them.)

I do habitually test newer kernels on qemu. I can draw attention to
them, but not necessarily interest.

> -- PMM

I'll poke at vexpress and see what I can get it to do.

Rob
Peter Maydell - Aug. 1, 2012, 12:01 p.m.
On 1 August 2012 12:50, Rob Landley <rob@landley.net> wrote:
> I can switch to a newer board, but I want to plug armv4tl, armv5l,
> armv6l, and armv7l processors into it. (And eventually armv8 but nothing
> supports that yet.) If it's always running armv7 then I can't _prove_
> that this userspace would actually run on armv5 and didn't leak armv7
> instructions in there.
>
> Last time I looked at newer boards the idea of plugging older processors
> into them confused both qemu and the kernel's config stuff.

Yes, this is because it fundamentally doesn't work and won't work.
ARM board models work with a limited set of CPUs (often just one
CPU), if you try to -cpu something into a board model that doesn't
support it you get to keep both pieces when it breaks.

You can't just plug an A9 CPU into a versatile PB board in hardware,
and it doesn't work in QEMU for about the same reason (the set of
connections between the CPU and the board is completely different).

> Looking at current qemu-git, there's no "-M vexpress", there is instead
>
> vexpress-a9          ARM Versatile Express for Cortex-A9
> vexpress-a15         ARM Versatile Express for Cortex-A15
>
> I.E. the assumption about what processor you're plugging into this board
> is baked into the board _name_, and both of those are armv7 only.

This is because the A9 and A15 versatile express systems are
genuinely different hardware (the memory maps are all different,
for example).

> Hmmm, then again it looks like
>
>   http://www.mail-archive.com/qemu-devel@nongnu.org/msg19370.html
>
> Never got merged so the -cpu restrictions for armv4t and armv5l are
> currently commented out anyway.

I think we subsequently merged a different patch for the equivalent
feature. Certainly translate.c does attempt to check for ARCH(5)
or ARCH(4T) and so on. If we get specific cases wrong I can fix them.

-- PMM

Patch

--- a/drivers/tty/serial/sh-sci.c
+++ b/drivers/tty/serial/sh-sci.c
@@ -1771,18 +1771,25 @@  static void sci_set_termios(struct uart_port
*port, struct ktermios *termios,

 	sci_init_pins(port, termios->c_cflag);

-	if (s->cfg->capabilities & SCIx_HAVE_RTSCTS) {
-		reg = sci_getreg(port, SCFCR);
-		if (reg->size) {
-			unsigned short ctrl;
+	reg = sci_getreg(port, SCFCR);
+	if (reg->size) {
+		unsigned short ctrl = sci_in(port, SCFCR);

-			ctrl = sci_in(port, SCFCR);
+		if (s->cfg->capabilities & SCIx_HAVE_RTSCTS) {

I.E. sci_getreg(port, SCFCR) move to before checking whether or not
we'll ever possibly use the result. SCFCR is 0x18 and QEMU calls abort()
on an attempt to read from an unimplemented register.

I can patch the kernel to work around this (and probably will for this
release), but the _proper_ fix is to get qemu not to abort on a register
read that works fine if it just returns 0.

It turns out the qemu fix (in current git) is just:

--- a/hw/sh_serial.c
+++ b/hw/sh_serial.c
@@ -248,11 +248,9 @@  static uint64_t sh_serial_read(void *opaque,
target_phys
                     s->flags &= ~SH_SERIAL_FLAG_RDF;
             }