Segfault using qemu-system-arm in smc91c111

Message ID	1441549313.24871.218.camel@linuxfoundation.org
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> Message-ID: <1441549313.24871.218.camel@linuxfoundation.org> From: Richard Purdie <richard.purdie@linuxfoundation.org> To: Peter Crosthwaite <crosthwaitepeter@gmail.com> Date: Sun, 06 Sep 2015 15:21:53 +0100 In-Reply-To: <CAPokK=p9GUx7OeJ7xNwbnJkzd7JMLVayhgOCf0XapHnW9M9JLA@mail.gmail.com> References: <1441362357.24871.155.camel@linuxfoundation.org> <CAFEAcA874VoaFLDU+5Ai1DgQ8ahgSrh6Xfkg6VSqHwZ0422PSA@mail.gmail.com> <1441365880.24871.164.camel@linuxfoundation.org> <CAFEAcA8ict1SFxpu4yb_y1Rvp6kTZn3xweMF_7Vw1H0Mc-w3iw@mail.gmail.com> <1441370585.24871.166.camel@linuxfoundation.org> <1441387258.24871.197.camel@linuxfoundation.org> <CAFEAcA9YWnhnc5eL8ST1bHjkBj1_UiVd7ZDET2GjKbV+BQ+igQ@mail.gmail.com> <CAPokK=p9GUx7OeJ7xNwbnJkzd7JMLVayhgOCf0XapHnW9M9JLA@mail.gmail.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Cc: Peter Maydell <peter.maydell@linaro.org>, qemu-devel <qemu-devel@nongnu.org> Subject: Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111 Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Richard Purdie Sept. 6, 2015, 2:21 p.m. UTC

On Sat, 2015-09-05 at 13:30 -0700, Peter Crosthwaite wrote:
> On Fri, Sep 4, 2015 at 10:30 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> > On 4 September 2015 at 18:20, Richard Purdie
> > <richard.purdie@linuxfoundation.org> wrote:
> >> On Fri, 2015-09-04 at 13:43 +0100, Richard Purdie wrote:
> >>> On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote:
> >>> > On 4 September 2015 at 12:24, Richard Purdie
> >>> > <richard.purdie@linuxfoundation.org> wrote:
> >>> > > So just based on that, yes, seems that the rx_fifo looks to be
> >>> > > overrunning. I can add the asserts but I think it would just confirm
> >>> > > this.
> >>> >
> >>> > Yes, the point of adding assertions is to confirm a hypothesis.
> >>>
> >>> I've now confirmed that it does indeed trigger the assert in
> >>> smc91c111_receive().
> >>
> >> I just tried an experiment where I put:
> >>
> >>     if (s->rx_fifo_len >= NUM_PACKETS)
> >>         return -1;
> >>
> >> into smc91c111_receive() and my reproducer stops reproducing the
> >> problem.
> 
> Does it just stop the crash or does it eliminate the problem
> completely with a fully now-working network?

It stops the crash, the network works great.

> >> I also noticed can_receive() could also have a check on buffer
> >> availability. Would one of these changes be the correct fix here?
> >
> > The interesting question is why smc91c111_allocate_packet() doesn't
> > fail in this situation. We only have NUM_PACKETS worth of storage,
> > shared between the tx and rx buffers, so how could we both have
> > already filled the rx_fifo and have a spare packet for the allocate
> > function to return?
> 
> Maybe this:
> 
>             case 5: /* Release.  */
>                 smc91c111_release_packet(s, s->packet_num);
>                 break;
> 
> The guest is able to free an allocated packet without the accompanying
> pop of tx/rx fifo. This may suggest some sort of guest error?
> 
> The fix depends on the behaviour of the real hardware. If that MMIO op
> is supposed to dequeue the corresponding queue entry then we may need
> to patch that logic to do search the queues and dequeue it. Otherwise
> we need to find out the genuine length of the rx queue, and clamp it
> without something like Richards patch. There are a few other bits and
> pieces that suggest the guest can have independent control of the
> queues and allocated buffers but i'm confused as to how the rx fifo
> length can get up to 10 in any case.

I think I have a handle on what is going on. smc91c111_release_packet()
changes s->allocated() but not rx_fifo. can_receive() only looks at
s->allocated. We can trigger new network packets to arrive from
smc91c111_release_packet() which calls qemu_flush_queued_packets()
*before* we change rx_fifo and this can loop.

The patch below which explicitly orders the qemu_flush_queued_packets()
call resolved the test case I was able to reproduce this problem in.

So there are three ways to fix this, either can_receive() needs to check
both s->allocated() and rx_fifo, or the code is more explicit about when
qemu_flush_queued_packets() is called (as per my patch below), or the
case 4 where smc91c111_release_packet() and then
smc91c111_pop_rx_fifo(s) is called is reversed. I also tested the latter
which also works, albeit with more ugly code.

The problem is much more reproducible with the assert btw, booting a
qemu image with this and hitting the network interface with scp of a few
large files is usually enough.

So which patch would be preferred? :)

Cheers,

Richard

Peter Crosthwaite Sept. 6, 2015, 6:37 p.m. UTC | #1

On Sun, Sep 6, 2015 at 7:21 AM, Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
> On Sat, 2015-09-05 at 13:30 -0700, Peter Crosthwaite wrote:
>> On Fri, Sep 4, 2015 at 10:30 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
>> > On 4 September 2015 at 18:20, Richard Purdie
>> > <richard.purdie@linuxfoundation.org> wrote:
>> >> On Fri, 2015-09-04 at 13:43 +0100, Richard Purdie wrote:
>> >>> On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote:
>> >>> > On 4 September 2015 at 12:24, Richard Purdie
>> >>> > <richard.purdie@linuxfoundation.org> wrote:
>> >>> > > So just based on that, yes, seems that the rx_fifo looks to be
>> >>> > > overrunning. I can add the asserts but I think it would just confirm
>> >>> > > this.
>> >>> >
>> >>> > Yes, the point of adding assertions is to confirm a hypothesis.
>> >>>
>> >>> I've now confirmed that it does indeed trigger the assert in
>> >>> smc91c111_receive().
>> >>
>> >> I just tried an experiment where I put:
>> >>
>> >>     if (s->rx_fifo_len >= NUM_PACKETS)
>> >>         return -1;
>> >>
>> >> into smc91c111_receive() and my reproducer stops reproducing the
>> >> problem.
>>
>> Does it just stop the crash or does it eliminate the problem
>> completely with a fully now-working network?
>
> It stops the crash, the network works great.
>
>> >> I also noticed can_receive() could also have a check on buffer
>> >> availability. Would one of these changes be the correct fix here?
>> >
>> > The interesting question is why smc91c111_allocate_packet() doesn't
>> > fail in this situation. We only have NUM_PACKETS worth of storage,
>> > shared between the tx and rx buffers, so how could we both have
>> > already filled the rx_fifo and have a spare packet for the allocate
>> > function to return?
>>
>> Maybe this:
>>
>>             case 5: /* Release.  */
>>                 smc91c111_release_packet(s, s->packet_num);
>>                 break;
>>
>> The guest is able to free an allocated packet without the accompanying
>> pop of tx/rx fifo. This may suggest some sort of guest error?
>>
>> The fix depends on the behaviour of the real hardware. If that MMIO op
>> is supposed to dequeue the corresponding queue entry then we may need
>> to patch that logic to do search the queues and dequeue it. Otherwise
>> we need to find out the genuine length of the rx queue, and clamp it
>> without something like Richards patch. There are a few other bits and
>> pieces that suggest the guest can have independent control of the
>> queues and allocated buffers but i'm confused as to how the rx fifo
>> length can get up to 10 in any case.
>
> I think I have a handle on what is going on. smc91c111_release_packet()
> changes s->allocated() but not rx_fifo. can_receive() only looks at
> s->allocated. We can trigger new network packets to arrive from
> smc91c111_release_packet() which calls qemu_flush_queued_packets()
> *before* we change rx_fifo and this can loop.
>
> The patch below which explicitly orders the qemu_flush_queued_packets()
> call resolved the test case I was able to reproduce this problem in.
>
> So there are three ways to fix this, either can_receive() needs to check
> both s->allocated() and rx_fifo,

This is probably the winner for me.

> or the code is more explicit about when
> qemu_flush_queued_packets() is called (as per my patch below), or the
> case 4 where smc91c111_release_packet() and then
> smc91c111_pop_rx_fifo(s) is called is reversed. I also tested the latter
> which also works, albeit with more ugly code.
>
> The problem is much more reproducible with the assert btw, booting a
> qemu image with this and hitting the network interface with scp of a few
> large files is usually enough.
>
> So which patch would be preferred? :)
>
> Cheers,
>
> Richard
>
>
>
> Index: qemu-2.4.0/hw/net/smc91c111.c
> ===================================================================
> --- qemu-2.4.0.orig/hw/net/smc91c111.c
> +++ qemu-2.4.0/hw/net/smc91c111.c
> @@ -185,7 +185,6 @@ static void smc91c111_release_packet(smc
>      s->allocated &= ~(1 << packet);
>      if (s->tx_alloc == 0x80)
>          smc91c111_tx_alloc(s);
> -    qemu_flush_queued_packets(qemu_get_queue(s->nic));
>  }
>
>  /* Flush the TX FIFO.  */
> @@ -237,9 +236,11 @@ static void smc91c111_do_tx(smc91c111_st
>              }
>          }
>  #endif
> -        if (s->ctr & CTR_AUTO_RELEASE)
> +        if (s->ctr & CTR_AUTO_RELEASE) {
>              /* Race?  */
>              smc91c111_release_packet(s, packetnum);
> +            qemu_flush_queued_packets(qemu_get_queue(s->nic));
> +        }
>          else if (s->tx_fifo_done_len < NUM_PACKETS)
>              s->tx_fifo_done[s->tx_fifo_done_len++] = packetnum;
>          qemu_send_packet(qemu_get_queue(s->nic), p, len);
> @@ -379,9 +380,11 @@ static void smc91c111_writeb(void *opaqu
>                      smc91c111_release_packet(s, s->rx_fifo[0]);
>                  }
>                  smc91c111_pop_rx_fifo(s);
> +                qemu_flush_queued_packets(qemu_get_queue(s->nic));
>                  break;
>              case 5: /* Release.  */
>                  smc91c111_release_packet(s, s->packet_num);
> +                qemu_flush_queued_packets(qemu_get_queue(s->nic));

This could still cause the same issue here though I think. The guest
can release first (case 5) without a corresponding fifo pop (case 3),
which actually means that the first patch idea might be the winner as
it catches this issue too. What is supposed to happen if the guest
does a case 5 while the rx_fifo is full without a matching case 3, and
then a packet arrives on the wire?

Regards,
Peter

>                  break;
>              case 6: /* Add to TX FIFO.  */
>                  smc91c111_queue_tx(s, s->packet_num);
>

Richard Purdie Sept. 6, 2015, 11:26 p.m. UTC | #2

On Sun, 2015-09-06 at 11:37 -0700, Peter Crosthwaite wrote:
> On Sun, Sep 6, 2015 at 7:21 AM, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
> > On Sat, 2015-09-05 at 13:30 -0700, Peter Crosthwaite wrote:
> >> On Fri, Sep 4, 2015 at 10:30 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> >> > On 4 September 2015 at 18:20, Richard Purdie
> >> > <richard.purdie@linuxfoundation.org> wrote:
> >> >> On Fri, 2015-09-04 at 13:43 +0100, Richard Purdie wrote:
> >> >>> On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote:
> >> >>> > On 4 September 2015 at 12:24, Richard Purdie
> >> >>> > <richard.purdie@linuxfoundation.org> wrote:
> >> >>> > > So just based on that, yes, seems that the rx_fifo looks to be
> >> >>> > > overrunning. I can add the asserts but I think it would just confirm
> >> >>> > > this.
> >> >>> >
> >> >>> > Yes, the point of adding assertions is to confirm a hypothesis.
> >> >>>
> >> >>> I've now confirmed that it does indeed trigger the assert in
> >> >>> smc91c111_receive().
> >> >>
> >> >> I just tried an experiment where I put:
> >> >>
> >> >>     if (s->rx_fifo_len >= NUM_PACKETS)
> >> >>         return -1;
> >> >>
> >> >> into smc91c111_receive() and my reproducer stops reproducing the
> >> >> problem.
> >>
> >> Does it just stop the crash or does it eliminate the problem
> >> completely with a fully now-working network?
> >
> > It stops the crash, the network works great.
> >
> >> >> I also noticed can_receive() could also have a check on buffer
> >> >> availability. Would one of these changes be the correct fix here?
> >> >
> >> > The interesting question is why smc91c111_allocate_packet() doesn't
> >> > fail in this situation. We only have NUM_PACKETS worth of storage,
> >> > shared between the tx and rx buffers, so how could we both have
> >> > already filled the rx_fifo and have a spare packet for the allocate
> >> > function to return?
> >>
> >> Maybe this:
> >>
> >>             case 5: /* Release.  */
> >>                 smc91c111_release_packet(s, s->packet_num);
> >>                 break;
> >>
> >> The guest is able to free an allocated packet without the accompanying
> >> pop of tx/rx fifo. This may suggest some sort of guest error?
> >>
> >> The fix depends on the behaviour of the real hardware. If that MMIO op
> >> is supposed to dequeue the corresponding queue entry then we may need
> >> to patch that logic to do search the queues and dequeue it. Otherwise
> >> we need to find out the genuine length of the rx queue, and clamp it
> >> without something like Richards patch. There are a few other bits and
> >> pieces that suggest the guest can have independent control of the
> >> queues and allocated buffers but i'm confused as to how the rx fifo
> >> length can get up to 10 in any case.
> >
> > I think I have a handle on what is going on. smc91c111_release_packet()
> > changes s->allocated() but not rx_fifo. can_receive() only looks at
> > s->allocated. We can trigger new network packets to arrive from
> > smc91c111_release_packet() which calls qemu_flush_queued_packets()
> > *before* we change rx_fifo and this can loop.
> >
> > The patch below which explicitly orders the qemu_flush_queued_packets()
> > call resolved the test case I was able to reproduce this problem in.
> >
> > So there are three ways to fix this, either can_receive() needs to check
> > both s->allocated() and rx_fifo,
> 
> This is probably the winner for me.
> 
> > or the code is more explicit about when
> > qemu_flush_queued_packets() is called (as per my patch below), or the
> > case 4 where smc91c111_release_packet() and then
> > smc91c111_pop_rx_fifo(s) is called is reversed. I also tested the latter
> > which also works, albeit with more ugly code.

It seems can_receive isn't enough, we'd need to put some checks into
receive itself since once can_receive says "yes", multiple packets can
arrive to _receive without further checks of can_receive. I've either
messed up my previous test or been lucky.

I tested an assert in _recieve() which confirms it can be called when
can_receive() says it isn't ready.

If we return -1 in _receive, the code will stop sending packets and all
works as it should, it recovers just fine. So I think that is looking
like the correct fix. I'd note that it already effectively has half this
check in the allocate_packet call, its just missing the rx_fifo_len one.

Cheers,

Richard

Peter Crosthwaite Sept. 7, 2015, 12:48 a.m. UTC | #3

On Sun, Sep 6, 2015 at 4:26 PM, Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
> On Sun, 2015-09-06 at 11:37 -0700, Peter Crosthwaite wrote:
>> On Sun, Sep 6, 2015 at 7:21 AM, Richard Purdie
>> <richard.purdie@linuxfoundation.org> wrote:
>> > On Sat, 2015-09-05 at 13:30 -0700, Peter Crosthwaite wrote:
>> >> On Fri, Sep 4, 2015 at 10:30 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
>> >> > On 4 September 2015 at 18:20, Richard Purdie
>> >> > <richard.purdie@linuxfoundation.org> wrote:
>> >> >> On Fri, 2015-09-04 at 13:43 +0100, Richard Purdie wrote:
>> >> >>> On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote:
>> >> >>> > On 4 September 2015 at 12:24, Richard Purdie
>> >> >>> > <richard.purdie@linuxfoundation.org> wrote:
>> >> >>> > > So just based on that, yes, seems that the rx_fifo looks to be
>> >> >>> > > overrunning. I can add the asserts but I think it would just confirm
>> >> >>> > > this.
>> >> >>> >
>> >> >>> > Yes, the point of adding assertions is to confirm a hypothesis.
>> >> >>>
>> >> >>> I've now confirmed that it does indeed trigger the assert in
>> >> >>> smc91c111_receive().
>> >> >>
>> >> >> I just tried an experiment where I put:
>> >> >>
>> >> >>     if (s->rx_fifo_len >= NUM_PACKETS)
>> >> >>         return -1;
>> >> >>
>> >> >> into smc91c111_receive() and my reproducer stops reproducing the
>> >> >> problem.
>> >>
>> >> Does it just stop the crash or does it eliminate the problem
>> >> completely with a fully now-working network?
>> >
>> > It stops the crash, the network works great.
>> >
>> >> >> I also noticed can_receive() could also have a check on buffer
>> >> >> availability. Would one of these changes be the correct fix here?
>> >> >
>> >> > The interesting question is why smc91c111_allocate_packet() doesn't
>> >> > fail in this situation. We only have NUM_PACKETS worth of storage,
>> >> > shared between the tx and rx buffers, so how could we both have
>> >> > already filled the rx_fifo and have a spare packet for the allocate
>> >> > function to return?
>> >>
>> >> Maybe this:
>> >>
>> >>             case 5: /* Release.  */
>> >>                 smc91c111_release_packet(s, s->packet_num);
>> >>                 break;
>> >>
>> >> The guest is able to free an allocated packet without the accompanying
>> >> pop of tx/rx fifo. This may suggest some sort of guest error?
>> >>
>> >> The fix depends on the behaviour of the real hardware. If that MMIO op
>> >> is supposed to dequeue the corresponding queue entry then we may need
>> >> to patch that logic to do search the queues and dequeue it. Otherwise
>> >> we need to find out the genuine length of the rx queue, and clamp it
>> >> without something like Richards patch. There are a few other bits and
>> >> pieces that suggest the guest can have independent control of the
>> >> queues and allocated buffers but i'm confused as to how the rx fifo
>> >> length can get up to 10 in any case.
>> >
>> > I think I have a handle on what is going on. smc91c111_release_packet()
>> > changes s->allocated() but not rx_fifo. can_receive() only looks at
>> > s->allocated. We can trigger new network packets to arrive from
>> > smc91c111_release_packet() which calls qemu_flush_queued_packets()
>> > *before* we change rx_fifo and this can loop.
>> >
>> > The patch below which explicitly orders the qemu_flush_queued_packets()
>> > call resolved the test case I was able to reproduce this problem in.
>> >
>> > So there are three ways to fix this, either can_receive() needs to check
>> > both s->allocated() and rx_fifo,
>>
>> This is probably the winner for me.
>>
>> > or the code is more explicit about when
>> > qemu_flush_queued_packets() is called (as per my patch below), or the
>> > case 4 where smc91c111_release_packet() and then
>> > smc91c111_pop_rx_fifo(s) is called is reversed. I also tested the latter
>> > which also works, albeit with more ugly code.
>
> It seems can_receive isn't enough, we'd need to put some checks into
> receive itself since once can_receive says "yes", multiple packets can
> arrive to _receive without further checks of can_receive.

This doesn't sound right. There are other network controllers that
rely of can_receive catching all cases properly. Is this a regression?
Looking at logs, I see some refactoring of QEMU net framework around
June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go
away?

> I've either
> messed up my previous test or been lucky.
>
> I tested an assert in _recieve() which confirms it can be called when
> can_receive() says it isn't ready.
>

A backtrace of this would be handy.

What is your replicator? I have core-image-minimal handy but it has no
scp or sshd so all I can think of to stress network is wget, but that
doesn't replicate.

Regards,
Peter

> If we return -1 in _receive, the code will stop sending packets and all
> works as it should, it recovers just fine. So I think that is looking
> like the correct fix. I'd note that it already effectively has half this
> check in the allocate_packet call, its just missing the rx_fifo_len one.
>
> Cheers,
>
> Richard
>

Richard Purdie Sept. 7, 2015, 7:09 a.m. UTC | #4

On Sun, 2015-09-06 at 17:48 -0700, Peter Crosthwaite wrote:
> On Sun, Sep 6, 2015 at 4:26 PM, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
> > On Sun, 2015-09-06 at 11:37 -0700, Peter Crosthwaite wrote:
> > It seems can_receive isn't enough, we'd need to put some checks into
> > receive itself since once can_receive says "yes", multiple packets can
> > arrive to _receive without further checks of can_receive.
> 
> This doesn't sound right. There are other network controllers that
> rely of can_receive catching all cases properly. Is this a regression?
> Looking at logs, I see some refactoring of QEMU net framework around
> June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go
> away?

We weren't seeing this problem until we upgraded to 2.4.

> 
> > I've either
> > messed up my previous test or been lucky.
> >
> > I tested an assert in _recieve() which confirms it can be called when
> > can_receive() says it isn't ready.
> >
> 
> A backtrace of this would be handy.
> 
> What is your replicator? I have core-image-minimal handy but it has no
> scp or sshd so all I can think of to stress network is wget, but that
> doesn't replicate.

I've been using a core-image-sato and using the "bitbake core-image-sato
-c testimage" which runs a set of tests against the target image. It
invariably crashes on the scp test when I put an assert in receive().

To make it simpler, if I just "runqemu qemuarm" to boot the
core-image-sato, then scp a 5MB file consisting of zeros into the image,
it hits the assert after around 2% transferred.

Cheers,

Richard

Richard Purdie Sept. 7, 2015, 7:18 a.m. UTC | #5

On Sun, 2015-09-06 at 17:48 -0700, Peter Crosthwaite wrote:
> On Sun, Sep 6, 2015 at 4:26 PM, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
> > On Sun, 2015-09-06 at 11:37 -0700, Peter Crosthwaite wrote:
> > I tested an assert in _recieve() which confirms it can be called when
> > can_receive() says it isn't ready.
> >
> 
> A backtrace of this would be handy.

This is the trace with my assert against smc91c111_can_receive(nc) being
false when we're in receive(), before we allocate_packet:

#0  0x00007f355f276267 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f355f277eca in __GI_abort () at abort.c:89
#2  0x00007f355f26f03d in __assert_fail_base (fmt=0x7f355f3d1028 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7f3562158ed7 "smc91c111_can_receive(nc)", 
    file=file@entry=0x7f3562158dc8 "/media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/hw/net/smc91c111.c", line=line@entry=680, function=function@entry=0x7f35621591b0 <__PRETTY_FUNCTION__.26130> "smc91c111_receive") at assert.c:92
#3  0x00007f355f26f0f2 in __GI___assert_fail (assertion=assertion@entry=0x7f3562158ed7 "smc91c111_can_receive(nc)", 
    file=file@entry=0x7f3562158dc8 "/media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/hw/net/smc91c111.c", line=line@entry=680, function=function@entry=0x7f35621591b0 <__PRETTY_FUNCTION__.26130> "smc91c111_receive")
    at assert.c:101
#4  0x00007f3561fca4d0 in smc91c111_receive (nc=0x7f3563604d10, buf=0x7f353c09d028 "RT", size=1514)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/hw/net/smc91c111.c:680
#5  0x00007f356203058b in qemu_deliver_packet (sender=<optimised out>, flags=<optimised out>, data=data@entry=0x7f353c09d028 "RT", 
    size=<optimised out>, opaque=0x7f3563604d10)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/net/net.c:577
#6  0x00007f3562031eaa in qemu_net_queue_deliver (size=<optimised out>, data=<optimised out>, flags=<optimised out>, 
    sender=<optimised out>, queue=0x7f3563604e70)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/net/queue.c:157
#7  qemu_net_queue_flush (queue=0x7f3563604e70)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/net/queue.c:254
#8  0x00007f356202fc7c in qemu_flush_or_purge_queued_packets (nc=0x7f3563604d10, purge=<optimised out>)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/net/net.c:606
#9  0x00007f3561fcacec in smc91c111_writeb (opaque=0x7f35635178f0, offset=<optimised out>, value=128)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/hw/net/smc91c111.c:382
#10 0x00007f3561fcaf84 in smc91c111_writew (opaque=0x7f35635178f0, offset=0, value=128)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/hw/net/smc91c111.c:612
#11 0x00007f3561e52cc5 in memory_region_oldmmio_write_accessor (mr=<optimised out>, addr=<optimised out>, value=<optimised out>, 
    size=<optimised out>, shift=<optimised out>, mask=<optimised out>, attrs=...)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/memory.c:434
#12 0x00007f3561e5227d in access_with_adjusted_size (addr=addr@entry=0, value=value@entry=0x7f35572d93f8, size=size@entry=2, 
    access_size_min=<optimised out>, access_size_max=<optimised out>, 
    access=0x7f3561e52c90 <memory_region_oldmmio_write_accessor>, mr=0x7f356351bc80, attrs=...)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/memory.c:506
#13 0x00007f3561e53d5b in memory_region_dispatch_write (mr=mr@entry=0x7f356351bc80, addr=0, data=128, size=size@entry=2, 
    attrs=attrs@entry=...) at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/memory.c:1171
#14 0x00007f3561e1fc51 in address_space_rw (as=<optimised out>, addr=268500992, attrs=..., buf=buf@entry=0x7f35572d94c0 "\200", 
    len=2, is_write=is_write@entry=true)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/exec.c:2445
#15 0x00007f3561e1fda0 in address_space_write (len=<optimised out>, buf=0x7f35572d94c0 "\200", attrs=..., addr=<optimised out>, 
    as=<optimised out>) at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/exec.c:2521
#16 subpage_write (opaque=<optimised out>, addr=<optimised out>, value=<optimised out>, len=<optimised out>, attrs=...)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/exec.c:2081
#17 0x00007f3561e5227d in access_with_adjusted_size (addr=addr@entry=0, value=value@entry=0x7f35572d9568, size=size@entry=2, 
    access_size_min=<optimised out>, access_size_max=<optimised out>, 
    access=0x7f3561e521a0 <memory_region_write_with_attrs_accessor>, mr=0x7f356375e360, attrs=...)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/memory.c:506
#18 0x00007f3561e53d5b in memory_region_dispatch_write (mr=0x7f356375e360, addr=0, data=128, size=2, attrs=...)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/memory.c:1171
#19 0x00007f3559342734 in ?? ()

the rest of the stack is just ??. So some write into the smc91c111
memory triggers qemu_flush_or_purge_queued_packets() which then keeps
delivering packets as far as I can tell.

Do we need to check can_receive() before triggering the
qemu_flush_or_purge_queued_packets() call? The code is clearly calling
this in places where the driver simply isn't ready.

(qemu_flush_queued_packets() == qemu_flush_or_purge_queued_packets())

Cheers,

Richard

Richard Purdie Sept. 7, 2015, 7:47 a.m. UTC | #6

On Sun, 2015-09-06 at 17:48 -0700, Peter Crosthwaite wrote:
> This doesn't sound right. There are other network controllers that
> rely of can_receive catching all cases properly. Is this a regression?
> Looking at logs, I see some refactoring of QEMU net framework around
> June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go
> away?

I did find an interesting comment in this commit:

http://git.qemu.org/?p=qemu.git;a=commitdiff;h=625de449fc5597f2e1aff9cb586e249e198f03c9

"""
Since commit 6e99c63 "net/socket: Drop net_socket_can_send" and friends,
net queues need to be explicitly flushed after qemu_can_send_packet()
returns false, because the netdev side will disable the polling of fd.
"""

smc91x111 is calling flush functions when it knows can_receive
would/should return false. I believe that is the bug here.

I suspect the driver needs:

* can_receive to actually return the right value
* the locations of the flush calls to be when there is receive space 

This could explain what changed to break this and why moving the flush
calls works in my patch.

Cheers,

Richard

Peter Maydell Sept. 7, 2015, 9:21 a.m. UTC | #7

CCing the net maintainers on this thread seems like it would
be a good idea...

On 7 September 2015 at 08:47, Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
> On Sun, 2015-09-06 at 17:48 -0700, Peter Crosthwaite wrote:
>> This doesn't sound right. There are other network controllers that
>> rely of can_receive catching all cases properly. Is this a regression?
>> Looking at logs, I see some refactoring of QEMU net framework around
>> June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go
>> away?
>
> I did find an interesting comment in this commit:
>
> http://git.qemu.org/?p=qemu.git;a=commitdiff;h=625de449fc5597f2e1aff9cb586e249e198f03c9
>
> """
> Since commit 6e99c63 "net/socket: Drop net_socket_can_send" and friends,
> net queues need to be explicitly flushed after qemu_can_send_packet()
> returns false, because the netdev side will disable the polling of fd.
> """
>
> smc91x111 is calling flush functions when it knows can_receive
> would/should return false. I believe that is the bug here.
>
> I suspect the driver needs:
>
> * can_receive to actually return the right value
> * the locations of the flush calls to be when there is receive space
>
> This could explain what changed to break this and why moving the flush
> calls works in my patch.

thanks
-- PMM

Peter Crosthwaite Sept. 7, 2015, 6:05 p.m. UTC | #8

On Mon, Sep 7, 2015 at 12:09 AM, Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
> On Sun, 2015-09-06 at 17:48 -0700, Peter Crosthwaite wrote:
>> On Sun, Sep 6, 2015 at 4:26 PM, Richard Purdie
>> <richard.purdie@linuxfoundation.org> wrote:
>> > On Sun, 2015-09-06 at 11:37 -0700, Peter Crosthwaite wrote:
>> > It seems can_receive isn't enough, we'd need to put some checks into
>> > receive itself since once can_receive says "yes", multiple packets can
>> > arrive to _receive without further checks of can_receive.
>>
>> This doesn't sound right. There are other network controllers that
>> rely of can_receive catching all cases properly. Is this a regression?
>> Looking at logs, I see some refactoring of QEMU net framework around
>> June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go
>> away?
>
> We weren't seeing this problem until we upgraded to 2.4.
>
>>
>> > I've either
>> > messed up my previous test or been lucky.
>> >
>> > I tested an assert in _recieve() which confirms it can be called when
>> > can_receive() says it isn't ready.
>> >
>>
>> A backtrace of this would be handy.
>>
>> What is your replicator? I have core-image-minimal handy but it has no
>> scp or sshd so all I can think of to stress network is wget, but that
>> doesn't replicate.
>
> I've been using a core-image-sato and using the "bitbake core-image-sato
> -c testimage" which runs a set of tests against the target image. It
> invariably crashes on the scp test when I put an assert in receive().
>
> To make it simpler, if I just "runqemu qemuarm" to boot the
> core-image-sato, then scp a 5MB file consisting of zeros into the image,
> it hits the assert after around 2% transferred.
>

I can't replicate. The 5MB scp works for me. Can you bisect qemu?

Regards,
Peter

> Cheers,
>
> Richard
>

Peter Crosthwaite Sept. 7, 2015, 6:12 p.m. UTC | #9

On Mon, Sep 7, 2015 at 2:21 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> CCing the net maintainers on this thread seems like it would
> be a good idea...
>
> On 7 September 2015 at 08:47, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
>> On Sun, 2015-09-06 at 17:48 -0700, Peter Crosthwaite wrote:
>>> This doesn't sound right. There are other network controllers that
>>> rely of can_receive catching all cases properly. Is this a regression?
>>> Looking at logs, I see some refactoring of QEMU net framework around
>>> June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go
>>> away?
>>
>> I did find an interesting comment in this commit:
>>
>> http://git.qemu.org/?p=qemu.git;a=commitdiff;h=625de449fc5597f2e1aff9cb586e249e198f03c9
>>
>> """
>> Since commit 6e99c63 "net/socket: Drop net_socket_can_send" and friends,
>> net queues need to be explicitly flushed after qemu_can_send_packet()
>> returns false, because the netdev side will disable the polling of fd.
>> """
>>
>> smc91x111 is calling flush functions when it knows can_receive
>> would/should return false. I believe that is the bug here.
>>
>> I suspect the driver needs:
>>
>> * can_receive to actually return the right value

Agreed, I think we have that figured further up the thread.

>> * the locations of the flush calls to be when there is receive space
>>

So my understanding is the net layer should be able to properly handle
a spurious flush. This avoids every device model having to if
(can_receive) { flush () }, rather than just flush. There are other
enet controllers that do this, e.g. xilinx_axienet and e1000. Has this
API change in the net layer?

Regards,
Peter

>> This could explain what changed to break this and why moving the flush
>> calls works in my patch.
>
> thanks
> -- PMM

Peter Maydell Sept. 7, 2015, 6:42 p.m. UTC | #10

On 6 September 2015 at 19:37, Peter Crosthwaite
<crosthwaitepeter@gmail.com> wrote:
> This could still cause the same issue here though I think. The guest
> can release first (case 5) without a corresponding fifo pop (case 3),
> which actually means that the first patch idea might be the winner as
> it catches this issue too. What is supposed to happen if the guest
> does a case 5 while the rx_fifo is full without a matching case 3, and
> then a packet arrives on the wire?

The data sheet seems to be unclear. Definitely incoming data when
there's no packet available to be allocated means we drop the
incoming packet and signal an rx overrun. The data sheet doesn't
say anything about what happens when the rx fifo overflows, though...

My best guess is that fifo overflow also results in an overrun report,
but you'd probably have to actually write a test case for the real
hardware to find out for sure.

thanks
-- PMM

Jason Wang Sept. 8, 2015, 9:55 a.m. UTC | #11

On 09/07/2015 05:21 PM, Peter Maydell wrote:
> CCing the net maintainers on this thread seems like it would
> be a good idea...
>
> On 7 September 2015 at 08:47, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
>> On Sun, 2015-09-06 at 17:48 -0700, Peter Crosthwaite wrote:
>>> This doesn't sound right. There are other network controllers that
>>> rely of can_receive catching all cases properly. Is this a regression?
>>> Looking at logs, I see some refactoring of QEMU net framework around
>>> June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go
>>> away?
>> I did find an interesting comment in this commit:
>>
>> http://git.qemu.org/?p=qemu.git;a=commitdiff;h=625de449fc5597f2e1aff9cb586e249e198f03c9
>>
>> """
>> Since commit 6e99c63 "net/socket: Drop net_socket_can_send" and friends,
>> net queues need to be explicitly flushed after qemu_can_send_packet()
>> returns false, because the netdev side will disable the polling of fd.
>> """
>>
>> smc91x111 is calling flush functions when it knows can_receive
>> would/should return false. I believe that is the bug here.
>>
>> I suspect the driver needs:
>>
>> * can_receive to actually return the right value
>> * the locations of the flush calls to be when there is receive space

Yes, please do this.

>> This could explain what changed to break this and why moving the flush
>> calls works in my patch.
> thanks
> -- PMM

Segfault using qemu-system-arm in smc91c111

Commit Message

Comments

Patch