diff mbox

[ovs-dev] bond: Use correct type for slave's change_seq.

Message ID 1449194352-28765-1-git-send-email-jarno@ovn.org
State Accepted
Headers show

Commit Message

Jarno Rajahalme Dec. 4, 2015, 1:59 a.m. UTC
seq values are 64-bit, and storing them to a 32-bit variable causes
the stored value never to match actual seq value after the seq value
gets big enough.

This is a likely cause of OVS main thread using 100% CPU in a system
using bonds after some runtime.

VMware-BZ: #1564993
Reported-by: Hiram Bayless <hbayless@vmware.com>
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
---
 ofproto/bond.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Joe Stringer Dec. 4, 2015, 2:19 a.m. UTC | #1
On 3 December 2015 at 17:59, Jarno Rajahalme <jarno@ovn.org> wrote:

> seq values are 64-bit, and storing them to a 32-bit variable causes
> the stored value never to match actual seq value after the seq value
> gets big enough.
>
> This is a likely cause of OVS main thread using 100% CPU in a system
> using bonds after some runtime.
>
> VMware-BZ: #1564993
> Reported-by: Hiram Bayless <hbayless@vmware.com>
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
>

It looks like this bug goes back at least as far as v1.10, have we really
not hit it until now?

Why aren't we getting any feedback from our compilers on this?

It looks like there are several places with this kind of problem:

$ grep -r unsigned.*change_seq
./lib/netdev-windows.c:    unsigned int change_seq;
./lib/ovsdb-idl.c:    unsigned int change_seqno;
./lib/ovsdb-idl.c:    unsigned int max_seqno =
table->change_seqno[OVSDB_IDL_CHANGE_INSERT];
./lib/ovsdb-idl-provider.h:    unsigned int
change_seqno[OVSDB_IDL_CHANGE_MAX];
./lib/ovsdb-idl-provider.h:    unsigned int
change_seqno[OVSDB_IDL_CHANGE_MAX];
./ofproto/bond.c:    unsigned int change_seq;    /* Tracks changes in
'netdev'. */
Jarno Rajahalme Dec. 4, 2015, 2:39 a.m. UTC | #2
> On Dec 3, 2015, at 6:19 PM, Joe Stringer <joe@ovn.org> wrote:
> 
> On 3 December 2015 at 17:59, Jarno Rajahalme <jarno@ovn.org <mailto:jarno@ovn.org>> wrote:
> seq values are 64-bit, and storing them to a 32-bit variable causes
> the stored value never to match actual seq value after the seq value
> gets big enough.
> 
> This is a likely cause of OVS main thread using 100% CPU in a system
> using bonds after some runtime.
> 
> VMware-BZ: #1564993
> Reported-by: Hiram Bayless <hbayless@vmware.com <mailto:hbayless@vmware.com>>
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org <mailto:jarno@ovn.org>>
> 
> It looks like this bug goes back at least as far as v1.10, have we really not hit it until now?
> 
> Why aren't we getting any feedback from our compilers on this?
> 
> It looks like there are several places with this kind of problem:
> 
> $ grep -r unsigned.*change_seq
> ./lib/netdev-windows.c:    unsigned int change_seq;
> ./lib/ovsdb-idl.c:    unsigned int change_seqno;
> ./lib/ovsdb-idl.c:    unsigned int max_seqno = table->change_seqno[OVSDB_IDL_CHANGE_INSERT];
> ./lib/ovsdb-idl-provider.h:    unsigned int change_seqno[OVSDB_IDL_CHANGE_MAX];
> ./lib/ovsdb-idl-provider.h:    unsigned int change_seqno[OVSDB_IDL_CHANGE_MAX];
> ./ofproto/bond.c:    unsigned int change_seq;    /* Tracks changes in 'netdev'. */

It is likely that in those other cases the value is not passed to seq_wait().

Nevertheless, it is curious how the connectivity_seq would grow to such a big value to trigger this bug.

  Jarno
Joe Stringer Dec. 4, 2015, 3:03 a.m. UTC | #3
On 3 December 2015 at 18:39, Jarno Rajahalme <jarno@ovn.org> wrote:

>
> On Dec 3, 2015, at 6:19 PM, Joe Stringer <joe@ovn.org> wrote:
>
> On 3 December 2015 at 17:59, Jarno Rajahalme <jarno@ovn.org> wrote:
>
>> seq values are 64-bit, and storing them to a 32-bit variable causes
>> the stored value never to match actual seq value after the seq value
>> gets big enough.
>>
>> This is a likely cause of OVS main thread using 100% CPU in a system
>> using bonds after some runtime.
>>
>> VMware-BZ: #1564993
>> Reported-by: Hiram Bayless <hbayless@vmware.com>
>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
>>
>
> It looks like this bug goes back at least as far as v1.10, have we really
> not hit it until now?
>
> Why aren't we getting any feedback from our compilers on this?
>
> It looks like there are several places with this kind of problem:
>
> $ grep -r unsigned.*change_seq
> ./lib/netdev-windows.c:    unsigned int change_seq;
> ./lib/ovsdb-idl.c:    unsigned int change_seqno;
> ./lib/ovsdb-idl.c:    unsigned int max_seqno =
> table->change_seqno[OVSDB_IDL_CHANGE_INSERT];
> ./lib/ovsdb-idl-provider.h:    unsigned int
> change_seqno[OVSDB_IDL_CHANGE_MAX];
> ./lib/ovsdb-idl-provider.h:    unsigned int
> change_seqno[OVSDB_IDL_CHANGE_MAX];
> ./ofproto/bond.c:    unsigned int change_seq;    /* Tracks changes in
> 'netdev'. */
>
>
> It is likely that in those other cases the value is not passed to
> seq_wait().
>

Looks like you're right. As far as I can tell, the netdev-windows version
doesn't do anything, and the ovsdb-idl instances out-date the existence of
struct seq. The bond one on the other hand does actually use struct seq.
So, other changes shouldn't be necessary.

Acked-by: Joe Stringer <joe@ovn.org>
Ben Pfaff Dec. 4, 2015, 6:22 a.m. UTC | #4
On Thu, Dec 03, 2015 at 06:39:46PM -0800, Jarno Rajahalme wrote:
> Nevertheless, it is curious how the connectivity_seq would grow to
> such a big value to trigger this bug.

seq values are global (see how seq_next works in lib/seq.c) so it
wouldn't have to be connectivity_seq related changes.
Ben Pfaff Dec. 4, 2015, 6:23 a.m. UTC | #5
On Thu, Dec 03, 2015 at 05:59:12PM -0800, Jarno Rajahalme wrote:
> seq values are 64-bit, and storing them to a 32-bit variable causes
> the stored value never to match actual seq value after the seq value
> gets big enough.
> 
> This is a likely cause of OVS main thread using 100% CPU in a system
> using bonds after some runtime.
> 
> VMware-BZ: #1564993
> Reported-by: Hiram Bayless <hbayless@vmware.com>
> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>

Great work!  Thank you.
Ben Pfaff Dec. 4, 2015, 6:57 a.m. UTC | #6
On Thu, Dec 03, 2015 at 10:23:19PM -0800, Ben Pfaff wrote:
> On Thu, Dec 03, 2015 at 05:59:12PM -0800, Jarno Rajahalme wrote:
> > seq values are 64-bit, and storing them to a 32-bit variable causes
> > the stored value never to match actual seq value after the seq value
> > gets big enough.
> > 
> > This is a likely cause of OVS main thread using 100% CPU in a system
> > using bonds after some runtime.
> > 
> > VMware-BZ: #1564993
> > Reported-by: Hiram Bayless <hbayless@vmware.com>
> > Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
> 
> Great work!  Thank you.

Also,
Acked-by: Ben Pfaff <blp@ovn.org>
Jarno Rajahalme Dec. 4, 2015, 7:11 p.m. UTC | #7
> On Dec 3, 2015, at 10:57 PM, Ben Pfaff <blp@ovn.org> wrote:
> 
> On Thu, Dec 03, 2015 at 10:23:19PM -0800, Ben Pfaff wrote:
>> On Thu, Dec 03, 2015 at 05:59:12PM -0800, Jarno Rajahalme wrote:
>>> seq values are 64-bit, and storing them to a 32-bit variable causes
>>> the stored value never to match actual seq value after the seq value
>>> gets big enough.
>>> 
>>> This is a likely cause of OVS main thread using 100% CPU in a system
>>> using bonds after some runtime.
>>> 
>>> VMware-BZ: #1564993
>>> Reported-by: Hiram Bayless <hbayless@vmware.com>
>>> Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
>> 
>> Great work!  Thank you.
> 
> Also,
> Acked-by: Ben Pfaff <blp@ovn.org>

Thanks for the reviews, pushed to master and branches 2.1-2.5. This bug is not present in branch-2.0 or earlier.

  Jarno
diff mbox

Patch

diff --git a/ofproto/bond.c b/ofproto/bond.c
index 1dbf8f1..c2749e5 100644
--- a/ofproto/bond.c
+++ b/ofproto/bond.c
@@ -84,7 +84,7 @@  struct bond_slave {
     void *aux;                  /* Client-provided handle for this slave. */
 
     struct netdev *netdev;      /* Network device, owned by the client. */
-    unsigned int change_seq;    /* Tracks changes in 'netdev'. */
+    uint64_t change_seq;        /* Tracks changes in 'netdev'. */
     ofp_port_t  ofp_port;       /* OpenFlow port number. */
     char *name;                 /* Name (a copy of netdev_get_name(netdev)). */