diff mbox

iproute2 flush: handle larger tables and deleted entries

Message ID 4e0db5bc0907130939k48b16256j8f60c786a7e5e44c@mail.gmail.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Gautam Kachroo July 13, 2009, 4:39 p.m. UTC
use a new netlink socket when sending flush messages to avoid reading
any pending data on the existing netlink socket.

read all of the response from the netlink request -- this response can
be split over multiple recv calls, pretty much one per netlink request
message. ENOENT errors, which correspond to attempts to delete an
already deleted entry, are ignored. Other errors are not ignored.

Signed-off-by: Gautam Kachroo <gk@aristanetworks.com>
---

resending in plain text format

rtnl_send_check calls recv after sending the delete messages. It does
this to catch errors, e.g. user doesn't have permissions to send
netlink messages. It treats *any* response as an error, even if the
message is not an NLMSG_ERROR. However, this doesn't work if the dump
is in the middle of being processed. recv will return the next chunk
of the dump. This was causing the flush operation to bail early, e.g.
when there is a large arp cache.

Ignoring ENOENT lets flush succeed even if entries have been deleted
from underneath the flush.


        int status;
@@ -122,28 +125,51 @@ int rtnl_send_check(struct rtnl_handle *rth,
const char *buf, int len)
                return status;

        /* Check for errors */
-       status = recv(rth->fd, resp, sizeof(resp), MSG_DONTWAIT);
-       if (status < 0) {
-               if (errno == EAGAIN)
-                       return 0;
-               return -1;
-       }
+       for (;;) {
+               status = recv(rth->fd, resp, sizeof(resp), MSG_DONTWAIT);
+               if (status < 0) {
+                       if (errno == EAGAIN)
+                               return 0;
+                       return -1;
+               }

-       for (h = (struct nlmsghdr *)resp; NLMSG_OK(h, status);
-            h = NLMSG_NEXT(h, status)) {
-               if (h->nlmsg_type == NLMSG_ERROR) {
-                       struct nlmsgerr *err = (struct nlmsgerr*)NLMSG_DATA(h);
-                       if (h->nlmsg_len < NLMSG_LENGTH(sizeof(struct
nlmsgerr)))
-                               fprintf(stderr, "ERROR truncated\n");
-                       else
-                               errno = -err->error;
+               for (h = (struct nlmsghdr *)resp; NLMSG_OK(h, status);
+                    h = NLMSG_NEXT(h, status)) {
+                       if (h->nlmsg_type == NLMSG_ERROR) {
+                               struct nlmsgerr *err = (struct
nlmsgerr*)NLMSG_DATA(h);
+                               if (h->nlmsg_len <
NLMSG_LENGTH(sizeof(struct nlmsgerr))) {
+                                       fprintf(stderr, "ERROR truncated\n");
+                                       return -1;
+                               }
+                               else {
+                                       if (err->error != -ENOENT) {
+                                               errno = -err->error;
+                                               return -1;
+                                       }
+                               }
+                       }
                }
-               return -1;
        }

        return 0;
 }

+/* creates a netlink socket and passes it to rtnl_send_check_impl to send the
+   message in buf
+*/
+int rtnl_send_check(struct rtnl_handle *unused, const char *buf, int len)
+{
+       struct rtnl_handle rth = { .fd = -1 };
+       int ret;
+       if (rtnl_open(&rth, 0) < 0) {
+               fprintf(stderr, "Cannot open rtnetlink\n");
+               return -1;
+       }
+       ret = rtnl_send_check_impl(&rth, buf, len);
+       rtnl_close(&rth);
+       return ret;
+}
+
 int rtnl_dump_request(struct rtnl_handle *rth, int type, void *req, int len)
 {
        struct nlmsghdr nlh;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Patrick McHardy July 14, 2009, 9:38 a.m. UTC | #1
Gautam Kachroo wrote:
> use a new netlink socket when sending flush messages to avoid reading
> any pending data on the existing netlink socket.
> 
> read all of the response from the netlink request -- this response can
> be split over multiple recv calls, pretty much one per netlink request
> message. ENOENT errors, which correspond to attempts to delete an
> already deleted entry, are ignored. Other errors are not ignored.

In which case would there be any pending data? From what I can see,
this can only happen when using batching, but in that case the
previous command should continue reading until it has received all
responses (which the netlink functions appear to be doing properly).

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gautam Kachroo July 14, 2009, 4:45 p.m. UTC | #2
On Tue, Jul 14, 2009 at 2:38 AM, Patrick McHardy<kaber@trash.net> wrote:
> Gautam Kachroo wrote:
>> use a new netlink socket when sending flush messages to avoid reading
>> any pending data on the existing netlink socket.
>>
>> read all of the response from the netlink request -- this response can
>> be split over multiple recv calls, pretty much one per netlink request
>> message. ENOENT errors, which correspond to attempts to delete an
>> already deleted entry, are ignored. Other errors are not ignored.
>
> In which case would there be any pending data? From what I can see,
> this can only happen when using batching, but in that case the
> previous command should continue reading until it has received all
> responses (which the netlink functions appear to be doing properly).

What is the "previous command"?
Are you referring to rtnl_dump_filter? If rtnl_send_check comes across
a failure, rtnl_dump_filter will not continue reading.

Here's the situation that I'm referring to:

If rtnl_send_check detects an error, it returns -1. rtnl_send_check is
called from flush_update. The multiple implementations of flush_update
(e.g. in ipneigh.c, ipaddress.c) propagate this return value to their
caller, e.g. print_neigh or print_addrinfo.

print_neigh, print_addrinfo, etc. are called from rtnl_dump_filter.
rtnl_dump_filter sits in a loop calling recvmsg on the netlink socket.
However, it returns the error value if the filter function (e.g.
print_neigh) returns an error. In this case, rtnl_dump_filter can
return before it's read all the responses.
The error return from rtnl_dump_filter causes the program to exit.

Note, rtnl_send_check treats any pending data as an error. It could be
changed to only complain if the response contains NLMSG_ERROR
messages, but that delays error detection because it will read all the
responses to the dump request first.

thanks,
-gk
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy July 15, 2009, 3:19 p.m. UTC | #3
Gautam Kachroo wrote:
> On Tue, Jul 14, 2009 at 2:38 AM, Patrick McHardy<kaber@trash.net> wrote:
>> Gautam Kachroo wrote:
>>> use a new netlink socket when sending flush messages to avoid reading
>>> any pending data on the existing netlink socket.
>>>
>>> read all of the response from the netlink request -- this response can
>>> be split over multiple recv calls, pretty much one per netlink request
>>> message. ENOENT errors, which correspond to attempts to delete an
>>> already deleted entry, are ignored. Other errors are not ignored.
>>
>> In which case would there be any pending data? From what I can see,
>> this can only happen when using batching, but in that case the
>> previous command should continue reading until it has received all
>> responses (which the netlink functions appear to be doing properly).
> 
> What is the "previous command"?

The last command before the one executing when using batching.

> Are you referring to rtnl_dump_filter? If rtnl_send_check comes across
> a failure, rtnl_dump_filter will not continue reading.
> 
> Here's the situation that I'm referring to:
> 
> If rtnl_send_check detects an error, it returns -1. rtnl_send_check is
> called from flush_update. The multiple implementations of flush_update
> (e.g. in ipneigh.c, ipaddress.c) propagate this return value to their
> caller, e.g. print_neigh or print_addrinfo.
> 
> print_neigh, print_addrinfo, etc. are called from rtnl_dump_filter.
> rtnl_dump_filter sits in a loop calling recvmsg on the netlink socket.
> However, it returns the error value if the filter function (e.g.
> print_neigh) returns an error. In this case, rtnl_dump_filter can
> return before it's read all the responses.
> The error return from rtnl_dump_filter causes the program to exit.

Yes, and I agree with your patch so far. My question is why you
need another socket.

> use a new netlink socket when sending flush messages to avoid reading
> any pending data on the existing netlink socket.

Under what circumstances would there be pending data when
performing a new iproute operation?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gautam Kachroo July 15, 2009, 5:50 p.m. UTC | #4
On Wed, Jul 15, 2009 at 8:19 AM, Patrick McHardy<kaber@trash.net> wrote:
> Gautam Kachroo wrote:
>> On Tue, Jul 14, 2009 at 2:38 AM, Patrick McHardy<kaber@trash.net> wrote:
>>> Gautam Kachroo wrote:
>>>> use a new netlink socket when sending flush messages to avoid reading
>>>> any pending data on the existing netlink socket.
>>>>
>>>> read all of the response from the netlink request -- this response can
>>>> be split over multiple recv calls, pretty much one per netlink request
>>>> message. ENOENT errors, which correspond to attempts to delete an
>>>> already deleted entry, are ignored. Other errors are not ignored.
>>>
>>> In which case would there be any pending data? From what I can see,
>>> this can only happen when using batching, but in that case the
>>> previous command should continue reading until it has received all
>>> responses (which the netlink functions appear to be doing properly).
>>
>> What is the "previous command"?
>
> The last command before the one executing when using batching.

This is independent of batching (I assume you're referring to the
-batch option to the ip command).
It happens when running a command like "ip neigh flush to 0.0.0.0/0"
if there are many neighbor entries.

The implementation of flush commands, e.g. ip neigh flush, sends a
dump request, e.g. RTM_GETNEIGH, and then sends requests, e.g.
RTM_DELNEIGH, *while* there can be unread data from the dump request.
There would be unread data if the response to the dump request was
split over multiple calls to recvmsg.

>> Are you referring to rtnl_dump_filter? If rtnl_send_check comes across
>> a failure, rtnl_dump_filter will not continue reading.
>>
>> Here's the situation that I'm referring to:
>>
>> If rtnl_send_check detects an error, it returns -1. rtnl_send_check is
>> called from flush_update. The multiple implementations of flush_update
>> (e.g. in ipneigh.c, ipaddress.c) propagate this return value to their
>> caller, e.g. print_neigh or print_addrinfo.
>>
>> print_neigh, print_addrinfo, etc. are called from rtnl_dump_filter.
>> rtnl_dump_filter sits in a loop calling recvmsg on the netlink socket.
>> However, it returns the error value if the filter function (e.g.
>> print_neigh) returns an error. In this case, rtnl_dump_filter can
>> return before it's read all the responses.
>> The error return from rtnl_dump_filter causes the program to exit.
>
> Yes, and I agree with your patch so far. My question is why you
> need another socket.
>
>> use a new netlink socket when sending flush messages to avoid reading
>> any pending data on the existing netlink socket.
>
> Under what circumstances would there be pending data when
> performing a new iproute operation?

As above, it's not that there is pending data when performing a new
iproute operation, it's that there can be pending data while
performing a single iproute operation, namely ip <object> flush.
The benefit of a new socket is that it won't have any data from the
dump request waiting for it.

thanks,
-gk
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger July 15, 2009, 7:19 p.m. UTC | #5
On Wed, 15 Jul 2009 10:50:57 -0700
Gautam Kachroo <gk@aristanetworks.com> wrote:

> On Wed, Jul 15, 2009 at 8:19 AM, Patrick McHardy<kaber@trash.net> wrote:
> > Gautam Kachroo wrote:
> >> On Tue, Jul 14, 2009 at 2:38 AM, Patrick McHardy<kaber@trash.net> wrote:
> >>> Gautam Kachroo wrote:
> >>>> use a new netlink socket when sending flush messages to avoid reading
> >>>> any pending data on the existing netlink socket.
> >>>>
> >>>> read all of the response from the netlink request -- this response can
> >>>> be split over multiple recv calls, pretty much one per netlink request
> >>>> message. ENOENT errors, which correspond to attempts to delete an
> >>>> already deleted entry, are ignored. Other errors are not ignored.
> >>>
> >>> In which case would there be any pending data? From what I can see,
> >>> this can only happen when using batching, but in that case the
> >>> previous command should continue reading until it has received all
> >>> responses (which the netlink functions appear to be doing properly).
> >>
> >> What is the "previous command"?
> >
> > The last command before the one executing when using batching.
> 
> This is independent of batching (I assume you're referring to the
> -batch option to the ip command).
> It happens when running a command like "ip neigh flush to 0.0.0.0/0"
> if there are many neighbor entries.
> 
> The implementation of flush commands, e.g. ip neigh flush, sends a
> dump request, e.g. RTM_GETNEIGH, and then sends requests, e.g.
> RTM_DELNEIGH, *while* there can be unread data from the dump request.
> There would be unread data if the response to the dump request was
> split over multiple calls to recvmsg.
> 
> >> Are you referring to rtnl_dump_filter? If rtnl_send_check comes across
> >> a failure, rtnl_dump_filter will not continue reading.
> >>
> >> Here's the situation that I'm referring to:
> >>
> >> If rtnl_send_check detects an error, it returns -1. rtnl_send_check is
> >> called from flush_update. The multiple implementations of flush_update
> >> (e.g. in ipneigh.c, ipaddress.c) propagate this return value to their
> >> caller, e.g. print_neigh or print_addrinfo.
> >>
> >> print_neigh, print_addrinfo, etc. are called from rtnl_dump_filter.
> >> rtnl_dump_filter sits in a loop calling recvmsg on the netlink socket.
> >> However, it returns the error value if the filter function (e.g.
> >> print_neigh) returns an error. In this case, rtnl_dump_filter can
> >> return before it's read all the responses.
> >> The error return from rtnl_dump_filter causes the program to exit.
> >
> > Yes, and I agree with your patch so far. My question is why you
> > need another socket.
> >
> >> use a new netlink socket when sending flush messages to avoid reading
> >> any pending data on the existing netlink socket.
> >
> > Under what circumstances would there be pending data when
> > performing a new iproute operation?
> 
> As above, it's not that there is pending data when performing a new
> iproute operation, it's that there can be pending data while
> performing a single iproute operation, namely ip <object> flush.
> The benefit of a new socket is that it won't have any data from the
> dump request waiting for it.

I posted a better fix (using MSG_PEEK).

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gautam Kachroo July 15, 2009, 10:04 p.m. UTC | #6
On Wed, Jul 15, 2009 at 12:19 PM, Stephen
Hemminger<shemminger@vyatta.com> wrote:
> On Wed, 15 Jul 2009 10:50:57 -0700
> Gautam Kachroo <gk@aristanetworks.com> wrote:
>
>> On Wed, Jul 15, 2009 at 8:19 AM, Patrick McHardy<kaber@trash.net> wrote:
>> > Gautam Kachroo wrote:
>> >> On Tue, Jul 14, 2009 at 2:38 AM, Patrick McHardy<kaber@trash.net> wrote:
>> >>> Gautam Kachroo wrote:
>> >>>> use a new netlink socket when sending flush messages to avoid reading
>> >>>> any pending data on the existing netlink socket.
>> >>>>
>> >>>> read all of the response from the netlink request -- this response can
>> >>>> be split over multiple recv calls, pretty much one per netlink request
>> >>>> message. ENOENT errors, which correspond to attempts to delete an
>> >>>> already deleted entry, are ignored. Other errors are not ignored.
>> >>>
>> >>> In which case would there be any pending data? From what I can see,
>> >>> this can only happen when using batching, but in that case the
>> >>> previous command should continue reading until it has received all
>> >>> responses (which the netlink functions appear to be doing properly).
>> >>
>> >> What is the "previous command"?
>> >
>> > The last command before the one executing when using batching.
>>
>> This is independent of batching (I assume you're referring to the
>> -batch option to the ip command).
>> It happens when running a command like "ip neigh flush to 0.0.0.0/0"
>> if there are many neighbor entries.
>>
>> The implementation of flush commands, e.g. ip neigh flush, sends a
>> dump request, e.g. RTM_GETNEIGH, and then sends requests, e.g.
>> RTM_DELNEIGH, *while* there can be unread data from the dump request.
>> There would be unread data if the response to the dump request was
>> split over multiple calls to recvmsg.
>>
>> >> Are you referring to rtnl_dump_filter? If rtnl_send_check comes across
>> >> a failure, rtnl_dump_filter will not continue reading.
>> >>
>> >> Here's the situation that I'm referring to:
>> >>
>> >> If rtnl_send_check detects an error, it returns -1. rtnl_send_check is
>> >> called from flush_update. The multiple implementations of flush_update
>> >> (e.g. in ipneigh.c, ipaddress.c) propagate this return value to their
>> >> caller, e.g. print_neigh or print_addrinfo.
>> >>
>> >> print_neigh, print_addrinfo, etc. are called from rtnl_dump_filter.
>> >> rtnl_dump_filter sits in a loop calling recvmsg on the netlink socket.
>> >> However, it returns the error value if the filter function (e.g.
>> >> print_neigh) returns an error. In this case, rtnl_dump_filter can
>> >> return before it's read all the responses.
>> >> The error return from rtnl_dump_filter causes the program to exit.
>> >
>> > Yes, and I agree with your patch so far. My question is why you
>> > need another socket.
>> >
>> >> use a new netlink socket when sending flush messages to avoid reading
>> >> any pending data on the existing netlink socket.
>> >
>> > Under what circumstances would there be pending data when
>> > performing a new iproute operation?
>>
>> As above, it's not that there is pending data when performing a new
>> iproute operation, it's that there can be pending data while
>> performing a single iproute operation, namely ip <object> flush.
>> The benefit of a new socket is that it won't have any data from the
>> dump request waiting for it.
>
> I posted a better fix (using MSG_PEEK).

Where did you post the fix? I didn't see it on netdev or in the iproute2 git...

I had considered using MSG_PEEK in rtnl_send_check, but I don't think
that notices errors with the requests in the "buf" argument of
rtnl_send_check if there is already pending data -- the recv will peek
the next chunk of the dump response. The error response will be
waiting in the queue after the dump response.
Of course, an error, e.g. EPERM, will eventually be noticed, just not
as early...

thanks,
-gk
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gautam Kachroo Aug. 21, 2009, 12:08 a.m. UTC | #7
On Wed, Jul 15, 2009 at 3:04 PM, Gautam Kachroo<gk@aristanetworks.com> wrote:
> On Wed, Jul 15, 2009 at 12:19 PM, Stephen
> Hemminger<shemminger@vyatta.com> wrote:
>> On Wed, 15 Jul 2009 10:50:57 -0700
>> Gautam Kachroo <gk@aristanetworks.com> wrote:
>>
>>> On Wed, Jul 15, 2009 at 8:19 AM, Patrick McHardy<kaber@trash.net> wrote:
>>> > Gautam Kachroo wrote:
>>> >> On Tue, Jul 14, 2009 at 2:38 AM, Patrick McHardy<kaber@trash.net> wrote:
>>> >>> Gautam Kachroo wrote:
>>> >>>> use a new netlink socket when sending flush messages to avoid reading
>>> >>>> any pending data on the existing netlink socket.
>>> >>>>
>>> >>>> read all of the response from the netlink request -- this response can
>>> >>>> be split over multiple recv calls, pretty much one per netlink request
>>> >>>> message. ENOENT errors, which correspond to attempts to delete an
>>> >>>> already deleted entry, are ignored. Other errors are not ignored.
>>> >>>
>>> >>> In which case would there be any pending data? From what I can see,
>>> >>> this can only happen when using batching, but in that case the
>>> >>> previous command should continue reading until it has received all
>>> >>> responses (which the netlink functions appear to be doing properly).
>>> >>
>>> >> What is the "previous command"?
>>> >
>>> > The last command before the one executing when using batching.
>>>
>>> This is independent of batching (I assume you're referring to the
>>> -batch option to the ip command).
>>> It happens when running a command like "ip neigh flush to 0.0.0.0/0"
>>> if there are many neighbor entries.
>>>
>>> The implementation of flush commands, e.g. ip neigh flush, sends a
>>> dump request, e.g. RTM_GETNEIGH, and then sends requests, e.g.
>>> RTM_DELNEIGH, *while* there can be unread data from the dump request.
>>> There would be unread data if the response to the dump request was
>>> split over multiple calls to recvmsg.
>>>
>>> >> Are you referring to rtnl_dump_filter? If rtnl_send_check comes across
>>> >> a failure, rtnl_dump_filter will not continue reading.
>>> >>
>>> >> Here's the situation that I'm referring to:
>>> >>
>>> >> If rtnl_send_check detects an error, it returns -1. rtnl_send_check is
>>> >> called from flush_update. The multiple implementations of flush_update
>>> >> (e.g. in ipneigh.c, ipaddress.c) propagate this return value to their
>>> >> caller, e.g. print_neigh or print_addrinfo.
>>> >>
>>> >> print_neigh, print_addrinfo, etc. are called from rtnl_dump_filter.
>>> >> rtnl_dump_filter sits in a loop calling recvmsg on the netlink socket.
>>> >> However, it returns the error value if the filter function (e.g.
>>> >> print_neigh) returns an error. In this case, rtnl_dump_filter can
>>> >> return before it's read all the responses.
>>> >> The error return from rtnl_dump_filter causes the program to exit.
>>> >
>>> > Yes, and I agree with your patch so far. My question is why you
>>> > need another socket.
>>> >
>>> >> use a new netlink socket when sending flush messages to avoid reading
>>> >> any pending data on the existing netlink socket.
>>> >
>>> > Under what circumstances would there be pending data when
>>> > performing a new iproute operation?
>>>
>>> As above, it's not that there is pending data when performing a new
>>> iproute operation, it's that there can be pending data while
>>> performing a single iproute operation, namely ip <object> flush.
>>> The benefit of a new socket is that it won't have any data from the
>>> dump request waiting for it.
>>
>> I posted a better fix (using MSG_PEEK).
>
> Where did you post the fix? I didn't see it on netdev or in the iproute2 git...

> I had considered using MSG_PEEK in rtnl_send_check, but I don't think
> that notices errors with the requests in the "buf" argument of
> rtnl_send_check if there is already pending data -- the recv will peek
> the next chunk of the dump response. The error response will be
> waiting in the queue after the dump response.
> Of course, an error, e.g. EPERM, will eventually be noticed, just not
> as early...

I saw commit 2d8240f8d95dfdc276dcf447623129fb5ccedcd6.

Using MSG_PEEK will prevent pending data from being removed during the
check for errors, but re-using the same socket means that errors won't
be detected until all the pending data has been read.

rtnl_send_check still treats ENOENT as an error. It seems better for
flush to ignore ENOENT. That way a flush will not be disrupted by an
entry being removed since that's not really an error for a flush
operation.

thanks,
-gk


> thanks,
> -gk
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -111,7 +111,10 @@  int rtnl_send(struct rtnl_handle *rth, const char
*buf, int len)
        return send(rth->fd, buf, len, 0);
 }

-int rtnl_send_check(struct rtnl_handle *rth, const char *buf, int len)
+/* send the message in buf using the socket in rth and check for errors
+   (ignoring ENOENT errors)
+*/
+int rtnl_send_check_impl(struct rtnl_handle *rth, const char *buf, int len)
 {
        struct nlmsghdr *h;