igmp: Allow mininum interval specification for igmp timers.

Message ID	alpine.DEB.2.00.1009221354410.32661@router.home
State	Rejected, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Date: Wed, 22 Sep 2010 13:59:30 -0500 (CDT) From: Christoph Lameter <cl@linux.com> To: linux-rdma@vger.kernel.org, netdev@vger.kernel.org cc: Bob Arendt <rda@rincon.com>, "David S. Miller" <davem@davemloft.net>, David L Stevens <dlstevens@us.ibm.com> Subject: igmp: Allow mininum interval specification for igmp timers. Message-ID: <alpine.DEB.2.00.1009221354410.32661@router.home> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: netdev-owner@vger.kernel.org Precedence: bulk

Christoph Lameter (Ampere) Sept. 22, 2010, 6:59 p.m. UTC

IGMP timers sometimes fire too rapidly due to randomization of the
intervalsfrom 0 to max_delay in igmp_start_timer(). For some situations
(like the initial IGMP reports that are not responses to an IGMP query) we
do not want them in too rapid succession otherwise all the initial reports
may be lost due to a race conditions with the reconfiguration of the
routers and switches going on via the link layer (like on Infiniband). If
those are lost then the router will only discover that a new mc group was
joined when the igmp query was sent. General IGMP queries may be sent
rarely on large fabrics resulting in excessively long wait times until
data starts flowing. The application may abort before then concluding that
the network hardware is not operational.

The worst case scenario without the changes will send 3 igmp reports on join:

First		3 jiffies ("immediate" (spec) ~3 ms)
Second		3 jiffies (randomization leads to shortest interval) 3 ms
Third		3 jiffies (randomization leads to shortest interval) 3 ms

Which may result in a total of less than 10ms until the kernel gives up sending
igmp requests.

Change the IGMP layer to allow the specification of minimum and maximum delay.
Calculate the IGMP_Unsolicated_Report interval based on what the interval
before this patch would be on a 100HZ kernel. 3 jiffies at 100 HZ would result
in a mininum ~30 milliseconds spacing between the initial two IGMP reports.
Round it up to 40ms.

This will result in 3 initial unsolicited reports

First	"immediately"	3 jiffies (~ 3ms)
Second	randomized 40ms to 10seconds later
Third	randomized 40ms	to 10seconds later

So a mininum of ~83ms will pass before the unsolicted reports are
given up.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 net/ipv4/igmp.c |   45 +++++++++++++++++++++++++++++++--------------
 1 file changed, 31 insertions(+), 14 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Miller Sept. 24, 2010, 4:38 a.m. UTC | #1

From: Christoph Lameter <cl@linux.com>
Date: Wed, 22 Sep 2010 13:59:30 -0500 (CDT)

> IGMP timers sometimes fire too rapidly due to randomization of the
> intervalsfrom 0 to max_delay in igmp_start_timer().
 ...
> Signed-off-by: Christoph Lameter <cl@linux.com>


This change seems reasonable to me, what do you think David?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Stevens Sept. 27, 2010, 5:41 p.m. UTC | #2

David Miller <davem@davemloft.net> wrote on 09/23/2010 09:38:23 PM:

> 
> > IGMP timers sometimes fire too rapidly due to randomization of the
> > intervalsfrom 0 to max_delay in igmp_start_timer().
>  ...
> > Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> 
> This change seems reasonable to me, what do you think David?

[sorry for the delay -- I was off-line for the last few days]
Dave,
        I don't know if you saw the more extended discussion we
had on this or not, but I think while this would help for IB,
it's not appropriate in general. These can in fact be "0" per
RFC which is worst case for IB if there is a delay for being
able to use the group, and the newer IGMPv3 standard has shortened
the max interval from 10sec in v2 to 1 sec.
        Fundamentally, the problem is that the device needs to
be able to send on the group immediately for IGMP; that it
can't for IB is the problem, and I think it should be solved
in IB by either queueing packets there or delaying there as
needed before doing the joins.
        I don't think tweaking IGMP for this is appropriate at
all, but if done there, it ought to be per-interface so it
doesn't change anything for other network types which don't
have this problem. It should be randomized and not the fixed
delays to prevent storms on a mass start-up, and we also don't
want to be increasing the number of duplicates for other
network types. The default should be 2 reports in randomized
0-10 sec for each for v2, 2 in randomized 0-1 sec for v3.

                                                        +-DLS

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Miller Sept. 27, 2010, 5:54 p.m. UTC | #3

From: David Stevens <dlstevens@us.ibm.com>
Date: Mon, 27 Sep 2010 10:41:20 -0700

>         I don't know if you saw the more extended discussion we
> had on this or not, but I think while this would help for IB,
> it's not appropriate in general. These can in fact be "0" per
> RFC which is worst case for IB if there is a delay for being
> able to use the group, and the newer IGMPv3 standard has shortened
> the max interval from 10sec in v2 to 1 sec.

I did see the extended discussion, and it was interesting :-)

But that mainly focused on the second patch, which I appropriately
marked as needing changes in patchwork.

This patch on the other hand is attacking a different problem,
namely avoiding the worst cases caused by the randomization we
do for the timer.

With bad luck this thing times out way too fast because the total of
all of the randomized intervals can end up being very small, and I
think we should fix that independently of the other issues hit by the
IB folks.

Don't you agree?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Stevens Sept. 27, 2010, 6:16 p.m. UTC | #4

netdev-owner@vger.kernel.org wrote on 09/27/2010 10:54:44 AM:

> From: David Miller <davem@davemloft.net>
> To: David Stevens/Beaverton/IBM@IBMUS
> Cc: cl@linux.com, linux-rdma@vger.kernel.org, 
> netdev@vger.kernel.org, netdev-owner@vger.kernel.org, rda@rincon.com
> Date: 09/27/2010 10:54 AM
> Subject: Re: igmp: Allow mininum interval specification for igmp timers.
> Sent by: netdev-owner@vger.kernel.org
> 
> From: David Stevens <dlstevens@us.ibm.com>
> Date: Mon, 27 Sep 2010 10:41:20 -0700
> 
> >         I don't know if you saw the more extended discussion we
> > had on this or not, but I think while this would help for IB,
> > it's not appropriate in general. These can in fact be "0" per
> > RFC which is worst case for IB if there is a delay for being
> > able to use the group, and the newer IGMPv3 standard has shortened
> > the max interval from 10sec in v2 to 1 sec.
> 
> I did see the extended discussion, and it was interesting :-)
> 
> But that mainly focused on the second patch, which I appropriately
> marked as needing changes in patchwork.

        OK, I'm not sure I've seen them all; haven't caught up on
e-mail yet.

> This patch on the other hand is attacking a different problem,
> namely avoiding the worst cases caused by the randomization we
> do for the timer.

        I think the multiples are to allow for drops and the
randomization is to prevent storms. As far as IGMP is concerned,
it's perfectly fine to send them back-to-back, since drops are
not necessarily time periods of network outage (as with IB) but
rather transient queue overflows where even the short delay of
a "0" timer but still having protocol and packet transmit delay
would be fine.
 
> With bad luck this thing times out way too fast because the total of
> all of the randomized intervals can end up being very small, and I
> think we should fix that independently of the other issues hit by the
> IB folks.
> 
> Don't you agree?

        If you mean enforcing a minimum spacing higher than a "0" timer,
I don't know that it's an issue for other network types. According to
IGMPv3, all of them (3 total) on average would be sent in 1 sec, but
it also isn't fatal to drop all of them. To the extent that 1 sec is
"small," it is intentional.
        I'll try digging out the particular patch and comment. I'm not
sure many of these tweaks would necessarily hurt other network types
but I think the current code also isn't a problem for anything but IB,
and that issue can be fixed within IB.

                                                                +-DLS


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Stevens Sept. 27, 2010, 7:55 p.m. UTC | #5

David Miller <davem@davemloft.net> wrote on 09/27/2010 10:54:44 AM:

> This patch on the other hand is attacking a different problem,
> namely avoiding the worst cases caused by the randomization we
> do for the timer.
> 
> With bad luck this thing times out way too fast because the total of
> all of the randomized intervals can end up being very small, and I
> think we should fix that independently of the other issues hit by the
> IB folks.

        I think I'm caught up on the discussion now. For IGMPv3, we
would send all the reports always in < 2 secs, and the average would
be < 1 sec, so I'm not sure any sort of tweaks we do to enforce a
minimum randomized interval are compatible with IGMPv3 and still
solve IB's problem.
        As I said before, I think per protocol, back-to-back is both
allowed and not a problem, even if both subsequent randomized reports
come out to 0 time. But if we wanted to enforce a minimum interval
of, say, X, then I think the better way to do that is to set the
timer to X + rand(Interval-X) and not a table of fixed intervals
as in the original patch. For v2, X=1 or 2 sec and Interval=10
might work well, but for v3, the entire interval is 1 sec and I
think I saw that the set-up time for the fabric may be on the
order of 1 sec.
        I also don't think that we want those kinds of delays on
Ethernet. A program may join and send lots of traffic in 1 sec,
and if the immediate join is lost, one of the quickly-following
<1 sec duplicate reports will make it recover and work. Delaying
the minimum would guarantee it wouldn't work until that minimum
and drop all that traffic if the immediate report is lost, then.
        Really, of course, I think the solution belongs in IB, but
if we did anything in IGMP, I'd prefer it were a per-interface
tunable that defaults as in the RFC. Since you can change the
interval and # of reports through a querier now, exporting the
default values of (10,2) for v2 and (1,2) for v3 to instead be
per-interface tunables and then bumped as needed for IB would
allow tweaking without running a querier. But a querier that's
using default values would also override that and cause the
problem all over again. Queuing in the driver until the MAC
address is usable solves it generally.

        Also, MLD and IPv6 will have all these same issues, and
working multicasting is even more important there.

                                                                +-DLS


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) Sept. 27, 2010, 8:20 p.m. UTC | #6

On Mon, 27 Sep 2010, David Stevens wrote:

> > With bad luck this thing times out way too fast because the total of
> > all of the randomized intervals can end up being very small, and I
> > think we should fix that independently of the other issues hit by the
> > IB folks.
>
>         I think I'm caught up on the discussion now. For IGMPv3, we
> would send all the reports always in < 2 secs, and the average would
> be < 1 sec, so I'm not sure any sort of tweaks we do to enforce a
> minimum randomized interval are compatible with IGMPv3 and still
> solve IB's problem.

Ok thanks for the effort but so far I do not see you having caught up. I'd
rather avoid responding to the misleading statements you made in other
replies and just respond to where you missed the boat here.

>         As I said before, I think per protocol, back-to-back is both
> allowed and not a problem, even if both subsequent randomized reports
> come out to 0 time. But if we wanted to enforce a minimum interval
> of, say, X, then I think the better way to do that is to set the
> timer to X + rand(Interval-X) and not a table of fixed intervals

The second patch sets the intervals to X .. X + Rand (interval) and not to
a table of fixed intervals as you state here. I have pointed this out
before.

> as in the original patch. For v2, X=1 or 2 sec and Interval=10
> might work well, but for v3, the entire interval is 1 sec and I
> think I saw that the set-up time for the fabric may be on the
> order of 1 sec.

Again there is no knowledge about V2 or V3 without a query and this is
during the period when no querier is known yet. You stated elsewhere that
I can assume V3 by default? So 1 sec?

>         I also don't think that we want those kinds of delays on
> Ethernet. A program may join and send lots of traffic in 1 sec,
> and if the immediate join is lost, one of the quickly-following
> <1 sec duplicate reports will make it recover and work. Delaying
> the minimum would guarantee it wouldn't work until that minimum
> and drop all that traffic if the immediate report is lost, then.

There can be any number of reasons that a short outage could prevent the
packets from going through. A buffer overrun (that you mentioned
elsewhere) usually causes lots of packets to be lost. Buffer overrun
scenarios usually mean that all igmp queries are lost.

>         Really, of course, I think the solution belongs in IB, but
> if we did anything in IGMP, I'd prefer it were a per-interface
> tunable that defaults as in the RFC. Since you can change the
> interval and # of reports through a querier now, exporting the
> default values of (10,2) for v2 and (1,2) for v3 to instead be
> per-interface tunables and then bumped as needed for IB would
> allow tweaking without running a querier. But a querier that's
> using default values would also override that and cause the
> problem all over again. Queuing in the driver until the MAC
> address is usable solves it generally.

There is no solution on the IB layer since there is no notification when
the fabric reconfiguration necessary for an multicast group is complete.

The querier is of not use since (for the gazillionth of times) this is an
unsolicited IGMP report. If there is a querier then the unsolicited igmp
reports would not be used but the timeout indicated by the querier would
be used.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Stevens Sept. 27, 2010, 9:45 p.m. UTC | #7

Christoph Lameter <cl@linux.com> wrote on 09/27/2010 01:20:54 PM:

> 
> >         As I said before, I think per protocol, back-to-back is both
> > allowed and not a problem, even if both subsequent randomized reports
> > come out to 0 time. But if we wanted to enforce a minimum interval
> > of, say, X, then I think the better way to do that is to set the
> > timer to X + rand(Interval-X) and not a table of fixed intervals
> 
> The second patch sets the intervals to X .. X + Rand (interval) and not 
to
> a table of fixed intervals as you state here. I have pointed this out
> before.

        Sorry if I've misunderstood something you're proposing, but what
you describe above would be certainly technically incorrect. There are
really no circumstances for sending a report greater than <Interval>
that is protocol-compliant. You can enforce a minimum greater than 0,
which is a departure from both RFCs, though IGMPv2 uses wishy-washy
language. The intent for both was to explicitly allow 0, IMO.

> 
> > as in the original patch. For v2, X=1 or 2 sec and Interval=10
> > might work well, but for v3, the entire interval is 1 sec and I
> > think I saw that the set-up time for the fabric may be on the
> > order of 1 sec.
> 
> Again there is no knowledge about V2 or V3 without a query and this is
> during the period when no querier is known yet. You stated elsewhere 
that
> I can assume V3 by default? So 1 sec?

        Yes, without a querier or the tunable to force it to IGMPv2,
the default is IGMPv3. It appears there is a bug where IGMPv3 is also
using a 10sec interval (haven't verified that), but a 1 sec interval
as required makes your situation worse, not better. It makes it even
more likely that all the initial reports will occur before your set-up
is done.
 
> There can be any number of reasons that a short outage could prevent the
> packets from going through. A buffer overrun (that you mentioned
> elsewhere) usually causes lots of packets to be lost. Buffer overrun
> scenarios usually mean that all igmp queries are lost.

        You're arguing against protocol compliance. I didn't define
the protocol, I only implemented it. And your view is through the
IB lens, but I don't believe this is an actual problem in any way
for typical networks. If you wrote a standards-track RFC that modifies
IGMP for NBMA networks that require a delay or different parameters
there, I'd have no objection to implementing that. Unilaterally
changing linux's behavior on all network types without cause for
departing from RFC on the most common types is another matter.


> There is no solution on the IB layer since there is no notification when
> the fabric reconfiguration necessary for an multicast group is complete.

        Certainly that's not true; without notification, you can queue for
first use of a new hardware multicast address and send the queue after an
appropriate delay (1 sec? If that covers your set-up time). If you had
positive acknowledgement from the IB network, you'd know exactly when to
do it, but there's no need to change anything for non-IB networks here.

> The querier is of not use since (for the gazillionth of times) this is 
an
> unsolicited IGMP report. If there is a querier then the unsolicited igmp
> reports would not be used but the timeout indicated by the querier would
> be used.

        A querier affects unsolicited reports because it sets both the
query interval and the robustness value. If you want to send 10 reports,
you can cause that by having a querier that sets it to that many. The
initial join would then send 10 reports and the query interval can also
be as low as you like.
        But the linux code is not just for your particular problem or
particular configuration. You can solve your problem by adding a querier,
but I know you're trying to do it without. The mail I was responding to
referred also to the case of a querier present, which is actually the
"normal" case for using full IGMP is. I'm saying that for the non-querier
case, making those per-interface configurable is reasonable because
they *are* querier-changeable, but you can also use a querier to change
it _for_the_unsolicited_reports_, as well as making the querier interval
small enough that you don't have to care at all whether any or all of
the unsolicited reports are lost.

                                                                +-DLS

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) Sept. 28, 2010, 6:42 p.m. UTC | #8

On Mon, 27 Sep 2010, David Stevens wrote:

> > The second patch sets the intervals to X .. X + Rand (interval) and not
> to
> > a table of fixed intervals as you state here. I have pointed this out
> > before.
>
>         Sorry if I've misunderstood something you're proposing, but what
> you describe above would be certainly technically incorrect. There are
> really no circumstances for sending a report greater than <Interval>
> that is protocol-compliant. You can enforce a minimum greater than 0,
> which is a departure from both RFCs, though IGMPv2 uses wishy-washy
> language. The intent for both was to explicitly allow 0, IMO.

There is no igmp interval set by any igmp query yet so this is your usual
unresponsive crappy response to something else that we are not talking
about.

I thought you were talking about the "fixed intervals" that you saw in the
patch. These initial intervals are for unsolicited igmp reports (do I need
to add that statement to every sentence in a thread where we *only*
discuss unsolicited igmp issues?) and those "intervals" are randomized
and not fixed.

> > > as in the original patch. For v2, X=1 or 2 sec and Interval=10
> > > might work well, but for v3, the entire interval is 1 sec and I
> > > think I saw that the set-up time for the fabric may be on the
> > > order of 1 sec.
> >
> > Again there is no knowledge about V2 or V3 without a query and this is
> > during the period when no querier is known yet. You stated elsewhere
> that
> > I can assume V3 by default? So 1 sec?
>
>         Yes, without a querier or the tunable to force it to IGMPv2,
> the default is IGMPv3. It appears there is a bug where IGMPv3 is also
> using a 10sec interval (haven't verified that), but a 1 sec interval

You do not have the linux source code tree available?

from net/ipv4/igmp.c:

#define IGMP_Unsolicited_Report_Interval        (10*HZ)

> as required makes your situation worse, not better. It makes it even
> more likely that all the initial reports will occur before your set-up
> is done.

Right. So can you please give me an approach that considers all these
issues and does not invent problem that do not exist, stays within the
subject discussed and follows the RFCs?

  > > There can be any number of reasons that a short outage could prevent the
> > packets from going through. A buffer overrun (that you mentioned
> > elsewhere) usually causes lots of packets to be lost. Buffer overrun
> > scenarios usually mean that all igmp queries are lost.
>
>         You're arguing against protocol compliance. I didn't define
> the protocol, I only implemented it. And your view is through the
> IB lens, but I don't believe this is an actual problem in any way
> for typical networks. If you wrote a standards-track RFC that modifies
> IGMP for NBMA networks that require a delay or different parameters
> there, I'd have no objection to implementing that. Unilaterally
> changing linux's behavior on all network types without cause for
> departing from RFC on the most common types is another matter.

The RFCs state that the igmp queries have to be repeated at least 3
times. The first patch ensures that a mininum time passes between two igmp
reports (to avoid them getting lost in one go). The second patch doubles
the number of igmp reports and increases the intervals so that we still
have a chance to process the join before the next igmp query is send by
the router (which can be minuates away).

It fixes buggy havior that we see because multicast joins take very long
or fail outright.

 > > There is no solution on the IB layer since there is no notification when
> > the fabric reconfiguration necessary for an multicast group is complete.
>
>         Certainly that's not true; without notification, you can queue for
> first use of a new hardware multicast address and send the queue after an
> appropriate delay (1 sec? If that covers your set-up time). If you had
> positive acknowledgement from the IB network, you'd know exactly when to
> do it, but there's no need to change anything for non-IB networks here.

So you want an arbitrary delay for all new multicast traffic to be
created? I'd rather have a series of staggered attempts so that we can
avoid this setup time.

Also the setup time varies greatly depending on the complexity of the
fabric changes. It can be extremely fast if the multicast group is already
in use by others in the fabric. Adding a delay penalizes everyone
unnecessarily.

Also much of these seems to be contigent on IGMPv3. We are using igmpv2.

> > The querier is of not use since (for the gazillionth of times) this is
> an
> > unsolicited IGMP report. If there is a querier then the unsolicited igmp
> > reports would not be used but the timeout indicated by the querier would
> > be used.
>
>         A querier affects unsolicited reports because it sets both the
> query interval and the robustness value. If you want to send 10 reports,
> you can cause that by having a querier that sets it to that many. The
> initial join would then send 10 reports and the query interval can also
> be as low as you like.

The Linux IGMP subsystem does not support either of those at this point.
When the multicast group is created it has no notion of query intervals
until the first igmp query is received. That is the period of interest
that we are discussing!

>         But the linux code is not just for your particular problem or
> particular configuration. You can solve your problem by adding a querier,
> but I know you're trying to do it without. The mail I was responding to
> referred also to the case of a querier present, which is actually the
> "normal" case for using full IGMP is. I'm saying that for the non-querier
> case, making those per-interface configurable is reasonable because
> they *are* querier-changeable, but you can also use a querier to change
> it _for_the_unsolicited_reports_, as well as making the querier interval
> small enough that you don't have to care at all whether any or all of
> the unsolicited reports are lost.

The network of course has a querier that sents igmp requests in intervals
that could span minutes. We are talking about the period of time after a
join when a multicast group has been created but we have not reached the
time when the router sends an igmp query and where the various bits of
information about the igmp handling can be determined for the multicast
group.

I am not sure that you comprehend the basics of IGMP processing in the
kernel. I see knowledge about IGMP in general but you have a difficult
time to relate that to what the Linux kernel actually does.

Would you please have a look at the source code?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

igmp: Allow mininum interval specification for igmp timers.

Commit Message

Comments

Patch