From patchwork Tue Aug 11 08:07:06 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Timo Teras X-Patchwork-Id: 31142 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@bilbo.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from ozlabs.org (ozlabs.org [203.10.76.45]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mx.ozlabs.org", Issuer "CA Cert Signing Authority" (verified OK)) by bilbo.ozlabs.org (Postfix) with ESMTPS id EE27FB6EDF for ; Tue, 11 Aug 2009 22:49:38 +1000 (EST) Received: by ozlabs.org (Postfix) id E0650DDD0B; Tue, 11 Aug 2009 22:49:38 +1000 (EST) Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by ozlabs.org (Postfix) with ESMTP id 61FB5DDD04 for ; Tue, 11 Aug 2009 22:49:38 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752748AbZHKMrn (ORCPT ); Tue, 11 Aug 2009 08:47:43 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753081AbZHKMrn (ORCPT ); Tue, 11 Aug 2009 08:47:43 -0400 Received: from mail-ew0-f214.google.com ([209.85.219.214]:40710 "EHLO mail-ew0-f214.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751667AbZHKMrl (ORCPT ); Tue, 11 Aug 2009 08:47:41 -0400 Received: by ewy10 with SMTP id 10so3740878ewy.37 for ; Tue, 11 Aug 2009 05:47:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:subject:content-type :content-transfer-encoding; bh=mq7xXDCYujwwdCgqZtFkcw1EM8eBesI+r9O6ccPHAs4=; b=jttLwBuwqd0Wt/gu5fGotZ8THPKAWbNg705+Mi4/OZrflWuK97XcHhV6m2GZf7R9vB 1Gj01qEyL0MGk2Jdmbz5SBEsIdwoUWV5tZuw5Jw14SYFZ9TWKa6JbglsyPTPTZnu5Pbk eW+vO6ygkpKl7B+bNAPzJB9O/jDCXh4sgGDIQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:subject :content-type:content-transfer-encoding; b=G1087tDw8IkUOXS6WKOoqfy4NkHIes5kAFkfCHPfg0vdyKyjCK3IwrP9S1A3skkssV a/3EbkxTeeih4VP4osCMLVA9VAJocIQeuFAUTA5zNYHJetO5b6DRXQyUtG8/qRN/acKz OMtUgpK5Fgp+yeQwRQkvXbC6LDcSSgQiRKj3E= Received: by 10.210.62.4 with SMTP id k4mr527379eba.84.1249978027869; Tue, 11 Aug 2009 01:07:07 -0700 (PDT) Received: from ?10.252.5.10? (xdsl-83-150-94-239.nebulazone.fi [83.150.94.239]) by mx.google.com with ESMTPS id 5sm2210020eyh.16.2009.08.11.01.07.06 (version=SSLv3 cipher=RC4-MD5); Tue, 11 Aug 2009 01:07:06 -0700 (PDT) Message-ID: <4A8126AA.9050309@iki.fi> Date: Tue, 11 Aug 2009 11:07:06 +0300 From: =?ISO-8859-1?Q?Timo_Ter=E4s?= User-Agent: Thunderbird 2.0.0.22 (X11/20090608) MIME-Version: 1.0 To: netdev@vger.kernel.org Subject: multicast over nbma gre tunnels Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Hi, I'm trying to figure out the proper way to do multicast forwarding over nbma gre tunnels. Currently, the userland opennhrp daemon just listens of packet socket, and calls sendto() for each target it wants to forward it. Obviously this is slow. I started to look at how to do it kernel. And I'm playing with the multicast forwarding code, and tried something like below (see patch at end of mail). However, there's several draw backs to this: - ABI breaks: MAXVIFS is changed (used in userland visible struct) - it does not forward locally originating link-local multicast traffic (it does not traverse through mcast forwarding code) - there can be only one mrouter application, so I can't have opennhrp manage the gre device, and some app managing the forwarding between interfaces It looks like to me that the multicast forwarding code would need a rewrite anyway, using netlink so that it can be managed by multiple apps without hard limits such as MAXVIFS. But then again, I'm thinking if the gre nbma part should be made part of ipmr.c (have "vifs" for each nbma entity; and hook link-local traffic to ipmr.c too) or ip_gre.c (specific api for managing the nbma forwardings within gre code). The other problem is that, each multicast packet could be forwarded to, say hundred nodes, via same physical eth. I'm wonder if just copying the skb hundred times and queuing them on same device causes problems? Any suggestions how this could be done in a smart way? Incidentally, I noticed that using this, and having multicast sender send to gre1, with forwarding gre1->multiple gre1 nbma destinations, the sender application had a large soft-irq usage. Forwarding from real interface to multiple nbma destinations seemed had the expected cpu usage level. I suppose I introduced (or there exists) some sort of routing loop that the packets got handled TTL times, instead of once per packet. Thanks, Timo --- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/include/linux/mroute.h b/include/linux/mroute.h index 0d45b4e..406ef6f 100644 --- a/include/linux/mroute.h +++ b/include/linux/mroute.h @@ -33,7 +33,7 @@ #define SIOCGETSGCNT (SIOCPROTOPRIVATE+1) #define SIOCGETRPF (SIOCPROTOPRIVATE+2) -#define MAXVIFS 32 +#define MAXVIFS 256 typedef unsigned long vifbitmap_t; /* User mode code depends on this lot */ typedef unsigned short vifi_t; #define ALL_VIFS ((vifi_t)(-1)) @@ -66,6 +66,7 @@ struct vifctl { #define VIFF_TUNNEL 0x1 /* IPIP tunnel */ #define VIFF_SRCRT 0x2 /* NI */ #define VIFF_REGISTER 0x4 /* register vif */ +#define VIFF_NBMA 0x10 /* * Cache manipulation structures for mrouted and PIMd diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 13e9dd3..43c988b 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -105,6 +105,31 @@ static struct net_protocol pim_protocol; static struct timer_list ipmr_expire_timer; +static __be32 ipmr_get_skb_nbma(struct sk_buff *skb) +{ + union { + char addr[MAX_ADDR_LEN]; + __be32 inaddr; + } u; + + if (dev_parse_header(skb, u.addr) != 4) + return INADDR_ANY; + + return u.inaddr; +} + +static int ip_mr_match_vif_skb(struct vif_device *vif, struct sk_buff *skb) +{ + if (vif->dev != skb->dev) + return 0; + + if (vif->flags & VIFF_NBMA) + return ipmr_get_skb_nbma(skb) == vif->remote; + + return 1; +} + + /* Service routines creating virtual interfaces: DVMRP tunnels and PIMREG */ static void ipmr_del_tunnel(struct net_device *dev, struct vifctl *v) @@ -470,6 +495,7 @@ static int vif_add(struct net *net, struct vifctl *vifc, int mrtsock) return err; } break; + case VIFF_NBMA: case 0: dev = ip_dev_find(net, vifc->vifc_lcl_addr.s_addr); if (!dev) @@ -504,7 +530,7 @@ static int vif_add(struct net *net, struct vifctl *vifc, int mrtsock) v->pkt_in = 0; v->pkt_out = 0; v->link = dev->ifindex; - if (v->flags&(VIFF_TUNNEL|VIFF_REGISTER)) + if (v->flags&(VIFF_TUNNEL|VIFF_REGISTER|VIFF_NBMA)) v->link = dev->iflink; /* And finish update writing critical data */ @@ -1212,12 +1238,15 @@ static inline int ipmr_forward_finish(struct sk_buff *skb) { struct ip_options * opt = &(IPCB(skb)->opt); - IP_INC_STATS_BH(dev_net(skb->dst->dev), IPSTATS_MIB_OUTFORWDATAGRAMS); + IP_INC_STATS_BH(dev_net(skb->dev), IPSTATS_MIB_OUTFORWDATAGRAMS); if (unlikely(opt->optlen)) ip_forward_options(skb); - return dst_output(skb); + if (skb->dst != NULL) + return dst_output(skb); + else + return dev_queue_xmit(skb); } /* @@ -1230,7 +1259,8 @@ static void ipmr_queue_xmit(struct sk_buff *skb, struct mfc_cache *c, int vifi) const struct iphdr *iph = ip_hdr(skb); struct vif_device *vif = &net->ipv4.vif_table[vifi]; struct net_device *dev; - struct rtable *rt; + struct net_device *fromdev = skb->dev; + struct rtable *rt = NULL; int encap = 0; if (vif->dev == NULL) @@ -1257,6 +1287,19 @@ static void ipmr_queue_xmit(struct sk_buff *skb, struct mfc_cache *c, int vifi) if (ip_route_output_key(net, &rt, &fl)) goto out_free; encap = sizeof(struct iphdr); + dev = rt->u.dst.dev; + } else if (vif->flags&VIFF_NBMA) { + /* Fixme, we should take tunnel source address from the + * tunnel device binding if it exists */ + struct flowi fl = { .oif = vif->link, + .nl_u = { .ip4_u = + { .daddr = vif->remote, + .tos = RT_TOS(iph->tos) } }, + .proto = IPPROTO_GRE }; + if (ip_route_output_key(&init_net, &rt, &fl)) + goto out_free; + encap = LL_RESERVED_SPACE(rt->u.dst.dev); + dev = vif->dev; } else { struct flowi fl = { .oif = vif->link, .nl_u = { .ip4_u = @@ -1265,34 +1308,39 @@ static void ipmr_queue_xmit(struct sk_buff *skb, struct mfc_cache *c, int vifi) .proto = IPPROTO_IPIP }; if (ip_route_output_key(net, &rt, &fl)) goto out_free; + dev = rt->u.dst.dev; } - dev = rt->u.dst.dev; + if (!(vif->flags & VIFF_NBMA)) { + if (skb->len+encap > dst_mtu(&rt->u.dst) && (ntohs(iph->frag_off) & IP_DF)) { + /* Do not fragment multicasts. Alas, IPv4 does not + allow to send ICMP, so that packets will disappear + to blackhole. + */ - if (skb->len+encap > dst_mtu(&rt->u.dst) && (ntohs(iph->frag_off) & IP_DF)) { - /* Do not fragment multicasts. Alas, IPv4 does not - allow to send ICMP, so that packets will disappear - to blackhole. - */ - - IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_FRAGFAILS); - ip_rt_put(rt); - goto out_free; + IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_FRAGFAILS); + goto out_free_rt; + } } encap += LL_RESERVED_SPACE(dev) + rt->u.dst.header_len; - if (skb_cow(skb, encap)) { - ip_rt_put(rt); - goto out_free; - } + if (skb_cow(skb, encap)) + goto out_free_rt; vif->pkt_out++; vif->bytes_out += skb->len; dst_release(skb->dst); - skb->dst = &rt->u.dst; + if (vif->flags & VIFF_NBMA) { + ip_rt_put(rt); + skb->dst = NULL; + rt = NULL; + } else { + skb->dst = &rt->u.dst; + } ip_decrease_ttl(ip_hdr(skb)); + skb->dev = dev; /* FIXME: forward and output firewalls used to be called here. * What do we do with netfilter? -- RR */ @@ -1301,6 +1349,10 @@ static void ipmr_queue_xmit(struct sk_buff *skb, struct mfc_cache *c, int vifi) /* FIXME: extra output firewall step used to be here. --RR */ vif->dev->stats.tx_packets++; vif->dev->stats.tx_bytes += skb->len; + } else if (vif->flags & VIFF_NBMA) { + if (dev_hard_header(skb, dev, ntohs(skb->protocol), + &vif->remote, NULL, 4) < 0) + goto out_free_rt; } IPCB(skb)->flags |= IPSKB_FORWARDED; @@ -1316,21 +1368,30 @@ static void ipmr_queue_xmit(struct sk_buff *skb, struct mfc_cache *c, int vifi) * not mrouter) cannot join to more than one interface - it will * result in receiving multiple packets. */ - NF_HOOK(PF_INET, NF_INET_FORWARD, skb, skb->dev, dev, + NF_HOOK(PF_INET, NF_INET_FORWARD, skb, fromdev, dev, ipmr_forward_finish); return; +out_free_rt: + if (rt != NULL) + ip_rt_put(rt); out_free: kfree_skb(skb); return; } -static int ipmr_find_vif(struct net_device *dev) +static int ipmr_find_vif(struct net_device *dev, __be32 nbma_origin) { struct net *net = dev_net(dev); int ct; for (ct = net->ipv4.maxvif-1; ct >= 0; ct--) { - if (net->ipv4.vif_table[ct].dev == dev) + if (net->ipv4.vif_table[ct].dev != dev) + continue; + + if (net->ipv4.vif_table[ct].flags & VIFF_NBMA) { + if (net->ipv4.vif_table[ct].remote == nbma_origin) + break; + } else if (nbma_origin == INADDR_ANY) break; } return ct; @@ -1351,7 +1412,7 @@ static int ip_mr_forward(struct sk_buff *skb, struct mfc_cache *cache, int local /* * Wrong interface: drop packet and (maybe) send PIM assert. */ - if (net->ipv4.vif_table[vif].dev != skb->dev) { + if (!ip_mr_match_vif_skb(&net->ipv4.vif_table[vif], skb)) { int true_vifi; if (skb->rtable->fl.iif == 0) { @@ -1370,7 +1431,7 @@ static int ip_mr_forward(struct sk_buff *skb, struct mfc_cache *cache, int local } cache->mfc_un.res.wrong_if++; - true_vifi = ipmr_find_vif(skb->dev); + true_vifi = ipmr_find_vif(skb->dev, ipmr_get_skb_nbma(skb)); if (true_vifi >= 0 && net->ipv4.mroute_do_assert && /* pimsm uses asserts, when switching from RPT to SPT, @@ -1479,7 +1540,7 @@ int ip_mr_input(struct sk_buff *skb) skb = skb2; } - vif = ipmr_find_vif(skb->dev); + vif = ipmr_find_vif(skb->dev, ipmr_get_skb_nbma(skb)); if (vif >= 0) { int err = ipmr_cache_unresolved(net, vif, skb); read_unlock(&mrt_lock); @@ -1663,7 +1724,7 @@ int ipmr_get_route(struct net *net, } dev = skb->dev; - if (dev == NULL || (vif = ipmr_find_vif(dev)) < 0) { + if (dev == NULL || (vif = ipmr_find_vif(dev, INADDR_ANY)) < 0) { read_unlock(&mrt_lock); return -ENODEV; }