Message ID | 1338360073.2760.81.camel@edumazet-glaptop |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Wed, 2012-05-30 at 08:41 +0200, Eric Dumazet wrote: > On Tue, 2012-05-29 at 12:37 -0700, Andi Kleen wrote: > > > So basically handling syncookie lockless? > > > > Makes sense. Syncookies is a bit obsolete these days of course, due > > to the lack of options. But may be still useful for this. > > > > Obviously you'll need to clean up the patch and support IPv6, > > but the basic idea looks good to me. > > Also TCP Fast Open should be a good way to make the SYN flood no more > effective. Sounds interesting, but TCP Fast Open is primarily concerned with enabling data exchange during SYN establishment. I don't see any indication that they have implemented parallel SYN handling. Implementing parallel SYN handling, should also benefit their work. After studying this code path, I also see great performance benefit in also optimizing the normal 3WHS on sock's in sk_state == LISTEN. Perhaps we should split up the code path for LISTEN vs. ESTABLISHED, as they are very entangled at the moment AFAIKS. > Yuchung Cheng and Jerry Chu should upstream this code in a very near > future. Looking forward to see the code, and the fallout discussions, on transferring data on SYN packets. > Another way to mitigate SYN scalability issues before the full RCU > solution I was cooking is to either : > > 1) Use a hardware filter (like on Intel NICS) to force all SYN packets > going to one queue (so that they are all serviced on one CPU) > > 2) Tweak RPS (__skb_get_rxhash()) so that SYN packets rxhash is not > dependent on src port/address, to get same effect (All SYN packets > processed by one cpu). Note this only address the SYN flood problem, not > the general 3WHS scalability one, since if real connection is > established, the third packet (ACK from client) will have the 'real' > rxhash and will be processed by another cpu. I don't like the idea of overloading one CPU with SYN packets. As the attacker can still cause a DoS on new connections. My "unlocked" parallel SYN cookie approach, should favor established connections, as they are allowed to run under a BH lock, and thus don't let new SYN packets in (on this CPU), until the establish conn packet is finished. Unless I have misunderstood something... I think I have, established connections have their own/seperate struck sock, and thus this is another slock spinlock, right?. (Well let Eric bash me for this ;-)) [...cut...] -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 30 May 2012 08:41:13 Eric Dumazet wrote: > On Tue, 2012-05-29 at 12:37 -0700, Andi Kleen wrote: > > > So basically handling syncookie lockless? > > > > Makes sense. Syncookies is a bit obsolete these days of course, due > > to the lack of options. But may be still useful for this. > > > > Obviously you'll need to clean up the patch and support IPv6, > > but the basic idea looks good to me. > > Also TCP Fast Open should be a good way to make the SYN flood no more > effective. > > Yuchung Cheng and Jerry Chu should upstream this code in a very near > future. > > Another way to mitigate SYN scalability issues before the full RCU > solution I was cooking is to either : > > 1) Use a hardware filter (like on Intel NICS) to force all SYN packets > going to one queue (so that they are all serviced on one CPU) We have this option running right now, and it gave slightly higher values. The upside is only one core is running at 100% load. To be able to process more SYN an attempt was made to spread them with RPS to 2 other cores gave 60% more SYN:s per sec i.e. syn filter in NIC sending all irq:s to one core gave ~ 52k syn. pkts/sec adding RPS and sending syn to two other core:s gave ~80k syn. pkts/sec Adding more cores than two didn't help that much. > 2) Tweak RPS (__skb_get_rxhash()) so that SYN packets rxhash is not > dependent on src port/address, to get same effect (All SYN packets > processed by one cpu). Note this only address the SYN flood problem, not > the general 3WHS scalability one, since if real connection is > established, the third packet (ACK from client) will have the 'real' > rxhash and will be processed by another cpu. Neither the NIC:s SYN filter or this scale that well.. > (Of course, RPS must be enabled to benefit from this) > > Untested patch to get the idea : > > include/net/flow_keys.h | 1 + > net/core/dev.c | 8 ++++++++ > net/core/flow_dissector.c | 9 +++++++++ > 3 files changed, 18 insertions(+) > > diff --git a/include/net/flow_keys.h b/include/net/flow_keys.h > index 80461c1..b5bae21 100644 > --- a/include/net/flow_keys.h > +++ b/include/net/flow_keys.h > @@ -10,6 +10,7 @@ struct flow_keys { > __be16 port16[2]; > }; > u8 ip_proto; > + u8 tcpflags; > }; > > extern bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys *flow); > diff --git a/net/core/dev.c b/net/core/dev.c > index cd09819..c9c039e 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -135,6 +135,7 @@ > #include <linux/net_tstamp.h> > #include <linux/static_key.h> > #include <net/flow_keys.h> > +#include <net/tcp.h> > > #include "net-sysfs.h" > > @@ -2614,6 +2615,12 @@ void __skb_get_rxhash(struct sk_buff *skb) > return; > > if (keys.ports) { > + if ((keys.tcpflags & (TCPHDR_SYN | TCPHDR_ACK)) == TCPHDR_SYN) { > + hash = jhash_2words((__force u32)keys.dst, > + (__force u32)keys.port16[1], > + hashrnd); > + goto end; > + } > if ((__force u16)keys.port16[1] < (__force u16)keys.port16[0]) > swap(keys.port16[0], keys.port16[1]); > skb->l4_rxhash = 1; > @@ -2626,6 +2633,7 @@ void __skb_get_rxhash(struct sk_buff *skb) > hash = jhash_3words((__force u32)keys.dst, > (__force u32)keys.src, > (__force u32)keys.ports, hashrnd); > +end: > if (!hash) > hash = 1; > > diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c > index a225089..cd4aedf 100644 > --- a/net/core/flow_dissector.c > +++ b/net/core/flow_dissector.c > @@ -137,6 +137,15 @@ ipv6: > ports = skb_header_pointer(skb, nhoff, sizeof(_ports), &_ports); > if (ports) > flow->ports = *ports; > + if (ip_proto == IPPROTO_TCP) { > + __u8 *tcpflags, _tcpflags; > + > + tcpflags = skb_header_pointer(skb, nhoff + 13, > + sizeof(_tcpflags), > + &_tcpflags); > + if (tcpflags) > + flow->tcpflags = *tcpflags; > + } > } > > return true; > > >
On Wed, 2012-05-30 at 09:45 +0200, Jesper Dangaard Brouer wrote: > Sounds interesting, but TCP Fast Open is primarily concerned with > enabling data exchange during SYN establishment. I don't see any > indication that they have implemented parallel SYN handling. > Not at all, TCP fast open main goal is to allow connection establishment with a single packet (thus removing one RTT). This also removes the whole idea of having half-sockets (in SYN_RCV state) Then, allowing DATA in the SYN packet is an extra bonus, only if the whole request can fit in the packet (it is unlikely for typical http requests) > Implementing parallel SYN handling, should also benefit their work. Why do you think I am working on this ? Hint : I am a Google coworker. > After studying this code path, I also see great performance benefit in > also optimizing the normal 3WHS on sock's in sk_state == LISTEN. > Perhaps we should split up the code path for LISTEN vs. ESTABLISHED, as > they are very entangled at the moment AFAIKS. > > > Yuchung Cheng and Jerry Chu should upstream this code in a very near > > future. > > Looking forward to see the code, and the fallout discussions, on > transferring data on SYN packets. > Problem is this code will be delayed if we change net-next code in this area, because we'll have to rebase and retest everything. > > > Another way to mitigate SYN scalability issues before the full RCU > > solution I was cooking is to either : > > > > 1) Use a hardware filter (like on Intel NICS) to force all SYN packets > > going to one queue (so that they are all serviced on one CPU) > > > > 2) Tweak RPS (__skb_get_rxhash()) so that SYN packets rxhash is not > > dependent on src port/address, to get same effect (All SYN packets > > processed by one cpu). Note this only address the SYN flood problem, not > > the general 3WHS scalability one, since if real connection is > > established, the third packet (ACK from client) will have the 'real' > > rxhash and will be processed by another cpu. > > I don't like the idea of overloading one CPU with SYN packets. As the > attacker can still cause a DoS on new connections. > One CPU can handle more than one million SYN per second, while 32 cpus fighting on socket lock can not handle 1 % of this load. If Intel chose to implement this hardware filter in their NIC, its for a good reason. > My "unlocked" parallel SYN cookie approach, should favor established > connections, as they are allowed to run under a BH lock, and thus don't > let new SYN packets in (on this CPU), until the establish conn packet is > finished. Unless I have misunderstood something... I think I have, > established connections have their own/seperate struck sock, and thus > this is another slock spinlock, right?. (Well let Eric bash me for > this ;-)) It seems you forgot I have patches to have full parallelism, not only the SYNCOOKIE hack. I am still polishing them, its a _long_ process, especially if network tree changes a lot. If you believe you can beat me on this, please let me know so that I can switch to other tasks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-05-30 at 10:03 +0200, Hans Schillstrom wrote: > We have this option running right now, and it gave slightly higher values. > The upside is only one core is running at 100% load. > > To be able to process more SYN an attempt was made to spread them with RPS to > 2 other cores gave 60% more SYN:s per sec > i.e. syn filter in NIC sending all irq:s to one core gave ~ 52k syn. pkts/sec > adding RPS and sending syn to two other core:s gave ~80k syn. pkts/sec > Adding more cores than two didn't help that much. When you say 52.000 pkt/s, is that for fully established sockets, or SYNFLOOD ? 19.23 us to handle _one_ SYN message seems pretty wrong to me, if there is no contention on listener socket. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-05-30 at 10:15 +0200, Eric Dumazet wrote: > On Wed, 2012-05-30 at 09:45 +0200, Jesper Dangaard Brouer wrote: > > > Sounds interesting, but TCP Fast Open is primarily concerned with > > enabling data exchange during SYN establishment. I don't see any > > indication that they have implemented parallel SYN handling. > > > > Not at all, TCP fast open main goal is to allow connection establishment > with a single packet (thus removing one RTT). This also removes the > whole idea of having half-sockets (in SYN_RCV state) > > Then, allowing DATA in the SYN packet is an extra bonus, only if the > whole request can fit in the packet (it is unlikely for typical http > requests) > > > > Implementing parallel SYN handling, should also benefit their work. > > Why do you think I am working on this ? Hint : I am a Google coworker. Did know you work for Google, but didn't know you worked actively on parallel SYN handling. Your previous quote "eventually in a short time", indicated to me, that I should solve the issue my self first, and then we would replace my code with your full solution later. > > After studying this code path, I also see great performance benefit in > > also optimizing the normal 3WHS on sock's in sk_state == LISTEN. > > Perhaps we should split up the code path for LISTEN vs. ESTABLISHED, as > > they are very entangled at the moment AFAIKS. > > > > > Yuchung Cheng and Jerry Chu should upstream this code in a very near > > > future. > > > > Looking forward to see the code, and the fallout discussions, on > > transferring data on SYN packets. > > > > Problem is this code will be delayed if we change net-next code in this > area, because we'll have to rebase and retest everything. Okay, don't want to delay your work. We can wait merging my cleanup patches, and I can take the pain of rebasing them after your work is merged. And then we will see if my performance patches have gotten obsolete. I'm going to post some updated v2 patches, just because I know some people that are desperate for a quick solution to their DDoS issues, and are willing patch their kernels for production. > > > Another way to mitigate SYN scalability issues before the full RCU > > > solution I was cooking is to either : > > > > > > 1) Use a hardware filter (like on Intel NICS) to force all SYN packets > > > going to one queue (so that they are all serviced on one CPU) > > > > > > 2) Tweak RPS (__skb_get_rxhash()) so that SYN packets rxhash is not > > > dependent on src port/address, to get same effect (All SYN packets > > > processed by one cpu). Note this only address the SYN flood problem, not > > > the general 3WHS scalability one, since if real connection is > > > established, the third packet (ACK from client) will have the 'real' > > > rxhash and will be processed by another cpu. > > > > I don't like the idea of overloading one CPU with SYN packets. As the > > attacker can still cause a DoS on new connections. > > > > One CPU can handle more than one million SYN per second, while 32 cpus > fighting on socket lock can not handle 1 % of this load. Not sure, one CPU can handle 1Mpps on this particular path. And Hans have some other measurements, although I'm assuming he has small CPUs. But if you are working on the real solution, we don't need to discuss this :-) > If Intel chose to implement this hardware filter in their NIC, its for a > good reason. > > > > My "unlocked" parallel SYN cookie approach, should favor established > > connections, as they are allowed to run under a BH lock, and thus don't > > let new SYN packets in (on this CPU), until the establish conn packet is > > finished. Unless I have misunderstood something... I think I have, > > established connections have their own/seperate struck sock, and thus > > this is another slock spinlock, right?. (Well let Eric bash me for > > this ;-)) > > It seems you forgot I have patches to have full parallelism, not only > the SYNCOOKIE hack. I'm so much, looking forward to this :-) > I am still polishing them, its a _long_ process, especially if network > tree changes a lot. > > If you believe you can beat me on this, please let me know so that I can > switch to other tasks. I don't dare to go into that battle with the network ninja, I surrender. DaveM, Eric's patches take precedence over mine... /me Crawing back into my cave, and switching to boring bugzilla cases of backporting kernel patches instead...
On Wed, 2012-05-30 at 11:24 +0200, Jesper Dangaard Brouer wrote: > I don't dare to go into that battle with the network ninja, I surrender. > DaveM, Eric's patches take precedence over mine... > > /me Crawing back into my cave, and switching to boring bugzilla cases of > backporting kernel patches instead... > Hey, I only wanted to say that we were working on the same area and that we should expect conflicts. In the long term, we want a scalable listener solution, but I can understand if some customers want an immediate solution (SYN flood mitigation) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 30 May 2012 10:24:48 Eric Dumazet wrote: > On Wed, 2012-05-30 at 10:03 +0200, Hans Schillstrom wrote: > > > We have this option running right now, and it gave slightly higher values. > > The upside is only one core is running at 100% load. > > > > To be able to process more SYN an attempt was made to spread them with RPS to > > 2 other cores gave 60% more SYN:s per sec > > i.e. syn filter in NIC sending all irq:s to one core gave ~ 52k syn. pkts/sec > > adding RPS and sending syn to two other core:s gave ~80k syn. pkts/sec > > Adding more cores than two didn't help that much. > > When you say 52.000 pkt/s, is that for fully established sockets, or > SYNFLOOD ? SYN Flood with hping3 random source ip, dest port 5060 and there is a listener on that port. (kernel 3.0.13) > 19.23 us to handle _one_ SYN message seems pretty wrong to me, if there > is no contention on listener socket. > BTW. I also see a strange behavior during SYN flood. The client starts data sending directly in the ack, and that first packet is more or less always retransmitted once. I'll dig into that later, or do anyone have an idea of the reason ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/30/2012 01:24 AM, Eric Dumazet wrote: > On Wed, 2012-05-30 at 10:03 +0200, Hans Schillstrom wrote: > >> We have this option running right now, and it gave slightly higher values. >> The upside is only one core is running at 100% load. >> >> To be able to process more SYN an attempt was made to spread them with RPS to >> 2 other cores gave 60% more SYN:s per sec >> i.e. syn filter in NIC sending all irq:s to one core gave ~ 52k syn. pkts/sec >> adding RPS and sending syn to two other core:s gave ~80k syn. pkts/sec >> Adding more cores than two didn't help that much. > > When you say 52.000 pkt/s, is that for fully established sockets, or > SYNFLOOD ? > > 19.23 us to handle _one_ SYN message seems pretty wrong to me, if there > is no contention on listener socket. It may still be high, but a very quick netperf TCP_CC test over loopback on a W3550 system running a 2.6.38 kernel shows: raj@tardy:~/netperf2_trunk/src$ ./netperf -t TCP_CC -l 60 -c -C TCP Connect/Close TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain () port 0 AF_INET Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % % us/Tr us/Tr 16384 87380 1 1 60.00 21515.29 30.68 30.96 57.042 57.557 16384 87380 57 microseconds per "transaction" which in this case is establishing and tearing-down the connection, with nothing else (no data packets) makes 19 microseconds for a SYN seem perhaps not all that beyond the realm of possibility? rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-05-30 at 14:20 -0700, Rick Jones wrote: > It may still be high, but a very quick netperf TCP_CC test over loopback > on a W3550 system running a 2.6.38 kernel shows: > > raj@tardy:~/netperf2_trunk/src$ ./netperf -t TCP_CC -l 60 -c -C > TCP Connect/Close TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > localhost.localdomain () port 0 AF_INET > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % % us/Tr us/Tr > > 16384 87380 1 1 60.00 21515.29 30.68 30.96 57.042 57.557 > 16384 87380 > > 57 microseconds per "transaction" which in this case is establishing and > tearing-down the connection, with nothing else (no data packets) makes > 19 microseconds for a SYN seem perhaps not all that beyond the realm of > possibility? Thats a different story, on loopback device (without stressing IP route cache by the way) Your netperf test is a full userspace transactions, and 5 frames per transaction. Two sockets creation/destruction, process scheduler activations, and not enter syncookie mode. In case of synflood/(syncookies on), we receive a packet and send one from softirq. One expensive thing might be the md5 to compute the SYNACK sequence. I suspect other things : 1) Of course we have to take into account the timer responsible for SYNACK retransmits of previously queued requests. Its cost depends on the listen backlog. When this timer runs, listen socket is locked. 2) IP route cache overflows. In case of SYNFLOOD, we should not store dst(s) in route cache but destroy them immediately. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 31 May 2012 10:28:37 Eric Dumazet wrote: > On Wed, 2012-05-30 at 14:20 -0700, Rick Jones wrote: > > > It may still be high, but a very quick netperf TCP_CC test over loopback > > on a W3550 system running a 2.6.38 kernel shows: > > > > raj@tardy:~/netperf2_trunk/src$ ./netperf -t TCP_CC -l 60 -c -C > > TCP Connect/Close TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to > > localhost.localdomain () port 0 AF_INET > > Local /Remote > > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > > Send Recv Size Size Time Rate local remote local remote > > bytes bytes bytes bytes secs. per sec % % us/Tr us/Tr > > > > 16384 87380 1 1 60.00 21515.29 30.68 30.96 57.042 57.557 > > 16384 87380 > > > > 57 microseconds per "transaction" which in this case is establishing and > > tearing-down the connection, with nothing else (no data packets) makes > > 19 microseconds for a SYN seem perhaps not all that beyond the realm of > > possibility? > > Thats a different story, on loopback device (without stressing IP route > cache by the way) > > Your netperf test is a full userspace transactions, and 5 frames per > transaction. Two sockets creation/destruction, process scheduler > activations, and not enter syncookie mode. > > In case of synflood/(syncookies on), we receive a packet and send one > from softirq. > > One expensive thing might be the md5 to compute the SYNACK sequence. > > I suspect other things : > > 1) Of course we have to take into account the timer responsible for > SYNACK retransmits of previously queued requests. Its cost depends on > the listen backlog. When this timer runs, listen socket is locked. > > 2) IP route cache overflows. > In case of SYNFLOOD, we should not store dst(s) in route cache but > destroy them immediately. > I can see plenty "IPv4: dst cache overflow" -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-05-31 at 10:45 +0200, Hans Schillstrom wrote: > I can see plenty "IPv4: dst cache overflow" > This is probably the most problematic problem in DDOS attacks. I have a patch for this problem. Idea is to not cache dst entries for following cases : 1) Input dst, if listener queue is full (syncookies possibly engaged) 2) Output dst of SYNACK messages. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 31 May 2012 16:09:21 Eric Dumazet wrote: > On Thu, 2012-05-31 at 10:45 +0200, Hans Schillstrom wrote: > > > I can see plenty "IPv4: dst cache overflow" > > > > This is probably the most problematic problem in DDOS attacks. > > I have a patch for this problem. > > Idea is to not cache dst entries for following cases : > > 1) Input dst, if listener queue is full (syncookies possibly engaged) > > 2) Output dst of SYNACK messages. > Sound like a good idea, if you need some testing just the patches
diff --git a/include/net/flow_keys.h b/include/net/flow_keys.h index 80461c1..b5bae21 100644 --- a/include/net/flow_keys.h +++ b/include/net/flow_keys.h @@ -10,6 +10,7 @@ struct flow_keys { __be16 port16[2]; }; u8 ip_proto; + u8 tcpflags; }; extern bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys *flow); diff --git a/net/core/dev.c b/net/core/dev.c index cd09819..c9c039e 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -135,6 +135,7 @@ #include <linux/net_tstamp.h> #include <linux/static_key.h> #include <net/flow_keys.h> +#include <net/tcp.h> #include "net-sysfs.h" @@ -2614,6 +2615,12 @@ void __skb_get_rxhash(struct sk_buff *skb) return; if (keys.ports) { + if ((keys.tcpflags & (TCPHDR_SYN | TCPHDR_ACK)) == TCPHDR_SYN) { + hash = jhash_2words((__force u32)keys.dst, + (__force u32)keys.port16[1], + hashrnd); + goto end; + } if ((__force u16)keys.port16[1] < (__force u16)keys.port16[0]) swap(keys.port16[0], keys.port16[1]); skb->l4_rxhash = 1; @@ -2626,6 +2633,7 @@ void __skb_get_rxhash(struct sk_buff *skb) hash = jhash_3words((__force u32)keys.dst, (__force u32)keys.src, (__force u32)keys.ports, hashrnd); +end: if (!hash) hash = 1; diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index a225089..cd4aedf 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -137,6 +137,15 @@ ipv6: ports = skb_header_pointer(skb, nhoff, sizeof(_ports), &_ports); if (ports) flow->ports = *ports; + if (ip_proto == IPPROTO_TCP) { + __u8 *tcpflags, _tcpflags; + + tcpflags = skb_header_pointer(skb, nhoff + 13, + sizeof(_tcpflags), + &_tcpflags); + if (tcpflags) + flow->tcpflags = *tcpflags; + } } return true;