diff mbox

[conntrack-tools,4/4] conntrackd: introduce RequestResync option

Message ID 149270929676.1751.18425946182083865800.stgit@nfdev2.cica.es
State Changes Requested
Delegated to: Pablo Neira
Headers show

Commit Message

Arturo Borrero Gonzalez April 20, 2017, 5:28 p.m. UTC
In some environments where both nodes of a cluster share all the conntracks,
after an initial or manual resync, the conntrack information diverges from
node to node.

I have observed that this is not due to syncronization problems, given the
link between the nodes is very stable and stats show no issues.
So, this could be due to every node of the cluster seing slighly different
traffic and flow updates, perhaps different tiemouts being applied to
the conntracks in every node.
A manual resync (using conntrackd -n) resolves these issues inmediately.

This new configuration option tells conntrackd to request a resync
with the other node, similar to what could happen manually using
the 'conntrackd -n' command.

By now this option is only valid in NOTRACK sync mode.

Example configuration:

[...]
Sync {
        Mode NOTRACK {
                DisableInternalCache on
                DisableExternalCache on
                RequestResync 30
        }
        TCP {
                IPv4_address 127.0.0.1
                IPv4_Destination_Address 127.0.0.1
                Port 3780
                Interface eth0
                SndSocketBuffer 1249280
                RcvSocketBuffer 1249280
                Checksum on
        }
        Options {
                TCPWindowTracking Off
                ExpectationSync On
        }
}
[...]

Signed-off-by: Arturo Borrero Gonzalez <arturo@debian.org>
---
 conntrackd.conf.5     |    9 +++++++++
 include/conntrackd.h  |    1 +
 include/resync.h      |    1 +
 src/read_config_lex.l |    1 +
 src/read_config_yy.y  |    8 +++++++-
 src/resync.c          |   21 +++++++++++++++++++++
 src/run.c             |    3 +++
 7 files changed, 43 insertions(+), 1 deletion(-)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Pablo Neira Ayuso April 25, 2017, 11:37 a.m. UTC | #1
On Thu, Apr 20, 2017 at 07:28:16PM +0200, Arturo Borrero Gonzalez wrote:
> In some environments where both nodes of a cluster share all the conntracks,
> after an initial or manual resync, the conntrack information diverges from
> node to node.
> 
> I have observed that this is not due to syncronization problems, given the
> link between the nodes is very stable and stats show no issues.
> So, this could be due to every node of the cluster seing slighly different
> traffic and flow updates, perhaps different tiemouts being applied to
> the conntracks in every node.
> A manual resync (using conntrackd -n) resolves these issues inmediately.
> 
> This new configuration option tells conntrackd to request a resync
> with the other node, similar to what could happen manually using
> the 'conntrackd -n' command.
> 
> By now this option is only valid in NOTRACK sync mode.
> 
> Example configuration:
> 
> [...]
> Sync {
>         Mode NOTRACK {
>                 DisableInternalCache on
>                 DisableExternalCache on
>                 RequestResync 30

This looks very similar to the timer based approach that it is already
there. Did you give it a try?

This approach doesn't solve nicely the case where you have an entry
with a large timeout that got out of sync.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arturo Borrero Gonzalez April 25, 2017, 12:46 p.m. UTC | #2
On 25 April 2017 at 13:37, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Thu, Apr 20, 2017 at 07:28:16PM +0200, Arturo Borrero Gonzalez wrote:
>> In some environments where both nodes of a cluster share all the conntracks,
>> after an initial or manual resync, the conntrack information diverges from
>> node to node.
>>
>> I have observed that this is not due to syncronization problems, given the
>> link between the nodes is very stable and stats show no issues.
>> So, this could be due to every node of the cluster seing slighly different
>> traffic and flow updates, perhaps different tiemouts being applied to
>> the conntracks in every node.
>> A manual resync (using conntrackd -n) resolves these issues inmediately.
>>
>> This new configuration option tells conntrackd to request a resync
>> with the other node, similar to what could happen manually using
>> the 'conntrackd -n' command.
>>
>> By now this option is only valid in NOTRACK sync mode.
>>
>> Example configuration:
>>
>> [...]
>> Sync {
>>         Mode NOTRACK {
>>                 DisableInternalCache on
>>                 DisableExternalCache on
>>                 RequestResync 30
>
> This looks very similar to the timer based approach that it is already
> there. Did you give it a try?
>

Yes. The timer based approach is... timer based (async).

It doesn't fit in an environment where you need to sync events as soon
as they happen.

> This approach doesn't solve nicely the case where you have an entry
> with a large timeout that got out of sync.

My idea is to be able to automatically force-sync nodes every 2 o 3
minutes (in my case).
Users may choose a different time of course. What do you have in mind
for your case in concrete?
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso April 25, 2017, 1:18 p.m. UTC | #3
On Tue, Apr 25, 2017 at 02:46:52PM +0200, Arturo Borrero Gonzalez wrote:
> On 25 April 2017 at 13:37, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Thu, Apr 20, 2017 at 07:28:16PM +0200, Arturo Borrero Gonzalez wrote:
> >> In some environments where both nodes of a cluster share all the conntracks,
> >> after an initial or manual resync, the conntrack information diverges from
> >> node to node.
> >>
> >> I have observed that this is not due to syncronization problems, given the
> >> link between the nodes is very stable and stats show no issues.
> >> So, this could be due to every node of the cluster seing slighly different
> >> traffic and flow updates, perhaps different tiemouts being applied to
> >> the conntracks in every node.
> >> A manual resync (using conntrackd -n) resolves these issues inmediately.
> >>
> >> This new configuration option tells conntrackd to request a resync
> >> with the other node, similar to what could happen manually using
> >> the 'conntrackd -n' command.
> >>
> >> By now this option is only valid in NOTRACK sync mode.
> >>
> >> Example configuration:
> >>
> >> [...]
> >> Sync {
> >>         Mode NOTRACK {
> >>                 DisableInternalCache on
> >>                 DisableExternalCache on
> >>                 RequestResync 30
> >
> > This looks very similar to the timer based approach that it is already
> > there. Did you give it a try?
> >
> 
> Yes. The timer based approach is... timer based (async).
> 
> It doesn't fit in an environment where you need to sync events as soon
> as they happen.

IIRC the timer based works like this:

1) If event occurs, sync message is send.
2) After some time, we send a message to tell the other peer the entry
   is still there.
3) If no message is received, then the entry expires.

> > This approach doesn't solve nicely the case where you have an entry
> > with a large timeout that got out of sync.
> 
> My idea is to be able to automatically force-sync nodes every 2 o 3
> minutes (in my case).

I see. Just wanted to know why the existing timer based doesn't fit
well for you.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arturo Borrero Gonzalez April 26, 2017, 11:32 a.m. UTC | #4
On 25 April 2017 at 15:18, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>>
>> Yes. The timer based approach is... timer based (async).
>>
>> It doesn't fit in an environment where you need to sync events as soon
>> as they happen.
>
> IIRC the timer based works like this:
>
> 1) If event occurs, sync message is send.
> 2) After some time, we send a message to tell the other peer the entry
>    is still there.
> 3) If no message is received, then the entry expires.
>

the ALARM mode requires to commit the external cache instead of the
conns being directly injected into the kernel.

I think the new RequestResync method (or whatever other alternative)
provides a good tradeoff
between methods and increases general usefulness of conntrackd.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso May 1, 2017, 9:13 a.m. UTC | #5
On Wed, Apr 26, 2017 at 01:32:38PM +0200, Arturo Borrero Gonzalez wrote:
> On 25 April 2017 at 15:18, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >>
> >> Yes. The timer based approach is... timer based (async).
> >>
> >> It doesn't fit in an environment where you need to sync events as soon
> >> as they happen.
> >
> > IIRC the timer based works like this:
> >
> > 1) If event occurs, sync message is send.
> > 2) After some time, we send a message to tell the other peer the entry
> >    is still there.
> > 3) If no message is received, then the entry expires.
> >
> 
> the ALARM mode requires to commit the external cache instead of the
> conns being directly injected into the kernel.

You may want to disable the external cache with the alarm mode. The
alarm mode only needs the internal cache though, but that shouldn't be
much of a problem.

With the alarm mode, you will skip spikes in CPU consumption since
resync is expensive.  With a very large table, this results in some
sort of lazy busy polling.

> I think the new RequestResync method (or whatever other alternative)
> provides a good tradeoff between methods and increases general
> usefulness of conntrackd.

I'm trying to help here if I can give something better ;-)

Look, you should at least combine this new RequestResync with
CommitTimeout. Even if you don't explicitly request a commit command,
this sets the timeout for the entries that are pushed into the kernel.

So, if you set:

        RequestResync 30
        CommitTimeout 180

connections we don't get any information from for 180 seconds will
expire.

BTW, how are you measuring this improvement? Is that you get less logs
error messages that you reported before or so?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arturo Borrero Gonzalez May 2, 2017, 8:18 a.m. UTC | #6
On 1 May 2017 at 11:13, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>>
>> the ALARM mode requires to commit the external cache instead of the
>> conns being directly injected into the kernel.
>
> You may want to disable the external cache with the alarm mode. The
> alarm mode only needs the internal cache though, but that shouldn't be
> much of a problem.
>
> With the alarm mode, you will skip spikes in CPU consumption since
> resync is expensive.  With a very large table, this results in some
> sort of lazy busy polling.
>

I do the equivalent of this RequestResync by hand (i.e. using conntrackd -n) and
it seems to work fine, see below.

>> I think the new RequestResync method (or whatever other alternative)
>> provides a good tradeoff between methods and increases general
>> usefulness of conntrackd.
>
> I'm trying to help here if I can give something better ;-)
>
> Look, you should at least combine this new RequestResync with
> CommitTimeout. Even if you don't explicitly request a commit command,
> this sets the timeout for the entries that are pushed into the kernel.
>
> So, if you set:
>
>         RequestResync 30
>         CommitTimeout 180
>
> connections we don't get any information from for 180 seconds will
> expire.
>

It seems that CommitTimeout can't be combined with
DisableExternalCache, see the evaluate() function.

However a patch to enable this seems easy. I guess we could extend a
bit external_inject_ct_new() to allow reading the commit_timeout
instead of using 0 (similar to what cache_ct_commit_step() does,
right?)

I can add a new previous patch to the series to enable this.

> BTW, how are you measuring this improvement? Is that you get less logs
> error messages that you reported before or so?
>

What I detect is that after the initial startup/sync, the amount of
conntracks in each node diverges.
After 10 minutes, the conntracks in each node are quite different, i.e:

aborrero@node1:~ $ sudo conntrack -C
7885

aborrero@node2:~ $ sudo conntrack -C
17813

A manual 'conntrackd -n' seems to solve the problem:

aborrero@node1:~ $ sudo conntrackd -n ; sudo conntrack -C
18583

aborrero@node2:~ $ sudo conntrackd -n ; sudo conntrack -C
18473

I can understand that each node sees different traffic (is a
multi-master symmetric configuration) but still,
according to my conntrackd setup, I understand that the numbers
shouldn't show that big divergence.

Then, in this scenario, if node2 failover to node1, there are 10k
entries missing in node1, connections that will be presumably lost and
dropped by the stateful configuration of nftables.

I currently solve this by means of scripts and cron calls which is a
bit ugly, given how easy could be for conntrackd to resync by himself.

You may ask, what kind of traffic does each node see? In my current
setup, node1 sees all the IPv4 traffic and node2 sees all the IPv6
traffic (or reverse). In case of failover, a sigle node can see all
the traffic.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso May 8, 2017, 5:47 p.m. UTC | #7
On Tue, May 02, 2017 at 10:18:55AM +0200, Arturo Borrero Gonzalez wrote:
> On 1 May 2017 at 11:13, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> >>
> >> the ALARM mode requires to commit the external cache instead of the
> >> conns being directly injected into the kernel.
> >
> > You may want to disable the external cache with the alarm mode. The
> > alarm mode only needs the internal cache though, but that shouldn't be
> > much of a problem.
> >
> > With the alarm mode, you will skip spikes in CPU consumption since
> > resync is expensive.  With a very large table, this results in some
> > sort of lazy busy polling.
> >
> 
> I do the equivalent of this RequestResync by hand (i.e. using conntrackd -n) and
> it seems to work fine, see below.

OK.

> >> I think the new RequestResync method (or whatever other alternative)
> >> provides a good tradeoff between methods and increases general
> >> usefulness of conntrackd.
> >
> > I'm trying to help here if I can give something better ;-)
> >
> > Look, you should at least combine this new RequestResync with
> > CommitTimeout. Even if you don't explicitly request a commit command,
> > this sets the timeout for the entries that are pushed into the kernel.
> >
> > So, if you set:
> >
> >         RequestResync 30
> >         CommitTimeout 180
> >
> > connections we don't get any information from for 180 seconds will
> > expire.
> >
> 
> It seems that CommitTimeout can't be combined with
> DisableExternalCache, see the evaluate() function.
>
> However a patch to enable this seems easy. I guess we could extend a
> bit external_inject_ct_new() to allow reading the commit_timeout
> instead of using 0 (similar to what cache_ct_commit_step() does,
> right?)
> 
> I can add a new previous patch to the series to enable this.
> 
> > BTW, how are you measuring this improvement? Is that you get less logs
> > error messages that you reported before or so?
> >
> 
> What I detect is that after the initial startup/sync, the amount of
> conntracks in each node diverges.
> After 10 minutes, the conntracks in each node are quite different, i.e:
> 
> aborrero@node1:~ $ sudo conntrack -C
> 7885
> 
> aborrero@node2:~ $ sudo conntrack -C
> 17813
> 
> A manual 'conntrackd -n' seems to solve the problem:
> 
> aborrero@node1:~ $ sudo conntrackd -n ; sudo conntrack -C
> 18583
> 
> aborrero@node2:~ $ sudo conntrackd -n ; sudo conntrack -C
> 18473
> 
> I can understand that each node sees different traffic (is a
> multi-master symmetric configuration) but still,
> according to my conntrackd setup, I understand that the numbers
> shouldn't show that big divergence.
>
> Then, in this scenario, if node2 failover to node1, there are 10k
> entries missing in node1, connections that will be presumably lost and
> dropped by the stateful configuration of nftables.
> 
> I currently solve this by means of scripts and cron calls which is a
> bit ugly, given how easy could be for conntrackd to resync by himself.
> 
> You may ask, what kind of traffic does each node see? In my current
> setup, node1 sees all the IPv4 traffic and node2 sees all the IPv6
> traffic (or reverse). In case of failover, a sigle node can see all
> the traffic.

OK, so there is no assymmetric path at all as node1 sees IPv4 traffic
coming both in original and reply direction.

This is strange, there is probably a more fundamental bug here, I
would like that we're not papering this with a new option.

I'm going to reproduce this in my testbed and get back to you.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/conntrackd.conf.5 b/conntrackd.conf.5
index 4a4f2e2..6ac0fb6 100644
--- a/conntrackd.conf.5
+++ b/conntrackd.conf.5
@@ -195,6 +195,15 @@  messages are directly sent through the dedicated link.
 This option is set off by default.
 
 .TP
+.BI "RequestResync <seconds>"
+Request the other node a complete resync. This should help resolve
+synchronization more easily if they happen in your environment.
+
+Example: RequestResync 60
+
+This option is set off by default.
+
+.TP
 .BI "DisableExternalCache <on|off>"
 Same as in \fBFTFW\fP mode.
 
diff --git a/include/conntrackd.h b/include/conntrackd.h
index 27e43db..4cfb373 100644
--- a/include/conntrackd.h
+++ b/include/conntrackd.h
@@ -111,6 +111,7 @@  struct ct_conf {
 	int event_iterations_limit;
 	int systemd;
 	int running_mode;
+	int request_resync;
 	struct {
 		int error_queue_length;
 	} channelc;
diff --git a/include/resync.h b/include/resync.h
index 5986600..75cd7dd 100644
--- a/include/resync.h
+++ b/include/resync.h
@@ -3,5 +3,6 @@ 
 
 void resync_req(void);
 void resync_send(int (*do_cache_to_tx)(void *data1, void *data2));
+void resync_run_init(void);
 
 #endif /*_RESYNC_H_ */
diff --git a/src/read_config_lex.l b/src/read_config_lex.l
index a378269..664b818 100644
--- a/src/read_config_lex.l
+++ b/src/read_config_lex.l
@@ -136,6 +136,7 @@  notrack		[N|n][O|o][T|t][R|r][A|a][C|c][K|k]
 "ExpectMax"			{ return T_HELPER_EXPECT_MAX; }
 "ExpectTimeout"			{ return T_HELPER_EXPECT_TIMEOUT; }
 "Systemd"			{ return T_SYSTEMD; }
+"RequestResync"			{ return T_REQUEST_RESYNC; }
 
 {is_on}			{ return T_ON; }
 {is_off}		{ return T_OFF; }
diff --git a/src/read_config_yy.y b/src/read_config_yy.y
index 2c08d4e..0509bd3 100644
--- a/src/read_config_yy.y
+++ b/src/read_config_yy.y
@@ -81,7 +81,7 @@  enum {
 %token T_OPTIONS T_TCP_WINDOW_TRACKING T_EXPECT_SYNC
 %token T_HELPER T_HELPER_QUEUE_NUM T_HELPER_QUEUE_LEN T_HELPER_POLICY
 %token T_HELPER_EXPECT_TIMEOUT T_HELPER_EXPECT_MAX
-%token T_SYSTEMD
+%token T_SYSTEMD T_REQUEST_RESYNC
 
 %token <string> T_IP T_PATH_VAL
 %token <val> T_NUMBER
@@ -777,6 +777,7 @@  sync_mode_notrack_line: timeout
 		      | purge
 		      | disable_internal_cache
 		      | disable_external_cache
+		      | request_resync
 		      ;
 
 disable_internal_cache: T_DISABLE_INTERNAL_CACHE T_ON
@@ -804,6 +805,11 @@  resend_queue_size: T_RESEND_QUEUE_SIZE T_NUMBER
 	conf.resend_queue_size = $2;
 };
 
+request_resync: T_REQUEST_RESYNC T_NUMBER
+{
+	conf.request_resync = $2;
+};
+
 window_size: T_WINDOWSIZE T_NUMBER
 {
 	conf.window_size = $2;
diff --git a/src/resync.c b/src/resync.c
index dbb2b6f..4310d6b 100644
--- a/src/resync.c
+++ b/src/resync.c
@@ -23,6 +23,9 @@ 
 #include "queue_tx.h"
 #include "resync.h"
 #include "cache.h"
+#include "alarm.h"
+
+static struct alarm_block	resync_run_alarm;
 
 void resync_req(void)
 {
@@ -38,3 +41,21 @@  void resync_send(int (*do_cache_to_tx)(void *data1, void *data2))
 	cache_iterate(STATE(mode)->internal->exp.data,
 		      NULL, do_cache_to_tx);
 }
+
+static void resync_run(struct alarm_block *a, void *data)
+{
+	resync_req();
+	add_alarm(&resync_run_alarm, CONFIG(request_resync), 0);
+}
+
+void resync_run_init(void)
+{
+	if (CONFIG(request_resync) == 0)
+		return;
+
+	dlog(LOG_NOTICE, "setting up atomatic resync requests every %d "
+	     "seconds", CONFIG(request_resync));
+
+	init_alarm(&resync_run_alarm, NULL,  resync_run);
+	add_alarm(&resync_run_alarm, CONFIG(request_resync), 0);
+}
diff --git a/src/run.c b/src/run.c
index 1fe6cba..4ff2186 100644
--- a/src/run.c
+++ b/src/run.c
@@ -31,6 +31,7 @@ 
 #include "date.h"
 #include "internal.h"
 #include "systemd.h"
+#include "resync.h"
 
 #include <errno.h>
 #include <signal.h>
@@ -284,6 +285,8 @@  init(void)
 #endif
 	time(&STATE(stats).daemon_start_time);
 
+	resync_run_init();
+
 	dlog(LOG_NOTICE, "initialization completed");
 
 	return 0;