diff mbox series

[ovs-dev] github: Temporarily disable SNAT with exhaustion system test.

Message ID 20240301113744.3684763-1-i.maximets@ovn.org
State Accepted
Commit 99c86c6c46632e6adfd699611136ea6b26d1e291
Delegated to: Ilya Maximets
Headers show
Series [ovs-dev] github: Temporarily disable SNAT with exhaustion system test. | expand

Checks

Context Check Description
ovsrobot/apply-robot success apply and check: success
ovsrobot/github-robot-_Build_and_Test success github build: passed
ovsrobot/intel-ovs-compilation success test: success

Commit Message

Ilya Maximets March 1, 2024, 11:37 a.m. UTC
With a new runner update, GitHub Actions had a kernel update.
And it seems like something changed between kernels 6.2 and 6.5
so this test now fails very frequently.

I can reproduce the same issue on RHEL 9, and I can't reproduce
it on Ubuntu 23.04 (kernel 6.2).

The test is creating a NAT with a single address+port pair in
an attempt to simulate an address space exhaustion.  It is
expected that a first connection with wget leaves a conntrack
entry in a TIME_WAIT state and the second wget should fail
as long as this entry remains, because the only available
address+port pair is already taken.

However, for some reason, very frequently (not always!) the
second connection replaces the first conntrack entry with a
new one and connection succeeds.  There is still only one
connection in the conntrack at any single moment in time, so
there is seemingly no issue with the NAT, but the behavior
is unexpected and the test fails.

Disable the test in CI until we figure out how to fix the
kernel (if it is a kernel bug) or the test.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
 tests/system-traffic.at | 1 +
 1 file changed, 1 insertion(+)

Comments

Ilya Maximets March 1, 2024, 12:42 p.m. UTC | #1
On 3/1/24 12:37, Ilya Maximets wrote:
> With a new runner update, GitHub Actions had a kernel update.
> And it seems like something changed between kernels 6.2 and 6.5
> so this test now fails very frequently.
> 
> I can reproduce the same issue on RHEL 9, and I can't reproduce
> it on Ubuntu 23.04 (kernel 6.2).
> 
> The test is creating a NAT with a single address+port pair in
> an attempt to simulate an address space exhaustion.  It is
> expected that a first connection with wget leaves a conntrack
> entry in a TIME_WAIT state and the second wget should fail
> as long as this entry remains, because the only available
> address+port pair is already taken.
> 
> However, for some reason, very frequently (not always!) the
> second connection replaces the first conntrack entry with a
> new one and connection succeeds.  There is still only one
> connection in the conntrack at any single moment in time, so
> there is seemingly no issue with the NAT, but the behavior
> is unexpected and the test fails.
> 
> Disable the test in CI until we figure out how to fix the
> kernel (if it is a kernel bug) or the test.

Marcelo pointed out the following change that went into 6.5 kernel:
  https://lore.kernel.org/netdev/20230626064749.75525-7-pablo@netfilter.org/

It looks like this test is not going to work anymore as new connections
are now allowed to evict closing ones on collisions.

We need to re-think the test, i.e. we have to keep the first connection
open while we're trying to make the second one, so it is not evicted.

> 
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>  tests/system-traffic.at | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
> index 98e494abf..07d09b912 100644
> --- a/tests/system-traffic.at
> +++ b/tests/system-traffic.at
> @@ -6388,6 +6388,7 @@ OVS_TRAFFIC_VSWITCHD_STOP
>  AT_CLEANUP
>  
>  AT_SETUP([conntrack - SNAT with port range with exhaustion])
> +OVS_CHECK_GITHUB_ACTION()
>  CHECK_CONNTRACK()
>  CHECK_CONNTRACK_NAT()
>  OVS_TRAFFIC_VSWITCHD_START()
Simon Horman March 1, 2024, 2:47 p.m. UTC | #2
On Fri, Mar 01, 2024 at 12:37:36PM +0100, Ilya Maximets wrote:
> With a new runner update, GitHub Actions had a kernel update.
> And it seems like something changed between kernels 6.2 and 6.5
> so this test now fails very frequently.
> 
> I can reproduce the same issue on RHEL 9, and I can't reproduce
> it on Ubuntu 23.04 (kernel 6.2).
> 
> The test is creating a NAT with a single address+port pair in
> an attempt to simulate an address space exhaustion.  It is
> expected that a first connection with wget leaves a conntrack
> entry in a TIME_WAIT state and the second wget should fail
> as long as this entry remains, because the only available
> address+port pair is already taken.
> 
> However, for some reason, very frequently (not always!) the
> second connection replaces the first conntrack entry with a
> new one and connection succeeds.  There is still only one
> connection in the conntrack at any single moment in time, so
> there is seemingly no issue with the NAT, but the behavior
> is unexpected and the test fails.
> 
> Disable the test in CI until we figure out how to fix the
> kernel (if it is a kernel bug) or the test.
> 
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

Acked-by: Simon Horman <horms@ovn.org>
Paolo Valerio March 1, 2024, 3:18 p.m. UTC | #3
Ilya Maximets <i.maximets@ovn.org> writes:

> With a new runner update, GitHub Actions had a kernel update.
> And it seems like something changed between kernels 6.2 and 6.5
> so this test now fails very frequently.
>
> I can reproduce the same issue on RHEL 9, and I can't reproduce
> it on Ubuntu 23.04 (kernel 6.2).
>
> The test is creating a NAT with a single address+port pair in
> an attempt to simulate an address space exhaustion.  It is
> expected that a first connection with wget leaves a conntrack
> entry in a TIME_WAIT state and the second wget should fail
> as long as this entry remains, because the only available
> address+port pair is already taken.
>
> However, for some reason, very frequently (not always!) the
> second connection replaces the first conntrack entry with a
> new one and connection succeeds.  There is still only one
> connection in the conntrack at any single moment in time, so
> there is seemingly no issue with the NAT, but the behavior
> is unexpected and the test fails.
>
> Disable the test in CI until we figure out how to fix the
> kernel (if it is a kernel bug) or the test.
>
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---

Acked-by: Paolo Valerio <pvalerio@redhat.com>
Eelco Chaudron March 1, 2024, 5:22 p.m. UTC | #4
On 1 Mar 2024, at 13:42, Ilya Maximets wrote:

> On 3/1/24 12:37, Ilya Maximets wrote:
>> With a new runner update, GitHub Actions had a kernel update.
>> And it seems like something changed between kernels 6.2 and 6.5
>> so this test now fails very frequently.
>>
>> I can reproduce the same issue on RHEL 9, and I can't reproduce
>> it on Ubuntu 23.04 (kernel 6.2).
>>
>> The test is creating a NAT with a single address+port pair in
>> an attempt to simulate an address space exhaustion.  It is
>> expected that a first connection with wget leaves a conntrack
>> entry in a TIME_WAIT state and the second wget should fail
>> as long as this entry remains, because the only available
>> address+port pair is already taken.
>>
>> However, for some reason, very frequently (not always!) the
>> second connection replaces the first conntrack entry with a
>> new one and connection succeeds.  There is still only one
>> connection in the conntrack at any single moment in time, so
>> there is seemingly no issue with the NAT, but the behavior
>> is unexpected and the test fails.
>>
>> Disable the test in CI until we figure out how to fix the
>> kernel (if it is a kernel bug) or the test.
>
> Marcelo pointed out the following change that went into 6.5 kernel:
> https://lore.kernel.org/netdev/20230626064749.75525-7-pablo@netfilter.org/
>
> It looks like this test is not going to work anymore as new connections
> are now allowed to evict closing ones on collisions.
>
> We need to re-think the test, i.e. we have to keep the first connection
> open while we're trying to make the second one, so it is not evicted.

Paolo, can you take a look at fixing the test case? Do you want an FDP issue for this?

>>
>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>> ---

The change looks good for now.

Acked-by: Eelco Chaudron <echaudro@redhat.com>

>>  tests/system-traffic.at | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
>> index 98e494abf..07d09b912 100644
>> --- a/tests/system-traffic.at
>> +++ b/tests/system-traffic.at
>> @@ -6388,6 +6388,7 @@ OVS_TRAFFIC_VSWITCHD_STOP
>>  AT_CLEANUP
>>
>>  AT_SETUP([conntrack - SNAT with port range with exhaustion])
>> +OVS_CHECK_GITHUB_ACTION()
>>  CHECK_CONNTRACK()
>>  CHECK_CONNTRACK_NAT()
>>  OVS_TRAFFIC_VSWITCHD_START()
Ilya Maximets March 1, 2024, 5:26 p.m. UTC | #5
On 3/1/24 16:18, Paolo Valerio wrote:
> Ilya Maximets <i.maximets@ovn.org> writes:
> 
>> With a new runner update, GitHub Actions had a kernel update.
>> And it seems like something changed between kernels 6.2 and 6.5
>> so this test now fails very frequently.
>>
>> I can reproduce the same issue on RHEL 9, and I can't reproduce
>> it on Ubuntu 23.04 (kernel 6.2).
>>
>> The test is creating a NAT with a single address+port pair in
>> an attempt to simulate an address space exhaustion.  It is
>> expected that a first connection with wget leaves a conntrack
>> entry in a TIME_WAIT state and the second wget should fail
>> as long as this entry remains, because the only available
>> address+port pair is already taken.
>>
>> However, for some reason, very frequently (not always!) the
>> second connection replaces the first conntrack entry with a
>> new one and connection succeeds.  There is still only one
>> connection in the conntrack at any single moment in time, so
>> there is seemingly no issue with the NAT, but the behavior
>> is unexpected and the test fails.
>>
>> Disable the test in CI until we figure out how to fix the
>> kernel (if it is a kernel bug) or the test.
>>
>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>> ---
> 
> Acked-by: Paolo Valerio <pvalerio@redhat.com>
> 

Thanks, Simon, Paolo and Eelco!

I updated the commit message with the reference to the kernel change.
With that, applied and backported to 3.3 to stop further CI failures.

Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/tests/system-traffic.at b/tests/system-traffic.at
index 98e494abf..07d09b912 100644
--- a/tests/system-traffic.at
+++ b/tests/system-traffic.at
@@ -6388,6 +6388,7 @@  OVS_TRAFFIC_VSWITCHD_STOP
 AT_CLEANUP
 
 AT_SETUP([conntrack - SNAT with port range with exhaustion])
+OVS_CHECK_GITHUB_ACTION()
 CHECK_CONNTRACK()
 CHECK_CONNTRACK_NAT()
 OVS_TRAFFIC_VSWITCHD_START()