diff mbox series

[net-next,V2,09/15] net/mlx5e: CT: Use the same counter for both directions

Message ID 20200923224824.67340-10-saeed@kernel.org
State Accepted
Delegated to: David Miller
Headers show
Series [net-next,V2,01/15] net/mlx5: Refactor multi chains and prios support | expand

Commit Message

Saeed Mahameed Sept. 23, 2020, 10:48 p.m. UTC
From: Oz Shlomo <ozsh@mellanox.com>

A connection is represented by two 5-tuple entries, one for each direction.
Currently, each direction allocates its own hw counter, which is
inefficient as ct aging is managed per connection.

Share the counter that was allocated for the original direction with the
reverse direction.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 .../ethernet/mellanox/mlx5/core/en/tc_ct.c    | 94 +++++++++++++++++--
 1 file changed, 85 insertions(+), 9 deletions(-)

Comments

Marcelo Ricardo Leitner Nov. 27, 2020, 2:01 p.m. UTC | #1
On Wed, Sep 23, 2020 at 03:48:18PM -0700, saeed@kernel.org wrote:
> From: Oz Shlomo <ozsh@mellanox.com>

Sorry for reviving this one, but seemed better for the context.

> 
> A connection is represented by two 5-tuple entries, one for each direction.
> Currently, each direction allocates its own hw counter, which is
> inefficient as ct aging is managed per connection.
> 
> Share the counter that was allocated for the original direction with the
> reverse direction.

Yes, aging is done per connection, but the stats are not. With this
patch, with netperf TCP_RR test, I get this: (mangled for readability)

# grep 172.0.0.4 /proc/net/nf_conntrack
ipv4     2 tcp      6
  src=172.0.0.3 dst=172.0.0.4 sport=34018 dport=33396 packets=3941992 bytes=264113427
  src=172.0.0.4 dst=172.0.0.3 sport=33396 dport=34018 packets=4 bytes=218 [HW_OFFLOAD]
  mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=3

while without it (594e31bceb + act_ct patch to enable it posted
yesterday + revert), I get:

# grep 172.0.0.4 /proc/net/nf_conntrack
ipv4     2 tcp      6
  src=172.0.0.3 dst=172.0.0.4 sport=41856 dport=32776 packets=1876763 bytes=125743084
  src=172.0.0.4 dst=172.0.0.3 sport=32776 dport=41856 packets=1876761 bytes=125742951 [HW_OFFLOAD]
  mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=3

The same is visible on 'ovs-appctl dpctl/dump-conntrack -s' then.
Summing both directions in one like this is at least very misleading.
Seems this change was motivated only by hw resources constrains. That
said, I'm wondering, can this change be reverted somehow?

  Marcelo
Saeed Mahameed Dec. 1, 2020, 9:41 p.m. UTC | #2
On Fri, 2020-11-27 at 11:01 -0300, Marcelo Ricardo Leitner wrote:
> On Wed, Sep 23, 2020 at 03:48:18PM -0700, saeed@kernel.org wrote:
> > From: Oz Shlomo <ozsh@mellanox.com>
> 
> Sorry for reviving this one, but seemed better for the context.
> 
> > A connection is represented by two 5-tuple entries, one for each
> > direction.
> > Currently, each direction allocates its own hw counter, which is
> > inefficient as ct aging is managed per connection.
> > 
> > Share the counter that was allocated for the original direction
> > with the
> > reverse direction.
> 
> Yes, aging is done per connection, but the stats are not. With this
> patch, with netperf TCP_RR test, I get this: (mangled for
> readability)
> 
> # grep 172.0.0.4 /proc/net/nf_conntrack
> ipv4     2 tcp      6
>   src=172.0.0.3 dst=172.0.0.4 sport=34018 dport=33396 packets=3941992
> bytes=264113427
>   src=172.0.0.4 dst=172.0.0.3 sport=33396 dport=34018 packets=4
> bytes=218 [HW_OFFLOAD]
>   mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=3
> 
> while without it (594e31bceb + act_ct patch to enable it posted
> yesterday + revert), I get:
> 
> # grep 172.0.0.4 /proc/net/nf_conntrack
> ipv4     2 tcp      6
>   src=172.0.0.3 dst=172.0.0.4 sport=41856 dport=32776 packets=1876763
> bytes=125743084
>   src=172.0.0.4 dst=172.0.0.3 sport=32776 dport=41856 packets=1876761
> bytes=125742951 [HW_OFFLOAD]
>   mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=3
> 
> The same is visible on 'ovs-appctl dpctl/dump-conntrack -s' then.
> Summing both directions in one like this is at least very misleading.
> Seems this change was motivated only by hw resources constrains. That
> said, I'm wondering, can this change be reverted somehow?
> 
>   Marcelo

Hi Marcelo, thanks for the report, 
Sorry i am not familiar with this /procfs
Oz, Ariel, Roi, what is your take on this, it seems that we changed the
behavior of stats incorrectly.

Thanks,
Saeed.
Oz Shlomo Dec. 7, 2020, 10:20 a.m. UTC | #3
Hi Marcelo,

On 12/1/2020 11:41 PM, Saeed Mahameed wrote:
> On Fri, 2020-11-27 at 11:01 -0300, Marcelo Ricardo Leitner wrote:
>> On Wed, Sep 23, 2020 at 03:48:18PM -0700, saeed@kernel.org wrote:
>>> From: Oz Shlomo <ozsh@mellanox.com>
>>
>> Sorry for reviving this one, but seemed better for the context.
>>
>>> A connection is represented by two 5-tuple entries, one for each
>>> direction.
>>> Currently, each direction allocates its own hw counter, which is
>>> inefficient as ct aging is managed per connection.
>>>
>>> Share the counter that was allocated for the original direction
>>> with the
>>> reverse direction.
>>
>> Yes, aging is done per connection, but the stats are not. With this
>> patch, with netperf TCP_RR test, I get this: (mangled for
>> readability)
>>
>> # grep 172.0.0.4 /proc/net/nf_conntrack
>> ipv4     2 tcp      6
>>    src=172.0.0.3 dst=172.0.0.4 sport=34018 dport=33396 packets=3941992
>> bytes=264113427
>>    src=172.0.0.4 dst=172.0.0.3 sport=33396 dport=34018 packets=4
>> bytes=218 [HW_OFFLOAD]
>>    mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=3
>>
>> while without it (594e31bceb + act_ct patch to enable it posted
>> yesterday + revert), I get:
>>
>> # grep 172.0.0.4 /proc/net/nf_conntrack
>> ipv4     2 tcp      6
>>    src=172.0.0.3 dst=172.0.0.4 sport=41856 dport=32776 packets=1876763
>> bytes=125743084
>>    src=172.0.0.4 dst=172.0.0.3 sport=32776 dport=41856 packets=1876761
>> bytes=125742951 [HW_OFFLOAD]
>>    mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=0 use=3
>>
>> The same is visible on 'ovs-appctl dpctl/dump-conntrack -s' then.
>> Summing both directions in one like this is at least very misleading.
>> Seems this change was motivated only by hw resources constrains. That
>> said, I'm wondering, can this change be reverted somehow?
>>
>>    Marcelo
> 
> Hi Marcelo, thanks for the report,
> Sorry i am not familiar with this /procfs
> Oz, Ariel, Roi, what is your take on this, it seems that we changed the
> behavior of stats incorrectly.

Indeed we overlooked the CT accounting extension.
We will submit a driver fix.

> 
> Thanks,
> Saeed.
> 
>
Marcelo Ricardo Leitner Dec. 7, 2020, 7:19 p.m. UTC | #4
On Mon, Dec 07, 2020 at 12:20:54PM +0200, Oz Shlomo wrote:
> On 12/1/2020 11:41 PM, Saeed Mahameed wrote:
> > On Fri, 2020-11-27 at 11:01 -0300, Marcelo Ricardo Leitner wrote:
...
> > > The same is visible on 'ovs-appctl dpctl/dump-conntrack -s' then.
> > > Summing both directions in one like this is at least very misleading.
> > > Seems this change was motivated only by hw resources constrains. That
> > > said, I'm wondering, can this change be reverted somehow?
> > > 
> > >    Marcelo
> > 
> > Hi Marcelo, thanks for the report,
> > Sorry i am not familiar with this /procfs
> > Oz, Ariel, Roi, what is your take on this, it seems that we changed the
> > behavior of stats incorrectly.
> 
> Indeed we overlooked the CT accounting extension.
> We will submit a driver fix.

Cool. Thanks for confirming it, Oz.

  Marcelo
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
index 86afef459dc6..9a7bd681f8fe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c
@@ -51,6 +51,7 @@  struct mlx5_tc_ct_priv {
 	struct mlx5_flow_table *ct_nat;
 	struct mlx5_flow_table *post_ct;
 	struct mutex control_lock; /* guards parallel adds/dels */
+	struct mutex shared_counter_lock;
 	struct mapping_ctx *zone_mapping;
 	struct mapping_ctx *labels_mapping;
 	enum mlx5_flow_namespace_type ns_type;
@@ -117,11 +118,16 @@  struct mlx5_ct_tuple {
 	u16 zone;
 };
 
+struct mlx5_ct_shared_counter {
+	struct mlx5_fc *counter;
+	refcount_t refcount;
+};
+
 struct mlx5_ct_entry {
 	struct rhash_head node;
 	struct rhash_head tuple_node;
 	struct rhash_head tuple_nat_node;
-	struct mlx5_fc *counter;
+	struct mlx5_ct_shared_counter *shared_counter;
 	unsigned long cookie;
 	unsigned long restore_cookie;
 	struct mlx5_ct_tuple tuple;
@@ -385,6 +391,16 @@  mlx5_tc_ct_set_tuple_match(struct mlx5e_priv *priv, struct mlx5_flow_spec *spec,
 	return 0;
 }
 
+static void
+mlx5_tc_ct_shared_counter_put(struct mlx5_tc_ct_priv *ct_priv, struct mlx5_ct_entry *entry)
+{
+	if (!refcount_dec_and_test(&entry->shared_counter->refcount))
+		return;
+
+	mlx5_fc_destroy(ct_priv->esw->dev, entry->shared_counter->counter);
+	kfree(entry->shared_counter);
+}
+
 static void
 mlx5_tc_ct_entry_del_rule(struct mlx5_tc_ct_priv *ct_priv,
 			  struct mlx5_ct_entry *entry,
@@ -409,7 +425,6 @@  mlx5_tc_ct_entry_del_rules(struct mlx5_tc_ct_priv *ct_priv,
 	mlx5_tc_ct_entry_del_rule(ct_priv, entry, true);
 	mlx5_tc_ct_entry_del_rule(ct_priv, entry, false);
 
-	mlx5_fc_destroy(ct_priv->esw->dev, entry->counter);
 }
 
 static struct flow_action_entry *
@@ -683,7 +698,7 @@  mlx5_tc_ct_entry_add_rule(struct mlx5_tc_ct_priv *ct_priv,
 	attr->dest_ft = ct_priv->post_ct;
 	attr->ft = nat ? ct_priv->ct_nat : ct_priv->ct;
 	attr->outer_match_level = MLX5_MATCH_L4;
-	attr->counter = entry->counter;
+	attr->counter = entry->shared_counter->counter;
 	attr->flags |= MLX5_ESW_ATTR_FLAG_NO_IN_PORT;
 
 	mlx5_tc_ct_set_tuple_match(netdev_priv(ct_priv->netdev), spec, flow_rule);
@@ -716,18 +731,73 @@  mlx5_tc_ct_entry_add_rule(struct mlx5_tc_ct_priv *ct_priv,
 	return err;
 }
 
+static struct mlx5_ct_shared_counter *
+mlx5_tc_ct_shared_counter_get(struct mlx5_tc_ct_priv *ct_priv,
+			      struct mlx5_ct_entry *entry)
+{
+	struct mlx5_ct_tuple rev_tuple = entry->tuple;
+	struct mlx5_ct_shared_counter *shared_counter;
+	struct mlx5_eswitch *esw = ct_priv->esw;
+	struct mlx5_ct_entry *rev_entry;
+	__be16 tmp_port;
+
+	/* get the reversed tuple */
+	tmp_port = rev_tuple.port.src;
+	rev_tuple.port.src = rev_tuple.port.dst;
+	rev_tuple.port.dst = tmp_port;
+
+	if (rev_tuple.addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS) {
+		__be32 tmp_addr = rev_tuple.ip.src_v4;
+
+		rev_tuple.ip.src_v4 = rev_tuple.ip.dst_v4;
+		rev_tuple.ip.dst_v4 = tmp_addr;
+	} else if (rev_tuple.addr_type == FLOW_DISSECTOR_KEY_IPV6_ADDRS) {
+		struct in6_addr tmp_addr = rev_tuple.ip.src_v6;
+
+		rev_tuple.ip.src_v6 = rev_tuple.ip.dst_v6;
+		rev_tuple.ip.dst_v6 = tmp_addr;
+	} else {
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	/* Use the same counter as the reverse direction */
+	mutex_lock(&ct_priv->shared_counter_lock);
+	rev_entry = rhashtable_lookup_fast(&ct_priv->ct_tuples_ht, &rev_tuple,
+					   tuples_ht_params);
+	if (rev_entry) {
+		if (refcount_inc_not_zero(&rev_entry->shared_counter->refcount)) {
+			mutex_unlock(&ct_priv->shared_counter_lock);
+			return rev_entry->shared_counter;
+		}
+	}
+	mutex_unlock(&ct_priv->shared_counter_lock);
+
+	shared_counter = kzalloc(sizeof(*shared_counter), GFP_KERNEL);
+	if (!shared_counter)
+		return ERR_PTR(-ENOMEM);
+
+	shared_counter->counter = mlx5_fc_create(esw->dev, true);
+	if (IS_ERR(shared_counter->counter)) {
+		ct_dbg("Failed to create counter for ct entry");
+		kfree(shared_counter);
+		return ERR_PTR(PTR_ERR(shared_counter->counter));
+	}
+
+	refcount_set(&shared_counter->refcount, 1);
+	return shared_counter;
+}
+
 static int
 mlx5_tc_ct_entry_add_rules(struct mlx5_tc_ct_priv *ct_priv,
 			   struct flow_rule *flow_rule,
 			   struct mlx5_ct_entry *entry,
 			   u8 zone_restore_id)
 {
-	struct mlx5_eswitch *esw = ct_priv->esw;
 	int err;
 
-	entry->counter = mlx5_fc_create(esw->dev, true);
-	if (IS_ERR(entry->counter)) {
-		err = PTR_ERR(entry->counter);
+	entry->shared_counter = mlx5_tc_ct_shared_counter_get(ct_priv, entry);
+	if (IS_ERR(entry->shared_counter)) {
+		err = PTR_ERR(entry->shared_counter);
 		ct_dbg("Failed to create counter for ct entry");
 		return err;
 	}
@@ -747,7 +817,7 @@  mlx5_tc_ct_entry_add_rules(struct mlx5_tc_ct_priv *ct_priv,
 err_nat:
 	mlx5_tc_ct_entry_del_rule(ct_priv, entry, false);
 err_orig:
-	mlx5_fc_destroy(esw->dev, entry->counter);
+	mlx5_tc_ct_shared_counter_put(ct_priv, entry);
 	return err;
 }
 
@@ -837,12 +907,16 @@  mlx5_tc_ct_del_ft_entry(struct mlx5_tc_ct_priv *ct_priv,
 			struct mlx5_ct_entry *entry)
 {
 	mlx5_tc_ct_entry_del_rules(ct_priv, entry);
+	mutex_lock(&ct_priv->shared_counter_lock);
 	if (entry->tuple_node.next)
 		rhashtable_remove_fast(&ct_priv->ct_tuples_nat_ht,
 				       &entry->tuple_nat_node,
 				       tuples_nat_ht_params);
 	rhashtable_remove_fast(&ct_priv->ct_tuples_ht, &entry->tuple_node,
 			       tuples_ht_params);
+	mutex_unlock(&ct_priv->shared_counter_lock);
+	mlx5_tc_ct_shared_counter_put(ct_priv, entry);
+
 }
 
 static int
@@ -879,7 +953,7 @@  mlx5_tc_ct_block_flow_offload_stats(struct mlx5_ct_ft *ft,
 	if (!entry)
 		return -ENOENT;
 
-	mlx5_fc_query_cached(entry->counter, &bytes, &packets, &lastuse);
+	mlx5_fc_query_cached(entry->shared_counter->counter, &bytes, &packets, &lastuse);
 	flow_stats_update(&f->stats, bytes, packets, 0, lastuse,
 			  FLOW_ACTION_HW_STATS_DELAYED);
 
@@ -1892,6 +1966,7 @@  mlx5_tc_ct_init(struct mlx5e_priv *priv, struct mlx5_fs_chains *chains,
 
 	idr_init(&ct_priv->fte_ids);
 	mutex_init(&ct_priv->control_lock);
+	mutex_init(&ct_priv->shared_counter_lock);
 	rhashtable_init(&ct_priv->zone_ht, &zone_params);
 	rhashtable_init(&ct_priv->ct_tuples_ht, &tuples_ht_params);
 	rhashtable_init(&ct_priv->ct_tuples_nat_ht, &tuples_nat_ht_params);
@@ -1934,6 +2009,7 @@  mlx5_tc_ct_clean(struct mlx5_tc_ct_priv *ct_priv)
 	rhashtable_destroy(&ct_priv->ct_tuples_nat_ht);
 	rhashtable_destroy(&ct_priv->zone_ht);
 	mutex_destroy(&ct_priv->control_lock);
+	mutex_destroy(&ct_priv->shared_counter_lock);
 	idr_destroy(&ct_priv->fte_ids);
 	kfree(ct_priv);
 }