diff mbox series

[mlx5-next,11/16] net/mlx5: Add VDPA priority to NIC RX namespace

Message ID 20201120230339.651609-12-saeedm@nvidia.com
State New
Headers show
Series mlx5 next updates 2020-11-20 | expand

Commit Message

Saeed Mahameed Nov. 20, 2020, 11:03 p.m. UTC
From: Eli Cohen <eli@mellanox.com>

Add a new namespace type to the NIC RX root namespace to allow for
inserting VDPA rules before regular NIC but after bypass, thus allowing
DPDK to have precedence in packet processing.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c | 10 +++++++++-
 include/linux/mlx5/fs.h                           |  1 +
 2 files changed, 10 insertions(+), 1 deletion(-)

Comments

Jakub Kicinski Nov. 22, 2020, 12:01 a.m. UTC | #1
On Fri, 20 Nov 2020 15:03:34 -0800 Saeed Mahameed wrote:
> From: Eli Cohen <eli@mellanox.com>
> 
> Add a new namespace type to the NIC RX root namespace to allow for
> inserting VDPA rules before regular NIC but after bypass, thus allowing
> DPDK to have precedence in packet processing.

How does DPDK and VDPA relate in this context?
Eli Cohen Nov. 22, 2020, 6:41 a.m. UTC | #2
On Sat, Nov 21, 2020 at 04:01:55PM -0800, Jakub Kicinski wrote:
> On Fri, 20 Nov 2020 15:03:34 -0800 Saeed Mahameed wrote:
> > From: Eli Cohen <eli@mellanox.com>
> > 
> > Add a new namespace type to the NIC RX root namespace to allow for
> > inserting VDPA rules before regular NIC but after bypass, thus allowing
> > DPDK to have precedence in packet processing.
> 
> How does DPDK and VDPA relate in this context?

mlx5 steering is hierarchical and defines precedence amongst namespaces.
Up till now, the VDPA implementation would insert a rule into the
MLX5_FLOW_NAMESPACE_BYPASS hierarchy which is used by DPDK thus taking
all the incoming traffic.

The MLX5_FLOW_NAMESPACE_VDPA hirerachy comes after
MLX5_FLOW_NAMESPACE_BYPASS.
Jakub Kicinski Nov. 24, 2020, 5:12 p.m. UTC | #3
On Sun, 22 Nov 2020 08:41:58 +0200 Eli Cohen wrote:
> On Sat, Nov 21, 2020 at 04:01:55PM -0800, Jakub Kicinski wrote:
> > On Fri, 20 Nov 2020 15:03:34 -0800 Saeed Mahameed wrote:  
> > > From: Eli Cohen <eli@mellanox.com>
> > > 
> > > Add a new namespace type to the NIC RX root namespace to allow for
> > > inserting VDPA rules before regular NIC but after bypass, thus allowing
> > > DPDK to have precedence in packet processing.  
> > 
> > How does DPDK and VDPA relate in this context?  
> 
> mlx5 steering is hierarchical and defines precedence amongst namespaces.
> Up till now, the VDPA implementation would insert a rule into the
> MLX5_FLOW_NAMESPACE_BYPASS hierarchy which is used by DPDK thus taking
> all the incoming traffic.
> 
> The MLX5_FLOW_NAMESPACE_VDPA hirerachy comes after
> MLX5_FLOW_NAMESPACE_BYPASS.

Our policy was no DPDK driver bifurcation. There's no asterisk saying
"unless you pretend you need flow filters for RDMA, get them upstream
and then drop the act".

What do you expect me to do?
Jason Gunthorpe Nov. 24, 2020, 6:02 p.m. UTC | #4
On Tue, Nov 24, 2020 at 09:12:19AM -0800, Jakub Kicinski wrote:
> On Sun, 22 Nov 2020 08:41:58 +0200 Eli Cohen wrote:
> > On Sat, Nov 21, 2020 at 04:01:55PM -0800, Jakub Kicinski wrote:
> > > On Fri, 20 Nov 2020 15:03:34 -0800 Saeed Mahameed wrote:  
> > > > From: Eli Cohen <eli@mellanox.com>
> > > > 
> > > > Add a new namespace type to the NIC RX root namespace to allow for
> > > > inserting VDPA rules before regular NIC but after bypass, thus allowing
> > > > DPDK to have precedence in packet processing.  
> > > 
> > > How does DPDK and VDPA relate in this context?  
> > 
> > mlx5 steering is hierarchical and defines precedence amongst namespaces.
> > Up till now, the VDPA implementation would insert a rule into the
> > MLX5_FLOW_NAMESPACE_BYPASS hierarchy which is used by DPDK thus taking
> > all the incoming traffic.
> > 
> > The MLX5_FLOW_NAMESPACE_VDPA hirerachy comes after
> > MLX5_FLOW_NAMESPACE_BYPASS.
> 
> Our policy was no DPDK driver bifurcation. There's no asterisk saying
> "unless you pretend you need flow filters for RDMA, get them upstream
> and then drop the act".

Huh?

mlx5 DPDK is an *RDMA* userspace application. It links to
libibverbs. It runs on the RDMA stack. It uses RDMA flow filtering and
RDMA raw ethernet QPs. It has been like this for years, it is not some
"act".

It is long standing uABI that accelerators like RDMA/etc get to take
the traffic before netdev. This cannot be reverted. I don't really
understand what you are expecting here?

Jason
Jakub Kicinski Nov. 24, 2020, 6:41 p.m. UTC | #5
On Tue, 24 Nov 2020 14:02:10 -0400 Jason Gunthorpe wrote:
> On Tue, Nov 24, 2020 at 09:12:19AM -0800, Jakub Kicinski wrote:
> > On Sun, 22 Nov 2020 08:41:58 +0200 Eli Cohen wrote:  
> > > On Sat, Nov 21, 2020 at 04:01:55PM -0800, Jakub Kicinski wrote:  
> > > > On Fri, 20 Nov 2020 15:03:34 -0800 Saeed Mahameed wrote:    
> > > > > From: Eli Cohen <eli@mellanox.com>
> > > > > 
> > > > > Add a new namespace type to the NIC RX root namespace to allow for
> > > > > inserting VDPA rules before regular NIC but after bypass, thus allowing
> > > > > DPDK to have precedence in packet processing.    
> > > > 
> > > > How does DPDK and VDPA relate in this context?    
> > > 
> > > mlx5 steering is hierarchical and defines precedence amongst namespaces.
> > > Up till now, the VDPA implementation would insert a rule into the
> > > MLX5_FLOW_NAMESPACE_BYPASS hierarchy which is used by DPDK thus taking
> > > all the incoming traffic.
> > > 
> > > The MLX5_FLOW_NAMESPACE_VDPA hirerachy comes after
> > > MLX5_FLOW_NAMESPACE_BYPASS.  
> > 
> > Our policy was no DPDK driver bifurcation. There's no asterisk saying
> > "unless you pretend you need flow filters for RDMA, get them upstream
> > and then drop the act".  
> 
> Huh?
> 
> mlx5 DPDK is an *RDMA* userspace application. 

Forgive me for my naiveté. 

Here I thought the RDMA subsystem is for doing RDMA.

I'm sure if you start doing crypto over ibverbs crypto people will want
to have a look.

> libibverbs. It runs on the RDMA stack. It uses RDMA flow filtering and
> RDMA raw ethernet QPs. 

I'm not saying that's not the case. I'm saying I don't think this was
something that netdev developers signed-off on. And our policy on DPDK
is pretty widely known.

Would you mind pointing us to the introduction of raw Ethernet QPs?

Is there any production use for that without DPDK?

> It has been like this for years, it is not some "act".
> 
> It is long standing uABI that accelerators like RDMA/etc get to take
> the traffic before netdev. This cannot be reverted. I don't really
> understand what you are expecting here?

Same. I don't really know what you expect me to do either. I don't
think I can sign-off on kernel changes needed for DPDK.
Jason Gunthorpe Nov. 24, 2020, 7:44 p.m. UTC | #6
On Tue, Nov 24, 2020 at 10:41:06AM -0800, Jakub Kicinski wrote:
> On Tue, 24 Nov 2020 14:02:10 -0400 Jason Gunthorpe wrote:
> > On Tue, Nov 24, 2020 at 09:12:19AM -0800, Jakub Kicinski wrote:
> > > On Sun, 22 Nov 2020 08:41:58 +0200 Eli Cohen wrote:  
> > > > On Sat, Nov 21, 2020 at 04:01:55PM -0800, Jakub Kicinski wrote:  
> > > > > On Fri, 20 Nov 2020 15:03:34 -0800 Saeed Mahameed wrote:    
> > > > > > From: Eli Cohen <eli@mellanox.com>
> > > > > > 
> > > > > > Add a new namespace type to the NIC RX root namespace to allow for
> > > > > > inserting VDPA rules before regular NIC but after bypass, thus allowing
> > > > > > DPDK to have precedence in packet processing.    
> > > > > 
> > > > > How does DPDK and VDPA relate in this context?    
> > > > 
> > > > mlx5 steering is hierarchical and defines precedence amongst namespaces.
> > > > Up till now, the VDPA implementation would insert a rule into the
> > > > MLX5_FLOW_NAMESPACE_BYPASS hierarchy which is used by DPDK thus taking
> > > > all the incoming traffic.
> > > > 
> > > > The MLX5_FLOW_NAMESPACE_VDPA hirerachy comes after
> > > > MLX5_FLOW_NAMESPACE_BYPASS.  
> > > 
> > > Our policy was no DPDK driver bifurcation. There's no asterisk saying
> > > "unless you pretend you need flow filters for RDMA, get them upstream
> > > and then drop the act".  
> > 
> > Huh?
> > 
> > mlx5 DPDK is an *RDMA* userspace application. 
> 
> Forgive me for my naiveté. 
> 
> Here I thought the RDMA subsystem is for doing RDMA.

RDMA covers a wide range of accelerated networking these days.. Where
else are you going to put this stuff in the kernel?

> I'm sure if you start doing crypto over ibverbs crypto people will want
> to have a look.

Well, RDMA has crypto transforms for a few years now too. Why would
crypto subsystem people be involved? It isn't using or duplicating
their APIs.

> > libibverbs. It runs on the RDMA stack. It uses RDMA flow filtering and
> > RDMA raw ethernet QPs. 
> 
> I'm not saying that's not the case. I'm saying I don't think this was
> something that netdev developers signed-off on.

Part of the point of the subsystem split was to end the fighting that
started all of it. It was very clear during the whole iWarp and TCP
Offload Engine buisness in the mid 2000's that netdev wanted nothing
to do with the accelerator world.

So why would netdev need sign off on any accelerator stuff?  Do you
want to start co-operating now? I'm willing to talk about how to do
that.

> And our policy on DPDK is pretty widely known.

I honestly have no idea on the netdev DPDK policy, I'm maintaining the
RDMA subsystem not DPDK :)

> Would you mind pointing us to the introduction of raw Ethernet QPs?
> 
> Is there any production use for that without DPDK?

Hmm.. It is very old. RAW (InfiniBand) QPs were part of the original
IBA specification cira 2000. When RoCE was defined (around 2010) they
were naturally carried forward to Ethernet. The "flow steering"
concept to make raw ethernet QP useful was added to verbs around 2012
- 2013. It officially made it upstream in commit 436f2ad05a0b
("IB/core: Export ib_create/destroy_flow through uverbs")

If I recall properly the first real application was ultra low latency
ethernet processing for financial applications.

dpdk later adopted the first mlx4 PMD using this libibverbs API around
2015. Interestingly the mlx4 PMD was made through an open source
process with minimal involvment from Mellanox, based on the
pre-existing RDMA work.

Currently there are many projects, and many open source, built on top
of the RDMA raw ethernet QP and RDMA flow steering model. It is now
long established kernel ABI.

> > It has been like this for years, it is not some "act".
> > 
> > It is long standing uABI that accelerators like RDMA/etc get to take
> > the traffic before netdev. This cannot be reverted. I don't really
> > understand what you are expecting here?
> 
> Same. I don't really know what you expect me to do either. I don't
> think I can sign-off on kernel changes needed for DPDK.

This patch is fine tuning the shared logic that splits the traffic to
accelerator subsystems, I don't think netdev should have a veto
here. This needs to be consensus among the various communities and
subsystems that rely on this.

Eli did not explain this well in his commit message. When he said DPDK
he means RDMA which is the owner of the FLOW_NAMESPACE. Each
accelerator subsystem gets hooked into this, so here VPDA is getting
its own hook because re-using the the same hook between two kernel
subsystems is buggy.

Jason
Eli Cohen Nov. 25, 2020, 6:19 a.m. UTC | #7
On Tue, Nov 24, 2020 at 03:44:13PM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 24, 2020 at 10:41:06AM -0800, Jakub Kicinski wrote:
> > On Tue, 24 Nov 2020 14:02:10 -0400 Jason Gunthorpe wrote:
> > > On Tue, Nov 24, 2020 at 09:12:19AM -0800, Jakub Kicinski wrote:
> > > > On Sun, 22 Nov 2020 08:41:58 +0200 Eli Cohen wrote:  
> > > > > On Sat, Nov 21, 2020 at 04:01:55PM -0800, Jakub Kicinski wrote:  
> > > > > > On Fri, 20 Nov 2020 15:03:34 -0800 Saeed Mahameed wrote:    
> > > > > > > From: Eli Cohen <eli@mellanox.com>
> > > > > > > 
> > > > > > > Add a new namespace type to the NIC RX root namespace to allow for
> > > > > > > inserting VDPA rules before regular NIC but after bypass, thus allowing
> > > > > > > DPDK to have precedence in packet processing.    
> > > > > > 
> > > > > > How does DPDK and VDPA relate in this context?    
> > > > > 
> > > > > mlx5 steering is hierarchical and defines precedence amongst namespaces.
> > > > > Up till now, the VDPA implementation would insert a rule into the
> > > > > MLX5_FLOW_NAMESPACE_BYPASS hierarchy which is used by DPDK thus taking
> > > > > all the incoming traffic.
> > > > > 
> > > > > The MLX5_FLOW_NAMESPACE_VDPA hirerachy comes after
> > > > > MLX5_FLOW_NAMESPACE_BYPASS.  
> > > > 
> > > > Our policy was no DPDK driver bifurcation. There's no asterisk saying
> > > > "unless you pretend you need flow filters for RDMA, get them upstream
> > > > and then drop the act".  
> > > 
> > > Huh?
> > > 
> > > mlx5 DPDK is an *RDMA* userspace application. 
> > 
> > Forgive me for my naiveté. 
> > 
> > Here I thought the RDMA subsystem is for doing RDMA.
> 
> RDMA covers a wide range of accelerated networking these days.. Where
> else are you going to put this stuff in the kernel?
> 
> > I'm sure if you start doing crypto over ibverbs crypto people will want
> > to have a look.
> 
> Well, RDMA has crypto transforms for a few years now too. Why would
> crypto subsystem people be involved? It isn't using or duplicating
> their APIs.
> 
> > > libibverbs. It runs on the RDMA stack. It uses RDMA flow filtering and
> > > RDMA raw ethernet QPs. 
> > 
> > I'm not saying that's not the case. I'm saying I don't think this was
> > something that netdev developers signed-off on.
> 
> Part of the point of the subsystem split was to end the fighting that
> started all of it. It was very clear during the whole iWarp and TCP
> Offload Engine buisness in the mid 2000's that netdev wanted nothing
> to do with the accelerator world.
> 
> So why would netdev need sign off on any accelerator stuff?  Do you
> want to start co-operating now? I'm willing to talk about how to do
> that.
> 
> > And our policy on DPDK is pretty widely known.
> 
> I honestly have no idea on the netdev DPDK policy, I'm maintaining the
> RDMA subsystem not DPDK :)
> 
> > Would you mind pointing us to the introduction of raw Ethernet QPs?
> > 
> > Is there any production use for that without DPDK?
> 
> Hmm.. It is very old. RAW (InfiniBand) QPs were part of the original
> IBA specification cira 2000. When RoCE was defined (around 2010) they
> were naturally carried forward to Ethernet. The "flow steering"
> concept to make raw ethernet QP useful was added to verbs around 2012
> - 2013. It officially made it upstream in commit 436f2ad05a0b
> ("IB/core: Export ib_create/destroy_flow through uverbs")
> 
> If I recall properly the first real application was ultra low latency
> ethernet processing for financial applications.
> 
> dpdk later adopted the first mlx4 PMD using this libibverbs API around
> 2015. Interestingly the mlx4 PMD was made through an open source
> process with minimal involvment from Mellanox, based on the
> pre-existing RDMA work.
> 
> Currently there are many projects, and many open source, built on top
> of the RDMA raw ethernet QP and RDMA flow steering model. It is now
> long established kernel ABI.
> 
> > > It has been like this for years, it is not some "act".
> > > 
> > > It is long standing uABI that accelerators like RDMA/etc get to take
> > > the traffic before netdev. This cannot be reverted. I don't really
> > > understand what you are expecting here?
> > 
> > Same. I don't really know what you expect me to do either. I don't
> > think I can sign-off on kernel changes needed for DPDK.
> 
> This patch is fine tuning the shared logic that splits the traffic to
> accelerator subsystems, I don't think netdev should have a veto
> here. This needs to be consensus among the various communities and
> subsystems that rely on this.
> 
> Eli did not explain this well in his commit message. When he said DPDK
> he means RDMA which is the owner of the FLOW_NAMESPACE. Each
> accelerator subsystem gets hooked into this, so here VPDA is getting
> its own hook because re-using the the same hook between two kernel
> subsystems is buggy.

I agree, RDMA should have been used here. DPDK is just one, though
widely used, accelerator using RDMA interfaces to flow steering.

I will push submit another patch with a modified change log.

> 
> Jason
Jakub Kicinski Nov. 25, 2020, 6:54 p.m. UTC | #8
On Tue, 24 Nov 2020 15:44:13 -0400 Jason Gunthorpe wrote:
> On Tue, Nov 24, 2020 at 10:41:06AM -0800, Jakub Kicinski wrote:
> > On Tue, 24 Nov 2020 14:02:10 -0400 Jason Gunthorpe wrote:  
> > > On Tue, Nov 24, 2020 at 09:12:19AM -0800, Jakub Kicinski wrote:  
> > > > On Sun, 22 Nov 2020 08:41:58 +0200 Eli Cohen wrote:    
> > > > > On Sat, Nov 21, 2020 at 04:01:55PM -0800, Jakub Kicinski wrote:    
> > > > > > On Fri, 20 Nov 2020 15:03:34 -0800 Saeed Mahameed wrote:      
> > > > > > > From: Eli Cohen <eli@mellanox.com>
> > > > > > > 
> > > > > > > Add a new namespace type to the NIC RX root namespace to allow for
> > > > > > > inserting VDPA rules before regular NIC but after bypass, thus allowing
> > > > > > > DPDK to have precedence in packet processing.      
> > > > > > 
> > > > > > How does DPDK and VDPA relate in this context?      
> > > > > 
> > > > > mlx5 steering is hierarchical and defines precedence amongst namespaces.
> > > > > Up till now, the VDPA implementation would insert a rule into the
> > > > > MLX5_FLOW_NAMESPACE_BYPASS hierarchy which is used by DPDK thus taking
> > > > > all the incoming traffic.
> > > > > 
> > > > > The MLX5_FLOW_NAMESPACE_VDPA hirerachy comes after
> > > > > MLX5_FLOW_NAMESPACE_BYPASS.    
> > > > 
> > > > Our policy was no DPDK driver bifurcation. There's no asterisk saying
> > > > "unless you pretend you need flow filters for RDMA, get them upstream
> > > > and then drop the act".    
> > > 
> > > Huh?
> > > 
> > > mlx5 DPDK is an *RDMA* userspace application.   
> > 
> > Forgive me for my naiveté. 
> > 
> > Here I thought the RDMA subsystem is for doing RDMA.  
> 
> RDMA covers a wide range of accelerated networking these days.. Where
> else are you going to put this stuff in the kernel?

IDK what else you got in there :) It's probably a case by case answer.

IMHO even using libibverbs is no strong reason for things to fall under
RDMA exclusively. Client drivers of virtio don't get silently funneled
through a separate tree just because they use a certain spec.

> > I'm sure if you start doing crypto over ibverbs crypto people will want
> > to have a look.  
> 
> Well, RDMA has crypto transforms for a few years now too. 

Are you talking about RDMA traffic being encrypted? That's a different
case.

My example was alluding to access to a generic crypto accelerator 
over ibverbs. I hope you'd let crypto people know when merging such 
a thing...

> Why would crypto subsystem people be involved? It isn't using or
> duplicating their APIs.
> 
> > > libibverbs. It runs on the RDMA stack. It uses RDMA flow
> > > filtering and RDMA raw ethernet QPs.   
> > 
> > I'm not saying that's not the case. I'm saying I don't think this
> > was something that netdev developers signed-off on.  
> 
> Part of the point of the subsystem split was to end the fighting that
> started all of it. It was very clear during the whole iWarp and TCP
> Offload Engine buisness in the mid 2000's that netdev wanted nothing
> to do with the accelerator world.

I was in middle school at the time, not sure what exactly went down :)
But I'm going by common sense here. Perhaps there was an agreement I'm
not aware of?

> So why would netdev need sign off on any accelerator stuff?

I'm not sure why you keep saying accelerators!

What is accelerated in raw Ethernet frame access??

> Do you want to start co-operating now? I'm willing to talk about how
> to do that.

IDK how that's even in question. I always try to bump all RDMA-looking
stuff to linux-rdma when it's not CCed there. That's the bare minimum
of cooperation I'd expect from anyone.

> > And our policy on DPDK is pretty widely known.  
> 
> I honestly have no idea on the netdev DPDK policy,
> 
> I'm maintaining the RDMA subsystem not DPDK :)

That's what I thought, but turns out DPDK is your important user.

> > Would you mind pointing us to the introduction of raw Ethernet QPs?
> > 
> > Is there any production use for that without DPDK?  
> 
> Hmm.. It is very old. RAW (InfiniBand) QPs were part of the original
> IBA specification cira 2000. When RoCE was defined (around 2010) they
> were naturally carried forward to Ethernet. The "flow steering"
> concept to make raw ethernet QP useful was added to verbs around 2012
> - 2013. It officially made it upstream in commit 436f2ad05a0b
> ("IB/core: Export ib_create/destroy_flow through uverbs")
> 
> If I recall properly the first real application was ultra low latency
> ethernet processing for financial applications.
> 
> dpdk later adopted the first mlx4 PMD using this libibverbs API around
> 2015. Interestingly the mlx4 PMD was made through an open source
> process with minimal involvment from Mellanox, based on the
> pre-existing RDMA work.
> 
> Currently there are many projects, and many open source, built on top
> of the RDMA raw ethernet QP and RDMA flow steering model. It is now
> long established kernel ABI.
> 
> > > It has been like this for years, it is not some "act".
> > > 
> > > It is long standing uABI that accelerators like RDMA/etc get to
> > > take the traffic before netdev. This cannot be reverted. I don't
> > > really understand what you are expecting here?  
> > 
> > Same. I don't really know what you expect me to do either. I don't
> > think I can sign-off on kernel changes needed for DPDK.  
> 
> This patch is fine tuning the shared logic that splits the traffic to
> accelerator subsystems, I don't think netdev should have a veto
> here. This needs to be consensus among the various communities and
> subsystems that rely on this.
> 
> Eli did not explain this well in his commit message. When he said DPDK
> he means RDMA which is the owner of the FLOW_NAMESPACE. Each
> accelerator subsystem gets hooked into this, so here VPDA is getting
> its own hook because re-using the the same hook between two kernel
> subsystems is buggy.

I'm not so sure about this.

The switchdev modeling is supposed to give users control over flow of
traffic in a sane, well defined way, as opposed to magic flow filtering
of the early SR-IOV implementations which every vendor had their own
twist on. 

Now IIUC you're tapping traffic for DPDK/raw QPs _before_ all switching
happens in the NIC? That breaks the switchdev model. We're back to
per-vendor magic.

And why do you need a separate VDPA table in the first place?
Forwarding to a VDPA device has different semantics than forwarding to
any other VF/SF?
Saeed Mahameed Nov. 25, 2020, 7:04 p.m. UTC | #9
On Wed, 2020-11-25 at 08:19 +0200, Eli Cohen wrote:
> On Tue, Nov 24, 2020 at 03:44:13PM -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 24, 2020 at 10:41:06AM -0800, Jakub Kicinski wrote:
> > > On Tue, 24 Nov 2020 14:02:10 -0400 Jason Gunthorpe wrote:
> > > > On Tue, Nov 24, 2020 at 09:12:19AM -0800, Jakub Kicinski wrote:
> > > > > On Sun, 22 Nov 2020 08:41:58 +0200 Eli Cohen wrote:  
> > > > > > On Sat, Nov 21, 2020 at 04:01:55PM -0800, Jakub Kicinski
> > > > > > wrote:  
> > > > > > > On Fri, 20 Nov 2020 15:03:34 -0800 Saeed Mahameed
> > > > > > > wrote:    
> > > > > > > > From: Eli Cohen <eli@mellanox.com>
> > > > > > > > 
> > > > > > > > Add a new namespace type to the NIC RX root namespace
> > > > > > > > to allow for
> > > > > > > > inserting VDPA rules before regular NIC but after
> > > > > > > > bypass, thus allowing
> > > > > > > > DPDK to have precedence in packet processing.    
> > > > > > > 
> > > > > > > How does DPDK and VDPA relate in this context?    
> > > > > > 
> > > > > > mlx5 steering is hierarchical and defines precedence
> > > > > > amongst namespaces.
> > > > > > Up till now, the VDPA implementation would insert a rule
> > > > > > into the
> > > > > > MLX5_FLOW_NAMESPACE_BYPASS hierarchy which is used by DPDK
> > > > > > thus taking
> > > > > > all the incoming traffic.
> > > > > > 
> > > > > > The MLX5_FLOW_NAMESPACE_VDPA hirerachy comes after
> > > > > > MLX5_FLOW_NAMESPACE_BYPASS.  
> > > > > 
> > > > > Our policy was no DPDK driver bifurcation. There's no
> > > > > asterisk saying
> > > > > "unless you pretend you need flow filters for RDMA, get them
> > > > > upstream
> > > > > and then drop the act".  
> > > > 
> > > > Huh?
> > > > 
> > > > mlx5 DPDK is an *RDMA* userspace application. 
> > > 
> > > Forgive me for my naiveté. 
> > > 
> > > Here I thought the RDMA subsystem is for doing RDMA.
> > 
> > RDMA covers a wide range of accelerated networking these days..
> > Where
> > else are you going to put this stuff in the kernel?
> > 
> > > I'm sure if you start doing crypto over ibverbs crypto people
> > > will want
> > > to have a look.
> > 
> > Well, RDMA has crypto transforms for a few years now too. Why would
> > crypto subsystem people be involved? It isn't using or duplicating
> > their APIs.
> > 
> > > > libibverbs. It runs on the RDMA stack. It uses RDMA flow
> > > > filtering and
> > > > RDMA raw ethernet QPs. 
> > > 
> > > I'm not saying that's not the case. I'm saying I don't think this
> > > was
> > > something that netdev developers signed-off on.
> > 
> > Part of the point of the subsystem split was to end the fighting
> > that
> > started all of it. It was very clear during the whole iWarp and TCP
> > Offload Engine buisness in the mid 2000's that netdev wanted
> > nothing
> > to do with the accelerator world.
> > 
> > So why would netdev need sign off on any accelerator stuff?  Do you
> > want to start co-operating now? I'm willing to talk about how to do
> > that.
> > 
> > > And our policy on DPDK is pretty widely known.
> > 
> > I honestly have no idea on the netdev DPDK policy, I'm maintaining
> > the
> > RDMA subsystem not DPDK :)
> > 
> > > Would you mind pointing us to the introduction of raw Ethernet
> > > QPs?
> > > 
> > > Is there any production use for that without DPDK?
> > 
> > Hmm.. It is very old. RAW (InfiniBand) QPs were part of the
> > original
> > IBA specification cira 2000. When RoCE was defined (around 2010)
> > they
> > were naturally carried forward to Ethernet. The "flow steering"
> > concept to make raw ethernet QP useful was added to verbs around
> > 2012
> > - 2013. It officially made it upstream in commit 436f2ad05a0b
> > ("IB/core: Export ib_create/destroy_flow through uverbs")
> > 
> > If I recall properly the first real application was ultra low
> > latency
> > ethernet processing for financial applications.
> > 
> > dpdk later adopted the first mlx4 PMD using this libibverbs API
> > around
> > 2015. Interestingly the mlx4 PMD was made through an open source
> > process with minimal involvment from Mellanox, based on the
> > pre-existing RDMA work.
> > 
> > Currently there are many projects, and many open source, built on
> > top
> > of the RDMA raw ethernet QP and RDMA flow steering model. It is now
> > long established kernel ABI.
> > 
> > > > It has been like this for years, it is not some "act".
> > > > 
> > > > It is long standing uABI that accelerators like RDMA/etc get to
> > > > take
> > > > the traffic before netdev. This cannot be reverted. I don't
> > > > really
> > > > understand what you are expecting here?
> > > 
> > > Same. I don't really know what you expect me to do either. I
> > > don't
> > > think I can sign-off on kernel changes needed for DPDK.
> > 
> > This patch is fine tuning the shared logic that splits the traffic
> > to
> > accelerator subsystems, I don't think netdev should have a veto
> > here. This needs to be consensus among the various communities and
> > subsystems that rely on this.
> > 
> > Eli did not explain this well in his commit message. When he said
> > DPDK
> > he means RDMA which is the owner of the FLOW_NAMESPACE. Each
> > accelerator subsystem gets hooked into this, so here VPDA is
> > getting
> > its own hook because re-using the the same hook between two kernel
> > subsystems is buggy.
> 
> I agree, RDMA should have been used here. DPDK is just one, though
> widely used, accelerator using RDMA interfaces to flow steering.
> 
> I will push submit another patch with a modified change log.

Hi Jakub, Given that this patch is just defining the HW domain of vDPA
to separate it from the RDMA domain, it has no actual functionality,
just the mlx5_core low level hw definition, can i take this patch to
mlx5-next and move on to my next series ? 

Thanks,
Saeed.
Saeed Mahameed Nov. 25, 2020, 7:28 p.m. UTC | #10
On Wed, 2020-11-25 at 10:54 -0800, Jakub Kicinski wrote:
> On Tue, 24 Nov 2020 15:44:13 -0400 Jason Gunthorpe wrote:
> > On Tue, Nov 24, 2020 at 10:41:06AM -0800, Jakub Kicinski wrote:
> > > On Tue, 24 Nov 2020 14:02:10 -0400 Jason Gunthorpe wrote:  

[snip]

> > > > > > 
> > > > It has been like this for years, it is not some "act".
> > > > 
> > > > It is long standing uABI that accelerators like RDMA/etc get to
> > > > take the traffic before netdev. This cannot be reverted. I
> > > > don't
> > > > really understand what you are expecting here?  
> > > 
> > > Same. I don't really know what you expect me to do either. I
> > > don't
> > > think I can sign-off on kernel changes needed for DPDK.  
> > 
> > This patch is fine tuning the shared logic that splits the traffic
> > to
> > accelerator subsystems, I don't think netdev should have a veto
> > here. This needs to be consensus among the various communities and
> > subsystems that rely on this.
> > 
> > Eli did not explain this well in his commit message. When he said
> > DPDK
> > he means RDMA which is the owner of the FLOW_NAMESPACE. Each
> > accelerator subsystem gets hooked into this, so here VPDA is
> > getting
> > its own hook because re-using the the same hook between two kernel
> > subsystems is buggy.
> 
> I'm not so sure about this.
> 
> The switchdev modeling is supposed to give users control over flow of
> traffic in a sane, well defined way, as opposed to magic flow
> filtering
> of the early SR-IOV implementations which every vendor had their own
> twist on. 
> 
> Now IIUC you're tapping traffic for DPDK/raw QPs _before_ all
> switching
> happens in the NIC? That breaks the switchdev model. We're back to
> per-vendor magic.

No this is after switching, nothing can precede switching!
after switching and forwarding to the correct function/vport, 
The HW deumx rdma to rdma and eth(rest) to netdev.

> 
> And why do you need a separate VDPA table in the first place?
> Forwarding to a VDPA device has different semantics than forwarding
> to
> any other VF/SF?

VDPA is yet another "RDMA" Application, similar to raw qp, it is
different than VF/SF.

switching can only forward to PF/VF/SF, it doesn't know or care about
the end point app (netdev/rdma).

Jakub, this is how rdma works and has been working for the past 20
years :), Jason is well aware of the lack of visibility, and i am sure
rdma folks will improve this, they have been improving a lot lately,
take rdma_tool for example.

Bottom line the switching model is not the answer for rdma, another
model is required, rdma by definition is HW oriented from day one, you
can't think of it as an offloaded SW model. ( also in a real switch you
can't define if a port is rdma or eth :) ) 

Anyway you have very valid points that Jason already raised in the
past, but the challenge is more complicated than the challenges we have
in netdev, simply because RDMA is RDMA, where the leading model is the
HW model and the rdma spec and not the SW ..
Jason Gunthorpe Nov. 25, 2020, 9:22 p.m. UTC | #11
On Wed, Nov 25, 2020 at 10:54:22AM -0800, Jakub Kicinski wrote:

> > RDMA covers a wide range of accelerated networking these days.. Where
> > else are you going to put this stuff in the kernel?
> 
> IDK what else you got in there :) It's probably a case by case answer.

Hmm, yes, it seems endless sometimes :(
 
> IMHO even using libibverbs is no strong reason for things to fall under
> RDMA exclusively. Client drivers of virtio don't get silently funneled
> through a separate tree just because they use a certain spec.

I'm not sure I understand this, libibverbs is the user library to
interface with the kernel RDMA subsystem. I don't care what apps
people build on top of it, it doesn't matter to me that netdev and
DPDK have some kind of feud.

> > > I'm sure if you start doing crypto over ibverbs crypto people will want
> > > to have a look.  
> > 
> > Well, RDMA has crypto transforms for a few years now too. 
> 
> Are you talking about RDMA traffic being encrypted? That's a different
> case.

That too, but in general, anything netdev can do can be done via RDMA
in userspace. So all the kTLS and IPSEC xfrm HW offloads mlx5 supports
are all available in userspace too.

> > Part of the point of the subsystem split was to end the fighting that
> > started all of it. It was very clear during the whole iWarp and TCP
> > Offload Engine buisness in the mid 2000's that netdev wanted nothing
> > to do with the accelerator world.
> 
> I was in middle school at the time, not sure what exactly went down :)

Ah, it was quite the thing. Microsoft and Co were heavilly pushing TOE
technology (Microsoft Chimney!) as the next most certain thing and I
recall DaveM&co was completely against it in Linux.

I will admit at the time I was doubtful, but in hindsight this was the
correct choice. netdev would not look like it does today if it had
been shackled by the HW implementations of the day. Instead all this
HW stuff ended up largely in RDMA and some in block with the iSCSI
mania of old. It is quite evident to me the mess being tied to HW has
caused to a SW ecosystem. DRM and RDMA both have a very similiar kind
of suffering due to this.

However - over the last 20 years it has been steadfast that there is
*always* a compelling reason for certain applications to use something
from the accelerator side. It is not for everyone, but the specialized
applications that need it, *really need it*.

For instance, it is the difference between being able to get a COVID
simulation result in a few week vs .. well.. never.

> But I'm going by common sense here. Perhaps there was an agreement
> I'm not aware of?

The resolution to the argument above was to split them in Linux.  Thus
what logically is networking was split up in the kernel between netdev
and the accelerator subsystems (iscsi, rdma, and so on).

The general notion is netdev doesn't have to accomodate anything an
accelerator does. If you choose to run them then you do not get to
complain that your ethtool counters are wrong, your routing tables
and tc don't work, firewalling doesn't work. Etc.

That is all broken by design.

In turn, the accelerators do their own thing, tap the traffic before
it hits netdev and so on. netdev does not care what goes on over there
and is not responsible.

I would say this is the basic unspoken agreement of the last 15 years.

Both have a right to exist in Linux. Both have a right to use the
physical ethernet port.

> > So why would netdev need sign off on any accelerator stuff?
> 
> I'm not sure why you keep saying accelerators!
> 
> What is accelerated in raw Ethernet frame access??

The nature of the traffic is not relavent.

It goes through RDMA, it is accelerator traffic (vs netdev traffic,
which goes to netdev). Even if you want to be pedantic, in the raw
ethernet area there is lots of HW special accelerated stuff going
on. Mellanox has some really neat hard realtime networking technology
that works on raw ethernet packets, for instance.

And of course raw ethernet is a fraction of what RDMA covers. iWarp
and RoCE are much more like you might imagine when you hear the word
accelerator.

> > Do you want to start co-operating now? I'm willing to talk about how
> > to do that.
> 
> IDK how that's even in question. I always try to bump all RDMA-looking
> stuff to linux-rdma when it's not CCed there. That's the bare minimum
> of cooperation I'd expect from anyone.

I mean co-operate in the sense of defining a scheme where the two
worlds are not completely seperated and isolated.

> > > And our policy on DPDK is pretty widely known.  
> > 
> > I honestly have no idea on the netdev DPDK policy,
> > 
> > I'm maintaining the RDMA subsystem not DPDK :)
> 
> That's what I thought, but turns out DPDK is your important user.

Nonsense.

I don't have stats but the majority of people I work with using RDMA
are not using DPDK. DPDK serves two somewhat niche markets of NFV and
certain hyperscalers - RDMA covers the entire scientific computing
community and a big swath of the classic "Big Iron" enterprise stuff,
like databases and storage.

> Now IIUC you're tapping traffic for DPDK/raw QPs _before_ all switching
> happens in the NIC? That breaks the switchdev model. We're back to
> per-vendor magic.

No, as I explained before, the switchdev completely contains the SF/VF
and all applications running on a mlx5_core are trapped by it. This
includes netdev, RDMA and VDPA.

> And why do you need a separate VDPA table in the first place?
> Forwarding to a VDPA device has different semantics than forwarding to
> any other VF/SF?

The VDPA table is not switchdev. Go back to my overly long email about
VDPA, here we are talking about the "selector" that chooses which
subsystem the traffic will go to. The selector is after switchdev but
before netdev, VDPA, RDMA.

Each accelerator subsystem gets a table. RDMA, VDPA, and netdev all
get one. It is some part of the HW to make the selectoring work.

Jason
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 16091838bfcf..e095c5968e67 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -118,6 +118,10 @@ 
 #define ANCHOR_NUM_PRIOS 1
 #define ANCHOR_MIN_LEVEL (BY_PASS_MIN_LEVEL + 1)
 
+#define VDPA_PRIO_NUM_LEVELS 1
+#define VDPA_NUM_PRIOS 1
+#define VDPA_MIN_LEVEL 1
+
 #define OFFLOADS_MAX_FT 2
 #define OFFLOADS_NUM_PRIOS 2
 #define OFFLOADS_MIN_LEVEL (ANCHOR_MIN_LEVEL + OFFLOADS_NUM_PRIOS)
@@ -147,7 +151,7 @@  static struct init_tree_node {
 	enum mlx5_flow_table_miss_action def_miss_action;
 } root_fs = {
 	.type = FS_TYPE_NAMESPACE,
-	.ar_size = 7,
+	.ar_size = 8,
 	  .children = (struct init_tree_node[]){
 		  ADD_PRIO(0, BY_PASS_MIN_LEVEL, 0, FS_CHAINING_CAPS,
 			   ADD_NS(MLX5_FLOW_TABLE_MISS_ACTION_DEF,
@@ -165,6 +169,10 @@  static struct init_tree_node {
 			   ADD_NS(MLX5_FLOW_TABLE_MISS_ACTION_DEF,
 				  ADD_MULTIPLE_PRIO(ETHTOOL_NUM_PRIOS,
 						    ETHTOOL_PRIO_NUM_LEVELS))),
+		  ADD_PRIO(0, VDPA_MIN_LEVEL, 0, FS_CHAINING_CAPS,
+			   ADD_NS(MLX5_FLOW_TABLE_MISS_ACTION_DEF,
+				  ADD_MULTIPLE_PRIO(VDPA_NUM_PRIOS,
+						    VDPA_PRIO_NUM_LEVELS))),
 		  ADD_PRIO(0, KERNEL_MIN_LEVEL, 0, {},
 			   ADD_NS(MLX5_FLOW_TABLE_MISS_ACTION_DEF,
 				  ADD_MULTIPLE_PRIO(KERNEL_NIC_TC_NUM_PRIOS,
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index 35d2cc1646d3..97176d623d74 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -67,6 +67,7 @@  enum mlx5_flow_namespace_type {
 	MLX5_FLOW_NAMESPACE_LAG,
 	MLX5_FLOW_NAMESPACE_OFFLOADS,
 	MLX5_FLOW_NAMESPACE_ETHTOOL,
+	MLX5_FLOW_NAMESPACE_VDPA,
 	MLX5_FLOW_NAMESPACE_KERNEL,
 	MLX5_FLOW_NAMESPACE_LEFTOVERS,
 	MLX5_FLOW_NAMESPACE_ANCHOR,