[net-next,v1,1/3] net: sched: af_packet support for direct ring access

Message ID 20141006000629.32055.2295.stgit@nitbit.x32
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

John Fastabend Oct. 6, 2014, 12:06 a.m.
This patch adds a net_device ops to split off a set of driver queues
from the driver and map the queues into user space via mmap. This
allows the queues to be directly manipulated from user space. For
raw packet interface this removes any overhead from the kernel network
stack.

Typically in an af_packet interface a packet_type handler is
registered and used to filter traffic to the socket and do other
things such as fan out traffic to multiple sockets. In this case the
networking stack is being bypassed so this code is not run. So the
hardware must push the correct traffic to the queues obtained from
the ndo callback ndo_split_queue_pairs().

Fortunately there is already a flow classification interface which
is part of the ethtool command set, ETHTOOL_SRXCLSRLINS. It is
currently supported by multiple drivers including sfc, mlx4, niu,
ixgbe, and i40e. Supporting some way to steer traffic to a queue
is the _only_ hardware requirement to support the interface, plus
the driver needs to implement the correct ndo ops. A follow on
patch adds support for ixgbe but we expect at least the subset of
drivers implementing ETHTOOL_SRXCLSRLINS to be implemented later.

The interface is driven over an af_packet socket which we believe
is the most natural interface to use. Because it is already used
for raw packet interfaces which is what we are providing here.
 The high level flow for this interface looks like:

	bind(fd, &sockaddr, sizeof(sockaddr));

	/* Get the device type and info */
	getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
		   &optlen);

	/* With device info we can look up descriptor format */

	/* Get the layout of ring space offset, page_sz, cnt */
	getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
		   &info, &optlen);

	/* request some queues from the driver */
	setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
		   &qpairs_info, sizeof(qpairs_info));

	/* if we let the driver pick us queues learn which queues
         * we were given
         */
	getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
		   &qpairs_info, sizeof(qpairs_info));

	/* And mmap queue pairs to user space */
	mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
	     MAP_SHARED, fd, 0);

	/* Now we have some user space queues to read/write to*/

There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO which
should provide enough details to extrapulate the descriptor formats.
Although this adds some complexity to user space it removes the
requirement to copy descriptor fields around.

The formats are usually provided by the device vendor documentation
If folks want I can provide a follow up patch to provide the formats
in a .h file in ./include/uapi/linux/ for ease of use. I have access
to formats for ixgbe and mlx drivers other driver owners would need to
provide their formats.

We tested this interface using traffic generators and doing basic
L2 forwarding tests on ixgbe devices. Our tests use a set of patches
to DPDK to enable an interface using this socket interfaace. With
this interface we can xmit/receive @ line rate from a test user space
application on a single core.

Additionally we have a set of DPDK patches to enable DPDK with this
interface. DPDK can be downloaded @ dpdk.org although as I hope is
clear from above DPDK is just our paticular test environment we
expect other libraries could be built on this interface.

Signed-off-by: Danny Zhou <danny.zhou@intel.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/linux/netdevice.h      |   63 ++++++++++++++
 include/uapi/linux/if_packet.h |   42 +++++++++
 net/packet/af_packet.c         |  181 ++++++++++++++++++++++++++++++++++++++++
 net/packet/internal.h          |    1 
 4 files changed, 287 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Florian Westphal Oct. 6, 2014, 12:29 a.m. | #1
John Fastabend <john.fastabend@gmail.com> wrote:
> There is one critical difference when running with these interfaces
> vs running without them. In the normal case the af_packet module
> uses a standard descriptor format exported by the af_packet user
> space headers. In this model because we are working directly with
> driver queues the descriptor format maps to the descriptor format
> used by the device. User space applications can learn device
> information from the socket option PACKET_DEV_DESC_INFO which
> should provide enough details to extrapulate the descriptor formats.
> Although this adds some complexity to user space it removes the
> requirement to copy descriptor fields around.

I find it very disappointing that we seem to have to expose such
hardware specific details to userspace via hw-independent interface.

How big of a cost are we talking about when you say that it 'removes
the requirement to copy descriptor fields'?

Thanks,
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 6, 2014, 1:09 a.m. | #2
From: Florian Westphal <fw@strlen.de>
Date: Mon, 6 Oct 2014 02:29:51 +0200

> John Fastabend <john.fastabend@gmail.com> wrote:
>> There is one critical difference when running with these interfaces
>> vs running without them. In the normal case the af_packet module
>> uses a standard descriptor format exported by the af_packet user
>> space headers. In this model because we are working directly with
>> driver queues the descriptor format maps to the descriptor format
>> used by the device. User space applications can learn device
>> information from the socket option PACKET_DEV_DESC_INFO which
>> should provide enough details to extrapulate the descriptor formats.
>> Although this adds some complexity to user space it removes the
>> requirement to copy descriptor fields around.
> 
> I find it very disappointing that we seem to have to expose such
> hardware specific details to userspace via hw-independent interface.
> 
> How big of a cost are we talking about when you say that it 'removes
> the requirement to copy descriptor fields'?

FWIW, it also avoids the domain switch (which is just a fancy way
to refer to performing the system call), both in and out.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend Oct. 6, 2014, 1:12 a.m. | #3
On 10/05/2014 05:29 PM, Florian Westphal wrote:
> John Fastabend <john.fastabend@gmail.com> wrote:
>> There is one critical difference when running with these interfaces
>> vs running without them. In the normal case the af_packet module
>> uses a standard descriptor format exported by the af_packet user
>> space headers. In this model because we are working directly with
>> driver queues the descriptor format maps to the descriptor format
>> used by the device. User space applications can learn device
>> information from the socket option PACKET_DEV_DESC_INFO which
>> should provide enough details to extrapulate the descriptor formats.
>> Although this adds some complexity to user space it removes the
>> requirement to copy descriptor fields around.
>
> I find it very disappointing that we seem to have to expose such
> hardware specific details to userspace via hw-independent interface.
>

Well it was only for convenience if it doesn't fit as a socket
option we can remove it. We can look up the device using the netdev
name from the bind call. I see your point though so if there is
consensus that this is not needed that is fine.

> How big of a cost are we talking about when you say that it 'removes
> the requirement to copy descriptor fields'?
>

This was likely a poor description. If you want to let user space
poll on the ring (without using system calls or interrupts) then
I don't see how you can _not_ expose the ring directly complete with
the vendor descriptor formats.

> Thanks,
> Florian
>
John Fastabend Oct. 6, 2014, 1:18 a.m. | #4
On 10/05/2014 06:09 PM, David Miller wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Mon, 6 Oct 2014 02:29:51 +0200
>
>> John Fastabend <john.fastabend@gmail.com> wrote:
>>> There is one critical difference when running with these interfaces
>>> vs running without them. In the normal case the af_packet module
>>> uses a standard descriptor format exported by the af_packet user
>>> space headers. In this model because we are working directly with
>>> driver queues the descriptor format maps to the descriptor format
>>> used by the device. User space applications can learn device
>>> information from the socket option PACKET_DEV_DESC_INFO which
>>> should provide enough details to extrapulate the descriptor formats.
>>> Although this adds some complexity to user space it removes the
>>> requirement to copy descriptor fields around.
>>
>> I find it very disappointing that we seem to have to expose such
>> hardware specific details to userspace via hw-independent interface.
>>
>> How big of a cost are we talking about when you say that it 'removes
>> the requirement to copy descriptor fields'?
>
> FWIW, it also avoids the domain switch (which is just a fancy way
> to refer to performing the system call), both in and out.
>

Right, my description could have been better and called this out.

Thanks.
Daniel Borkmann Oct. 6, 2014, 9:49 a.m. | #5
Hi John,

On 10/06/2014 03:12 AM, John Fastabend wrote:
> On 10/05/2014 05:29 PM, Florian Westphal wrote:
>> John Fastabend <john.fastabend@gmail.com> wrote:
>>> There is one critical difference when running with these interfaces
>>> vs running without them. In the normal case the af_packet module
>>> uses a standard descriptor format exported by the af_packet user
>>> space headers. In this model because we are working directly with
>>> driver queues the descriptor format maps to the descriptor format
>>> used by the device. User space applications can learn device
>>> information from the socket option PACKET_DEV_DESC_INFO which
>>> should provide enough details to extrapulate the descriptor formats.
>>> Although this adds some complexity to user space it removes the
>>> requirement to copy descriptor fields around.
>>
>> I find it very disappointing that we seem to have to expose such
>> hardware specific details to userspace via hw-independent interface.
>
> Well it was only for convenience if it doesn't fit as a socket
> option we can remove it. We can look up the device using the netdev
> name from the bind call. I see your point though so if there is
> consensus that this is not needed that is fine.
>
>> How big of a cost are we talking about when you say that it 'removes
>> the requirement to copy descriptor fields'?
>
> This was likely a poor description. If you want to let user space
> poll on the ring (without using system calls or interrupts) then
> I don't see how you can _not_ expose the ring directly complete with
> the vendor descriptor formats.

But how big is the concrete performance degradation you're seeing if you
use an e.g. `netmap-alike` Linux-own variant as a hw-neutral interface
that does *not* directly expose hw descriptor formats to user space?

With 1 core netmap does 10G line-rate on 64b; I don't know their numbers
on 40G when run on decent hardware though.

It would really be great if we have something vendor neutral exposed as
a stable ABI and could leverage emerging infrastructure we already have
in the kernel such as eBPF and recent qdisc batching for raw sockets
instead of reinventing the wheels. (Don't get me wrong, I would love to
see AF_PACKET improved ...)

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend Oct. 6, 2014, 3:01 p.m. | #6
On 10/06/2014 02:49 AM, Daniel Borkmann wrote:
> Hi John,
> 
> On 10/06/2014 03:12 AM, John Fastabend wrote:
>> On 10/05/2014 05:29 PM, Florian Westphal wrote:
>>> John Fastabend <john.fastabend@gmail.com> wrote:
>>>> There is one critical difference when running with these interfaces
>>>> vs running without them. In the normal case the af_packet module
>>>> uses a standard descriptor format exported by the af_packet user
>>>> space headers. In this model because we are working directly with
>>>> driver queues the descriptor format maps to the descriptor format
>>>> used by the device. User space applications can learn device
>>>> information from the socket option PACKET_DEV_DESC_INFO which
>>>> should provide enough details to extrapulate the descriptor formats.
>>>> Although this adds some complexity to user space it removes the
>>>> requirement to copy descriptor fields around.
>>>
>>> I find it very disappointing that we seem to have to expose such
>>> hardware specific details to userspace via hw-independent interface.
>>
>> Well it was only for convenience if it doesn't fit as a socket
>> option we can remove it. We can look up the device using the netdev
>> name from the bind call. I see your point though so if there is
>> consensus that this is not needed that is fine.
>>
>>> How big of a cost are we talking about when you say that it 'removes
>>> the requirement to copy descriptor fields'?
>>
>> This was likely a poor description. If you want to let user space
>> poll on the ring (without using system calls or interrupts) then
>> I don't see how you can _not_ expose the ring directly complete with
>> the vendor descriptor formats.
> 
> But how big is the concrete performance degradation you're seeing if you
> use an e.g. `netmap-alike` Linux-own variant as a hw-neutral interface
> that does *not* directly expose hw descriptor formats to user space?

If we don't directly expose the hardware descriptor formats then we
need to somehow kick the driver when we want it to do the copy from
the driver descriptor format to the common descriptor format.

This requires a system call as far as I can tell. Which has unwanted
overhead. I can micro-benchmark this if its helpful. But if we dredge
up Jesper's slides here we are really counting cycles so even small
numbers count if we want to hit line rate in a user space application
with 40Gpbs hardware.

> 
> With 1 core netmap does 10G line-rate on 64b; I don't know their numbers
> on 40G when run on decent hardware though.
> 
> It would really be great if we have something vendor neutral exposed as
> a stable ABI and could leverage emerging infrastructure we already have
> in the kernel such as eBPF and recent qdisc batching for raw sockets
> instead of reinventing the wheels. (Don't get me wrong, I would love to
> see AF_PACKET improved ...)

I don't think the interface is vendor specific. It does require some
knowledge of the hardware descriptor layout though. It is though vendor
neutral from my point of view. I provided the ixgbe patch simple because
I'm most familiar with it and have a NIC here. If someone wants to send me
a Mellanox NIC I can give it a try although I was hoping to recruit Or or
Amir? The only hardware feature required is flow classification to queues
which seems to be common across 10Gbps and 40/100Gbps devices. So most
of the drivers should be able to support this.

If your worried driver writers will implement the interface but not make
their descriptor formats easily available I considered putting the layout
in a header file in the uapi somewhere. Then we could just reject any
implementation that doesn't include the header file needed to use it
from user space.

With regards to leveraging eBPF and qdisc batching I don't see how this
works with direct DMA and polling. Needed to give the lowest overhead
between kernel and user space. In this case we want to use the hardware
to do the filtering that would normally be done for eBPF and for many
use cases the hardware flow classifiers is sufficient. 

We already added a qdisc bypass option I see this as taking this path
further. I believe there is room for a continuum here. For basic cases
use af_packet v1,v2 for mmap rings but using common descriptors use
af_packet v3 and set QOS_BYASS. For absolute lowest overhead and
specific applications that don't need QOS, eBPF use this interface.

Thanks.

> 
> Thanks,
> Daniel
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer Oct. 6, 2014, 4:35 p.m. | #7
On Mon, 06 Oct 2014 08:01:52 -0700 John Fastabend <john.r.fastabend@intel.com> wrote:
 
> This requires a system call as far as I can tell. Which has unwanted
> overhead. I can micro-benchmark this if its helpful. But if we dredge
> up Jesper's slides here we are really counting cycles so even small
> numbers count if we want to hit line rate in a user space application
> with 40Gpbs hardware.

The micro-benchmarked syscall[2] cost is approx 42 ns [1] (when
disabling CONFIG_AUDITSYSCALL else its approx 88ns), which is
significant compared to the 10G wirespeed smallest packet size budget
of 67.2ns.

See:
 [1] http://netoptimizer.blogspot.dk/2014/05/the-calculations-10gbits-wirespeed.html
 [2] https://github.com/netoptimizer/network-testing/blob/master/src/syscall_overhead.c

[...] 
> We already added a qdisc bypass option I see this as taking this path
> further. I believe there is room for a continuum here. For basic cases
> use af_packet v1,v2 for mmap rings but using common descriptors use
> af_packet v3 and set QOS_BYASS. For absolute lowest overhead and
> specific applications that don't need QOS, eBPF use this interface.

Well, after the qdisc bulking changes, when bulking kicks in then the
qdisc path is faster than the qdisc bypass (measured with trafgen).
Stephen Hemminger Oct. 6, 2014, 4:55 p.m. | #8
On Sun, 05 Oct 2014 17:06:31 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> This patch adds a net_device ops to split off a set of driver queues
> from the driver and map the queues into user space via mmap. This
> allows the queues to be directly manipulated from user space. For
> raw packet interface this removes any overhead from the kernel network
> stack.
> 
> Typically in an af_packet interface a packet_type handler is
> registered and used to filter traffic to the socket and do other
> things such as fan out traffic to multiple sockets. In this case the
> networking stack is being bypassed so this code is not run. So the
> hardware must push the correct traffic to the queues obtained from
> the ndo callback ndo_split_queue_pairs().
> 
> Fortunately there is already a flow classification interface which
> is part of the ethtool command set, ETHTOOL_SRXCLSRLINS. It is
> currently supported by multiple drivers including sfc, mlx4, niu,
> ixgbe, and i40e. Supporting some way to steer traffic to a queue
> is the _only_ hardware requirement to support the interface, plus
> the driver needs to implement the correct ndo ops. A follow on
> patch adds support for ixgbe but we expect at least the subset of
> drivers implementing ETHTOOL_SRXCLSRLINS to be implemented later.
> 
> The interface is driven over an af_packet socket which we believe
> is the most natural interface to use. Because it is already used
> for raw packet interfaces which is what we are providing here.
>  The high level flow for this interface looks like:
> 
> 	bind(fd, &sockaddr, sizeof(sockaddr));
> 
> 	/* Get the device type and info */
> 	getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
> 		   &optlen);
> 
> 	/* With device info we can look up descriptor format */
> 
> 	/* Get the layout of ring space offset, page_sz, cnt */
> 	getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
> 		   &info, &optlen);
> 
> 	/* request some queues from the driver */
> 	setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> 		   &qpairs_info, sizeof(qpairs_info));
> 
> 	/* if we let the driver pick us queues learn which queues
>          * we were given
>          */
> 	getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> 		   &qpairs_info, sizeof(qpairs_info));
> 
> 	/* And mmap queue pairs to user space */
> 	mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
> 	     MAP_SHARED, fd, 0);
> 
> 	/* Now we have some user space queues to read/write to*/
> 
> There is one critical difference when running with these interfaces
> vs running without them. In the normal case the af_packet module
> uses a standard descriptor format exported by the af_packet user
> space headers. In this model because we are working directly with
> driver queues the descriptor format maps to the descriptor format
> used by the device. User space applications can learn device
> information from the socket option PACKET_DEV_DESC_INFO which
> should provide enough details to extrapulate the descriptor formats.
> Although this adds some complexity to user space it removes the
> requirement to copy descriptor fields around.
> 
> The formats are usually provided by the device vendor documentation
> If folks want I can provide a follow up patch to provide the formats
> in a .h file in ./include/uapi/linux/ for ease of use. I have access
> to formats for ixgbe and mlx drivers other driver owners would need to
> provide their formats.
> 
> We tested this interface using traffic generators and doing basic
> L2 forwarding tests on ixgbe devices. Our tests use a set of patches
> to DPDK to enable an interface using this socket interfaace. With
> this interface we can xmit/receive @ line rate from a test user space
> application on a single core.
> 
> Additionally we have a set of DPDK patches to enable DPDK with this
> interface. DPDK can be downloaded @ dpdk.org although as I hope is
> clear from above DPDK is just our paticular test environment we
> expect other libraries could be built on this interface.
> 
> Signed-off-by: Danny Zhou <danny.zhou@intel.com>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>

I like the ability to share a device between kernel and user mode networking.
The model used for DPDK for this is really ugly and fragile/broken.
Your proposal assumes that you fully trust the user mode networking application
which is not a generally safe assumption.

A device can DMA from/to any arbitrary physical memory. 
And it would be hard to use IOMMU to protect because the
IOMMU doesn't know that the difference between the applications queue and
the rest of the queues.

At least with DPDK you can use VFIO, and you are claiming the whole device to
allow protection against random memory being read/written.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Oct. 6, 2014, 5:03 p.m. | #9
Hi John,

On Mo, 2014-10-06 at 08:01 -0700, John Fastabend wrote:
> On 10/06/2014 02:49 AM, Daniel Borkmann wrote:
> > Hi John,
> > 
> > On 10/06/2014 03:12 AM, John Fastabend wrote:
> >> On 10/05/2014 05:29 PM, Florian Westphal wrote:
> >>> John Fastabend <john.fastabend@gmail.com> wrote:
> >>>> There is one critical difference when running with these interfaces
> >>>> vs running without them. In the normal case the af_packet module
> >>>> uses a standard descriptor format exported by the af_packet user
> >>>> space headers. In this model because we are working directly with
> >>>> driver queues the descriptor format maps to the descriptor format
> >>>> used by the device. User space applications can learn device
> >>>> information from the socket option PACKET_DEV_DESC_INFO which
> >>>> should provide enough details to extrapulate the descriptor formats.
> >>>> Although this adds some complexity to user space it removes the
> >>>> requirement to copy descriptor fields around.
> >>>
> >>> I find it very disappointing that we seem to have to expose such
> >>> hardware specific details to userspace via hw-independent interface.
> >>
> >> Well it was only for convenience if it doesn't fit as a socket
> >> option we can remove it. We can look up the device using the netdev
> >> name from the bind call. I see your point though so if there is
> >> consensus that this is not needed that is fine.
> >>
> >>> How big of a cost are we talking about when you say that it 'removes
> >>> the requirement to copy descriptor fields'?
> >>
> >> This was likely a poor description. If you want to let user space
> >> poll on the ring (without using system calls or interrupts) then
> >> I don't see how you can _not_ expose the ring directly complete with
> >> the vendor descriptor formats.
> > 
> > But how big is the concrete performance degradation you're seeing if you
> > use an e.g. `netmap-alike` Linux-own variant as a hw-neutral interface
> > that does *not* directly expose hw descriptor formats to user space?
> 
> If we don't directly expose the hardware descriptor formats then we
> need to somehow kick the driver when we want it to do the copy from
> the driver descriptor format to the common descriptor format.
>
> This requires a system call as far as I can tell. Which has unwanted
> overhead. I can micro-benchmark this if its helpful. But if we dredge
> up Jesper's slides here we are really counting cycles so even small
> numbers count if we want to hit line rate in a user space application
> with 40Gpbs hardware.

I agree, it seems pretty hard to achieve non-syscall sending on the same
core, as we somehow must transfer control over to the kernel without
doing a syscall.

The only other idea would be to export machine code up to user space,
which you can mmap(MAP_EXEC) from the socket somehow to make this API
truly NIC agnostic without recompiling. This code then would transform
the generic descriptors to the hardware specific ones. Seems also pretty
hairy to do that correctly, maybe.

> > With 1 core netmap does 10G line-rate on 64b; I don't know their numbers
> > on 40G when run on decent hardware though.
> > 
> > It would really be great if we have something vendor neutral exposed as
> > a stable ABI and could leverage emerging infrastructure we already have
> > in the kernel such as eBPF and recent qdisc batching for raw sockets
> > instead of reinventing the wheels. (Don't get me wrong, I would love to
> > see AF_PACKET improved ...)
> 
> I don't think the interface is vendor specific. It does require some
> knowledge of the hardware descriptor layout though. It is though vendor
> neutral from my point of view. I provided the ixgbe patch simple because
> I'm most familiar with it and have a NIC here. If someone wants to send me
> a Mellanox NIC I can give it a try although I was hoping to recruit Or or
> Amir? The only hardware feature required is flow classification to queues
> which seems to be common across 10Gbps and 40/100Gbps devices. So most
> of the drivers should be able to support this.

Does flow classification work at the same level as registering network
addresses? Do I have to bind a e.g. multicast address wie ip maddr and
then set up flow director/ntuple to get the packets on the correct user
space facing queue or is it in case of the ixgbe card enough to just add
those addresses via fdir? Have you thought about letting the
kernel/driver handle that? In case one would like to connect their
virtual machines via this interface to the network maybe we need central
policing and resource constraints for queue management here?

Do other drivers need a separate af-packet managed way to bind addresses
to the queue? Maybe there are other quirks we might need to add to
actually build support for other network interface cards. Would be great
to at least examine one other driver in regard to this.
        
What other properties of the NIC must be exported? I think we also have
to deal with MTUs currently configured in the NIC, promisc mode and
maybe TSO?

> If your worried driver writers will implement the interface but not make
> their descriptor formats easily available I considered putting the layout
> in a header file in the uapi somewhere. Then we could just reject any
> implementation that doesn't include the header file needed to use it
> from user space.
> 
> With regards to leveraging eBPF and qdisc batching I don't see how this
> works with direct DMA and polling. Needed to give the lowest overhead
> between kernel and user space. In this case we want to use the hardware
> to do the filtering that would normally be done for eBPF and for many
> use cases the hardware flow classifiers is sufficient.

I agree, those features are hard to connect.

> We already added a qdisc bypass option I see this as taking this path
> further. I believe there is room for a continuum here. For basic cases
> use af_packet v1,v2 for mmap rings but using common descriptors use
> af_packet v3 and set QOS_BYASS. For absolute lowest overhead and
> specific applications that don't need QOS, eBPF use this interface.

You can simply write C code instead of eBPF code, yes.

I find the six additional ndo ops a bit worrisome as we are adding more
and more subsystem specific ndoops to this struct. I would like to see
some unification here, but currently cannot make concrete proposals,
sorry.

Patch 2/3 does not yet expose hw ring descriptors in uapi headers it
seems?

Are there plans to push a user space framework (maybe even into the
kernel), too? Will this be dpdk (alike) in the end?

Bye,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend Oct. 6, 2014, 8:37 p.m. | #10
On 10/06/2014 10:03 AM, Hannes Frederic Sowa wrote:
> Hi John,
> 
> On Mo, 2014-10-06 at 08:01 -0700, John Fastabend wrote:
>> On 10/06/2014 02:49 AM, Daniel Borkmann wrote:
>>> Hi John,
>>>
>>> On 10/06/2014 03:12 AM, John Fastabend wrote:
>>>> On 10/05/2014 05:29 PM, Florian Westphal wrote:
>>>>> John Fastabend <john.fastabend@gmail.com> wrote:
>>>>>> There is one critical difference when running with these interfaces
>>>>>> vs running without them. In the normal case the af_packet module
>>>>>> uses a standard descriptor format exported by the af_packet user
>>>>>> space headers. In this model because we are working directly with
>>>>>> driver queues the descriptor format maps to the descriptor format
>>>>>> used by the device. User space applications can learn device
>>>>>> information from the socket option PACKET_DEV_DESC_INFO which
>>>>>> should provide enough details to extrapulate the descriptor formats.
>>>>>> Although this adds some complexity to user space it removes the
>>>>>> requirement to copy descriptor fields around.
>>>>>
>>>>> I find it very disappointing that we seem to have to expose such
>>>>> hardware specific details to userspace via hw-independent interface.
>>>>
>>>> Well it was only for convenience if it doesn't fit as a socket
>>>> option we can remove it. We can look up the device using the netdev
>>>> name from the bind call. I see your point though so if there is
>>>> consensus that this is not needed that is fine.
>>>>
>>>>> How big of a cost are we talking about when you say that it 'removes
>>>>> the requirement to copy descriptor fields'?
>>>>
>>>> This was likely a poor description. If you want to let user space
>>>> poll on the ring (without using system calls or interrupts) then
>>>> I don't see how you can _not_ expose the ring directly complete with
>>>> the vendor descriptor formats.
>>>
>>> But how big is the concrete performance degradation you're seeing if you
>>> use an e.g. `netmap-alike` Linux-own variant as a hw-neutral interface
>>> that does *not* directly expose hw descriptor formats to user space?
>>
>> If we don't directly expose the hardware descriptor formats then we
>> need to somehow kick the driver when we want it to do the copy from
>> the driver descriptor format to the common descriptor format.
>>
>> This requires a system call as far as I can tell. Which has unwanted
>> overhead. I can micro-benchmark this if its helpful. But if we dredge
>> up Jesper's slides here we are really counting cycles so even small
>> numbers count if we want to hit line rate in a user space application
>> with 40Gpbs hardware.
> 
> I agree, it seems pretty hard to achieve non-syscall sending on the same
> core, as we somehow must transfer control over to the kernel without
> doing a syscall.
> 
> The only other idea would be to export machine code up to user space,
> which you can mmap(MAP_EXEC) from the socket somehow to make this API
> truly NIC agnostic without recompiling. This code then would transform
> the generic descriptors to the hardware specific ones. Seems also pretty
> hairy to do that correctly, maybe.

This seems more complicated then a minimal library in userspace to
load descriptor handling code based on device id.

> 
>>> With 1 core netmap does 10G line-rate on 64b; I don't know their numbers
>>> on 40G when run on decent hardware though.
>>>
>>> It would really be great if we have something vendor neutral exposed as
>>> a stable ABI and could leverage emerging infrastructure we already have
>>> in the kernel such as eBPF and recent qdisc batching for raw sockets
>>> instead of reinventing the wheels. (Don't get me wrong, I would love to
>>> see AF_PACKET improved ...)
>>
>> I don't think the interface is vendor specific. It does require some
>> knowledge of the hardware descriptor layout though. It is though vendor
>> neutral from my point of view. I provided the ixgbe patch simple because
>> I'm most familiar with it and have a NIC here. If someone wants to send me
>> a Mellanox NIC I can give it a try although I was hoping to recruit Or or
>> Amir? The only hardware feature required is flow classification to queues
>> which seems to be common across 10Gbps and 40/100Gbps devices. So most
>> of the drivers should be able to support this.
> 
> Does flow classification work at the same level as registering network
> addresses? Do I have to bind a e.g. multicast address wie ip maddr and
> then set up flow director/ntuple to get the packets on the correct user
> space facing queue or is it in case of the ixgbe card enough to just add
> those addresses via fdir? Have you thought about letting the
> kernel/driver handle that? In case one would like to connect their
> virtual machines via this interface to the network maybe we need central
> policing and resource constraints for queue management here?

Right now it is enough to program the addresses via fdir. This shouldn't
be ixgbe specific and I did take a quick look at the other drivers use
for fdir and believe it should work the same.

I'm not sure what you mean by kernel/driver handle this? Maybe you mean
from the socket interface? I thought about it briefly but opted for what
I think is more straight forward and using the fdir APIs.

As far as policing and resource constraints I think that is a user space
task. And yes I am working on some user space management applications but
they are still fairly rough.

> 
> Do other drivers need a separate af-packet managed way to bind addresses
> to the queue? Maybe there are other quirks we might need to add to
> actually build support for other network interface cards. Would be great
> to at least examine one other driver in regard to this.

I _believe_ this interface is sufficient. I think one of the mellanox
interfaces could be implemented fairly easily.

>         
> What other properties of the NIC must be exported? I think we also have
> to deal with MTUs currently configured in the NIC, promisc mode and
> maybe TSO?

We don't support per queue MTUs in the kernel. So I think this can be
learned using existing interfaces.

> 
>> If your worried driver writers will implement the interface but not make
>> their descriptor formats easily available I considered putting the layout
>> in a header file in the uapi somewhere. Then we could just reject any
>> implementation that doesn't include the header file needed to use it
>> from user space.
>>
>> With regards to leveraging eBPF and qdisc batching I don't see how this
>> works with direct DMA and polling. Needed to give the lowest overhead
>> between kernel and user space. In this case we want to use the hardware
>> to do the filtering that would normally be done for eBPF and for many
>> use cases the hardware flow classifiers is sufficient.
> 
> I agree, those features are hard to connect.
> 
>> We already added a qdisc bypass option I see this as taking this path
>> further. I believe there is room for a continuum here. For basic cases
>> use af_packet v1,v2 for mmap rings but using common descriptors use
>> af_packet v3 and set QOS_BYASS. For absolute lowest overhead and
>> specific applications that don't need QOS, eBPF use this interface.
> 
> You can simply write C code instead of eBPF code, yes.
> 
> I find the six additional ndo ops a bit worrisome as we are adding more
> and more subsystem specific ndoops to this struct. I would like to see
> some unification here, but currently cannot make concrete proposals,
> sorry.

I agree it seems like a bit much. One thought was to split the ndo
ops into categories. Switch ops, MACVLAN ops, basic ops and with this
userspace queue ops. This sort of goes along with some of the switch
offload work which is going to add a handful more ops as best I can
tell.

> 
> Patch 2/3 does not yet expose hw ring descriptors in uapi headers it
> seems?
> 

Nope I wasn't sure if we wanted to put the ring desciptors in UAPI or
not. If we do I would likely push that as a 4th patch.

> Are there plans to push a user space framework (maybe even into the
> kernel), too? Will this be dpdk (alike) in the end?

I have patches to enable this interface on DPDK and it gets the
same performance as the other DPDK interfaces.

I've considered creating a minimal library to do basic tx/rx and
descriptor processing maybe in ./test or ,/scripts to give a usage
example that is easier to get ahold of and review without having
to pull in all the other things DPDK does that may or may not be
interesting depending on what you are doing and on what hardware.

> 
> Bye,
> Hannes
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend Oct. 6, 2014, 8:42 p.m. | #11
On 10/06/2014 09:55 AM, Stephen Hemminger wrote:
> On Sun, 05 Oct 2014 17:06:31 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
>
>> This patch adds a net_device ops to split off a set of driver queues
>> from the driver and map the queues into user space via mmap. This
>> allows the queues to be directly manipulated from user space. For
>> raw packet interface this removes any overhead from the kernel network
>> stack.
>>
>> Typically in an af_packet interface a packet_type handler is
>> registered and used to filter traffic to the socket and do other
>> things such as fan out traffic to multiple sockets. In this case the
>> networking stack is being bypassed so this code is not run. So the
>> hardware must push the correct traffic to the queues obtained from
>> the ndo callback ndo_split_queue_pairs().
>>
>> Fortunately there is already a flow classification interface which
>> is part of the ethtool command set, ETHTOOL_SRXCLSRLINS. It is
>> currently supported by multiple drivers including sfc, mlx4, niu,
>> ixgbe, and i40e. Supporting some way to steer traffic to a queue
>> is the _only_ hardware requirement to support the interface, plus
>> the driver needs to implement the correct ndo ops. A follow on
>> patch adds support for ixgbe but we expect at least the subset of
>> drivers implementing ETHTOOL_SRXCLSRLINS to be implemented later.
>>
>> The interface is driven over an af_packet socket which we believe
>> is the most natural interface to use. Because it is already used
>> for raw packet interfaces which is what we are providing here.
>>   The high level flow for this interface looks like:
>>
>> 	bind(fd, &sockaddr, sizeof(sockaddr));
>>
>> 	/* Get the device type and info */
>> 	getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
>> 		   &optlen);
>>
>> 	/* With device info we can look up descriptor format */
>>
>> 	/* Get the layout of ring space offset, page_sz, cnt */
>> 	getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>> 		   &info, &optlen);
>>
>> 	/* request some queues from the driver */
>> 	setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> 		   &qpairs_info, sizeof(qpairs_info));
>>
>> 	/* if we let the driver pick us queues learn which queues
>>           * we were given
>>           */
>> 	getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> 		   &qpairs_info, sizeof(qpairs_info));
>>
>> 	/* And mmap queue pairs to user space */
>> 	mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>> 	     MAP_SHARED, fd, 0);
>>
>> 	/* Now we have some user space queues to read/write to*/
>>
>> There is one critical difference when running with these interfaces
>> vs running without them. In the normal case the af_packet module
>> uses a standard descriptor format exported by the af_packet user
>> space headers. In this model because we are working directly with
>> driver queues the descriptor format maps to the descriptor format
>> used by the device. User space applications can learn device
>> information from the socket option PACKET_DEV_DESC_INFO which
>> should provide enough details to extrapulate the descriptor formats.
>> Although this adds some complexity to user space it removes the
>> requirement to copy descriptor fields around.
>>
>> The formats are usually provided by the device vendor documentation
>> If folks want I can provide a follow up patch to provide the formats
>> in a .h file in ./include/uapi/linux/ for ease of use. I have access
>> to formats for ixgbe and mlx drivers other driver owners would need to
>> provide their formats.
>>
>> We tested this interface using traffic generators and doing basic
>> L2 forwarding tests on ixgbe devices. Our tests use a set of patches
>> to DPDK to enable an interface using this socket interfaace. With
>> this interface we can xmit/receive @ line rate from a test user space
>> application on a single core.
>>
>> Additionally we have a set of DPDK patches to enable DPDK with this
>> interface. DPDK can be downloaded @ dpdk.org although as I hope is
>> clear from above DPDK is just our paticular test environment we
>> expect other libraries could be built on this interface.
>>
>> Signed-off-by: Danny Zhou <danny.zhou@intel.com>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>
> I like the ability to share a device between kernel and user mode networking.
> The model used for DPDK for this is really ugly and fragile/broken.
> Your proposal assumes that you fully trust the user mode networking application
> which is not a generally safe assumption.
>
> A device can DMA from/to any arbitrary physical memory.
> And it would be hard to use IOMMU to protect because the
> IOMMU doesn't know that the difference between the applications queue and
> the rest of the queues.
>
> At least with DPDK you can use VFIO, and you are claiming the whole device to
> allow protection against random memory being read/written.
>
>

However not all platforms support VFIO and when the application
only want to handle specific traffic types a queue maps well to
this.
David Miller Oct. 6, 2014, 9:42 p.m. | #12
From: John Fastabend <john.fastabend@gmail.com>
Date: Sun, 05 Oct 2014 17:06:31 -0700

> This patch adds a net_device ops to split off a set of driver queues
> from the driver and map the queues into user space via mmap. This
> allows the queues to be directly manipulated from user space. For
> raw packet interface this removes any overhead from the kernel network
> stack.

About the facility in general, I am generally in favor, as I expressed
at the networking track in Chiacgo.

But you missed the mark wrt. describing the descriptors.

I do not want you to give device IDs.

I want the code to be %100 agnostic to device or vendor IDs.

Really "describe" the descriptor.  Not just how large is it (32-bits,
64-bits, etc.), but also: 1) is it little or big endian 2) where is
the length field 3) where is control bit "foo" located, etc.

That's what I want to see in "struct tpacket_dev_info", rather than
device IDs and "versions".
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Oct. 6, 2014, 11:26 p.m. | #13
Hi John,

On Mon, Oct 6, 2014, at 22:37, John Fastabend wrote:
> > I find the six additional ndo ops a bit worrisome as we are adding more
> > and more subsystem specific ndoops to this struct. I would like to see
> > some unification here, but currently cannot make concrete proposals,
> > sorry.
> 
> I agree it seems like a bit much. One thought was to split the ndo
> ops into categories. Switch ops, MACVLAN ops, basic ops and with this
> userspace queue ops. This sort of goes along with some of the switch
> offload work which is going to add a handful more ops as best I can
> tell.

Thanks for your mail, you answered all of my questions.

Have you looked at <https://code.google.com/p/kernel/wiki/ProjectUnetq>?
Willem (also in Cc) used sysfs files which get mmaped to represent the
tx/rx descriptors. The representation was independent of the device and
IIRC the prototype used a write(fd, "", 1) to signal the kernel it
should proceed with tx. I agree, it would be great to be syscall-free
here.

For the semantics of the descriptors we could also easily generate files
in sysfs. I thought about something like tracepoints already do for
representing the data in the ringbuffer depending on the event:

-- >8 --
# cat /sys/kernel/debug/tracing/events/net/net_dev_queue/format 
name: net_dev_queue
ID: 1006
format:
	field:unsigned short common_type;       offset:0;       size:2;
	signed:0;
	field:unsigned char common_flags;       offset:2;       size:1;
	signed:0;
	field:unsigned char common_preempt_count;       offset:3;      
	size:1; signed:0;
	field:int common_pid;   offset:4;       size:4; signed:1;

	field:void * skbaddr;   offset:8;       size:8; signed:0;
	field:unsigned int len; offset:16;      size:4; signed:0;
	field:__data_loc char[] name;   offset:20;      size:4;
	signed:1;

print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr,
REC->len
-- >8 --

Maybe the macros from tracing are reusable (TP_STRUCT__entry), e.g.
endianess would need to be added. Hopefully there is already a user
space parser somewhere in the perf sources. An easier to parse binary
representation could be added easily and maybe even something vDSO alike
if people care about that.

Maybe this open/mmap per queue also kills some of the ndo_ops?

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willem de Bruijn Oct. 7, 2014, 4:24 a.m. | #14
> Supporting some way to steer traffic to a queue
> is the _only_ hardware requirement to support the interface,

I would not impose his constraint. There may be legitimate use
cases for taking over all queues of a device. For instance, when
this is a secondary nic that does not carry any control traffic.

> Typically in an af_packet interface a packet_type handler is
> registered and used to filter traffic to the socket and do other
> things such as fan out traffic to multiple sockets. In this case the
> networking stack is being bypassed so this code is not run. So the
> hardware must push the correct traffic to the queues obtained from
> the ndo callback ndo_split_queue_pairs().

Why does the interface work at the level of queue_pairs instead of
individual queues?

>         /* Get the layout of ring space offset, page_sz, cnt */
>         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>                    &info, &optlen);
>
>         /* request some queues from the driver */
>         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>                    &qpairs_info, sizeof(qpairs_info));
>
>         /* if we let the driver pick us queues learn which queues
>          * we were given
>          */
>         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>                    &qpairs_info, sizeof(qpairs_info));

If ethtool -U is used to steer traffic to a specific descriptor queue,
then the setsockopt can pass the exact id of that queue and there
is no need for a getsockopt follow-up.

>         /* And mmap queue pairs to user space */
>         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>              MAP_SHARED, fd, 0);

How will packet data be mapped and how will userspace translate
from paddr to vaddr? Is the goal to maintain long lived mappings
and instruct drivers to allocate from this restricted range (to
avoid per-packet system calls and vma operations)?

For throughput-oriented workloads, the syscall overhead
involved in kicking the nic (on tx, or for increasing the ring
consumer index on rx) can be amortized. And the operation
can perhaps piggy-back on interrupts or other events
(as long as interrupts are not disabled for full userspace
polling). Latency would be harder to satisfy while maintaining
some kernel policy enforcement. An extreme solution
uses an asynchronously busy polling kernel worker thread
(at high cycle cost, so acceptable for few workloads).

When keeping the kernel in the loop, it is possible to do
some basic sanity checking and transparently translate between
vaddr and paddr, even when exposing the hardware descriptors
directly. Though at this point it may be just as cheap to expose
an idealized virtualized descriptor format and copy fields between
that and device descriptors.

One assumption underlying exposing the hardware descriptors
is that they are quire similar between devices. How true is this
in the context of formats that span multiple descriptors?

> + * int (*ndo_split_queue_pairs) (struct net_device *dev,
> + *                              unsigned int qpairs_start_from,
> + *                              unsigned int qpairs_num,
> + *                              struct sock *sk)
> + *     Called to request a set of queues from the driver to be
> + *     handed to the callee for management. After this returns the
> + *     driver will not use the queues.

Are these queues also taken out of ethtool management, or is
this equivalent to taking removing them from the rss set with
ethtool -X?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend Oct. 7, 2014, 4:25 a.m. | #15
On 10/06/2014 02:42 PM, David Miller wrote:
> From: John Fastabend <john.fastabend@gmail.com>
> Date: Sun, 05 Oct 2014 17:06:31 -0700
> 
>> This patch adds a net_device ops to split off a set of driver queues
>> from the driver and map the queues into user space via mmap. This
>> allows the queues to be directly manipulated from user space. For
>> raw packet interface this removes any overhead from the kernel network
>> stack.
> 
> About the facility in general, I am generally in favor, as I expressed
> at the networking track in Chiacgo.
> 
> But you missed the mark wrt. describing the descriptors.
> 
> I do not want you to give device IDs.
> 
> I want the code to be %100 agnostic to device or vendor IDs.
> 

OK. I'll work on this and have a v2 after net-next opens again.

Thanks!

> Really "describe" the descriptor.  Not just how large is it (32-bits,
> 64-bits, etc.), but also: 1) is it little or big endian 2) where is
> the length field 3) where is control bit "foo" located, etc.
> 
> That's what I want to see in "struct tpacket_dev_info", rather than
> device IDs and "versions".
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Oct. 7, 2014, 9:27 a.m. | #16
From: Willem de Bruijn

...
> When keeping the kernel in the loop, it is possible to do

> some basic sanity checking and transparently translate between

> vaddr and paddr, even when exposing the hardware descriptors

> directly.


The application could change the addresses after they have been
validated, but before they have been read by the device.

> Though at this point it may be just as cheap to expose

> an idealized virtualized descriptor format and copy fields between

> that and device descriptors.


That is (probably) the only scheme that stops the application
accessing random parts of physical memory.

> One assumption underlying exposing the hardware descriptors

> is that they are quire similar between devices. How true is this

> in the context of formats that span multiple descriptors?


I suspect you'd need to define complete ring entries for 'initial',
'middle', 'final' and 'complete' fragments, together with the
offsets and endianness (and size?) of the address and data fields.

Also whether there is a special 'last entry' in the ring.

Passing checksum offload flags through adds an extra level of complexity.

Rings like the xhci (actually USB, but could contain ethernet data)
require the 'owner' bit be written odd or even in alternating passes.
Actually mapping support for usbnet (especially xhci - usb3) might show
up some deficiencies in the definition.

You also need to know when transmits have completed.
This might be an 'owner' bit being cleared, but could be signalled
in an entirely different way.

	David
Zhou, Danny Oct. 7, 2014, 3:21 p.m. | #17
> -----Original Message-----

> From: Willem de Bruijn [mailto:willemb@google.com]

> Sent: Tuesday, October 07, 2014 12:24 PM

> To: John Fastabend

> Cc: Daniel Borkmann; Florian Westphal; gerlitz.or@gmail.com; Hannes Frederic Sowa; Network Development; Ronciak, John; Amir

> Vadai; Eric Dumazet; Zhou, Danny

> Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access

> 

> > Supporting some way to steer traffic to a queue

> > is the _only_ hardware requirement to support the interface,

> 

> I would not impose his constraint. There may be legitimate use

> cases for taking over all queues of a device. For instance, when

> this is a secondary nic that does not carry any control traffic.


For the secondary NIC that carries the data plane traffics only, you can use UIO or VFIO 
to map the entire NIC's entire I/O space to user space. Then the user space poll-mode driver, 
like those have been supported and open-sourced in DPDK and those supports 
Mellanox/Emulex NICs but not open-sourced, can drive the NIC as a sole driver in user space. 

> 

> > Typically in an af_packet interface a packet_type handler is

> > registered and used to filter traffic to the socket and do other

> > things such as fan out traffic to multiple sockets. In this case the

> > networking stack is being bypassed so this code is not run. So the

> > hardware must push the correct traffic to the queues obtained from

> > the ndo callback ndo_split_queue_pairs().

> 

> Why does the interface work at the level of queue_pairs instead of

> individual queues?


The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O 
on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and 
egress traffics respectively, although the flow director only applies to ingress traffics.

> 

> >         /* Get the layout of ring space offset, page_sz, cnt */

> >         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,

> >                    &info, &optlen);

> >

> >         /* request some queues from the driver */

> >         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,

> >                    &qpairs_info, sizeof(qpairs_info));

> >

> >         /* if we let the driver pick us queues learn which queues

> >          * we were given

> >          */

> >         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,

> >                    &qpairs_info, sizeof(qpairs_info));

> 

> If ethtool -U is used to steer traffic to a specific descriptor queue,

> then the setsockopt can pass the exact id of that queue and there

> is no need for a getsockopt follow-up.


Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via 
setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and 
available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info 
as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable. 
But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need 
to call getsocketopt.

> 

> >         /* And mmap queue pairs to user space */

> >         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,

> >              MAP_SHARED, fd, 0);

> 

> How will packet data be mapped and how will userspace translate

> from paddr to vaddr? Is the goal to maintain long lived mappings

> and instruct drivers to allocate from this restricted range (to

> avoid per-packet system calls and vma operations)?

> 


Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues 
completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages 
are translated to paddr) to fill in the packet descriptors.
As of security concern raised in previous discussion, the reason we think(BTW, correct me if I am wrong)
af_packet  is most suitable is because only user application with root permission is allowed to successfully 
split-off queue pairs and mmap a small window of PCIe I/O space to user space, so concern regarding "device 
can DMA from/to any arbitrary physical memory." is not that big. As all user space device drivers based on 
UIO mechanism has the same concern issue, VFIO adds protection but it is based on IOMMU which is
specific to Intel silicons.

> For throughput-oriented workloads, the syscall overhead

> involved in kicking the nic (on tx, or for increasing the ring

> consumer index on rx) can be amortized. And the operation

> can perhaps piggy-back on interrupts or other events

> (as long as interrupts are not disabled for full userspace

> polling). Latency would be harder to satisfy while maintaining

> some kernel policy enforcement. An extreme solution

> uses an asynchronously busy polling kernel worker thread

> (at high cycle cost, so acceptable for few workloads).

> 

> When keeping the kernel in the loop, it is possible to do

> some basic sanity checking and transparently translate between

> vaddr and paddr, even when exposing the hardware descriptors

> directly. Though at this point it may be just as cheap to expose

> an idealized virtualized descriptor format and copy fields between

> that and device descriptors.

> 

> One assumption underlying exposing the hardware descriptors

> is that they are quire similar between devices. How true is this

> in the context of formats that span multiple descriptors?

> 


Packet descriptors format varies for different vendors. On Intel NICs, 1G/10G/40G NICs have 
totally different formats. Even a same Intel 10G/40G NIC supports at least 2 different descriptor 
formats. IMHO, the idea behind those patches intent to skip descriptor difference among devices, 
as it just maps certain I/O space pages to user space and user space "slave" NIC driver can handle 
it using different descriptor struct based on vendor/device ID. But I am open to add support of generic 
packet descriptor format description, per David M' suggestion.

> > + * int (*ndo_split_queue_pairs) (struct net_device *dev,

> > + *                              unsigned int qpairs_start_from,

> > + *                              unsigned int qpairs_num,

> > + *                              struct sock *sk)

> > + *     Called to request a set of queues from the driver to be

> > + *     handed to the callee for management. After this returns the

> > + *     driver will not use the queues.

> 

> Are these queues also taken out of ethtool management, or is

> this equivalent to taking removing them from the rss set with

> ethtool -X?


As a master driver, the NIC kernel driver still takes control of flow director as a ethtool backend. Generally, 
not all queues are initialized and used by NIC kernel driver, which reports actually used rx/tx numbers to stacks. 
Before splitting off certain queues, if you want use ethtool to direct traffics to those unused queues, 
ethtool reports invalid argument. Once certain stack-unaware queues are allocated for user space slave driver, 
ethtool allows directing packets to them as the NIC driver maintains a data struct about which queues are visible 
and used by kernel, which are used by user space.
David Miller Oct. 7, 2014, 3:43 p.m. | #18
From: David Laight <David.Laight@ACULAB.COM>
Date: Tue, 7 Oct 2014 09:27:03 +0000

> That is (probably) the only scheme that stops the application
> accessing random parts of physical memory.

I don't know where this claim keeps coming from, it's false.

The application has to attach memory to the ring, and then the
ring can only refer to that memory for the duration of the
session.

There is no way that the user can program the address field of the
descriptors to point at arbitrary physical memory locations.

There is protection and control.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willem de Bruijn Oct. 7, 2014, 3:46 p.m. | #19
>> > Typically in an af_packet interface a packet_type handler is
>> > registered and used to filter traffic to the socket and do other
>> > things such as fan out traffic to multiple sockets. In this case the
>> > networking stack is being bypassed so this code is not run. So the
>> > hardware must push the correct traffic to the queues obtained from
>> > the ndo callback ndo_split_queue_pairs().
>>
>> Why does the interface work at the level of queue_pairs instead of
>> individual queues?
>
> The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
> on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
> egress traffics respectively, although the flow director only applies to ingress traffics.

That requirement of co-allocation is absent in existing packet
rings. Many applications only receive or transmit. For
receive-only, it would even be possible to map descriptor
rings read-only, if the kernel remains responsible for posting
buffers -- but I see below that that is not the case, so that's
not very relevant here.

Still, some workloads want asymmetric sets of rx and tx rings.
For instance, instead of using RSS, a process may want to
receive on as few rings as possible, load balance across
workers in software, but still give each worker thread its own
private transmit ring.


>>
>> >         /* Get the layout of ring space offset, page_sz, cnt */
>> >         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>> >                    &info, &optlen);
>> >
>> >         /* request some queues from the driver */
>> >         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> >                    &qpairs_info, sizeof(qpairs_info));
>> >
>> >         /* if we let the driver pick us queues learn which queues
>> >          * we were given
>> >          */
>> >         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> >                    &qpairs_info, sizeof(qpairs_info));
>>
>> If ethtool -U is used to steer traffic to a specific descriptor queue,
>> then the setsockopt can pass the exact id of that queue and there
>> is no need for a getsockopt follow-up.
>
> Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
> setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
> available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
> Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
> as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
> But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
> to call getsocketopt.

One step further would be to move the entire configuration behind
the packet socket interface. It's perhaps out of scope of this patch,
but the difference between using `ethtool -U` and passing the same
expression through the packet socket is that in the latter case the
kernel can automatically rollback the configuration change when the
process dies.

>
>>
>> >         /* And mmap queue pairs to user space */
>> >         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>> >              MAP_SHARED, fd, 0);
>>
>> How will packet data be mapped and how will userspace translate
>> from paddr to vaddr? Is the goal to maintain long lived mappings
>> and instruct drivers to allocate from this restricted range (to
>> avoid per-packet system calls and vma operations)?
>>
>
> Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
> completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
> are translated to paddr) to fill in the packet descriptors.

Ah, userspace is responsible for posting buffers and translation
from vaddr to paddr is straightforward. Yes that makes sense.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend Oct. 7, 2014, 3:55 p.m. | #20
On 10/07/2014 08:46 AM, Willem de Bruijn wrote:
>>>> Typically in an af_packet interface a packet_type handler is
>>>> registered and used to filter traffic to the socket and do other
>>>> things such as fan out traffic to multiple sockets. In this case the
>>>> networking stack is being bypassed so this code is not run. So the
>>>> hardware must push the correct traffic to the queues obtained from
>>>> the ndo callback ndo_split_queue_pairs().
>>>
>>> Why does the interface work at the level of queue_pairs instead of
>>> individual queues?
>>
>> The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
>> on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
>> egress traffics respectively, although the flow director only applies to ingress traffics.
> 
> That requirement of co-allocation is absent in existing packet
> rings. Many applications only receive or transmit. For
> receive-only, it would even be possible to map descriptor
> rings read-only, if the kernel remains responsible for posting
> buffers -- but I see below that that is not the case, so that's
> not very relevant here.
> 
> Still, some workloads want asymmetric sets of rx and tx rings.
> For instance, instead of using RSS, a process may want to
> receive on as few rings as possible, load balance across
> workers in software, but still give each worker thread its own
> private transmit ring.
> 

We can build this into the interface by having the setsockopt
provide both the number of tx rings and number of rx rings. It
might not be immediately available in any drivers because at
least ixgbe is pretty dependent on tx/rx pairing.

I would have to look through the other drivers to see how
much work it would be to support this on them. If I can't find
a good candidate we might leave it out until we can fix up the
drivers.

> 
>>>
>>>>         /* Get the layout of ring space offset, page_sz, cnt */
>>>>         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>>>>                    &info, &optlen);
>>>>
>>>>         /* request some queues from the driver */
>>>>         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>>>>                    &qpairs_info, sizeof(qpairs_info));
>>>>
>>>>         /* if we let the driver pick us queues learn which queues
>>>>          * we were given
>>>>          */
>>>>         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>>>>                    &qpairs_info, sizeof(qpairs_info));
>>>
>>> If ethtool -U is used to steer traffic to a specific descriptor queue,
>>> then the setsockopt can pass the exact id of that queue and there
>>> is no need for a getsockopt follow-up.
>>
>> Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
>> setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
>> available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
>> Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
>> as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
>> But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
>> to call getsocketopt.
> 
> One step further would be to move the entire configuration behind
> the packet socket interface. It's perhaps out of scope of this patch,
> but the difference between using `ethtool -U` and passing the same
> expression through the packet socket is that in the latter case the
> kernel can automatically rollback the configuration change when the
> process dies.
> 

hmm might be interesting I think  this is a follow on path to
investigate after the initial support.

>>
>>>
>>>>         /* And mmap queue pairs to user space */
>>>>         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>>>>              MAP_SHARED, fd, 0);
>>>
>>> How will packet data be mapped and how will userspace translate
>>> from paddr to vaddr? Is the goal to maintain long lived mappings
>>> and instruct drivers to allocate from this restricted range (to
>>> avoid per-packet system calls and vma operations)?
>>>
>>
>> Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
>> completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
>> are translated to paddr) to fill in the packet descriptors.
> 
> Ah, userspace is responsible for posting buffers and translation
> from vaddr to paddr is straightforward. Yes that makes sense.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Oct. 7, 2014, 3:59 p.m. | #21
From: David 
> From: David Laight <David.Laight@ACULAB.COM>
> Date: Tue, 7 Oct 2014 09:27:03 +0000
> 
> > That is (probably) the only scheme that stops the application
> > accessing random parts of physical memory.
> 
> I don't know where this claim keeps coming from, it's false.
> 
> The application has to attach memory to the ring, and then the
> ring can only refer to that memory for the duration of the
> session.
> 
> There is no way that the user can program the address field of the
> descriptors to point at arbitrary physical memory locations.
> 
> There is protection and control.

I got the impression that the application was directly writing the ring
structure that the ethernet mac hardware uses to describe tx and rx buffers.
(ie they are mapped read-write into userspace).
Unless you have a system where you can limit the physical memory
ranges accessible to the mac hardware, I don't see how you can stop
the application putting rogue values into the ring.

Clearly I'm missing something in my quick read of the change.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 7, 2014, 4:05 p.m. | #22
From: "Zhou, Danny" <danny.zhou@intel.com>
Date: Tue, 7 Oct 2014 15:21:15 +0000

> Once qpairs split-off is done, the user space driver, as a slave
> driver, will re-initialize those queues completely in user space by
> using paddr(in the case of DPDK, vaddr of DPDK used huge pages are
> translated to paddr) to fill in the packet descriptors.  As of
> security concern raised in previous discussion, the reason we
> think(BTW, correct me if I am wrong) af_packet is most suitable is
> because only user application with root permission is allowed to
> successfully split-off queue pairs and mmap a small window of PCIe
> I/O space to user space, so concern regarding "device can DMA
> from/to any arbitrary physical memory." is not that big. As all user
> space device drivers based on UIO mechanism has the same concern
> issue, VFIO adds protection but it is based on IOMMU which is
> specific to Intel silicons.

Wait a second.

If there is no memory protection performed I'm not merging this.

I thought the user has to associate a fixed pool of memory to the
queueus, the kernel attaches that memory, and then the user cannot
modify the addresses _AT_ _ALL_.

If the user can modify the addresses in the descriptors and make
the chip crap on random memory, this is a non-starter.

Sorry.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 7, 2014, 4:08 p.m. | #23
From: David Laight <David.Laight@ACULAB.COM>
Date: Tue, 7 Oct 2014 15:59:35 +0000

> From: David 
>> From: David Laight <David.Laight@ACULAB.COM>
>> Date: Tue, 7 Oct 2014 09:27:03 +0000
>> 
>> > That is (probably) the only scheme that stops the application
>> > accessing random parts of physical memory.
>> 
>> I don't know where this claim keeps coming from, it's false.
>> 
>> The application has to attach memory to the ring, and then the
>> ring can only refer to that memory for the duration of the
>> session.
>> 
>> There is no way that the user can program the address field of the
>> descriptors to point at arbitrary physical memory locations.
>> 
>> There is protection and control.
> 
> I got the impression that the application was directly writing the ring
> structure that the ethernet mac hardware uses to describe tx and rx buffers.
> (ie they are mapped read-write into userspace).
> Unless you have a system where you can limit the physical memory
> ranges accessible to the mac hardware, I don't see how you can stop
> the application putting rogue values into the ring.
> 
> Clearly I'm missing something in my quick read of the change.

No, I think I misunderstood, and apparently the Mellanox driver allows
the user to crap into arbitrary physical memory too.

All of this garbage must get fixed and this feature is a non-starter
until there is control over the memory the rings can point ti.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman Oct. 7, 2014, 6:59 p.m. | #24
On Tue, Oct 07, 2014 at 01:26:11AM +0200, Hannes Frederic Sowa wrote:
> Hi John,
> 
> On Mon, Oct 6, 2014, at 22:37, John Fastabend wrote:
> > > I find the six additional ndo ops a bit worrisome as we are adding more
> > > and more subsystem specific ndoops to this struct. I would like to see
> > > some unification here, but currently cannot make concrete proposals,
> > > sorry.
> > 
> > I agree it seems like a bit much. One thought was to split the ndo
> > ops into categories. Switch ops, MACVLAN ops, basic ops and with this
> > userspace queue ops. This sort of goes along with some of the switch
> > offload work which is going to add a handful more ops as best I can
> > tell.
> 
> Thanks for your mail, you answered all of my questions.
> 
> Have you looked at <https://code.google.com/p/kernel/wiki/ProjectUnetq>?
> Willem (also in Cc) used sysfs files which get mmaped to represent the
> tx/rx descriptors. The representation was independent of the device and
> IIRC the prototype used a write(fd, "", 1) to signal the kernel it
> should proceed with tx. I agree, it would be great to be syscall-free
> here.
> 
> For the semantics of the descriptors we could also easily generate files
> in sysfs. I thought about something like tracepoints already do for
> representing the data in the ringbuffer depending on the event:
> 
> -- >8 --
> # cat /sys/kernel/debug/tracing/events/net/net_dev_queue/format 
> name: net_dev_queue
> ID: 1006
> format:
> 	field:unsigned short common_type;       offset:0;       size:2;
> 	signed:0;
> 	field:unsigned char common_flags;       offset:2;       size:1;
> 	signed:0;
> 	field:unsigned char common_preempt_count;       offset:3;      
> 	size:1; signed:0;
> 	field:int common_pid;   offset:4;       size:4; signed:1;
> 
> 	field:void * skbaddr;   offset:8;       size:8; signed:0;
> 	field:unsigned int len; offset:16;      size:4; signed:0;
> 	field:__data_loc char[] name;   offset:20;      size:4;
> 	signed:1;
> 
> print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr,
> REC->len
> -- >8 --
> 
> Maybe the macros from tracing are reusable (TP_STRUCT__entry), e.g.
> endianess would need to be added. Hopefully there is already a user
> space parser somewhere in the perf sources. An easier to parse binary
> representation could be added easily and maybe even something vDSO alike
> if people care about that.
> 
> Maybe this open/mmap per queue also kills some of the ndo_ops?
> 
> Bye,
> Hannes
> 


John-
	I don't know if its of use to you here, but I was experimenting awhile
ago with af_packet memory mapping, using the protection bits in the page tables
as a doorbell mechanism.  I scrapped the work as the performance bottleneck for
af_packet wasn't found in the syscall trap time, but it occurs to me, it might
be useful for you here, in that, using this mechanism, if you keep the transmit
ring non-empty, you only encur the cost of a single trap to start the transmit
process.  Let me know if you want to see it.

Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend Oct. 8, 2014, 5:20 p.m. | #25
On 10/07/2014 11:59 AM, Neil Horman wrote:
> On Tue, Oct 07, 2014 at 01:26:11AM +0200, Hannes Frederic Sowa wrote:
>> Hi John,
>>
>> On Mon, Oct 6, 2014, at 22:37, John Fastabend wrote:
>>>> I find the six additional ndo ops a bit worrisome as we are adding more
>>>> and more subsystem specific ndoops to this struct. I would like to see
>>>> some unification here, but currently cannot make concrete proposals,
>>>> sorry.
>>>
>>> I agree it seems like a bit much. One thought was to split the ndo
>>> ops into categories. Switch ops, MACVLAN ops, basic ops and with this
>>> userspace queue ops. This sort of goes along with some of the switch
>>> offload work which is going to add a handful more ops as best I can
>>> tell.
>>
>> Thanks for your mail, you answered all of my questions.
>>
>> Have you looked at <https://code.google.com/p/kernel/wiki/ProjectUnetq>?
>> Willem (also in Cc) used sysfs files which get mmaped to represent the
>> tx/rx descriptors. The representation was independent of the device and
>> IIRC the prototype used a write(fd, "", 1) to signal the kernel it
>> should proceed with tx. I agree, it would be great to be syscall-free
>> here.
>>
>> For the semantics of the descriptors we could also easily generate files
>> in sysfs. I thought about something like tracepoints already do for
>> representing the data in the ringbuffer depending on the event:
>>
>> -- >8 --
>> # cat /sys/kernel/debug/tracing/events/net/net_dev_queue/format
>> name: net_dev_queue
>> ID: 1006
>> format:
>> 	field:unsigned short common_type;       offset:0;       size:2;
>> 	signed:0;
>> 	field:unsigned char common_flags;       offset:2;       size:1;
>> 	signed:0;
>> 	field:unsigned char common_preempt_count;       offset:3;
>> 	size:1; signed:0;
>> 	field:int common_pid;   offset:4;       size:4; signed:1;
>>
>> 	field:void * skbaddr;   offset:8;       size:8; signed:0;
>> 	field:unsigned int len; offset:16;      size:4; signed:0;
>> 	field:__data_loc char[] name;   offset:20;      size:4;
>> 	signed:1;
>>
>> print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr,
>> REC->len
>> -- >8 --
>>
>> Maybe the macros from tracing are reusable (TP_STRUCT__entry), e.g.
>> endianess would need to be added. Hopefully there is already a user
>> space parser somewhere in the perf sources. An easier to parse binary
>> representation could be added easily and maybe even something vDSO alike
>> if people care about that.
>>
>> Maybe this open/mmap per queue also kills some of the ndo_ops?
>>
>> Bye,
>> Hannes
>>
>
>
> John-
> 	I don't know if its of use to you here, but I was experimenting awhile
> ago with af_packet memory mapping, using the protection bits in the page tables
> as a doorbell mechanism.  I scrapped the work as the performance bottleneck for
> af_packet wasn't found in the syscall trap time, but it occurs to me, it might
> be useful for you here, in that, using this mechanism, if you keep the transmit
> ring non-empty, you only encur the cost of a single trap to start the transmit
> process.  Let me know if you want to see it.
>
> Neil
>


Hi Neil,

If you could forward it along I'll take a look. It seems like something
along these lines will be needed.

Thanks,
John
Zhou, Danny Oct. 10, 2014, 3:49 a.m. | #26
> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Wednesday, October 08, 2014 12:06 AM
> To: Zhou, Danny
> Cc: willemb@google.com; john.fastabend@gmail.com; dborkman@redhat.com; fw@strlen.de; gerlitz.or@gmail.com;
> hannes@stressinduktion.org; netdev@vger.kernel.org; Ronciak, John; amirv@mellanox.com; eric.dumazet@gmail.com
> Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
> 
> From: "Zhou, Danny" <danny.zhou@intel.com>
> Date: Tue, 7 Oct 2014 15:21:15 +0000
> 
> > Once qpairs split-off is done, the user space driver, as a slave
> > driver, will re-initialize those queues completely in user space by
> > using paddr(in the case of DPDK, vaddr of DPDK used huge pages are
> > translated to paddr) to fill in the packet descriptors.  As of
> > security concern raised in previous discussion, the reason we
> > think(BTW, correct me if I am wrong) af_packet is most suitable is
> > because only user application with root permission is allowed to
> > successfully split-off queue pairs and mmap a small window of PCIe
> > I/O space to user space, so concern regarding "device can DMA
> > from/to any arbitrary physical memory." is not that big. As all user
> > space device drivers based on UIO mechanism has the same concern
> > issue, VFIO adds protection but it is based on IOMMU which is
> > specific to Intel silicons.
> 
> Wait a second.
> 
> If there is no memory protection performed I'm not merging this.
> 
> I thought the user has to associate a fixed pool of memory to the
> queueus, the kernel attaches that memory, and then the user cannot
> modify the addresses _AT_ _ALL_.
> 
> If the user can modify the addresses in the descriptors and make
> the chip crap on random memory, this is a non-starter.
> 
> Sorry.

Fairly enough, we will manage to add memory protection in future versions. 
Several options are under investigation.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9f5d293..dae96dc2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -51,6 +51,8 @@ 
 #include <linux/neighbour.h>
 #include <uapi/linux/netdevice.h>
 
+#include <linux/if_packet.h>
+
 struct netpoll_info;
 struct device;
 struct phy_device;
@@ -997,6 +999,47 @@  typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
  *	Callback to use for xmit over the accelerated station. This
  *	is used in place of ndo_start_xmit on accelerated net
  *	devices.
+ *
+ * int (*ndo_split_queue_pairs) (struct net_device *dev,
+ *				 unsigned int qpairs_start_from,
+ *				 unsigned int qpairs_num,
+ *				 struct sock *sk)
+ *	Called to request a set of queues from the driver to be
+ *	handed to the callee for management. After this returns the
+ *	driver will not use the queues. The call should return zero
+ *	on success otherwise an appropriate error code can be used.
+ *	If qpairs_start_from is less than zero driver can start at
+ *	any available slot.
+ *
+ * int (*ndo_get_queue_pairs)(struct net_device *dev,
+ *			      unsigned int *qpairs_start_from,
+ *			      unsigned int *qpairs_num,
+ *			      struct sock *sk);
+ *	Called to get the queues assigned by the driver to this sock
+ *	when ndo_split_queue_pairs does not specify a start_from and
+ *	qpairs_num field. Returns zero on success.
+ *
+ * int (*ndo_return_queue_pairs) (struct net_device *dev,
+ *				  struct sock *sk)
+ *	Called to return a set of queues identified by sock to
+ *	the driver. The socket must have previously requested
+ *	the queues via ndo_split_queue_pairs for this action to
+ *	be performed.
+ *
+ * int (*ndo_get_device_qpair_map_region_info) (struct net_device *dev,
+ *				struct tpacket_dev_qpair_map_region_info *info)
+ *	Called to return mapping of queue memory region
+ *
+ * int (*ndo_get_device_desc_info) (struct net_device *dev,
+ *				    struct tpacket_dev_info *dev_info)
+ *	Called to get device specific information. This should
+ *	uniquely identify the hardware so that descriptor formats
+ *	can be learned by the stack/user space.
+ *
+ * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
+ *				     struct net_device *dev)
+ *	Called to map queue pair range from split_queue_pairs into
+ *	mmap region.
  */
 struct net_device_ops {
 	int			(*ndo_init)(struct net_device *dev);
@@ -1146,6 +1189,26 @@  struct net_device_ops {
 							struct net_device *dev,
 							void *priv);
 	int			(*ndo_get_lock_subclass)(struct net_device *dev);
+	int			(*ndo_split_queue_pairs)(struct net_device *dev,
+					 unsigned int qpairs_start_from,
+					 unsigned int qpairs_num,
+					 struct sock *sk);
+	int			(*ndo_get_queue_pairs)(struct net_device *dev,
+					 unsigned int *qpairs_start_from,
+					 unsigned int *qpairs_num,
+					 struct sock *sk);
+	int			(*ndo_return_queue_pairs)(
+					 struct net_device *dev,
+					 struct sock *sk);
+	int			(*ndo_get_device_qpair_map_region_info)
+					(struct net_device *dev,
+					 struct tpacket_dev_qpair_map_region_info *info);
+	int			(*ndo_get_device_desc_info)
+					(struct net_device *dev,
+					 struct tpacket_dev_info *dev_info);
+	int			(*ndo_direct_qpair_page_map)
+					(struct vm_area_struct *vma,
+					 struct net_device *dev);
 };
 
 /**
diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index da2d668..fa94b65 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -54,6 +54,10 @@  struct sockaddr_ll {
 #define PACKET_FANOUT			18
 #define PACKET_TX_HAS_OFF		19
 #define PACKET_QDISC_BYPASS		20
+#define PACKET_RXTX_QPAIRS_SPLIT	21
+#define PACKET_RXTX_QPAIRS_RETURN	22
+#define PACKET_DEV_QPAIR_MAP_REGION_INFO	23
+#define PACKET_DEV_DESC_INFO		24
 
 #define PACKET_FANOUT_HASH		0
 #define PACKET_FANOUT_LB		1
@@ -64,6 +68,44 @@  struct sockaddr_ll {
 #define PACKET_FANOUT_FLAG_ROLLOVER	0x1000
 #define PACKET_FANOUT_FLAG_DEFRAG	0x8000
 
+#define MAX_MAP_MEMORY_REGIONS		64
+
+struct tpacket_dev_qpair_map_region_info {
+	unsigned int tp_dev_bar_sz;		/* size of BAR */
+	unsigned int tp_dev_sysm_sz;		/* size of systerm memory */
+	/* number of contiguous memory on BAR mapping to user space */
+	unsigned int tp_num_map_regions;
+	/* number of contiguous memory on system mapping to user apce */
+	unsigned int tp_num_sysm_map_regions;
+	struct map_page_region {
+		unsigned page_offset;		/* offset to start of region */
+		unsigned page_sz;		/* size of page */
+		unsigned page_cnt;		/* number of pages */
+	} regions[MAX_MAP_MEMORY_REGIONS];
+};
+
+#define PACKET_QPAIRS_START_ANY		-1
+
+struct tpacket_dev_qpairs_info {
+	unsigned int tp_qpairs_start_from;	/* qpairs index to start from */
+	unsigned int tp_qpairs_num;		/* number of qpairs */
+};
+
+struct tpacket_dev_info {
+	__u16		tp_device_id;
+	__u16		tp_vendor_id;
+	__u16		tp_subsystem_device_id;
+	__u16		tp_subsystem_vendor_id;
+	__u32		tp_numa_node;
+	__u32		tp_revision_id;
+	__u32		tp_num_total_qpairs;
+	__u32		tp_num_inuse_qpairs;
+	unsigned int	tp_rxdesc_size;	/* rx desc size in bytes */
+	__u16		tp_rxdesc_ver;
+	unsigned int	tp_txdesc_size;	/* tx desc size in bytes */
+	__u16		tp_txdesc_ver;
+};
+
 struct tpacket_stats {
 	unsigned int	tp_packets;
 	unsigned int	tp_drops;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 87d20f4..19b43ee 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2611,6 +2611,14 @@  static int packet_release(struct socket *sock)
 	sock_prot_inuse_add(net, sk->sk_prot, -1);
 	preempt_enable();
 
+	if (po->tp_owns_queue_pairs) {
+		struct net_device *dev;
+
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (dev)
+			dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
+	}
+
 	spin_lock(&po->bind_lock);
 	unregister_prot_hook(sk, false);
 	packet_cached_dev_reset(po);
@@ -3403,6 +3411,70 @@  packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 		po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
 		return 0;
 	}
+
+	case PACKET_RXTX_QPAIRS_SPLIT:
+	{
+		struct tpacket_dev_qpairs_info qpairs;
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+		int err;
+
+		if (optlen != sizeof(qpairs))
+			return -EINVAL;
+		if (copy_from_user(&qpairs, optval, sizeof(qpairs)))
+			return -EFAULT;
+
+		/* Only allow one set of queues to be owned by userspace */
+		if (po->tp_owns_queue_pairs)
+			return -EBUSY;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		ops = dev->netdev_ops;
+		if (!ops->ndo_split_queue_pairs)
+			return -EOPNOTSUPP;
+
+		err =  ops->ndo_split_queue_pairs(dev,
+						  qpairs.tp_qpairs_start_from,
+						  qpairs.tp_qpairs_num, sk);
+		if (!err)
+			po->tp_owns_queue_pairs = true;
+
+		return err;
+	}
+
+	case PACKET_RXTX_QPAIRS_RETURN:
+	{
+		struct tpacket_dev_qpairs_info qpairs_info;
+		struct net_device *dev;
+		int err;
+
+		if (optlen != sizeof(qpairs_info))
+			return -EINVAL;
+		if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
+			return -EFAULT;
+
+		if (!po->tp_owns_queue_pairs)
+			return -EINVAL;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		err =  dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
+		if (!err)
+			po->tp_owns_queue_pairs = false;
+
+		return err;
+	}
+
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -3498,6 +3570,99 @@  static int packet_getsockopt(struct socket *sock, int level, int optname,
 	case PACKET_QDISC_BYPASS:
 		val = packet_use_direct_xmit(po);
 		break;
+	case PACKET_RXTX_QPAIRS_SPLIT:
+	{
+		struct net_device *dev;
+		struct tpacket_dev_qpairs_info qpairs_info;
+		int err;
+
+		if (len != sizeof(qpairs_info))
+			return -EINVAL;
+		if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
+			return -EFAULT;
+
+		/* This call only works after a successful queue pairs split-off
+		 * operation via setsockopt()
+		 */
+		if (!po->tp_owns_queue_pairs)
+			return -EINVAL;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		if (!dev->netdev_ops->ndo_split_queue_pairs)
+			return -EOPNOTSUPP;
+
+		err = dev->netdev_ops->ndo_get_queue_pairs(dev,
+					&qpairs_info.tp_qpairs_start_from,
+					&qpairs_info.tp_qpairs_num, sk);
+
+		lv = sizeof(qpairs_info);
+		data = &qpairs_info;
+		break;
+	}
+	case PACKET_DEV_QPAIR_MAP_REGION_INFO:
+	{
+		struct tpacket_dev_qpair_map_region_info info;
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+		int err;
+
+		if (len != sizeof(info))
+			return -EINVAL;
+		if (copy_from_user(&info, optval, sizeof(info)))
+			return -EFAULT;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		ops = dev->netdev_ops;
+		if (!ops->ndo_get_device_qpair_map_region_info)
+			return -EOPNOTSUPP;
+
+		err = ops->ndo_get_device_qpair_map_region_info(dev, &info);
+		if (err)
+			return err;
+
+		lv = sizeof(struct tpacket_dev_qpair_map_region_info);
+		data = &info;
+		break;
+	}
+	case PACKET_DEV_DESC_INFO:
+	{
+		struct net_device *dev;
+		struct tpacket_dev_info info;
+		int err;
+
+		if (len != sizeof(info))
+			return -EINVAL;
+		if (copy_from_user(&info, optval, sizeof(info)))
+			return -EFAULT;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		if (!dev->netdev_ops->ndo_get_device_desc_info)
+			return -EOPNOTSUPP;
+
+		err =  dev->netdev_ops->ndo_get_device_desc_info(dev, &info);
+		if (err)
+			return err;
+
+		lv = sizeof(struct tpacket_dev_info);
+		data = &info;
+		break;
+	}
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -3904,6 +4069,21 @@  static int packet_mmap(struct file *file, struct socket *sock,
 
 	mutex_lock(&po->pg_vec_lock);
 
+	if (po->tp_owns_queue_pairs) {
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		ops = dev->netdev_ops;
+		err = ops->ndo_direct_qpair_page_map(vma, dev);
+		if (err)
+			goto out;
+		goto done;
+	}
+
 	expected_size = 0;
 	for (rb = &po->rx_ring; rb <= &po->tx_ring; rb++) {
 		if (rb->pg_vec) {
@@ -3941,6 +4121,7 @@  static int packet_mmap(struct file *file, struct socket *sock,
 		}
 	}
 
+done:
 	atomic_inc(&po->mapped);
 	vma->vm_ops = &packet_mmap_ops;
 	err = 0;
diff --git a/net/packet/internal.h b/net/packet/internal.h
index cdddf6a..55cadbc 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -113,6 +113,7 @@  struct packet_sock {
 	unsigned int		tp_reserve;
 	unsigned int		tp_loss:1;
 	unsigned int		tp_tx_has_off:1;
+	unsigned int		tp_owns_queue_pairs:1;
 	unsigned int		tp_tstamp;
 	struct net_device __rcu	*cached_dev;
 	int			(*xmit)(struct sk_buff *skb);