diff mbox series

[ovs-dev,1/1] daemon-unix: Support OVS-DPDK HW offloads for non-root user

Message ID 20200915104535.143393-1-ameerm@nvidia.com
State Rejected
Headers show
Series [ovs-dev,1/1] daemon-unix: Support OVS-DPDK HW offloads for non-root user | expand

Commit Message

Ameer Mahagneh Sept. 15, 2020, 10:45 a.m. UTC
For security reasons only root or privileged user can allocate Interconnect
Context Memory (ICM). Add this capability for vendors that require ICM
allocation when applying DPDK rte flows.

Signed-off-by: Ameer Mahagneh <ameerm@nvidia.com>
Acked-by: Eli Britstein <elibr@nvidia.com>
---
 lib/daemon-unix.c | 1 +
 1 file changed, 1 insertion(+)

Comments

David Marchand Sept. 16, 2020, 4:23 p.m. UTC | #1
On Tue, Sep 15, 2020 at 12:52 PM Ameer Mahagneh <ameerm@nvidia.com> wrote:
>
> For security reasons only root or privileged user can allocate Interconnect
> Context Memory (ICM). Add this capability for vendors that require ICM
> allocation when applying DPDK rte flows.
>
> Signed-off-by: Ameer Mahagneh <ameerm@nvidia.com>
> Acked-by: Eli Britstein <elibr@nvidia.com>
> ---
>  lib/daemon-unix.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/lib/daemon-unix.c b/lib/daemon-unix.c
> index ae59ecf2c..d32a60657 100644
> --- a/lib/daemon-unix.c
> +++ b/lib/daemon-unix.c
> @@ -820,6 +820,7 @@ daemon_become_new_user_linux(bool access_datapath OVS_UNUSED)
>              if (access_datapath && !ret) {
>                  ret = capng_update(CAPNG_ADD, cap_sets, CAP_NET_ADMIN)
>                        || capng_update(CAPNG_ADD, cap_sets, CAP_NET_RAW)
> +                      || capng_update(CAPNG_ADD, cap_sets, CAP_SYS_RAWIO)
>                        || capng_update(CAPNG_ADD, cap_sets, CAP_NET_BROADCAST);
>              }
>          } else {

This patch seems incomplete: the manual is not updated and I would
expect some changes in the selinux policy files.
Aaron Conole Sept. 16, 2020, 8:05 p.m. UTC | #2
David Marchand <david.marchand@redhat.com> writes:

> On Tue, Sep 15, 2020 at 12:52 PM Ameer Mahagneh <ameerm@nvidia.com> wrote:
>>
>> For security reasons only root or privileged user can allocate Interconnect
>> Context Memory (ICM). Add this capability for vendors that require ICM
>> allocation when applying DPDK rte flows.
>>
>> Signed-off-by: Ameer Mahagneh <ameerm@nvidia.com>
>> Acked-by: Eli Britstein <elibr@nvidia.com>
>> ---

Why is this needed?  SYS_RAWIO is extremely privileged and means that
there is no point even in dropping privs or changing UID - the process
with these caps is allowed to alter anything, map /dev/mem and
/dev/kmem, etc.

Is there really no other way of doing this?  This feels somewhat like a
security regression rather than an improvement.  NOTE that we cannot
even use an LSM to protect against this - sys_rawio is able to perform
operations that can subvert LSMs.

>>  lib/daemon-unix.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/lib/daemon-unix.c b/lib/daemon-unix.c
>> index ae59ecf2c..d32a60657 100644
>> --- a/lib/daemon-unix.c
>> +++ b/lib/daemon-unix.c
>> @@ -820,6 +820,7 @@ daemon_become_new_user_linux(bool access_datapath OVS_UNUSED)
>>              if (access_datapath && !ret) {
>>                  ret = capng_update(CAPNG_ADD, cap_sets, CAP_NET_ADMIN)
>>                        || capng_update(CAPNG_ADD, cap_sets, CAP_NET_RAW)
>> +                      || capng_update(CAPNG_ADD, cap_sets, CAP_SYS_RAWIO)
>>                        || capng_update(CAPNG_ADD, cap_sets, CAP_NET_BROADCAST);
>>              }
>>          } else {
>
> This patch seems incomplete: the manual is not updated and I would
> expect some changes in the selinux policy files.
David Marchand March 11, 2021, 8:44 p.m. UTC | #3
On Wed, Sep 16, 2020 at 10:06 PM Aaron Conole <aconole@redhat.com> wrote:
>
> David Marchand <david.marchand@redhat.com> writes:
>
> > On Tue, Sep 15, 2020 at 12:52 PM Ameer Mahagneh <ameerm@nvidia.com> wrote:
> >>
> >> For security reasons only root or privileged user can allocate Interconnect
> >> Context Memory (ICM). Add this capability for vendors that require ICM
> >> allocation when applying DPDK rte flows.
> >>
> >> Signed-off-by: Ameer Mahagneh <ameerm@nvidia.com>
> >> Acked-by: Eli Britstein <elibr@nvidia.com>
> >> ---
>
> Why is this needed?  SYS_RAWIO is extremely privileged and means that
> there is no point even in dropping privs or changing UID - the process
> with these caps is allowed to alter anything, map /dev/mem and
> /dev/kmem, etc.
>
> Is there really no other way of doing this?  This feels somewhat like a
> security regression rather than an improvement.  NOTE that we cannot
> even use an LSM to protect against this - sys_rawio is able to perform
> operations that can subvert LSMs.

I had forgotten about this patch... I was expecting someone from
Nvidia to reply but I see nothing on the ml.

I do not have the full story, but I hit an issue just yesterday and
spent today figuring this out.

For me, the impact is simple: without this capability, full
hw-offloads with mlx5 devices are unavailable with ovs running as non
root.
The logs are not helping btw, example:
2021-03-11T17:48:01.407Z|00062|netdev_offload_dpdk(dp_netdev_flow_5)|WARN|dpdk0:
rte_flow creation failed: 1 ((null)).
2021-03-11T17:48:01.407Z|00063|netdev_offload_dpdk(dp_netdev_flow_5)|WARN|dpdk0:
Failed flow:   flow create 2 ingress priority 0 group 0 transfer
pattern eth src is 0c:42:a1:00:a8:7c dst is 6a:20:8f:82:52:49 type is
0x0800 / ipv4 / end actions count / port_id original 0 id 5 / end
And OVS automatically falls back to partial offloading.

Can nvidia people explain the need for this capability and if other
options have been considered?


Thanks.
Ilya Maximets March 11, 2021, 9:07 p.m. UTC | #4
On 3/11/21 9:44 PM, David Marchand wrote:
> On Wed, Sep 16, 2020 at 10:06 PM Aaron Conole <aconole@redhat.com> wrote:
>>
>> David Marchand <david.marchand@redhat.com> writes:
>>
>>> On Tue, Sep 15, 2020 at 12:52 PM Ameer Mahagneh <ameerm@nvidia.com> wrote:
>>>>
>>>> For security reasons only root or privileged user can allocate Interconnect
>>>> Context Memory (ICM). Add this capability for vendors that require ICM
>>>> allocation when applying DPDK rte flows.
>>>>
>>>> Signed-off-by: Ameer Mahagneh <ameerm@nvidia.com>
>>>> Acked-by: Eli Britstein <elibr@nvidia.com>
>>>> ---
>>
>> Why is this needed?  SYS_RAWIO is extremely privileged and means that
>> there is no point even in dropping privs or changing UID - the process
>> with these caps is allowed to alter anything, map /dev/mem and
>> /dev/kmem, etc.
>>
>> Is there really no other way of doing this?  This feels somewhat like a
>> security regression rather than an improvement.  NOTE that we cannot
>> even use an LSM to protect against this - sys_rawio is able to perform
>> operations that can subvert LSMs.
> 
> I had forgotten about this patch... I was expecting someone from
> Nvidia to reply but I see nothing on the ml.
> 
> I do not have the full story, but I hit an issue just yesterday and
> spent today figuring this out.
> 
> For me, the impact is simple: without this capability, full
> hw-offloads with mlx5 devices are unavailable with ovs running as non
> root.
> The logs are not helping btw, example:
> 2021-03-11T17:48:01.407Z|00062|netdev_offload_dpdk(dp_netdev_flow_5)|WARN|dpdk0:
> rte_flow creation failed: 1 ((null)).

At least, I think mlx driver should provide better error for this case
instead of 'null', e.g. EPERM with some meaningful message, so users
will know that they need to bump their privilege level.

> 2021-03-11T17:48:01.407Z|00063|netdev_offload_dpdk(dp_netdev_flow_5)|WARN|dpdk0:
> Failed flow:   flow create 2 ingress priority 0 group 0 transfer
> pattern eth src is 0c:42:a1:00:a8:7c dst is 6a:20:8f:82:52:49 type is
> 0x0800 / ipv4 / end actions count / port_id original 0 id 5 / end
> And OVS automatically falls back to partial offloading.
> 
> Can nvidia people explain the need for this capability and if other
> options have been considered?
> 
> 
> Thanks.
>
Maxime Coquelin March 15, 2021, 8:53 a.m. UTC | #5
On 3/11/21 10:07 PM, Ilya Maximets wrote:
> On 3/11/21 9:44 PM, David Marchand wrote:
>> On Wed, Sep 16, 2020 at 10:06 PM Aaron Conole <aconole@redhat.com> wrote:
>>>
>>> David Marchand <david.marchand@redhat.com> writes:
>>>
>>>> On Tue, Sep 15, 2020 at 12:52 PM Ameer Mahagneh <ameerm@nvidia.com> wrote:
>>>>>
>>>>> For security reasons only root or privileged user can allocate Interconnect
>>>>> Context Memory (ICM). Add this capability for vendors that require ICM
>>>>> allocation when applying DPDK rte flows.
>>>>>
>>>>> Signed-off-by: Ameer Mahagneh <ameerm@nvidia.com>
>>>>> Acked-by: Eli Britstein <elibr@nvidia.com>
>>>>> ---
>>>
>>> Why is this needed?  SYS_RAWIO is extremely privileged and means that
>>> there is no point even in dropping privs or changing UID - the process
>>> with these caps is allowed to alter anything, map /dev/mem and
>>> /dev/kmem, etc.
>>>
>>> Is there really no other way of doing this?  This feels somewhat like a
>>> security regression rather than an improvement.  NOTE that we cannot
>>> even use an LSM to protect against this - sys_rawio is able to perform
>>> operations that can subvert LSMs.
>>
>> I had forgotten about this patch... I was expecting someone from
>> Nvidia to reply but I see nothing on the ml.
>>
>> I do not have the full story, but I hit an issue just yesterday and
>> spent today figuring this out.
>>
>> For me, the impact is simple: without this capability, full
>> hw-offloads with mlx5 devices are unavailable with ovs running as non
>> root.
>> The logs are not helping btw, example:
>> 2021-03-11T17:48:01.407Z|00062|netdev_offload_dpdk(dp_netdev_flow_5)|WARN|dpdk0:
>> rte_flow creation failed: 1 ((null)).
> 
> At least, I think mlx driver should provide better error for this case
> instead of 'null', e.g. EPERM with some meaningful message, so users
> will know that they need to bump their privilege level.

+1

Please note that on my side with ConnectX-6 Dx, without this capability,
flows are marked as partially offloaded but no packet is received on the
VFs. Adding the proper capability fixes it (except that sometimes, for
unknown reasons, half the flow are partially offloaded).

Regards,
Maxime

>> 2021-03-11T17:48:01.407Z|00063|netdev_offload_dpdk(dp_netdev_flow_5)|WARN|dpdk0:
>> Failed flow:   flow create 2 ingress priority 0 group 0 transfer
>> pattern eth src is 0c:42:a1:00:a8:7c dst is 6a:20:8f:82:52:49 type is
>> 0x0800 / ipv4 / end actions count / port_id original 0 id 5 / end
>> And OVS automatically falls back to partial offloading.
>>
>> Can nvidia people explain the need for this capability and if other
>> options have been considered?
>>
>>
>> Thanks.
>>
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Gaetan Rivet March 19, 2021, 4:59 p.m. UTC | #6
On Mon, Mar 15, 2021, at 08:53, Maxime Coquelin wrote: 
> 
> 
> On 3/11/21 10:07 PM, Ilya Maximets wrote:
> > On 3/11/21 9:44 PM, David Marchand wrote:
> >> On Wed, Sep 16, 2020 at 10:06 PM Aaron Conole <aconole@redhat.com> wrote:
> >>>
> >>> David Marchand <david.marchand@redhat.com> writes:
> >>>
> >>>> On Tue, Sep 15, 2020 at 12:52 PM Ameer Mahagneh <ameerm@nvidia.com> wrote:
> >>>>>
> >>>>> For security reasons only root or privileged user can allocate Interconnect
> >>>>> Context Memory (ICM). Add this capability for vendors that require ICM
> >>>>> allocation when applying DPDK rte flows.
> >>>>>
> >>>>> Signed-off-by: Ameer Mahagneh <ameerm@nvidia.com>
> >>>>> Acked-by: Eli Britstein <elibr@nvidia.com>
> >>>>> ---
> >>>
> >>> Why is this needed?  SYS_RAWIO is extremely privileged and means that
> >>> there is no point even in dropping privs or changing UID - the process
> >>> with these caps is allowed to alter anything, map /dev/mem and
> >>> /dev/kmem, etc.
> >>>
> >>> Is there really no other way of doing this?  This feels somewhat like a
> >>> security regression rather than an improvement.  NOTE that we cannot
> >>> even use an LSM to protect against this - sys_rawio is able to perform
> >>> operations that can subvert LSMs.
> >>
> >> I had forgotten about this patch... I was expecting someone from
> >> Nvidia to reply but I see nothing on the ml.
> >>
> >> I do not have the full story, but I hit an issue just yesterday and
> >> spent today figuring this out.
> >>
> >> For me, the impact is simple: without this capability, full
> >> hw-offloads with mlx5 devices are unavailable with ovs running as non
> >> root.
> >> The logs are not helping btw, example:
> >> 2021-03-11T17:48:01.407Z|00062|netdev_offload_dpdk(dp_netdev_flow_5)|WARN|dpdk0:
> >> rte_flow creation failed: 1 ((null)).
> > 
> > At least, I think mlx driver should provide better error for this case
> > instead of 'null', e.g. EPERM with some meaningful message, so users
> > will know that they need to bump their privilege level.
> 
> +1
> 
> Please note that on my side with ConnectX-6 Dx, without this capability,
> flows are marked as partially offloaded but no packet is received on the
> VFs. Adding the proper capability fixes it (except that sometimes, for
> unknown reasons, half the flow are partially offloaded).
> 
> Regards,
> Maxime
> 
> >> 2021-03-11T17:48:01.407Z|00063|netdev_offload_dpdk(dp_netdev_flow_5)|WARN|dpdk0:
> >> Failed flow:   flow create 2 ingress priority 0 group 0 transfer
> >> pattern eth src is 0c:42:a1:00:a8:7c dst is 6a:20:8f:82:52:49 type is
> >> 0x0800 / ipv4 / end actions count / port_id original 0 id 5 / end
> >> And OVS automatically falls back to partial offloading.
> >>
> >> Can nvidia people explain the need for this capability and if other
> >> options have been considered?
> >>
> >>
> >> Thanks.
> >>
>

Hello everyone,

We should have addressed this issue earlier, sorry about that.

Our rte_flow implementation uses ICM mappings to program our hardware,
which requires super privileged access. We are looking into ways to avoid it.

In the meantime, we failed to properly communicate this need in the rte_flow API.
We will improve the documentation and the error path in DPDK.

I can also update OVS documentation if anyone thinks it could help, but it is vendor-specific.
I would expect it to be more relevant at the DPDK level.

Best regards,
David Marchand March 22, 2021, 2:21 p.m. UTC | #7
Hello Gaëtan,

On Fri, Mar 19, 2021 at 5:59 PM Gaetan Rivet <gaetanr@nvidia.com> wrote:
> Our rte_flow implementation uses ICM mappings to program our hardware,
> which requires super privileged access. We are looking into ways to avoid it.

Ok, thanks for looking into it.

>
> In the meantime, we failed to properly communicate this need in the rte_flow API.
> We will improve the documentation and the error path in DPDK.

Without this capa, mlx5 rte_flow full hw offloading errors with logs like:
2021-03-22T14:12:40.274Z|00001|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
rte_flow creation failed: 1 ((null)).
2021-03-22T14:12:40.274Z|00002|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
Failed flow:   flow create 3 ingress priority 0 group 0 transfer
pattern eth src is 6a:20:8f:82:52:49 dst is 0c:42:a1:00:a8:7c type is
0x0800 / ipv4 / end actions count / port_id original 0 id 2 / end

First log is useless.
This is more bugfixing than enhancement.
Though logs do not need to tell the full story, they can point at the
mlx5 pmd documentation where the full explanation is.

>
> I can also update OVS documentation if anyone thinks it could help, but it is vendor-specific.
> I would expect it to be more relevant at the DPDK level.

Yes, OVS should be device agnostic.
DPDK is enough for me.
Eelco Chaudron Jan. 23, 2023, 9:44 a.m. UTC | #8
On 22 Mar 2021, at 15:21, David Marchand wrote:

> Hello Gaëtan,
>
> On Fri, Mar 19, 2021 at 5:59 PM Gaetan Rivet <gaetanr@nvidia.com> wrote:
>> Our rte_flow implementation uses ICM mappings to program our hardware,
>> which requires super privileged access. We are looking into ways to avoid it.
>
> Ok, thanks for looking into it.

Was any progress made on this? Or was the conclusion that this is the only way?

>>
>> In the meantime, we failed to properly communicate this need in the rte_flow API.
>> We will improve the documentation and the error path in DPDK.
>
> Without this capa, mlx5 rte_flow full hw offloading errors with logs like:
> 2021-03-22T14:12:40.274Z|00001|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
> rte_flow creation failed: 1 ((null)).
> 2021-03-22T14:12:40.274Z|00002|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
> Failed flow:   flow create 3 ingress priority 0 group 0 transfer
> pattern eth src is 6a:20:8f:82:52:49 dst is 0c:42:a1:00:a8:7c type is
> 0x0800 / ipv4 / end actions count / port_id original 0 id 2 / end
>
> First log is useless.
> This is more bugfixing than enhancement.
> Though logs do not need to tell the full story, they can point at the
> mlx5 pmd documentation where the full explanation is.


I was running into this issue also and spent a decent amount of time trying to figure out what was going on.

I did not have HW offload enabled yet, but just the basic VF/port representer configuration and no error messages or packets were arriving.

It Would be good to get some logging indicating the configuration/system was not valid. All I got was silent packet drops, but counters were incremented :(

>>
>> I can also update OVS documentation if anyone thinks it could help, but it is vendor-specific.
>> I would expect it to be more relevant at the DPDK level.
>
> Yes, OVS should be device agnostic.
> DPDK is enough for me.
>
>
> -- 
> David Marchand
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Gaetan Rivet Feb. 9, 2023, 9:41 a.m. UTC | #9
>-----Original Message-----
>From: dev <ovs-dev-bounces@openvswitch.org <mailto:ovs-dev-bounces@openvswitch.org>> on behalf of Eelco Chaudron <echaudro@redhat.com <mailto:echaudro@redhat.com>>
>Date: Monday 23 January 2023 at 10:44
>To: Gaetan Rivet <gaetanr@nvidia.com <mailto:gaetanr@nvidia.com>>
>Cc: ovs dev <dev@openvswitch.org <mailto:dev@openvswitch.org>>, Eli Britstein <elibr@nvidia.com <mailto:elibr@nvidia.com>>, Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>>, Maxime Coquelin <maxime.coquelin@redhat.com <mailto:maxime.coquelin@redhat.com>>, Jason Gunthorpe <jgg@nvidia.com <mailto:jgg@nvidia.com>>, Majd Dibbiny <majd@nvidia.com <mailto:majd@nvidia.com>>, David Marchand <david.marchand@redhat.com <mailto:david.marchand@redhat.com>>
>Subject: Re: [ovs-dev] [PATCH 1/1] daemon-unix: Support OVS-DPDK HW offloads for non-root user
>
>On 22 Mar 2021, at 15:21, David Marchand wrote:
>
>
>> Hello Gaëtan,
>>
>> On Fri, Mar 19, 2021 at 5:59 PM Gaetan Rivet <gaetanr@nvidia.com <mailto:gaetanr@nvidia.com>> wrote:
>>> Our rte_flow implementation uses ICM mappings to program our hardware,
>>> which requires super privileged access. We are looking into ways to avoid it.
>>
>> Ok, thanks for looking into it.
>
>
>Was any progress made on this? Or was the conclusion that this is the only way?
>
>

Hello Eelco,

I'll start by clarifying something: this issue is two-fold, even though there is a single RC.
The previous email thread is about using rte_flow API with the mlx5 PMD,
while the bug you described is about lack of TX on ports with no offloads inserted by the user.

On the matter of the above rte_flow requirements: it was discussed, the conclusion
was that nothing could be done with the current offload architecture.
A patch was written to improve the logs and the documentation, but I see that it didn't
make it to upstream DPDK. I have brought it to the attention of the DPDK team and
it will be submitted.

>>>
>>> In the meantime, we failed to properly communicate this need in the rte_flow API.
>>> We will improve the documentation and the error path in DPDK.
>>
>> Without this capa, mlx5 rte_flow full hw offloading errors with logs like:
>> 2021-03-22T14:12:40.274Z|00001|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
>> rte_flow creation failed: 1 ((null)).
>> 2021-03-22T14:12:40.274Z|00002|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
>> Failed flow: flow create 3 ingress priority 0 group 0 transfer
>> pattern eth src is 6a:20:8f:82:52:49 dst is 0c:42:a1:00:a8:7c type is
>> 0x0800 / ipv4 / end actions count / port_id original 0 id 2 / end
>>
>> First log is useless.
>> This is more bugfixing than enhancement.
>> Though logs do not need to tell the full story, they can point at the
>> mlx5 pmd documentation where the full explanation is.
>
>
>
>
>I was running into this issue also and spent a decent amount of time trying to figure out what was going on.
>I did not have HW offload enabled yet, but just the basic VF/port representer configuration and no error messages or packets were arriving.
>It Would be good to get some logging indicating the configuration/system was not valid. All I got was silent packet drops, but counters were incremented :(

I was able to reproduce this issue. The DPDK team was adamant that it should have resulted in an error log,
but none can be seen. The absence of log is a bug, however the behavior itself is intended.

To give some more details, port probe should work without SYS_RAWIO, but start should fail.
On start, the PMD installs hardware rules, using the same underlying API as rte_flow. It makes SYS_RAWIO
a requirement for even basic port functions. This will remain the case in the future.

Similarly, this issue has been brought to the attention of the DPDK team and they
will make sure the user has an error log in that case, as well as clarifying the mlx5 doc.

Thanks for reporting this!

Gaetan
Aaron Conole Feb. 9, 2023, 2:32 p.m. UTC | #10
Gaetan Rivet via dev <ovs-dev@openvswitch.org> writes:

>>-----Original Message-----
>>From: dev <ovs-dev-bounces@openvswitch.org
>> <mailto:ovs-dev-bounces@openvswitch.org>> on behalf of Eelco
>> Chaudron <echaudro@redhat.com <mailto:echaudro@redhat.com>>
>>Date: Monday 23 January 2023 at 10:44
>>To: Gaetan Rivet <gaetanr@nvidia.com <mailto:gaetanr@nvidia.com>>
>>Cc: ovs dev <dev@openvswitch.org <mailto:dev@openvswitch.org>>, Eli
>> Britstein <elibr@nvidia.com <mailto:elibr@nvidia.com>>, Ilya
>> Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>>, Maxime
>> Coquelin <maxime.coquelin@redhat.com
>> <mailto:maxime.coquelin@redhat.com>>, Jason Gunthorpe
>> <jgg@nvidia.com <mailto:jgg@nvidia.com>>, Majd Dibbiny
>> <majd@nvidia.com <mailto:majd@nvidia.com>>, David Marchand
>> <david.marchand@redhat.com <mailto:david.marchand@redhat.com>>
>>Subject: Re: [ovs-dev] [PATCH 1/1] daemon-unix: Support OVS-DPDK HW offloads for non-root user
>>
>>On 22 Mar 2021, at 15:21, David Marchand wrote:
>>
>>
>>> Hello Gaëtan,
>>>
>>> On Fri, Mar 19, 2021 at 5:59 PM Gaetan Rivet <gaetanr@nvidia.com <mailto:gaetanr@nvidia.com>> wrote:
>>>> Our rte_flow implementation uses ICM mappings to program our hardware,
>>>> which requires super privileged access. We are looking into ways to avoid it.
>>>
>>> Ok, thanks for looking into it.
>>
>>
>>Was any progress made on this? Or was the conclusion that this is the only way?
>>
>>
>
> Hello Eelco,
>
> I'll start by clarifying something: this issue is two-fold, even though there is a single RC.
> The previous email thread is about using rte_flow API with the mlx5 PMD,
> while the bug you described is about lack of TX on ports with no offloads inserted by the user.
>
> On the matter of the above rte_flow requirements: it was discussed, the conclusion
> was that nothing could be done with the current offload architecture.
> A patch was written to improve the logs and the documentation, but I see that it didn't
> make it to upstream DPDK. I have brought it to the attention of the DPDK team and
> it will be submitted.
>
>>>>
>>>> In the meantime, we failed to properly communicate this need in the rte_flow API.
>>>> We will improve the documentation and the error path in DPDK.
>>>
>>> Without this capa, mlx5 rte_flow full hw offloading errors with logs like:
>>> 2021-03-22T14:12:40.274Z|00001|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
>>> rte_flow creation failed: 1 ((null)).
>>> 2021-03-22T14:12:40.274Z|00002|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
>>> Failed flow: flow create 3 ingress priority 0 group 0 transfer
>>> pattern eth src is 6a:20:8f:82:52:49 dst is 0c:42:a1:00:a8:7c type is
>>> 0x0800 / ipv4 / end actions count / port_id original 0 id 2 / end
>>>
>>> First log is useless.
>>> This is more bugfixing than enhancement.
>>> Though logs do not need to tell the full story, they can point at the
>>> mlx5 pmd documentation where the full explanation is.
>>
>>
>>
>>
>>I was running into this issue also and spent a decent amount of time
>> trying to figure out what was going on.
>>I did not have HW offload enabled yet, but just the basic VF/port
>> representer configuration and no error messages or packets were
>> arriving.
>>It Would be good to get some logging indicating the
>> configuration/system was not valid. All I got was silent packet
>> drops, but counters were incremented :(
>
> I was able to reproduce this issue. The DPDK team was adamant that it
> should have resulted in an error log,
> but none can be seen. The absence of log is a bug, however the behavior itself is intended.
>
> To give some more details, port probe should work without SYS_RAWIO, but start should fail.
> On start, the PMD installs hardware rules, using the same underlying API as rte_flow. It makes SYS_RAWIO
> a requirement for even basic port functions. This will remain the case in the future.
>
> Similarly, this issue has been brought to the attention of the DPDK team and they
> will make sure the user has an error log in that case, as well as clarifying the mlx5 doc.
>
> Thanks for reporting this!
>

I guess we would need something like the following (untested, but
general idea) patch, but I can see having some debate about a proper
solution.  It should also be noted that some vfio support would also
need iopl() access (at least from a quick glance at the DPDK side), but
I'm not sure what use case that would enable.

In this change, I don't really put much documentation around the
hw-access knob - there may be additional selinux changes (for example to
add support for memory_device_t access), but I don't currently have a
system ready to go for testing this.

There is a bit of an ABI break around daemonize_start() in here, so
maybe it would be better to introduce a proper ABI version signature for
daemonize_start() even though historically we never tried to have a
consistent ABI in libopenvswitch, so just "thinking out loud," at the
moment.

---
diff --git a/NEWS b/NEWS
index fe6055a270..13d2522679 100644
--- a/NEWS
+++ b/NEWS
@@ -4,6 +4,10 @@ Post-v3.1.0
      * OVS now collects per-interface upcall statistics that can be obtained
        via 'ovs-appctl dpctl/show -s' or the interface's statistics column
        in OVSDB.  Available with upstream kernel 6.2+.
+   - DPDK:
+     * ovs-vswitchd will keep the CAP_SYS_RAWIO capability when started
+       with the --hw-access command line option.  This allows the process
+       extra privileges when mapping physical interconnect memory.
 
 
 v3.1.0 - xx xxx xxxx
diff --git a/lib/daemon-unix.c b/lib/daemon-unix.c
index 1a7ba427d7..8b895a48de 100644
--- a/lib/daemon-unix.c
+++ b/lib/daemon-unix.c
@@ -88,7 +88,8 @@ static bool switch_user = false;
 static uid_t uid;
 static gid_t gid;
 static char *user = NULL;
-static void daemon_become_new_user__(bool access_datapath);
+static void daemon_become_new_user__(bool access_datapath,
+                                     bool access_hardware_ports);
 
 static void check_already_running(void);
 static int lock_pidfile(FILE *, int command);
@@ -443,13 +444,13 @@ monitor_daemon(pid_t daemon_pid)
  * daemonize_complete()) or that it failed to start up (by exiting with a
  * nonzero exit code). */
 void
-daemonize_start(bool access_datapath)
+daemonize_start(bool access_datapath, bool access_hardware_ports)
 {
     assert_single_threaded();
     daemonize_fd = -1;
 
     if (switch_user) {
-        daemon_become_new_user__(access_datapath);
+        daemon_become_new_user__(access_datapath, access_hardware_ports);
         switch_user = false;
     }
 
@@ -807,7 +808,8 @@ daemon_become_new_user_unix(void)
 /* Linux specific implementation of daemon_become_new_user()
  * using libcap-ng.   */
 static void
-daemon_become_new_user_linux(bool access_datapath OVS_UNUSED)
+daemon_become_new_user_linux(bool access_datapath OVS_UNUSED,
+                             bool access_hardware_ports OVS_UNUSED)
 {
 #if defined __linux__ &&  HAVE_LIBCAPNG
     int ret;
@@ -826,7 +828,17 @@ daemon_become_new_user_linux(bool access_datapath OVS_UNUSED)
             if (access_datapath && !ret) {
                 ret = capng_update(CAPNG_ADD, cap_sets, CAP_NET_ADMIN)
                       || capng_update(CAPNG_ADD, cap_sets, CAP_NET_RAW)
-                      || capng_update(CAPNG_ADD, cap_sets, CAP_NET_BROADCAST);
+                      || capng_update(CAPNG_ADD, cap_sets, CAP_NET_BROADCAST)
+#ifdef DPDK_NETDEV
+                      || (access_hardware_ports &&
+                          capng_update(CAPNG_ADD, cap_sets, CAP_SYS_RAWIO))
+#else
+                    ;
+                if (access_hardware_ports) {
+                    VLOG_WARN("hw port access requested, but no userspace ioport support.  Dropping.");
+                }
+#endif
+                    ;
             }
         } else {
             ret = -1;
@@ -854,7 +866,7 @@ daemon_become_new_user_linux(bool access_datapath OVS_UNUSED)
 }
 
 static void
-daemon_become_new_user__(bool access_datapath)
+daemon_become_new_user__(bool access_datapath, bool access_hardware_ports)
 {
     /* If vlog file has been created, change its owner to the non-root user
      * as specifed by the --user option.  */
@@ -862,7 +874,8 @@ daemon_become_new_user__(bool access_datapath)
 
     if (LINUX) {
         if (LIBCAPNG) {
-            daemon_become_new_user_linux(access_datapath);
+            daemon_become_new_user_linux(access_datapath,
+                                         access_hardware_ports);
         } else {
             VLOG_FATAL("%s: fail to downgrade user using libcap-ng. "
                        "(libcap-ng is not configured at compile time), "
@@ -877,11 +890,11 @@ daemon_become_new_user__(bool access_datapath)
  * However, there in case the user switch needs to be done
  * before daemonize_start(), the following API can be used.  */
 void
-daemon_become_new_user(bool access_datapath)
+daemon_become_new_user(bool access_datapath, bool access_hardware_ports)
 {
     assert_single_threaded();
     if (switch_user) {
-        daemon_become_new_user__(access_datapath);
+        daemon_become_new_user__(access_datapath, access_hardware_ports);
         /* daemonize_start() should not switch user again. */
         switch_user = false;
     }
diff --git a/lib/daemon.c b/lib/daemon.c
index 3249c5ab4b..1e1c019eb1 100644
--- a/lib/daemon.c
+++ b/lib/daemon.c
@@ -48,7 +48,7 @@ get_detach(void)
 void
 daemonize(void)
 {
-    daemonize_start(false);
+    daemonize_start(false, false);
     daemonize_complete();
 }
 
diff --git a/lib/daemon.h b/lib/daemon.h
index 0941574963..42372d1463 100644
--- a/lib/daemon.h
+++ b/lib/daemon.h
@@ -167,10 +167,10 @@ void set_detach(void);
 bool get_detach(void);
 void daemon_save_fd(int fd);
 void daemonize(void);
-void daemonize_start(bool access_datapath);
+void daemonize_start(bool access_datapath, bool access_hardware_ports);
 void daemonize_complete(void);
 void daemon_set_new_user(const char * user_spec);
-void daemon_become_new_user(bool access_datapath);
+void daemon_become_new_user(bool access_datapath, bool access_hardware_ports);
 void daemon_usage(void);
 void daemon_disable_self_confinement(void);
 bool daemon_should_self_confine(void);
diff --git a/vswitchd/ovs-vswitchd.8.in b/vswitchd/ovs-vswitchd.8.in
index 9569265fcb..e39e3c984c 100644
--- a/vswitchd/ovs-vswitchd.8.in
+++ b/vswitchd/ovs-vswitchd.8.in
@@ -81,6 +81,12 @@ unavailable or unsuccessful.
 .SS "DPDK Options"
 For details on initializing \fBovs\-vswitchd\fR to use DPDK ports,
 refer to the documentation or \fBovs\-vswitchd.conf.db\fR(5).
+.SS "DPDK HW Access Options"
+.IP "\fB\-\-hw\-access\fR"
+Tells \fBovs\-vswitchd\fR to retain the \fBCAP_SYS_RAWIO\fR capability,
+to allow userspace drivers access to raw hardware memory.  This will
+also allow the \fBovs\-vswitchd\fR daemon to call \fBiopl()\fR and
+\fBioperm()\fR functions to set port access.
 .SS "Daemon Options"
 .ds DD \
 \fBovs\-vswitchd\fR detaches only after it has connected to the \
diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c
index 407bfc60eb..ff1527fd5b 100644
--- a/vswitchd/ovs-vswitchd.c
+++ b/vswitchd/ovs-vswitchd.c
@@ -60,6 +60,9 @@ VLOG_DEFINE_THIS_MODULE(vswitchd);
  * the kernel from paging any of its memory to disk. */
 static bool want_mlockall;
 
+/* --hw-access: If set, retains CAP_SYS_RAWIO privileges.  */
+static bool hw_access;
+
 static unixctl_cb_func ovs_vswitchd_exit;
 
 static char *parse_options(int argc, char *argv[], char **unixctl_path);
@@ -169,6 +172,7 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
         OPT_DPDK,
         SSL_OPTION_ENUMS,
         OPT_DUMMY_NUMA,
+        OPT_HW_ACCESS,
     };
     static const struct option long_options[] = {
         {"help",        no_argument, NULL, 'h'},
@@ -185,6 +189,7 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
         {"disable-system-route", no_argument, NULL, OPT_DISABLE_SYSTEM_ROUTE},
         {"dpdk", optional_argument, NULL, OPT_DPDK},
         {"dummy-numa", required_argument, NULL, OPT_DUMMY_NUMA},
+        {"hw-access", no_argument, NULL, OPT_HW_ACCESS},
         {NULL, 0, NULL, 0},
     };
     char *short_options = ovs_cmdl_long_options_to_short_options(long_options);
@@ -249,6 +254,10 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
             ovs_numa_set_dummy(optarg);
             break;
 
+        case OPT_HW_ACCESS:
+            hw_access = true;
+            break;
+
         default:
             abort();
         }
---
Aaron Conole Feb. 9, 2023, 2:48 p.m. UTC | #11
Aaron Conole <aconole@redhat.com> writes:

> Gaetan Rivet via dev <ovs-dev@openvswitch.org> writes:
>
>>>-----Original Message-----
>>>From: dev <ovs-dev-bounces@openvswitch.org
>>> <mailto:ovs-dev-bounces@openvswitch.org>> on behalf of Eelco
>>> Chaudron <echaudro@redhat.com <mailto:echaudro@redhat.com>>
>>>Date: Monday 23 January 2023 at 10:44
>>>To: Gaetan Rivet <gaetanr@nvidia.com <mailto:gaetanr@nvidia.com>>
>>>Cc: ovs dev <dev@openvswitch.org <mailto:dev@openvswitch.org>>, Eli
>>> Britstein <elibr@nvidia.com <mailto:elibr@nvidia.com>>, Ilya
>>> Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>>, Maxime
>>> Coquelin <maxime.coquelin@redhat.com
>>> <mailto:maxime.coquelin@redhat.com>>, Jason Gunthorpe
>>> <jgg@nvidia.com <mailto:jgg@nvidia.com>>, Majd Dibbiny
>>> <majd@nvidia.com <mailto:majd@nvidia.com>>, David Marchand
>>> <david.marchand@redhat.com <mailto:david.marchand@redhat.com>>
>>>Subject: Re: [ovs-dev] [PATCH 1/1] daemon-unix: Support OVS-DPDK HW offloads for non-root user
>>>
>>>On 22 Mar 2021, at 15:21, David Marchand wrote:
>>>
>>>
>>>> Hello Gaëtan,
>>>>
>>>> On Fri, Mar 19, 2021 at 5:59 PM Gaetan Rivet <gaetanr@nvidia.com <mailto:gaetanr@nvidia.com>> wrote:
>>>>> Our rte_flow implementation uses ICM mappings to program our hardware,
>>>>> which requires super privileged access. We are looking into ways to avoid it.
>>>>
>>>> Ok, thanks for looking into it.
>>>
>>>
>>>Was any progress made on this? Or was the conclusion that this is the only way?
>>>
>>>
>>
>> Hello Eelco,
>>
>> I'll start by clarifying something: this issue is two-fold, even though there is a single RC.
>> The previous email thread is about using rte_flow API with the mlx5 PMD,
>> while the bug you described is about lack of TX on ports with no offloads inserted by the user.
>>
>> On the matter of the above rte_flow requirements: it was discussed, the conclusion
>> was that nothing could be done with the current offload architecture.
>> A patch was written to improve the logs and the documentation, but I see that it didn't
>> make it to upstream DPDK. I have brought it to the attention of the DPDK team and
>> it will be submitted.
>>
>>>>>
>>>>> In the meantime, we failed to properly communicate this need in the rte_flow API.
>>>>> We will improve the documentation and the error path in DPDK.
>>>>
>>>> Without this capa, mlx5 rte_flow full hw offloading errors with logs like:
>>>> 2021-03-22T14:12:40.274Z|00001|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
>>>> rte_flow creation failed: 1 ((null)).
>>>> 2021-03-22T14:12:40.274Z|00002|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
>>>> Failed flow: flow create 3 ingress priority 0 group 0 transfer
>>>> pattern eth src is 6a:20:8f:82:52:49 dst is 0c:42:a1:00:a8:7c type is
>>>> 0x0800 / ipv4 / end actions count / port_id original 0 id 2 / end
>>>>
>>>> First log is useless.
>>>> This is more bugfixing than enhancement.
>>>> Though logs do not need to tell the full story, they can point at the
>>>> mlx5 pmd documentation where the full explanation is.
>>>
>>>
>>>
>>>
>>>I was running into this issue also and spent a decent amount of time
>>> trying to figure out what was going on.
>>>I did not have HW offload enabled yet, but just the basic VF/port
>>> representer configuration and no error messages or packets were
>>> arriving.
>>>It Would be good to get some logging indicating the
>>> configuration/system was not valid. All I got was silent packet
>>> drops, but counters were incremented :(
>>
>> I was able to reproduce this issue. The DPDK team was adamant that it
>> should have resulted in an error log,
>> but none can be seen. The absence of log is a bug, however the behavior itself is intended.
>>
>> To give some more details, port probe should work without SYS_RAWIO, but start should fail.
>> On start, the PMD installs hardware rules, using the same underlying
>> API as rte_flow. It makes SYS_RAWIO
>> a requirement for even basic port functions. This will remain the case in the future.
>>
>> Similarly, this issue has been brought to the attention of the DPDK team and they
>> will make sure the user has an error log in that case, as well as clarifying the mlx5 doc.
>>
>> Thanks for reporting this!
>>
>
> I guess we would need something like the following (untested, but
> general idea) patch, but I can see having some debate about a proper
> solution.  It should also be noted that some vfio support would also
> need iopl() access (at least from a quick glance at the DPDK side), but
> I'm not sure what use case that would enable.
>
> In this change, I don't really put much documentation around the
> hw-access knob - there may be additional selinux changes (for example to
> add support for memory_device_t access), but I don't currently have a
> system ready to go for testing this.
>
> There is a bit of an ABI break around daemonize_start() in here, so
> maybe it would be better to introduce a proper ABI version signature for
> daemonize_start() even though historically we never tried to have a
> consistent ABI in libopenvswitch, so just "thinking out loud," at the
> moment.
>
> ---
> diff --git a/NEWS b/NEWS
> index fe6055a270..13d2522679 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -4,6 +4,10 @@ Post-v3.1.0
>       * OVS now collects per-interface upcall statistics that can be obtained
>         via 'ovs-appctl dpctl/show -s' or the interface's statistics column
>         in OVSDB.  Available with upstream kernel 6.2+.
> +   - DPDK:
> +     * ovs-vswitchd will keep the CAP_SYS_RAWIO capability when started
> +       with the --hw-access command line option.  This allows the process
> +       extra privileges when mapping physical interconnect memory.
>  
>  
>  v3.1.0 - xx xxx xxxx
> diff --git a/lib/daemon-unix.c b/lib/daemon-unix.c
> index 1a7ba427d7..8b895a48de 100644
> --- a/lib/daemon-unix.c
> +++ b/lib/daemon-unix.c
> @@ -88,7 +88,8 @@ static bool switch_user = false;
>  static uid_t uid;
>  static gid_t gid;
>  static char *user = NULL;
> -static void daemon_become_new_user__(bool access_datapath);
> +static void daemon_become_new_user__(bool access_datapath,
> +                                     bool access_hardware_ports);
>  
>  static void check_already_running(void);
>  static int lock_pidfile(FILE *, int command);
> @@ -443,13 +444,13 @@ monitor_daemon(pid_t daemon_pid)
>   * daemonize_complete()) or that it failed to start up (by exiting with a
>   * nonzero exit code). */
>  void
> -daemonize_start(bool access_datapath)
> +daemonize_start(bool access_datapath, bool access_hardware_ports)
>  {
>      assert_single_threaded();
>      daemonize_fd = -1;
>  
>      if (switch_user) {
> -        daemon_become_new_user__(access_datapath);
> +        daemon_become_new_user__(access_datapath, access_hardware_ports);
>          switch_user = false;
>      }
>  
> @@ -807,7 +808,8 @@ daemon_become_new_user_unix(void)
>  /* Linux specific implementation of daemon_become_new_user()
>   * using libcap-ng.   */
>  static void
> -daemon_become_new_user_linux(bool access_datapath OVS_UNUSED)
> +daemon_become_new_user_linux(bool access_datapath OVS_UNUSED,
> +                             bool access_hardware_ports OVS_UNUSED)
>  {
>  #if defined __linux__ &&  HAVE_LIBCAPNG
>      int ret;
> @@ -826,7 +828,17 @@ daemon_become_new_user_linux(bool access_datapath OVS_UNUSED)
>              if (access_datapath && !ret) {
>                  ret = capng_update(CAPNG_ADD, cap_sets, CAP_NET_ADMIN)
>                        || capng_update(CAPNG_ADD, cap_sets, CAP_NET_RAW)
> -                      || capng_update(CAPNG_ADD, cap_sets, CAP_NET_BROADCAST);
> +                      || capng_update(CAPNG_ADD, cap_sets, CAP_NET_BROADCAST)
> +#ifdef DPDK_NETDEV
> +                      || (access_hardware_ports &&
> +                          capng_update(CAPNG_ADD, cap_sets, CAP_SYS_RAWIO))
> +#else
> +                    ;
> +                if (access_hardware_ports) {
> +                    VLOG_WARN("hw port access requested, but no userspace ioport support.  Dropping.");
> +                }
> +#endif
> +                    ;
>              }
>          } else {
>              ret = -1;
> @@ -854,7 +866,7 @@ daemon_become_new_user_linux(bool access_datapath OVS_UNUSED)
>  }
>  
>  static void
> -daemon_become_new_user__(bool access_datapath)
> +daemon_become_new_user__(bool access_datapath, bool access_hardware_ports)
>  {
>      /* If vlog file has been created, change its owner to the non-root user
>       * as specifed by the --user option.  */
> @@ -862,7 +874,8 @@ daemon_become_new_user__(bool access_datapath)
>  
>      if (LINUX) {
>          if (LIBCAPNG) {
> -            daemon_become_new_user_linux(access_datapath);
> +            daemon_become_new_user_linux(access_datapath,
> +                                         access_hardware_ports);
>          } else {
>              VLOG_FATAL("%s: fail to downgrade user using libcap-ng. "
>                         "(libcap-ng is not configured at compile time), "
> @@ -877,11 +890,11 @@ daemon_become_new_user__(bool access_datapath)
>   * However, there in case the user switch needs to be done
>   * before daemonize_start(), the following API can be used.  */
>  void
> -daemon_become_new_user(bool access_datapath)
> +daemon_become_new_user(bool access_datapath, bool access_hardware_ports)
>  {
>      assert_single_threaded();
>      if (switch_user) {
> -        daemon_become_new_user__(access_datapath);
> +        daemon_become_new_user__(access_datapath, access_hardware_ports);
>          /* daemonize_start() should not switch user again. */
>          switch_user = false;
>      }
> diff --git a/lib/daemon.c b/lib/daemon.c
> index 3249c5ab4b..1e1c019eb1 100644
> --- a/lib/daemon.c
> +++ b/lib/daemon.c
> @@ -48,7 +48,7 @@ get_detach(void)
>  void
>  daemonize(void)
>  {
> -    daemonize_start(false);
> +    daemonize_start(false, false);
>      daemonize_complete();
>  }
>  
> diff --git a/lib/daemon.h b/lib/daemon.h
> index 0941574963..42372d1463 100644
> --- a/lib/daemon.h
> +++ b/lib/daemon.h
> @@ -167,10 +167,10 @@ void set_detach(void);
>  bool get_detach(void);
>  void daemon_save_fd(int fd);
>  void daemonize(void);
> -void daemonize_start(bool access_datapath);
> +void daemonize_start(bool access_datapath, bool access_hardware_ports);
>  void daemonize_complete(void);
>  void daemon_set_new_user(const char * user_spec);
> -void daemon_become_new_user(bool access_datapath);
> +void daemon_become_new_user(bool access_datapath, bool access_hardware_ports);
>  void daemon_usage(void);
>  void daemon_disable_self_confinement(void);
>  bool daemon_should_self_confine(void);
> diff --git a/vswitchd/ovs-vswitchd.8.in b/vswitchd/ovs-vswitchd.8.in
> index 9569265fcb..e39e3c984c 100644
> --- a/vswitchd/ovs-vswitchd.8.in
> +++ b/vswitchd/ovs-vswitchd.8.in
> @@ -81,6 +81,12 @@ unavailable or unsuccessful.
>  .SS "DPDK Options"
>  For details on initializing \fBovs\-vswitchd\fR to use DPDK ports,
>  refer to the documentation or \fBovs\-vswitchd.conf.db\fR(5).
> +.SS "DPDK HW Access Options"
> +.IP "\fB\-\-hw\-access\fR"
> +Tells \fBovs\-vswitchd\fR to retain the \fBCAP_SYS_RAWIO\fR capability,
> +to allow userspace drivers access to raw hardware memory.  This will
> +also allow the \fBovs\-vswitchd\fR daemon to call \fBiopl()\fR and
> +\fBioperm()\fR functions to set port access.
>  .SS "Daemon Options"
>  .ds DD \
>  \fBovs\-vswitchd\fR detaches only after it has connected to the \
> diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c
> index 407bfc60eb..ff1527fd5b 100644
> --- a/vswitchd/ovs-vswitchd.c
> +++ b/vswitchd/ovs-vswitchd.c
> @@ -60,6 +60,9 @@ VLOG_DEFINE_THIS_MODULE(vswitchd);
>   * the kernel from paging any of its memory to disk. */
>  static bool want_mlockall;
>  
> +/* --hw-access: If set, retains CAP_SYS_RAWIO privileges.  */
> +static bool hw_access;
> +
>  static unixctl_cb_func ovs_vswitchd_exit;
>  
>  static char *parse_options(int argc, char *argv[], char **unixctl_path);
> @@ -169,6 +172,7 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
>          OPT_DPDK,
>          SSL_OPTION_ENUMS,
>          OPT_DUMMY_NUMA,
> +        OPT_HW_ACCESS,
>      };
>      static const struct option long_options[] = {
>          {"help",        no_argument, NULL, 'h'},
> @@ -185,6 +189,7 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
>          {"disable-system-route", no_argument, NULL, OPT_DISABLE_SYSTEM_ROUTE},
>          {"dpdk", optional_argument, NULL, OPT_DPDK},
>          {"dummy-numa", required_argument, NULL, OPT_DUMMY_NUMA},
> +        {"hw-access", no_argument, NULL, OPT_HW_ACCESS},
>          {NULL, 0, NULL, 0},
>      };
>      char *short_options = ovs_cmdl_long_options_to_short_options(long_options);
> @@ -249,6 +254,10 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
>              ovs_numa_set_dummy(optarg);
>              break;
>  
> +        case OPT_HW_ACCESS:
> +            hw_access = true;
> +            break;
> +
>          default:
>              abort();
>          }
> ---

Ugh, I completely forgot to update the daemonize_start() call in main,
so - yeah untested is right :)

It is also important to note that I went with a command like argument
because we can't use a database knob for this.  Using the database will
require that we delay dropping privileges until after a connection to
the DB has happened and that we have read the DB.

Maybe there's another approach that gives some control to the user.
Eelco Chaudron Feb. 9, 2023, 4 p.m. UTC | #12
On 9 Feb 2023, at 10:41, Gaetan Rivet wrote:

>> -----Original Message-----
>> From: dev <ovs-dev-bounces@openvswitch.org <mailto:ovs-dev-bounces@openvswitch.org>> on behalf of Eelco Chaudron <echaudro@redhat.com <mailto:echaudro@redhat.com>>
>> Date: Monday 23 January 2023 at 10:44
>> To: Gaetan Rivet <gaetanr@nvidia.com <mailto:gaetanr@nvidia.com>>
>> Cc: ovs dev <dev@openvswitch.org <mailto:dev@openvswitch.org>> , Eli Britstein <elibr@nvidia.com <mailto:elibr@nvidia.com>> , Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> , Maxime Coquelin <maxime.coquelin@redhat.com <mailto:maxime.coquelin@redhat.com>> , Jason Gunthorpe <jgg@nvidia.com <mailto:jgg@nvidia.com>> , Majd Dibbiny <majd@nvidia.com <mailto:majd@nvidia.com>> , David Marchand <david.marchand@redhat.com <mailto:david.marchand@redhat.com>>
>> Subject: Re: [ovs-dev] [PATCH 1/1] daemon-unix: Support OVS-DPDK HW offloads for non-root user
>>
>> On 22 Mar 2021, at 15:21, David Marchand wrote:
>>
>>
>>> Hello Gaëtan,
>>>
>>> On Fri, Mar 19, 2021 at 5:59 PM Gaetan Rivet <gaetanr@nvidia.com <mailto:gaetanr@nvidia.com>> wrote:
>>>> Our rte_flow implementation uses ICM mappings to program our hardware,
>>>> which requires super privileged access. We are looking into ways to avoid it.
>>>
>>> Ok, thanks for looking into it.
>>
>>
>> Was any progress made on this? Or was the conclusion that this is the only way?
>>
>>
>
> Hello Eelco,
>
> I'll start by clarifying something: this issue is two-fold, even though there is a single RC.
> The previous email thread is about using rte_flow API with the mlx5 PMD,
> while the bug you described is about lack of TX on ports with no offloads inserted by the user.
>
> On the matter of the above rte_flow requirements: it was discussed, the conclusion
> was that nothing could be done with the current offload architecture.
> A patch was written to improve the logs and the documentation, but I see that it didn't
> make it to upstream DPDK. I have brought it to the attention of the DPDK team and
> it will be submitted.
>
>>>>
>>>> In the meantime, we failed to properly communicate this need in the rte_flow API.
>>>> We will improve the documentation and the error path in DPDK.
>>>
>>> Without this capa, mlx5 rte_flow full hw offloading errors with logs like:
>>> 2021-03-22T14:12:40.274Z|00001|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
>>> rte_flow creation failed: 1 ((null)).
>>> 2021-03-22T14:12:40.274Z|00002|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
>>> Failed flow: flow create 3 ingress priority 0 group 0 transfer
>>> pattern eth src is 6a:20:8f:82:52:49 dst is 0c:42:a1:00:a8:7c type is
>>> 0x0800 / ipv4 / end actions count / port_id original 0 id 2 / end
>>>
>>> First log is useless.
>>> This is more bugfixing than enhancement.
>>> Though logs do not need to tell the full story, they can point at the
>>> mlx5 pmd documentation where the full explanation is.
>>
>>
>>
>>
>> I was running into this issue also and spent a decent amount of time trying to figure out what was going on.
>> I did not have HW offload enabled yet, but just the basic VF/port representer configuration and no error messages or packets were arriving.
>> It Would be good to get some logging indicating the configuration/system was not valid. All I got was silent packet drops, but counters were incremented :(
>
> I was able to reproduce this issue. The DPDK team was adamant that it should have resulted in an error log,
> but none can be seen. The absence of log is a bug, however the behavior itself is intended.
>
> To give some more details, port probe should work without SYS_RAWIO, but start should fail.
> On start, the PMD installs hardware rules, using the same underlying API as rte_flow. It makes SYS_RAWIO
> a requirement for even basic port functions. This will remain the case in the future.
>
> Similarly, this issue has been brought to the attention of the DPDK team and they
> will make sure the user has an error log in that case, as well as clarifying the mlx5 doc.

The odd thing is that it seemed to work fine with our OVS2.17 package, linked with an older version of DPDK. Did something change in your driver (rdma-core) that made this no longer work?

//Eelco
diff mbox series

Patch

diff --git a/lib/daemon-unix.c b/lib/daemon-unix.c
index ae59ecf2c..d32a60657 100644
--- a/lib/daemon-unix.c
+++ b/lib/daemon-unix.c
@@ -820,6 +820,7 @@  daemon_become_new_user_linux(bool access_datapath OVS_UNUSED)
             if (access_datapath && !ret) {
                 ret = capng_update(CAPNG_ADD, cap_sets, CAP_NET_ADMIN)
                       || capng_update(CAPNG_ADD, cap_sets, CAP_NET_RAW)
+                      || capng_update(CAPNG_ADD, cap_sets, CAP_SYS_RAWIO)
                       || capng_update(CAPNG_ADD, cap_sets, CAP_NET_BROADCAST);
             }
         } else {