mbox series

[ovs-dev,v12,00/11] Add offload support for sFlow

Message ID 20210127062344.194230-1-cmi@nvidia.com
Headers show
Series Add offload support for sFlow | expand

Message

Chris Mi Jan. 27, 2021, 6:23 a.m. UTC
This patch set adds offload support for sFlow.

Psample is a genetlink channel for packet sampling. TC action act_sample
uses psample to send sampled packets to userspace.

When offloading sample action to TC, userspace creates a unique ID to
map sFlow action and tunnel info and passes this ID to kernel instead
of the sFlow info. psample will send this ID and sampled packet to
userspace. Using the ID, userspace can recover the sFlow info and send
sampled packet to the right sFlow monitoring host.

v2-v1:
- Fix robot errors.
v3-v2:
- Remove Gerrit Change-Id.
- Add patch #9 to fix older kernels build issue.
- Add travis test result.
v4-v3:
- Fix offload issue when sampling rate is 1.
v5-v4:
- Move polling thread from ofproto to netdev-offload-tc.
v6-v5:
- Rebase.
- Add GitHub Actions test result.
v7-v6:
- Remove Gerrit Change-Id.
- Fix "ERROR: Inappropriate spacing around cast"
v8-v7
- Address Eelco Chaudron's comment for patch #11.
v9-v8
- Remove sflow_len from struct dpif_sflow_attr.
- Log a debug message for other userspace actions.
v10-v9
- Address Eelco Chaudron's comments on v9.
v11-v10
- Fix a bracing error.
v12-v11
- Add duplicate sample group id check.

Chris Mi (11):
  compat: Add psample and tc sample action defines for older kernels
  ovs-kmod-ctl: Load kernel module psample
  dpif: Introduce register sFlow upcall callback API
  ofproto: Add upcall callback to process sFlow packet
  netdev-offload: Introduce register sFlow upcall callback API
  netdev-offload-tc: Implement register sFlow upcall callback API
  dpif-netlink: Implement register sFlow upcall callback API
  netdev-offload-tc: Introduce group ID management API
  netdev-offload-tc: Create psample netlink socket
  netdev-offload-tc: Add psample receive handler
  netdev-offload-tc: Add offload support for sFlow

 include/linux/automake.mk        |   4 +-
 include/linux/psample.h          |  58 +++
 include/linux/tc_act/tc_sample.h |  25 ++
 lib/dpif-netdev.c                |   1 +
 lib/dpif-netlink.c               |  27 ++
 lib/dpif-netlink.h               |   4 +
 lib/dpif-provider.h              |  10 +
 lib/dpif.c                       |   8 +
 lib/dpif.h                       |  23 ++
 lib/netdev-offload-provider.h    |   3 +
 lib/netdev-offload-tc.c          | 659 ++++++++++++++++++++++++++++++-
 lib/netdev-offload.c             |  30 ++
 lib/netdev-offload.h             |   4 +
 lib/tc.c                         |  61 ++-
 lib/tc.h                         |  16 +-
 ofproto/ofproto-dpif-upcall.c    |  42 ++
 utilities/ovs-kmod-ctl.in        |  14 +
 17 files changed, 973 insertions(+), 16 deletions(-)
 create mode 100644 include/linux/psample.h
 create mode 100644 include/linux/tc_act/tc_sample.h

Comments

Roi Dayan Feb. 23, 2021, 9:08 a.m. UTC | #1
On 2021-01-27 8:23 AM, Chris Mi wrote:
> This patch set adds offload support for sFlow.
> 
> Psample is a genetlink channel for packet sampling. TC action act_sample
> uses psample to send sampled packets to userspace.
> 
> When offloading sample action to TC, userspace creates a unique ID to
> map sFlow action and tunnel info and passes this ID to kernel instead
> of the sFlow info. psample will send this ID and sampled packet to
> userspace. Using the ID, userspace can recover the sFlow info and send
> sampled packet to the right sFlow monitoring host.
> 
> v2-v1:
> - Fix robot errors.
> v3-v2:
> - Remove Gerrit Change-Id.
> - Add patch #9 to fix older kernels build issue.
> - Add travis test result.
> v4-v3:
> - Fix offload issue when sampling rate is 1.
> v5-v4:
> - Move polling thread from ofproto to netdev-offload-tc.
> v6-v5:
> - Rebase.
> - Add GitHub Actions test result.
> v7-v6:
> - Remove Gerrit Change-Id.
> - Fix "ERROR: Inappropriate spacing around cast"
> v8-v7
> - Address Eelco Chaudron's comment for patch #11.
> v9-v8
> - Remove sflow_len from struct dpif_sflow_attr.
> - Log a debug message for other userspace actions.
> v10-v9
> - Address Eelco Chaudron's comments on v9.
> v11-v10
> - Fix a bracing error.
> v12-v11
> - Add duplicate sample group id check.
> 
> Chris Mi (11):
>    compat: Add psample and tc sample action defines for older kernels
>    ovs-kmod-ctl: Load kernel module psample
>    dpif: Introduce register sFlow upcall callback API
>    ofproto: Add upcall callback to process sFlow packet
>    netdev-offload: Introduce register sFlow upcall callback API
>    netdev-offload-tc: Implement register sFlow upcall callback API
>    dpif-netlink: Implement register sFlow upcall callback API
>    netdev-offload-tc: Introduce group ID management API
>    netdev-offload-tc: Create psample netlink socket
>    netdev-offload-tc: Add psample receive handler
>    netdev-offload-tc: Add offload support for sFlow
> 
>   include/linux/automake.mk        |   4 +-
>   include/linux/psample.h          |  58 +++
>   include/linux/tc_act/tc_sample.h |  25 ++
>   lib/dpif-netdev.c                |   1 +
>   lib/dpif-netlink.c               |  27 ++
>   lib/dpif-netlink.h               |   4 +
>   lib/dpif-provider.h              |  10 +
>   lib/dpif.c                       |   8 +
>   lib/dpif.h                       |  23 ++
>   lib/netdev-offload-provider.h    |   3 +
>   lib/netdev-offload-tc.c          | 659 ++++++++++++++++++++++++++++++-
>   lib/netdev-offload.c             |  30 ++
>   lib/netdev-offload.h             |   4 +
>   lib/tc.c                         |  61 ++-
>   lib/tc.h                         |  16 +-
>   ofproto/ofproto-dpif-upcall.c    |  42 ++
>   utilities/ovs-kmod-ctl.in        |  14 +
>   17 files changed, 973 insertions(+), 16 deletions(-)
>   create mode 100644 include/linux/psample.h
>   create mode 100644 include/linux/tc_act/tc_sample.h
> 


Hi Simon, Ilya,

Can you help review for this series?
do you have any comments you want us to handle?

Thanks,
Roi
Eelco Chaudron Feb. 23, 2021, 11:11 a.m. UTC | #2
On 23 Feb 2021, at 10:08, Roi Dayan wrote:

> On 2021-01-27 8:23 AM, Chris Mi wrote:
>> This patch set adds offload support for sFlow.
>>
>> Psample is a genetlink channel for packet sampling. TC action 
>> act_sample
>> uses psample to send sampled packets to userspace.
>>
>> When offloading sample action to TC, userspace creates a unique ID to
>> map sFlow action and tunnel info and passes this ID to kernel instead
>> of the sFlow info. psample will send this ID and sampled packet to
>> userspace. Using the ID, userspace can recover the sFlow info and 
>> send
>> sampled packet to the right sFlow monitoring host.
>>
>> v2-v1:
>> - Fix robot errors.
>> v3-v2:
>> - Remove Gerrit Change-Id.
>> - Add patch #9 to fix older kernels build issue.
>> - Add travis test result.
>> v4-v3:
>> - Fix offload issue when sampling rate is 1.
>> v5-v4:
>> - Move polling thread from ofproto to netdev-offload-tc.
>> v6-v5:
>> - Rebase.
>> - Add GitHub Actions test result.
>> v7-v6:
>> - Remove Gerrit Change-Id.
>> - Fix "ERROR: Inappropriate spacing around cast"
>> v8-v7
>> - Address Eelco Chaudron's comment for patch #11.
>> v9-v8
>> - Remove sflow_len from struct dpif_sflow_attr.
>> - Log a debug message for other userspace actions.
>> v10-v9
>> - Address Eelco Chaudron's comments on v9.
>> v11-v10
>> - Fix a bracing error.
>> v12-v11
>> - Add duplicate sample group id check.
>>
>> Chris Mi (11):
>>    compat: Add psample and tc sample action defines for older kernels
>>    ovs-kmod-ctl: Load kernel module psample
>>    dpif: Introduce register sFlow upcall callback API
>>    ofproto: Add upcall callback to process sFlow packet
>>    netdev-offload: Introduce register sFlow upcall callback API
>>    netdev-offload-tc: Implement register sFlow upcall callback API
>>    dpif-netlink: Implement register sFlow upcall callback API
>>    netdev-offload-tc: Introduce group ID management API
>>    netdev-offload-tc: Create psample netlink socket
>>    netdev-offload-tc: Add psample receive handler
>>    netdev-offload-tc: Add offload support for sFlow
>>
>>   include/linux/automake.mk        |   4 +-
>>   include/linux/psample.h          |  58 +++
>>   include/linux/tc_act/tc_sample.h |  25 ++
>>   lib/dpif-netdev.c                |   1 +
>>   lib/dpif-netlink.c               |  27 ++
>>   lib/dpif-netlink.h               |   4 +
>>   lib/dpif-provider.h              |  10 +
>>   lib/dpif.c                       |   8 +
>>   lib/dpif.h                       |  23 ++
>>   lib/netdev-offload-provider.h    |   3 +
>>   lib/netdev-offload-tc.c          | 659 
>> ++++++++++++++++++++++++++++++-
>>   lib/netdev-offload.c             |  30 ++
>>   lib/netdev-offload.h             |   4 +
>>   lib/tc.c                         |  61 ++-
>>   lib/tc.h                         |  16 +-
>>   ofproto/ofproto-dpif-upcall.c    |  42 ++
>>   utilities/ovs-kmod-ctl.in        |  14 +
>>   17 files changed, 973 insertions(+), 16 deletions(-)
>>   create mode 100644 include/linux/psample.h
>>   create mode 100644 include/linux/tc_act/tc_sample.h
>>
>
>
> Hi Simon, Ilya,
>
> Can you help review for this series?
> do you have any comments you want us to handle?

See my comments on v11. We need some more inputs on those so I could 
review v12, or we need a v13.

//Eelco
Eelco Chaudron Feb. 23, 2021, 11:15 a.m. UTC | #3
On 23 Feb 2021, at 12:11, Eelco Chaudron wrote:

> On 23 Feb 2021, at 10:08, Roi Dayan wrote:
>
>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>> This patch set adds offload support for sFlow.
>>>
>>> Psample is a genetlink channel for packet sampling. TC action 
>>> act_sample
>>> uses psample to send sampled packets to userspace.
>>>
>>> When offloading sample action to TC, userspace creates a unique ID 
>>> to
>>> map sFlow action and tunnel info and passes this ID to kernel 
>>> instead
>>> of the sFlow info. psample will send this ID and sampled packet to
>>> userspace. Using the ID, userspace can recover the sFlow info and 
>>> send
>>> sampled packet to the right sFlow monitoring host.
>>>
>>> v2-v1:
>>> - Fix robot errors.
>>> v3-v2:
>>> - Remove Gerrit Change-Id.
>>> - Add patch #9 to fix older kernels build issue.
>>> - Add travis test result.
>>> v4-v3:
>>> - Fix offload issue when sampling rate is 1.
>>> v5-v4:
>>> - Move polling thread from ofproto to netdev-offload-tc.
>>> v6-v5:
>>> - Rebase.
>>> - Add GitHub Actions test result.
>>> v7-v6:
>>> - Remove Gerrit Change-Id.
>>> - Fix "ERROR: Inappropriate spacing around cast"
>>> v8-v7
>>> - Address Eelco Chaudron's comment for patch #11.
>>> v9-v8
>>> - Remove sflow_len from struct dpif_sflow_attr.
>>> - Log a debug message for other userspace actions.
>>> v10-v9
>>> - Address Eelco Chaudron's comments on v9.
>>> v11-v10
>>> - Fix a bracing error.
>>> v12-v11
>>> - Add duplicate sample group id check.
>>>
>>> Chris Mi (11):
>>>    compat: Add psample and tc sample action defines for older 
>>> kernels
>>>    ovs-kmod-ctl: Load kernel module psample
>>>    dpif: Introduce register sFlow upcall callback API
>>>    ofproto: Add upcall callback to process sFlow packet
>>>    netdev-offload: Introduce register sFlow upcall callback API
>>>    netdev-offload-tc: Implement register sFlow upcall callback API
>>>    dpif-netlink: Implement register sFlow upcall callback API
>>>    netdev-offload-tc: Introduce group ID management API
>>>    netdev-offload-tc: Create psample netlink socket
>>>    netdev-offload-tc: Add psample receive handler
>>>    netdev-offload-tc: Add offload support for sFlow
>>>
>>>   include/linux/automake.mk        |   4 +-
>>>   include/linux/psample.h          |  58 +++
>>>   include/linux/tc_act/tc_sample.h |  25 ++
>>>   lib/dpif-netdev.c                |   1 +
>>>   lib/dpif-netlink.c               |  27 ++
>>>   lib/dpif-netlink.h               |   4 +
>>>   lib/dpif-provider.h              |  10 +
>>>   lib/dpif.c                       |   8 +
>>>   lib/dpif.h                       |  23 ++
>>>   lib/netdev-offload-provider.h    |   3 +
>>>   lib/netdev-offload-tc.c          | 659 
>>> ++++++++++++++++++++++++++++++-
>>>   lib/netdev-offload.c             |  30 ++
>>>   lib/netdev-offload.h             |   4 +
>>>   lib/tc.c                         |  61 ++-
>>>   lib/tc.h                         |  16 +-
>>>   ofproto/ofproto-dpif-upcall.c    |  42 ++
>>>   utilities/ovs-kmod-ctl.in        |  14 +
>>>   17 files changed, 973 insertions(+), 16 deletions(-)
>>>   create mode 100644 include/linux/psample.h
>>>   create mode 100644 include/linux/tc_act/tc_sample.h
>>>
>>
>>
>> Hi Simon, Ilya,
>>
>> Can you help review for this series?
>> do you have any comments you want us to handle?
>
> See my comments on v11. We need some more inputs on those so I could 
> review v12, or we need a v13.

Oops, my comments on the v9 patches, to be exact:

   [PATCH v9 08/11] netdev-offload-tc: Introduce group ID management API
   [PATCH v9 10/11] netdev-offload-tc: Add psample receive handler
Chris Mi March 1, 2021, 8:30 a.m. UTC | #4
Hi Simon, Ilya,

Could I know what should we do to make progress for this patch set?
It has been posted in the community for a long time 😁

Thanks,
Chris

On 2/23/2021 5:08 PM, Roi Dayan wrote:
>
>
> On 2021-01-27 8:23 AM, Chris Mi wrote:
>> This patch set adds offload support for sFlow.
>>
>> Psample is a genetlink channel for packet sampling. TC action act_sample
>> uses psample to send sampled packets to userspace.
>>
>> When offloading sample action to TC, userspace creates a unique ID to
>> map sFlow action and tunnel info and passes this ID to kernel instead
>> of the sFlow info. psample will send this ID and sampled packet to
>> userspace. Using the ID, userspace can recover the sFlow info and send
>> sampled packet to the right sFlow monitoring host.
>>
>> v2-v1:
>> - Fix robot errors.
>> v3-v2:
>> - Remove Gerrit Change-Id.
>> - Add patch #9 to fix older kernels build issue.
>> - Add travis test result.
>> v4-v3:
>> - Fix offload issue when sampling rate is 1.
>> v5-v4:
>> - Move polling thread from ofproto to netdev-offload-tc.
>> v6-v5:
>> - Rebase.
>> - Add GitHub Actions test result.
>> v7-v6:
>> - Remove Gerrit Change-Id.
>> - Fix "ERROR: Inappropriate spacing around cast"
>> v8-v7
>> - Address Eelco Chaudron's comment for patch #11.
>> v9-v8
>> - Remove sflow_len from struct dpif_sflow_attr.
>> - Log a debug message for other userspace actions.
>> v10-v9
>> - Address Eelco Chaudron's comments on v9.
>> v11-v10
>> - Fix a bracing error.
>> v12-v11
>> - Add duplicate sample group id check.
>>
>> Chris Mi (11):
>>    compat: Add psample and tc sample action defines for older kernels
>>    ovs-kmod-ctl: Load kernel module psample
>>    dpif: Introduce register sFlow upcall callback API
>>    ofproto: Add upcall callback to process sFlow packet
>>    netdev-offload: Introduce register sFlow upcall callback API
>>    netdev-offload-tc: Implement register sFlow upcall callback API
>>    dpif-netlink: Implement register sFlow upcall callback API
>>    netdev-offload-tc: Introduce group ID management API
>>    netdev-offload-tc: Create psample netlink socket
>>    netdev-offload-tc: Add psample receive handler
>>    netdev-offload-tc: Add offload support for sFlow
>>
>>   include/linux/automake.mk        |   4 +-
>>   include/linux/psample.h          |  58 +++
>>   include/linux/tc_act/tc_sample.h |  25 ++
>>   lib/dpif-netdev.c                |   1 +
>>   lib/dpif-netlink.c               |  27 ++
>>   lib/dpif-netlink.h               |   4 +
>>   lib/dpif-provider.h              |  10 +
>>   lib/dpif.c                       |   8 +
>>   lib/dpif.h                       |  23 ++
>>   lib/netdev-offload-provider.h    |   3 +
>>   lib/netdev-offload-tc.c          | 659 ++++++++++++++++++++++++++++++-
>>   lib/netdev-offload.c             |  30 ++
>>   lib/netdev-offload.h             |   4 +
>>   lib/tc.c                         |  61 ++-
>>   lib/tc.h                         |  16 +-
>>   ofproto/ofproto-dpif-upcall.c    |  42 ++
>>   utilities/ovs-kmod-ctl.in        |  14 +
>>   17 files changed, 973 insertions(+), 16 deletions(-)
>>   create mode 100644 include/linux/psample.h
>>   create mode 100644 include/linux/tc_act/tc_sample.h
>>
>
>
> Hi Simon, Ilya,
>
> Can you help review for this series?
> do you have any comments you want us to handle?
>
> Thanks,
> Roi
Ilya Maximets March 1, 2021, 12:48 p.m. UTC | #5
On 3/1/21 9:30 AM, Chris Mi wrote:
> Hi Simon, Ilya,
> 
> Could I know what should we do to make progress for this patch set?
> It has been posted in the community for a long time 😁

In general, the way to get your patches reviewed is to review other
patches.  It's simply because we still have a huge review backlog
(214 patches right now in patchwork and most of them needs review)
and bugfixes usually has a bit higher priority.  By reviewing other
patches you're reducing amount of work for maintainers so they can
get to your patches faster.

For the series and comments from Eelco:
I didn't read the patches carefully, only a quick glance, but I still
do not understand why we need a separate thread to poll psample events.
Why can't we just allow usual handler threads to do that?  From the
architecture perspective it's not a good thing to call ofproto code
from the offload-provider.  This introduces lots of complications
and might cause lots of issues in the future.

I'd say that design should look like this:

  handler thread ->
    dpif_recv() ->
      dpif_netlink_recv() ->
        netdev_offload_recv() ->
          netdev_offload_tc_recv() ->
            nl_sock_recv()

Last three calls should be implemented.

The better version of this will be to throw away dpif part from
above call chain and make it:

  handler thread ->
    netdev_offload_recv() ->
      netdev_offload_tc_recv() ->
        nl_sock_recv()

This way we could avoid touching dpif-netdev and still have psample
offloading for the case where netdev-offload-tc is used from the
userspace datapath.

Above architecture also implies implementation of:
 - netdev_offload_recv_wait()
 - netdev_offload_recv_purge()
 - and the netdev_offload_tc_* counterparts.

Best regards, Ilya Maximets.

> 
> Thanks,
> Chris
> 
> On 2/23/2021 5:08 PM, Roi Dayan wrote:
>>
>>
>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>> This patch set adds offload support for sFlow.
>>>
>>> Psample is a genetlink channel for packet sampling. TC action act_sample
>>> uses psample to send sampled packets to userspace.
>>>
>>> When offloading sample action to TC, userspace creates a unique ID to
>>> map sFlow action and tunnel info and passes this ID to kernel instead
>>> of the sFlow info. psample will send this ID and sampled packet to
>>> userspace. Using the ID, userspace can recover the sFlow info and send
>>> sampled packet to the right sFlow monitoring host.
>>>
>>> v2-v1:
>>> - Fix robot errors.
>>> v3-v2:
>>> - Remove Gerrit Change-Id.
>>> - Add patch #9 to fix older kernels build issue.
>>> - Add travis test result.
>>> v4-v3:
>>> - Fix offload issue when sampling rate is 1.
>>> v5-v4:
>>> - Move polling thread from ofproto to netdev-offload-tc.
>>> v6-v5:
>>> - Rebase.
>>> - Add GitHub Actions test result.
>>> v7-v6:
>>> - Remove Gerrit Change-Id.
>>> - Fix "ERROR: Inappropriate spacing around cast"
>>> v8-v7
>>> - Address Eelco Chaudron's comment for patch #11.
>>> v9-v8
>>> - Remove sflow_len from struct dpif_sflow_attr.
>>> - Log a debug message for other userspace actions.
>>> v10-v9
>>> - Address Eelco Chaudron's comments on v9.
>>> v11-v10
>>> - Fix a bracing error.
>>> v12-v11
>>> - Add duplicate sample group id check.
>>>
>>> Chris Mi (11):
>>>    compat: Add psample and tc sample action defines for older kernels
>>>    ovs-kmod-ctl: Load kernel module psample
>>>    dpif: Introduce register sFlow upcall callback API
>>>    ofproto: Add upcall callback to process sFlow packet
>>>    netdev-offload: Introduce register sFlow upcall callback API
>>>    netdev-offload-tc: Implement register sFlow upcall callback API
>>>    dpif-netlink: Implement register sFlow upcall callback API
>>>    netdev-offload-tc: Introduce group ID management API
>>>    netdev-offload-tc: Create psample netlink socket
>>>    netdev-offload-tc: Add psample receive handler
>>>    netdev-offload-tc: Add offload support for sFlow
>>>
>>>   include/linux/automake.mk        |   4 +-
>>>   include/linux/psample.h          |  58 +++
>>>   include/linux/tc_act/tc_sample.h |  25 ++
>>>   lib/dpif-netdev.c                |   1 +
>>>   lib/dpif-netlink.c               |  27 ++
>>>   lib/dpif-netlink.h               |   4 +
>>>   lib/dpif-provider.h              |  10 +
>>>   lib/dpif.c                       |   8 +
>>>   lib/dpif.h                       |  23 ++
>>>   lib/netdev-offload-provider.h    |   3 +
>>>   lib/netdev-offload-tc.c          | 659 ++++++++++++++++++++++++++++++-
>>>   lib/netdev-offload.c             |  30 ++
>>>   lib/netdev-offload.h             |   4 +
>>>   lib/tc.c                         |  61 ++-
>>>   lib/tc.h                         |  16 +-
>>>   ofproto/ofproto-dpif-upcall.c    |  42 ++
>>>   utilities/ovs-kmod-ctl.in        |  14 +
>>>   17 files changed, 973 insertions(+), 16 deletions(-)
>>>   create mode 100644 include/linux/psample.h
>>>   create mode 100644 include/linux/tc_act/tc_sample.h
>>>
>>
>>
>> Hi Simon, Ilya,
>>
>> Can you help review for this series?
>> do you have any comments you want us to handle?
>>
>> Thanks,
>> Roi
>
Chris Mi March 1, 2021, 1:25 p.m. UTC | #6
On 3/1/2021 8:48 PM, Ilya Maximets wrote:
> On 3/1/21 9:30 AM, Chris Mi wrote:
>> Hi Simon, Ilya,
>>
>> Could I know what should we do to make progress for this patch set?
>> It has been posted in the community for a long time 😁
> In general, the way to get your patches reviewed is to review other
> patches.  It's simply because we still have a huge review backlog
> (214 patches right now in patchwork and most of them needs review)
> and bugfixes usually has a bit higher priority.  By reviewing other
> patches you're reducing amount of work for maintainers so they can
> get to your patches faster.
>
> For the series and comments from Eelco:
> I didn't read the patches carefully, only a quick glance, but I still
> do not understand why we need a separate thread to poll psample events.
> Why can't we just allow usual handler threads to do that?  From the
> architecture perspective it's not a good thing to call ofproto code
> from the offload-provider.  This introduces lots of complications
> and might cause lots of issues in the future.
>
> I'd say that design should look like this:
>
>    handler thread ->
>      dpif_recv() ->
>        dpif_netlink_recv() ->
>          netdev_offload_recv() ->
>            netdev_offload_tc_recv() ->
>              nl_sock_recv()
>
> Last three calls should be implemented.
>
> The better version of this will be to throw away dpif part from
> above call chain and make it:
>
>    handler thread ->
>      netdev_offload_recv() ->
>        netdev_offload_tc_recv() ->
>          nl_sock_recv()
>
> This way we could avoid touching dpif-netdev and still have psample
> offloading for the case where netdev-offload-tc is used from the
> userspace datapath.
>
> Above architecture also implies implementation of:
>   - netdev_offload_recv_wait()
>   - netdev_offload_recv_purge()
>   - and the netdev_offload_tc_* counterparts.
Thanks for your above suggestion, Ilya. I'll check what need to be improved.

Regards,
Chris
> Best regards, Ilya Maximets.
>
>> Thanks,
>> Chris
>>
>> On 2/23/2021 5:08 PM, Roi Dayan wrote:
>>>
>>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>>> This patch set adds offload support for sFlow.
>>>>
>>>> Psample is a genetlink channel for packet sampling. TC action act_sample
>>>> uses psample to send sampled packets to userspace.
>>>>
>>>> When offloading sample action to TC, userspace creates a unique ID to
>>>> map sFlow action and tunnel info and passes this ID to kernel instead
>>>> of the sFlow info. psample will send this ID and sampled packet to
>>>> userspace. Using the ID, userspace can recover the sFlow info and send
>>>> sampled packet to the right sFlow monitoring host.
>>>>
>>>> v2-v1:
>>>> - Fix robot errors.
>>>> v3-v2:
>>>> - Remove Gerrit Change-Id.
>>>> - Add patch #9 to fix older kernels build issue.
>>>> - Add travis test result.
>>>> v4-v3:
>>>> - Fix offload issue when sampling rate is 1.
>>>> v5-v4:
>>>> - Move polling thread from ofproto to netdev-offload-tc.
>>>> v6-v5:
>>>> - Rebase.
>>>> - Add GitHub Actions test result.
>>>> v7-v6:
>>>> - Remove Gerrit Change-Id.
>>>> - Fix "ERROR: Inappropriate spacing around cast"
>>>> v8-v7
>>>> - Address Eelco Chaudron's comment for patch #11.
>>>> v9-v8
>>>> - Remove sflow_len from struct dpif_sflow_attr.
>>>> - Log a debug message for other userspace actions.
>>>> v10-v9
>>>> - Address Eelco Chaudron's comments on v9.
>>>> v11-v10
>>>> - Fix a bracing error.
>>>> v12-v11
>>>> - Add duplicate sample group id check.
>>>>
>>>> Chris Mi (11):
>>>>     compat: Add psample and tc sample action defines for older kernels
>>>>     ovs-kmod-ctl: Load kernel module psample
>>>>     dpif: Introduce register sFlow upcall callback API
>>>>     ofproto: Add upcall callback to process sFlow packet
>>>>     netdev-offload: Introduce register sFlow upcall callback API
>>>>     netdev-offload-tc: Implement register sFlow upcall callback API
>>>>     dpif-netlink: Implement register sFlow upcall callback API
>>>>     netdev-offload-tc: Introduce group ID management API
>>>>     netdev-offload-tc: Create psample netlink socket
>>>>     netdev-offload-tc: Add psample receive handler
>>>>     netdev-offload-tc: Add offload support for sFlow
>>>>
>>>>    include/linux/automake.mk        |   4 +-
>>>>    include/linux/psample.h          |  58 +++
>>>>    include/linux/tc_act/tc_sample.h |  25 ++
>>>>    lib/dpif-netdev.c                |   1 +
>>>>    lib/dpif-netlink.c               |  27 ++
>>>>    lib/dpif-netlink.h               |   4 +
>>>>    lib/dpif-provider.h              |  10 +
>>>>    lib/dpif.c                       |   8 +
>>>>    lib/dpif.h                       |  23 ++
>>>>    lib/netdev-offload-provider.h    |   3 +
>>>>    lib/netdev-offload-tc.c          | 659 ++++++++++++++++++++++++++++++-
>>>>    lib/netdev-offload.c             |  30 ++
>>>>    lib/netdev-offload.h             |   4 +
>>>>    lib/tc.c                         |  61 ++-
>>>>    lib/tc.h                         |  16 +-
>>>>    ofproto/ofproto-dpif-upcall.c    |  42 ++
>>>>    utilities/ovs-kmod-ctl.in        |  14 +
>>>>    17 files changed, 973 insertions(+), 16 deletions(-)
>>>>    create mode 100644 include/linux/psample.h
>>>>    create mode 100644 include/linux/tc_act/tc_sample.h
>>>>
>>>
>>> Hi Simon, Ilya,
>>>
>>> Can you help review for this series?
>>> do you have any comments you want us to handle?
>>>
>>> Thanks,
>>> Roi
Chris Mi March 5, 2021, 3:27 a.m. UTC | #7
Hi Ilya,

I think about your suggestion recently. But I'm still not very clear 
about the design.
Please see my reply below:

On 3/1/2021 8:48 PM, Ilya Maximets wrote:
> On 3/1/21 9:30 AM, Chris Mi wrote:
>> Hi Simon, Ilya,
>>
>> Could I know what should we do to make progress for this patch set?
>> It has been posted in the community for a long time 😁
> In general, the way to get your patches reviewed is to review other
> patches.  It's simply because we still have a huge review backlog
> (214 patches right now in patchwork and most of them needs review)
> and bugfixes usually has a bit higher priority.  By reviewing other
> patches you're reducing amount of work for maintainers so they can
> get to your patches faster.
OK, I see.
>
> For the series and comments from Eelco:
> I didn't read the patches carefully, only a quick glance, but I still
> do not understand why we need a separate thread to poll psample events.
> Why can't we just allow usual handler threads to do that?
I'm not sure if you are aware of that the psample netlink is different 
from the ovs
netlink. Without offload, kernel sends missed packet and sFlow packet to 
userspace
using the same netlink 'ovs_packet_family'. So they can use the same 
handler thread.
But in order to offload sFlow action, we use psample kernel module to 
send sampled
packets from kernel to userspace. The format for ovs netlink message and 
psample
netlink messages are different.
>    From the
> architecture perspective it's not a good thing to call ofproto code
> from the offload-provider.  This introduces lots of complications
> and might cause lots of issues in the future.
>
> I'd say that design should look like this:
>
>    handler thread ->
>      dpif_recv() ->
>        dpif_netlink_recv() ->
>          netdev_offload_recv() ->
>            netdev_offload_tc_recv() ->
>              nl_sock_recv()
In order to use the handler thread, I'm not sure if we should add 
psample socket
fd to every handler's epoll_fd. If we should do that, we should consider 
how to
differentiate if the event comes from ovs or psample netlink. Maybe we 
should
allocate a special port number for psample and assign it to event.data.u32.
Anyway, that's the details. If this is the right direction, I'll think 
about it.
>
> Last three calls should be implemented.
>
> The better version of this will be to throw away dpif part from
> above call chain and make it:
>
>    handler thread ->
>      netdev_offload_recv() ->
>        netdev_offload_tc_recv() ->
>          nl_sock_recv()
If we throw away dpif part, maybe we have to write some duplicate epoll code
for psample only. And we can't block in nl_sock_recv(). Maybe we have to add
psample socket fd to every handler's epoll_fd. So we have to change the dpif
somehow.

I don't know if I understand your suggestion correctly. Or I missed anything
So if you have time, could you please elaborate?

Thanks,
Chris
>
> This way we could avoid touching dpif-netdev and still have psample
> offloading for the case where netdev-offload-tc is used from the
> userspace datapath.
>
> Above architecture also implies implementation of:
>   - netdev_offload_recv_wait()
>   - netdev_offload_recv_purge()
>   - and the netdev_offload_tc_* counterparts.
>
> Best regards, Ilya Maximets.
>
>> Thanks,
>> Chris
>>
>> On 2/23/2021 5:08 PM, Roi Dayan wrote:
>>>
>>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>>> This patch set adds offload support for sFlow.
>>>>
>>>> Psample is a genetlink channel for packet sampling. TC action act_sample
>>>> uses psample to send sampled packets to userspace.
>>>>
>>>> When offloading sample action to TC, userspace creates a unique ID to
>>>> map sFlow action and tunnel info and passes this ID to kernel instead
>>>> of the sFlow info. psample will send this ID and sampled packet to
>>>> userspace. Using the ID, userspace can recover the sFlow info and send
>>>> sampled packet to the right sFlow monitoring host.
>>>>
>>>> v2-v1:
>>>> - Fix robot errors.
>>>> v3-v2:
>>>> - Remove Gerrit Change-Id.
>>>> - Add patch #9 to fix older kernels build issue.
>>>> - Add travis test result.
>>>> v4-v3:
>>>> - Fix offload issue when sampling rate is 1.
>>>> v5-v4:
>>>> - Move polling thread from ofproto to netdev-offload-tc.
>>>> v6-v5:
>>>> - Rebase.
>>>> - Add GitHub Actions test result.
>>>> v7-v6:
>>>> - Remove Gerrit Change-Id.
>>>> - Fix "ERROR: Inappropriate spacing around cast"
>>>> v8-v7
>>>> - Address Eelco Chaudron's comment for patch #11.
>>>> v9-v8
>>>> - Remove sflow_len from struct dpif_sflow_attr.
>>>> - Log a debug message for other userspace actions.
>>>> v10-v9
>>>> - Address Eelco Chaudron's comments on v9.
>>>> v11-v10
>>>> - Fix a bracing error.
>>>> v12-v11
>>>> - Add duplicate sample group id check.
>>>>
>>>> Chris Mi (11):
>>>>     compat: Add psample and tc sample action defines for older kernels
>>>>     ovs-kmod-ctl: Load kernel module psample
>>>>     dpif: Introduce register sFlow upcall callback API
>>>>     ofproto: Add upcall callback to process sFlow packet
>>>>     netdev-offload: Introduce register sFlow upcall callback API
>>>>     netdev-offload-tc: Implement register sFlow upcall callback API
>>>>     dpif-netlink: Implement register sFlow upcall callback API
>>>>     netdev-offload-tc: Introduce group ID management API
>>>>     netdev-offload-tc: Create psample netlink socket
>>>>     netdev-offload-tc: Add psample receive handler
>>>>     netdev-offload-tc: Add offload support for sFlow
>>>>
>>>>    include/linux/automake.mk        |   4 +-
>>>>    include/linux/psample.h          |  58 +++
>>>>    include/linux/tc_act/tc_sample.h |  25 ++
>>>>    lib/dpif-netdev.c                |   1 +
>>>>    lib/dpif-netlink.c               |  27 ++
>>>>    lib/dpif-netlink.h               |   4 +
>>>>    lib/dpif-provider.h              |  10 +
>>>>    lib/dpif.c                       |   8 +
>>>>    lib/dpif.h                       |  23 ++
>>>>    lib/netdev-offload-provider.h    |   3 +
>>>>    lib/netdev-offload-tc.c          | 659 ++++++++++++++++++++++++++++++-
>>>>    lib/netdev-offload.c             |  30 ++
>>>>    lib/netdev-offload.h             |   4 +
>>>>    lib/tc.c                         |  61 ++-
>>>>    lib/tc.h                         |  16 +-
>>>>    ofproto/ofproto-dpif-upcall.c    |  42 ++
>>>>    utilities/ovs-kmod-ctl.in        |  14 +
>>>>    17 files changed, 973 insertions(+), 16 deletions(-)
>>>>    create mode 100644 include/linux/psample.h
>>>>    create mode 100644 include/linux/tc_act/tc_sample.h
>>>>
>>>
>>> Hi Simon, Ilya,
>>>
>>> Can you help review for this series?
>>> do you have any comments you want us to handle?
>>>
>>> Thanks,
>>> Roi
Ilya Maximets March 23, 2021, 2:24 p.m. UTC | #8
On 3/5/21 4:27 AM, Chris Mi wrote:
> Hi Ilya,
> 
> I think about your suggestion recently. But I'm still not very clear about the design.
> Please see my reply below:
> 
> On 3/1/2021 8:48 PM, Ilya Maximets wrote:
>> On 3/1/21 9:30 AM, Chris Mi wrote:
>>> Hi Simon, Ilya,
>>>
>>> Could I know what should we do to make progress for this patch set?
>>> It has been posted in the community for a long time 😁
>> In general, the way to get your patches reviewed is to review other
>> patches.  It's simply because we still have a huge review backlog
>> (214 patches right now in patchwork and most of them needs review)
>> and bugfixes usually has a bit higher priority.  By reviewing other
>> patches you're reducing amount of work for maintainers so they can
>> get to your patches faster.
> OK, I see.
>>
>> For the series and comments from Eelco:
>> I didn't read the patches carefully, only a quick glance, but I still
>> do not understand why we need a separate thread to poll psample events.
>> Why can't we just allow usual handler threads to do that?
> I'm not sure if you are aware of that the psample netlink is different from the ovs
> netlink. Without offload, kernel sends missed packet and sFlow packet to userspace
> using the same netlink 'ovs_packet_family'. So they can use the same handler thread.
> But in order to offload sFlow action, we use psample kernel module to send sampled
> packets from kernel to userspace. The format for ovs netlink message and psample
> netlink messages are different.

Hi.  Sorry for late reply.

Yes, I understand that message format is different, but it doesn't really
matter.  All the message-specific handling and parsing should happen
inside the netdev_offload_tc_recv().  This function should have a prototype
similar to dpif_netlink_recv(), i.e. it should receive a pointer to the
struct dpif_upcall from the caller and fill it with data.  Maybe other
type of a generic data structure if it's really not possible to construct
struct dpif_upcall.


>>    From the
>> architecture perspective it's not a good thing to call ofproto code
>> from the offload-provider.  This introduces lots of complications
>> and might cause lots of issues in the future.
>>
>> I'd say that design should look like this:
>>
>>    handler thread ->
>>      dpif_recv() ->
>>        dpif_netlink_recv() ->
>>          netdev_offload_recv() ->
>>            netdev_offload_tc_recv() ->
>>              nl_sock_recv()
> In order to use the handler thread, I'm not sure if we should add psample socket
> fd to every handler's epoll_fd. If we should do that, we should consider how to
> differentiate if the event comes from ovs or psample netlink. Maybe we should
> allocate a special port number for psample and assign it to event.data.u32.
> Anyway, that's the details. If this is the right direction, I'll think about it.
>>
>> Last three calls should be implemented.
>>
>> The better version of this will be to throw away dpif part from
>> above call chain and make it:
>>
>>    handler thread ->
>>      netdev_offload_recv() ->
>>        netdev_offload_tc_recv() ->
>>          nl_sock_recv()
> If we throw away dpif part, maybe we have to write some duplicate epoll code
> for psample only. And we can't block in nl_sock_recv(). Maybe we have to add
> psample socket fd to every handler's epoll_fd. So we have to change the dpif
> somehow.

There is no need to implement any epoll_fd logic for psample.
You may only use handler with handler_id == 0.  Only this handler will receive
psample upcalls.  netdev_offload_recv_wait() should be implemented similar
to how dpif_netlink_recv_wait() implemented for windows case, i.e. it will
call nl_sock_wait(nl_sock, POLLIN) for the psample netlink socket.
There is no need to block and you're never blocking anywhere because your're
calling nl_sock_recv(..., wait = false) which will use MSG_DONTWAIT while
actually receiving a message.

So, there should be several call chains:

1. Init.

   open_dpif_backer/type_run() ->
     netdev_offload_recv_set() ->
       netdev_offload_tc_recv_set(enabled) ->
         if (enable)
           psample_sock = nl_sock_create(...);
         else
           close(psample_sock);

2. Wait.

   udpif_upcall_handler() ->
     netdev_offload_recv_wait() ->
       netdev_offload_tc_recv_wait() ->
         if (handler_id == 0)
           nl_sock_wait(psample_sock, POLLIN);

3. Receive.

   udpif_upcall_handler() ->
     recv_upcalls() ->
       netdev_offload_recv() ->
       | netdev_offload_tc_recv(..., struct dpif_upcall *upcall) ->
       |   if (handler_id == 0)
       |     nl_sock_recv(psample_socket, ..., wait = false);
       |     *upcall = <received data>;
       upcall_receive()
       process_upcall() ->
         dpif_sflow_received()

4. Deinit.

   close_dpif_backer() ->
     netdev_offload_recv_set(enable = false);


Does that look more clear?

> 
> I don't know if I understand your suggestion correctly. Or I missed anything
> So if you have time, could you please elaborate?
> 
> Thanks,
> Chris
>>
>> This way we could avoid touching dpif-netdev and still have psample
>> offloading for the case where netdev-offload-tc is used from the
>> userspace datapath.
>>
>> Above architecture also implies implementation of:
>>   - netdev_offload_recv_wait()
>>   - netdev_offload_recv_purge()
>>   - and the netdev_offload_tc_* counterparts.
>>
>> Best regards, Ilya Maximets.
>>
>>> Thanks,
>>> Chris
>>>
>>> On 2/23/2021 5:08 PM, Roi Dayan wrote:
>>>>
>>>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>>>> This patch set adds offload support for sFlow.
>>>>>
>>>>> Psample is a genetlink channel for packet sampling. TC action act_sample
>>>>> uses psample to send sampled packets to userspace.
>>>>>
>>>>> When offloading sample action to TC, userspace creates a unique ID to
>>>>> map sFlow action and tunnel info and passes this ID to kernel instead
>>>>> of the sFlow info. psample will send this ID and sampled packet to
>>>>> userspace. Using the ID, userspace can recover the sFlow info and send
>>>>> sampled packet to the right sFlow monitoring host.
>>>>>
>>>>> v2-v1:
>>>>> - Fix robot errors.
>>>>> v3-v2:
>>>>> - Remove Gerrit Change-Id.
>>>>> - Add patch #9 to fix older kernels build issue.
>>>>> - Add travis test result.
>>>>> v4-v3:
>>>>> - Fix offload issue when sampling rate is 1.
>>>>> v5-v4:
>>>>> - Move polling thread from ofproto to netdev-offload-tc.
>>>>> v6-v5:
>>>>> - Rebase.
>>>>> - Add GitHub Actions test result.
>>>>> v7-v6:
>>>>> - Remove Gerrit Change-Id.
>>>>> - Fix "ERROR: Inappropriate spacing around cast"
>>>>> v8-v7
>>>>> - Address Eelco Chaudron's comment for patch #11.
>>>>> v9-v8
>>>>> - Remove sflow_len from struct dpif_sflow_attr.
>>>>> - Log a debug message for other userspace actions.
>>>>> v10-v9
>>>>> - Address Eelco Chaudron's comments on v9.
>>>>> v11-v10
>>>>> - Fix a bracing error.
>>>>> v12-v11
>>>>> - Add duplicate sample group id check.
>>>>>
>>>>> Chris Mi (11):
>>>>>     compat: Add psample and tc sample action defines for older kernels
>>>>>     ovs-kmod-ctl: Load kernel module psample
>>>>>     dpif: Introduce register sFlow upcall callback API
>>>>>     ofproto: Add upcall callback to process sFlow packet
>>>>>     netdev-offload: Introduce register sFlow upcall callback API
>>>>>     netdev-offload-tc: Implement register sFlow upcall callback API
>>>>>     dpif-netlink: Implement register sFlow upcall callback API
>>>>>     netdev-offload-tc: Introduce group ID management API
>>>>>     netdev-offload-tc: Create psample netlink socket
>>>>>     netdev-offload-tc: Add psample receive handler
>>>>>     netdev-offload-tc: Add offload support for sFlow
>>>>>
>>>>>    include/linux/automake.mk        |   4 +-
>>>>>    include/linux/psample.h          |  58 +++
>>>>>    include/linux/tc_act/tc_sample.h |  25 ++
>>>>>    lib/dpif-netdev.c                |   1 +
>>>>>    lib/dpif-netlink.c               |  27 ++
>>>>>    lib/dpif-netlink.h               |   4 +
>>>>>    lib/dpif-provider.h              |  10 +
>>>>>    lib/dpif.c                       |   8 +
>>>>>    lib/dpif.h                       |  23 ++
>>>>>    lib/netdev-offload-provider.h    |   3 +
>>>>>    lib/netdev-offload-tc.c          | 659 ++++++++++++++++++++++++++++++-
>>>>>    lib/netdev-offload.c             |  30 ++
>>>>>    lib/netdev-offload.h             |   4 +
>>>>>    lib/tc.c                         |  61 ++-
>>>>>    lib/tc.h                         |  16 +-
>>>>>    ofproto/ofproto-dpif-upcall.c    |  42 ++
>>>>>    utilities/ovs-kmod-ctl.in        |  14 +
>>>>>    17 files changed, 973 insertions(+), 16 deletions(-)
>>>>>    create mode 100644 include/linux/psample.h
>>>>>    create mode 100644 include/linux/tc_act/tc_sample.h
>>>>>
>>>>
>>>> Hi Simon, Ilya,
>>>>
>>>> Can you help review for this series?
>>>> do you have any comments you want us to handle?
>>>>
>>>> Thanks,
>>>> Roi
>
Chris Mi March 24, 2021, 9:17 a.m. UTC | #9
On 3/23/2021 10:24 PM, Ilya Maximets wrote:
> On 3/5/21 4:27 AM, Chris Mi wrote:
>> Hi Ilya,
>>
>> I think about your suggestion recently. But I'm still not very clear about the design.
>> Please see my reply below:
>>
>> On 3/1/2021 8:48 PM, Ilya Maximets wrote:
>>> On 3/1/21 9:30 AM, Chris Mi wrote:
>>>> Hi Simon, Ilya,
>>>>
>>>> Could I know what should we do to make progress for this patch set?
>>>> It has been posted in the community for a long time 😁
>>> In general, the way to get your patches reviewed is to review other
>>> patches.  It's simply because we still have a huge review backlog
>>> (214 patches right now in patchwork and most of them needs review)
>>> and bugfixes usually has a bit higher priority.  By reviewing other
>>> patches you're reducing amount of work for maintainers so they can
>>> get to your patches faster.
>> OK, I see.
>>> For the series and comments from Eelco:
>>> I didn't read the patches carefully, only a quick glance, but I still
>>> do not understand why we need a separate thread to poll psample events.
>>> Why can't we just allow usual handler threads to do that?
>> I'm not sure if you are aware of that the psample netlink is different from the ovs
>> netlink. Without offload, kernel sends missed packet and sFlow packet to userspace
>> using the same netlink 'ovs_packet_family'. So they can use the same handler thread.
>> But in order to offload sFlow action, we use psample kernel module to send sampled
>> packets from kernel to userspace. The format for ovs netlink message and psample
>> netlink messages are different.
> Hi.  Sorry for late reply.
>
> Yes, I understand that message format is different, but it doesn't really
> matter.  All the message-specific handling and parsing should happen
> inside the netdev_offload_tc_recv().  This function should have a prototype
> similar to dpif_netlink_recv(), i.e. it should receive a pointer to the
> struct dpif_upcall from the caller and fill it with data.  Maybe other
> type of a generic data structure if it's really not possible to construct
> struct dpif_upcall.
>
>
>>>     From the
>>> architecture perspective it's not a good thing to call ofproto code
>>> from the offload-provider.  This introduces lots of complications
>>> and might cause lots of issues in the future.
>>>
>>> I'd say that design should look like this:
>>>
>>>     handler thread ->
>>>       dpif_recv() ->
>>>         dpif_netlink_recv() ->
>>>           netdev_offload_recv() ->
>>>             netdev_offload_tc_recv() ->
>>>               nl_sock_recv()
>> In order to use the handler thread, I'm not sure if we should add psample socket
>> fd to every handler's epoll_fd. If we should do that, we should consider how to
>> differentiate if the event comes from ovs or psample netlink. Maybe we should
>> allocate a special port number for psample and assign it to event.data.u32.
>> Anyway, that's the details. If this is the right direction, I'll think about it.
>>> Last three calls should be implemented.
>>>
>>> The better version of this will be to throw away dpif part from
>>> above call chain and make it:
>>>
>>>     handler thread ->
>>>       netdev_offload_recv() ->
>>>         netdev_offload_tc_recv() ->
>>>           nl_sock_recv()
>> If we throw away dpif part, maybe we have to write some duplicate epoll code
>> for psample only. And we can't block in nl_sock_recv(). Maybe we have to add
>> psample socket fd to every handler's epoll_fd. So we have to change the dpif
>> somehow.
> There is no need to implement any epoll_fd logic for psample.
> You may only use handler with handler_id == 0.  Only this handler will receive
> psample upcalls.  netdev_offload_recv_wait() should be implemented similar
> to how dpif_netlink_recv_wait() implemented for windows case, i.e. it will
> call nl_sock_wait(nl_sock, POLLIN) for the psample netlink socket.
> There is no need to block and you're never blocking anywhere because your're
> calling nl_sock_recv(..., wait = false) which will use MSG_DONTWAIT while
> actually receiving a message.
Hi Ilya,

With this new design, the code will be changed greatly. I'm not 
unwilling to change it.
The effort is not small. So before I start, I want to make sure that it 
is feasible.

Yes, we won't block in nl_sock_recv(), but the handler thread will be 
blocked in poll_block().
If any of the vport netlink socket is readable , it will be waken up by 
kernel. If we don't use
epoll_ctl to add the psample netlink socket to the handler's epoll fd, 
we can't receive the
psample packet as soon as possible. That means we can only receive the 
sampled packet
when there is a miss upcall.

I added some debug message in ovs epoll code and got above conclusion. 
But I'm not
the expert of the epoll API, so I'm not sure if I missed anything.

Thanks,
Chris
>
> So, there should be several call chains:
>
> 1. Init.
>
>     open_dpif_backer/type_run() ->
>       netdev_offload_recv_set() ->
>         netdev_offload_tc_recv_set(enabled) ->
>           if (enable)
>             psample_sock = nl_sock_create(...);
>           else
>             close(psample_sock);
>
> 2. Wait.
>
>     udpif_upcall_handler() ->
>       netdev_offload_recv_wait() ->
>         netdev_offload_tc_recv_wait() ->
>           if (handler_id == 0)
>             nl_sock_wait(psample_sock, POLLIN);
>
> 3. Receive.
>
>     udpif_upcall_handler() ->
>       recv_upcalls() ->
>         netdev_offload_recv() ->
>         | netdev_offload_tc_recv(..., struct dpif_upcall *upcall) ->
>         |   if (handler_id == 0)
>         |     nl_sock_recv(psample_socket, ..., wait = false);
>         |     *upcall = <received data>;
>         upcall_receive()
>         process_upcall() ->
>           dpif_sflow_received()
>
> 4. Deinit.
>
>     close_dpif_backer() ->
>       netdev_offload_recv_set(enable = false);
>
>
> Does that look more clear?
>
>> I don't know if I understand your suggestion correctly. Or I missed anything
>> So if you have time, could you please elaborate?
>>
>> Thanks,
>> Chris
>>> This way we could avoid touching dpif-netdev and still have psample
>>> offloading for the case where netdev-offload-tc is used from the
>>> userspace datapath.
>>>
>>> Above architecture also implies implementation of:
>>>    - netdev_offload_recv_wait()
>>>    - netdev_offload_recv_purge()
>>>    - and the netdev_offload_tc_* counterparts.
>>>
>>> Best regards, Ilya Maximets.
>>>
>>>> Thanks,
>>>> Chris
>>>>
>>>> On 2/23/2021 5:08 PM, Roi Dayan wrote:
>>>>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>>>>> This patch set adds offload support for sFlow.
>>>>>>
>>>>>> Psample is a genetlink channel for packet sampling. TC action act_sample
>>>>>> uses psample to send sampled packets to userspace.
>>>>>>
>>>>>> When offloading sample action to TC, userspace creates a unique ID to
>>>>>> map sFlow action and tunnel info and passes this ID to kernel instead
>>>>>> of the sFlow info. psample will send this ID and sampled packet to
>>>>>> userspace. Using the ID, userspace can recover the sFlow info and send
>>>>>> sampled packet to the right sFlow monitoring host.
>>>>>>
>>>>>> v2-v1:
>>>>>> - Fix robot errors.
>>>>>> v3-v2:
>>>>>> - Remove Gerrit Change-Id.
>>>>>> - Add patch #9 to fix older kernels build issue.
>>>>>> - Add travis test result.
>>>>>> v4-v3:
>>>>>> - Fix offload issue when sampling rate is 1.
>>>>>> v5-v4:
>>>>>> - Move polling thread from ofproto to netdev-offload-tc.
>>>>>> v6-v5:
>>>>>> - Rebase.
>>>>>> - Add GitHub Actions test result.
>>>>>> v7-v6:
>>>>>> - Remove Gerrit Change-Id.
>>>>>> - Fix "ERROR: Inappropriate spacing around cast"
>>>>>> v8-v7
>>>>>> - Address Eelco Chaudron's comment for patch #11.
>>>>>> v9-v8
>>>>>> - Remove sflow_len from struct dpif_sflow_attr.
>>>>>> - Log a debug message for other userspace actions.
>>>>>> v10-v9
>>>>>> - Address Eelco Chaudron's comments on v9.
>>>>>> v11-v10
>>>>>> - Fix a bracing error.
>>>>>> v12-v11
>>>>>> - Add duplicate sample group id check.
>>>>>>
>>>>>> Chris Mi (11):
>>>>>>      compat: Add psample and tc sample action defines for older kernels
>>>>>>      ovs-kmod-ctl: Load kernel module psample
>>>>>>      dpif: Introduce register sFlow upcall callback API
>>>>>>      ofproto: Add upcall callback to process sFlow packet
>>>>>>      netdev-offload: Introduce register sFlow upcall callback API
>>>>>>      netdev-offload-tc: Implement register sFlow upcall callback API
>>>>>>      dpif-netlink: Implement register sFlow upcall callback API
>>>>>>      netdev-offload-tc: Introduce group ID management API
>>>>>>      netdev-offload-tc: Create psample netlink socket
>>>>>>      netdev-offload-tc: Add psample receive handler
>>>>>>      netdev-offload-tc: Add offload support for sFlow
>>>>>>
>>>>>>     include/linux/automake.mk        |   4 +-
>>>>>>     include/linux/psample.h          |  58 +++
>>>>>>     include/linux/tc_act/tc_sample.h |  25 ++
>>>>>>     lib/dpif-netdev.c                |   1 +
>>>>>>     lib/dpif-netlink.c               |  27 ++
>>>>>>     lib/dpif-netlink.h               |   4 +
>>>>>>     lib/dpif-provider.h              |  10 +
>>>>>>     lib/dpif.c                       |   8 +
>>>>>>     lib/dpif.h                       |  23 ++
>>>>>>     lib/netdev-offload-provider.h    |   3 +
>>>>>>     lib/netdev-offload-tc.c          | 659 ++++++++++++++++++++++++++++++-
>>>>>>     lib/netdev-offload.c             |  30 ++
>>>>>>     lib/netdev-offload.h             |   4 +
>>>>>>     lib/tc.c                         |  61 ++-
>>>>>>     lib/tc.h                         |  16 +-
>>>>>>     ofproto/ofproto-dpif-upcall.c    |  42 ++
>>>>>>     utilities/ovs-kmod-ctl.in        |  14 +
>>>>>>     17 files changed, 973 insertions(+), 16 deletions(-)
>>>>>>     create mode 100644 include/linux/psample.h
>>>>>>     create mode 100644 include/linux/tc_act/tc_sample.h
>>>>>>
>>>>> Hi Simon, Ilya,
>>>>>
>>>>> Can you help review for this series?
>>>>> do you have any comments you want us to handle?
>>>>>
>>>>> Thanks,
>>>>> Roi
Ilya Maximets March 24, 2021, 12:14 p.m. UTC | #10
On 3/24/21 10:17 AM, Chris Mi wrote:
> On 3/23/2021 10:24 PM, Ilya Maximets wrote:
>> On 3/5/21 4:27 AM, Chris Mi wrote:
>>> Hi Ilya,
>>>
>>> I think about your suggestion recently. But I'm still not very clear about the design.
>>> Please see my reply below:
>>>
>>> On 3/1/2021 8:48 PM, Ilya Maximets wrote:
>>>> On 3/1/21 9:30 AM, Chris Mi wrote:
>>>>> Hi Simon, Ilya,
>>>>>
>>>>> Could I know what should we do to make progress for this patch set?
>>>>> It has been posted in the community for a long time 😁
>>>> In general, the way to get your patches reviewed is to review other
>>>> patches.  It's simply because we still have a huge review backlog
>>>> (214 patches right now in patchwork and most of them needs review)
>>>> and bugfixes usually has a bit higher priority.  By reviewing other
>>>> patches you're reducing amount of work for maintainers so they can
>>>> get to your patches faster.
>>> OK, I see.
>>>> For the series and comments from Eelco:
>>>> I didn't read the patches carefully, only a quick glance, but I still
>>>> do not understand why we need a separate thread to poll psample events.
>>>> Why can't we just allow usual handler threads to do that?
>>> I'm not sure if you are aware of that the psample netlink is different from the ovs
>>> netlink. Without offload, kernel sends missed packet and sFlow packet to userspace
>>> using the same netlink 'ovs_packet_family'. So they can use the same handler thread.
>>> But in order to offload sFlow action, we use psample kernel module to send sampled
>>> packets from kernel to userspace. The format for ovs netlink message and psample
>>> netlink messages are different.
>> Hi.  Sorry for late reply.
>>
>> Yes, I understand that message format is different, but it doesn't really
>> matter.  All the message-specific handling and parsing should happen
>> inside the netdev_offload_tc_recv().  This function should have a prototype
>> similar to dpif_netlink_recv(), i.e. it should receive a pointer to the
>> struct dpif_upcall from the caller and fill it with data.  Maybe other
>> type of a generic data structure if it's really not possible to construct
>> struct dpif_upcall.
>>
>>
>>>>     From the
>>>> architecture perspective it's not a good thing to call ofproto code
>>>> from the offload-provider.  This introduces lots of complications
>>>> and might cause lots of issues in the future.
>>>>
>>>> I'd say that design should look like this:
>>>>
>>>>     handler thread ->
>>>>       dpif_recv() ->
>>>>         dpif_netlink_recv() ->
>>>>           netdev_offload_recv() ->
>>>>             netdev_offload_tc_recv() ->
>>>>               nl_sock_recv()
>>> In order to use the handler thread, I'm not sure if we should add psample socket
>>> fd to every handler's epoll_fd. If we should do that, we should consider how to
>>> differentiate if the event comes from ovs or psample netlink. Maybe we should
>>> allocate a special port number for psample and assign it to event.data.u32.
>>> Anyway, that's the details. If this is the right direction, I'll think about it.
>>>> Last three calls should be implemented.
>>>>
>>>> The better version of this will be to throw away dpif part from
>>>> above call chain and make it:
>>>>
>>>>     handler thread ->
>>>>       netdev_offload_recv() ->
>>>>         netdev_offload_tc_recv() ->
>>>>           nl_sock_recv()
>>> If we throw away dpif part, maybe we have to write some duplicate epoll code
>>> for psample only. And we can't block in nl_sock_recv(). Maybe we have to add
>>> psample socket fd to every handler's epoll_fd. So we have to change the dpif
>>> somehow.
>> There is no need to implement any epoll_fd logic for psample.
>> You may only use handler with handler_id == 0.  Only this handler will receive
>> psample upcalls.  netdev_offload_recv_wait() should be implemented similar
>> to how dpif_netlink_recv_wait() implemented for windows case, i.e. it will
>> call nl_sock_wait(nl_sock, POLLIN) for the psample netlink socket.
>> There is no need to block and you're never blocking anywhere because your're
>> calling nl_sock_recv(..., wait = false) which will use MSG_DONTWAIT while
>> actually receiving a message.
> Hi Ilya,
> 
> With this new design, the code will be changed greatly. I'm not unwilling to change it.
> The effort is not small. So before I start, I want to make sure that it is feasible.
> 
> Yes, we won't block in nl_sock_recv(), but the handler thread will be blocked in poll_block().
> If any of the vport netlink socket is readable , it will be waken up by kernel. If we don't use
> epoll_ctl to add the psample netlink socket to the handler's epoll fd, we can't receive the
> psample packet as soon as possible. That means we can only receive the sampled packet
> when there is a miss upcall.
> 
> I added some debug message in ovs epoll code and got above conclusion. But I'm not
> the expert of the epoll API, so I'm not sure if I missed anything.

Handler thread wakes up on POLLIN on epoll_fd itself, not on the event on one
of the file descriptors added to epoll.  epoll_fd is added to a thread's poll
loop by
  poll_fd_wait(handler->epoll_fd, POLLIN);

If you will call nl_sock_wait() on psample socket, this socket will be added
to the same poll loop with:
  nl_sock_wait(psample_sock, POLLIN) ->
    poll_fd_wait(psample_sock->fd, POLLIN);

This way psample socket will become an *additional* source for waking up
for the thread that called nl_sock_wait().  So, this handler will be waken
up if POLLIN happened on a psample socket even if there are no miss upcalls.

The whole epoll infrastructure is local to lib/dpif-netlink.c and handler
threads knows nothing about it.  poll_loop() knows nothing about this epoll
stuff too, it just adds epoll_fd itself to the list of fds for the usual
poll() because of the poll_fd_wait() call.  psample socket will end up in
the same usual poll().

> 
> Thanks,
> Chris
>>
>> So, there should be several call chains:
>>
>> 1. Init.
>>
>>     open_dpif_backer/type_run() ->
>>       netdev_offload_recv_set() ->
>>         netdev_offload_tc_recv_set(enabled) ->
>>           if (enable)
>>             psample_sock = nl_sock_create(...);
>>           else
>>             close(psample_sock);
>>
>> 2. Wait.
>>
>>     udpif_upcall_handler() ->
>>       netdev_offload_recv_wait() ->
>>         netdev_offload_tc_recv_wait() ->
>>           if (handler_id == 0)
>>             nl_sock_wait(psample_sock, POLLIN);
>>
>> 3. Receive.
>>
>>     udpif_upcall_handler() ->
>>       recv_upcalls() ->
>>         netdev_offload_recv() ->
>>         | netdev_offload_tc_recv(..., struct dpif_upcall *upcall) ->
>>         |   if (handler_id == 0)
>>         |     nl_sock_recv(psample_socket, ..., wait = false);
>>         |     *upcall = <received data>;
>>         upcall_receive()
>>         process_upcall() ->
>>           dpif_sflow_received()
>>
>> 4. Deinit.
>>
>>     close_dpif_backer() ->
>>       netdev_offload_recv_set(enable = false);
>>
>>
>> Does that look more clear?
>>
>>> I don't know if I understand your suggestion correctly. Or I missed anything
>>> So if you have time, could you please elaborate?
>>>
>>> Thanks,
>>> Chris
>>>> This way we could avoid touching dpif-netdev and still have psample
>>>> offloading for the case where netdev-offload-tc is used from the
>>>> userspace datapath.
>>>>
>>>> Above architecture also implies implementation of:
>>>>    - netdev_offload_recv_wait()
>>>>    - netdev_offload_recv_purge()
>>>>    - and the netdev_offload_tc_* counterparts.
>>>>
>>>> Best regards, Ilya Maximets.
>>>>
>>>>> Thanks,
>>>>> Chris
>>>>>
>>>>> On 2/23/2021 5:08 PM, Roi Dayan wrote:
>>>>>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>>>>>> This patch set adds offload support for sFlow.
>>>>>>>
>>>>>>> Psample is a genetlink channel for packet sampling. TC action act_sample
>>>>>>> uses psample to send sampled packets to userspace.
>>>>>>>
>>>>>>> When offloading sample action to TC, userspace creates a unique ID to
>>>>>>> map sFlow action and tunnel info and passes this ID to kernel instead
>>>>>>> of the sFlow info. psample will send this ID and sampled packet to
>>>>>>> userspace. Using the ID, userspace can recover the sFlow info and send
>>>>>>> sampled packet to the right sFlow monitoring host.
>>>>>>>
>>>>>>> v2-v1:
>>>>>>> - Fix robot errors.
>>>>>>> v3-v2:
>>>>>>> - Remove Gerrit Change-Id.
>>>>>>> - Add patch #9 to fix older kernels build issue.
>>>>>>> - Add travis test result.
>>>>>>> v4-v3:
>>>>>>> - Fix offload issue when sampling rate is 1.
>>>>>>> v5-v4:
>>>>>>> - Move polling thread from ofproto to netdev-offload-tc.
>>>>>>> v6-v5:
>>>>>>> - Rebase.
>>>>>>> - Add GitHub Actions test result.
>>>>>>> v7-v6:
>>>>>>> - Remove Gerrit Change-Id.
>>>>>>> - Fix "ERROR: Inappropriate spacing around cast"
>>>>>>> v8-v7
>>>>>>> - Address Eelco Chaudron's comment for patch #11.
>>>>>>> v9-v8
>>>>>>> - Remove sflow_len from struct dpif_sflow_attr.
>>>>>>> - Log a debug message for other userspace actions.
>>>>>>> v10-v9
>>>>>>> - Address Eelco Chaudron's comments on v9.
>>>>>>> v11-v10
>>>>>>> - Fix a bracing error.
>>>>>>> v12-v11
>>>>>>> - Add duplicate sample group id check.
>>>>>>>
>>>>>>> Chris Mi (11):
>>>>>>>      compat: Add psample and tc sample action defines for older kernels
>>>>>>>      ovs-kmod-ctl: Load kernel module psample
>>>>>>>      dpif: Introduce register sFlow upcall callback API
>>>>>>>      ofproto: Add upcall callback to process sFlow packet
>>>>>>>      netdev-offload: Introduce register sFlow upcall callback API
>>>>>>>      netdev-offload-tc: Implement register sFlow upcall callback API
>>>>>>>      dpif-netlink: Implement register sFlow upcall callback API
>>>>>>>      netdev-offload-tc: Introduce group ID management API
>>>>>>>      netdev-offload-tc: Create psample netlink socket
>>>>>>>      netdev-offload-tc: Add psample receive handler
>>>>>>>      netdev-offload-tc: Add offload support for sFlow
>>>>>>>
>>>>>>>     include/linux/automake.mk        |   4 +-
>>>>>>>     include/linux/psample.h          |  58 +++
>>>>>>>     include/linux/tc_act/tc_sample.h |  25 ++
>>>>>>>     lib/dpif-netdev.c                |   1 +
>>>>>>>     lib/dpif-netlink.c               |  27 ++
>>>>>>>     lib/dpif-netlink.h               |   4 +
>>>>>>>     lib/dpif-provider.h              |  10 +
>>>>>>>     lib/dpif.c                       |   8 +
>>>>>>>     lib/dpif.h                       |  23 ++
>>>>>>>     lib/netdev-offload-provider.h    |   3 +
>>>>>>>     lib/netdev-offload-tc.c          | 659 ++++++++++++++++++++++++++++++-
>>>>>>>     lib/netdev-offload.c             |  30 ++
>>>>>>>     lib/netdev-offload.h             |   4 +
>>>>>>>     lib/tc.c                         |  61 ++-
>>>>>>>     lib/tc.h                         |  16 +-
>>>>>>>     ofproto/ofproto-dpif-upcall.c    |  42 ++
>>>>>>>     utilities/ovs-kmod-ctl.in        |  14 +
>>>>>>>     17 files changed, 973 insertions(+), 16 deletions(-)
>>>>>>>     create mode 100644 include/linux/psample.h
>>>>>>>     create mode 100644 include/linux/tc_act/tc_sample.h
>>>>>>>
>>>>>> Hi Simon, Ilya,
>>>>>>
>>>>>> Can you help review for this series?
>>>>>> do you have any comments you want us to handle?
>>>>>>
>>>>>> Thanks,
>>>>>> Roi
>
Chris Mi March 24, 2021, 1:03 p.m. UTC | #11
On 3/24/2021 8:14 PM, Ilya Maximets wrote:
> On 3/24/21 10:17 AM, Chris Mi wrote:
>> On 3/23/2021 10:24 PM, Ilya Maximets wrote:
>>> On 3/5/21 4:27 AM, Chris Mi wrote:
>>>> Hi Ilya,
>>>>
>>>> I think about your suggestion recently. But I'm still not very clear about the design.
>>>> Please see my reply below:
>>>>
>>>> On 3/1/2021 8:48 PM, Ilya Maximets wrote:
>>>>> On 3/1/21 9:30 AM, Chris Mi wrote:
>>>>>> Hi Simon, Ilya,
>>>>>>
>>>>>> Could I know what should we do to make progress for this patch set?
>>>>>> It has been posted in the community for a long time 😁
>>>>> In general, the way to get your patches reviewed is to review other
>>>>> patches.  It's simply because we still have a huge review backlog
>>>>> (214 patches right now in patchwork and most of them needs review)
>>>>> and bugfixes usually has a bit higher priority.  By reviewing other
>>>>> patches you're reducing amount of work for maintainers so they can
>>>>> get to your patches faster.
>>>> OK, I see.
>>>>> For the series and comments from Eelco:
>>>>> I didn't read the patches carefully, only a quick glance, but I still
>>>>> do not understand why we need a separate thread to poll psample events.
>>>>> Why can't we just allow usual handler threads to do that?
>>>> I'm not sure if you are aware of that the psample netlink is different from the ovs
>>>> netlink. Without offload, kernel sends missed packet and sFlow packet to userspace
>>>> using the same netlink 'ovs_packet_family'. So they can use the same handler thread.
>>>> But in order to offload sFlow action, we use psample kernel module to send sampled
>>>> packets from kernel to userspace. The format for ovs netlink message and psample
>>>> netlink messages are different.
>>> Hi.  Sorry for late reply.
>>>
>>> Yes, I understand that message format is different, but it doesn't really
>>> matter.  All the message-specific handling and parsing should happen
>>> inside the netdev_offload_tc_recv().  This function should have a prototype
>>> similar to dpif_netlink_recv(), i.e. it should receive a pointer to the
>>> struct dpif_upcall from the caller and fill it with data.  Maybe other
>>> type of a generic data structure if it's really not possible to construct
>>> struct dpif_upcall.
>>>
>>>
>>>>>      From the
>>>>> architecture perspective it's not a good thing to call ofproto code
>>>>> from the offload-provider.  This introduces lots of complications
>>>>> and might cause lots of issues in the future.
>>>>>
>>>>> I'd say that design should look like this:
>>>>>
>>>>>      handler thread ->
>>>>>        dpif_recv() ->
>>>>>          dpif_netlink_recv() ->
>>>>>            netdev_offload_recv() ->
>>>>>              netdev_offload_tc_recv() ->
>>>>>                nl_sock_recv()
>>>> In order to use the handler thread, I'm not sure if we should add psample socket
>>>> fd to every handler's epoll_fd. If we should do that, we should consider how to
>>>> differentiate if the event comes from ovs or psample netlink. Maybe we should
>>>> allocate a special port number for psample and assign it to event.data.u32.
>>>> Anyway, that's the details. If this is the right direction, I'll think about it.
>>>>> Last three calls should be implemented.
>>>>>
>>>>> The better version of this will be to throw away dpif part from
>>>>> above call chain and make it:
>>>>>
>>>>>      handler thread ->
>>>>>        netdev_offload_recv() ->
>>>>>          netdev_offload_tc_recv() ->
>>>>>            nl_sock_recv()
>>>> If we throw away dpif part, maybe we have to write some duplicate epoll code
>>>> for psample only. And we can't block in nl_sock_recv(). Maybe we have to add
>>>> psample socket fd to every handler's epoll_fd. So we have to change the dpif
>>>> somehow.
>>> There is no need to implement any epoll_fd logic for psample.
>>> You may only use handler with handler_id == 0.  Only this handler will receive
>>> psample upcalls.  netdev_offload_recv_wait() should be implemented similar
>>> to how dpif_netlink_recv_wait() implemented for windows case, i.e. it will
>>> call nl_sock_wait(nl_sock, POLLIN) for the psample netlink socket.
>>> There is no need to block and you're never blocking anywhere because your're
>>> calling nl_sock_recv(..., wait = false) which will use MSG_DONTWAIT while
>>> actually receiving a message.
>> Hi Ilya,
>>
>> With this new design, the code will be changed greatly. I'm not unwilling to change it.
>> The effort is not small. So before I start, I want to make sure that it is feasible.
>>
>> Yes, we won't block in nl_sock_recv(), but the handler thread will be blocked in poll_block().
>> If any of the vport netlink socket is readable , it will be waken up by kernel. If we don't use
>> epoll_ctl to add the psample netlink socket to the handler's epoll fd, we can't receive the
>> psample packet as soon as possible. That means we can only receive the sampled packet
>> when there is a miss upcall.
>>
>> I added some debug message in ovs epoll code and got above conclusion. But I'm not
>> the expert of the epoll API, so I'm not sure if I missed anything.
> Handler thread wakes up on POLLIN on epoll_fd itself, not on the event on one
> of the file descriptors added to epoll.  epoll_fd is added to a thread's poll
> loop by
>    poll_fd_wait(handler->epoll_fd, POLLIN);
>
> If you will call nl_sock_wait() on psample socket, this socket will be added
> to the same poll loop with:
>    nl_sock_wait(psample_sock, POLLIN) ->
>      poll_fd_wait(psample_sock->fd, POLLIN);
>
> This way psample socket will become an *additional* source for waking up
> for the thread that called nl_sock_wait().  So, this handler will be waken
> up if POLLIN happened on a psample socket even if there are no miss upcalls.
>
> The whole epoll infrastructure is local to lib/dpif-netlink.c and handler
> threads knows nothing about it.  poll_loop() knows nothing about this epoll
> stuff too, it just adds epoll_fd itself to the list of fds for the usual
> poll() because of the poll_fd_wait() call.  psample socket will end up in
> the same usual poll().
Thank a lot for your detailed explanation, Ilya. Now I'm clear of it. I 
will work on it.

-Chris
>> Thanks,
>> Chris
>>> So, there should be several call chains:
>>>
>>> 1. Init.
>>>
>>>      open_dpif_backer/type_run() ->
>>>        netdev_offload_recv_set() ->
>>>          netdev_offload_tc_recv_set(enabled) ->
>>>            if (enable)
>>>              psample_sock = nl_sock_create(...);
>>>            else
>>>              close(psample_sock);
>>>
>>> 2. Wait.
>>>
>>>      udpif_upcall_handler() ->
>>>        netdev_offload_recv_wait() ->
>>>          netdev_offload_tc_recv_wait() ->
>>>            if (handler_id == 0)
>>>              nl_sock_wait(psample_sock, POLLIN);
>>>
>>> 3. Receive.
>>>
>>>      udpif_upcall_handler() ->
>>>        recv_upcalls() ->
>>>          netdev_offload_recv() ->
>>>          | netdev_offload_tc_recv(..., struct dpif_upcall *upcall) ->
>>>          |   if (handler_id == 0)
>>>          |     nl_sock_recv(psample_socket, ..., wait = false);
>>>          |     *upcall = <received data>;
>>>          upcall_receive()
>>>          process_upcall() ->
>>>            dpif_sflow_received()
>>>
>>> 4. Deinit.
>>>
>>>      close_dpif_backer() ->
>>>        netdev_offload_recv_set(enable = false);
>>>
>>>
>>> Does that look more clear?
>>>
>>>> I don't know if I understand your suggestion correctly. Or I missed anything
>>>> So if you have time, could you please elaborate?
>>>>
>>>> Thanks,
>>>> Chris
>>>>> This way we could avoid touching dpif-netdev and still have psample
>>>>> offloading for the case where netdev-offload-tc is used from the
>>>>> userspace datapath.
>>>>>
>>>>> Above architecture also implies implementation of:
>>>>>     - netdev_offload_recv_wait()
>>>>>     - netdev_offload_recv_purge()
>>>>>     - and the netdev_offload_tc_* counterparts.
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>>
>>>>>> Thanks,
>>>>>> Chris
>>>>>>
>>>>>> On 2/23/2021 5:08 PM, Roi Dayan wrote:
>>>>>>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>>>>>>> This patch set adds offload support for sFlow.
>>>>>>>>
>>>>>>>> Psample is a genetlink channel for packet sampling. TC action act_sample
>>>>>>>> uses psample to send sampled packets to userspace.
>>>>>>>>
>>>>>>>> When offloading sample action to TC, userspace creates a unique ID to
>>>>>>>> map sFlow action and tunnel info and passes this ID to kernel instead
>>>>>>>> of the sFlow info. psample will send this ID and sampled packet to
>>>>>>>> userspace. Using the ID, userspace can recover the sFlow info and send
>>>>>>>> sampled packet to the right sFlow monitoring host.
>>>>>>>>
>>>>>>>> v2-v1:
>>>>>>>> - Fix robot errors.
>>>>>>>> v3-v2:
>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>> - Add patch #9 to fix older kernels build issue.
>>>>>>>> - Add travis test result.
>>>>>>>> v4-v3:
>>>>>>>> - Fix offload issue when sampling rate is 1.
>>>>>>>> v5-v4:
>>>>>>>> - Move polling thread from ofproto to netdev-offload-tc.
>>>>>>>> v6-v5:
>>>>>>>> - Rebase.
>>>>>>>> - Add GitHub Actions test result.
>>>>>>>> v7-v6:
>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>> - Fix "ERROR: Inappropriate spacing around cast"
>>>>>>>> v8-v7
>>>>>>>> - Address Eelco Chaudron's comment for patch #11.
>>>>>>>> v9-v8
>>>>>>>> - Remove sflow_len from struct dpif_sflow_attr.
>>>>>>>> - Log a debug message for other userspace actions.
>>>>>>>> v10-v9
>>>>>>>> - Address Eelco Chaudron's comments on v9.
>>>>>>>> v11-v10
>>>>>>>> - Fix a bracing error.
>>>>>>>> v12-v11
>>>>>>>> - Add duplicate sample group id check.
>>>>>>>>
>>>>>>>> Chris Mi (11):
>>>>>>>>       compat: Add psample and tc sample action defines for older kernels
>>>>>>>>       ovs-kmod-ctl: Load kernel module psample
>>>>>>>>       dpif: Introduce register sFlow upcall callback API
>>>>>>>>       ofproto: Add upcall callback to process sFlow packet
>>>>>>>>       netdev-offload: Introduce register sFlow upcall callback API
>>>>>>>>       netdev-offload-tc: Implement register sFlow upcall callback API
>>>>>>>>       dpif-netlink: Implement register sFlow upcall callback API
>>>>>>>>       netdev-offload-tc: Introduce group ID management API
>>>>>>>>       netdev-offload-tc: Create psample netlink socket
>>>>>>>>       netdev-offload-tc: Add psample receive handler
>>>>>>>>       netdev-offload-tc: Add offload support for sFlow
>>>>>>>>
>>>>>>>>      include/linux/automake.mk        |   4 +-
>>>>>>>>      include/linux/psample.h          |  58 +++
>>>>>>>>      include/linux/tc_act/tc_sample.h |  25 ++
>>>>>>>>      lib/dpif-netdev.c                |   1 +
>>>>>>>>      lib/dpif-netlink.c               |  27 ++
>>>>>>>>      lib/dpif-netlink.h               |   4 +
>>>>>>>>      lib/dpif-provider.h              |  10 +
>>>>>>>>      lib/dpif.c                       |   8 +
>>>>>>>>      lib/dpif.h                       |  23 ++
>>>>>>>>      lib/netdev-offload-provider.h    |   3 +
>>>>>>>>      lib/netdev-offload-tc.c          | 659 ++++++++++++++++++++++++++++++-
>>>>>>>>      lib/netdev-offload.c             |  30 ++
>>>>>>>>      lib/netdev-offload.h             |   4 +
>>>>>>>>      lib/tc.c                         |  61 ++-
>>>>>>>>      lib/tc.h                         |  16 +-
>>>>>>>>      ofproto/ofproto-dpif-upcall.c    |  42 ++
>>>>>>>>      utilities/ovs-kmod-ctl.in        |  14 +
>>>>>>>>      17 files changed, 973 insertions(+), 16 deletions(-)
>>>>>>>>      create mode 100644 include/linux/psample.h
>>>>>>>>      create mode 100644 include/linux/tc_act/tc_sample.h
>>>>>>>>
>>>>>>> Hi Simon, Ilya,
>>>>>>>
>>>>>>> Can you help review for this series?
>>>>>>> do you have any comments you want us to handle?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Roi
Eelco Chaudron March 29, 2021, 9:46 a.m. UTC | #12
On 24 Mar 2021, at 14:03, Chris Mi wrote:

> On 3/24/2021 8:14 PM, Ilya Maximets wrote:
>> On 3/24/21 10:17 AM, Chris Mi wrote:
>>> On 3/23/2021 10:24 PM, Ilya Maximets wrote:
>>>> On 3/5/21 4:27 AM, Chris Mi wrote:
>>>>> Hi Ilya,
>>>>>
>>>>> I think about your suggestion recently. But I'm still not very 
>>>>> clear about the design.
>>>>> Please see my reply below:
>>>>>
>>>>> On 3/1/2021 8:48 PM, Ilya Maximets wrote:
>>>>>> On 3/1/21 9:30 AM, Chris Mi wrote:
>>>>>>> Hi Simon, Ilya,
>>>>>>>
>>>>>>> Could I know what should we do to make progress for this patch 
>>>>>>> set?
>>>>>>> It has been posted in the community for a long time 😁
>>>>>> In general, the way to get your patches reviewed is to review 
>>>>>> other
>>>>>> patches.  It's simply because we still have a huge review 
>>>>>> backlog
>>>>>> (214 patches right now in patchwork and most of them needs 
>>>>>> review)
>>>>>> and bugfixes usually has a bit higher priority.  By reviewing 
>>>>>> other
>>>>>> patches you're reducing amount of work for maintainers so they 
>>>>>> can
>>>>>> get to your patches faster.
>>>>> OK, I see.
>>>>>> For the series and comments from Eelco:
>>>>>> I didn't read the patches carefully, only a quick glance, but I 
>>>>>> still
>>>>>> do not understand why we need a separate thread to poll psample 
>>>>>> events.
>>>>>> Why can't we just allow usual handler threads to do that?
>>>>> I'm not sure if you are aware of that the psample netlink is 
>>>>> different from the ovs
>>>>> netlink. Without offload, kernel sends missed packet and sFlow 
>>>>> packet to userspace
>>>>> using the same netlink 'ovs_packet_family'. So they can use the 
>>>>> same handler thread.
>>>>> But in order to offload sFlow action, we use psample kernel module 
>>>>> to send sampled
>>>>> packets from kernel to userspace. The format for ovs netlink 
>>>>> message and psample
>>>>> netlink messages are different.
>>>> Hi.  Sorry for late reply.
>>>>
>>>> Yes, I understand that message format is different, but it doesn't 
>>>> really
>>>> matter.  All the message-specific handling and parsing should 
>>>> happen
>>>> inside the netdev_offload_tc_recv().  This function should have a 
>>>> prototype
>>>> similar to dpif_netlink_recv(), i.e. it should receive a pointer to 
>>>> the
>>>> struct dpif_upcall from the caller and fill it with data.  Maybe 
>>>> other
>>>> type of a generic data structure if it's really not possible to 
>>>> construct
>>>> struct dpif_upcall.
>>>>
>>>>
>>>>>>      From the
>>>>>> architecture perspective it's not a good thing to call ofproto 
>>>>>> code
>>>>>> from the offload-provider.  This introduces lots of 
>>>>>> complications
>>>>>> and might cause lots of issues in the future.
>>>>>>
>>>>>> I'd say that design should look like this:
>>>>>>
>>>>>>      handler thread ->
>>>>>>        dpif_recv() ->
>>>>>>          dpif_netlink_recv() ->
>>>>>>            netdev_offload_recv() ->
>>>>>>              netdev_offload_tc_recv() ->
>>>>>>                nl_sock_recv()
>>>>> In order to use the handler thread, I'm not sure if we should add 
>>>>> psample socket
>>>>> fd to every handler's epoll_fd. If we should do that, we should 
>>>>> consider how to
>>>>> differentiate if the event comes from ovs or psample netlink. 
>>>>> Maybe we should
>>>>> allocate a special port number for psample and assign it to 
>>>>> event.data.u32.
>>>>> Anyway, that's the details. If this is the right direction, I'll 
>>>>> think about it.
>>>>>> Last three calls should be implemented.
>>>>>>
>>>>>> The better version of this will be to throw away dpif part from
>>>>>> above call chain and make it:
>>>>>>
>>>>>>      handler thread ->
>>>>>>        netdev_offload_recv() ->
>>>>>>          netdev_offload_tc_recv() ->
>>>>>>            nl_sock_recv()
>>>>> If we throw away dpif part, maybe we have to write some duplicate 
>>>>> epoll code
>>>>> for psample only. And we can't block in nl_sock_recv(). Maybe we 
>>>>> have to add
>>>>> psample socket fd to every handler's epoll_fd. So we have to 
>>>>> change the dpif
>>>>> somehow.
>>>> There is no need to implement any epoll_fd logic for psample.
>>>> You may only use handler with handler_id == 0.  Only this handler 
>>>> will receive
>>>> psample upcalls.  netdev_offload_recv_wait() should be implemented 
>>>> similar
>>>> to how dpif_netlink_recv_wait() implemented for windows case, i.e. 
>>>> it will
>>>> call nl_sock_wait(nl_sock, POLLIN) for the psample netlink socket.
>>>> There is no need to block and you're never blocking anywhere 
>>>> because your're
>>>> calling nl_sock_recv(..., wait = false) which will use MSG_DONTWAIT 
>>>> while
>>>> actually receiving a message.
>>> Hi Ilya,
>>>
>>> With this new design, the code will be changed greatly. I'm not 
>>> unwilling to change it.
>>> The effort is not small. So before I start, I want to make sure that 
>>> it is feasible.
>>>
>>> Yes, we won't block in nl_sock_recv(), but the handler thread will 
>>> be blocked in poll_block().
>>> If any of the vport netlink socket is readable , it will be waken up 
>>> by kernel. If we don't use
>>> epoll_ctl to add the psample netlink socket to the handler's epoll 
>>> fd, we can't receive the
>>> psample packet as soon as possible. That means we can only receive 
>>> the sampled packet
>>> when there is a miss upcall.
>>>
>>> I added some debug message in ovs epoll code and got above 
>>> conclusion. But I'm not
>>> the expert of the epoll API, so I'm not sure if I missed anything.
>> Handler thread wakes up on POLLIN on epoll_fd itself, not on the 
>> event on one
>> of the file descriptors added to epoll.  epoll_fd is added to a 
>> thread's poll
>> loop by
>>    poll_fd_wait(handler->epoll_fd, POLLIN);
>>
>> If you will call nl_sock_wait() on psample socket, this socket will 
>> be added
>> to the same poll loop with:
>>    nl_sock_wait(psample_sock, POLLIN) ->
>>      poll_fd_wait(psample_sock->fd, POLLIN);
>>
>> This way psample socket will become an *additional* source for waking 
>> up
>> for the thread that called nl_sock_wait().  So, this handler will be 
>> waken
>> up if POLLIN happened on a psample socket even if there are no miss 
>> upcalls.
>>
>> The whole epoll infrastructure is local to lib/dpif-netlink.c and 
>> handler
>> threads knows nothing about it.  poll_loop() knows nothing about this 
>> epoll
>> stuff too, it just adds epoll_fd itself to the list of fds for the 
>> usual
>> poll() because of the poll_fd_wait() call.  psample socket will end 
>> up in
>> the same usual poll().
> Thank a lot for your detailed explanation, Ilya. Now I'm clear of it. 
> I will work on it.

Guys did you also look at my other objection on the
“[PATCH v9 08/11] netdev-offload-tc: Introduce group ID management 
API” as I still feel this some rework. Ilya your comments on this 
would help set some directions for Chris.

Cheers,

Eelco


>>> Thanks,
>>> Chris
>>>> So, there should be several call chains:
>>>>
>>>> 1. Init.
>>>>
>>>>      open_dpif_backer/type_run() ->
>>>>        netdev_offload_recv_set() ->
>>>>          netdev_offload_tc_recv_set(enabled) ->
>>>>            if (enable)
>>>>              psample_sock = nl_sock_create(...);
>>>>            else
>>>>              close(psample_sock);
>>>>
>>>> 2. Wait.
>>>>
>>>>      udpif_upcall_handler() ->
>>>>        netdev_offload_recv_wait() ->
>>>>          netdev_offload_tc_recv_wait() ->
>>>>            if (handler_id == 0)
>>>>              nl_sock_wait(psample_sock, POLLIN);
>>>>
>>>> 3. Receive.
>>>>
>>>>      udpif_upcall_handler() ->
>>>>        recv_upcalls() ->
>>>>          netdev_offload_recv() ->
>>>>          | netdev_offload_tc_recv(..., struct dpif_upcall 
>>>> *upcall) ->
>>>>          |   if (handler_id == 0)
>>>>          |     nl_sock_recv(psample_socket, ..., wait = 
>>>> false);
>>>>          |     *upcall = <received data>;
>>>>          upcall_receive()
>>>>          process_upcall() ->
>>>>            dpif_sflow_received()
>>>>
>>>> 4. Deinit.
>>>>
>>>>      close_dpif_backer() ->
>>>>        netdev_offload_recv_set(enable = false);
>>>>
>>>>
>>>> Does that look more clear?
>>>>
>>>>> I don't know if I understand your suggestion correctly. Or I 
>>>>> missed anything
>>>>> So if you have time, could you please elaborate?
>>>>>
>>>>> Thanks,
>>>>> Chris
>>>>>> This way we could avoid touching dpif-netdev and still have 
>>>>>> psample
>>>>>> offloading for the case where netdev-offload-tc is used from the
>>>>>> userspace datapath.
>>>>>>
>>>>>> Above architecture also implies implementation of:
>>>>>>     - netdev_offload_recv_wait()
>>>>>>     - netdev_offload_recv_purge()
>>>>>>     - and the netdev_offload_tc_* counterparts.
>>>>>>
>>>>>> Best regards, Ilya Maximets.
>>>>>>
>>>>>>> Thanks,
>>>>>>> Chris
>>>>>>>
>>>>>>> On 2/23/2021 5:08 PM, Roi Dayan wrote:
>>>>>>>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>>>>>>>> This patch set adds offload support for sFlow.
>>>>>>>>>
>>>>>>>>> Psample is a genetlink channel for packet sampling. TC action 
>>>>>>>>> act_sample
>>>>>>>>> uses psample to send sampled packets to userspace.
>>>>>>>>>
>>>>>>>>> When offloading sample action to TC, userspace creates a 
>>>>>>>>> unique ID to
>>>>>>>>> map sFlow action and tunnel info and passes this ID to kernel 
>>>>>>>>> instead
>>>>>>>>> of the sFlow info. psample will send this ID and sampled 
>>>>>>>>> packet to
>>>>>>>>> userspace. Using the ID, userspace can recover the sFlow info 
>>>>>>>>> and send
>>>>>>>>> sampled packet to the right sFlow monitoring host.
>>>>>>>>>
>>>>>>>>> v2-v1:
>>>>>>>>> - Fix robot errors.
>>>>>>>>> v3-v2:
>>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>>> - Add patch #9 to fix older kernels build issue.
>>>>>>>>> - Add travis test result.
>>>>>>>>> v4-v3:
>>>>>>>>> - Fix offload issue when sampling rate is 1.
>>>>>>>>> v5-v4:
>>>>>>>>> - Move polling thread from ofproto to netdev-offload-tc.
>>>>>>>>> v6-v5:
>>>>>>>>> - Rebase.
>>>>>>>>> - Add GitHub Actions test result.
>>>>>>>>> v7-v6:
>>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>>> - Fix "ERROR: Inappropriate spacing around cast"
>>>>>>>>> v8-v7
>>>>>>>>> - Address Eelco Chaudron's comment for patch #11.
>>>>>>>>> v9-v8
>>>>>>>>> - Remove sflow_len from struct dpif_sflow_attr.
>>>>>>>>> - Log a debug message for other userspace actions.
>>>>>>>>> v10-v9
>>>>>>>>> - Address Eelco Chaudron's comments on v9.
>>>>>>>>> v11-v10
>>>>>>>>> - Fix a bracing error.
>>>>>>>>> v12-v11
>>>>>>>>> - Add duplicate sample group id check.
>>>>>>>>>
>>>>>>>>> Chris Mi (11):
>>>>>>>>>       compat: Add psample and tc sample action defines for 
>>>>>>>>> older kernels
>>>>>>>>>       ovs-kmod-ctl: Load kernel module psample
>>>>>>>>>       dpif: Introduce register sFlow upcall callback API
>>>>>>>>>       ofproto: Add upcall callback to process sFlow packet
>>>>>>>>>       netdev-offload: Introduce register sFlow upcall 
>>>>>>>>> callback API
>>>>>>>>>       netdev-offload-tc: Implement register sFlow upcall 
>>>>>>>>> callback API
>>>>>>>>>       dpif-netlink: Implement register sFlow upcall 
>>>>>>>>> callback API
>>>>>>>>>       netdev-offload-tc: Introduce group ID management API
>>>>>>>>>       netdev-offload-tc: Create psample netlink socket
>>>>>>>>>       netdev-offload-tc: Add psample receive handler
>>>>>>>>>       netdev-offload-tc: Add offload support for sFlow
>>>>>>>>>
>>>>>>>>>      include/linux/automake.mk        |   4 +-
>>>>>>>>>      include/linux/psample.h          |  58 +++
>>>>>>>>>      include/linux/tc_act/tc_sample.h |  25 ++
>>>>>>>>>      lib/dpif-netdev.c                |   
>>>>>>>>> 1 +
>>>>>>>>>      lib/dpif-netlink.c               |  27 
>>>>>>>>> ++
>>>>>>>>>      lib/dpif-netlink.h               |   4 
>>>>>>>>> +
>>>>>>>>>      lib/dpif-provider.h              |  10 +
>>>>>>>>>      lib/dpif.c                       
>>>>>>>>> |   8 +
>>>>>>>>>      lib/dpif.h                       
>>>>>>>>> |  23 ++
>>>>>>>>>      lib/netdev-offload-provider.h    |   3 +
>>>>>>>>>      lib/netdev-offload-tc.c          | 659 
>>>>>>>>> ++++++++++++++++++++++++++++++-
>>>>>>>>>      lib/netdev-offload.c             |  30 ++
>>>>>>>>>      lib/netdev-offload.h             |   4 +
>>>>>>>>>      
>>>>>>>>> lib/tc.c                         |  
>>>>>>>>> 61 ++-
>>>>>>>>>      
>>>>>>>>> lib/tc.h                         |  
>>>>>>>>> 16 +-
>>>>>>>>>      ofproto/ofproto-dpif-upcall.c    |  42 ++
>>>>>>>>>      utilities/ovs-kmod-ctl.in        |  14 +
>>>>>>>>>      17 files changed, 973 insertions(+), 16 deletions(-)
>>>>>>>>>      create mode 100644 include/linux/psample.h
>>>>>>>>>      create mode 100644 include/linux/tc_act/tc_sample.h
>>>>>>>>>
>>>>>>>> Hi Simon, Ilya,
>>>>>>>>
>>>>>>>> Can you help review for this series?
>>>>>>>> do you have any comments you want us to handle?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Roi
Chris Mi April 27, 2021, 1:23 a.m. UTC | #13
On 3/24/2021 8:14 PM, Ilya Maximets wrote:
> On 3/24/21 10:17 AM, Chris Mi wrote:
>> On 3/23/2021 10:24 PM, Ilya Maximets wrote:
>>> On 3/5/21 4:27 AM, Chris Mi wrote:
>>>> Hi Ilya,
>>>>
>>>> I think about your suggestion recently. But I'm still not very clear about the design.
>>>> Please see my reply below:
>>>>
>>>> On 3/1/2021 8:48 PM, Ilya Maximets wrote:
>>>>> On 3/1/21 9:30 AM, Chris Mi wrote:
>>>>>> Hi Simon, Ilya,
>>>>>>
>>>>>> Could I know what should we do to make progress for this patch set?
>>>>>> It has been posted in the community for a long time 😁
>>>>> In general, the way to get your patches reviewed is to review other
>>>>> patches.  It's simply because we still have a huge review backlog
>>>>> (214 patches right now in patchwork and most of them needs review)
>>>>> and bugfixes usually has a bit higher priority.  By reviewing other
>>>>> patches you're reducing amount of work for maintainers so they can
>>>>> get to your patches faster.
>>>> OK, I see.
>>>>> For the series and comments from Eelco:
>>>>> I didn't read the patches carefully, only a quick glance, but I still
>>>>> do not understand why we need a separate thread to poll psample events.
>>>>> Why can't we just allow usual handler threads to do that?
>>>> I'm not sure if you are aware of that the psample netlink is different from the ovs
>>>> netlink. Without offload, kernel sends missed packet and sFlow packet to userspace
>>>> using the same netlink 'ovs_packet_family'. So they can use the same handler thread.
>>>> But in order to offload sFlow action, we use psample kernel module to send sampled
>>>> packets from kernel to userspace. The format for ovs netlink message and psample
>>>> netlink messages are different.
>>> Hi.  Sorry for late reply.
>>>
>>> Yes, I understand that message format is different, but it doesn't really
>>> matter.  All the message-specific handling and parsing should happen
>>> inside the netdev_offload_tc_recv().  This function should have a prototype
>>> similar to dpif_netlink_recv(), i.e. it should receive a pointer to the
>>> struct dpif_upcall from the caller and fill it with data.  Maybe other
>>> type of a generic data structure if it's really not possible to construct
>>> struct dpif_upcall.
>>>
>>>
>>>>>      From the
>>>>> architecture perspective it's not a good thing to call ofproto code
>>>>> from the offload-provider.  This introduces lots of complications
>>>>> and might cause lots of issues in the future.
>>>>>
>>>>> I'd say that design should look like this:
>>>>>
>>>>>      handler thread ->
>>>>>        dpif_recv() ->
>>>>>          dpif_netlink_recv() ->
>>>>>            netdev_offload_recv() ->
>>>>>              netdev_offload_tc_recv() ->
>>>>>                nl_sock_recv()
>>>> In order to use the handler thread, I'm not sure if we should add psample socket
>>>> fd to every handler's epoll_fd. If we should do that, we should consider how to
>>>> differentiate if the event comes from ovs or psample netlink. Maybe we should
>>>> allocate a special port number for psample and assign it to event.data.u32.
>>>> Anyway, that's the details. If this is the right direction, I'll think about it.
>>>>> Last three calls should be implemented.
>>>>>
>>>>> The better version of this will be to throw away dpif part from
>>>>> above call chain and make it:
>>>>>
>>>>>      handler thread ->
>>>>>        netdev_offload_recv() ->
>>>>>          netdev_offload_tc_recv() ->
>>>>>            nl_sock_recv()
>>>> If we throw away dpif part, maybe we have to write some duplicate epoll code
>>>> for psample only. And we can't block in nl_sock_recv(). Maybe we have to add
>>>> psample socket fd to every handler's epoll_fd. So we have to change the dpif
>>>> somehow.
>>> There is no need to implement any epoll_fd logic for psample.
>>> You may only use handler with handler_id == 0.  Only this handler will receive
>>> psample upcalls.  netdev_offload_recv_wait() should be implemented similar
>>> to how dpif_netlink_recv_wait() implemented for windows case, i.e. it will
>>> call nl_sock_wait(nl_sock, POLLIN) for the psample netlink socket.
>>> There is no need to block and you're never blocking anywhere because your're
>>> calling nl_sock_recv(..., wait = false) which will use MSG_DONTWAIT while
>>> actually receiving a message.
>> Hi Ilya,
>>
>> With this new design, the code will be changed greatly. I'm not unwilling to change it.
>> The effort is not small. So before I start, I want to make sure that it is feasible.
>>
>> Yes, we won't block in nl_sock_recv(), but the handler thread will be blocked in poll_block().
>> If any of the vport netlink socket is readable , it will be waken up by kernel. If we don't use
>> epoll_ctl to add the psample netlink socket to the handler's epoll fd, we can't receive the
>> psample packet as soon as possible. That means we can only receive the sampled packet
>> when there is a miss upcall.
>>
>> I added some debug message in ovs epoll code and got above conclusion. But I'm not
>> the expert of the epoll API, so I'm not sure if I missed anything.
> Handler thread wakes up on POLLIN on epoll_fd itself, not on the event on one
> of the file descriptors added to epoll.  epoll_fd is added to a thread's poll
> loop by
>    poll_fd_wait(handler->epoll_fd, POLLIN);
>
> If you will call nl_sock_wait() on psample socket, this socket will be added
> to the same poll loop with:
>    nl_sock_wait(psample_sock, POLLIN) ->
>      poll_fd_wait(psample_sock->fd, POLLIN);
>
> This way psample socket will become an *additional* source for waking up
> for the thread that called nl_sock_wait().  So, this handler will be waken
> up if POLLIN happened on a psample socket even if there are no miss upcalls.
>
> The whole epoll infrastructure is local to lib/dpif-netlink.c and handler
> threads knows nothing about it.  poll_loop() knows nothing about this epoll
> stuff too, it just adds epoll_fd itself to the list of fds for the usual
> poll() because of the poll_fd_wait() call.  psample socket will end up in
> the same usual poll().
Hi Ilya,

The code according to your suggestion is ready. But during the internal 
code review,
Eli Britstein thought flow_api is netdev based, but the psample/sFlow 
offload is
not netdev based, but dpif based. Maybe we should introduce 
dpif-offload-provider
and have a dpf_offload_api, so it won't be related to a specific netdev, 
but to a dpif type.

Could I know what's your opinion about it?

Thanks,
Chris
>
>> Thanks,
>> Chris
>>> So, there should be several call chains:
>>>
>>> 1. Init.
>>>
>>>      open_dpif_backer/type_run() ->
>>>        netdev_offload_recv_set() ->
>>>          netdev_offload_tc_recv_set(enabled) ->
>>>            if (enable)
>>>              psample_sock = nl_sock_create(...);
>>>            else
>>>              close(psample_sock);
>>>
>>> 2. Wait.
>>>
>>>      udpif_upcall_handler() ->
>>>        netdev_offload_recv_wait() ->
>>>          netdev_offload_tc_recv_wait() ->
>>>            if (handler_id == 0)
>>>              nl_sock_wait(psample_sock, POLLIN);
>>>
>>> 3. Receive.
>>>
>>>      udpif_upcall_handler() ->
>>>        recv_upcalls() ->
>>>          netdev_offload_recv() ->
>>>          | netdev_offload_tc_recv(..., struct dpif_upcall *upcall) ->
>>>          |   if (handler_id == 0)
>>>          |     nl_sock_recv(psample_socket, ..., wait = false);
>>>          |     *upcall = <received data>;
>>>          upcall_receive()
>>>          process_upcall() ->
>>>            dpif_sflow_received()
>>>
>>> 4. Deinit.
>>>
>>>      close_dpif_backer() ->
>>>        netdev_offload_recv_set(enable = false);
>>>
>>>
>>> Does that look more clear?
>>>
>>>> I don't know if I understand your suggestion correctly. Or I missed anything
>>>> So if you have time, could you please elaborate?
>>>>
>>>> Thanks,
>>>> Chris
>>>>> This way we could avoid touching dpif-netdev and still have psample
>>>>> offloading for the case where netdev-offload-tc is used from the
>>>>> userspace datapath.
>>>>>
>>>>> Above architecture also implies implementation of:
>>>>>     - netdev_offload_recv_wait()
>>>>>     - netdev_offload_recv_purge()
>>>>>     - and the netdev_offload_tc_* counterparts.
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>>
>>>>>> Thanks,
>>>>>> Chris
>>>>>>
>>>>>> On 2/23/2021 5:08 PM, Roi Dayan wrote:
>>>>>>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>>>>>>> This patch set adds offload support for sFlow.
>>>>>>>>
>>>>>>>> Psample is a genetlink channel for packet sampling. TC action act_sample
>>>>>>>> uses psample to send sampled packets to userspace.
>>>>>>>>
>>>>>>>> When offloading sample action to TC, userspace creates a unique ID to
>>>>>>>> map sFlow action and tunnel info and passes this ID to kernel instead
>>>>>>>> of the sFlow info. psample will send this ID and sampled packet to
>>>>>>>> userspace. Using the ID, userspace can recover the sFlow info and send
>>>>>>>> sampled packet to the right sFlow monitoring host.
>>>>>>>>
>>>>>>>> v2-v1:
>>>>>>>> - Fix robot errors.
>>>>>>>> v3-v2:
>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>> - Add patch #9 to fix older kernels build issue.
>>>>>>>> - Add travis test result.
>>>>>>>> v4-v3:
>>>>>>>> - Fix offload issue when sampling rate is 1.
>>>>>>>> v5-v4:
>>>>>>>> - Move polling thread from ofproto to netdev-offload-tc.
>>>>>>>> v6-v5:
>>>>>>>> - Rebase.
>>>>>>>> - Add GitHub Actions test result.
>>>>>>>> v7-v6:
>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>> - Fix "ERROR: Inappropriate spacing around cast"
>>>>>>>> v8-v7
>>>>>>>> - Address Eelco Chaudron's comment for patch #11.
>>>>>>>> v9-v8
>>>>>>>> - Remove sflow_len from struct dpif_sflow_attr.
>>>>>>>> - Log a debug message for other userspace actions.
>>>>>>>> v10-v9
>>>>>>>> - Address Eelco Chaudron's comments on v9.
>>>>>>>> v11-v10
>>>>>>>> - Fix a bracing error.
>>>>>>>> v12-v11
>>>>>>>> - Add duplicate sample group id check.
>>>>>>>>
>>>>>>>> Chris Mi (11):
>>>>>>>>       compat: Add psample and tc sample action defines for older kernels
>>>>>>>>       ovs-kmod-ctl: Load kernel module psample
>>>>>>>>       dpif: Introduce register sFlow upcall callback API
>>>>>>>>       ofproto: Add upcall callback to process sFlow packet
>>>>>>>>       netdev-offload: Introduce register sFlow upcall callback API
>>>>>>>>       netdev-offload-tc: Implement register sFlow upcall callback API
>>>>>>>>       dpif-netlink: Implement register sFlow upcall callback API
>>>>>>>>       netdev-offload-tc: Introduce group ID management API
>>>>>>>>       netdev-offload-tc: Create psample netlink socket
>>>>>>>>       netdev-offload-tc: Add psample receive handler
>>>>>>>>       netdev-offload-tc: Add offload support for sFlow
>>>>>>>>
>>>>>>>>      include/linux/automake.mk        |   4 +-
>>>>>>>>      include/linux/psample.h          |  58 +++
>>>>>>>>      include/linux/tc_act/tc_sample.h |  25 ++
>>>>>>>>      lib/dpif-netdev.c                |   1 +
>>>>>>>>      lib/dpif-netlink.c               |  27 ++
>>>>>>>>      lib/dpif-netlink.h               |   4 +
>>>>>>>>      lib/dpif-provider.h              |  10 +
>>>>>>>>      lib/dpif.c                       |   8 +
>>>>>>>>      lib/dpif.h                       |  23 ++
>>>>>>>>      lib/netdev-offload-provider.h    |   3 +
>>>>>>>>      lib/netdev-offload-tc.c          | 659 ++++++++++++++++++++++++++++++-
>>>>>>>>      lib/netdev-offload.c             |  30 ++
>>>>>>>>      lib/netdev-offload.h             |   4 +
>>>>>>>>      lib/tc.c                         |  61 ++-
>>>>>>>>      lib/tc.h                         |  16 +-
>>>>>>>>      ofproto/ofproto-dpif-upcall.c    |  42 ++
>>>>>>>>      utilities/ovs-kmod-ctl.in        |  14 +
>>>>>>>>      17 files changed, 973 insertions(+), 16 deletions(-)
>>>>>>>>      create mode 100644 include/linux/psample.h
>>>>>>>>      create mode 100644 include/linux/tc_act/tc_sample.h
>>>>>>>>
>>>>>>> Hi Simon, Ilya,
>>>>>>>
>>>>>>> Can you help review for this series?
>>>>>>> do you have any comments you want us to handle?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Roi
Chris Mi May 11, 2021, 8:11 a.m. UTC | #14
On 4/27/2021 9:23 AM, Chris Mi wrote:
> On 3/24/2021 8:14 PM, Ilya Maximets wrote:
>> On 3/24/21 10:17 AM, Chris Mi wrote:
>>> On 3/23/2021 10:24 PM, Ilya Maximets wrote:
>>>> On 3/5/21 4:27 AM, Chris Mi wrote:
>>>>> Hi Ilya,
>>>>>
>>>>> I think about your suggestion recently. But I'm still not very 
>>>>> clear about the design.
>>>>> Please see my reply below:
>>>>>
>>>>> On 3/1/2021 8:48 PM, Ilya Maximets wrote:
>>>>>> On 3/1/21 9:30 AM, Chris Mi wrote:
>>>>>>> Hi Simon, Ilya,
>>>>>>>
>>>>>>> Could I know what should we do to make progress for this patch set?
>>>>>>> It has been posted in the community for a long time 😁
>>>>>> In general, the way to get your patches reviewed is to review other
>>>>>> patches.  It's simply because we still have a huge review backlog
>>>>>> (214 patches right now in patchwork and most of them needs review)
>>>>>> and bugfixes usually has a bit higher priority.  By reviewing other
>>>>>> patches you're reducing amount of work for maintainers so they can
>>>>>> get to your patches faster.
>>>>> OK, I see.
>>>>>> For the series and comments from Eelco:
>>>>>> I didn't read the patches carefully, only a quick glance, but I 
>>>>>> still
>>>>>> do not understand why we need a separate thread to poll psample 
>>>>>> events.
>>>>>> Why can't we just allow usual handler threads to do that?
>>>>> I'm not sure if you are aware of that the psample netlink is 
>>>>> different from the ovs
>>>>> netlink. Without offload, kernel sends missed packet and sFlow 
>>>>> packet to userspace
>>>>> using the same netlink 'ovs_packet_family'. So they can use the 
>>>>> same handler thread.
>>>>> But in order to offload sFlow action, we use psample kernel module 
>>>>> to send sampled
>>>>> packets from kernel to userspace. The format for ovs netlink 
>>>>> message and psample
>>>>> netlink messages are different.
>>>> Hi.  Sorry for late reply.
>>>>
>>>> Yes, I understand that message format is different, but it doesn't 
>>>> really
>>>> matter.  All the message-specific handling and parsing should happen
>>>> inside the netdev_offload_tc_recv().  This function should have a 
>>>> prototype
>>>> similar to dpif_netlink_recv(), i.e. it should receive a pointer to 
>>>> the
>>>> struct dpif_upcall from the caller and fill it with data. Maybe other
>>>> type of a generic data structure if it's really not possible to 
>>>> construct
>>>> struct dpif_upcall.
>>>>
>>>>
>>>>>>      From the
>>>>>> architecture perspective it's not a good thing to call ofproto code
>>>>>> from the offload-provider.  This introduces lots of complications
>>>>>> and might cause lots of issues in the future.
>>>>>>
>>>>>> I'd say that design should look like this:
>>>>>>
>>>>>>      handler thread ->
>>>>>>        dpif_recv() ->
>>>>>>          dpif_netlink_recv() ->
>>>>>>            netdev_offload_recv() ->
>>>>>>              netdev_offload_tc_recv() ->
>>>>>>                nl_sock_recv()
>>>>> In order to use the handler thread, I'm not sure if we should add 
>>>>> psample socket
>>>>> fd to every handler's epoll_fd. If we should do that, we should 
>>>>> consider how to
>>>>> differentiate if the event comes from ovs or psample netlink. 
>>>>> Maybe we should
>>>>> allocate a special port number for psample and assign it to 
>>>>> event.data.u32.
>>>>> Anyway, that's the details. If this is the right direction, I'll 
>>>>> think about it.
>>>>>> Last three calls should be implemented.
>>>>>>
>>>>>> The better version of this will be to throw away dpif part from
>>>>>> above call chain and make it:
>>>>>>
>>>>>>      handler thread ->
>>>>>>        netdev_offload_recv() ->
>>>>>>          netdev_offload_tc_recv() ->
>>>>>>            nl_sock_recv()
>>>>> If we throw away dpif part, maybe we have to write some duplicate 
>>>>> epoll code
>>>>> for psample only. And we can't block in nl_sock_recv(). Maybe we 
>>>>> have to add
>>>>> psample socket fd to every handler's epoll_fd. So we have to 
>>>>> change the dpif
>>>>> somehow.
>>>> There is no need to implement any epoll_fd logic for psample.
>>>> You may only use handler with handler_id == 0.  Only this handler 
>>>> will receive
>>>> psample upcalls.  netdev_offload_recv_wait() should be implemented 
>>>> similar
>>>> to how dpif_netlink_recv_wait() implemented for windows case, i.e. 
>>>> it will
>>>> call nl_sock_wait(nl_sock, POLLIN) for the psample netlink socket.
>>>> There is no need to block and you're never blocking anywhere 
>>>> because your're
>>>> calling nl_sock_recv(..., wait = false) which will use MSG_DONTWAIT 
>>>> while
>>>> actually receiving a message.
>>> Hi Ilya,
>>>
>>> With this new design, the code will be changed greatly. I'm not 
>>> unwilling to change it.
>>> The effort is not small. So before I start, I want to make sure that 
>>> it is feasible.
>>>
>>> Yes, we won't block in nl_sock_recv(), but the handler thread will 
>>> be blocked in poll_block().
>>> If any of the vport netlink socket is readable , it will be waken up 
>>> by kernel. If we don't use
>>> epoll_ctl to add the psample netlink socket to the handler's epoll 
>>> fd, we can't receive the
>>> psample packet as soon as possible. That means we can only receive 
>>> the sampled packet
>>> when there is a miss upcall.
>>>
>>> I added some debug message in ovs epoll code and got above 
>>> conclusion. But I'm not
>>> the expert of the epoll API, so I'm not sure if I missed anything.
>> Handler thread wakes up on POLLIN on epoll_fd itself, not on the 
>> event on one
>> of the file descriptors added to epoll.  epoll_fd is added to a 
>> thread's poll
>> loop by
>>    poll_fd_wait(handler->epoll_fd, POLLIN);
>>
>> If you will call nl_sock_wait() on psample socket, this socket will 
>> be added
>> to the same poll loop with:
>>    nl_sock_wait(psample_sock, POLLIN) ->
>>      poll_fd_wait(psample_sock->fd, POLLIN);
>>
>> This way psample socket will become an *additional* source for waking up
>> for the thread that called nl_sock_wait().  So, this handler will be 
>> waken
>> up if POLLIN happened on a psample socket even if there are no miss 
>> upcalls.
>>
>> The whole epoll infrastructure is local to lib/dpif-netlink.c and 
>> handler
>> threads knows nothing about it.  poll_loop() knows nothing about this 
>> epoll
>> stuff too, it just adds epoll_fd itself to the list of fds for the usual
>> poll() because of the poll_fd_wait() call.  psample socket will end 
>> up in
>> the same usual poll().
> Hi Ilya,
>
> The code according to your suggestion is ready. But during the 
> internal code review,
> Eli Britstein thought flow_api is netdev based, but the psample/sFlow 
> offload is
> not netdev based, but dpif based. Maybe we should introduce 
> dpif-offload-provider
> and have a dpf_offload_api, so it won't be related to a specific 
> netdev, but to a dpif type.
>
> Could I know what's your opinion about it?
Hi Ilya,

Any comment about it?

Thanks,
Chris
>
> Thanks,
> Chris
>>
>>> Thanks,
>>> Chris
>>>> So, there should be several call chains:
>>>>
>>>> 1. Init.
>>>>
>>>>      open_dpif_backer/type_run() ->
>>>>        netdev_offload_recv_set() ->
>>>>          netdev_offload_tc_recv_set(enabled) ->
>>>>            if (enable)
>>>>              psample_sock = nl_sock_create(...);
>>>>            else
>>>>              close(psample_sock);
>>>>
>>>> 2. Wait.
>>>>
>>>>      udpif_upcall_handler() ->
>>>>        netdev_offload_recv_wait() ->
>>>>          netdev_offload_tc_recv_wait() ->
>>>>            if (handler_id == 0)
>>>>              nl_sock_wait(psample_sock, POLLIN);
>>>>
>>>> 3. Receive.
>>>>
>>>>      udpif_upcall_handler() ->
>>>>        recv_upcalls() ->
>>>>          netdev_offload_recv() ->
>>>>          | netdev_offload_tc_recv(..., struct dpif_upcall *upcall) ->
>>>>          |   if (handler_id == 0)
>>>>          |     nl_sock_recv(psample_socket, ..., wait = false);
>>>>          |     *upcall = <received data>;
>>>>          upcall_receive()
>>>>          process_upcall() ->
>>>>            dpif_sflow_received()
>>>>
>>>> 4. Deinit.
>>>>
>>>>      close_dpif_backer() ->
>>>>        netdev_offload_recv_set(enable = false);
>>>>
>>>>
>>>> Does that look more clear?
>>>>
>>>>> I don't know if I understand your suggestion correctly. Or I 
>>>>> missed anything
>>>>> So if you have time, could you please elaborate?
>>>>>
>>>>> Thanks,
>>>>> Chris
>>>>>> This way we could avoid touching dpif-netdev and still have psample
>>>>>> offloading for the case where netdev-offload-tc is used from the
>>>>>> userspace datapath.
>>>>>>
>>>>>> Above architecture also implies implementation of:
>>>>>>     - netdev_offload_recv_wait()
>>>>>>     - netdev_offload_recv_purge()
>>>>>>     - and the netdev_offload_tc_* counterparts.
>>>>>>
>>>>>> Best regards, Ilya Maximets.
>>>>>>
>>>>>>> Thanks,
>>>>>>> Chris
>>>>>>>
>>>>>>> On 2/23/2021 5:08 PM, Roi Dayan wrote:
>>>>>>>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>>>>>>>> This patch set adds offload support for sFlow.
>>>>>>>>>
>>>>>>>>> Psample is a genetlink channel for packet sampling. TC action 
>>>>>>>>> act_sample
>>>>>>>>> uses psample to send sampled packets to userspace.
>>>>>>>>>
>>>>>>>>> When offloading sample action to TC, userspace creates a 
>>>>>>>>> unique ID to
>>>>>>>>> map sFlow action and tunnel info and passes this ID to kernel 
>>>>>>>>> instead
>>>>>>>>> of the sFlow info. psample will send this ID and sampled 
>>>>>>>>> packet to
>>>>>>>>> userspace. Using the ID, userspace can recover the sFlow info 
>>>>>>>>> and send
>>>>>>>>> sampled packet to the right sFlow monitoring host.
>>>>>>>>>
>>>>>>>>> v2-v1:
>>>>>>>>> - Fix robot errors.
>>>>>>>>> v3-v2:
>>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>>> - Add patch #9 to fix older kernels build issue.
>>>>>>>>> - Add travis test result.
>>>>>>>>> v4-v3:
>>>>>>>>> - Fix offload issue when sampling rate is 1.
>>>>>>>>> v5-v4:
>>>>>>>>> - Move polling thread from ofproto to netdev-offload-tc.
>>>>>>>>> v6-v5:
>>>>>>>>> - Rebase.
>>>>>>>>> - Add GitHub Actions test result.
>>>>>>>>> v7-v6:
>>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>>> - Fix "ERROR: Inappropriate spacing around cast"
>>>>>>>>> v8-v7
>>>>>>>>> - Address Eelco Chaudron's comment for patch #11.
>>>>>>>>> v9-v8
>>>>>>>>> - Remove sflow_len from struct dpif_sflow_attr.
>>>>>>>>> - Log a debug message for other userspace actions.
>>>>>>>>> v10-v9
>>>>>>>>> - Address Eelco Chaudron's comments on v9.
>>>>>>>>> v11-v10
>>>>>>>>> - Fix a bracing error.
>>>>>>>>> v12-v11
>>>>>>>>> - Add duplicate sample group id check.
>>>>>>>>>
>>>>>>>>> Chris Mi (11):
>>>>>>>>>       compat: Add psample and tc sample action defines for 
>>>>>>>>> older kernels
>>>>>>>>>       ovs-kmod-ctl: Load kernel module psample
>>>>>>>>>       dpif: Introduce register sFlow upcall callback API
>>>>>>>>>       ofproto: Add upcall callback to process sFlow packet
>>>>>>>>>       netdev-offload: Introduce register sFlow upcall callback 
>>>>>>>>> API
>>>>>>>>>       netdev-offload-tc: Implement register sFlow upcall 
>>>>>>>>> callback API
>>>>>>>>>       dpif-netlink: Implement register sFlow upcall callback API
>>>>>>>>>       netdev-offload-tc: Introduce group ID management API
>>>>>>>>>       netdev-offload-tc: Create psample netlink socket
>>>>>>>>>       netdev-offload-tc: Add psample receive handler
>>>>>>>>>       netdev-offload-tc: Add offload support for sFlow
>>>>>>>>>
>>>>>>>>>      include/linux/automake.mk        |   4 +-
>>>>>>>>>      include/linux/psample.h          |  58 +++
>>>>>>>>>      include/linux/tc_act/tc_sample.h |  25 ++
>>>>>>>>>      lib/dpif-netdev.c                |   1 +
>>>>>>>>>      lib/dpif-netlink.c               |  27 ++
>>>>>>>>>      lib/dpif-netlink.h               |   4 +
>>>>>>>>>      lib/dpif-provider.h              |  10 +
>>>>>>>>>      lib/dpif.c                       |   8 +
>>>>>>>>>      lib/dpif.h                       |  23 ++
>>>>>>>>>      lib/netdev-offload-provider.h    |   3 +
>>>>>>>>>      lib/netdev-offload-tc.c          | 659 
>>>>>>>>> ++++++++++++++++++++++++++++++-
>>>>>>>>>      lib/netdev-offload.c             |  30 ++
>>>>>>>>>      lib/netdev-offload.h             |   4 +
>>>>>>>>>      lib/tc.c                         |  61 ++-
>>>>>>>>>      lib/tc.h                         |  16 +-
>>>>>>>>>      ofproto/ofproto-dpif-upcall.c    |  42 ++
>>>>>>>>>      utilities/ovs-kmod-ctl.in        |  14 +
>>>>>>>>>      17 files changed, 973 insertions(+), 16 deletions(-)
>>>>>>>>>      create mode 100644 include/linux/psample.h
>>>>>>>>>      create mode 100644 include/linux/tc_act/tc_sample.h
>>>>>>>>>
>>>>>>>> Hi Simon, Ilya,
>>>>>>>>
>>>>>>>> Can you help review for this series?
>>>>>>>> do you have any comments you want us to handle?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Roi
>
Chris Mi May 18, 2021, 8:06 a.m. UTC | #15
Hi Ilya,

Could you please take a look if you have time?

Thanks,
Chris

On 5/11/2021 4:11 PM, Chris Mi wrote:
> On 4/27/2021 9:23 AM, Chris Mi wrote:
>> On 3/24/2021 8:14 PM, Ilya Maximets wrote:
>>> On 3/24/21 10:17 AM, Chris Mi wrote:
>>>> On 3/23/2021 10:24 PM, Ilya Maximets wrote:
>>>>> On 3/5/21 4:27 AM, Chris Mi wrote:
>>>>>> Hi Ilya,
>>>>>>
>>>>>> I think about your suggestion recently. But I'm still not very 
>>>>>> clear about the design.
>>>>>> Please see my reply below:
>>>>>>
>>>>>> On 3/1/2021 8:48 PM, Ilya Maximets wrote:
>>>>>>> On 3/1/21 9:30 AM, Chris Mi wrote:
>>>>>>>> Hi Simon, Ilya,
>>>>>>>>
>>>>>>>> Could I know what should we do to make progress for this patch 
>>>>>>>> set?
>>>>>>>> It has been posted in the community for a long time 😁
>>>>>>> In general, the way to get your patches reviewed is to review other
>>>>>>> patches.  It's simply because we still have a huge review backlog
>>>>>>> (214 patches right now in patchwork and most of them needs review)
>>>>>>> and bugfixes usually has a bit higher priority.  By reviewing other
>>>>>>> patches you're reducing amount of work for maintainers so they can
>>>>>>> get to your patches faster.
>>>>>> OK, I see.
>>>>>>> For the series and comments from Eelco:
>>>>>>> I didn't read the patches carefully, only a quick glance, but I 
>>>>>>> still
>>>>>>> do not understand why we need a separate thread to poll psample 
>>>>>>> events.
>>>>>>> Why can't we just allow usual handler threads to do that?
>>>>>> I'm not sure if you are aware of that the psample netlink is 
>>>>>> different from the ovs
>>>>>> netlink. Without offload, kernel sends missed packet and sFlow 
>>>>>> packet to userspace
>>>>>> using the same netlink 'ovs_packet_family'. So they can use the 
>>>>>> same handler thread.
>>>>>> But in order to offload sFlow action, we use psample kernel 
>>>>>> module to send sampled
>>>>>> packets from kernel to userspace. The format for ovs netlink 
>>>>>> message and psample
>>>>>> netlink messages are different.
>>>>> Hi.  Sorry for late reply.
>>>>>
>>>>> Yes, I understand that message format is different, but it doesn't 
>>>>> really
>>>>> matter.  All the message-specific handling and parsing should happen
>>>>> inside the netdev_offload_tc_recv().  This function should have a 
>>>>> prototype
>>>>> similar to dpif_netlink_recv(), i.e. it should receive a pointer 
>>>>> to the
>>>>> struct dpif_upcall from the caller and fill it with data. Maybe other
>>>>> type of a generic data structure if it's really not possible to 
>>>>> construct
>>>>> struct dpif_upcall.
>>>>>
>>>>>
>>>>>>>      From the
>>>>>>> architecture perspective it's not a good thing to call ofproto code
>>>>>>> from the offload-provider.  This introduces lots of complications
>>>>>>> and might cause lots of issues in the future.
>>>>>>>
>>>>>>> I'd say that design should look like this:
>>>>>>>
>>>>>>>      handler thread ->
>>>>>>>        dpif_recv() ->
>>>>>>>          dpif_netlink_recv() ->
>>>>>>>            netdev_offload_recv() ->
>>>>>>>              netdev_offload_tc_recv() ->
>>>>>>>                nl_sock_recv()
>>>>>> In order to use the handler thread, I'm not sure if we should add 
>>>>>> psample socket
>>>>>> fd to every handler's epoll_fd. If we should do that, we should 
>>>>>> consider how to
>>>>>> differentiate if the event comes from ovs or psample netlink. 
>>>>>> Maybe we should
>>>>>> allocate a special port number for psample and assign it to 
>>>>>> event.data.u32.
>>>>>> Anyway, that's the details. If this is the right direction, I'll 
>>>>>> think about it.
>>>>>>> Last three calls should be implemented.
>>>>>>>
>>>>>>> The better version of this will be to throw away dpif part from
>>>>>>> above call chain and make it:
>>>>>>>
>>>>>>>      handler thread ->
>>>>>>>        netdev_offload_recv() ->
>>>>>>>          netdev_offload_tc_recv() ->
>>>>>>>            nl_sock_recv()
>>>>>> If we throw away dpif part, maybe we have to write some duplicate 
>>>>>> epoll code
>>>>>> for psample only. And we can't block in nl_sock_recv(). Maybe we 
>>>>>> have to add
>>>>>> psample socket fd to every handler's epoll_fd. So we have to 
>>>>>> change the dpif
>>>>>> somehow.
>>>>> There is no need to implement any epoll_fd logic for psample.
>>>>> You may only use handler with handler_id == 0.  Only this handler 
>>>>> will receive
>>>>> psample upcalls.  netdev_offload_recv_wait() should be implemented 
>>>>> similar
>>>>> to how dpif_netlink_recv_wait() implemented for windows case, i.e. 
>>>>> it will
>>>>> call nl_sock_wait(nl_sock, POLLIN) for the psample netlink socket.
>>>>> There is no need to block and you're never blocking anywhere 
>>>>> because your're
>>>>> calling nl_sock_recv(..., wait = false) which will use 
>>>>> MSG_DONTWAIT while
>>>>> actually receiving a message.
>>>> Hi Ilya,
>>>>
>>>> With this new design, the code will be changed greatly. I'm not 
>>>> unwilling to change it.
>>>> The effort is not small. So before I start, I want to make sure 
>>>> that it is feasible.
>>>>
>>>> Yes, we won't block in nl_sock_recv(), but the handler thread will 
>>>> be blocked in poll_block().
>>>> If any of the vport netlink socket is readable , it will be waken 
>>>> up by kernel. If we don't use
>>>> epoll_ctl to add the psample netlink socket to the handler's epoll 
>>>> fd, we can't receive the
>>>> psample packet as soon as possible. That means we can only receive 
>>>> the sampled packet
>>>> when there is a miss upcall.
>>>>
>>>> I added some debug message in ovs epoll code and got above 
>>>> conclusion. But I'm not
>>>> the expert of the epoll API, so I'm not sure if I missed anything.
>>> Handler thread wakes up on POLLIN on epoll_fd itself, not on the 
>>> event on one
>>> of the file descriptors added to epoll.  epoll_fd is added to a 
>>> thread's poll
>>> loop by
>>>    poll_fd_wait(handler->epoll_fd, POLLIN);
>>>
>>> If you will call nl_sock_wait() on psample socket, this socket will 
>>> be added
>>> to the same poll loop with:
>>>    nl_sock_wait(psample_sock, POLLIN) ->
>>>      poll_fd_wait(psample_sock->fd, POLLIN);
>>>
>>> This way psample socket will become an *additional* source for 
>>> waking up
>>> for the thread that called nl_sock_wait().  So, this handler will be 
>>> waken
>>> up if POLLIN happened on a psample socket even if there are no miss 
>>> upcalls.
>>>
>>> The whole epoll infrastructure is local to lib/dpif-netlink.c and 
>>> handler
>>> threads knows nothing about it.  poll_loop() knows nothing about 
>>> this epoll
>>> stuff too, it just adds epoll_fd itself to the list of fds for the 
>>> usual
>>> poll() because of the poll_fd_wait() call.  psample socket will end 
>>> up in
>>> the same usual poll().
>> Hi Ilya,
>>
>> The code according to your suggestion is ready. But during the 
>> internal code review,
>> Eli Britstein thought flow_api is netdev based, but the psample/sFlow 
>> offload is
>> not netdev based, but dpif based. Maybe we should introduce 
>> dpif-offload-provider
>> and have a dpf_offload_api, so it won't be related to a specific 
>> netdev, but to a dpif type.
>>
>> Could I know what's your opinion about it?
> Hi Ilya,
>
> Any comment about it?
>
> Thanks,
> Chris
>>
>> Thanks,
>> Chris
>>>
>>>> Thanks,
>>>> Chris
>>>>> So, there should be several call chains:
>>>>>
>>>>> 1. Init.
>>>>>
>>>>>      open_dpif_backer/type_run() ->
>>>>>        netdev_offload_recv_set() ->
>>>>>          netdev_offload_tc_recv_set(enabled) ->
>>>>>            if (enable)
>>>>>              psample_sock = nl_sock_create(...);
>>>>>            else
>>>>>              close(psample_sock);
>>>>>
>>>>> 2. Wait.
>>>>>
>>>>>      udpif_upcall_handler() ->
>>>>>        netdev_offload_recv_wait() ->
>>>>>          netdev_offload_tc_recv_wait() ->
>>>>>            if (handler_id == 0)
>>>>>              nl_sock_wait(psample_sock, POLLIN);
>>>>>
>>>>> 3. Receive.
>>>>>
>>>>>      udpif_upcall_handler() ->
>>>>>        recv_upcalls() ->
>>>>>          netdev_offload_recv() ->
>>>>>          | netdev_offload_tc_recv(..., struct dpif_upcall *upcall) ->
>>>>>          |   if (handler_id == 0)
>>>>>          |     nl_sock_recv(psample_socket, ..., wait = false);
>>>>>          |     *upcall = <received data>;
>>>>>          upcall_receive()
>>>>>          process_upcall() ->
>>>>>            dpif_sflow_received()
>>>>>
>>>>> 4. Deinit.
>>>>>
>>>>>      close_dpif_backer() ->
>>>>>        netdev_offload_recv_set(enable = false);
>>>>>
>>>>>
>>>>> Does that look more clear?
>>>>>
>>>>>> I don't know if I understand your suggestion correctly. Or I 
>>>>>> missed anything
>>>>>> So if you have time, could you please elaborate?
>>>>>>
>>>>>> Thanks,
>>>>>> Chris
>>>>>>> This way we could avoid touching dpif-netdev and still have psample
>>>>>>> offloading for the case where netdev-offload-tc is used from the
>>>>>>> userspace datapath.
>>>>>>>
>>>>>>> Above architecture also implies implementation of:
>>>>>>>     - netdev_offload_recv_wait()
>>>>>>>     - netdev_offload_recv_purge()
>>>>>>>     - and the netdev_offload_tc_* counterparts.
>>>>>>>
>>>>>>> Best regards, Ilya Maximets.
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Chris
>>>>>>>>
>>>>>>>> On 2/23/2021 5:08 PM, Roi Dayan wrote:
>>>>>>>>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>>>>>>>>> This patch set adds offload support for sFlow.
>>>>>>>>>>
>>>>>>>>>> Psample is a genetlink channel for packet sampling. TC action 
>>>>>>>>>> act_sample
>>>>>>>>>> uses psample to send sampled packets to userspace.
>>>>>>>>>>
>>>>>>>>>> When offloading sample action to TC, userspace creates a 
>>>>>>>>>> unique ID to
>>>>>>>>>> map sFlow action and tunnel info and passes this ID to kernel 
>>>>>>>>>> instead
>>>>>>>>>> of the sFlow info. psample will send this ID and sampled 
>>>>>>>>>> packet to
>>>>>>>>>> userspace. Using the ID, userspace can recover the sFlow info 
>>>>>>>>>> and send
>>>>>>>>>> sampled packet to the right sFlow monitoring host.
>>>>>>>>>>
>>>>>>>>>> v2-v1:
>>>>>>>>>> - Fix robot errors.
>>>>>>>>>> v3-v2:
>>>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>>>> - Add patch #9 to fix older kernels build issue.
>>>>>>>>>> - Add travis test result.
>>>>>>>>>> v4-v3:
>>>>>>>>>> - Fix offload issue when sampling rate is 1.
>>>>>>>>>> v5-v4:
>>>>>>>>>> - Move polling thread from ofproto to netdev-offload-tc.
>>>>>>>>>> v6-v5:
>>>>>>>>>> - Rebase.
>>>>>>>>>> - Add GitHub Actions test result.
>>>>>>>>>> v7-v6:
>>>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>>>> - Fix "ERROR: Inappropriate spacing around cast"
>>>>>>>>>> v8-v7
>>>>>>>>>> - Address Eelco Chaudron's comment for patch #11.
>>>>>>>>>> v9-v8
>>>>>>>>>> - Remove sflow_len from struct dpif_sflow_attr.
>>>>>>>>>> - Log a debug message for other userspace actions.
>>>>>>>>>> v10-v9
>>>>>>>>>> - Address Eelco Chaudron's comments on v9.
>>>>>>>>>> v11-v10
>>>>>>>>>> - Fix a bracing error.
>>>>>>>>>> v12-v11
>>>>>>>>>> - Add duplicate sample group id check.
>>>>>>>>>>
>>>>>>>>>> Chris Mi (11):
>>>>>>>>>>       compat: Add psample and tc sample action defines for 
>>>>>>>>>> older kernels
>>>>>>>>>>       ovs-kmod-ctl: Load kernel module psample
>>>>>>>>>>       dpif: Introduce register sFlow upcall callback API
>>>>>>>>>>       ofproto: Add upcall callback to process sFlow packet
>>>>>>>>>>       netdev-offload: Introduce register sFlow upcall 
>>>>>>>>>> callback API
>>>>>>>>>>       netdev-offload-tc: Implement register sFlow upcall 
>>>>>>>>>> callback API
>>>>>>>>>>       dpif-netlink: Implement register sFlow upcall callback API
>>>>>>>>>>       netdev-offload-tc: Introduce group ID management API
>>>>>>>>>>       netdev-offload-tc: Create psample netlink socket
>>>>>>>>>>       netdev-offload-tc: Add psample receive handler
>>>>>>>>>>       netdev-offload-tc: Add offload support for sFlow
>>>>>>>>>>
>>>>>>>>>>      include/linux/automake.mk        |   4 +-
>>>>>>>>>>      include/linux/psample.h          |  58 +++
>>>>>>>>>>      include/linux/tc_act/tc_sample.h |  25 ++
>>>>>>>>>>      lib/dpif-netdev.c                |   1 +
>>>>>>>>>>      lib/dpif-netlink.c               |  27 ++
>>>>>>>>>>      lib/dpif-netlink.h               |   4 +
>>>>>>>>>>      lib/dpif-provider.h              |  10 +
>>>>>>>>>>      lib/dpif.c                       |   8 +
>>>>>>>>>>      lib/dpif.h                       |  23 ++
>>>>>>>>>>      lib/netdev-offload-provider.h    |   3 +
>>>>>>>>>>      lib/netdev-offload-tc.c          | 659 
>>>>>>>>>> ++++++++++++++++++++++++++++++-
>>>>>>>>>>      lib/netdev-offload.c             |  30 ++
>>>>>>>>>>      lib/netdev-offload.h             |   4 +
>>>>>>>>>>      lib/tc.c                         |  61 ++-
>>>>>>>>>>      lib/tc.h                         |  16 +-
>>>>>>>>>>      ofproto/ofproto-dpif-upcall.c    |  42 ++
>>>>>>>>>>      utilities/ovs-kmod-ctl.in        |  14 +
>>>>>>>>>>      17 files changed, 973 insertions(+), 16 deletions(-)
>>>>>>>>>>      create mode 100644 include/linux/psample.h
>>>>>>>>>>      create mode 100644 include/linux/tc_act/tc_sample.h
>>>>>>>>>>
>>>>>>>>> Hi Simon, Ilya,
>>>>>>>>>
>>>>>>>>> Can you help review for this series?
>>>>>>>>> do you have any comments you want us to handle?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Roi
>>
>
Ilya Maximets May 18, 2021, 12:22 p.m. UTC | #16
On 4/27/21 3:23 AM, Chris Mi wrote:
> On 3/24/2021 8:14 PM, Ilya Maximets wrote:
>> On 3/24/21 10:17 AM, Chris Mi wrote:
>>> On 3/23/2021 10:24 PM, Ilya Maximets wrote:
>>>> On 3/5/21 4:27 AM, Chris Mi wrote:
>>>>> Hi Ilya,
>>>>>
>>>>> I think about your suggestion recently. But I'm still not very clear about the design.
>>>>> Please see my reply below:
>>>>>
>>>>> On 3/1/2021 8:48 PM, Ilya Maximets wrote:
>>>>>> On 3/1/21 9:30 AM, Chris Mi wrote:
>>>>>>> Hi Simon, Ilya,
>>>>>>>
>>>>>>> Could I know what should we do to make progress for this patch set?
>>>>>>> It has been posted in the community for a long time 😁
>>>>>> In general, the way to get your patches reviewed is to review other
>>>>>> patches.  It's simply because we still have a huge review backlog
>>>>>> (214 patches right now in patchwork and most of them needs review)
>>>>>> and bugfixes usually has a bit higher priority.  By reviewing other
>>>>>> patches you're reducing amount of work for maintainers so they can
>>>>>> get to your patches faster.
>>>>> OK, I see.
>>>>>> For the series and comments from Eelco:
>>>>>> I didn't read the patches carefully, only a quick glance, but I still
>>>>>> do not understand why we need a separate thread to poll psample events.
>>>>>> Why can't we just allow usual handler threads to do that?
>>>>> I'm not sure if you are aware of that the psample netlink is different from the ovs
>>>>> netlink. Without offload, kernel sends missed packet and sFlow packet to userspace
>>>>> using the same netlink 'ovs_packet_family'. So they can use the same handler thread.
>>>>> But in order to offload sFlow action, we use psample kernel module to send sampled
>>>>> packets from kernel to userspace. The format for ovs netlink message and psample
>>>>> netlink messages are different.
>>>> Hi.  Sorry for late reply.
>>>>
>>>> Yes, I understand that message format is different, but it doesn't really
>>>> matter.  All the message-specific handling and parsing should happen
>>>> inside the netdev_offload_tc_recv().  This function should have a prototype
>>>> similar to dpif_netlink_recv(), i.e. it should receive a pointer to the
>>>> struct dpif_upcall from the caller and fill it with data.  Maybe other
>>>> type of a generic data structure if it's really not possible to construct
>>>> struct dpif_upcall.
>>>>
>>>>
>>>>>>      From the
>>>>>> architecture perspective it's not a good thing to call ofproto code
>>>>>> from the offload-provider.  This introduces lots of complications
>>>>>> and might cause lots of issues in the future.
>>>>>>
>>>>>> I'd say that design should look like this:
>>>>>>
>>>>>>      handler thread ->
>>>>>>        dpif_recv() ->
>>>>>>          dpif_netlink_recv() ->
>>>>>>            netdev_offload_recv() ->
>>>>>>              netdev_offload_tc_recv() ->
>>>>>>                nl_sock_recv()
>>>>> In order to use the handler thread, I'm not sure if we should add psample socket
>>>>> fd to every handler's epoll_fd. If we should do that, we should consider how to
>>>>> differentiate if the event comes from ovs or psample netlink. Maybe we should
>>>>> allocate a special port number for psample and assign it to event.data.u32.
>>>>> Anyway, that's the details. If this is the right direction, I'll think about it.
>>>>>> Last three calls should be implemented.
>>>>>>
>>>>>> The better version of this will be to throw away dpif part from
>>>>>> above call chain and make it:
>>>>>>
>>>>>>      handler thread ->
>>>>>>        netdev_offload_recv() ->
>>>>>>          netdev_offload_tc_recv() ->
>>>>>>            nl_sock_recv()
>>>>> If we throw away dpif part, maybe we have to write some duplicate epoll code
>>>>> for psample only. And we can't block in nl_sock_recv(). Maybe we have to add
>>>>> psample socket fd to every handler's epoll_fd. So we have to change the dpif
>>>>> somehow.
>>>> There is no need to implement any epoll_fd logic for psample.
>>>> You may only use handler with handler_id == 0.  Only this handler will receive
>>>> psample upcalls.  netdev_offload_recv_wait() should be implemented similar
>>>> to how dpif_netlink_recv_wait() implemented for windows case, i.e. it will
>>>> call nl_sock_wait(nl_sock, POLLIN) for the psample netlink socket.
>>>> There is no need to block and you're never blocking anywhere because your're
>>>> calling nl_sock_recv(..., wait = false) which will use MSG_DONTWAIT while
>>>> actually receiving a message.
>>> Hi Ilya,
>>>
>>> With this new design, the code will be changed greatly. I'm not unwilling to change it.
>>> The effort is not small. So before I start, I want to make sure that it is feasible.
>>>
>>> Yes, we won't block in nl_sock_recv(), but the handler thread will be blocked in poll_block().
>>> If any of the vport netlink socket is readable , it will be waken up by kernel. If we don't use
>>> epoll_ctl to add the psample netlink socket to the handler's epoll fd, we can't receive the
>>> psample packet as soon as possible. That means we can only receive the sampled packet
>>> when there is a miss upcall.
>>>
>>> I added some debug message in ovs epoll code and got above conclusion. But I'm not
>>> the expert of the epoll API, so I'm not sure if I missed anything.
>> Handler thread wakes up on POLLIN on epoll_fd itself, not on the event on one
>> of the file descriptors added to epoll.  epoll_fd is added to a thread's poll
>> loop by
>>    poll_fd_wait(handler->epoll_fd, POLLIN);
>>
>> If you will call nl_sock_wait() on psample socket, this socket will be added
>> to the same poll loop with:
>>    nl_sock_wait(psample_sock, POLLIN) ->
>>      poll_fd_wait(psample_sock->fd, POLLIN);
>>
>> This way psample socket will become an *additional* source for waking up
>> for the thread that called nl_sock_wait().  So, this handler will be waken
>> up if POLLIN happened on a psample socket even if there are no miss upcalls.
>>
>> The whole epoll infrastructure is local to lib/dpif-netlink.c and handler
>> threads knows nothing about it.  poll_loop() knows nothing about this epoll
>> stuff too, it just adds epoll_fd itself to the list of fds for the usual
>> poll() because of the poll_fd_wait() call.  psample socket will end up in
>> the same usual poll().
> Hi Ilya,
> 
> The code according to your suggestion is ready. But during the internal code review,
> Eli Britstein thought flow_api is netdev based, but the psample/sFlow offload is
> not netdev based, but dpif based. Maybe we should introduce dpif-offload-provider
> and have a dpf_offload_api, so it won't be related to a specific netdev, but to a dpif type.
> 
> Could I know what's your opinion about it?

This might be not bad idea in general.  The problem is that if we'll introduce
this kind of API, we'll need to remove the usage of netdev-offload API from
all the layers higher than dpif implementation including dpif.c and make all
offloading go through dpif-offload-provider.

In general, I think, it might be good to have a separate dpif type, e.g.
dpif-offload that will use netdev-offload APIs internally and provide offloading
service.  In this case, dpif_operate(), depending on offload configuration, will
try to install flows to dpif-offload first and fallback to install flows to the
main dpif implementation if offload failed or if it was partial (i.e.
classification offloading).  This way we will delete all the offloading related
code from dpif-netdev and dpif-netlink and will have a unified offloading
architecture.  It's not easy to implement and there will be a lot of issues that
we'll need to figure out how to fix (e.g. dpif-netdev performs upcalls by itself,
so it will have to install resulted flows by calling a dpif_flow_put/operate and
not install flows directly to its local flow tables).  Anyway, this looks like a
long run and will require rework of the whole offloading architecture, so might
be a separate thing to work on.  Thoughts?

Best regards, Ilya Maximets.
Simon Horman May 19, 2021, 9:47 a.m. UTC | #17
On Tue, May 18, 2021 at 02:22:26PM +0200, Ilya Maximets wrote:
> On 4/27/21 3:23 AM, Chris Mi wrote:

...

> > Hi Ilya,
> > 
> > The code according to your suggestion is ready. But during the internal
> > code review, Eli Britstein thought flow_api is netdev based, but the
> > psample/sFlow offload is not netdev based, but dpif based. Maybe we
> > should introduce dpif-offload-provider and have a dpf_offload_api, so
> > it won't be related to a specific netdev, but to a dpif type.
> > 
> > Could I know what's your opinion about it?
> 
> This might be not bad idea in general.  The problem is that if we'll
> introduce this kind of API, we'll need to remove the usage of
> netdev-offload API from all the layers higher than dpif implementation
> including dpif.c and make all offloading go through
> dpif-offload-provider.
> 
> In general, I think, it might be good to have a separate dpif type, e.g.
> dpif-offload that will use netdev-offload APIs internally and provide
> offloading service.  In this case, dpif_operate(), depending on offload
> configuration, will try to install flows to dpif-offload first and
> fallback to install flows to the main dpif implementation if offload
> failed or if it was partial (i.e.  classification offloading).  This way
> we will delete all the offloading related code from dpif-netdev and
> dpif-netlink and will have a unified offloading architecture.  It's not
> easy to implement and there will be a lot of issues that we'll need to
> figure out how to fix (e.g. dpif-netdev performs upcalls by itself, so it
> will have to install resulted flows by calling a dpif_flow_put/operate
> and not install flows directly to its local flow tables).  Anyway, this
> looks like a long run and will require rework of the whole offloading
> architecture, so might be a separate thing to work on.  Thoughts?

Hi Ilya, Hi Chris,

I think that this approach makes sense in terms of an evolution of
the internal offload APIs/architecture. And might lean itself to more
advanced options in future - such as layered dpif implementations for
some purpose.

I do also there will be quite a lot of implementation challenges
to overcome.

Kind regards,
Simon
Eli Britstein May 25, 2021, 11:19 a.m. UTC | #18
On 5/19/2021 12:47 PM, Simon Horman wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, May 18, 2021 at 02:22:26PM +0200, Ilya Maximets wrote:
>> On 4/27/21 3:23 AM, Chris Mi wrote:
> ...
>
>>> Hi Ilya,
>>>
>>> The code according to your suggestion is ready. But during the internal
>>> code review, Eli Britstein thought flow_api is netdev based, but the
>>> psample/sFlow offload is not netdev based, but dpif based. Maybe we
>>> should introduce dpif-offload-provider and have a dpf_offload_api, so
>>> it won't be related to a specific netdev, but to a dpif type.
>>>
>>> Could I know what's your opinion about it?
>> This might be not bad idea in general.  The problem is that if we'll
>> introduce this kind of API, we'll need to remove the usage of
>> netdev-offload API from all the layers higher than dpif implementation
>> including dpif.c and make all offloading go through
>> dpif-offload-provider.
>>
>> In general, I think, it might be good to have a separate dpif type, e.g.
>> dpif-offload that will use netdev-offload APIs internally and provide
>> offloading service.  In this case, dpif_operate(), depending on offload
>> configuration, will try to install flows to dpif-offload first and
>> fallback to install flows to the main dpif implementation if offload
>> failed or if it was partial (i.e.  classification offloading).  This way
>> we will delete all the offloading related code from dpif-netdev and
>> dpif-netlink and will have a unified offloading architecture.  It's not
>> easy to implement and there will be a lot of issues that we'll need to
>> figure out how to fix (e.g. dpif-netdev performs upcalls by itself, so it
>> will have to install resulted flows by calling a dpif_flow_put/operate
>> and not install flows directly to its local flow tables).  Anyway, this
>> looks like a long run and will require rework of the whole offloading
>> architecture, so might be a separate thing to work on.  Thoughts?
> Hi Ilya, Hi Chris,
>
> I think that this approach makes sense in terms of an evolution of
> the internal offload APIs/architecture. And might lean itself to more
> advanced options in future - such as layered dpif implementations for
> some purpose.
>
> I do also there will be quite a lot of implementation challenges
> to overcome.

Hi Ilya, Simon,

This approach was initially thought of following the suggested design to 
have:

      handler thread ->
        netdev_offload_recv() ->
          netdev_offload_tc_recv() ->
            nl_sock_recv()

The ofproto layer does not have any 'netdev' to pass, yet calling a new 
netdev based API (in flow_api) netdev_offload_recv.

The initial approach was to pass const char *dpif_type, and in 
netdev-offload-tc.c just take the first netdev of this dpif type, and 
work on it.

However, this didn't fit right, so we thought to have this new layer, 
but not as this layered design, but only to serve this case, at least at 
first.

I also agree with Simon that there will be a lot of challenges.

How about we will have phases towards such a complete design, also to 
better cope with those challenges?

At the first phase we will just introduce such a layer (struct 
offload_api in struct dpif for example), with a dpif-offload-provider.h 
to declare it.

There will be functions like dpif_offload_recv(), to be used for this 
sFlow support.

At later phase, the further architecture refactoring (layering) can be done.

What do you think?

Thanks,

Eli

>
> Kind regards,
> Simon
>
>
Chris Mi June 8, 2021, 8:08 a.m. UTC | #19
On 5/25/2021 7:19 PM, Eli Britstein wrote:
>
> On 5/19/2021 12:47 PM, Simon Horman wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On Tue, May 18, 2021 at 02:22:26PM +0200, Ilya Maximets wrote:
>>> On 4/27/21 3:23 AM, Chris Mi wrote:
>> ...
>>
>>>> Hi Ilya,
>>>>
>>>> The code according to your suggestion is ready. But during the 
>>>> internal
>>>> code review, Eli Britstein thought flow_api is netdev based, but the
>>>> psample/sFlow offload is not netdev based, but dpif based. Maybe we
>>>> should introduce dpif-offload-provider and have a dpf_offload_api, so
>>>> it won't be related to a specific netdev, but to a dpif type.
>>>>
>>>> Could I know what's your opinion about it?
>>> This might be not bad idea in general.  The problem is that if we'll
>>> introduce this kind of API, we'll need to remove the usage of
>>> netdev-offload API from all the layers higher than dpif implementation
>>> including dpif.c and make all offloading go through
>>> dpif-offload-provider.
>>>
>>> In general, I think, it might be good to have a separate dpif type, 
>>> e.g.
>>> dpif-offload that will use netdev-offload APIs internally and provide
>>> offloading service.  In this case, dpif_operate(), depending on offload
>>> configuration, will try to install flows to dpif-offload first and
>>> fallback to install flows to the main dpif implementation if offload
>>> failed or if it was partial (i.e.  classification offloading).  This 
>>> way
>>> we will delete all the offloading related code from dpif-netdev and
>>> dpif-netlink and will have a unified offloading architecture. It's not
>>> easy to implement and there will be a lot of issues that we'll need to
>>> figure out how to fix (e.g. dpif-netdev performs upcalls by itself, 
>>> so it
>>> will have to install resulted flows by calling a dpif_flow_put/operate
>>> and not install flows directly to its local flow tables). Anyway, this
>>> looks like a long run and will require rework of the whole offloading
>>> architecture, so might be a separate thing to work on. Thoughts?
>> Hi Ilya, Hi Chris,
>>
>> I think that this approach makes sense in terms of an evolution of
>> the internal offload APIs/architecture. And might lean itself to more
>> advanced options in future - such as layered dpif implementations for
>> some purpose.
>>
>> I do also there will be quite a lot of implementation challenges
>> to overcome.
>
> Hi Ilya, Simon,
>
> This approach was initially thought of following the suggested design 
> to have:
>
>      handler thread ->
>        netdev_offload_recv() ->
>          netdev_offload_tc_recv() ->
>            nl_sock_recv()
>
> The ofproto layer does not have any 'netdev' to pass, yet calling a 
> new netdev based API (in flow_api) netdev_offload_recv.
>
> The initial approach was to pass const char *dpif_type, and in 
> netdev-offload-tc.c just take the first netdev of this dpif type, and 
> work on it.
>
> However, this didn't fit right, so we thought to have this new layer, 
> but not as this layered design, but only to serve this case, at least 
> at first.
>
> I also agree with Simon that there will be a lot of challenges.
>
> How about we will have phases towards such a complete design, also to 
> better cope with those challenges?
>
> At the first phase we will just introduce such a layer (struct 
> offload_api in struct dpif for example), with a 
> dpif-offload-provider.h to declare it.
>
> There will be functions like dpif_offload_recv(), to be used for this 
> sFlow support.
>
> At later phase, the further architecture refactoring (layering) can be 
> done.
>
> What do you think?
Hi Ilya,

Any comments about Eli's suggestion?

Thanks,
Chris