mbox series

[ovs-dev,RFC,0/3] Add ovn drop debugging

Message ID 20220425111724.2981776-1-amorenoz@redhat.com
Headers show
Series Add ovn drop debugging | expand

Message

Adrian Moreno April 25, 2022, 11:17 a.m. UTC
Very often when troubleshooting networking issues in an OVN cluster one
would like to know if any packet (or a specific one) is being dropped by
OVN.

Currently, this cannot be known because of two main reasons:

1 - Implicit drops: Some tables do not have a default action
(priority=0, match=1). In this case, a packet that does not match any
rule will be silently dropped.

2 - Even on explicit drops, we only know a packet was dropped. We lack
information about that packet.

In order to improve this, this RFC proposes a two-fold solution:

- First, create a debug-mode option that makes northd add a default
  "drop;" action on those tables that currently lack one.

- Secondly, allow sampling of all drops. By introducing a new OVN
  action: "sample" (equivalent to OVS's), OVN can make OVS sample the
  packets as they are dropped and insert the first 32 bits of the
  Logical Flow's UUID (a.k.a cookie) into the IPFIX sample's
  ObservationPointId. That way a collector can see the packet's
  header information as well as what Logical Flow dropped it.


This RFC has some limitations I'd like some specific
feedback/guidance on:

* Per-datapath flows
Even if a Logical Flow is created with "match=1", the controller will
insert the datapath metadata match. This might be good enough for most
cases but could miss packets if there's a bug in OVN. A possible
approach could be to propagate the "drop-debug" configuration to the
SB and make the controller insert the default drops but without a
Logical Flow, how would we trace it back?

Another approach (suggested by Dumitru) could be to have OVN detect that
a lflow actually applies to all datapaths and remove the metadata
match which would also remove the number of Openflow flows.

* Use of ObservationPointID
In this RFC, I just used the ObservationPointID (IPFIX element 138)
because it's already supported in the OVS's NXAST_SAMPLE. This allows us
to encode 32bits which is good enough for the cookie. If we wanted to
encode more information we'd have to look for another IPFIX element.

Adrian Moreno (3):
  actions: add sample action
  northd: add drop_debugging option
  debug: add sampling of drop actions

 include/ovn/actions.h |  11 ++++
 lib/actions.c         | 111 +++++++++++++++++++++++++++++++++++++
 northd/automake.mk    |   2 +
 northd/debug.c        |  98 +++++++++++++++++++++++++++++++++
 northd/debug.h        |  41 ++++++++++++++
 northd/northd.c       | 125 ++++++++++++++++++++++++++++--------------
 ovn-nb.xml            |  29 ++++++++++
 tests/ovn.at          |  10 +++-
 tests/test-ovn.c      |   2 +
 utilities/ovn-trace.c |   3 +
 10 files changed, 390 insertions(+), 42 deletions(-)
 create mode 100644 northd/debug.c
 create mode 100644 northd/debug.h

Comments

Mark Michelson May 9, 2022, 8:41 p.m. UTC | #1
Hi Adrian,

Since this is an RFC series, I haven't taken a deep look at the code, so 
I apologize if my comments below are not correct.

On 4/25/22 07:17, Adrian Moreno wrote:
> Very often when troubleshooting networking issues in an OVN cluster one
> would like to know if any packet (or a specific one) is being dropped by
> OVN.
> 
> Currently, this cannot be known because of two main reasons:
> 
> 1 - Implicit drops: Some tables do not have a default action
> (priority=0, match=1). In this case, a packet that does not match any
> rule will be silently dropped.
> 
> 2 - Even on explicit drops, we only know a packet was dropped. We lack
> information about that packet.
> 
> In order to improve this, this RFC proposes a two-fold solution:
> 
> - First, create a debug-mode option that makes northd add a default
>    "drop;" action on those tables that currently lack one.
> 
> - Secondly, allow sampling of all drops. By introducing a new OVN
>    action: "sample" (equivalent to OVS's), OVN can make OVS sample the
>    packets as they are dropped and insert the first 32 bits of the
>    Logical Flow's UUID (a.k.a cookie) into the IPFIX sample's
>    ObservationPointId. That way a collector can see the packet's
>    header information as well as what Logical Flow dropped it.
> 
> 
> This RFC has some limitations I'd like some specific
> feedback/guidance on:
> 
> * Per-datapath flows
> Even if a Logical Flow is created with "match=1", the controller will
> insert the datapath metadata match. This might be good enough for most
> cases but could miss packets if there's a bug in OVN. A possible
> approach could be to propagate the "drop-debug" configuration to the
> SB and make the controller insert the default drops but without a
> Logical Flow, how would we trace it back?

I think the most important data points to track are:
* The packet contents
* The logical datapath where the drop occurred (i.e. metadata)
* (possibly) register values at the time of the drop

Since the logical datapath is important, I think we're going to end up 
with an OF flow per datapath.

The question still remains whether we need to add explicit drop and 
sample logical flows to the southbound database.

I think having logical flows for the explicit drops and samples is a 
good idea for the time being. I think it adds clarity to what OVN is 
doing. The only reason we should move the logic down to ovn-controller 
is if our testing shows significant performance degradation. And I do 
mean *significant*. Since this is something that is only intended to be 
turned on conditionally, it doesn't need to be lightning fast. However, 
if turning on drop debugging causes everything to grind to a halt, then 
we probably should look into optimizing the process.

> 
> Another approach (suggested by Dumitru) could be to have OVN detect that
> a lflow actually applies to all datapaths and remove the metadata
> match which would also remove the number of Openflow flows.

I guess my question here is how we will know on which logical datapath 
the packet was being processed when the drop occurred. If you can set 
the observation point ID to the value of the OXM_METADATA register, then 
this suggestion would work fine. Otherwise, I don't know how you'd 
reduce this to a single flow and still get the necessary information in 
the sample.

> 
> * Use of ObservationPointID
> In this RFC, I just used the ObservationPointID (IPFIX element 138)
> because it's already supported in the OVS's NXAST_SAMPLE. This allows us
> to encode 32bits which is good enough for the cookie. If we wanted to
> encode more information we'd have to look for another IPFIX element.

I had mentioned above that having the data from the registers as part of 
the sample is good for debugging purposes, but I understand this may be 
a big ask. Right now you're putting the logical flow UUID here, and 
that's fine. It's easy to trace that back to a particular datapath. The 
alternative would be to use the UUID of the datapath directly instead.

> 
> Adrian Moreno (3):
>    actions: add sample action
>    northd: add drop_debugging option
>    debug: add sampling of drop actions
> 
>   include/ovn/actions.h |  11 ++++
>   lib/actions.c         | 111 +++++++++++++++++++++++++++++++++++++
>   northd/automake.mk    |   2 +
>   northd/debug.c        |  98 +++++++++++++++++++++++++++++++++
>   northd/debug.h        |  41 ++++++++++++++
>   northd/northd.c       | 125 ++++++++++++++++++++++++++++--------------
>   ovn-nb.xml            |  29 ++++++++++
>   tests/ovn.at          |  10 +++-
>   tests/test-ovn.c      |   2 +
>   utilities/ovn-trace.c |   3 +
>   10 files changed, 390 insertions(+), 42 deletions(-)
>   create mode 100644 northd/debug.c
>   create mode 100644 northd/debug.h
>
Adrian Moreno May 10, 2022, 6:23 a.m. UTC | #2
On 5/9/22 22:41, Mark Michelson wrote:
> Hi Adrian,
> 
> Since this is an RFC series, I haven't taken a deep look at the code, so I 
> apologize if my comments below are not correct.
> 

Thank you very much for the feedback Mark.

> On 4/25/22 07:17, Adrian Moreno wrote:
>> Very often when troubleshooting networking issues in an OVN cluster one
>> would like to know if any packet (or a specific one) is being dropped by
>> OVN.
>>
>> Currently, this cannot be known because of two main reasons:
>>
>> 1 - Implicit drops: Some tables do not have a default action
>> (priority=0, match=1). In this case, a packet that does not match any
>> rule will be silently dropped.
>>
>> 2 - Even on explicit drops, we only know a packet was dropped. We lack
>> information about that packet.
>>
>> In order to improve this, this RFC proposes a two-fold solution:
>>
>> - First, create a debug-mode option that makes northd add a default
>>    "drop;" action on those tables that currently lack one.
>>
>> - Secondly, allow sampling of all drops. By introducing a new OVN
>>    action: "sample" (equivalent to OVS's), OVN can make OVS sample the
>>    packets as they are dropped and insert the first 32 bits of the
>>    Logical Flow's UUID (a.k.a cookie) into the IPFIX sample's
>>    ObservationPointId. That way a collector can see the packet's
>>    header information as well as what Logical Flow dropped it.
>>
>>
>> This RFC has some limitations I'd like some specific
>> feedback/guidance on:
>>
>> * Per-datapath flows
>> Even if a Logical Flow is created with "match=1", the controller will
>> insert the datapath metadata match. This might be good enough for most
>> cases but could miss packets if there's a bug in OVN. A possible
>> approach could be to propagate the "drop-debug" configuration to the
>> SB and make the controller insert the default drops but without a
>> Logical Flow, how would we trace it back?
> 
> I think the most important data points to track are:
> * The packet contents
> * The logical datapath where the drop occurred (i.e. metadata)

See below thoughts about how to do this.

> * (possibly) register values at the time of the drop

This metadata fields are not in the datapath, they only exist at the openflow 
classification level, rigth? So we would have to make OpenFlow "sample()" 
dynamic and insert arbitrary fields in it.


> Since the logical datapath is important, I think we're going to end up with an 
> OF flow per datapath.
> 
> The question still remains whether we need to add explicit drop and sample 
> logical flows to the southbound database.
> 
> I think having logical flows for the explicit drops and samples is a good idea 
> for the time being. I think it adds clarity to what OVN is doing. The only 
> reason we should move the logic down to ovn-controller is if our testing shows 
> significant performance degradation. And I do mean *significant*. Since this is 
> something that is only intended to be turned on conditionally, it doesn't need 
> to be lightning fast. However, if turning on drop debugging causes everything to 
> grind to a halt, then we probably should look into optimizing the process.
> 
>>
>> Another approach (suggested by Dumitru) could be to have OVN detect that
>> a lflow actually applies to all datapaths and remove the metadata
>> match which would also remove the number of Openflow flows.
> 
> I guess my question here is how we will know on which logical datapath the 
> packet was being processed when the drop occurred. If you can set the 
> observation point ID to the value of the OXM_METADATA register, then this 
> suggestion would work fine. Otherwise, I don't know how you'd reduce this to a 
> single flow and still get the necessary information in the sample.
> 

Right, we would loose that information since we cannot set the observation point 
ID dynamically (i.e: it's hardcoded in the of flow) so if we optimize out and 
only add a single of flow, we can only have a single observation point ID.

In fact, will we have the same problem with the current implementation? Right 
now northd will create a ovs action as "sample(obs_point_id=$cookie, ....)" and 
the controller will replace it with the first 32bits of the lflow UUID. But the 
lflow is the same for all datapaths, right?

If so, I see these ways forward:

A) We modify the "sample" OpenFlow flow to allow the controller to add more 
fields to the sample. For instance, OVS currently supports an arbitrary string 
field "virtual_obs_id" but only per IPFIX target.

B) We use the observation domain ID.

This has some complexity: My current approach was to have the users specify the 
obs_domain_ID to use for ovn-drop-debugging so they can make sure it is not the 
same as the one used for other kind of sampling (e.g: per flow sampling of ACLs, 
bridge-level sampling, etc). This is because obs_point_ids should be unique per 
obs_domain_id, so we can use this field to separate "applications" at the 
collector level. My questions would be:
What's the maximum number of datapaths? Is it less than 2^32 so we can reserve a 
domain_id for each and still leave space for other applications?
Is the number of datapaths per controller more limited in theory (I'd guess in 
practice it is)?

C) Have control over the IPFIX table in OVS.

We're currently relying on the users to create rows in the IPFIX and 
Flow_Sample_Collector_Set Tables on every bridge they want to sample. If 
ovn-controller takes control over those tables it could add a row (with a 
different virtual_obs_point_id) for each datapath. But the problem is the same 
as in B): How to co-exist with other user-configured IPFIX sampling applications.

>>
>> * Use of ObservationPointID
>> In this RFC, I just used the ObservationPointID (IPFIX element 138)
>> because it's already supported in the OVS's NXAST_SAMPLE. This allows us
>> to encode 32bits which is good enough for the cookie. If we wanted to
>> encode more information we'd have to look for another IPFIX element.
> 
> I had mentioned above that having the data from the registers as part of the 
> sample is good for debugging purposes, but I understand this may be a big ask. 
> Right now you're putting the logical flow UUID here, and that's fine. It's easy 
> to trace that back to a particular datapath. The alternative would be to use the 
> UUID of the datapath directly instead.
> 

See above.

>>
>> Adrian Moreno (3):
>>    actions: add sample action
>>    northd: add drop_debugging option
>>    debug: add sampling of drop actions
>>
>>   include/ovn/actions.h |  11 ++++
>>   lib/actions.c         | 111 +++++++++++++++++++++++++++++++++++++
>>   northd/automake.mk    |   2 +
>>   northd/debug.c        |  98 +++++++++++++++++++++++++++++++++
>>   northd/debug.h        |  41 ++++++++++++++
>>   northd/northd.c       | 125 ++++++++++++++++++++++++++++--------------
>>   ovn-nb.xml            |  29 ++++++++++
>>   tests/ovn.at          |  10 +++-
>>   tests/test-ovn.c      |   2 +
>>   utilities/ovn-trace.c |   3 +
>>   10 files changed, 390 insertions(+), 42 deletions(-)
>>   create mode 100644 northd/debug.c
>>   create mode 100644 northd/debug.h
>>
>
Mark Michelson May 10, 2022, 1:24 p.m. UTC | #3
On 5/10/22 02:23, Adrian Moreno wrote:
> 
> 
> On 5/9/22 22:41, Mark Michelson wrote:
>> Hi Adrian,
>>
>> Since this is an RFC series, I haven't taken a deep look at the code, 
>> so I apologize if my comments below are not correct.
>>
> 
> Thank you very much for the feedback Mark.
> 
>> On 4/25/22 07:17, Adrian Moreno wrote:
>>> Very often when troubleshooting networking issues in an OVN cluster one
>>> would like to know if any packet (or a specific one) is being dropped by
>>> OVN.
>>>
>>> Currently, this cannot be known because of two main reasons:
>>>
>>> 1 - Implicit drops: Some tables do not have a default action
>>> (priority=0, match=1). In this case, a packet that does not match any
>>> rule will be silently dropped.
>>>
>>> 2 - Even on explicit drops, we only know a packet was dropped. We lack
>>> information about that packet.
>>>
>>> In order to improve this, this RFC proposes a two-fold solution:
>>>
>>> - First, create a debug-mode option that makes northd add a default
>>>    "drop;" action on those tables that currently lack one.
>>>
>>> - Secondly, allow sampling of all drops. By introducing a new OVN
>>>    action: "sample" (equivalent to OVS's), OVN can make OVS sample the
>>>    packets as they are dropped and insert the first 32 bits of the
>>>    Logical Flow's UUID (a.k.a cookie) into the IPFIX sample's
>>>    ObservationPointId. That way a collector can see the packet's
>>>    header information as well as what Logical Flow dropped it.
>>>
>>>
>>> This RFC has some limitations I'd like some specific
>>> feedback/guidance on:
>>>
>>> * Per-datapath flows
>>> Even if a Logical Flow is created with "match=1", the controller will
>>> insert the datapath metadata match. This might be good enough for most
>>> cases but could miss packets if there's a bug in OVN. A possible
>>> approach could be to propagate the "drop-debug" configuration to the
>>> SB and make the controller insert the default drops but without a
>>> Logical Flow, how would we trace it back?
>>
>> I think the most important data points to track are:
>> * The packet contents
>> * The logical datapath where the drop occurred (i.e. metadata)
> 
> See below thoughts about how to do this.
> 
>> * (possibly) register values at the time of the drop
> 
> This metadata fields are not in the datapath, they only exist at the 
> openflow classification level, rigth? So we would have to make OpenFlow 
> "sample()" dynamic and insert arbitrary fields in it.

Having the register values is mostly wishful thinking on my part :)

It's not something that is 100% required; it's just something I thought 
could be useful when debugging a dropped packet.

> 
> 
>> Since the logical datapath is important, I think we're going to end up 
>> with an OF flow per datapath.
>>
>> The question still remains whether we need to add explicit drop and 
>> sample logical flows to the southbound database.
>>
>> I think having logical flows for the explicit drops and samples is a 
>> good idea for the time being. I think it adds clarity to what OVN is 
>> doing. The only reason we should move the logic down to ovn-controller 
>> is if our testing shows significant performance degradation. And I do 
>> mean *significant*. Since this is something that is only intended to 
>> be turned on conditionally, it doesn't need to be lightning fast. 
>> However, if turning on drop debugging causes everything to grind to a 
>> halt, then we probably should look into optimizing the process.
>>
>>>
>>> Another approach (suggested by Dumitru) could be to have OVN detect that
>>> a lflow actually applies to all datapaths and remove the metadata
>>> match which would also remove the number of Openflow flows.
>>
>> I guess my question here is how we will know on which logical datapath 
>> the packet was being processed when the drop occurred. If you can set 
>> the observation point ID to the value of the OXM_METADATA register, 
>> then this suggestion would work fine. Otherwise, I don't know how 
>> you'd reduce this to a single flow and still get the necessary 
>> information in the sample.
>>
> 
> Right, we would loose that information since we cannot set the 
> observation point ID dynamically (i.e: it's hardcoded in the of flow) so 
> if we optimize out and only add a single of flow, we can only have a 
> single observation point ID.
> 
> In fact, will we have the same problem with the current implementation? 
> Right now northd will create a ovs action as 
> "sample(obs_point_id=$cookie, ....)" and the controller will replace it 
> with the first 32bits of the lflow UUID. But the lflow is the same for 
> all datapaths, right?

That's a good question. If logical datapath groups are disabled, then 
you should definitely have a logical flow per datapath, even if the 
matches and actions are the same across all datapaths. With logical 
datapath groups enabled (the default), those similar flows are factored 
into a single logical flow. I think in that scenario the lflow UUID does 
not work to identify the logical datapath.

Some good alternatives to using the lflow UUID as a datapath identifier 
would be

* The value of the OXM_METADATA register.
* The southbound datapath_binding UUID.

In either case, you will end up with an OF flow per datapath. Again I 
want to stress that this is fine, especially since this likely will only 
be enabled while actively debugging.

> 
> If so, I see these ways forward:
> 
> A) We modify the "sample" OpenFlow flow to allow the controller to add 
> more fields to the sample. For instance, OVS currently supports an 
> arbitrary string field "virtual_obs_id" but only per IPFIX target.
> 
> B) We use the observation domain ID.
> 
> This has some complexity: My current approach was to have the users 
> specify the obs_domain_ID to use for ovn-drop-debugging so they can make 
> sure it is not the same as the one used for other kind of sampling (e.g: 
> per flow sampling of ACLs, bridge-level sampling, etc). This is because 
> obs_point_ids should be unique per obs_domain_id, so we can use this 
> field to separate "applications" at the collector level. My questions 
> would be:
> What's the maximum number of datapaths? Is it less than 2^32 so we can 
> reserve a domain_id for each and still leave space for other applications?
> Is the number of datapaths per controller more limited in theory (I'd 
> guess in practice it is)?
> 
> C) Have control over the IPFIX table in OVS.
> 
> We're currently relying on the users to create rows in the IPFIX and 
> Flow_Sample_Collector_Set Tables on every bridge they want to sample. If 
> ovn-controller takes control over those tables it could add a row (with 
> a different virtual_obs_point_id) for each datapath. But the problem is 
> the same as in B): How to co-exist with other user-configured IPFIX 
> sampling applications.
> 

This is where my lack of knowledge of IPFIX hinders my ability to give 
good direction. However, based on what I was saying above, would the 
idea of using an alternative identifier than the logical datapath UUID 
do the job without having to take any of the above measures?

>>>
>>> * Use of ObservationPointID
>>> In this RFC, I just used the ObservationPointID (IPFIX element 138)
>>> because it's already supported in the OVS's NXAST_SAMPLE. This allows us
>>> to encode 32bits which is good enough for the cookie. If we wanted to
>>> encode more information we'd have to look for another IPFIX element.
>>
>> I had mentioned above that having the data from the registers as part 
>> of the sample is good for debugging purposes, but I understand this 
>> may be a big ask. Right now you're putting the logical flow UUID here, 
>> and that's fine. It's easy to trace that back to a particular 
>> datapath. The alternative would be to use the UUID of the datapath 
>> directly instead.
>>
> 
> See above.
> 
>>>
>>> Adrian Moreno (3):
>>>    actions: add sample action
>>>    northd: add drop_debugging option
>>>    debug: add sampling of drop actions
>>>
>>>   include/ovn/actions.h |  11 ++++
>>>   lib/actions.c         | 111 +++++++++++++++++++++++++++++++++++++
>>>   northd/automake.mk    |   2 +
>>>   northd/debug.c        |  98 +++++++++++++++++++++++++++++++++
>>>   northd/debug.h        |  41 ++++++++++++++
>>>   northd/northd.c       | 125 ++++++++++++++++++++++++++++--------------
>>>   ovn-nb.xml            |  29 ++++++++++
>>>   tests/ovn.at          |  10 +++-
>>>   tests/test-ovn.c      |   2 +
>>>   utilities/ovn-trace.c |   3 +
>>>   10 files changed, 390 insertions(+), 42 deletions(-)
>>>   create mode 100644 northd/debug.c
>>>   create mode 100644 northd/debug.h
>>>
>>
>
Adrian Moreno May 10, 2022, 1:58 p.m. UTC | #4
On 5/10/22 15:24, Mark Michelson wrote:
> On 5/10/22 02:23, Adrian Moreno wrote:
>>
>>
>> On 5/9/22 22:41, Mark Michelson wrote:
>>> Hi Adrian,
>>>
>>> Since this is an RFC series, I haven't taken a deep look at the code, so I 
>>> apologize if my comments below are not correct.
>>>
>>
>> Thank you very much for the feedback Mark.
>>
>>> On 4/25/22 07:17, Adrian Moreno wrote:
>>>> Very often when troubleshooting networking issues in an OVN cluster one
>>>> would like to know if any packet (or a specific one) is being dropped by
>>>> OVN.
>>>>
>>>> Currently, this cannot be known because of two main reasons:
>>>>
>>>> 1 - Implicit drops: Some tables do not have a default action
>>>> (priority=0, match=1). In this case, a packet that does not match any
>>>> rule will be silently dropped.
>>>>
>>>> 2 - Even on explicit drops, we only know a packet was dropped. We lack
>>>> information about that packet.
>>>>
>>>> In order to improve this, this RFC proposes a two-fold solution:
>>>>
>>>> - First, create a debug-mode option that makes northd add a default
>>>>    "drop;" action on those tables that currently lack one.
>>>>
>>>> - Secondly, allow sampling of all drops. By introducing a new OVN
>>>>    action: "sample" (equivalent to OVS's), OVN can make OVS sample the
>>>>    packets as they are dropped and insert the first 32 bits of the
>>>>    Logical Flow's UUID (a.k.a cookie) into the IPFIX sample's
>>>>    ObservationPointId. That way a collector can see the packet's
>>>>    header information as well as what Logical Flow dropped it.
>>>>
>>>>
>>>> This RFC has some limitations I'd like some specific
>>>> feedback/guidance on:
>>>>
>>>> * Per-datapath flows
>>>> Even if a Logical Flow is created with "match=1", the controller will
>>>> insert the datapath metadata match. This might be good enough for most
>>>> cases but could miss packets if there's a bug in OVN. A possible
>>>> approach could be to propagate the "drop-debug" configuration to the
>>>> SB and make the controller insert the default drops but without a
>>>> Logical Flow, how would we trace it back?
>>>
>>> I think the most important data points to track are:
>>> * The packet contents
>>> * The logical datapath where the drop occurred (i.e. metadata)
>>
>> See below thoughts about how to do this.
>>
>>> * (possibly) register values at the time of the drop
>>
>> This metadata fields are not in the datapath, they only exist at the openflow 
>> classification level, rigth? So we would have to make OpenFlow "sample()" 
>> dynamic and insert arbitrary fields in it.
> 
> Having the register values is mostly wishful thinking on my part :)
> 
> It's not something that is 100% required; it's just something I thought could be 
> useful when debugging a dropped packet.
> 

OK.

>>
>>
>>> Since the logical datapath is important, I think we're going to end up with 
>>> an OF flow per datapath.
>>>
>>> The question still remains whether we need to add explicit drop and sample 
>>> logical flows to the southbound database.
>>>
>>> I think having logical flows for the explicit drops and samples is a good 
>>> idea for the time being. I think it adds clarity to what OVN is doing. The 
>>> only reason we should move the logic down to ovn-controller is if our testing 
>>> shows significant performance degradation. And I do mean *significant*. Since 
>>> this is something that is only intended to be turned on conditionally, it 
>>> doesn't need to be lightning fast. However, if turning on drop debugging 
>>> causes everything to grind to a halt, then we probably should look into 
>>> optimizing the process.
>>>
>>>>
>>>> Another approach (suggested by Dumitru) could be to have OVN detect that
>>>> a lflow actually applies to all datapaths and remove the metadata
>>>> match which would also remove the number of Openflow flows.
>>>
>>> I guess my question here is how we will know on which logical datapath the 
>>> packet was being processed when the drop occurred. If you can set the 
>>> observation point ID to the value of the OXM_METADATA register, then this 
>>> suggestion would work fine. Otherwise, I don't know how you'd reduce this to 
>>> a single flow and still get the necessary information in the sample.
>>>
>>
>> Right, we would loose that information since we cannot set the observation 
>> point ID dynamically (i.e: it's hardcoded in the of flow) so if we optimize 
>> out and only add a single of flow, we can only have a single observation point 
>> ID.
>>
>> In fact, will we have the same problem with the current implementation? Right 
>> now northd will create a ovs action as "sample(obs_point_id=$cookie, ....)" 
>> and the controller will replace it with the first 32bits of the lflow UUID. 
>> But the lflow is the same for all datapaths, right?
> 
> That's a good question. If logical datapath groups are disabled, then you should 
> definitely have a logical flow per datapath, even if the matches and actions are 
> the same across all datapaths. With logical datapath groups enabled (the 
> default), those similar flows are factored into a single logical flow. I think 
> in that scenario the lflow UUID does not work to identify the logical datapath.
> 
> Some good alternatives to using the lflow UUID as a datapath identifier would be
> 
> * The value of the OXM_METADATA register.
> * The southbound datapath_binding UUID.
> 

If we use any of these identifiers, we would know what datapath dropped the 
packet but not what lflow in it's pipeline did it, Right?

> In either case, you will end up with an OF flow per datapath. Again I want to 
> stress that this is fine, especially since this likely will only be enabled 
> while actively debugging.
> 
>>
>> If so, I see these ways forward:
>>
>> A) We modify the "sample" OpenFlow flow to allow the controller to add more 
>> fields to the sample. For instance, OVS currently supports an arbitrary string 
>> field "virtual_obs_id" but only per IPFIX target.
>>
>> B) We use the observation domain ID.
>>
>> This has some complexity: My current approach was to have the users specify 
>> the obs_domain_ID to use for ovn-drop-debugging so they can make sure it is 
>> not the same as the one used for other kind of sampling (e.g: per flow 
>> sampling of ACLs, bridge-level sampling, etc). This is because obs_point_ids 
>> should be unique per obs_domain_id, so we can use this field to separate 
>> "applications" at the collector level. My questions would be:
>> What's the maximum number of datapaths? Is it less than 2^32 so we can reserve 
>> a domain_id for each and still leave space for other applications?
>> Is the number of datapaths per controller more limited in theory (I'd guess in 
>> practice it is)?
>>
>> C) Have control over the IPFIX table in OVS.
>>
>> We're currently relying on the users to create rows in the IPFIX and 
>> Flow_Sample_Collector_Set Tables on every bridge they want to sample. If 
>> ovn-controller takes control over those tables it could add a row (with a 
>> different virtual_obs_point_id) for each datapath. But the problem is the same 
>> as in B): How to co-exist with other user-configured IPFIX sampling applications.
>>
> 
> This is where my lack of knowledge of IPFIX hinders my ability to give good 
> direction. However, based on what I was saying above, would the idea of using an 
> alternative identifier than the logical datapath UUID do the job without having 
> to take any of the above measures?
> 
>>>>
>>>> * Use of ObservationPointID
>>>> In this RFC, I just used the ObservationPointID (IPFIX element 138)
>>>> because it's already supported in the OVS's NXAST_SAMPLE. This allows us
>>>> to encode 32bits which is good enough for the cookie. If we wanted to
>>>> encode more information we'd have to look for another IPFIX element.
>>>
>>> I had mentioned above that having the data from the registers as part of the 
>>> sample is good for debugging purposes, but I understand this may be a big 
>>> ask. Right now you're putting the logical flow UUID here, and that's fine. 
>>> It's easy to trace that back to a particular datapath. The alternative would 
>>> be to use the UUID of the datapath directly instead.
>>>
>>
>> See above.
>>
>>>>
>>>> Adrian Moreno (3):
>>>>    actions: add sample action
>>>>    northd: add drop_debugging option
>>>>    debug: add sampling of drop actions
>>>>
>>>>   include/ovn/actions.h |  11 ++++
>>>>   lib/actions.c         | 111 +++++++++++++++++++++++++++++++++++++
>>>>   northd/automake.mk    |   2 +
>>>>   northd/debug.c        |  98 +++++++++++++++++++++++++++++++++
>>>>   northd/debug.h        |  41 ++++++++++++++
>>>>   northd/northd.c       | 125 ++++++++++++++++++++++++++++--------------
>>>>   ovn-nb.xml            |  29 ++++++++++
>>>>   tests/ovn.at          |  10 +++-
>>>>   tests/test-ovn.c      |   2 +
>>>>   utilities/ovn-trace.c |   3 +
>>>>   10 files changed, 390 insertions(+), 42 deletions(-)
>>>>   create mode 100644 northd/debug.c
>>>>   create mode 100644 northd/debug.h
>>>>
>>>
>>
>
Mark Michelson May 10, 2022, 2:42 p.m. UTC | #5
On 5/10/22 09:58, Adrian Moreno wrote:
> 
> 
> On 5/10/22 15:24, Mark Michelson wrote:
>> On 5/10/22 02:23, Adrian Moreno wrote:
>>>
>>>
>>> On 5/9/22 22:41, Mark Michelson wrote:
>>>> Hi Adrian,
>>>>
>>>> Since this is an RFC series, I haven't taken a deep look at the 
>>>> code, so I apologize if my comments below are not correct.
>>>>
>>>
>>> Thank you very much for the feedback Mark.
>>>
>>>> On 4/25/22 07:17, Adrian Moreno wrote:
>>>>> Very often when troubleshooting networking issues in an OVN cluster 
>>>>> one
>>>>> would like to know if any packet (or a specific one) is being 
>>>>> dropped by
>>>>> OVN.
>>>>>
>>>>> Currently, this cannot be known because of two main reasons:
>>>>>
>>>>> 1 - Implicit drops: Some tables do not have a default action
>>>>> (priority=0, match=1). In this case, a packet that does not match any
>>>>> rule will be silently dropped.
>>>>>
>>>>> 2 - Even on explicit drops, we only know a packet was dropped. We lack
>>>>> information about that packet.
>>>>>
>>>>> In order to improve this, this RFC proposes a two-fold solution:
>>>>>
>>>>> - First, create a debug-mode option that makes northd add a default
>>>>>    "drop;" action on those tables that currently lack one.
>>>>>
>>>>> - Secondly, allow sampling of all drops. By introducing a new OVN
>>>>>    action: "sample" (equivalent to OVS's), OVN can make OVS sample the
>>>>>    packets as they are dropped and insert the first 32 bits of the
>>>>>    Logical Flow's UUID (a.k.a cookie) into the IPFIX sample's
>>>>>    ObservationPointId. That way a collector can see the packet's
>>>>>    header information as well as what Logical Flow dropped it.
>>>>>
>>>>>
>>>>> This RFC has some limitations I'd like some specific
>>>>> feedback/guidance on:
>>>>>
>>>>> * Per-datapath flows
>>>>> Even if a Logical Flow is created with "match=1", the controller will
>>>>> insert the datapath metadata match. This might be good enough for most
>>>>> cases but could miss packets if there's a bug in OVN. A possible
>>>>> approach could be to propagate the "drop-debug" configuration to the
>>>>> SB and make the controller insert the default drops but without a
>>>>> Logical Flow, how would we trace it back?
>>>>
>>>> I think the most important data points to track are:
>>>> * The packet contents
>>>> * The logical datapath where the drop occurred (i.e. metadata)
>>>
>>> See below thoughts about how to do this.
>>>
>>>> * (possibly) register values at the time of the drop
>>>
>>> This metadata fields are not in the datapath, they only exist at the 
>>> openflow classification level, rigth? So we would have to make 
>>> OpenFlow "sample()" dynamic and insert arbitrary fields in it.
>>
>> Having the register values is mostly wishful thinking on my part :)
>>
>> It's not something that is 100% required; it's just something I 
>> thought could be useful when debugging a dropped packet.
>>
> 
> OK.
> 
>>>
>>>
>>>> Since the logical datapath is important, I think we're going to end 
>>>> up with an OF flow per datapath.
>>>>
>>>> The question still remains whether we need to add explicit drop and 
>>>> sample logical flows to the southbound database.
>>>>
>>>> I think having logical flows for the explicit drops and samples is a 
>>>> good idea for the time being. I think it adds clarity to what OVN is 
>>>> doing. The only reason we should move the logic down to 
>>>> ovn-controller is if our testing shows significant performance 
>>>> degradation. And I do mean *significant*. Since this is something 
>>>> that is only intended to be turned on conditionally, it doesn't need 
>>>> to be lightning fast. However, if turning on drop debugging causes 
>>>> everything to grind to a halt, then we probably should look into 
>>>> optimizing the process.
>>>>
>>>>>
>>>>> Another approach (suggested by Dumitru) could be to have OVN detect 
>>>>> that
>>>>> a lflow actually applies to all datapaths and remove the metadata
>>>>> match which would also remove the number of Openflow flows.
>>>>
>>>> I guess my question here is how we will know on which logical 
>>>> datapath the packet was being processed when the drop occurred. If 
>>>> you can set the observation point ID to the value of the 
>>>> OXM_METADATA register, then this suggestion would work fine. 
>>>> Otherwise, I don't know how you'd reduce this to a single flow and 
>>>> still get the necessary information in the sample.
>>>>
>>>
>>> Right, we would loose that information since we cannot set the 
>>> observation point ID dynamically (i.e: it's hardcoded in the of flow) 
>>> so if we optimize out and only add a single of flow, we can only have 
>>> a single observation point ID.
>>>
>>> In fact, will we have the same problem with the current 
>>> implementation? Right now northd will create a ovs action as 
>>> "sample(obs_point_id=$cookie, ....)" and the controller will replace 
>>> it with the first 32bits of the lflow UUID. But the lflow is the same 
>>> for all datapaths, right?
>>
>> That's a good question. If logical datapath groups are disabled, then 
>> you should definitely have a logical flow per datapath, even if the 
>> matches and actions are the same across all datapaths. With logical 
>> datapath groups enabled (the default), those similar flows are 
>> factored into a single logical flow. I think in that scenario the 
>> lflow UUID does not work to identify the logical datapath.
>>
>> Some good alternatives to using the lflow UUID as a datapath 
>> identifier would be
>>
>> * The value of the OXM_METADATA register.
>> * The southbound datapath_binding UUID.
>>
> 
> If we use any of these identifiers, we would know what datapath dropped 
> the packet but not what lflow in it's pipeline did it, Right?

OK, I see your point here now. For some reason I was thinking that the 
sample() would capture the current OF flow where the sample was sent, 
but that's not how it works.

As was discussed before, logical flows can be coalesced if datapath 
groups are enabled. So a logical flow UUID is ambiguous when trying to 
determine which logical datapath dropped the packet.

One idea would be to implicitly disable logical datapath groups when 
drop debugging is enabled. This way, logical flow UUIDs are not 
ambiguous. One downside is that the southbound database will become 
larger, likely resulting in OVSDB and ovn-controller incurring more CPU 
and taking longer to process the data. Since this is an active debugging 
scenario, that may not be a problem. The other potential downside would 
be if removing logical datapath groups causes the problem to present 
itself differently. That could turn a simple issue into a heisenbug.

Another approach would be to encode data directly into the 
ObservationPointID. From what I can tell, the key things you need to 
know where a packet was dropped are

* Logical datapath
* OF table
* OF priority

Logical datapaths are 24 bits. I'm not sure what the maximum size for OF 
tables are, but with OVN, 8 bits is plenty. Priorities are 16 bits. 
Therefore, 48 bits would be necessary to encode all of that. Since the 
ObservationPointID is only 32 bits, that makes this idea not work as-is. 
You'd have to include another field, potentially.

> 
>> In either case, you will end up with an OF flow per datapath. Again I 
>> want to stress that this is fine, especially since this likely will 
>> only be enabled while actively debugging.
>>
>>>
>>> If so, I see these ways forward:
>>>
>>> A) We modify the "sample" OpenFlow flow to allow the controller to 
>>> add more fields to the sample. For instance, OVS currently supports 
>>> an arbitrary string field "virtual_obs_id" but only per IPFIX target.
>>>
>>> B) We use the observation domain ID.
>>>
>>> This has some complexity: My current approach was to have the users 
>>> specify the obs_domain_ID to use for ovn-drop-debugging so they can 
>>> make sure it is not the same as the one used for other kind of 
>>> sampling (e.g: per flow sampling of ACLs, bridge-level sampling, 
>>> etc). This is because obs_point_ids should be unique per 
>>> obs_domain_id, so we can use this field to separate "applications" at 
>>> the collector level. My questions would be:
>>> What's the maximum number of datapaths? Is it less than 2^32 so we 
>>> can reserve a domain_id for each and still leave space for other 
>>> applications?
>>> Is the number of datapaths per controller more limited in theory (I'd 
>>> guess in practice it is)?
>>>
>>> C) Have control over the IPFIX table in OVS.
>>>
>>> We're currently relying on the users to create rows in the IPFIX and 
>>> Flow_Sample_Collector_Set Tables on every bridge they want to sample. 
>>> If ovn-controller takes control over those tables it could add a row 
>>> (with a different virtual_obs_point_id) for each datapath. But the 
>>> problem is the same as in B): How to co-exist with other 
>>> user-configured IPFIX sampling applications.
>>>
>>
>> This is where my lack of knowledge of IPFIX hinders my ability to give 
>> good direction. However, based on what I was saying above, would the 
>> idea of using an alternative identifier than the logical datapath UUID 
>> do the job without having to take any of the above measures?
>>
>>>>>
>>>>> * Use of ObservationPointID
>>>>> In this RFC, I just used the ObservationPointID (IPFIX element 138)
>>>>> because it's already supported in the OVS's NXAST_SAMPLE. This 
>>>>> allows us
>>>>> to encode 32bits which is good enough for the cookie. If we wanted to
>>>>> encode more information we'd have to look for another IPFIX element.
>>>>
>>>> I had mentioned above that having the data from the registers as 
>>>> part of the sample is good for debugging purposes, but I understand 
>>>> this may be a big ask. Right now you're putting the logical flow 
>>>> UUID here, and that's fine. It's easy to trace that back to a 
>>>> particular datapath. The alternative would be to use the UUID of the 
>>>> datapath directly instead.
>>>>
>>>
>>> See above.
>>>
>>>>>
>>>>> Adrian Moreno (3):
>>>>>    actions: add sample action
>>>>>    northd: add drop_debugging option
>>>>>    debug: add sampling of drop actions
>>>>>
>>>>>   include/ovn/actions.h |  11 ++++
>>>>>   lib/actions.c         | 111 +++++++++++++++++++++++++++++++++++++
>>>>>   northd/automake.mk    |   2 +
>>>>>   northd/debug.c        |  98 +++++++++++++++++++++++++++++++++
>>>>>   northd/debug.h        |  41 ++++++++++++++
>>>>>   northd/northd.c       | 125 
>>>>> ++++++++++++++++++++++++++++--------------
>>>>>   ovn-nb.xml            |  29 ++++++++++
>>>>>   tests/ovn.at          |  10 +++-
>>>>>   tests/test-ovn.c      |   2 +
>>>>>   utilities/ovn-trace.c |   3 +
>>>>>   10 files changed, 390 insertions(+), 42 deletions(-)
>>>>>   create mode 100644 northd/debug.c
>>>>>   create mode 100644 northd/debug.h
>>>>>
>>>>
>>>
>>
>
Adrian Moreno May 10, 2022, 4:27 p.m. UTC | #6
[...]

>>> Some good alternatives to using the lflow UUID as a datapath identifier would be
>>>
>>> * The value of the OXM_METADATA register.
>>> * The southbound datapath_binding UUID.
>>>
>>
>> If we use any of these identifiers, we would know what datapath dropped the 
>> packet but not what lflow in it's pipeline did it, Right?
> 
> OK, I see your point here now. For some reason I was thinking that the sample() 
> would capture the current OF flow where the sample was sent, but that's not how 
> it works.
> 
> As was discussed before, logical flows can be coalesced if datapath groups are 
> enabled. So a logical flow UUID is ambiguous when trying to determine which 
> logical datapath dropped the packet.
> 
> One idea would be to implicitly disable logical datapath groups when drop 
> debugging is enabled. This way, logical flow UUIDs are not ambiguous. One 
> downside is that the southbound database will become larger, likely resulting in 
> OVSDB and ovn-controller incurring more CPU and taking longer to process the 
> data. Since this is an active debugging scenario, that may not be a problem. The 
> other potential downside would be if removing logical datapath groups causes the 
> problem to present itself differently. That could turn a simple issue into a 
> heisenbug.
> 
> Another approach would be to encode data directly into the ObservationPointID. 
>  From what I can tell, the key things you need to know where a packet was 
> dropped are
> 
> * Logical datapath
> * OF table
> * OF priority
> 
> Logical datapaths are 24 bits. I'm not sure what the maximum size for OF tables 
> are, but with OVN, 8 bits is plenty. Priorities are 16 bits. Therefore, 48 bits 
> would be necessary to encode all of that. Since the ObservationPointID is only 
> 32 bits, that makes this idea not work as-is. You'd have to include another 
> field, potentially.


If the logical datapaths are only 24 bits I think we can squeeze the information 
in. What we can specify in the sample action is:
ObservationPointID: 32bit
ObservationDomainID: 32bit

Let's say for each IPFIX "application" that OVN might want to implement in the 
future (drop-debugging is just one of them) we define a "base_obs_domain_id" where:

	0 < base_obs_domain_id < 256

and we make the controller insert:

	obs_domain_id = (base_obs_domain_id << 24 | logical_datapath)

In addition, for ovn-debug we use:

	obs_point_id = lflow_uuid

This would allow for 254 different non-overlaping uses of IPFIX by OVN plus 16Mi 
observation_domain_ids for external users to use (base_domain_id = 0).

For drop debugging we'll always know the logical flow that caused the drop and 
the logical datapath it belongs to.

Would this work?
Where can I find those unique 24bits that identify the datapath? Are they 
available when on action encoding (i.e: through struct ovnact_encode_params)?

Thanks for the patience and guidance.
Mark Michelson May 11, 2022, 6:45 p.m. UTC | #7
On 5/10/22 12:27, Adrian Moreno wrote:
> [...]
> 
>>>> Some good alternatives to using the lflow UUID as a datapath 
>>>> identifier would be
>>>>
>>>> * The value of the OXM_METADATA register.
>>>> * The southbound datapath_binding UUID.
>>>>
>>>
>>> If we use any of these identifiers, we would know what datapath 
>>> dropped the packet but not what lflow in it's pipeline did it, Right?
>>
>> OK, I see your point here now. For some reason I was thinking that the 
>> sample() would capture the current OF flow where the sample was sent, 
>> but that's not how it works.
>>
>> As was discussed before, logical flows can be coalesced if datapath 
>> groups are enabled. So a logical flow UUID is ambiguous when trying to 
>> determine which logical datapath dropped the packet.
>>
>> One idea would be to implicitly disable logical datapath groups when 
>> drop debugging is enabled. This way, logical flow UUIDs are not 
>> ambiguous. One downside is that the southbound database will become 
>> larger, likely resulting in OVSDB and ovn-controller incurring more 
>> CPU and taking longer to process the data. Since this is an active 
>> debugging scenario, that may not be a problem. The other potential 
>> downside would be if removing logical datapath groups causes the 
>> problem to present itself differently. That could turn a simple issue 
>> into a heisenbug.
>>
>> Another approach would be to encode data directly into the 
>> ObservationPointID.  From what I can tell, the key things you need to 
>> know where a packet was dropped are
>>
>> * Logical datapath
>> * OF table
>> * OF priority
>>
>> Logical datapaths are 24 bits. I'm not sure what the maximum size for 
>> OF tables are, but with OVN, 8 bits is plenty. Priorities are 16 bits. 
>> Therefore, 48 bits would be necessary to encode all of that. Since the 
>> ObservationPointID is only 32 bits, that makes this idea not work 
>> as-is. You'd have to include another field, potentially.
> 
> 
> If the logical datapaths are only 24 bits I think we can squeeze the 
> information in. What we can specify in the sample action is:
> ObservationPointID: 32bit
> ObservationDomainID: 32bit
> 
> Let's say for each IPFIX "application" that OVN might want to implement 
> in the future (drop-debugging is just one of them) we define a 
> "base_obs_domain_id" where:
> 
>      0 < base_obs_domain_id < 256
> 
> and we make the controller insert:
> 
>      obs_domain_id = (base_obs_domain_id << 24 | logical_datapath)
> 
> In addition, for ovn-debug we use:
> 
>      obs_point_id = lflow_uuid
> 
> This would allow for 254 different non-overlaping uses of IPFIX by OVN 
> plus 16Mi observation_domain_ids for external users to use 
> (base_domain_id = 0).
> 
> For drop debugging we'll always know the logical flow that caused the 
> drop and the logical datapath it belongs to.
> 
> Would this work?
> Where can I find those unique 24bits that identify the datapath? Are 
> they available when on action encoding (i.e: through struct 
> ovnact_encode_params)?

That sounds like it should work. The 24 bits that identify the datapath 
are the "tunnel_key" column on the datapath_binding in the southbound 
database. This is the same value we put in the OXM_METADATA register, 
and it's the same value we encode in the GENEVE metadata on tunnels.

> 
> Thanks for the patience and guidance.
Numan Siddique June 1, 2022, 3:11 p.m. UTC | #8
On Wed, May 11, 2022 at 2:45 PM Mark Michelson <mmichels@redhat.com> wrote:
>
> On 5/10/22 12:27, Adrian Moreno wrote:
> > [...]
> >
> >>>> Some good alternatives to using the lflow UUID as a datapath
> >>>> identifier would be
> >>>>
> >>>> * The value of the OXM_METADATA register.
> >>>> * The southbound datapath_binding UUID.
> >>>>
> >>>
> >>> If we use any of these identifiers, we would know what datapath
> >>> dropped the packet but not what lflow in it's pipeline did it, Right?
> >>
> >> OK, I see your point here now. For some reason I was thinking that the
> >> sample() would capture the current OF flow where the sample was sent,
> >> but that's not how it works.
> >>
> >> As was discussed before, logical flows can be coalesced if datapath
> >> groups are enabled. So a logical flow UUID is ambiguous when trying to
> >> determine which logical datapath dropped the packet.
> >>
> >> One idea would be to implicitly disable logical datapath groups when
> >> drop debugging is enabled. This way, logical flow UUIDs are not
> >> ambiguous. One downside is that the southbound database will become
> >> larger, likely resulting in OVSDB and ovn-controller incurring more
> >> CPU and taking longer to process the data. Since this is an active
> >> debugging scenario, that may not be a problem. The other potential
> >> downside would be if removing logical datapath groups causes the
> >> problem to present itself differently. That could turn a simple issue
> >> into a heisenbug.
> >>
> >> Another approach would be to encode data directly into the
> >> ObservationPointID.  From what I can tell, the key things you need to
> >> know where a packet was dropped are
> >>
> >> * Logical datapath
> >> * OF table
> >> * OF priority
> >>
> >> Logical datapaths are 24 bits. I'm not sure what the maximum size for
> >> OF tables are, but with OVN, 8 bits is plenty. Priorities are 16 bits.
> >> Therefore, 48 bits would be necessary to encode all of that. Since the
> >> ObservationPointID is only 32 bits, that makes this idea not work
> >> as-is. You'd have to include another field, potentially.
> >
> >
> > If the logical datapaths are only 24 bits I think we can squeeze the
> > information in. What we can specify in the sample action is:
> > ObservationPointID: 32bit
> > ObservationDomainID: 32bit
> >
> > Let's say for each IPFIX "application" that OVN might want to implement
> > in the future (drop-debugging is just one of them) we define a
> > "base_obs_domain_id" where:
> >
> >      0 < base_obs_domain_id < 256
> >
> > and we make the controller insert:
> >
> >      obs_domain_id = (base_obs_domain_id << 24 | logical_datapath)
> >
> > In addition, for ovn-debug we use:
> >
> >      obs_point_id = lflow_uuid
> >
> > This would allow for 254 different non-overlaping uses of IPFIX by OVN
> > plus 16Mi observation_domain_ids for external users to use
> > (base_domain_id = 0).
> >
> > For drop debugging we'll always know the logical flow that caused the
> > drop and the logical datapath it belongs to.
> >
> > Would this work?
> > Where can I find those unique 24bits that identify the datapath? Are
> > they available when on action encoding (i.e: through struct
> > ovnact_encode_params)?
>
> That sounds like it should work. The 24 bits that identify the datapath
> are the "tunnel_key" column on the datapath_binding in the southbound
> database. This is the same value we put in the OXM_METADATA register,
> and it's the same value we encode in the GENEVE metadata on tunnels.
>
> >
> > Thanks for the patience and guidance.


I'm sorry for being late in taking a  look at this discussion.

Just checking if this series is going to be respinned as a formal
patch series with the inputs provided by Mark ?
Or if further discussion is required ?


I don't know much about IPFIX.  Is it possible to define the metadata
(and perhaps other ovs registers) into the ipfix template record ?
(Sorry if its a stupid question).

Thanks
Numan

>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Adrian Moreno June 1, 2022, 4:26 p.m. UTC | #9
On 6/1/22 17:11, Numan Siddique wrote:
> On Wed, May 11, 2022 at 2:45 PM Mark Michelson <mmichels@redhat.com> wrote:
>>
>> On 5/10/22 12:27, Adrian Moreno wrote:
>>> [...]
>>>
>>>>>> Some good alternatives to using the lflow UUID as a datapath
>>>>>> identifier would be
>>>>>>
>>>>>> * The value of the OXM_METADATA register.
>>>>>> * The southbound datapath_binding UUID.
>>>>>>
>>>>>
>>>>> If we use any of these identifiers, we would know what datapath
>>>>> dropped the packet but not what lflow in it's pipeline did it, Right?
>>>>
>>>> OK, I see your point here now. For some reason I was thinking that the
>>>> sample() would capture the current OF flow where the sample was sent,
>>>> but that's not how it works.
>>>>
>>>> As was discussed before, logical flows can be coalesced if datapath
>>>> groups are enabled. So a logical flow UUID is ambiguous when trying to
>>>> determine which logical datapath dropped the packet.
>>>>
>>>> One idea would be to implicitly disable logical datapath groups when
>>>> drop debugging is enabled. This way, logical flow UUIDs are not
>>>> ambiguous. One downside is that the southbound database will become
>>>> larger, likely resulting in OVSDB and ovn-controller incurring more
>>>> CPU and taking longer to process the data. Since this is an active
>>>> debugging scenario, that may not be a problem. The other potential
>>>> downside would be if removing logical datapath groups causes the
>>>> problem to present itself differently. That could turn a simple issue
>>>> into a heisenbug.
>>>>
>>>> Another approach would be to encode data directly into the
>>>> ObservationPointID.  From what I can tell, the key things you need to
>>>> know where a packet was dropped are
>>>>
>>>> * Logical datapath
>>>> * OF table
>>>> * OF priority
>>>>
>>>> Logical datapaths are 24 bits. I'm not sure what the maximum size for
>>>> OF tables are, but with OVN, 8 bits is plenty. Priorities are 16 bits.
>>>> Therefore, 48 bits would be necessary to encode all of that. Since the
>>>> ObservationPointID is only 32 bits, that makes this idea not work
>>>> as-is. You'd have to include another field, potentially.
>>>
>>>
>>> If the logical datapaths are only 24 bits I think we can squeeze the
>>> information in. What we can specify in the sample action is:
>>> ObservationPointID: 32bit
>>> ObservationDomainID: 32bit
>>>
>>> Let's say for each IPFIX "application" that OVN might want to implement
>>> in the future (drop-debugging is just one of them) we define a
>>> "base_obs_domain_id" where:
>>>
>>>       0 < base_obs_domain_id < 256
>>>
>>> and we make the controller insert:
>>>
>>>       obs_domain_id = (base_obs_domain_id << 24 | logical_datapath)
>>>
>>> In addition, for ovn-debug we use:
>>>
>>>       obs_point_id = lflow_uuid
>>>
>>> This would allow for 254 different non-overlaping uses of IPFIX by OVN
>>> plus 16Mi observation_domain_ids for external users to use
>>> (base_domain_id = 0).
>>>
>>> For drop debugging we'll always know the logical flow that caused the
>>> drop and the logical datapath it belongs to.
>>>
>>> Would this work?
>>> Where can I find those unique 24bits that identify the datapath? Are
>>> they available when on action encoding (i.e: through struct
>>> ovnact_encode_params)?
>>
>> That sounds like it should work. The 24 bits that identify the datapath
>> are the "tunnel_key" column on the datapath_binding in the southbound
>> database. This is the same value we put in the OXM_METADATA register,
>> and it's the same value we encode in the GENEVE metadata on tunnels.
>>
>>>
>>> Thanks for the patience and guidance.
> 
> 
> I'm sorry for being late in taking a  look at this discussion.
> 
> Just checking if this series is going to be respinned as a formal
> patch series with the inputs provided by Mark ?
> Or if further discussion is required ?
> 

Yes, I'll respin this as proper patch series.

> 
> I don't know much about IPFIX.  Is it possible to define the metadata
> (and perhaps other ovs registers) into the ipfix template record ?
> (Sorry if its a stupid question).

Unfortunately, it does not. We can only "hardcode" stuff in the action itself in 
the fields described in ovs-actions(7): 
https://www.mankier.com/7/ovs-actions#Other_Actions-The_sample_action

> 
> Thanks
> Numan
> 
>>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>