Message ID | 20190906000403.3701-1-pablo@netfilter.org |
---|---|
Headers | show |
Series | flow_offload: update mangle action representation | expand |
On 06/09/2019 01:03, Pablo Neira Ayuso wrote: > This patch updates the mangle action representation: > > Patch 1) Undo bitwise NOT operation on the mangle mask (coming from tc > pedit userspace). > > Patch 2) mangle value &= mask from the front-end side. > > Patch 3) adjust offset, length and coalesce consecutive pedit keys into > one single action. > > Patch 4) add support for payload mangling for netfilter. > > After this patchset: > > * Offset to payload does not need to be on the 32-bits boundaries anymore. > This patchset adds front-end code to adjust the offset and length coming > from the tc pedit representation, so drivers get an exact header field > offset and length. > > * This new front-end code coalesces consecutive pedit actions into one > single action, so drivers can mangle IPv6 and ethernet address fields > in one go, instead once for each 32-bit word. > > On the driver side, diffstat -t shows that drivers code to deal with > payload mangling gets simplified: > > INSERTED,DELETED,MODIFIED,FILENAME > 46,116,0,drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c (-70 LOC) > 12,28,0,drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.h (-16 LOC) > 26,54,0,drivers/net/ethernet/mellanox/mlx5/core/en_tc.c (-27 LOC) > 89,111,0,drivers/net/ethernet/netronome/nfp/flower/action.c (-17 LOC) > > While, on the front-end side the balance is the following: > > 123,22,0,net/sched/cls_api.c (+101 LOC) > > Changes since v2: > > * Fix is_action_keys_supported() breakage in mlx5 reported by Vlad Buslov. Still NAK for the same reasons as three versions ago (when it was called "netfilter: payload mangling offload support"), you've never managed to explain why this extra API complexity is useful. (Reducing LOC does not mean you've reduced complexity.) As Jakub said, 'We suffered through enough haphazard "updates"'. Please can you fix the problems your previous API changes caused (I still haven't had an answer about the flow block changes since sending you my driver code two weeks ago) before trying to ram new ones through. -Ed
On Fri, Sep 06, 2019 at 11:02:18AM +0100, Edward Cree wrote: > On 06/09/2019 01:03, Pablo Neira Ayuso wrote: > > This patch updates the mangle action representation: > > > > Patch 1) Undo bitwise NOT operation on the mangle mask (coming from tc > > pedit userspace). > > > > Patch 2) mangle value &= mask from the front-end side. > > > > Patch 3) adjust offset, length and coalesce consecutive pedit keys into > > one single action. > > > > Patch 4) add support for payload mangling for netfilter. > > > > After this patchset: > > > > * Offset to payload does not need to be on the 32-bits boundaries anymore. > > This patchset adds front-end code to adjust the offset and length coming > > from the tc pedit representation, so drivers get an exact header field > > offset and length. > > > > * This new front-end code coalesces consecutive pedit actions into one > > single action, so drivers can mangle IPv6 and ethernet address fields > > in one go, instead once for each 32-bit word. > > > > On the driver side, diffstat -t shows that drivers code to deal with > > payload mangling gets simplified: > > > > INSERTED,DELETED,MODIFIED,FILENAME > > 46,116,0,drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c (-70 LOC) > > 12,28,0,drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.h (-16 LOC) > > 26,54,0,drivers/net/ethernet/mellanox/mlx5/core/en_tc.c (-27 LOC) > > 89,111,0,drivers/net/ethernet/netronome/nfp/flower/action.c (-17 LOC) > > > > While, on the front-end side the balance is the following: > > > > 123,22,0,net/sched/cls_api.c (+101 LOC) > > > > Changes since v2: > > > > * Fix is_action_keys_supported() breakage in mlx5 reported by Vlad Buslov. > > Still NAK for the same reasons as three versions ago (when it was called > "netfilter: payload mangling offload support"), you've never managed to > explain why this extra API complexity is useful. (Reducing LOC does not > mean you've reduced complexity.) Oh well... Patch 1) Mask is inverted for no reason, just because tc pedit needs this in this way. All drivers reverse this mask. Patch 2) All drivers mask out meaningless fields in the value field. Patch 3) Without this patchset, offsets are on the 32-bit boundary. Drivers need to play with the 32-bit mask to infer what field they are supposed to mangle... eg. with 32-bit offset alignment, checking if the use want to alter the ttl/hop_limit need for helper structures to check the 32-bit mask. Mangling a IPv6 address comes with one single action...
On 06/09/2019 11:56, Pablo Neira Ayuso wrote: > On Fri, Sep 06, 2019 at 11:02:18AM +0100, Edward Cree wrote: >> Still NAK for the same reasons as three versions ago (when it was called >> "netfilter: payload mangling offload support"), you've never managed to >> explain why this extra API complexity is useful. (Reducing LOC does not >> mean you've reduced complexity.) > Oh well... > > Patch 1) Mask is inverted for no reason, just because tc pedit needs > this in this way. All drivers reverse this mask. > > Patch 2) All drivers mask out meaningless fields in the value field. To be clear: I have no issue with these two patches; they look fine to me. (Though I'd like to see some comments on struct flow_action_entry specifying the semantics of these fields, especially if they're going to differ from the corresponding fields in struct tc_pedit_key.) > Patch 3) Without this patchset, offsets are on the 32-bit boundary. > Drivers need to play with the 32-bit mask to infer what field they are > supposed to mangle... eg. with 32-bit offset alignment, checking if > the use want to alter the ttl/hop_limit need for helper structures to > check the 32-bit mask. Mangling a IPv6 address comes with one single > action... Drivers are still going to need to handle multiple pedit actions, in case the original rule wanted to mangle two non-consecutive fields. And you can't just coalesce all consecutive mangles, because if you mangle two consecutive fields (e.g. UDP sport and dport) the driver still needs to disentangle that if it works on a 'fields' (rather than 'u32s') level. So either have the core convert things into named protocol fields (i.e. "set src IPv6 to 1234::5 and add 1 to UDP sport"), or leave the current sequence-of-u32-mangles as it is. This in-between "we'll coalesce things together despite not knowing what fields they are" is neither fish nor fowl. -Ed
On Fri, Sep 06, 2019 at 01:55:29PM +0100, Edward Cree wrote: > On 06/09/2019 11:56, Pablo Neira Ayuso wrote: > > On Fri, Sep 06, 2019 at 11:02:18AM +0100, Edward Cree wrote: > >> Still NAK for the same reasons as three versions ago (when it was called > >> "netfilter: payload mangling offload support"), you've never managed to > >> explain why this extra API complexity is useful. (Reducing LOC does not > >> mean you've reduced complexity.) > > Oh well... > > > > Patch 1) Mask is inverted for no reason, just because tc pedit needs > > this in this way. All drivers reverse this mask. > > > > Patch 2) All drivers mask out meaningless fields in the value field. > > To be clear: I have no issue with these two patches; they look fine to me. > (Though I'd like to see some comments on struct flow_action_entry specifying > the semantics of these fields, especially if they're going to differ from > the corresponding fields in struct tc_pedit_key.) OK, I can document this semantics, I need just _time_ to write that documentation. I was expecting this patch description is enough by now until I can get to finish that documentation. > > Patch 3) Without this patchset, offsets are on the 32-bit boundary. > > Drivers need to play with the 32-bit mask to infer what field they are > > supposed to mangle... eg. with 32-bit offset alignment, checking if > > the use want to alter the ttl/hop_limit need for helper structures to > > check the 32-bit mask. Mangling a IPv6 address comes with one single > > action... > > Drivers are still going to need to handle multiple pedit actions, in > case the original rule wanted to mangle two non-consecutive fields. > And you can't just coalesce all consecutive mangles, because if you > mangle two consecutive fields (e.g. UDP sport and dport) the driver > still needs to disentangle that if it works on a 'fields' (rather > than 'u32s') level. This infrastructure is _not_ coalescing two consecutive field, e.g. UDP sport and dport is _not_ coalesced. The coalesce routine does _not_ handle multiple tc pedit ex actions. I'll clarify this in the cover letter in the next patchset round. > So either have the core convert things into named protocol fields > (i.e. "set src IPv6 to 1234::5 and add 1 to UDP sport"), or leave > the current sequence-of-u32-mangles as it is. This in-between "we'll > coalesce things together despite not knowing what fields they are" is > neither fish nor fowl. The model you propose would still need this code for tc pedit to adjust offset/length and coalesce u32 fields.
On 06/09/2019 14:14, Pablo Neira Ayuso wrote: > OK, I can document this semantics, I need just _time_ to write that > documentation. I was expecting this patch description is enough by now > until I can get to finish that documentation. I think for two structs with apparently the same contents but different semantics (one has the mask bitwise complemented) it's best to hold up the code change until the comment is ready to come with it, because until then it's a dangerously confusing situation. >> And you can't just coalesce all consecutive mangles, because if you >> mangle two consecutive fields (e.g. UDP sport and dport) the driver >> still needs to disentangle that if it works on a 'fields' (rather >> than 'u32s') level. > This infrastructure is _not_ coalescing two consecutive field, e.g. > UDP sport and dport is _not_ coalesced. The coalesce routine does > _not_ handle multiple tc pedit ex actions. So an IPv6 address mangle only comes as a single action if it's from netfilter, not if it's coming from TC pedit. Therefore drivers still need to handle an IPv6 or MAC address mangle coming in multiple actions, therefore your driver simplifications are invalid. No? > The model you propose would still need this code for tc pedit to > adjust offset/length and coalesce u32 fields. Yes, but we don't add code/features to the kernel based on what we *could* use it for later; every submission has to be self-contained in providing something of demonstrable value. So either implement "the model I propose" (which to be clear I'm *not* proposing, I want the u32 pedit left as it is; it's just that it's a better model than what you're implementing here), or leave well alone. -Ed
On Fri, Sep 06, 2019 at 02:37:16PM +0100, Edward Cree wrote: > On 06/09/2019 14:14, Pablo Neira Ayuso wrote: > > OK, I can document this semantics, I need just _time_ to write that > > documentation. I was expecting this patch description is enough by now > > until I can get to finish that documentation. > > I think for two structs with apparently the same contents but different > semantics (one has the mask bitwise complemented) it's best to hold up > the code change until the comment is ready to come with it, because > until then it's a dangerously confusing situation. The idea is that flow rule API != tc front-end anymore. So the flow rule API can evolve regardless the front-end requirements. Before this update there was no other choice than using the tc pedit layout since it was exposed to the drivers, this is not the case anymore. > >> And you can't just coalesce all consecutive mangles, because if you > >> mangle two consecutive fields (e.g. UDP sport and dport) the driver > >> still needs to disentangle that if it works on a 'fields' (rather > >> than 'u32s') level. > > > > This infrastructure is _not_ coalescing two consecutive field, e.g. > > UDP sport and dport is _not_ coalesced. The coalesce routine does > > _not_ handle multiple tc pedit ex actions. > > So an IPv6 address mangle only comes as a single action if it's from > netfilter, not if it's coming from TC pedit. Driver gets one single action from tc/netfilter (and potentially ethtool if it would support for this), it comes as a single action from both subsystems: front-end -----> flow_rule API -----> drivers Front-end need to translate their representation to this FLOW_ACTION_MANGLE layout. By front-end, I refer to ethtool/netfilter/tc. > Therefore drivers still need to handle an IPv6 or MAC address > mangle coming in multiple actions, therefore your driver > simplifications are invalid. No? No. IPv6 and MAC come as a single action. All subsystems use the same flow_rule API, they use the same layout. > > The model you propose would still need this code for tc pedit to > > adjust offset/length and coalesce u32 fields. > > Yes, but we don't add code/features to the kernel based on what we > *could* use it for later This is already useful: Look at the cxgb driver in particular, it's way easier to read. Other existing drivers do not need to do things like: case round_down(offsetof(struct iphdr, tos), 4): and the play with masks to infer if the user wants to mangle the TOS field, they can just do: case offsetof(struct iphdr, tos): which is way more natural representation.
On 06/09/2019 15:50, Pablo Neira Ayuso wrote: > On Fri, Sep 06, 2019 at 02:37:16PM +0100, Edward Cree wrote: >> On 06/09/2019 14:14, Pablo Neira Ayuso wrote: >>> OK, I can document this semantics, I need just _time_ to write that >>> documentation. I was expecting this patch description is enough by now >>> until I can get to finish that documentation. >> I think for two structs with apparently the same contents but different >> semantics (one has the mask bitwise complemented) it's best to hold up >> the code change until the comment is ready to come with it, because >> until then it's a dangerously confusing situation. > The idea is that flow rule API != tc front-end anymore. So the flow > rule API can evolve regardless the front-end requirements. Before this > update there was no other choice than using the tc pedit layout since > it was exposed to the drivers, this is not the case anymore. I'm not saying that they have to be the same. I'm saying that they're _almost_ the same, and having things that are _almost_ the same but subtly different is a recipe for misunderstandings and bugs, and so must must must have very visible comments in the code. >> So an IPv6 address mangle only comes as a single action if it's from >> netfilter, not if it's coming from TC pedit. > Driver gets one single action from tc/netfilter (and potentially > ethtool if it would support for this), it comes as a single action > from both subsystems: > > front-end -----> flow_rule API -----> drivers > > Front-end need to translate their representation to this > FLOW_ACTION_MANGLE layout. > > By front-end, I refer to ethtool/netfilter/tc. In your patch, flow_action_mangle() looks only at the offset and type (add vs. set) of each pedit, coalesces them if compatible (so, unless I'm misreading the code, it _will_ coalesce adjacent pedits to contiguous fields like UDP sport & dport, unlike what you said), because it doesn't know whether two contiguous pedits are part of the same field or not (it doesn't have any knowledge of 'fields' at all). And for you to change that, while still coalescing IPv6 pedits, you would need flow_action_mangle() to know what fields each pedit is in. It is not possible for code that doesn't know about fields to both: (a) not coalesce edits of contiguous fields, and (b) coalesce edits of larger-than-u32 fields >> Yes, but we don't add code/features to the kernel based on what we >> *could* use it for later > This is already useful: Look at the cxgb driver in particular, it's > way easier to read. Have you tested it? Again, I might be misreading, but it looks like your flow_action_mangle() *will* coalesce sport & dport, which it appears will break that cxgb code. > Other existing drivers do not need to do things like: > > case round_down(offsetof(struct iphdr, tos), 4): > > and the play with masks to infer if the user wants to mangle the TOS > field, they can just do: > > case offsetof(struct iphdr, tos): > > which is way more natural representation. Proper thing to do is have helper functions available to drivers to test the pedit, and not just switch on the offset. Why do I say that? Well, consider a pedit on UDP dport, with mask 0x00ff (network endian). Now as a u32 pedit that's 0x000000ff offset 0, so field-blind offset calculation (ffs in flow_action_mangle_entry()) will turn that into offset 3 mask 0xff. Now driver does switch(offset) { /* 3 */ case offsetof(struct udphdr, dest): /* 2 */ /* Whoops, we never get here! */ } Do you see the problem? -Ed
Hi Edward, On Fri, Sep 06, 2019 at 04:13:17PM +0100, Edward Cree wrote: > On 06/09/2019 15:50, Pablo Neira Ayuso wrote: > > On Fri, Sep 06, 2019 at 02:37:16PM +0100, Edward Cree wrote: [...] > >> So an IPv6 address mangle only comes as a single action if it's from > >> netfilter, not if it's coming from TC pedit. > > Driver gets one single action from tc/netfilter (and potentially > > ethtool if it would support for this), it comes as a single action > > from both subsystems: > > > > front-end -----> flow_rule API -----> drivers > > > > Front-end need to translate their representation to this > > FLOW_ACTION_MANGLE layout. > > > > By front-end, I refer to ethtool/netfilter/tc. > > In your patch, flow_action_mangle() looks only at the offset and type > (add vs. set) of each pedit, coalesces them if compatible (so, unless > I'm misreading the code, it _will_ coalesce adjacent pedits to > contiguous fields like UDP sport & dport, unlike what you said), > because it doesn't know whether two contiguous pedits are part of the > same field or not (it doesn't have any knowledge of 'fields' at all). In tc pedit ex, those are _indeed_ two separated actions: * One to mangle UDP sport. * One to mangle UDP dport. They are _not_ one as you describe above. The coalesce routine I made does _not_ coalesce fields in two different actions. > And for you to change that, while still coalescing IPv6 pedits, you > would need flow_action_mangle() to know what fields each pedit is in. In the particular case of IPv6 address, 'tc pedit ex' generates one single action with 4 nkeys. So this is: * One action to mangle four 32-bits words (the number of words in tc pedit is stored in the nkeys field). The coalesce routine I made in this case merges the four 32-bits words into one single action. [...] > >> Yes, but we don't add code/features to the kernel based on what we > >> *could* use it for later > > This is already useful: Look at the cxgb driver in particular, it's > > way easier to read. > > Have you tested it? Again, I might be misreading, but it looks like > your flow_action_mangle() *will* coalesce sport & dport, which it > appears will break that cxgb code. Because sport and dport are not place in the same action by tc pedit ex, _that cannot happen_. > > Other existing drivers do not need to do things like: > > > > case round_down(offsetof(struct iphdr, tos), 4): > > > > and the play with masks to infer if the user wants to mangle the TOS > > field, they can just do: > > > > case offsetof(struct iphdr, tos): > > > > which is way more natural representation. > > Proper thing to do is have helper functions available to drivers to test > the pedit, and not just switch on the offset. Why do I say that? > > Well, consider a pedit on UDP dport, with mask 0x00ff (network endian). > Now as a u32 pedit that's 0x000000ff offset 0, so field-blind offset > calculation (ffs in flow_action_mangle_entry()) will turn that into > offset 3 mask 0xff. Now driver does > switch(offset) { /* 3 */ > case offsetof(struct udphdr, dest): /* 2 */ > /* Whoops, we never get here! */ > } > > Do you see the problem? This scenario you describe cannot _work_ right now, with the existing code. Without my patchset, this scenario you describe does _not_ work, The drivers in the tree need a mask of 0xffff to infer that this is UDP dport. The 'tc pedit ex' infrastructure does not allow for the scenario that you describe above. No driver in the tree allow for what you describe already.
On 06/09/2019 16:58, Pablo Neira Ayuso wrote: > In tc pedit ex, those are _indeed_ two separated actions: I read the code again and I get it now, there's double iteration already over tcf_exts_for_each_action and tcf_pedit_nkeys, and it's only within the latter that you coalesce. However, have you considered that iproute2 (i.e. tc tool) isn't guaranteed to be the only userland consumer of the TC uAPI? For all we know there could be another user out there producing things like a single pedit action with two keys, same offset but different masks, to mangle sport & dport separately, which your code now _would_ coalesce into a single mangle. I don't know if that would lead to any problems, but I want to be sure you've thought about it ;) >> Proper thing to do is have helper functions available to drivers to test >> the pedit, and not just switch on the offset. Why do I say that? >> >> Well, consider a pedit on UDP dport, with mask 0x00ff (network endian). >> Now as a u32 pedit that's 0x000000ff offset 0, so field-blind offset >> calculation (ffs in flow_action_mangle_entry()) will turn that into >> offset 3 mask 0xff. Now driver does >> switch(offset) { /* 3 */ >> case offsetof(struct udphdr, dest): /* 2 */ >> /* Whoops, we never get here! */ >> } >> >> Do you see the problem? > This scenario you describe cannot _work_ right now, with the existing > code. Without my patchset, this scenario you describe does _not_ work, > > The drivers in the tree need a mask of 0xffff to infer that this is > UDP dport. > > The 'tc pedit ex' infrastructure does not allow for the scenario that > you describe above. > > No driver in the tree allow for what you describe already. Looks to me like existing nfp_fl_set_tport() handles just fine any arbitrary mask across the u32 that contains UDP sport & dport. And the uAPI we have to maintain is the uAPI we expose, not the subset that iproute2 uses. I could write a patched tc tool *today* that does a pedit of 'UDP header, offset 0, mask 0xff0000ff' and the nfp driver would accept that fine (no idea what the fw / chip would do with it, but presumably it works or Netronome folks would have put checks in), whereas with your patch it'll complain "invalid pedit L4 action" because the mask isn't all-1s. And if I made it produce my example from above, mask 0x000000ff, you'd calculate an offset of 3 and hit the other error, "unsupported section of L4 header", which again would have worked before.
On Fri, Sep 06, 2019 at 05:49:07PM +0100, Edward Cree wrote: > On 06/09/2019 16:58, Pablo Neira Ayuso wrote: > > In tc pedit ex, those are _indeed_ two separated actions: > > I read the code again and I get it now, there's double iteration > already over tcf_exts_for_each_action and tcf_pedit_nkeys, and > it's only within the latter that you coalesce. Exactly. > However, have you considered that iproute2 (i.e. tc tool) isn't > guaranteed to be the only userland consumer of the TC uAPI? > For all we know there could be another user out there producing things like > a single pedit action with two keys, same offset but different > masks, to mangle sport & dport separately, which your code now > _would_ coalesce into a single mangle. I don't know if that would > lead to any problems, but I want to be sure you've thought about it ;) tc pedit only supports for the "extended mode". So the hardware offloads only support for a subset of the tc pedit uAPI already. Userland may decide not to use the "extended mode", however, it will not work for hardware offloads. > >> Proper thing to do is have helper functions available to drivers to test > >> the pedit, and not just switch on the offset. Why do I say that? > >> > >> Well, consider a pedit on UDP dport, with mask 0x00ff (network endian). > >> Now as a u32 pedit that's 0x000000ff offset 0, so field-blind offset > >> calculation (ffs in flow_action_mangle_entry()) will turn that into > >> offset 3 mask 0xff. Now driver does > >> switch(offset) { /* 3 */ > >> case offsetof(struct udphdr, dest): /* 2 */ > >> /* Whoops, we never get here! */ > >> } > >> > >> Do you see the problem? > > > > This scenario you describe cannot _work_ right now, with the existing > > code. Without my patchset, this scenario you describe does _not_ work, > > > > The drivers in the tree need a mask of 0xffff to infer that this is > > UDP dport. > > > > The 'tc pedit ex' infrastructure does not allow for the scenario that > > you describe above. > > > > No driver in the tree allow for what you describe already. > > Looks to me like existing nfp_fl_set_tport() handles just fine any > arbitrary mask across the u32 that contains UDP sport & dport. > And the uAPI we have to maintain is the uAPI we expose, not the subset > that iproute2 uses. I could write a patched tc tool *today* that does > a pedit of 'UDP header, offset 0, mask 0xff0000ff' and the nfp driver > would accept that fine (no idea what the fw / chip would do with it, > but presumably it works or Netronome folks would have put checks in), > whereas with your patch it'll complain "invalid pedit L4 action" > because the mask isn't all-1s. 'UDP header, offset 0, mask 0xff0000ff': Mangle one byte of the UDP sport, and only one mangle of the dport via uAPI. I get your point: If you think that supporting for this is reasonable usecase, I'll fix this patchset and send a v4 so the nfp still works for this. > And if I made it produce my example from above, mask 0x000000ff, you'd > calculate an offset of 3 and hit the other error, "unsupported section > of L4 header", which again would have worked before. 'mask 0x000000ff' mangle only one byte of a UDP port. I'm sorry for this, I assumed in this case that the reasonable (sane) uAPI subset in this case to be supported for the hardware offloads is what tc tool via pedit ex generates. I'll restore the nfp driver so it works for these scenarios via uAPI that you describe.