Message ID | 20210903122148.826196-6-mark.d.gray@redhat.com |
---|---|
State | Superseded |
Headers | show |
Series | northd: Split northd and northd incremental processing framework | expand |
Context | Check | Description |
---|---|---|
ovsrobot/apply-robot | success | apply and check: success |
ovsrobot/github-robot-_Build_and_Test | success | github build: passed |
ovsrobot/github-robot-_ovn-kubernetes | success | github build: passed |
Hi Mark, Of all the patches in this series, this is the one I am least sure about. It seems simple enough, but it's not clear to me 1) Why would the number of nat entries on an ovn_datapath differ from what's in the database? Since this series doesn't introduce any "real" incremental processing, we should be recreating the ovn_datapath on every iteration. And on each iteration, we should have a fresh view of the database, so why would the values differ? 2) Why is this only an issue in destroy_nat_entries()? There are several other places in the code where we iterate over the number of nat entries using od->nbr->n_nat. Why are those not affected? 3) How does this patch address that issue? The patch just sets the ovn_datapath's n_nat_entries to be the same as the database's count, which according to the commit message, is not always accurate. Can you provide some clarity here? Maybe tell of a situation where od->nbr->n_nat is incorrect? On 9/3/21 8:21 AM, Mark Gray wrote: > destroy_nat_entries() iterates over nat_entries using the > number of NAT entries from the NB database. When using I-P, > this behaviour is incorrect as the number of NAT entries in > the NBDB may differ from that stored in 'struct ovn_datapath' > causing unexpected behaviour. > > Signed-off-by: Mark Gray <mark.d.gray@redhat.com> > --- > northd/northd.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/northd/northd.c b/northd/northd.c > index 724baa3994d5..829c4479f14b 100644 > --- a/northd/northd.c > +++ b/northd/northd.c > @@ -605,6 +605,8 @@ struct ovn_datapath { > > /* NAT entries configured on the router. */ > struct ovn_nat *nat_entries; > + size_t n_nat_entries; > + > bool has_distributed_nat; > > /* Set of nat external ips on the router. */ > @@ -789,6 +791,7 @@ init_nat_entries(struct ovn_datapath *od) > od->has_distributed_nat = true; > } > } > + od->n_nat_entries = od->nbr->n_nat; > } > > static void > @@ -802,7 +805,7 @@ destroy_nat_entries(struct ovn_datapath *od) > destroy_lport_addresses(&od->dnat_force_snat_addrs); > destroy_lport_addresses(&od->lb_force_snat_addrs); > > - for (size_t i = 0; i < od->nbr->n_nat; i++) { > + for (size_t i = 0; i < od->n_nat_entries; i++) { > destroy_lport_addresses(&od->nat_entries[i].ext_addrs); > } > } >
On 13/09/2021 21:41, Mark Michelson wrote: > Hi Mark, > > Of all the patches in this series, this is the one I am least sure about. Thanks for looking over these. > > It seems simple enough, but it's not clear to me > > 1) Why would the number of nat entries on an ovn_datapath differ from > what's in the database? Since this series doesn't introduce any "real" > incremental processing, we should be recreating the ovn_datapath on > every iteration. And on each iteration, we should have a fresh view of > the database, so why would the values differ? > 2) Why is this only an issue in destroy_nat_entries()? There are several > other places in the code where we iterate over the number of nat entries > using od->nbr->n_nat. Why are those not affected? > 3) How does this patch address that issue? The patch just sets the > ovn_datapath's n_nat_entries to be the same as the database's count, > which according to the commit message, is not always accurate. > > Can you provide some clarity here? Maybe tell of a situation where > od->nbr->n_nat is incorrect? You are absolutely correct as the details in the commit message are not sufficient to understand. In fact, I had a hard time remembering the reason myself! IIRC, on each iteration, we need to call en_runtime_run(). This needs to initialize the runtime data .. but .. it also needs to destroy the runtime data from the previous run. This may not be the best way to do it, but the I-P framework doesn't have a convenient way to clear data between runs. Consider an example in which we have completed iteration x, and started iteration x+1. If we add a NAT entry between iteration x and iteration x+1, od->nbr->n_nat will contain the updated value of n_nat as the northbound record will have been updated. However, as we have not yet (re)initialized the nat entries in od, od->n_nat_entries will be equal to the previous value. These two values could then be different which makes this loop invalid. Does this make sense? If so, shall I update the commit message? How would you like me to proceed? > > On 9/3/21 8:21 AM, Mark Gray wrote: >> destroy_nat_entries() iterates over nat_entries using the >> number of NAT entries from the NB database. When using I-P, >> this behaviour is incorrect as the number of NAT entries in >> the NBDB may differ from that stored in 'struct ovn_datapath' >> causing unexpected behaviour. >> >> Signed-off-by: Mark Gray <mark.d.gray@redhat.com> >> --- >> northd/northd.c | 5 ++++- >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/northd/northd.c b/northd/northd.c >> index 724baa3994d5..829c4479f14b 100644 >> --- a/northd/northd.c >> +++ b/northd/northd.c >> @@ -605,6 +605,8 @@ struct ovn_datapath { >> >> /* NAT entries configured on the router. */ >> struct ovn_nat *nat_entries; >> + size_t n_nat_entries; >> + >> bool has_distributed_nat; >> >> /* Set of nat external ips on the router. */ >> @@ -789,6 +791,7 @@ init_nat_entries(struct ovn_datapath *od) >> od->has_distributed_nat = true; >> } >> } >> + od->n_nat_entries = od->nbr->n_nat; >> } >> >> static void >> @@ -802,7 +805,7 @@ destroy_nat_entries(struct ovn_datapath *od) >> destroy_lport_addresses(&od->dnat_force_snat_addrs); >> destroy_lport_addresses(&od->lb_force_snat_addrs); >> >> - for (size_t i = 0; i < od->nbr->n_nat; i++) { >> + for (size_t i = 0; i < od->n_nat_entries; i++) { >> destroy_lport_addresses(&od->nat_entries[i].ext_addrs); >> } >> } >> >
On 9/14/21 1:27 PM, Mark Gray wrote: > On 13/09/2021 21:41, Mark Michelson wrote: >> Hi Mark, >> >> Of all the patches in this series, this is the one I am least sure about. > Thanks for looking over these. > >> >> It seems simple enough, but it's not clear to me >> >> 1) Why would the number of nat entries on an ovn_datapath differ from >> what's in the database? Since this series doesn't introduce any "real" >> incremental processing, we should be recreating the ovn_datapath on >> every iteration. And on each iteration, we should have a fresh view of >> the database, so why would the values differ? >> 2) Why is this only an issue in destroy_nat_entries()? There are several >> other places in the code where we iterate over the number of nat entries >> using od->nbr->n_nat. Why are those not affected? >> 3) How does this patch address that issue? The patch just sets the >> ovn_datapath's n_nat_entries to be the same as the database's count, >> which according to the commit message, is not always accurate. >> >> Can you provide some clarity here? Maybe tell of a situation where >> od->nbr->n_nat is incorrect? > > You are absolutely correct as the details in the commit message are not > sufficient to understand. In fact, I had a hard time remembering the > reason myself! > > IIRC, on each iteration, we need to call en_runtime_run(). This needs to > initialize the runtime data .. but .. it also needs to destroy the > runtime data from the previous run. This may not be the best way to do > it, but the I-P framework doesn't have a convenient way to clear data > between runs. > > Consider an example in which we have completed iteration x, and started > iteration x+1. If we add a NAT entry between iteration x and iteration > x+1, od->nbr->n_nat will contain the updated value of n_nat as the > northbound record will have been updated. However, as we have not yet > (re)initialized the nat entries in od, od->n_nat_entries will be equal > to the previous value. These two values could then be different which > makes this loop invalid. Does this make sense? If so, shall I update the > commit message? How would you like me to proceed? OK, this makes a lot more sense now. It makes sense that the issue specifically would be seen when destroying datapaths and not at any other point. Here is my suggestion for how to proceed: 1) Update the commit message to give an explanation similar to what you've given here. 2) Correct the finding that I just found right now and that I mention down below. > >> >> On 9/3/21 8:21 AM, Mark Gray wrote: >>> destroy_nat_entries() iterates over nat_entries using the >>> number of NAT entries from the NB database. When using I-P, >>> this behaviour is incorrect as the number of NAT entries in >>> the NBDB may differ from that stored in 'struct ovn_datapath' >>> causing unexpected behaviour. >>> >>> Signed-off-by: Mark Gray <mark.d.gray@redhat.com> >>> --- >>> northd/northd.c | 5 ++++- >>> 1 file changed, 4 insertions(+), 1 deletion(-) >>> >>> diff --git a/northd/northd.c b/northd/northd.c >>> index 724baa3994d5..829c4479f14b 100644 >>> --- a/northd/northd.c >>> +++ b/northd/northd.c >>> @@ -605,6 +605,8 @@ struct ovn_datapath { >>> >>> /* NAT entries configured on the router. */ >>> struct ovn_nat *nat_entries; >>> + size_t n_nat_entries; >>> + n_nat_entries is declared and never used in this function. >>> bool has_distributed_nat; >>> >>> /* Set of nat external ips on the router. */ >>> @@ -789,6 +791,7 @@ init_nat_entries(struct ovn_datapath *od) >>> od->has_distributed_nat = true; >>> } >>> } >>> + od->n_nat_entries = od->nbr->n_nat; >>> } >>> >>> static void >>> @@ -802,7 +805,7 @@ destroy_nat_entries(struct ovn_datapath *od) >>> destroy_lport_addresses(&od->dnat_force_snat_addrs); >>> destroy_lport_addresses(&od->lb_force_snat_addrs); >>> >>> - for (size_t i = 0; i < od->nbr->n_nat; i++) { >>> + for (size_t i = 0; i < od->n_nat_entries; i++) { >>> destroy_lport_addresses(&od->nat_entries[i].ext_addrs); >>> } >>> } >>> >> >
On 15/09/2021 18:58, Mark Michelson wrote: > On 9/14/21 1:27 PM, Mark Gray wrote: >> On 13/09/2021 21:41, Mark Michelson wrote: >>> Hi Mark, >>> >>> Of all the patches in this series, this is the one I am least sure about. >> Thanks for looking over these. >> >>> >>> It seems simple enough, but it's not clear to me >>> >>> 1) Why would the number of nat entries on an ovn_datapath differ from >>> what's in the database? Since this series doesn't introduce any "real" >>> incremental processing, we should be recreating the ovn_datapath on >>> every iteration. And on each iteration, we should have a fresh view of >>> the database, so why would the values differ? >>> 2) Why is this only an issue in destroy_nat_entries()? There are several >>> other places in the code where we iterate over the number of nat entries >>> using od->nbr->n_nat. Why are those not affected? >>> 3) How does this patch address that issue? The patch just sets the >>> ovn_datapath's n_nat_entries to be the same as the database's count, >>> which according to the commit message, is not always accurate. >>> >>> Can you provide some clarity here? Maybe tell of a situation where >>> od->nbr->n_nat is incorrect? >> >> You are absolutely correct as the details in the commit message are not >> sufficient to understand. In fact, I had a hard time remembering the >> reason myself! >> >> IIRC, on each iteration, we need to call en_runtime_run(). This needs to >> initialize the runtime data .. but .. it also needs to destroy the >> runtime data from the previous run. This may not be the best way to do >> it, but the I-P framework doesn't have a convenient way to clear data >> between runs. >> >> Consider an example in which we have completed iteration x, and started >> iteration x+1. If we add a NAT entry between iteration x and iteration >> x+1, od->nbr->n_nat will contain the updated value of n_nat as the >> northbound record will have been updated. However, as we have not yet >> (re)initialized the nat entries in od, od->n_nat_entries will be equal >> to the previous value. These two values could then be different which >> makes this loop invalid. Does this make sense? If so, shall I update the >> commit message? How would you like me to proceed? > > OK, this makes a lot more sense now. It makes sense that the issue > specifically would be seen when destroying datapaths and not at any > other point. > > Here is my suggestion for how to proceed: > 1) Update the commit message to give an explanation similar to what > you've given here. I posted a new series and added the comment. > 2) Correct the finding that I just found right now and that I mention > down below. I forgot to do this, I will fix in v2. Also, I am surprised I didn't see this as I though I compiled with --enable-Werror. > >> >>> >>> On 9/3/21 8:21 AM, Mark Gray wrote: >>>> destroy_nat_entries() iterates over nat_entries using the >>>> number of NAT entries from the NB database. When using I-P, >>>> this behaviour is incorrect as the number of NAT entries in >>>> the NBDB may differ from that stored in 'struct ovn_datapath' >>>> causing unexpected behaviour. >>>> >>>> Signed-off-by: Mark Gray <mark.d.gray@redhat.com> >>>> --- >>>> northd/northd.c | 5 ++++- >>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/northd/northd.c b/northd/northd.c >>>> index 724baa3994d5..829c4479f14b 100644 >>>> --- a/northd/northd.c >>>> +++ b/northd/northd.c >>>> @@ -605,6 +605,8 @@ struct ovn_datapath { >>>> >>>> /* NAT entries configured on the router. */ >>>> struct ovn_nat *nat_entries; >>>> + size_t n_nat_entries; >>>> + > > n_nat_entries is declared and never used in this function. > >>>> bool has_distributed_nat; >>>> >>>> /* Set of nat external ips on the router. */ >>>> @@ -789,6 +791,7 @@ init_nat_entries(struct ovn_datapath *od) >>>> od->has_distributed_nat = true; >>>> } >>>> } >>>> + od->n_nat_entries = od->nbr->n_nat; >>>> } >>>> >>>> static void >>>> @@ -802,7 +805,7 @@ destroy_nat_entries(struct ovn_datapath *od) >>>> destroy_lport_addresses(&od->dnat_force_snat_addrs); >>>> destroy_lport_addresses(&od->lb_force_snat_addrs); >>>> >>>> - for (size_t i = 0; i < od->nbr->n_nat; i++) { >>>> + for (size_t i = 0; i < od->n_nat_entries; i++) { >>>> destroy_lport_addresses(&od->nat_entries[i].ext_addrs); >>>> } >>>> } >>>> >>> >> >
diff --git a/northd/northd.c b/northd/northd.c index 724baa3994d5..829c4479f14b 100644 --- a/northd/northd.c +++ b/northd/northd.c @@ -605,6 +605,8 @@ struct ovn_datapath { /* NAT entries configured on the router. */ struct ovn_nat *nat_entries; + size_t n_nat_entries; + bool has_distributed_nat; /* Set of nat external ips on the router. */ @@ -789,6 +791,7 @@ init_nat_entries(struct ovn_datapath *od) od->has_distributed_nat = true; } } + od->n_nat_entries = od->nbr->n_nat; } static void @@ -802,7 +805,7 @@ destroy_nat_entries(struct ovn_datapath *od) destroy_lport_addresses(&od->dnat_force_snat_addrs); destroy_lport_addresses(&od->lb_force_snat_addrs); - for (size_t i = 0; i < od->nbr->n_nat; i++) { + for (size_t i = 0; i < od->n_nat_entries; i++) { destroy_lport_addresses(&od->nat_entries[i].ext_addrs); } }
destroy_nat_entries() iterates over nat_entries using the number of NAT entries from the NB database. When using I-P, this behaviour is incorrect as the number of NAT entries in the NBDB may differ from that stored in 'struct ovn_datapath' causing unexpected behaviour. Signed-off-by: Mark Gray <mark.d.gray@redhat.com> --- northd/northd.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)