diff mbox series

[ovs-dev,RFC,ovn] Add VXLAN support for non-VTEP datapath bindings

Message ID 20200320050711.247351-1-ihrachys@redhat.com
State Superseded, archived
Headers show
Series [ovs-dev,RFC,ovn] Add VXLAN support for non-VTEP datapath bindings | expand

Commit Message

Ihar Hrachyshka March 20, 2020, 5:07 a.m. UTC
Hello,

this patch is not ready, sending it to collect initial feedback on the
path taken. Let me know.

Because of limited space in VXLAN VNI to pass over all three of -
datapath id, ingress port, egress port - the implementation ignores
ingress; and splits the remaining 24 bits of VNI into two chunks, 12
bits each - one for datapath and one for egress port.

Limitations: because ingress port is not passed, ACLs that rely on it
won't work with VXLAN; reduced number of networks and ports per
network (max 4096 for both).

Renamed MLF_RCV_FROM_VXLAN_BIT into MLF_RCV_FROM_VTEP_BIT to reflect
the new use case.

TODO:
* limit maximum number of networks / ports per network for vxlan
  datapaths.
* forbid acls matching against ingress port for vxlan datapaths.
* update test suite.
* update documentation.

Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
---
 controller/physical.c        | 81 ++++++++++++++++++++++--------------
 include/ovn/logical-fields.h | 10 ++---
 2 files changed, 55 insertions(+), 36 deletions(-)

Comments

Ben Pfaff March 20, 2020, 3:38 p.m. UTC | #1
On Fri, Mar 20, 2020 at 01:07:11AM -0400, Ihar Hrachyshka wrote:
> this patch is not ready, sending it to collect initial feedback on the
> path taken. Let me know.
> 
> Because of limited space in VXLAN VNI to pass over all three of -
> datapath id, ingress port, egress port - the implementation ignores
> ingress; and splits the remaining 24 bits of VNI into two chunks, 12
> bits each - one for datapath and one for egress port.
> 
> Limitations: because ingress port is not passed, ACLs that rely on it
> won't work with VXLAN; reduced number of networks and ports per
> network (max 4096 for both).
> 
> Renamed MLF_RCV_FROM_VXLAN_BIT into MLF_RCV_FROM_VTEP_BIT to reflect
> the new use case.
> 
> TODO:
> * limit maximum number of networks / ports per network for vxlan
>   datapaths.
> * forbid acls matching against ingress port for vxlan datapaths.
> * update test suite.
> * update documentation.

We have a bit of a terminology problem here, which is going to add
confusion to the code and the documentation.  This is that VTEP, a VXLAN
Tunnel End-Point, has a generic definition (a node that hosts VXLAN
tunnels) that is different from how we use it in OVN (a physical switch
that connects to an OVN deployment through a simple OVSDB schema).  

This problem existed before, but before this change it was not a serious
problem because OVN only used VXLAN tunnels for the latter kind of "VTEP".
This change is going to make it worse.

We can't change generic terminology.  I suggest that we should choose a
new name for what we currently call a "VTEP" in OVN parlance, and then
change all references to VTEP to the new name.

Thanks,

Ben.
Ihar Hrachyshka March 20, 2020, 7:35 p.m. UTC | #2
On Fri, Mar 20, 2020 at 11:38 AM Ben Pfaff <blp@ovn.org> wrote:
>
> On Fri, Mar 20, 2020 at 01:07:11AM -0400, Ihar Hrachyshka wrote:
> > this patch is not ready, sending it to collect initial feedback on the
> > path taken. Let me know.
> >
> > Because of limited space in VXLAN VNI to pass over all three of -
> > datapath id, ingress port, egress port - the implementation ignores
> > ingress; and splits the remaining 24 bits of VNI into two chunks, 12
> > bits each - one for datapath and one for egress port.
> >
> > Limitations: because ingress port is not passed, ACLs that rely on it
> > won't work with VXLAN; reduced number of networks and ports per
> > network (max 4096 for both).
> >
> > Renamed MLF_RCV_FROM_VXLAN_BIT into MLF_RCV_FROM_VTEP_BIT to reflect
> > the new use case.
> >
> > TODO:
> > * limit maximum number of networks / ports per network for vxlan
> >   datapaths.
> > * forbid acls matching against ingress port for vxlan datapaths.
> > * update test suite.
> > * update documentation.
>
> We have a bit of a terminology problem here, which is going to add
> confusion to the code and the documentation.  This is that VTEP, a VXLAN
> Tunnel End-Point, has a generic definition (a node that hosts VXLAN
> tunnels) that is different from how we use it in OVN (a physical switch
> that connects to an OVN deployment through a simple OVSDB schema).
>
> This problem existed before, but before this change it was not a serious
> problem because OVN only used VXLAN tunnels for the latter kind of "VTEP".
> This change is going to make it worse.
>
> We can't change generic terminology.  I suggest that we should choose a
> new name for what we currently call a "VTEP" in OVN parlance, and then
> change all references to VTEP to the new name.

Thanks for the comment. I totally agree that VTEP terminology in OVN
is a bit confusing.

What would be good alternatives? "Alien switch"? "Foreign switch"?
"Remote switch"? (I saw "remote" being used for chassis - not sure how
they relate.) Perhaps someone with better taste - and knowledge - of
English could lend a hand here.

Assuming we pick a term to use to describe these out-of-cluster
switches, we should consider the impact of the rename. Renaming
internal symbols / functions is trivial. But "vtep" is used in OVN
schema (for example, for port binding 'type' attribute). Do we want to
rename those too? If so, what considerations should we apply when
doing it? Any guidance as to maintaining backwards compatibility?

Also, is such a rename something that should happen at the same moment
when we add support for VXLAN for in-cluster communication? Or should
it be a separate work item? (If so, do we expect it to land before or
after the core VXLAN implementation lands?)

Thanks again for the comment, let me know if there are other concerns here,
Ihar

>
> Thanks,
>
> Ben.
>
Ben Pfaff March 20, 2020, 9:06 p.m. UTC | #3
On Fri, Mar 20, 2020 at 03:35:53PM -0400, Ihar Hrachyshka wrote:
> On Fri, Mar 20, 2020 at 11:38 AM Ben Pfaff <blp@ovn.org> wrote:
> >
> > We have a bit of a terminology problem here, which is going to add
> > confusion to the code and the documentation.  This is that VTEP, a VXLAN
> > Tunnel End-Point, has a generic definition (a node that hosts VXLAN
> > tunnels) that is different from how we use it in OVN (a physical switch
> > that connects to an OVN deployment through a simple OVSDB schema).
> >
> > This problem existed before, but before this change it was not a serious
> > problem because OVN only used VXLAN tunnels for the latter kind of "VTEP".
> > This change is going to make it worse.
> >
> > We can't change generic terminology.  I suggest that we should choose a
> > new name for what we currently call a "VTEP" in OVN parlance, and then
> > change all references to VTEP to the new name.
> 
> Thanks for the comment. I totally agree that VTEP terminology in OVN
> is a bit confusing.
> 
> What would be good alternatives? "Alien switch"? "Foreign switch"?
> "Remote switch"? (I saw "remote" being used for chassis - not sure how
> they relate.) Perhaps someone with better taste - and knowledge - of
> English could lend a hand here.

Naming is hard.  Here, I think that it's more important to come with a
distinct name than a good name, although I do hope to come up with one
that we like.

For me, "alien switch" and "foreign switch" seem distinct enough, but
not very descriptive.  I think that "remote switch" isn't distinct
enough because "remote" ends up being used in random places in the
documentation anyway.

Back at Nicira, sometimes we would talk about this kind of thing as an
"on-ramp" switch, meaning that it was good enough to get packets into
the system.  (I guess that in British/international English, this would
be a "slip road" switch.)  If people like it, we could adopt the term
"ramp switch".  I think that this would be a distinct name, since I've
never heard the term before, and it has a decent metaphor behind it.

What do you think?

> Assuming we pick a term to use to describe these out-of-cluster
> switches, we should consider the impact of the rename. Renaming
> internal symbols / functions is trivial. But "vtep" is used in OVN
> schema (for example, for port binding 'type' attribute). Do we want to
> rename those too? If so, what considerations should we apply when
> doing it? Any guidance as to maintaining backwards compatibility?
> 
> Also, is such a rename something that should happen at the same moment
> when we add support for VXLAN for in-cluster communication? Or should
> it be a separate work item? (If so, do we expect it to land before or
> after the core VXLAN implementation lands?)

We can't (or at any rate should not) change the terms in the schema, but
we can change other places and point out to people in a few places that
a "ramp switch" is sometimes, confusingly, called a "vtep".
Ihar Hrachyshka March 23, 2020, 10:39 p.m. UTC | #4
First, some questions as to implementation (or feasibility) of several
todo items in my list for the patch.

1) I initially thought that, because VXLAN would have limited space
for both networks and ports in its VNI, the encap type would not be
able to support as many of both as Geneve / STT, and so we would need
to enforce the limit programmatically somehow. But in OVN context, is
it even doable? North DB resources may be created before any chassis
are registered; once a chassis that is VXLAN only joins, it's too late
to forbid the spilling resources from existence (though it may be a
good time to detect this condition and perhaps fail to register the
chassis / configure flow tables). How do we want to handle this case?
Do we fail to start VXLAN configured ovn-controller when too many
networks / ports per network created? Do we forbid creating too many
resources when a chassis is registered that is VXLAN only? Both? Or do
we leave it up to the deployment / CMS to control the chassis / north
DB configuration?

2) Similar to the issue above, I originally planned to forbid using
ACLs relying on ingress port when a VXLAN chassis is involved (because
the VNI won't carry the information). I believe the approach should be
similar to how we choose to handle the issue with the maximum number
of resources, described above.

I am new to OVN so maybe there are existing examples for such
situations already that I could get inspiration from. Let me know what
you think.

More comments inline below.

On Fri, Mar 20, 2020 at 5:06 PM Ben Pfaff <blp@ovn.org> wrote:
>
> On Fri, Mar 20, 2020 at 03:35:53PM -0400, Ihar Hrachyshka wrote:
> > On Fri, Mar 20, 2020 at 11:38 AM Ben Pfaff <blp@ovn.org> wrote:
> > >
> > > We have a bit of a terminology problem here, which is going to add
> > > confusion to the code and the documentation.  This is that VTEP, a VXLAN
> > > Tunnel End-Point, has a generic definition (a node that hosts VXLAN
> > > tunnels) that is different from how we use it in OVN (a physical switch
> > > that connects to an OVN deployment through a simple OVSDB schema).
> > >
> > > This problem existed before, but before this change it was not a serious
> > > problem because OVN only used VXLAN tunnels for the latter kind of "VTEP".
> > > This change is going to make it worse.
> > >
> > > We can't change generic terminology.  I suggest that we should choose a
> > > new name for what we currently call a "VTEP" in OVN parlance, and then
> > > change all references to VTEP to the new name.
> >
> > Thanks for the comment. I totally agree that VTEP terminology in OVN
> > is a bit confusing.
> >
> > What would be good alternatives? "Alien switch"? "Foreign switch"?
> > "Remote switch"? (I saw "remote" being used for chassis - not sure how
> > they relate.) Perhaps someone with better taste - and knowledge - of
> > English could lend a hand here.
>
> Naming is hard.  Here, I think that it's more important to come with a
> distinct name than a good name, although I do hope to come up with one
> that we like.
>
> For me, "alien switch" and "foreign switch" seem distinct enough, but
> not very descriptive.  I think that "remote switch" isn't distinct
> enough because "remote" ends up being used in random places in the
> documentation anyway.
>
> Back at Nicira, sometimes we would talk about this kind of thing as an
> "on-ramp" switch, meaning that it was good enough to get packets into
> the system.  (I guess that in British/international English, this would
> be a "slip road" switch.)  If people like it, we could adopt the term
> "ramp switch".  I think that this would be a distinct name, since I've
> never heard the term before, and it has a decent metaphor behind it.
>
> What do you think?
>

I am not a native speaker so it's hard for me to tell if this sounds
good for a native, but from my perspective it's 1) unique and 2) has a
metaphor behind it that makes sense. Unless other people have strong
opinions or better alternative, I am ok with following the "ramp
switch" terminology. I like it.

> > Assuming we pick a term to use to describe these out-of-cluster
> > switches, we should consider the impact of the rename. Renaming
> > internal symbols / functions is trivial. But "vtep" is used in OVN
> > schema (for example, for port binding 'type' attribute). Do we want to
> > rename those too? If so, what considerations should we apply when
> > doing it? Any guidance as to maintaining backwards compatibility?
> >
> > Also, is such a rename something that should happen at the same moment
> > when we add support for VXLAN for in-cluster communication? Or should
> > it be a separate work item? (If so, do we expect it to land before or
> > after the core VXLAN implementation lands?)
>
> We can't (or at any rate should not) change the terms in the schema, but
> we can change other places and point out to people in a few places that
> a "ramp switch" is sometimes, confusingly, called a "vtep".
>

Gotcha. Any preferences as to whether to consider it a preparatory
work item; a follow-up; or a part of the VXLAN implementation? (I lean
towards handling the ramp term introduction as an independent
preparatory step.)

Thanks,
Ihar
Ben Pfaff March 23, 2020, 11:47 p.m. UTC | #5
On Mon, Mar 23, 2020 at 06:39:14PM -0400, Ihar Hrachyshka wrote:
> First, some questions as to implementation (or feasibility) of several
> todo items in my list for the patch.
> 
> 1) I initially thought that, because VXLAN would have limited space
> for both networks and ports in its VNI, the encap type would not be
> able to support as many of both as Geneve / STT, and so we would need
> to enforce the limit programmatically somehow. But in OVN context, is
> it even doable? North DB resources may be created before any chassis
> are registered; once a chassis that is VXLAN only joins, it's too late
> to forbid the spilling resources from existence (though it may be a
> good time to detect this condition and perhaps fail to register the
> chassis / configure flow tables). How do we want to handle this case?
> Do we fail to start VXLAN configured ovn-controller when too many
> networks / ports per network created? Do we forbid creating too many
> resources when a chassis is registered that is VXLAN only? Both? Or do
> we leave it up to the deployment / CMS to control the chassis / north
> DB configuration?
>
> 2) Similar to the issue above, I originally planned to forbid using
> ACLs relying on ingress port when a VXLAN chassis is involved (because
> the VNI won't carry the information). I believe the approach should be
> similar to how we choose to handle the issue with the maximum number
> of resources, described above.
> 
> I am new to OVN so maybe there are existing examples for such
> situations already that I could get inspiration from. Let me know what
> you think.

I don't have good solutions for the above resource limit problems.  We
designed OVN so that this kind of resource limit wouldn't be a problem
in practice, so we didn't think through what would happen if the limits
suddenly became more stringent.

I think that it falls upon the CMS by default.

> > > Assuming we pick a term to use to describe these out-of-cluster
> > > switches, we should consider the impact of the rename. Renaming
> > > internal symbols / functions is trivial. But "vtep" is used in OVN
> > > schema (for example, for port binding 'type' attribute). Do we want to
> > > rename those too? If so, what considerations should we apply when
> > > doing it? Any guidance as to maintaining backwards compatibility?
> > >
> > > Also, is such a rename something that should happen at the same moment
> > > when we add support for VXLAN for in-cluster communication? Or should
> > > it be a separate work item? (If so, do we expect it to land before or
> > > after the core VXLAN implementation lands?)
> >
> > We can't (or at any rate should not) change the terms in the schema, but
> > we can change other places and point out to people in a few places that
> > a "ramp switch" is sometimes, confusingly, called a "vtep".
> 
> Gotcha. Any preferences as to whether to consider it a preparatory
> work item; a follow-up; or a part of the VXLAN implementation? (I lean
> towards handling the ramp term introduction as an independent
> preparatory step.)

I sent out a patch for people to look at.
Numan Siddique March 24, 2020, 8:07 a.m. UTC | #6
On Tue, Mar 24, 2020 at 5:17 AM Ben Pfaff <blp@ovn.org> wrote:
>
> On Mon, Mar 23, 2020 at 06:39:14PM -0400, Ihar Hrachyshka wrote:
> > First, some questions as to implementation (or feasibility) of several
> > todo items in my list for the patch.
> >
> > 1) I initially thought that, because VXLAN would have limited space
> > for both networks and ports in its VNI, the encap type would not be
> > able to support as many of both as Geneve / STT, and so we would need
> > to enforce the limit programmatically somehow. But in OVN context, is
> > it even doable? North DB resources may be created before any chassis
> > are registered; once a chassis that is VXLAN only joins, it's too late
> > to forbid the spilling resources from existence (though it may be a
> > good time to detect this condition and perhaps fail to register the
> > chassis / configure flow tables). How do we want to handle this case?
> > Do we fail to start VXLAN configured ovn-controller when too many
> > networks / ports per network created? Do we forbid creating too many
> > resources when a chassis is registered that is VXLAN only? Both? Or do
> > we leave it up to the deployment / CMS to control the chassis / north
> > DB configuration?
> >
> > 2) Similar to the issue above, I originally planned to forbid using
> > ACLs relying on ingress port when a VXLAN chassis is involved (because
> > the VNI won't carry the information). I believe the approach should be
> > similar to how we choose to handle the issue with the maximum number
> > of resources, described above.
> >
> > I am new to OVN so maybe there are existing examples for such
> > situations already that I could get inspiration from. Let me know what
> > you think.
>
> I don't have good solutions for the above resource limit problems.  We
> designed OVN so that this kind of resource limit wouldn't be a problem
> in practice, so we didn't think through what would happen if the limits
> suddenly became more stringent.
>
> I think that it falls upon the CMS by default.

I agree. I think It should be handled by CMS.

Thanks
Numan

>
> > > > Assuming we pick a term to use to describe these out-of-cluster
> > > > switches, we should consider the impact of the rename. Renaming
> > > > internal symbols / functions is trivial. But "vtep" is used in OVN
> > > > schema (for example, for port binding 'type' attribute). Do we want to
> > > > rename those too? If so, what considerations should we apply when
> > > > doing it? Any guidance as to maintaining backwards compatibility?
> > > >
> > > > Also, is such a rename something that should happen at the same moment
> > > > when we add support for VXLAN for in-cluster communication? Or should
> > > > it be a separate work item? (If so, do we expect it to land before or
> > > > after the core VXLAN implementation lands?)
> > >
> > > We can't (or at any rate should not) change the terms in the schema, but
> > > we can change other places and point out to people in a few places that
> > > a "ramp switch" is sometimes, confusingly, called a "vtep".
> >
> > Gotcha. Any preferences as to whether to consider it a preparatory
> > work item; a follow-up; or a part of the VXLAN implementation? (I lean
> > towards handling the ramp term introduction as an independent
> > preparatory step.)
>
> I sent out a patch for people to look at.
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Numan Siddique March 24, 2020, 1:17 p.m. UTC | #7
(Sorry, top posting it)

Hi Ihar,

I tested out your patch. All the tests pass.

I then tested out using the ovn-fake-multinode [1].

After the deployment, I changed the encap to vxlan. East/West traffic
worked fine
without any issues.. But I was not able to ping to the gateway router port IP
from outside (172.16.0.100). The moment I changed the encap back to
geneve, the ping
works. I didn't look into the issue. So not sure what's the reason.
You may want to try out
once with ovn-fake-multinode

Thanks
Numan

[1] - https://github.com/ovn-org/ovn-fake-multinode/




On Tue, Mar 24, 2020 at 1:37 PM Numan Siddique <numans@ovn.org> wrote:
>
> On Tue, Mar 24, 2020 at 5:17 AM Ben Pfaff <blp@ovn.org> wrote:
> >
> > On Mon, Mar 23, 2020 at 06:39:14PM -0400, Ihar Hrachyshka wrote:
> > > First, some questions as to implementation (or feasibility) of several
> > > todo items in my list for the patch.
> > >
> > > 1) I initially thought that, because VXLAN would have limited space
> > > for both networks and ports in its VNI, the encap type would not be
> > > able to support as many of both as Geneve / STT, and so we would need
> > > to enforce the limit programmatically somehow. But in OVN context, is
> > > it even doable? North DB resources may be created before any chassis
> > > are registered; once a chassis that is VXLAN only joins, it's too late
> > > to forbid the spilling resources from existence (though it may be a
> > > good time to detect this condition and perhaps fail to register the
> > > chassis / configure flow tables). How do we want to handle this case?
> > > Do we fail to start VXLAN configured ovn-controller when too many
> > > networks / ports per network created? Do we forbid creating too many
> > > resources when a chassis is registered that is VXLAN only? Both? Or do
> > > we leave it up to the deployment / CMS to control the chassis / north
> > > DB configuration?
> > >
> > > 2) Similar to the issue above, I originally planned to forbid using
> > > ACLs relying on ingress port when a VXLAN chassis is involved (because
> > > the VNI won't carry the information). I believe the approach should be
> > > similar to how we choose to handle the issue with the maximum number
> > > of resources, described above.
> > >
> > > I am new to OVN so maybe there are existing examples for such
> > > situations already that I could get inspiration from. Let me know what
> > > you think.
> >
> > I don't have good solutions for the above resource limit problems.  We
> > designed OVN so that this kind of resource limit wouldn't be a problem
> > in practice, so we didn't think through what would happen if the limits
> > suddenly became more stringent.
> >
> > I think that it falls upon the CMS by default.
>
> I agree. I think It should be handled by CMS.
>
> Thanks
> Numan
>
> >
> > > > > Assuming we pick a term to use to describe these out-of-cluster
> > > > > switches, we should consider the impact of the rename. Renaming
> > > > > internal symbols / functions is trivial. But "vtep" is used in OVN
> > > > > schema (for example, for port binding 'type' attribute). Do we want to
> > > > > rename those too? If so, what considerations should we apply when
> > > > > doing it? Any guidance as to maintaining backwards compatibility?
> > > > >
> > > > > Also, is such a rename something that should happen at the same moment
> > > > > when we add support for VXLAN for in-cluster communication? Or should
> > > > > it be a separate work item? (If so, do we expect it to land before or
> > > > > after the core VXLAN implementation lands?)
> > > >
> > > > We can't (or at any rate should not) change the terms in the schema, but
> > > > we can change other places and point out to people in a few places that
> > > > a "ramp switch" is sometimes, confusingly, called a "vtep".
> > >
> > > Gotcha. Any preferences as to whether to consider it a preparatory
> > > work item; a follow-up; or a part of the VXLAN implementation? (I lean
> > > towards handling the ramp term introduction as an independent
> > > preparatory step.)
> >
> > I sent out a patch for people to look at.
> > _______________________________________________
> > dev mailing list
> > dev@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
Ihar Hrachyshka March 24, 2020, 3:28 p.m. UTC | #8
On Tue, Mar 24, 2020 at 9:18 AM Numan Siddique <numans@ovn.org> wrote:
>
> (Sorry, top posting it)
>
> Hi Ihar,
>
> I tested out your patch. All the tests pass.
>
> I then tested out using the ovn-fake-multinode [1].
>
> After the deployment, I changed the encap to vxlan. East/West traffic
> worked fine
> without any issues.. But I was not able to ping to the gateway router port IP
> from outside (172.16.0.100). The moment I changed the encap back to
> geneve, the ping
> works. I didn't look into the issue. So not sure what's the reason.
> You may want to try out
> once with ovn-fake-multinode
>

Thanks a lot Numan. I used fake-multinode but as you said, E-W works
and I haven't considered gateway case (not an excuse for missing it,
just an observation). I will take a look. I guess this suggests my
patch will have to expand the test suite a bit more than what I have
in my (local) version of the patch where I also test E-W only.

Thanks again,
Ihar
Numan Siddique March 24, 2020, 3:36 p.m. UTC | #9
On Tue, Mar 24, 2020 at 8:58 PM Ihar Hrachyshka <ihrachys@redhat.com> wrote:
>
> On Tue, Mar 24, 2020 at 9:18 AM Numan Siddique <numans@ovn.org> wrote:
> >
> > (Sorry, top posting it)
> >
> > Hi Ihar,
> >
> > I tested out your patch. All the tests pass.
> >
> > I then tested out using the ovn-fake-multinode [1].
> >
> > After the deployment, I changed the encap to vxlan. East/West traffic
> > worked fine
> > without any issues.. But I was not able to ping to the gateway router port IP
> > from outside (172.16.0.100). The moment I changed the encap back to
> > geneve, the ping
> > works. I didn't look into the issue. So not sure what's the reason.
> > You may want to try out
> > once with ovn-fake-multinode
> >
>
> Thanks a lot Numan. I used fake-multinode but as you said, E-W works
> and I haven't considered gateway case (not an excuse for missing it,
> just an observation). I will take a look. I guess this suggests my
> patch will have to expand the test suite a bit more than what I have
> in my (local) version of the patch where I also test E-W only.

No worries. When you submit the formal patch, please also add
necessary documentation in ovn-architecture.

Thanks
Numan

>
> Thanks again,
> Ihar
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Ihar Hrachyshka March 25, 2020, 5:03 p.m. UTC | #10
On Mon, Mar 23, 2020 at 7:47 PM Ben Pfaff <blp@ovn.org> wrote:
>
> On Mon, Mar 23, 2020 at 06:39:14PM -0400, Ihar Hrachyshka wrote:
> > First, some questions as to implementation (or feasibility) of several
> > todo items in my list for the patch.
> >
> > 1) I initially thought that, because VXLAN would have limited space
> > for both networks and ports in its VNI, the encap type would not be
> > able to support as many of both as Geneve / STT, and so we would need
> > to enforce the limit programmatically somehow. But in OVN context, is
> > it even doable? North DB resources may be created before any chassis
> > are registered; once a chassis that is VXLAN only joins, it's too late
> > to forbid the spilling resources from existence (though it may be a
> > good time to detect this condition and perhaps fail to register the
> > chassis / configure flow tables). How do we want to handle this case?
> > Do we fail to start VXLAN configured ovn-controller when too many
> > networks / ports per network created? Do we forbid creating too many
> > resources when a chassis is registered that is VXLAN only? Both? Or do
> > we leave it up to the deployment / CMS to control the chassis / north
> > DB configuration?
> >
> > 2) Similar to the issue above, I originally planned to forbid using
> > ACLs relying on ingress port when a VXLAN chassis is involved (because
> > the VNI won't carry the information). I believe the approach should be
> > similar to how we choose to handle the issue with the maximum number
> > of resources, described above.
> >
> > I am new to OVN so maybe there are existing examples for such
> > situations already that I could get inspiration from. Let me know what
> > you think.
>
> I don't have good solutions for the above resource limit problems.  We
> designed OVN so that this kind of resource limit wouldn't be a problem
> in practice, so we didn't think through what would happen if the limits
> suddenly became more stringent.
>
> I think that it falls upon the CMS by default.
>

For ACLs, I think it's fair to put the burden on CMS (just because it
should be easy for them to follow the simple rule: "Don't use ingress
matching ACLs in your OVN driver.")

While having a guard against overflowing resource number limits in CMS
may be helpful (for example, for immediate failure mode feedback to
CMS user - compare to async notification about a CMS resource to OVSDB
primitive conversion),

I believe OVN should handle the case too. The risk of not doing it is
- the limits are reached, and we start to send traffic that belongs to
one network to another, because their lower 12 bits of datapath ID are
the same.

While CMS could guard against that, it may be less aware about chassis
configuration than OVN. A dumb way to resolve this in CMS would be
having a global configuration option set by deployment tool that
configures OVN and that would know whether any VXLAN capable chassis
are deployed in the cluster. A more proper way to solve it would be to
make CMS aware of chassis configuration by maintaining a cache of
Chassis table records and checking their encap types on each network /
port created.

The same could be done by OVN itself, and arguably OVN is the owner of
the data source (encap records) and is in a better position to control
it:

1. on network creation, if VXLAN is enabled on any chassis, count
networks; if result >= limit, fail; same for ports per network;
2. on ovn-controller start, if VXLAN is enabled for the chassis,
calculate networks / ports per network; if result >= limit, fail to
start the service.

Note that in most common scenario, all chassis have the same
encapsulation types registered; there are multiple ovn-controller
nodes; and resources are created after all chassis are registered in
the database. So point (2) above is to handle a corner case that
probably won't ever happen in real life. (1) is a hot path.

Any specific objections to having this kind of guards in OVN itself?
This may be in addition to CMS side guards (to avoid even trying to
create CMS resources that are known to fail to sync to OVN).

(A similar approach may be extended to ACLs allowed though it's not as
pressing because there are no known CMS that rely on unsupported
ACLs.)

Cheers,
Ihar

> > > > Assuming we pick a term to use to describe these out-of-cluster
> > > > switches, we should consider the impact of the rename. Renaming
> > > > internal symbols / functions is trivial. But "vtep" is used in OVN
> > > > schema (for example, for port binding 'type' attribute). Do we want to
> > > > rename those too? If so, what considerations should we apply when
> > > > doing it? Any guidance as to maintaining backwards compatibility?
> > > >
> > > > Also, is such a rename something that should happen at the same moment
> > > > when we add support for VXLAN for in-cluster communication? Or should
> > > > it be a separate work item? (If so, do we expect it to land before or
> > > > after the core VXLAN implementation lands?)
> > >
> > > We can't (or at any rate should not) change the terms in the schema, but
> > > we can change other places and point out to people in a few places that
> > > a "ramp switch" is sometimes, confusingly, called a "vtep".
> >
> > Gotcha. Any preferences as to whether to consider it a preparatory
> > work item; a follow-up; or a part of the VXLAN implementation? (I lean
> > towards handling the ramp term introduction as an independent
> > preparatory step.)
>
> I sent out a patch for people to look at.
>
Ihar Hrachyshka March 26, 2020, 12:58 a.m. UTC | #11
On Tue, Mar 24, 2020 at 9:18 AM Numan Siddique <numans@ovn.org> wrote:
>
> (Sorry, top posting it)
>
> Hi Ihar,
>
> I tested out your patch. All the tests pass.
>
> I then tested out using the ovn-fake-multinode [1].
>
> After the deployment, I changed the encap to vxlan. East/West traffic
> worked fine
> without any issues.. But I was not able to ping to the gateway router port IP
> from outside (172.16.0.100). The moment I changed the encap back to
> geneve, the ping
> works. I didn't look into the issue. So not sure what's the reason.
> You may want to try out
> once with ovn-fake-multinode
>
> Thanks
> Numan
>
> [1] - https://github.com/ovn-org/ovn-fake-multinode/
>

Hi Numan,

I've tried to find which issue you experience with N-S and I failed to
reproduce it with ovn-fake-multinode or with a test case as part of
the suite (I retrofitted a N-S test case that uses geneve for this
matter, leaving everything as is but changing the encap type to vxlan
- it passes as expected). For fake-multinode, here is what I do:

[vagrant@ovnhostvm ~]$ sudo ip netns exec ovnfake-ext ping -c 1 172.16.0.110
PING 172.16.0.110 (172.16.0.110) 56(84) bytes of data.
64 bytes from 172.16.0.110: icmp_seq=1 ttl=64 time=0.070 ms

--- 172.16.0.110 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.070/0.070/0.070/0.000 ms

Is there something I am missing? (May very well be the case, perhaps I
validate outside connectivity to the router incorrectly.)

Let me know.
Ihar
Ihar Hrachyshka March 27, 2020, 6:14 p.m. UTC | #12
On Wed, Mar 25, 2020 at 1:03 PM Ihar Hrachyshka <ihrachys@redhat.com> wrote:
>
> On Mon, Mar 23, 2020 at 7:47 PM Ben Pfaff <blp@ovn.org> wrote:
> >
> > On Mon, Mar 23, 2020 at 06:39:14PM -0400, Ihar Hrachyshka wrote:
> > > First, some questions as to implementation (or feasibility) of several
> > > todo items in my list for the patch.
> > >
> > > 1) I initially thought that, because VXLAN would have limited space
> > > for both networks and ports in its VNI, the encap type would not be
> > > able to support as many of both as Geneve / STT, and so we would need
> > > to enforce the limit programmatically somehow. But in OVN context, is
> > > it even doable? North DB resources may be created before any chassis
> > > are registered; once a chassis that is VXLAN only joins, it's too late
> > > to forbid the spilling resources from existence (though it may be a
> > > good time to detect this condition and perhaps fail to register the
> > > chassis / configure flow tables). How do we want to handle this case?
> > > Do we fail to start VXLAN configured ovn-controller when too many
> > > networks / ports per network created? Do we forbid creating too many
> > > resources when a chassis is registered that is VXLAN only? Both? Or do
> > > we leave it up to the deployment / CMS to control the chassis / north
> > > DB configuration?
> > >
> > > 2) Similar to the issue above, I originally planned to forbid using
> > > ACLs relying on ingress port when a VXLAN chassis is involved (because
> > > the VNI won't carry the information). I believe the approach should be
> > > similar to how we choose to handle the issue with the maximum number
> > > of resources, described above.
> > >
> > > I am new to OVN so maybe there are existing examples for such
> > > situations already that I could get inspiration from. Let me know what
> > > you think.
> >
> > I don't have good solutions for the above resource limit problems.  We
> > designed OVN so that this kind of resource limit wouldn't be a problem
> > in practice, so we didn't think through what would happen if the limits
> > suddenly became more stringent.
> >
> > I think that it falls upon the CMS by default.
> >
>
> For ACLs, I think it's fair to put the burden on CMS (just because it
> should be easy for them to follow the simple rule: "Don't use ingress
> matching ACLs in your OVN driver.")
>
> While having a guard against overflowing resource number limits in CMS
> may be helpful (for example, for immediate failure mode feedback to
> CMS user - compare to async notification about a CMS resource to OVSDB
> primitive conversion),
>
> I believe OVN should handle the case too. The risk of not doing it is
> - the limits are reached, and we start to send traffic that belongs to
> one network to another, because their lower 12 bits of datapath ID are
> the same.
>
> While CMS could guard against that, it may be less aware about chassis
> configuration than OVN. A dumb way to resolve this in CMS would be
> having a global configuration option set by deployment tool that
> configures OVN and that would know whether any VXLAN capable chassis
> are deployed in the cluster. A more proper way to solve it would be to
> make CMS aware of chassis configuration by maintaining a cache of
> Chassis table records and checking their encap types on each network /
> port created.
>
> The same could be done by OVN itself, and arguably OVN is the owner of
> the data source (encap records) and is in a better position to control
> it:
>
> 1. on network creation, if VXLAN is enabled on any chassis, count
> networks; if result >= limit, fail; same for ports per network;
> 2. on ovn-controller start, if VXLAN is enabled for the chassis,
> calculate networks / ports per network; if result >= limit, fail to
> start the service.
>
> Note that in most common scenario, all chassis have the same
> encapsulation types registered; there are multiple ovn-controller
> nodes; and resources are created after all chassis are registered in
> the database. So point (2) above is to handle a corner case that
> probably won't ever happen in real life. (1) is a hot path.
>
> Any specific objections to having this kind of guards in OVN itself?
> This may be in addition to CMS side guards (to avoid even trying to
> create CMS resources that are known to fail to sync to OVN).
>
> (A similar approach may be extended to ACLs allowed though it's not as
> pressing because there are no known CMS that rely on unsupported
> ACLs.)
>

The more I think about the issue the more important it looks that OVN
is aware of VXLAN limitations and guards against overflowing the
number of resources. Here is why.

While CMS could relatively easy control the overall number of
resources in database - it should be aware of its own resource records
- it does not, in general case, control tunnel keys selected for
datapaths. Meaning, OVN allocates the IDs on Datapath_Binding
creation. OVN selects datapath IDs sequentially, starting from 1 up to
max value for the 24-bit ID, then wraps to the start. A problem with
this approach may occur when after a significant number of networks
were created and then deleted, the "next tunnel ID" counter moves to
the "edge" of 12-bit space available for unique VXLAN datapath
identifiers. Then once a new logical switch creation request is
submitted, OVN may allocate an ID that would have the same lower
12-bits of the new datapath ID as another existing switch (the final
24-bit datapath ID would be unique but that won't translate into a
unique ID passed to a remote hypervisor through VXLAN VNI due to the
proposed 12/12-bit split scheme).

This is probably a bit convoluted, so to give an example, consider
there is a network A with datapath ID = 0b000000000000000000000001.
When VXLAN is enabled, we truncate the datapath ID to 12-bits before
setting it to outgoing packet metadata. Then network B is created with
datapath ID = 0b000000000001000000000001. (Note two bits set.) This
unique datapath ID will map to the same 12-bit value when setting it
for the outgoing packet, making traffic from one network to flow to
another network.

Note that in this example, the number of switches in the database is
below the maximum number allowed for VXLAN (2^12). The only way CMS
could guard against this scenario is monitoring all tunnel keys
allocated to all datapaths and explicitly requesting tunnel keys when
creating new switches, doing it in a way that would not produce a
12-bit clash. (There is already the `requested-tnl-key` option for
this.)

It is not a good idea to offload tunnel key management onto CMS (or at
least it's not a good idea to assume that all CMS implement this
correctly, considering that the risk of not doing so has serious
tenant privacy and connectivity implications). My belief is OVN should
detect VXLAN enabled in cluster, in which case datapath ID range to
allocate to new switches would be halved. (2^24 -> 2^12) This would
involve additional database server work; specifically, ovsdb-server
would need to, on switch and port creation, detect VXLAN mode by
fetching (probably subscribing and caching) all chassis encaps and
checking if any have VXLAN enabled, and if so, adjust the maximum
allowed value for datapath IDs to 2^12.

Another issue that I initially haven't considered that is related to
the available space for port IDs is that I assumed 2^12 port IDs
available per network in the proposed solution; but I missed that OVN
allocates separate sub-range for multicast groups that occupies half
of the total range for port IDs. (The reserved multicast space is IDs
32768 through 65535.) Perhaps having 2^11 for unique port IDs is still
ok but since we already reduced the available limits pretty
significantly, this is something to keep in mind.

Let me know what you think.
Ihar
Frank Wang(王培辉) April 26, 2020, 2:52 a.m. UTC | #13
Hi 

	I have a question about this, why don't let packets received from a
VXLAN tunnel which get resubmitted to ingress pipeline again to determine
the egress port?
In this case, there is no need to carry egress port metadata in tunnel, I'm
not sure what's lost in this situation 
I am also new to OVN, please correct me if I miss something, Thanks.


-----邮件原件-----
发件人: dev [mailto:ovs-dev-bounces@openvswitch.org] 代表 Ihar Hrachyshka
发送时间: 2020年3月28日 2:14
收件人: Ben Pfaff <blp@ovn.org>
抄送: ovs dev <dev@openvswitch.org>
主题: [openvswitch.org代发]Re: [ovs-dev] [PATCH RFC ovn] Add VXLAN support
for non-VTEP datapath bindings

On Wed, Mar 25, 2020 at 1:03 PM Ihar Hrachyshka <ihrachys@redhat.com> wrote:
>
> On Mon, Mar 23, 2020 at 7:47 PM Ben Pfaff <blp@ovn.org> wrote:
> >
> > On Mon, Mar 23, 2020 at 06:39:14PM -0400, Ihar Hrachyshka wrote:
> > > First, some questions as to implementation (or feasibility) of 
> > > several todo items in my list for the patch.
> > >
> > > 1) I initially thought that, because VXLAN would have limited 
> > > space for both networks and ports in its VNI, the encap type would 
> > > not be able to support as many of both as Geneve / STT, and so we 
> > > would need to enforce the limit programmatically somehow. But in 
> > > OVN context, is it even doable? North DB resources may be created 
> > > before any chassis are registered; once a chassis that is VXLAN 
> > > only joins, it's too late to forbid the spilling resources from 
> > > existence (though it may be a good time to detect this condition 
> > > and perhaps fail to register the chassis / configure flow tables). How
do we want to handle this case?
> > > Do we fail to start VXLAN configured ovn-controller when too many 
> > > networks / ports per network created? Do we forbid creating too 
> > > many resources when a chassis is registered that is VXLAN only? 
> > > Both? Or do we leave it up to the deployment / CMS to control the 
> > > chassis / north DB configuration?
> > >
> > > 2) Similar to the issue above, I originally planned to forbid 
> > > using ACLs relying on ingress port when a VXLAN chassis is 
> > > involved (because the VNI won't carry the information). I believe 
> > > the approach should be similar to how we choose to handle the 
> > > issue with the maximum number of resources, described above.
> > >
> > > I am new to OVN so maybe there are existing examples for such 
> > > situations already that I could get inspiration from. Let me know 
> > > what you think.
> >
> > I don't have good solutions for the above resource limit problems.  
> > We designed OVN so that this kind of resource limit wouldn't be a 
> > problem in practice, so we didn't think through what would happen if 
> > the limits suddenly became more stringent.
> >
> > I think that it falls upon the CMS by default.
> >
>
> For ACLs, I think it's fair to put the burden on CMS (just because it 
> should be easy for them to follow the simple rule: "Don't use ingress 
> matching ACLs in your OVN driver.")
>
> While having a guard against overflowing resource number limits in CMS 
> may be helpful (for example, for immediate failure mode feedback to 
> CMS user - compare to async notification about a CMS resource to OVSDB 
> primitive conversion),
>
> I believe OVN should handle the case too. The risk of not doing it is
> - the limits are reached, and we start to send traffic that belongs to 
> one network to another, because their lower 12 bits of datapath ID are 
> the same.
>
> While CMS could guard against that, it may be less aware about chassis 
> configuration than OVN. A dumb way to resolve this in CMS would be 
> having a global configuration option set by deployment tool that 
> configures OVN and that would know whether any VXLAN capable chassis 
> are deployed in the cluster. A more proper way to solve it would be to 
> make CMS aware of chassis configuration by maintaining a cache of 
> Chassis table records and checking their encap types on each network / 
> port created.
>
> The same could be done by OVN itself, and arguably OVN is the owner of 
> the data source (encap records) and is in a better position to control
> it:
>
> 1. on network creation, if VXLAN is enabled on any chassis, count 
> networks; if result >= limit, fail; same for ports per network; 2. on 
> ovn-controller start, if VXLAN is enabled for the chassis, calculate 
> networks / ports per network; if result >= limit, fail to start the 
> service.
>
> Note that in most common scenario, all chassis have the same 
> encapsulation types registered; there are multiple ovn-controller 
> nodes; and resources are created after all chassis are registered in 
> the database. So point (2) above is to handle a corner case that 
> probably won't ever happen in real life. (1) is a hot path.
>
> Any specific objections to having this kind of guards in OVN itself?
> This may be in addition to CMS side guards (to avoid even trying to 
> create CMS resources that are known to fail to sync to OVN).
>
> (A similar approach may be extended to ACLs allowed though it's not as 
> pressing because there are no known CMS that rely on unsupported
> ACLs.)
>

The more I think about the issue the more important it looks that OVN is
aware of VXLAN limitations and guards against overflowing the number of
resources. Here is why.

While CMS could relatively easy control the overall number of resources in
database - it should be aware of its own resource records
- it does not, in general case, control tunnel keys selected for datapaths.
Meaning, OVN allocates the IDs on Datapath_Binding creation. OVN selects
datapath IDs sequentially, starting from 1 up to max value for the 24-bit
ID, then wraps to the start. A problem with this approach may occur when
after a significant number of networks were created and then deleted, the
"next tunnel ID" counter moves to the "edge" of 12-bit space available for
unique VXLAN datapath identifiers. Then once a new logical switch creation
request is submitted, OVN may allocate an ID that would have the same lower
12-bits of the new datapath ID as another existing switch (the final 24-bit
datapath ID would be unique but that won't translate into a unique ID passed
to a remote hypervisor through VXLAN VNI due to the proposed 12/12-bit split
scheme).

This is probably a bit convoluted, so to give an example, consider there is
a network A with datapath ID = 0b000000000000000000000001.
When VXLAN is enabled, we truncate the datapath ID to 12-bits before setting
it to outgoing packet metadata. Then network B is created with datapath ID =
0b000000000001000000000001. (Note two bits set.) This unique datapath ID
will map to the same 12-bit value when setting it for the outgoing packet,
making traffic from one network to flow to another network.

Note that in this example, the number of switches in the database is below
the maximum number allowed for VXLAN (2^12). The only way CMS could guard
against this scenario is monitoring all tunnel keys allocated to all
datapaths and explicitly requesting tunnel keys when creating new switches,
doing it in a way that would not produce a 12-bit clash. (There is already
the `requested-tnl-key` option for
this.)

It is not a good idea to offload tunnel key management onto CMS (or at least
it's not a good idea to assume that all CMS implement this correctly,
considering that the risk of not doing so has serious tenant privacy and
connectivity implications). My belief is OVN should detect VXLAN enabled in
cluster, in which case datapath ID range to allocate to new switches would
be halved. (2^24 -> 2^12) This would involve additional database server
work; specifically, ovsdb-server would need to, on switch and port creation,
detect VXLAN mode by fetching (probably subscribing and caching) all chassis
encaps and checking if any have VXLAN enabled, and if so, adjust the maximum
allowed value for datapath IDs to 2^12.

Another issue that I initially haven't considered that is related to the
available space for port IDs is that I assumed 2^12 port IDs available per
network in the proposed solution; but I missed that OVN allocates separate
sub-range for multicast groups that occupies half of the total range for
port IDs. (The reserved multicast space is IDs
32768 through 65535.) Perhaps having 2^11 for unique port IDs is still ok
but since we already reduced the available limits pretty significantly, this
is something to keep in mind.

Let me know what you think.
Ihar
Ihar Hrachyshka May 4, 2020, 6:07 p.m. UTC | #14
On Sat, Apr 25, 2020 at 10:55 PM Frank Wang(王培辉) <wangpeihui@inspur.com> wrote:
>
> Hi
>
>         I have a question about this, why don't let packets received from a
> VXLAN tunnel which get resubmitted to ingress pipeline again to determine
> the egress port?
> In this case, there is no need to carry egress port metadata in tunnel, I'm
> not sure what's lost in this situation
> I am also new to OVN, please correct me if I miss something, Thanks.
>

New to OVN? Me too. :)

AFAIU if a packet arrived to a chassis through a tunnel, it doesn't
have enough data to calculate which port it came from - which is
required if you want to run the ingress pipeline. If you attach
destination port that is already calculated by the originating chassis
to the payload, then you won't need the ingress port because it's used
in ingress pipeline only that you skip when processing packets
arriving from tunnels.

If you have ideas on how to determine the incoming port on destination
side without attaching it when sending it through the tunnel, that
would of course solve a lot of problems! Let me know.

Ihar

>
> -----邮件原件-----
> 发件人: dev [mailto:ovs-dev-bounces@openvswitch.org] 代表 Ihar Hrachyshka
> 发送时间: 2020年3月28日 2:14
> 收件人: Ben Pfaff <blp@ovn.org>
> 抄送: ovs dev <dev@openvswitch.org>
> 主题: [openvswitch.org代发]Re: [ovs-dev] [PATCH RFC ovn] Add VXLAN support
> for non-VTEP datapath bindings
>
> On Wed, Mar 25, 2020 at 1:03 PM Ihar Hrachyshka <ihrachys@redhat.com> wrote:
> >
> > On Mon, Mar 23, 2020 at 7:47 PM Ben Pfaff <blp@ovn.org> wrote:
> > >
> > > On Mon, Mar 23, 2020 at 06:39:14PM -0400, Ihar Hrachyshka wrote:
> > > > First, some questions as to implementation (or feasibility) of
> > > > several todo items in my list for the patch.
> > > >
> > > > 1) I initially thought that, because VXLAN would have limited
> > > > space for both networks and ports in its VNI, the encap type would
> > > > not be able to support as many of both as Geneve / STT, and so we
> > > > would need to enforce the limit programmatically somehow. But in
> > > > OVN context, is it even doable? North DB resources may be created
> > > > before any chassis are registered; once a chassis that is VXLAN
> > > > only joins, it's too late to forbid the spilling resources from
> > > > existence (though it may be a good time to detect this condition
> > > > and perhaps fail to register the chassis / configure flow tables). How
> do we want to handle this case?
> > > > Do we fail to start VXLAN configured ovn-controller when too many
> > > > networks / ports per network created? Do we forbid creating too
> > > > many resources when a chassis is registered that is VXLAN only?
> > > > Both? Or do we leave it up to the deployment / CMS to control the
> > > > chassis / north DB configuration?
> > > >
> > > > 2) Similar to the issue above, I originally planned to forbid
> > > > using ACLs relying on ingress port when a VXLAN chassis is
> > > > involved (because the VNI won't carry the information). I believe
> > > > the approach should be similar to how we choose to handle the
> > > > issue with the maximum number of resources, described above.
> > > >
> > > > I am new to OVN so maybe there are existing examples for such
> > > > situations already that I could get inspiration from. Let me know
> > > > what you think.
> > >
> > > I don't have good solutions for the above resource limit problems.
> > > We designed OVN so that this kind of resource limit wouldn't be a
> > > problem in practice, so we didn't think through what would happen if
> > > the limits suddenly became more stringent.
> > >
> > > I think that it falls upon the CMS by default.
> > >
> >
> > For ACLs, I think it's fair to put the burden on CMS (just because it
> > should be easy for them to follow the simple rule: "Don't use ingress
> > matching ACLs in your OVN driver.")
> >
> > While having a guard against overflowing resource number limits in CMS
> > may be helpful (for example, for immediate failure mode feedback to
> > CMS user - compare to async notification about a CMS resource to OVSDB
> > primitive conversion),
> >
> > I believe OVN should handle the case too. The risk of not doing it is
> > - the limits are reached, and we start to send traffic that belongs to
> > one network to another, because their lower 12 bits of datapath ID are
> > the same.
> >
> > While CMS could guard against that, it may be less aware about chassis
> > configuration than OVN. A dumb way to resolve this in CMS would be
> > having a global configuration option set by deployment tool that
> > configures OVN and that would know whether any VXLAN capable chassis
> > are deployed in the cluster. A more proper way to solve it would be to
> > make CMS aware of chassis configuration by maintaining a cache of
> > Chassis table records and checking their encap types on each network /
> > port created.
> >
> > The same could be done by OVN itself, and arguably OVN is the owner of
> > the data source (encap records) and is in a better position to control
> > it:
> >
> > 1. on network creation, if VXLAN is enabled on any chassis, count
> > networks; if result >= limit, fail; same for ports per network; 2. on
> > ovn-controller start, if VXLAN is enabled for the chassis, calculate
> > networks / ports per network; if result >= limit, fail to start the
> > service.
> >
> > Note that in most common scenario, all chassis have the same
> > encapsulation types registered; there are multiple ovn-controller
> > nodes; and resources are created after all chassis are registered in
> > the database. So point (2) above is to handle a corner case that
> > probably won't ever happen in real life. (1) is a hot path.
> >
> > Any specific objections to having this kind of guards in OVN itself?
> > This may be in addition to CMS side guards (to avoid even trying to
> > create CMS resources that are known to fail to sync to OVN).
> >
> > (A similar approach may be extended to ACLs allowed though it's not as
> > pressing because there are no known CMS that rely on unsupported
> > ACLs.)
> >
>
> The more I think about the issue the more important it looks that OVN is
> aware of VXLAN limitations and guards against overflowing the number of
> resources. Here is why.
>
> While CMS could relatively easy control the overall number of resources in
> database - it should be aware of its own resource records
> - it does not, in general case, control tunnel keys selected for datapaths.
> Meaning, OVN allocates the IDs on Datapath_Binding creation. OVN selects
> datapath IDs sequentially, starting from 1 up to max value for the 24-bit
> ID, then wraps to the start. A problem with this approach may occur when
> after a significant number of networks were created and then deleted, the
> "next tunnel ID" counter moves to the "edge" of 12-bit space available for
> unique VXLAN datapath identifiers. Then once a new logical switch creation
> request is submitted, OVN may allocate an ID that would have the same lower
> 12-bits of the new datapath ID as another existing switch (the final 24-bit
> datapath ID would be unique but that won't translate into a unique ID passed
> to a remote hypervisor through VXLAN VNI due to the proposed 12/12-bit split
> scheme).
>
> This is probably a bit convoluted, so to give an example, consider there is
> a network A with datapath ID = 0b000000000000000000000001.
> When VXLAN is enabled, we truncate the datapath ID to 12-bits before setting
> it to outgoing packet metadata. Then network B is created with datapath ID =
> 0b000000000001000000000001. (Note two bits set.) This unique datapath ID
> will map to the same 12-bit value when setting it for the outgoing packet,
> making traffic from one network to flow to another network.
>
> Note that in this example, the number of switches in the database is below
> the maximum number allowed for VXLAN (2^12). The only way CMS could guard
> against this scenario is monitoring all tunnel keys allocated to all
> datapaths and explicitly requesting tunnel keys when creating new switches,
> doing it in a way that would not produce a 12-bit clash. (There is already
> the `requested-tnl-key` option for
> this.)
>
> It is not a good idea to offload tunnel key management onto CMS (or at least
> it's not a good idea to assume that all CMS implement this correctly,
> considering that the risk of not doing so has serious tenant privacy and
> connectivity implications). My belief is OVN should detect VXLAN enabled in
> cluster, in which case datapath ID range to allocate to new switches would
> be halved. (2^24 -> 2^12) This would involve additional database server
> work; specifically, ovsdb-server would need to, on switch and port creation,
> detect VXLAN mode by fetching (probably subscribing and caching) all chassis
> encaps and checking if any have VXLAN enabled, and if so, adjust the maximum
> allowed value for datapath IDs to 2^12.
>
> Another issue that I initially haven't considered that is related to the
> available space for port IDs is that I assumed 2^12 port IDs available per
> network in the proposed solution; but I missed that OVN allocates separate
> sub-range for multicast groups that occupies half of the total range for
> port IDs. (The reserved multicast space is IDs
> 32768 through 65535.) Perhaps having 2^11 for unique port IDs is still ok
> but since we already reduced the available limits pretty significantly, this
> is something to keep in mind.
>
> Let me know what you think.
> Ihar
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
diff mbox series

Patch

diff --git a/controller/physical.c b/controller/physical.c
index 144aeb7bd..28f639480 100644
--- a/controller/physical.c
+++ b/controller/physical.c
@@ -180,7 +180,8 @@  static void
 put_encapsulation(enum mf_field_id mff_ovn_geneve,
                   const struct chassis_tunnel *tun,
                   const struct sbrec_datapath_binding *datapath,
-                  uint16_t outport, struct ofpbuf *ofpacts)
+                  uint16_t outport, bool is_vtep,
+                  struct ofpbuf *ofpacts)
 {
     if (tun->type == GENEVE) {
         put_load(datapath->tunnel_key, MFF_TUN_ID, 0, 24, ofpacts);
@@ -191,7 +192,10 @@  put_encapsulation(enum mf_field_id mff_ovn_geneve,
                  MFF_TUN_ID, 0, 64, ofpacts);
         put_move(MFF_LOG_INPORT, 0, MFF_TUN_ID, 40, 15, ofpacts);
     } else if (tun->type == VXLAN) {
-        put_load(datapath->tunnel_key, MFF_TUN_ID, 0, 24, ofpacts);
+        uint64_t vni = (is_vtep?
+                        datapath->tunnel_key :
+                        datapath->tunnel_key | ((uint64_t) outport << 12));
+        put_load(vni, MFF_TUN_ID, 0, 24, ofpacts);
     } else {
         OVS_NOT_REACHED();
     }
@@ -323,8 +327,9 @@  put_remote_port_redirect_overlay(const struct
         if (!rem_tun) {
             return;
         }
-        put_encapsulation(mff_ovn_geneve, tun, binding->datapath,
-                          port_key, ofpacts_p);
+        put_encapsulation(mff_ovn_geneve, tun, binding->datapath, port_key,
+                          !strcmp(binding->type, "vtep"),
+                          ofpacts_p);
         /* Output to tunnel. */
         ofpact_put_OUTPUT(ofpacts_p)->port = rem_tun->ofport;
     } else {
@@ -360,8 +365,9 @@  put_remote_port_redirect_overlay(const struct
             return;
         }
 
-        put_encapsulation(mff_ovn_geneve, tun, binding->datapath,
-                          port_key, ofpacts_p);
+        put_encapsulation(mff_ovn_geneve, tun, binding->datapath, port_key,
+                          !strcmp(binding->type, "vtep"),
+                          ofpacts_p);
 
         /* Output to tunnels with active/backup */
         struct ofpact_bundle *bundle = ofpact_put_BUNDLE(ofpacts_p);
@@ -1364,7 +1370,7 @@  consider_mc_group(enum mf_field_id mff_ovn_geneve,
 
             if (!prev || tun->type != prev->type) {
                 put_encapsulation(mff_ovn_geneve, tun, mc->datapath,
-                                  mc->tunnel_key, &remote_ofpacts);
+                                  mc->tunnel_key, true, &remote_ofpacts);
                 prev = tun;
             }
             ofpact_put_OUTPUT(&remote_ofpacts)->port = tun->ofport;
@@ -1609,11 +1615,12 @@  physical_run(struct physical_ctx *p_ctx,
      * Process packets that arrive from a remote hypervisor (by matching
      * on tunnel in_port). */
 
-    /* Add flows for Geneve and STT encapsulations.  These
-     * encapsulations have metadata about the ingress and egress logical
-     * ports.  We set MFF_LOG_DATAPATH, MFF_LOG_INPORT, and
-     * MFF_LOG_OUTPORT from the tunnel key data, then resubmit to table
-     * 33 to handle packets to the local hypervisor. */
+    /* Add flows for Geneve, STT and non-VTEP VXLAN encapsulations.  Geneve and
+     * STT encapsulations have metadata about the ingress and egress logical
+     * ports.  Non-VTEP VXLAN encapsulations have metadata about the egress
+     * logical port only. We set MFF_LOG_DATAPATH, MFF_LOG_INPORT, and
+     * MFF_LOG_OUTPORT from the tunnel key data where possible, then resubmit
+     * to table 33 to handle packets to the local hypervisor. */
     HMAP_FOR_EACH (tun, hmap_node, &tunnels) {
         struct match match = MATCH_CATCHALL_INITIALIZER;
         match_set_in_port(&match, tun->ofport);
@@ -1642,11 +1649,7 @@  physical_run(struct physical_ctx *p_ctx,
                         &ofpacts, hc_uuid);
     }
 
-    /* Add flows for VXLAN encapsulations.  Due to the limited amount of
-     * metadata, we only support VXLAN for connections to gateways.  The
-     * VNI is used to populate MFF_LOG_DATAPATH.  The gateway's logical
-     * port is set to MFF_LOG_INPORT.  Then the packet is resubmitted to
-     * table 16 to determine the logical egress port. */
+    /* Handle VXLAN encapsulations. */
     HMAP_FOR_EACH (tun, hmap_node, &tunnels) {
         if (tun->type != VXLAN) {
             continue;
@@ -1662,20 +1665,36 @@  physical_run(struct physical_ctx *p_ctx,
                 continue;
             }
 
-            match_set_in_port(&match, tun->ofport);
-            match_set_tun_id(&match, htonll(binding->datapath->tunnel_key));
-
-            ofpbuf_clear(&ofpacts);
-            put_move(MFF_TUN_ID, 0,  MFF_LOG_DATAPATH, 0, 24, &ofpacts);
-            put_load(binding->tunnel_key, MFF_LOG_INPORT, 0, 15, &ofpacts);
-            /* For packets received from a vxlan tunnel, set a flag to that
-             * effect. */
-            put_load(1, MFF_LOG_FLAGS, MLF_RCV_FROM_VXLAN_BIT, 1, &ofpacts);
-            put_resubmit(OFTABLE_LOG_INGRESS_PIPELINE, &ofpacts);
-
-            ofctrl_add_flow(flow_table, OFTABLE_PHY_TO_LOG, 100,
-                            binding->header_.uuid.parts[0],
-                            &match, &ofpacts, hc_uuid);
+            if (!strcmp(binding->type, "vtep")) {
+                /* Add flows for VTEP encapsulations.  The VNI is used to
+                 * populate MFF_LOG_DATAPATH.  The gateway's logical port is
+                 * set to MFF_LOG_INPORT.  Then the packet is resubmitted to
+                 * table 8 to determine the logical egress port. */
+                match_set_in_port(&match, tun->ofport);
+                match_set_tun_id(&match,
+                                 htonll(binding->datapath->tunnel_key));
+
+                ofpbuf_clear(&ofpacts);
+                put_move(MFF_TUN_ID, 0,  MFF_LOG_DATAPATH, 0, 24, &ofpacts);
+                put_load(binding->tunnel_key, MFF_LOG_INPORT, 0, 15, &ofpacts);
+                /* For packets received from a VTEP tunnel, set a flag to that
+                 * effect. */
+                put_load(1, MFF_LOG_FLAGS, MLF_RCV_FROM_VTEP_BIT, 1, &ofpacts);
+                put_resubmit(OFTABLE_LOG_INGRESS_PIPELINE, &ofpacts);
+
+                ofctrl_add_flow(flow_table, OFTABLE_PHY_TO_LOG, 100,
+                                binding->header_.uuid.parts[0],
+                                &match, &ofpacts, hc_uuid);
+            } else {
+                /* Add flows for non-VTEP tunnels. Split VNI into two 12-bit
+                 * sections and use them for datapath and outport IDs. */
+                put_move(MFF_TUN_ID, 12, MFF_LOG_OUTPORT,  0, 12, &ofpacts);
+                put_move(MFF_TUN_ID, 0, MFF_LOG_DATAPATH, 0, 12, &ofpacts);
+
+                put_resubmit(OFTABLE_LOCAL_OUTPUT, &ofpacts);
+                ofctrl_add_flow(flow_table, OFTABLE_PHY_TO_LOG, 100, 0,
+                                &match, &ofpacts, hc_uuid);
+            }
         }
     }
 
diff --git a/include/ovn/logical-fields.h b/include/ovn/logical-fields.h
index c7bd2dba9..c5a25b07d 100644
--- a/include/ovn/logical-fields.h
+++ b/include/ovn/logical-fields.h
@@ -51,7 +51,7 @@  void ovn_init_symtab(struct shash *symtab);
 /* MFF_LOG_FLAGS_REG bit assignments */
 enum mff_log_flags_bits {
     MLF_ALLOW_LOOPBACK_BIT = 0,
-    MLF_RCV_FROM_VXLAN_BIT = 1,
+    MLF_RCV_FROM_VTEP_BIT = 1,
     MLF_FORCE_SNAT_FOR_DNAT_BIT = 2,
     MLF_FORCE_SNAT_FOR_LB_BIT = 3,
     MLF_LOCAL_ONLY_BIT = 4,
@@ -64,11 +64,11 @@  enum mff_log_flags {
     /* Allow outputting back to inport. */
     MLF_ALLOW_LOOPBACK = (1 << MLF_ALLOW_LOOPBACK_BIT),
 
-    /* Indicate that a packet was received from a VXLAN tunnel to
+    /* Indicate that a packet was received from a VTEP tunnel to
      * compensate for the lack of egress port information available in
-     * VXLAN encapsulation.  Egress port information is available for
-     * Geneve and STT tunnel types. */
-    MLF_RCV_FROM_VXLAN = (1 << MLF_RCV_FROM_VXLAN_BIT),
+     * VTEP encapsulation.  Egress port information is available for
+     * Geneve, STT and non-VTEP VXLAN tunnel types. */
+    MLF_RCV_FROM_VXLAN = (1 << MLF_RCV_FROM_VTEP_BIT),
 
     /* Indicate that a packet needs a force SNAT in the gateway router when
      * DNAT has taken place. */