diff mbox

[ovs-dev,3/3] ovn: Update TODO, ovn-northd flow table design, ovn-architecture for L3.

Message ID 1442509883-3992-3-git-send-email-blp@nicira.com
State Superseded
Headers show

Commit Message

Ben Pfaff Sept. 17, 2015, 5:11 p.m. UTC
This is a proposed plan for logical L3 in OVN.  It is not entirely
complete but it includes many important details and I believe that it moves
planning forward.

Signed-off-by: Ben Pfaff <blp@nicira.com>
---
 ovn/TODO                    | 264 +++++++++++++++++++++++++++++++++-
 ovn/northd/ovn-northd.8.xml | 342 +++++++++++++++++++++++++++++++++++++++++++-
 ovn/ovn-architecture.7.xml  |   2 +-
 ovn/ovn-sb.xml              | 109 ++++++++++++--
 4 files changed, 697 insertions(+), 20 deletions(-)

Comments

Russell Bryant Sept. 21, 2015, 8:48 p.m. UTC | #1
On 09/17/2015 01:11 PM, Ben Pfaff wrote:
> This is a proposed plan for logical L3 in OVN.  It is not entirely
> complete but it includes many important details and I believe that it moves
> planning forward.
> 
> Signed-off-by: Ben Pfaff <blp@nicira.com>
> ---
>  ovn/TODO                    | 264 +++++++++++++++++++++++++++++++++-
>  ovn/northd/ovn-northd.8.xml | 342 +++++++++++++++++++++++++++++++++++++++++++-
>  ovn/ovn-architecture.7.xml  |   2 +-
>  ovn/ovn-sb.xml              | 109 ++++++++++++--
>  4 files changed, 697 insertions(+), 20 deletions(-)
> 
> diff --git a/ovn/TODO b/ovn/TODO
> index 6f625ce..a0f5385 100644
> --- a/ovn/TODO
> +++ b/ovn/TODO
> @@ -1,3 +1,265 @@
> +-*- outline -*-
> +
> +* L3 support
> +
> +** OVN_Northbound schema
> +
> +*** Needs to support interconnected routers
> +
> +It should be possible to connect one router to another, e.g. to
> +represent a provider/tenant router relationship.  This requires
> +an OVN_Northbound schema change.

I'm curious about the use case here.

I'd like to be able to put a router between a Neutron "provider network"
(existing physical network) and a regular tenant network (OVN managed
virtual network).  Is that the kind of thing you're talking about?

Here's how I'd model that example with 2 tenants with 3 VMs each on
their own tenant networks (I think).  I was thinking it could be a
special Logical Switch to hook a localnet port to a logical router,
similar to what we do with regular vif logical ports.


Logical Switch LS1  (for tenant A)
  Logical Port LP1
  Logical Port LP2
  Logical Port LP3
  router = LR1

Logical Switch LS2  (for tenant A)
  Logical Port LP4, type=localnet. network-name=mynetwork
  router = LR1

Logical Router LR1  (for tenant A)
  Logical Router Port LRP1, network=LS1
  Logical Router Port LRP2, network=LS2


Logical Switch LS3  (for tenant B)
  Logical Port LP5
  Logical Port LP6
  Logical Port LP7
  router = LR2

Logical Switch LS4  (for tenant B)
  Logical Port LP8, type=localnet, network-name=mynetwork
  router = LR2

Logical Router LR2 (for tenant B)
  Logical Router Port LRP3, network=LS3
  Logical Router Port LRP4, network=LS4


Does that sort of configuration seem sane?  If not, how would we
accomplish the end goal?
Justin Pettit Sept. 22, 2015, 1:06 a.m. UTC | #2
> On Sep 17, 2015, at 10:11 AM, Ben Pfaff <blp@nicira.com> wrote:


Thanks for writing this up!  I'm still digesting it, but here are some initial comments:

> +        <p>
> +          L3 admission control: A priority-220 flow drops packets that match
> +          any of the following:
> +        </p>
> +
> +        <ul>
> +          <li>
> +            <code>ip.src[28..31] == 0xe</code> (multicast source)
> +          </li>
> +          <li>
> +            <code>ip.src == 255.255.255.255</code> (broadcast source)
> +          </li>
> +          <li>
> +            <code>ip.src == 127.0.0.0/8 || ip.dst == 127.0.0.0/8</code>
> +            (localhost source or destination)
> +          </li>
> +          <li>
> +            <code>ip.src == 0.0.0.0/8 || ip.dst == 0.0.0.0/8</code> (zero
> +            network source or destination)
> +          </li>

I assume these should all be "ip4.src".

> +          <li>
> +            <code>ip.src</code> is any IP address owned by the router.
> +          </li>
> +          <li>
> +            <code>ip.src</code> is the broadcast address of any IP network
> +            known to the router.

These will be IPv4 or IPv6 specific, but may be good enough for illustrative purposes here.

> +        <p>
> +          ICMP echo reply.  These flows reply to ICMP echo requests received
> +          for the router's IP address.  Let <var>A</var> be an IP address owned
> +          by the router or the broadcast address for one of these IP address's
> +          networks.  Then, for each <var>A</var>, a priority-210 flow matches
> +          on <code>ip.dst == <var>A</var></code> and <code>icmp4.type == 8
> +          &amp;&amp; icmp4.code == 0</code> (ICMP echo request).  These flows
> +          use the following actions where, if <var>A</var> is unicast, then
> +          <var>S</var> is <var>A</var>, and if <var>A</var> is broadcast,
> +          <var>S</var> is the router's IP address in <var>A</var>'s network:
> +        </p>
> +
> +        <pre>
> +ip4.dst = ip4.src;
> +ip4.src = <var>S</var>;
> +ip4.ttl = 255;
> +icmp4.type = 0;
> +reg0 = ip4.dst;

Later on, it becomes clear why reg0 is being used, but it would be nice to be explicit about it earlier.

Maybe it's obvious to others, but my first thought was that this was the original "ip.dst", but now that I look at it more carefully, my guess is that it is the original "ip.src".  It might be good to add a comment to clarify.

Finally, should it always be the original source IP?  If we have multiple routers in between, shouldn't it be the gateway's address?  (This question holds for all the places where reg0 is being set.)

> +next;
> +</pre>

Is it possible in these examples to have "<pre>" and "</pre>"  line up?

> +      <li>
> +        <p>
> +          TCP reset.  These flows generate TCP reset messages in reply to TCP
> +          datagrams directed to the router's IP address.  The logical router
> +          doesn't accept any TCP traffic so it always generates such a reply.
> +        </p>

Do you want to add the comment about not matching on IP fragments with nonzero offset like you did for UDP and protocol unreachable?  (TCP shouldn't generate IP fragments, but it can happen.)

> +        <p>
> +          Protocol unreachable.  These flows generate ICMP protocol unreachable
> +          messages in reply to packets directed to the router's IP address on
> +          IP protocols other than UDP, TCP, and ICMP.
> +        </p>

I think for all of these error generators, we should consider some sort of rate-limiting.  Obviously, this is a little complicated if we want to do it in ovs-vswitchd--especially in the fast path.

> +      <li>
> +        Ethernet local broadcast.  A priority-190 flow with match <code>eth.dst
> +        == ff:ff:ff:ff:ff:ff</code> drops traffic destined to the local
> +        Ethernet broadcast address.  By definition this traffic should not be
> +        forwarded.
> +      </li>
> +
> +      <li>
> +        Drop IP multicast.  A priority-190 flow with match <code>ip.dst[28..31]
> +        == 0xe</code> drops IP multicast traffic.
> +      </li>

We may want to drop traffic sent to an IP broadcast address to prevent things like Smurf attacks.

> +        <p>
> +          Unknown MAC bindings.  For each non-gateway route to IPv4 network
> +          <var>N</var> with netmask <var>M</var> on router port <var>P</var>
> +          that owns IP address <var>A</var> and Ethernet address <var>E</var>,
> +          a logical flow with match <code>ip.dst ==
> +          <var>N</var>/<var>M</var></code>, whose priority is the number of
> +          1-bits in <var>M</var>, has the following actions:
> +        </p>
> +
> +        <pre>
> +ratelimit;

I don't think "ratelimit" is defined anywhere.

> +arp {
> +    eth.dst = ff:ff:ff:ff:ff:ff;
> +    eth.src = <var>E</var>;
> +    arp.sha = <var>E</var>;
> +    arp.tha = 00:00:00:00:00:00;
> +    arp.spa = <var>A</var>;
> +    arp.tpa = ip.dst;
> +    outport = <var>P</var>;
> +    output;

Should you set the ARP opcode?

> +        <dt><code>ip4.ttl--;</code></dt>
> +        <dd>
> +          <p>
> +            Decrements the IPv4 TTL.  If this would make the TTL zero or
> +            negative, then processing of the packet halts; no further actions
> +            are processed.  (To properly handle such cases, a higher-priority
> +            flow should match on <code>ip.ttl &lt; 2</code>.)
> +          </p>
> +
> +          <p><b>Prerequisite:</b> <code>ip4</code></p>

Is IPv6 that different?

>         </dd>
> 
> -        <dt><code>learn</code></dt>
> +        <dt><code>arp { <var>action</var>; </code>...<code> };</code></dt>
> +        <dd>
> +          <p>
> +            Temporarily replaces the IPv4 packet being processed by an ARP
> +            packet and executes each nested <var>action</var> on the ARP
> +            packet.  Actions following the <var>arp</var> action, if any, apply
> +            to the original, unmodified packet.
> +          </p>

So what would we normally do with the original packet?  Do we just drop it?  That seems kind of unfortunate.

> 
> -        <dt><code>dec_ttl { <var>action</var>, </code>...<code> } { <var>action</var>; </code>...<code>};</code></dt>
> +        <dt><code>icmp4 { <var>action</var>; </code>...<code> };</code></dt>
>         <dd>
> -          decrement TTL; execute first set of actions if
> -          successful, second set if TTL decrement fails
> +          <p>
> +            Temporarily replaces the IPv4 packet being processed by an ICMPv4
> +            packet and executes each nested <var>action</var> on the ARP

Do you mean IPv4 instead of ARP?

> +            packet.  Actions following the <var>icmp4</var> action, if any,
> +            apply to the original, unmodified packet.
> +          </p>
> +
> +          <p>
> +            The ICMPv4 packet that this action operates on is initialized based
> +            on the IPv4 packet being processed, as follows.  Ethernet and IPv4
> +            fields not listed here are not changed:
> +          </p>
> +
> +          <ul>
> +            <li><code>ip.proto = 1</code> (ICMPv4)</li>
> +            <li><code>ip.frag = 0</code> (not a fragment)</li>
> +            <li><code>icmp4.type = 3</code> (destination unreachable)</li>
> +            <li><code>icmp4.code = 1</code> (host unreachable)</li>

I assume this is just an example since we support other types and codes, so may want to mention that in the description.

I'm going to spend some more time thinking about it, and then I'll follow up with some additional feedback.

Thanks again.

--Justin
Ben Pfaff Sept. 29, 2015, 9:34 p.m. UTC | #3
On Mon, Sep 21, 2015 at 04:48:04PM -0400, Russell Bryant wrote:
> On 09/17/2015 01:11 PM, Ben Pfaff wrote:
> > This is a proposed plan for logical L3 in OVN.  It is not entirely
> > complete but it includes many important details and I believe that it moves
> > planning forward.
> > 
> > Signed-off-by: Ben Pfaff <blp@nicira.com>
> > ---
> >  ovn/TODO                    | 264 +++++++++++++++++++++++++++++++++-
> >  ovn/northd/ovn-northd.8.xml | 342 +++++++++++++++++++++++++++++++++++++++++++-
> >  ovn/ovn-architecture.7.xml  |   2 +-
> >  ovn/ovn-sb.xml              | 109 ++++++++++++--
> >  4 files changed, 697 insertions(+), 20 deletions(-)
> > 
> > diff --git a/ovn/TODO b/ovn/TODO
> > index 6f625ce..a0f5385 100644
> > --- a/ovn/TODO
> > +++ b/ovn/TODO
> > @@ -1,3 +1,265 @@
> > +-*- outline -*-
> > +
> > +* L3 support
> > +
> > +** OVN_Northbound schema
> > +
> > +*** Needs to support interconnected routers
> > +
> > +It should be possible to connect one router to another, e.g. to
> > +represent a provider/tenant router relationship.  This requires
> > +an OVN_Northbound schema change.
> 
> I'm curious about the use case here.

I think that it's always possible to "cross-product" a topology of
routers into a single router, so it's not strictly necessary since
whatever runs above OVS (such as Neutron) could do the cross-producting,
or we could put it into ovn-northd.  That said, there are a few use
cases.

One is to make it easier for users to reproduce in a logical network the
structure of some existing physical network that includes multiple
routers.  That's essentially for convenience.

Another is for multitenant environments where each tenant might control
its own set of logical switches and logical routers, and then a
higher-level router connects the tenants' routers.

> I'd like to be able to put a router between a Neutron "provider network"
> (existing physical network) and a regular tenant network (OVN managed
> virtual network).  Is that the kind of thing you're talking about?

When I said "provider" and "tenant" above, I was speaking in terms of
the plain meanings of those words.  I might have mis-stepped into
Neutron terminology that has very specific meanings, like the Neutron
"provider networks" that we've implemented in OVN through localnet
logical ports.  So there might be some impedance mismatch.

But what you mention might be a way to apply router topologies.  Let me
see...

> Here's how I'd model that example with 2 tenants with 3 VMs each on
> their own tenant networks (I think).  I was thinking it could be a
> special Logical Switch to hook a localnet port to a logical router,
> similar to what we do with regular vif logical ports.
> 
> 
> Logical Switch LS1  (for tenant A)
>   Logical Port LP1
>   Logical Port LP2
>   Logical Port LP3
>   router = LR1
> 
> Logical Switch LS2  (for tenant A)
>   Logical Port LP4, type=localnet. network-name=mynetwork
>   router = LR1
> 
> Logical Router LR1  (for tenant A)
>   Logical Router Port LRP1, network=LS1
>   Logical Router Port LRP2, network=LS2
> 
> 
> Logical Switch LS3  (for tenant B)
>   Logical Port LP5
>   Logical Port LP6
>   Logical Port LP7
>   router = LR2
> 
> Logical Switch LS4  (for tenant B)
>   Logical Port LP8, type=localnet, network-name=mynetwork
>   router = LR2
> 
> Logical Router LR2 (for tenant B)
>   Logical Router Port LRP3, network=LS3
>   Logical Router Port LRP4, network=LS4
> 
> 
> Does that sort of configuration seem sane?  If not, how would we
> accomplish the end goal?

I think that's one way to accomplish it.  I had something more like this
in mind (mostly copied from yours).  It's somewhat degenerate in that
each tenant only has one router and one switch of its own so that it
could be modeled without LR1 and LR2 but I think it still makes the
pointer:

    Logical Switch LS1  (for tenant A)
      Logical Port LP1
      Logical Port LP2
      Logical Port LP3
      router = LR1

    Logical Router LR1  (for tenant A)
      Logical Router Port LRP1 connects to LS1
      Logical Router Port LRP2 connects to LR3

    Logical Switch LS2  (for tenant B)
      Logical Port LP5
      Logical Port LP6
      Logical Port LP7
      router = LR2

    Logical Router LR2 (for tenant B)
      Logical Router Port LRP3 connects to LS2
      Logical Router Port LRP4 connects to LR3

    Logical Router LR3 (connects tenants together)
      Logical Router Port LRP5 connects to LR1
      Logical Router Port LRP6 connects to LR2
diff mbox

Patch

diff --git a/ovn/TODO b/ovn/TODO
index 6f625ce..a0f5385 100644
--- a/ovn/TODO
+++ b/ovn/TODO
@@ -1,3 +1,265 @@ 
+-*- outline -*-
+
+* L3 support
+
+** OVN_Northbound schema
+
+*** Needs to support interconnected routers
+
+It should be possible to connect one router to another, e.g. to
+represent a provider/tenant router relationship.  This requires
+an OVN_Northbound schema change.
+
+*** Needs to support extra routes
+
+Currently a router port has a single route associated with it, but
+presumably we should support multiple routes.  For connections from
+one router to another, this doesn't seem to matter (just put more than
+one connection between them), but for connections between a router and
+a switch it might matter because a switch has only one router port.
+
+** OVN_SB schema
+
+*** Logical datapath interconnection
+
+There needs to be a way in the OVN_Southbound database to express
+connections between logical datapaths, so that packets can pass from a
+logical switch to its logical router (and vice versa) and from one
+logical router to another.
+
+One way to do this would be to introduce logical patch ports, closely
+modeled on the "physical" patch ports that OVS has had for ages.  Each
+logical patch port would consist of two rows in the Port_Binding table
+(one in each logical datapath), with type "patch" and an option "peer"
+that names the other logical port in the pair.
+
+If we do it this way then we'll need to figure out one odd special
+case.  Currently the ACL table documents that the logical router port
+is always named "ROUTER".  This can't be done directly with this patch
+port technique, because every row in the Logical_Port table must have
+a unique name.  This probably means that we should change the
+convention for the ACL table so that the logical router port name is
+unique; for example, we could change the Logical_Router_Port table to
+require the 'name' column to be unique, and then use that name in the
+ACL table.
+
+*** Allow output to ingress port
+
+Sometimes when a packet ingresses into a router, it has to egress the
+same port.  One example is a "one-armed" router that has multiple
+routes on a single port (or in which a host is (mis)configured to send
+every IP packet to the router, e.g. due to a bad netmask).  Another is
+when a router needs to send an ICMP reply to a ingressing packet.
+
+To some degree this problem is layered, because there are two
+different notions of "ingress port".  The first is the OpenFlow
+ingress port, essentially a physical port identifier.  This is
+implemented as part of ovs-vswitchd's OpenFlow implementation.  It
+prevents a reply from being sent across the tunnel on which it
+arrived.  It is questionable whether this OpenFlow feature is useful
+to OVN.  (OVN already has to override it to allow a packet from one
+nested container to be forwarded to a different nested container.)
+OVS make it possible to disable this feature of OpenFlow by setting
+the OpenFlow input port field to 0.  (If one does this too early, of
+course, it means that there's no way to actually match on the input
+port in the OpenFlow flow tables, but one can work around that by
+instead setting the input port just before the output action, possibly
+wrapping these actions in push/pop pairs to preserve the input port
+for later.)
+
+The second is the OVN logical ingress port, which is implemented in
+ovn-controller as part of the logical abstraction, using an OVS
+register.  Dropping packets directed to the logical ingress port is
+implemented through an OpenFlow table not directly visible to the
+logical flow table.  Currently this behavior can't be disabled, but
+various ways to ensure it could be implemented, e.g. the same as for
+OpenFlow by allowing the logical inport to be zeroed, or by
+introducing a new action that ignores the inport.
+
+** ovn-northd
+
+*** What flows should it generate?
+
+See description in ovn-northd(8).
+
+** New OVN logical actions
+
+*** arp
+
+Generates an ARP packet based on the current IPv4 packet and allows it
+to be processed as part of the current pipeline (and then pop back to
+processing the original IPv4 packet).
+
+TCP/IP stacks typically limit the rate at which ARPs are sent, e.g. to
+one per second for a given target.  We might need to do this too.
+
+*** icmp4 { action... }
+
+Generates an ICMPv4 packet based on the current IPv4 packet and
+processes it according to each nested action (and then pops back to
+processing the original IPv4 packet).  The intended use case is for
+generating "time exceeded" and "destination unreachable" errors.
+
+ovn-sb.xml includes a tentative specification for this action.
+
+Tentatively, the icmp4 action sets a default icmp_type and icmp_code
+and lets the nested actions override it.  This means that we'd have to
+make icmp_type and icmp_code writable.  Because changing icmp_type and
+icmp_code can change the interpretation of the rest of the data in the
+ICMP packet, we would want to think this through carefully.  If it
+seems like a bad idea then we could instead make the type and code a
+parameter to the action: icmp4(type, code) { action... }
+
+It is worth considering what should be considered the ingress port for
+the ICMPv4 packet.  It's quite likely that the ICMPv4 packet is going
+to go back out the ingress port.  Maybe the icmp4 action, therefore,
+should clear the inport, so that output to the original inport won't
+be discarded.
+
+*** tcp_reset
+
+Transforms the current TCP packet into a RST reply.
+
+ovn-sb.xml includes a tentative specification for this action.
+
+*** Other actions for IPv6.
+
+IPv6 will probably need an action or actions for ND that is similar to
+the "arp" action, and an action for generating
+
+*** Other actions.
+
+Possibly we'll need to implement "field1 = field2;" for copying
+between fields and "field1 <-> field2;" for swapping fields.
+
+*** ovn-controller translation to OpenFlow
+
+The following two translation strategies come to mind.  Some of the
+new actions we might want to implement one way, some of them the
+other, depending on the details.
+
+*** Implementation strategies
+
+One way to do this is to define new actions as Open vSwitch extensions
+to OpenFlow, emit those actions in ovn-controller, and implement them
+in ovs-vswitchd (possibly pushing the implementations into the Linux
+and DPDK datapaths as well).  This is the only acceptable way for
+actions that need high performance.  None of these actions obviously
+need high performance, but it might be necessary to have fairness in
+handling e.g. a flood of incoming packets that require these actions.
+The main disadvantage of this approach is that it ties ovs-vswitchd
+(and the Linux kernel module) to supporting these actions essentially
+forever, which means that we'd want to make sure that they are
+general-purpose, well designed, maintainable, and supportable.
+
+The other way to do this is to send the packets across an OpenFlow
+channel to ovn-controller and have ovn-controller process them.  This
+is acceptable for actions that don't need high performance, and it
+means that we don't add anything permanently to ovs-vswitchd or the
+kernel (so we can be more casual about the design).  The big
+disadvantage is that it becomes necessary to add a way to resume the
+OpenFlow pipeline when it is interrupted in the middle by sending a
+packet to the controller.  This is not as simple as doing a new flow
+table lookup and resuming from that point.  Instead, it is equivalent
+to the (very complicated) recirculation logic in ofproto-dpif-xlate.c.
+Much of this logic can be translated into OpenFlow actions (e.g. the
+call stack and data stack), but some of it is entirely outside
+OpenFlow (e.g. the state of mirrors).  To implement it properly, it
+seems that we'll have to introduce a new Open vSwitch extension to
+OpenFlow, a "send-to-controller" action that causes extra data to be
+sent to the controller, where the extra data packages up the state
+necessary to resume the pipeline.  Maybe the bits of the state that
+can be represented in OpenFlow can be embedded in this extra data in a
+controller-readable form, but other bits we might want to be opaque.
+It's also likely that we'll want to change and extend the form of this
+opaque data over time, so this should be allowed for, e.g. by
+including a nonce in the extra data that is newly generated every time
+ovs-vswitchd starts.
+
+*** OpenFlow action definitions
+
+Define OpenFlow wire structures for each new OpenFlow action and
+implement them in lib/ofp-actions.[ch].
+
+*** OVS implementation
+
+Add code for action translation.  Possibly add datapath code for
+action implementation.  However, none of these new actions should
+require high-bandwidth processing so we could at least start with them
+implemented in userspace only.  (ARP field modification is already
+userspace-only and no one has complained yet.)
+
+** IPv6
+
+*** ND versus ARP
+
+*** IPv6 routing
+
+*** ICMPv6
+
+** IP to MAC binding
+
+Somehow it has to be possible for an L3 logical router to map from an
+IP address to an Ethernet address.  This can happen statically or
+dynamically.  Probably both cases need to be supported eventually.
+
+*** Static IP to MAC binding
+
+Commonly, for a VM, the binding of an IP address to a MAC is known
+statically.  The Logical_Port table in the OVN_Northbound schema can
+be revised to make these bindings known.  Then ovn-northd can
+integrate the bindings into the logical router flow table.
+(ovn-northd can also integrate them into the logical switch flow table
+to terminate ARP requests from VIFs.)
+
+*** Dynamic IP to MAC bindings
+
+Some bindings from IP address to MAC will undoubtedly need to be
+discovered dynamically through ARP requests.  It's straightforward
+enough for a logical L3 router to generate ARP requests and forward
+them to the appropriate switch.
+
+It's more difficult to figure out where the reply should be processed
+and stored.  It might seem at first that a first-cut implementation
+could just keep track of the binding on the hypervisor that needs to
+know, but that can't happen easily because the VM that sends the reply
+might not be on the same HV as the VM that needs the answer (that is,
+the VM that sent the packet that needs the binding to be resolved) and
+there isn't an easy way for it to know which HV needs the answer.
+
+Thus, the HV that processes the ARP reply (which is unknown when the
+ARP is sent) has to tell all the HVs the binding.  The most obvious
+place for this in the OVN_Southbound database.
+
+Details need to be worked out, including:
+
+**** OVN_Southbound schema changes.
+
+Possibly bindings could be added to the Port_Binding table by adding
+or modifying columns.  Another possibility is that another table
+should be added.
+
+**** Logical_Flow representation
+
+It would be really nice to maintain the general-purpose nature of
+logical flows, but these bindings might have to include some
+hard-coded special cases, especially when it comes to the relationship
+with populating the bindings into the OVN_Southbound table.
+
+**** Tracking queries
+
+It's probably best to only record in the database responses to queries
+actually issued by an L3 logical router, so somehow they have to be
+tracked, probably by putting a tentative binding without a MAC address
+into the database.
+
+**** Renewal and expiration.
+
+Something needs to make sure that bindings remain valid and expire
+those that become stale.
+
+*** MTU handling (fragmentation on output)
+
 * ovn-controller
 
 ** ovn-controller parameters and configuration.
@@ -100,4 +362,4 @@ 
 
    Both ovn-controller and ovn-contorller-vtep should use BFD to
    monitor the tunnel liveness.  Both ovs-vswitchd schema and
-   VTEP schema supports BFD.
\ No newline at end of file
+   VTEP schema supports BFD.
diff --git a/ovn/northd/ovn-northd.8.xml b/ovn/northd/ovn-northd.8.xml
index 1655958..9d35d9f 100644
--- a/ovn/northd/ovn-northd.8.xml
+++ b/ovn/northd/ovn-northd.8.xml
@@ -106,10 +106,12 @@ 
       One of the main purposes of <code>ovn-northd</code> is to populate the
       <code>Logical_Flow</code> table in the <code>OVN_Southbound</code>
       database.  This section describes how <code>ovn-northd</code> does this
-      for logical datapaths.
+      for switch and router logical datapaths.
     </p>
 
-    <h2>Ingress Table 0: Admission Control and Ingress Port Security</h2>
+    <h2>Logical Switch Datapaths</h2>
+
+    <h3>Ingress Table 0: Admission Control and Ingress Port Security</h3>
 
     <p>
       Ingress table 0 contains these logical flows:
@@ -137,7 +139,7 @@ 
       be dropped.
     </p>
 
-    <h2>Ingress table 1: <code>from-lport</code> ACLs</h2>
+    <h3>Ingress table 1: <code>from-lport</code> ACLs</h3>
 
     <p>
       Logical flows in this table closely reproduce those in the
@@ -154,7 +156,7 @@ 
       <code>next;</code>, so that ACLs allow packets by default.
     </p>
 
-    <h2>Ingress Table 2: Destination Lookup</h2>
+    <h3>Ingress Table 2: Destination Lookup</h3>
 
     <p>
       This table implements switching behavior.  It contains these logical
@@ -185,13 +187,13 @@ 
       </li>
     </ul>
 
-    <h2>Egress Table 0: <code>to-lport</code> ACLs</h2>
+    <h3>Egress Table 0: <code>to-lport</code> ACLs</h3>
 
     <p>
       This is similar to ingress table 1 except for <code>to-lport</code> ACLs.
     </p>
 
-    <h2>Egress Table 1: Egress Port Security</h2>
+    <h3>Egress Table 1: Egress Port Security</h3>
 
     <p>
       This is similar to the ingress port security logic in ingress table 0,
@@ -206,4 +208,332 @@ 
       disabled logical <code>outport</code> overrides the priority-100 flow
       with a <code>drop;</code> action.
     </p>
+
+    <h2>Logical Router Datapaths</h2>
+
+    <h3>Ingress Table 0: L2 Admission Control</h3>
+
+    <p>
+      This table drops packets that the router shouldn't see at all based on
+      their Ethernet headers.  It contains the following flows, all with
+      priority 100:
+    </p>
+
+    <ul>
+      <li>
+        One flow that matches on <code>eth.dst[40] == 1</code> with action
+        <code>next;</code>.
+      </li>
+
+      <li>
+        For each router port <var>P</var> with Ethernet address <var>E</var>, a
+        flow that matches <code>inport == <var>P</var> &amp;&amp; eth.dst ==
+        <var>E</var></code>, with action <code>next;</code>.
+      </li>
+    </ul>
+
+    <p>
+      Other packets are implicitly dropped.
+    </p>
+
+    <h3>Ingress Table 1: IP Routing</h3>
+
+    <p>
+      This table is the core of the logical router datapath functionality.  It
+      contains the following flows to implement very basic IP host
+      functionality:
+    </p>
+
+    <ul>
+      <li>
+        <p>
+          L3 admission control: A priority-220 flow drops packets that match
+          any of the following:
+        </p>
+
+        <ul>
+          <li>
+            <code>ip.src[28..31] == 0xe</code> (multicast source)
+          </li>
+          <li>
+            <code>ip.src == 255.255.255.255</code> (broadcast source)
+          </li>
+          <li>
+            <code>ip.src == 127.0.0.0/8 || ip.dst == 127.0.0.0/8</code>
+            (localhost source or destination)
+          </li>
+          <li>
+            <code>ip.src == 0.0.0.0/8 || ip.dst == 0.0.0.0/8</code> (zero
+            network source or destination)
+          </li>
+          <li>
+            <code>ip.src</code> is any IP address owned by the router.
+          </li>
+          <li>
+            <code>ip.src</code> is the broadcast address of any IP network
+            known to the router.
+          </li>
+        </ul>
+      </li>
+
+      <li>
+        <p>
+          ICMP echo reply.  These flows reply to ICMP echo requests received
+          for the router's IP address.  Let <var>A</var> be an IP address owned
+          by the router or the broadcast address for one of these IP address's
+          networks.  Then, for each <var>A</var>, a priority-210 flow matches
+          on <code>ip.dst == <var>A</var></code> and <code>icmp4.type == 8
+          &amp;&amp; icmp4.code == 0</code> (ICMP echo request).  These flows
+          use the following actions where, if <var>A</var> is unicast, then
+          <var>S</var> is <var>A</var>, and if <var>A</var> is broadcast,
+          <var>S</var> is the router's IP address in <var>A</var>'s network:
+        </p>
+
+        <pre>
+ip4.dst = ip4.src;
+ip4.src = <var>S</var>;
+ip4.ttl = 255;
+icmp4.type = 0;
+reg0 = ip4.dst;
+next;
+</pre>
+
+        <p>
+          Similar flows match on <code>ip.dst == 255.255.255.255</code> and
+          each individual <code>inport</code>, and use the same actions in
+          which <var>S</var> is a function of <code>inport</code>.
+        </p>
+      </li>
+
+      <li>
+        <p>
+          ARP reply.  These flows reply to ARP requests for the router's own IP
+          address.  For each router port <var>P</var> that owns IP address
+          <var>A</var> and Ethernet address <var>E</var>, a priority-210 flow
+          matches <code>inport == <var>P</var> &amp;&amp; arp.tpa ==
+          <var>A</var> &amp;&amp; arp.op == 1</code> (ARP request) with the
+          following actions:
+        </p>
+
+        <pre>
+eth.dst = eth.src;
+eth.src = <var>E</var>;
+arp.op = 2; // ARP reply
+arp.tha = arp.sha;
+arp.sha = <var>E</var>;
+arp.tpa = arp.spa;
+arp.spa = <var>A</var>;
+outport = <var>P</var>;
+inport = 0; // allow sending out inport
+output;
+</pre>
+      </li>
+
+      <li>
+        <p>
+          UDP port unreachable.  These flows generate ICMP port unreachable
+          messages in reply to UDP datagrams directed to the router's IP
+          address.  The logical router doesn't accept any UDP traffic so it
+          always generates such a reply.
+        </p>
+
+        <p>
+          These flows should not match IP fragments with nonzero offset.
+        </p>
+
+        <p>
+          Details TBD.
+        </p>
+      </li>
+
+      <li>
+        <p>
+          TCP reset.  These flows generate TCP reset messages in reply to TCP
+          datagrams directed to the router's IP address.  The logical router
+          doesn't accept any TCP traffic so it always generates such a reply.
+        </p>
+
+        <p>
+          Details TBD.
+        </p>
+      </li>
+
+      <li>
+        <p>
+          Protocol unreachable.  These flows generate ICMP protocol unreachable
+          messages in reply to packets directed to the router's IP address on
+          IP protocols other than UDP, TCP, and ICMP.
+        </p>
+
+        <p>
+          These flows should not match IP fragments with nonzero offset.
+        </p>
+
+        <p>
+          Details TBD.
+        </p>
+      </li>
+
+      <li>
+        Drop other IP traffic to this router.  These flows drop any other
+        traffic destined to an IP address of this router that is not already
+        handled by one of the flows above.  For each IP address <var>A</var>
+        owned by the router, a priority-200 flow matches <code>ip.dst ==
+        <var>A</var></code> and drops the traffic.
+      </li>
+    </ul>
+
+    <p>
+      The flows above handle all of the traffic that might be directed to the
+      router itself.  The following flows (with lower priorities) handle the
+      remaining traffic, potentially for forwarding:
+    </p>
+
+    <ul>
+      <li>
+        Ethernet local broadcast.  A priority-190 flow with match <code>eth.dst
+        == ff:ff:ff:ff:ff:ff</code> drops traffic destined to the local
+        Ethernet broadcast address.  By definition this traffic should not be
+        forwarded.
+      </li>
+
+      <li>
+        Drop IP multicast.  A priority-190 flow with match <code>ip.dst[28..31]
+        == 0xe</code> drops IP multicast traffic.
+      </li>
+
+      <li>
+        <p>
+          TTL check.  For each router port <var>P</var>, whose IP address is
+          <var>A</var>, a priority-180 flow with match <code>inport ==
+          <var>P</var> &amp;&amp; ip.ttl &lt; 2 &amp;&amp;
+          !ip.later_frag</code> matches packets whose TTL has expired, with the
+          following actions to send an ICMP time exceeded reply:
+        </p>
+
+        <pre>
+icmp4 {
+    icmp4.type = 11; // Time exceeded
+    icmp4.code = 0;  // TTL exceeded in transit
+    ip4.dst = ip4.src;
+    ip4.src = <var>A</var>;
+    ip4.ttl = 255;
+    reg0 = ip4.dst;
+    next;
+};
+</pre>
+      </li>
+
+      <li>
+        <p>
+          Routing table.  For each route to IPv4 network <var>N</var> with
+          netmask <var>M</var>, a logical flow with match <code>ip.dst ==
+          <var>N</var>/<var>M</var></code>, whose priority is the number of
+          1-bits in <var>M</var>, has the following actions:
+        </p>
+
+        <pre>
+ip4.ttl--;
+reg0 = <var>G</var>;
+next;
+</pre>
+
+        <p>
+          If the route has a gateway, <var>G</var> is the gateway IP address,
+          otherwise it is <code>ip.dst</code>.
+        </p>
+      </li>
+
+      <li>
+        <p>
+          Destination unreachable.  For each router port <var>P</var>, which
+          owns IP address <var>A</var>, a priority-0 logical flow with match
+          <code>in_port == <var>P</var> &amp;&amp; !ip.later_frag</code> has
+          the following actions:
+        </p>
+
+        <pre>
+icmp4 {
+    icmp4.type = 3; // Destination unreachable
+    icmp4.code = 0; // Network unreachable
+    ip4.dst = ip4.src;
+    ip4.src = <var>A</var>;
+    ip4.ttl = 255;
+    reg0 = ip4.dst;
+    next;
+};
+</pre>
+
+        <p>
+          These flows are omitted if the logical router has a default route,
+          that is, a route with netmask 0.0.0.0.
+        </p>
+      </li>
+    </ul>
+
+    <h3>Ingress Table 2: ARP Resolution</h3>
+
+    <p>
+      Any packet that reaches this table is an IP packet whose next-hop IP
+      address is in <code>reg0</code>.  (<code>ip.dst</code> is the final
+      destination.)  This table resolves the IP address in <code>reg0</code>
+      into an Ethernet address in <code>eth.dst</code>, using the following
+      flows:
+    </p>
+
+    <ul>
+      <li>
+        <p>
+          Known MAC bindings.  For each IP address <var>A</var> whose host is
+          known to have Ethernet address <var>E</var> and reside on router port
+          <var>P</var>, a priority-200 flow with match <code>reg0 ==
+          <var>A</var></code> has the following actions:
+        </p>
+
+        <pre>
+eth.dst = <var>E</var>;
+outport = <var>P</var>;
+output;
+</pre>
+      </li>
+
+      <li>
+        <p>
+          Unknown MAC bindings.  For each non-gateway route to IPv4 network
+          <var>N</var> with netmask <var>M</var> on router port <var>P</var>
+          that owns IP address <var>A</var> and Ethernet address <var>E</var>,
+          a logical flow with match <code>ip.dst ==
+          <var>N</var>/<var>M</var></code>, whose priority is the number of
+          1-bits in <var>M</var>, has the following actions:
+        </p>
+
+        <pre>
+ratelimit;
+arp {
+    eth.dst = ff:ff:ff:ff:ff:ff;
+    eth.src = <var>E</var>;
+    arp.sha = <var>E</var>;
+    arp.tha = 00:00:00:00:00:00;
+    arp.spa = <var>A</var>;
+    arp.tpa = ip.dst;
+    outport = <var>P</var>;
+    output;
+};
+</pre>
+
+        <p>
+          TBD: How to install MAC bindings when an ARP response comes back.
+          (Implement a "learn" action?)
+        </p>
+      </li>
+    </ul>
+
+    <h3>Egress Table 0: ARP Details</h3>
+
+    <p>
+      Packets that reach this table are ready for delivery.  It contains a
+      single priority-0 logical flow that matches all packets and actions
+      <code>output;</code>.
+    </p>
+
 </manpage>
diff --git a/ovn/ovn-architecture.7.xml b/ovn/ovn-architecture.7.xml
index 47dfc2a..a7ff674 100644
--- a/ovn/ovn-architecture.7.xml
+++ b/ovn/ovn-architecture.7.xml
@@ -596,7 +596,7 @@ 
     </li>
   </ol>
 
-  <h2>Life Cycle of a Packet</h2>
+  <h2>Architectural Life Cycle of a Packet</h2>
 
   <p>
     This section describes how a packet travels from one virtual machine or
diff --git a/ovn/ovn-sb.xml b/ovn/ovn-sb.xml
index c1932ad..0aaf7ca 100644
--- a/ovn/ovn-sb.xml
+++ b/ovn/ovn-sb.xml
@@ -240,12 +240,12 @@ 
       The default action when no flow matches is to drop packets.
     </p>
 
-    <p><em>Logical Life Cycle of a Packet</em></p>
+    <p><em>Architectural Logical Life Cycle of a Packet</em></p>
 
     <p>
       This following description focuses on the life cycle of a packet through
       a logical datapath, ignoring physical details of the implementation.
-      Please refer to <em>Life Cycle of a Packet</em> in
+      Please refer to <em>Architectural Life Cycle of a Packet</em> in
       <code>ovn-architecture</code>(7) for the physical information.
     </p>
 
@@ -810,24 +810,109 @@ 
       <dl>
         <dt><code><var>field1</var> = <var>field2</var>;</code></dt>
         <dd>
-          Extends the assignment action to allow copying between fields.
+          <p>
+            Extends the assignment action to allow copying between fields.
+          </p>
+
+          <p>
+            An assignment adds prerequisites from the source and the
+            destination fields.
+          </p>
+        </dd>
+
+        <dt><code>ip4.ttl--;</code></dt>
+        <dd>
+          <p>
+            Decrements the IPv4 TTL.  If this would make the TTL zero or
+            negative, then processing of the packet halts; no further actions
+            are processed.  (To properly handle such cases, a higher-priority
+            flow should match on <code>ip.ttl &lt; 2</code>.)
+          </p>
+
+          <p><b>Prerequisite:</b> <code>ip4</code></p>
         </dd>
 
-        <dt><code>learn</code></dt>
+        <dt><code>arp { <var>action</var>; </code>...<code> };</code></dt>
+        <dd>
+          <p>
+            Temporarily replaces the IPv4 packet being processed by an ARP
+            packet and executes each nested <var>action</var> on the ARP
+            packet.  Actions following the <var>arp</var> action, if any, apply
+            to the original, unmodified packet.
+          </p>
 
-        <dt><code>conntrack</code></dt>
+          <p>
+            The ARP packet that this action operates on is initialized based on
+            the IPv4 packet being processed, as follows:
+          </p>
+
+          <ul>
+            <li><code>eth.src</code> unchanged</li>
+            <li><code>eth.dst</code> unchanged</li>
+            <li><code>eth.type = 0x0806</code></li>
+            <li><code>arp.op = 1</code> (ARP request)</li>
+            <li><code>arp.sha</code> copied from <code>eth.src</code></li>
+            <li><code>arp.spa</code> copied from <code>ip4.src</code></li>
+            <li><code>arp.tha = 00:00:00:00:00:00</code></li>
+            <li><code>arp.tpa</code> copied from <code>ip4.dst</code></li>
+          </ul>
+
+          <p><b>Prerequisite:</b> <code>ip4</code></p>
+        </dd>
 
-        <dt><code>dec_ttl { <var>action</var>, </code>...<code> } { <var>action</var>; </code>...<code>};</code></dt>
+        <dt><code>icmp4 { <var>action</var>; </code>...<code> };</code></dt>
         <dd>
-          decrement TTL; execute first set of actions if
-          successful, second set if TTL decrement fails
+          <p>
+            Temporarily replaces the IPv4 packet being processed by an ICMPv4
+            packet and executes each nested <var>action</var> on the ARP
+            packet.  Actions following the <var>icmp4</var> action, if any,
+            apply to the original, unmodified packet.
+          </p>
+
+          <p>
+            The ICMPv4 packet that this action operates on is initialized based
+            on the IPv4 packet being processed, as follows.  Ethernet and IPv4
+            fields not listed here are not changed:
+          </p>
+
+          <ul>
+            <li><code>ip.proto = 1</code> (ICMPv4)</li>
+            <li><code>ip.frag = 0</code> (not a fragment)</li>
+            <li><code>icmp4.type = 3</code> (destination unreachable)</li>
+            <li><code>icmp4.code = 1</code> (host unreachable)</li>
+          </ul>
+
+          <p>
+            XXX need to explain exactly how the ICMP packet is constructed
+          </p>
+
+          <p><b>Prerequisite:</b> <code>ip4</code></p>
         </dd>
 
-        <dt><code>icmp_reply { <var>action</var>, </code>...<code> };</code></dt>
-        <dd>generate ICMP reply from packet, execute <var>action</var>s</dd>
+        <dt><code>tcp_reset;</code></dt>
+        <dd>
+          <p>
+            This action transforms the current TCP packet according to the
+            following pseudocode:
+          </p>
+
+          <pre>
+if (tcp.ack) {
+        tcp.seq = tcp.ack;
+} else {
+        tcp.ack = tcp.seq + length(tcp.payload);
+        tcp.seq = 0;
+}
+tcp.flags = RST;
+</pre>
 
-        <dt><code>arp { <var>action</var>, </code>...<code> }</code></dt>
-        <dd>generate ARP from packet, execute <var>action</var>s</dd>
+          <p>
+            Then, the action drops all TCP options and payload data, and
+            updates the TCP checksum.
+          </p>
+
+          <p><b>Prerequisite:</b> <code>tcp</code></p>
+        </dd>
       </dl>
     </column>