[ovs-dev,06/23] ovn: Update TODO, ovn-northd flow table design, ovn-architecture for L3.
diff mbox

Message ID 1444450544-11845-7-git-send-email-blp@nicira.com
State Accepted
Headers show

Commit Message

Ben Pfaff Oct. 10, 2015, 4:15 a.m. UTC
This is a proposed plan for logical L3 in OVN.  It is not entirely
complete but it includes many important details and I believe that it moves
planning forward.

Signed-off-by: Ben Pfaff <blp@nicira.com>
---
 ovn/TODO                    | 269 +++++++++++++++++++++++++++++++
 ovn/northd/ovn-northd.8.xml | 376 +++++++++++++++++++++++++++++++++++++++++++-
 ovn/ovn-architecture.7.xml  |   2 +-
 ovn/ovn-sb.xml              | 102 ++++++++++--
 4 files changed, 731 insertions(+), 18 deletions(-)

Comments

Justin Pettit Oct. 16, 2015, 9:46 p.m. UTC | #1
> On Oct 9, 2015, at 9:15 PM, Ben Pfaff <blp@nicira.com> wrote:
> 
> +*** Allow output to ingress port
> +
> +Sometimes when a packet ingresses into a router, it has to egress the
> +same port.  One example is a "one-armed" router that has multiple
> +routes on a single port (or in which a host is (mis)configured to send
> +every IP packet to the router, e.g. due to a bad netmask).  Another is
> +when a router needs to send an ICMP reply to a ingressing packet.

s/a/an/

> +        <p>
> +          ICMP echo reply.  These flows reply to ICMP echo requests received
> +          for the router's IP address.  Let <var>A</var> be an IP address or
> +          broadcast address owned by a router port.  Then, for each
> +          <var>A</var>, a priority-210 flow matches on <code>ip4.dst ==
> +          <var>A</var></code> and <code>icmp4.type == 8 &amp;&amp; icmp4.code
> +          == 0</code> (ICMP echo request).  These flows use the following
> +          actions where, if <var>A</var> is unicast, then <var>S</var> is
> +          <var>A</var>, and if <var>A</var> is broadcast, <var>S</var> is the
> +          router's IP address in <var>A</var>'s network:
> +        </p>

I don't believe this is actually implemented in patch 23.  It might be nice to put a bolded "future" or something in the descriptions of things that aren't yet implemented.

> +        <p>
> +          UDP port unreachable.  These flows generate ICMP port unreachable
> +          messages in reply to UDP datagrams directed to the router's IP
> +          address.  The logical router doesn't accept any UDP traffic so it
> +          always generates such a reply.
> +        </p>
> ...
> +          TCP reset.  These flows generate TCP reset messages in reply to TCP
> +          datagrams directed to the router's IP address.  The logical router
> +          doesn't accept any TCP traffic so it always generates such a reply.
> +        </p>
> ...
> +        <p>
> +          Protocol unreachable.  These flows generate ICMP protocol unreachable
> +          messages in reply to packets directed to the router's IP address on
> +          IP protocols other than UDP, TCP, and ICMP.
> +        </p>

Did you want to specify a priority for these flows?  The ping and ARP processing have priority-210 and the drop everything else shows priority-200, so it may be worth throwing in there.

> +        Drop IP multicast.  A priority-190 flow with match
> +        <code>ip4.dst[28..31] == 0xe</code> drops IP multicast traffic.

Do you want to use "ip.mcast"?

> +          ICMP time exceeded.  For each router port <var>P</var>, whose IP
> +          address is <var>A</var>, a priority-180 flow with match <code>inport
> +          == <var>P</var> &amp;&amp; ip4.ttl == {0, 1} &amp;&amp;
> +          !ip.later_frag</code> matches packets whose TTL has expired, with the
> +          following actions to send an ICMP time exceeded reply:

The "ICMP time exceeded" and "TTL discard" flows match on an exceeded TTL differently.  It might be nice to use the same match in both. 

> +        TTL discard.  A priority-170 flow with match <code>ip4.ttl &lt;
> +        2</code> and actions <code>drop;</code> drops other packets whose TTL
> +        has expired, that should not receive a ICMP error reply.

It might be nice to explain that this handles fragments.

> +    <h3>Ingress Table 2: IP Routing</h3>

Is there anything in the Table 1 description that indicates the common case of moving to this table?

> +          Unknown MAC bindings.  For each non-gateway route to IPv4 network
> +          <var>N</var> with netmask <var>M</var> on router port <var>P</var>
> +          that owns IP address <var>A</var> and Ethernet address <var>E</var>,
> +          a logical flow with match <code>ip4.dst ==
> +          <var>N</var>/<var>M</var></code>, whose priority is the number of
> +          1-bits in <var>M</var>, has the following actions:
> ...
> +arp {
> +    eth.dst = ff:ff:ff:ff:ff:ff;
> +    eth.src = <var>E</var>;

I think you may want to set "eth.type".

> diff --git a/ovn/ovn-architecture.7.xml b/ovn/ovn-architecture.7.xml
> index 47dfc2a..a7ff674 100644
> --- a/ovn/ovn-architecture.7.xml
> +++ b/ovn/ovn-architecture.7.xml
> @@ -596,7 +596,7 @@
>     </li>
>   </ol>
> 
> -  <h2>Life Cycle of a Packet</h2>
> +  <h2>Architectural Life Cycle of a Packet</h2>

There's the very similar sounding "Architectural Logical Life Cycle of a Packet" in ovn-sb, which I think could be confusing.  What do you think about throwing a "Physical" in there?

>       This following description focuses on the life cycle of a packet through
>       a logical datapath, ignoring physical details of the implementation.
> -      Please refer to <em>Life Cycle of a Packet</em> in
> +      Please refer to <em>Architectural Life Cycle of a Packet</em> in

If you change it, this would need to be updated, too, of course.

> +        <dt><code>ip.ttl--;</code></dt>
> +        <dd>
> +          <p>
> +            Decrements the IPv4 or IPv6 TTL.  If this would make the TTL zero
> +            or negative, then processing of the packet halts; no further
> +            actions are processed.  (To properly handle such cases, a
> +            higher-priority flow should match on <code>ip.ttl &lt; 2</code>.)
> +          </p>

This is put in the "to be implemented later"category of actions, but I believe "ip.ttl--" got implemented a few patches ago.

> +        <dt><code>icmp4 { <var>action</var>; </code>...<code> };</code></dt>
> ...
> +          <p>
> +            XXX need to explain exactly how the ICMP packet is constructed
> +          </p>

I don't understand what this means.

Acked-by: Justin Pettit <jpettit@nicira.com>

--Justin
Ben Pfaff Oct. 17, 2015, 3:15 a.m. UTC | #2
On Fri, Oct 16, 2015 at 02:46:55PM -0700, Justin Pettit wrote:
> 
> > On Oct 9, 2015, at 9:15 PM, Ben Pfaff <blp@nicira.com> wrote:
> > 
> > +*** Allow output to ingress port
> > +
> > +Sometimes when a packet ingresses into a router, it has to egress the
> > +same port.  One example is a "one-armed" router that has multiple
> > +routes on a single port (or in which a host is (mis)configured to send
> > +every IP packet to the router, e.g. due to a bad netmask).  Another is
> > +when a router needs to send an ICMP reply to a ingressing packet.
> 
> s/a/an/

Fixed, thanks.

> > +        <p>
> > +          ICMP echo reply.  These flows reply to ICMP echo requests received
> > +          for the router's IP address.  Let <var>A</var> be an IP address or
> > +          broadcast address owned by a router port.  Then, for each
> > +          <var>A</var>, a priority-210 flow matches on <code>ip4.dst ==
> > +          <var>A</var></code> and <code>icmp4.type == 8 &amp;&amp; icmp4.code
> > +          == 0</code> (ICMP echo request).  These flows use the following
> > +          actions where, if <var>A</var> is unicast, then <var>S</var> is
> > +          <var>A</var>, and if <var>A</var> is broadcast, <var>S</var> is the
> > +          router's IP address in <var>A</var>'s network:
> > +        </p>
> 
> I don't believe this is actually implemented in patch 23.  It might be
> nice to put a bolded "future" or something in the descriptions of
> things that aren't yet implemented.

OK, I added "Not yet implemented." to a bunch of items.

> > +        <p>
> > +          UDP port unreachable.  These flows generate ICMP port unreachable
> > +          messages in reply to UDP datagrams directed to the router's IP
> > +          address.  The logical router doesn't accept any UDP traffic so it
> > +          always generates such a reply.
> > +        </p>
> > ...
> > +          TCP reset.  These flows generate TCP reset messages in reply to TCP
> > +          datagrams directed to the router's IP address.  The logical router
> > +          doesn't accept any TCP traffic so it always generates such a reply.
> > +        </p>
> > ...
> > +        <p>
> > +          Protocol unreachable.  These flows generate ICMP protocol unreachable
> > +          messages in reply to packets directed to the router's IP address on
> > +          IP protocols other than UDP, TCP, and ICMP.
> > +        </p>
> 
> Did you want to specify a priority for these flows?  The ping and ARP
> processing have priority-210 and the drop everything else shows
> priority-200, so it may be worth throwing in there.

I added a priority.

> > +        Drop IP multicast.  A priority-190 flow with match
> > +        <code>ip4.dst[28..31] == 0xe</code> drops IP multicast traffic.
> 
> Do you want to use "ip.mcast"?

Yes, thanks.

> > +          ICMP time exceeded.  For each router port <var>P</var>, whose IP
> > +          address is <var>A</var>, a priority-180 flow with match <code>inport
> > +          == <var>P</var> &amp;&amp; ip4.ttl == {0, 1} &amp;&amp;
> > +          !ip.later_frag</code> matches packets whose TTL has expired, with the
> > +          following actions to send an ICMP time exceeded reply:
> 
> The "ICMP time exceeded" and "TTL discard" flows match on an exceeded
> TTL differently.  It might be nice to use the same match in both.

I changed them both to say "ip4.ttl == {0, 1}".

> > +        TTL discard.  A priority-170 flow with match <code>ip4.ttl &lt;
> > +        2</code> and actions <code>drop;</code> drops other packets whose TTL
> > +        has expired, that should not receive a ICMP error reply.
> 
> It might be nice to explain that this handles fragments.

Done.

> > +    <h3>Ingress Table 2: IP Routing</h3>
> 
> Is there anything in the Table 1 description that indicates the common
> case of moving to this table?

No, this was missing.  I added a description of a priority-0 flow.

> > +          Unknown MAC bindings.  For each non-gateway route to IPv4 network
> > +          <var>N</var> with netmask <var>M</var> on router port <var>P</var>
> > +          that owns IP address <var>A</var> and Ethernet address <var>E</var>,
> > +          a logical flow with match <code>ip4.dst ==
> > +          <var>N</var>/<var>M</var></code>, whose priority is the number of
> > +          1-bits in <var>M</var>, has the following actions:
> > ...
> > +arp {
> > +    eth.dst = ff:ff:ff:ff:ff:ff;
> > +    eth.src = <var>E</var>;
> 
> I think you may want to set "eth.type".

No, eth.type isn't modifiable.  arp { ... } always uses ARP Ethertype.

> > diff --git a/ovn/ovn-architecture.7.xml b/ovn/ovn-architecture.7.xml
> > index 47dfc2a..a7ff674 100644
> > --- a/ovn/ovn-architecture.7.xml
> > +++ b/ovn/ovn-architecture.7.xml
> > @@ -596,7 +596,7 @@
> >     </li>
> >   </ol>
> > 
> > -  <h2>Life Cycle of a Packet</h2>
> > +  <h2>Architectural Life Cycle of a Packet</h2>
> 
> There's the very similar sounding "Architectural Logical Life Cycle of
> a Packet" in ovn-sb, which I think could be confusing.  What do you
> think about throwing a "Physical" in there?

Changed, thanks.

> >       This following description focuses on the life cycle of a packet through
> >       a logical datapath, ignoring physical details of the implementation.
> > -      Please refer to <em>Life Cycle of a Packet</em> in
> > +      Please refer to <em>Architectural Life Cycle of a Packet</em> in
> 
> If you change it, this would need to be updated, too, of course.

Done.

> > +        <dt><code>ip.ttl--;</code></dt>
> > +        <dd>
> > +          <p>
> > +            Decrements the IPv4 or IPv6 TTL.  If this would make the TTL zero
> > +            or negative, then processing of the packet halts; no further
> > +            actions are processed.  (To properly handle such cases, a
> > +            higher-priority flow should match on <code>ip.ttl &lt; 2</code>.)
> > +          </p>
> 
> This is put in the "to be implemented later"category of actions, but I
> believe "ip.ttl--" got implemented a few patches ago.

You're right.  I mistakenly omitted the documentation.  I broke the
documentation of this action into a separate commit and pushed it.

> > +        <dt><code>icmp4 { <var>action</var>; </code>...<code> };</code></dt>
> > ...
> > +          <p>
> > +            XXX need to explain exactly how the ICMP packet is constructed
> > +          </p>
> 
> I don't understand what this means.

I changed it to say "Details TBD."

> Acked-by: Justin Pettit <jpettit@nicira.com>

Thanks for the detailed review.  I applied this to master.

Patch
diff mbox

diff --git a/ovn/TODO b/ovn/TODO
index a48251f..b29df75 100644
--- a/ovn/TODO
+++ b/ovn/TODO
@@ -1,3 +1,272 @@ 
+-*- outline -*-
+
+* L3 support
+
+** OVN_Northbound schema
+
+*** Needs to support interconnected routers
+
+It should be possible to connect one router to another, e.g. to
+represent a provider/tenant router relationship.  This requires
+an OVN_Northbound schema change.
+
+*** Needs to support extra routes
+
+Currently a router port has a single route associated with it, but
+presumably we should support multiple routes.  For connections from
+one router to another, this doesn't seem to matter (just put more than
+one connection between them), but for connections between a router and
+a switch it might matter because a switch has only one router port.
+
+** OVN_SB schema
+
+*** Logical datapath interconnection
+
+There needs to be a way in the OVN_Southbound database to express
+connections between logical datapaths, so that packets can pass from a
+logical switch to its logical router (and vice versa) and from one
+logical router to another.
+
+One way to do this would be to introduce logical patch ports, closely
+modeled on the "physical" patch ports that OVS has had for ages.  Each
+logical patch port would consist of two rows in the Port_Binding table
+(one in each logical datapath), with type "patch" and an option "peer"
+that names the other logical port in the pair.
+
+If we do it this way then we'll need to figure out one odd special
+case.  Currently the ACL table documents that the logical router port
+is always named "ROUTER".  This can't be done directly with this patch
+port technique, because every row in the Logical_Port table must have
+a unique name.  This probably means that we should change the
+convention for the ACL table so that the logical router port name is
+unique; for example, we could change the Logical_Router_Port table to
+require the 'name' column to be unique, and then use that name in the
+ACL table.
+
+*** Allow output to ingress port
+
+Sometimes when a packet ingresses into a router, it has to egress the
+same port.  One example is a "one-armed" router that has multiple
+routes on a single port (or in which a host is (mis)configured to send
+every IP packet to the router, e.g. due to a bad netmask).  Another is
+when a router needs to send an ICMP reply to a ingressing packet.
+
+To some degree this problem is layered, because there are two
+different notions of "ingress port".  The first is the OpenFlow
+ingress port, essentially a physical port identifier.  This is
+implemented as part of ovs-vswitchd's OpenFlow implementation.  It
+prevents a reply from being sent across the tunnel on which it
+arrived.  It is questionable whether this OpenFlow feature is useful
+to OVN.  (OVN already has to override it to allow a packet from one
+nested container to be forwarded to a different nested container.)
+OVS make it possible to disable this feature of OpenFlow by setting
+the OpenFlow input port field to 0.  (If one does this too early, of
+course, it means that there's no way to actually match on the input
+port in the OpenFlow flow tables, but one can work around that by
+instead setting the input port just before the output action, possibly
+wrapping these actions in push/pop pairs to preserve the input port
+for later.)
+
+The second is the OVN logical ingress port, which is implemented in
+ovn-controller as part of the logical abstraction, using an OVS
+register.  Dropping packets directed to the logical ingress port is
+implemented through an OpenFlow table not directly visible to the
+logical flow table.  Currently this behavior can't be disabled, but
+various ways to ensure it could be implemented, e.g. the same as for
+OpenFlow by allowing the logical inport to be zeroed, or by
+introducing a new action that ignores the inport.
+
+** ovn-northd
+
+*** What flows should it generate?
+
+See description in ovn-northd(8).
+
+** New OVN logical actions
+
+*** arp
+
+Generates an ARP packet based on the current IPv4 packet and allows it
+to be processed as part of the current pipeline (and then pop back to
+processing the original IPv4 packet).
+
+TCP/IP stacks typically limit the rate at which ARPs are sent, e.g. to
+one per second for a given target.  We might need to do this too.
+
+We probably need to buffer the packet that generated the ARP.  I don't
+know where to do that.
+
+*** icmp4 { action... }
+
+Generates an ICMPv4 packet based on the current IPv4 packet and
+processes it according to each nested action (and then pops back to
+processing the original IPv4 packet).  The intended use case is for
+generating "time exceeded" and "destination unreachable" errors.
+
+ovn-sb.xml includes a tentative specification for this action.
+
+Tentatively, the icmp4 action sets a default icmp_type and icmp_code
+and lets the nested actions override it.  This means that we'd have to
+make icmp_type and icmp_code writable.  Because changing icmp_type and
+icmp_code can change the interpretation of the rest of the data in the
+ICMP packet, we would want to think this through carefully.  If it
+seems like a bad idea then we could instead make the type and code a
+parameter to the action: icmp4(type, code) { action... }
+
+It is worth considering what should be considered the ingress port for
+the ICMPv4 packet.  It's quite likely that the ICMPv4 packet is going
+to go back out the ingress port.  Maybe the icmp4 action, therefore,
+should clear the inport, so that output to the original inport won't
+be discarded.
+
+*** tcp_reset
+
+Transforms the current TCP packet into a RST reply.
+
+ovn-sb.xml includes a tentative specification for this action.
+
+*** Other actions for IPv6.
+
+IPv6 will probably need an action or actions for ND that is similar to
+the "arp" action, and an action for generating
+
+*** ovn-controller translation to OpenFlow
+
+The following two translation strategies come to mind.  Some of the
+new actions we might want to implement one way, some of them the
+other, depending on the details.
+
+*** Implementation strategies
+
+One way to do this is to define new actions as Open vSwitch extensions
+to OpenFlow, emit those actions in ovn-controller, and implement them
+in ovs-vswitchd (possibly pushing the implementations into the Linux
+and DPDK datapaths as well).  This is the only acceptable way for
+actions that need high performance.  None of these actions obviously
+need high performance, but it might be necessary to have fairness in
+handling e.g. a flood of incoming packets that require these actions.
+The main disadvantage of this approach is that it ties ovs-vswitchd
+(and the Linux kernel module) to supporting these actions essentially
+forever, which means that we'd want to make sure that they are
+general-purpose, well designed, maintainable, and supportable.
+
+The other way to do this is to send the packets across an OpenFlow
+channel to ovn-controller and have ovn-controller process them.  This
+is acceptable for actions that don't need high performance, and it
+means that we don't add anything permanently to ovs-vswitchd or the
+kernel (so we can be more casual about the design).  The big
+disadvantage is that it becomes necessary to add a way to resume the
+OpenFlow pipeline when it is interrupted in the middle by sending a
+packet to the controller.  This is not as simple as doing a new flow
+table lookup and resuming from that point.  Instead, it is equivalent
+to the (very complicated) recirculation logic in ofproto-dpif-xlate.c.
+Much of this logic can be translated into OpenFlow actions (e.g. the
+call stack and data stack), but some of it is entirely outside
+OpenFlow (e.g. the state of mirrors).  To implement it properly, it
+seems that we'll have to introduce a new Open vSwitch extension to
+OpenFlow, a "send-to-controller" action that causes extra data to be
+sent to the controller, where the extra data packages up the state
+necessary to resume the pipeline.  Maybe the bits of the state that
+can be represented in OpenFlow can be embedded in this extra data in a
+controller-readable form, but other bits we might want to be opaque.
+It's also likely that we'll want to change and extend the form of this
+opaque data over time, so this should be allowed for, e.g. by
+including a nonce in the extra data that is newly generated every time
+ovs-vswitchd starts.
+
+*** OpenFlow action definitions
+
+Define OpenFlow wire structures for each new OpenFlow action and
+implement them in lib/ofp-actions.[ch].
+
+*** OVS implementation
+
+Add code for action translation.  Possibly add datapath code for
+action implementation.  However, none of these new actions should
+require high-bandwidth processing so we could at least start with them
+implemented in userspace only.  (ARP field modification is already
+userspace-only and no one has complained yet.)
+
+** IPv6
+
+*** ND versus ARP
+
+*** IPv6 routing
+
+*** ICMPv6
+
+** IP to MAC binding
+
+Somehow it has to be possible for an L3 logical router to map from an
+IP address to an Ethernet address.  This can happen statically or
+dynamically.  Probably both cases need to be supported eventually.
+
+*** Static IP to MAC binding
+
+Commonly, for a VM, the binding of an IP address to a MAC is known
+statically.  The Logical_Port table in the OVN_Northbound schema can
+be revised to make these bindings known.  Then ovn-northd can
+integrate the bindings into the logical router flow table.
+(ovn-northd can also integrate them into the logical switch flow table
+to terminate ARP requests from VIFs.)
+
+*** Dynamic IP to MAC bindings
+
+Some bindings from IP address to MAC will undoubtedly need to be
+discovered dynamically through ARP requests.  It's straightforward
+enough for a logical L3 router to generate ARP requests and forward
+them to the appropriate switch.
+
+It's more difficult to figure out where the reply should be processed
+and stored.  It might seem at first that a first-cut implementation
+could just keep track of the binding on the hypervisor that needs to
+know, but that can't happen easily because the VM that sends the reply
+might not be on the same HV as the VM that needs the answer (that is,
+the VM that sent the packet that needs the binding to be resolved) and
+there isn't an easy way for it to know which HV needs the answer.
+
+Thus, the HV that processes the ARP reply (which is unknown when the
+ARP is sent) has to tell all the HVs the binding.  The most obvious
+place for this in the OVN_Southbound database.
+
+Details need to be worked out, including:
+
+**** OVN_Southbound schema changes.
+
+Possibly bindings could be added to the Port_Binding table by adding
+or modifying columns.  Another possibility is that another table
+should be added.
+
+**** Logical_Flow representation
+
+It would be really nice to maintain the general-purpose nature of
+logical flows, but these bindings might have to include some
+hard-coded special cases, especially when it comes to the relationship
+with populating the bindings into the OVN_Southbound table.
+
+**** Tracking queries
+
+It's probably best to only record in the database responses to queries
+actually issued by an L3 logical router, so somehow they have to be
+tracked, probably by putting a tentative binding without a MAC address
+into the database.
+
+**** Renewal and expiration.
+
+Something needs to make sure that bindings remain valid and expire
+those that become stale.
+
+*** MTU handling (fragmentation on output)
+
+** Ratelimiting.
+
+*** ARP.
+
+*** ICMP error generation, TCP reset, UDP unreachable, protocol unreachable, ...
+
+As a point of comparison, Linux doesn't ratelimit TCP resets but I
+think it does everything else.
+
 * ovn-controller
 
 ** ovn-controller parameters and configuration.
diff --git a/ovn/northd/ovn-northd.8.xml b/ovn/northd/ovn-northd.8.xml
index 3c5d362..3731d56 100644
--- a/ovn/northd/ovn-northd.8.xml
+++ b/ovn/northd/ovn-northd.8.xml
@@ -106,10 +106,12 @@ 
       One of the main purposes of <code>ovn-northd</code> is to populate the
       <code>Logical_Flow</code> table in the <code>OVN_Southbound</code>
       database.  This section describes how <code>ovn-northd</code> does this
-      for logical datapaths.
+      for switch and router logical datapaths.
     </p>
 
-    <h2>Ingress Table 0: Admission Control and Ingress Port Security</h2>
+    <h2>Logical Switch Datapaths</h2>
+
+    <h3>Ingress Table 0: Admission Control and Ingress Port Security</h3>
 
     <p>
       Ingress table 0 contains these logical flows:
@@ -137,7 +139,7 @@ 
       be dropped.
     </p>
 
-    <h2>Ingress table 1: <code>from-lport</code> ACLs</h2>
+    <h3>Ingress table 1: <code>from-lport</code> ACLs</h3>
 
     <p>
       Logical flows in this table closely reproduce those in the
@@ -154,7 +156,7 @@ 
       <code>next;</code>, so that ACLs allow packets by default.
     </p>
 
-    <h2>Ingress Table 2: Destination Lookup</h2>
+    <h3>Ingress Table 2: Destination Lookup</h3>
 
     <p>
       This table implements switching behavior.  It contains these logical
@@ -185,13 +187,13 @@ 
       </li>
     </ul>
 
-    <h2>Egress Table 0: <code>to-lport</code> ACLs</h2>
+    <h3>Egress Table 0: <code>to-lport</code> ACLs</h3>
 
     <p>
       This is similar to ingress table 1 except for <code>to-lport</code> ACLs.
     </p>
 
-    <h2>Egress Table 1: Egress Port Security</h2>
+    <h3>Egress Table 1: Egress Port Security</h3>
 
     <p>
       This is similar to the ingress port security logic in ingress table 0,
@@ -206,4 +208,366 @@ 
       disabled logical <code>outport</code> overrides the priority-100 flow
       with a <code>drop;</code> action.
     </p>
+
+    <h2>Logical Router Datapaths</h2>
+
+    <h3>Ingress Table 0: L2 Admission Control</h3>
+
+    <p>
+      This table drops packets that the router shouldn't see at all based on
+      their Ethernet headers.  It contains the following flows:
+    </p>
+
+    <ul>
+      <li>
+        Priority-100 flows to drop packets with VLAN tags or multicast Ethernet
+        source addresses.
+      </li>
+
+      <li>
+        For each enabled router port <var>P</var> with Ethernet address
+        <var>E</var>, a priority-50 flow that matches <code>inport ==
+        <var>P</var> &amp;&amp; (eth.dst[40] || eth.dst ==
+        <var>E</var></code>), with action <code>next;</code>.
+      </li>
+    </ul>
+
+    <p>
+      Other packets are implicitly dropped.
+    </p>
+
+    <h3>Ingress Table 1: IP Input</h3>
+
+    <p>
+      This table is the core of the logical router datapath functionality.  It
+      contains the following flows to implement very basic IP host
+      functionality.
+    </p>
+
+    <ul>
+      <li>
+        <p>
+          L3 admission control: A priority-220 flow drops packets that match
+          any of the following:
+        </p>
+
+        <ul>
+          <li>
+            <code>ip4.src[28..31] == 0xe</code> (multicast source)
+          </li>
+          <li>
+            <code>ip4.src == 255.255.255.255</code> (broadcast source)
+          </li>
+          <li>
+            <code>ip4.src == 127.0.0.0/8 || ip4.dst == 127.0.0.0/8</code>
+            (localhost source or destination)
+          </li>
+          <li>
+            <code>ip4.src == 0.0.0.0/8 || ip4.dst == 0.0.0.0/8</code> (zero
+            network source or destination)
+          </li>
+          <li>
+            <code>ip4.src</code> is any IP address owned by the router.
+          </li>
+          <li>
+            <code>ip4.src</code> is the broadcast address of any IP network
+            known to the router.
+          </li>
+        </ul>
+      </li>
+
+      <li>
+        <p>
+          ICMP echo reply.  These flows reply to ICMP echo requests received
+          for the router's IP address.  Let <var>A</var> be an IP address or
+          broadcast address owned by a router port.  Then, for each
+          <var>A</var>, a priority-210 flow matches on <code>ip4.dst ==
+          <var>A</var></code> and <code>icmp4.type == 8 &amp;&amp; icmp4.code
+          == 0</code> (ICMP echo request).  These flows use the following
+          actions where, if <var>A</var> is unicast, then <var>S</var> is
+          <var>A</var>, and if <var>A</var> is broadcast, <var>S</var> is the
+          router's IP address in <var>A</var>'s network:
+        </p>
+
+        <pre>
+ip4.dst = ip4.src;
+ip4.src = <var>S</var>;
+ip4.ttl = 255;
+icmp4.type = 0;
+next;
+        </pre>
+
+        <p>
+          Similar flows match on <code>ip4.dst == 255.255.255.255</code> and
+          each individual <code>inport</code>, and use the same actions in
+          which <var>S</var> is a function of <code>inport</code>.
+        </p>
+      </li>
+
+      <li>
+        <p>
+          ARP reply.  These flows reply to ARP requests for the router's own IP
+          address.  For each router port <var>P</var> that owns IP address
+          <var>A</var> and Ethernet address <var>E</var>, a priority-210 flow
+          matches <code>inport == <var>P</var> &amp;&amp; arp.tpa ==
+          <var>A</var> &amp;&amp; arp.op == 1</code> (ARP request) with the
+          following actions:
+        </p>
+
+        <pre>
+eth.dst = eth.src;
+eth.src = <var>E</var>;
+arp.op = 2; /* ARP reply. */
+arp.tha = arp.sha;
+arp.sha = <var>E</var>;
+arp.tpa = arp.spa;
+arp.spa = <var>A</var>;
+outport = <var>P</var>;
+inport = 0; /* Allow sending out inport. */
+output;
+        </pre>
+      </li>
+
+      <li>
+        <p>
+          UDP port unreachable.  These flows generate ICMP port unreachable
+          messages in reply to UDP datagrams directed to the router's IP
+          address.  The logical router doesn't accept any UDP traffic so it
+          always generates such a reply.
+        </p>
+
+        <p>
+          These flows should not match IP fragments with nonzero offset.
+        </p>
+
+        <p>
+          Details TBD.
+        </p>
+      </li>
+
+      <li>
+        <p>
+          TCP reset.  These flows generate TCP reset messages in reply to TCP
+          datagrams directed to the router's IP address.  The logical router
+          doesn't accept any TCP traffic so it always generates such a reply.
+        </p>
+
+        <p>
+          These flows should not match IP fragments with nonzero offset.
+        </p>
+
+        <p>
+          Details TBD.
+        </p>
+      </li>
+
+      <li>
+        <p>
+          Protocol unreachable.  These flows generate ICMP protocol unreachable
+          messages in reply to packets directed to the router's IP address on
+          IP protocols other than UDP, TCP, and ICMP.
+        </p>
+
+        <p>
+          These flows should not match IP fragments with nonzero offset.
+        </p>
+
+        <p>
+          Details TBD.
+        </p>
+      </li>
+
+      <li>
+        Drop other IP traffic to this router.  These flows drop any other
+        traffic destined to an IP address of this router that is not already
+        handled by one of the flows above.  For each IP address <var>A</var>
+        owned by the router, a priority-200 flow matches <code>ip4.dst ==
+        <var>A</var></code> and drops the traffic.
+      </li>
+    </ul>
+
+    <p>
+      The flows above handle all of the traffic that might be directed to the
+      router itself.  The following flows (with lower priorities) handle the
+      remaining traffic, potentially for forwarding:
+    </p>
+
+    <ul>
+      <li>
+        Drop Ethernet local broadcast.  A priority-190 flow with match
+        <code>eth.bcast</code> drops traffic destined to the local Ethernet
+        broadcast address.  By definition this traffic should not be forwarded.
+      </li>
+
+      <li>
+        Drop IP multicast.  A priority-190 flow with match
+        <code>ip4.dst[28..31] == 0xe</code> drops IP multicast traffic.
+      </li>
+
+      <li>
+        <p>
+          ICMP time exceeded.  For each router port <var>P</var>, whose IP
+          address is <var>A</var>, a priority-180 flow with match <code>inport
+          == <var>P</var> &amp;&amp; ip4.ttl == {0, 1} &amp;&amp;
+          !ip.later_frag</code> matches packets whose TTL has expired, with the
+          following actions to send an ICMP time exceeded reply:
+        </p>
+
+        <pre>
+icmp4 {
+    icmp4.type = 11; /* Time exceeded. */
+    icmp4.code = 0;  /* TTL exceeded in transit. */
+    ip4.dst = ip4.src;
+    ip4.src = <var>A</var>;
+    ip4.ttl = 255;
+    next;
+};
+        </pre>
+      </li>
+
+      <li>
+        TTL discard.  A priority-170 flow with match <code>ip4.ttl &lt;
+        2</code> and actions <code>drop;</code> drops other packets whose TTL
+        has expired, that should not receive a ICMP error reply.
+      </li>
+    </ul>
+
+    <h3>Ingress Table 2: IP Routing</h3>
+
+    <p>
+      A packet that arrives at this table is an IP packet that should be routed
+      to the address in <code>ip4.dst</code>.  This table implements IP
+      routing, setting <code>reg0</code> to the next-hop IP address (leaving
+      <code>ip4.dst</code>, the packet's final destination, unchanged) and
+      advances to the next table for ARP resolution.
+    </p>
+
+    <p>
+      This table contains the following logical flows:
+    </p>
+
+    <ul>
+      <li>
+        <p>
+          Routing table.  For each route to IPv4 network <var>N</var> with
+          netmask <var>M</var>, a logical flow with match <code>ip4.dst ==
+          <var>N</var>/<var>M</var></code>, whose priority is the number of
+          1-bits in <var>M</var>, has the following actions:
+        </p>
+
+        <pre>
+ip4.ttl--;
+reg0 = <var>G</var>;
+next;
+        </pre>
+
+        <p>
+          (Ingress table 1 already verified that <code>ip4.ttl--;</code> will
+          not yield a TTL exceeded error.)
+        </p>
+
+        <p>
+          If the route has a gateway, <var>G</var> is the gateway IP address,
+          otherwise it is <code>ip4.dst</code>.
+        </p>
+      </li>
+
+      <li>
+        <p>
+          Destination unreachable.  For each router port <var>P</var>, which
+          owns IP address <var>A</var>, a priority-0 logical flow with match
+          <code>in_port == <var>P</var> &amp;&amp; !ip.later_frag &amp;&amp;
+          !icmp</code> has the following actions:
+        </p>
+
+        <pre>
+icmp4 {
+    icmp4.type = 3; /* Destination unreachable. */
+    icmp4.code = 0; /* Network unreachable. */
+    ip4.dst = ip4.src;
+    ip4.src = <var>A</var>;
+    ip4.ttl = 255;
+    next(2);
+};
+        </pre>
+
+        <p>
+          (The <code>!icmp</code> check prevents recursion if the destination
+          unreachable message itself cannot be routed.)
+        </p>
+
+        <p>
+          These flows are omitted if the logical router has a default route,
+          that is, a route with netmask 0.0.0.0.
+        </p>
+      </li>
+    </ul>
+
+    <h3>Ingress Table 3: ARP Resolution</h3>
+
+    <p>
+      Any packet that reaches this table is an IP packet whose next-hop IP
+      address is in <code>reg0</code>.  (<code>ip4.dst</code> is the final
+      destination.)  This table resolves the IP address in <code>reg0</code>
+      into an output port in <code>outport</code> and an Ethernet address in
+      <code>eth.dst</code>, using the following flows:
+    </p>
+
+    <ul>
+      <li>
+        <p>
+          Known MAC bindings.  For each IP address <var>A</var> whose host is
+          known to have Ethernet address <var>HE</var> and reside on router
+          port <var>P</var> with Ethernet address <var>PE</var>, a priority-200
+          flow with match <code>reg0 == <var>A</var></code> has the following
+          actions:
+        </p>
+
+        <pre>
+eth.src = <var>PE</var>;
+eth.dst = <var>HE</var>;
+outport = <var>P</var>;
+output;
+        </pre>
+      </li>
+
+      <li>
+        <p>
+          Unknown MAC bindings.  For each non-gateway route to IPv4 network
+          <var>N</var> with netmask <var>M</var> on router port <var>P</var>
+          that owns IP address <var>A</var> and Ethernet address <var>E</var>,
+          a logical flow with match <code>ip4.dst ==
+          <var>N</var>/<var>M</var></code>, whose priority is the number of
+          1-bits in <var>M</var>, has the following actions:
+        </p>
+
+        <pre>
+arp {
+    eth.dst = ff:ff:ff:ff:ff:ff;
+    eth.src = <var>E</var>;
+    arp.sha = <var>E</var>;
+    arp.tha = 00:00:00:00:00:00;
+    arp.spa = <var>A</var>;
+    arp.tpa = ip4.dst;
+    arp.op = 1;  /* ARP request. */
+    outport = <var>P</var>;
+    output;
+};
+        </pre>
+
+        <p>
+          TBD: How to install MAC bindings when an ARP response comes back.
+          (Implement a "learn" action?)
+        </p>
+      </li>
+    </ul>
+
+    <h3>Egress Table 0: Delivery</h3>
+
+    <p>
+      Packets that reach this table are ready for delivery.  It contains
+      priority-100 logical flows that match packets on each enabled logical
+      router port, with action <code>output;</code>.
+    </p>
+
 </manpage>
diff --git a/ovn/ovn-architecture.7.xml b/ovn/ovn-architecture.7.xml
index 47dfc2a..a7ff674 100644
--- a/ovn/ovn-architecture.7.xml
+++ b/ovn/ovn-architecture.7.xml
@@ -596,7 +596,7 @@ 
     </li>
   </ol>
 
-  <h2>Life Cycle of a Packet</h2>
+  <h2>Architectural Life Cycle of a Packet</h2>
 
   <p>
     This section describes how a packet travels from one virtual machine or
diff --git a/ovn/ovn-sb.xml b/ovn/ovn-sb.xml
index 8c457d4..3cc96d4 100644
--- a/ovn/ovn-sb.xml
+++ b/ovn/ovn-sb.xml
@@ -240,12 +240,12 @@ 
       The default action when no flow matches is to drop packets.
     </p>
 
-    <p><em>Logical Life Cycle of a Packet</em></p>
+    <p><em>Architectural Logical Life Cycle of a Packet</em></p>
 
     <p>
       This following description focuses on the life cycle of a packet through
       a logical datapath, ignoring physical details of the implementation.
-      Please refer to <em>Life Cycle of a Packet</em> in
+      Please refer to <em>Architectural Life Cycle of a Packet</em> in
       <code>ovn-architecture</code>(7) for the physical information.
     </p>
 
@@ -847,21 +847,101 @@ 
       </p>
 
       <dl>
-        <dt><code>learn</code></dt>
+        <dt><code>ip.ttl--;</code></dt>
+        <dd>
+          <p>
+            Decrements the IPv4 or IPv6 TTL.  If this would make the TTL zero
+            or negative, then processing of the packet halts; no further
+            actions are processed.  (To properly handle such cases, a
+            higher-priority flow should match on <code>ip.ttl &lt; 2</code>.)
+          </p>
 
-        <dt><code>conntrack</code></dt>
+          <p><b>Prerequisite:</b> <code>ip</code></p>
+        </dd>
 
-        <dt><code>dec_ttl { <var>action</var>, </code>...<code> } { <var>action</var>; </code>...<code>};</code></dt>
+        <dt><code>arp { <var>action</var>; </code>...<code> };</code></dt>
         <dd>
-          decrement TTL; execute first set of actions if
-          successful, second set if TTL decrement fails
+          <p>
+            Temporarily replaces the IPv4 packet being processed by an ARP
+            packet and executes each nested <var>action</var> on the ARP
+            packet.  Actions following the <var>arp</var> action, if any, apply
+            to the original, unmodified packet.
+          </p>
+
+          <p>
+            The ARP packet that this action operates on is initialized based on
+            the IPv4 packet being processed, as follows.  These are default
+            values that the nested actions will probably want to change:
+          </p>
+
+          <ul>
+            <li><code>eth.src</code> unchanged</li>
+            <li><code>eth.dst</code> unchanged</li>
+            <li><code>eth.type = 0x0806</code></li>
+            <li><code>arp.op = 1</code> (ARP request)</li>
+            <li><code>arp.sha</code> copied from <code>eth.src</code></li>
+            <li><code>arp.spa</code> copied from <code>ip4.src</code></li>
+            <li><code>arp.tha = 00:00:00:00:00:00</code></li>
+            <li><code>arp.tpa</code> copied from <code>ip4.dst</code></li>
+          </ul>
+
+          <p><b>Prerequisite:</b> <code>ip4</code></p>
         </dd>
 
-        <dt><code>icmp_reply { <var>action</var>, </code>...<code> };</code></dt>
-        <dd>generate ICMP reply from packet, execute <var>action</var>s</dd>
+        <dt><code>icmp4 { <var>action</var>; </code>...<code> };</code></dt>
+        <dd>
+          <p>
+            Temporarily replaces the IPv4 packet being processed by an ICMPv4
+            packet and executes each nested <var>action</var> on the ICMPv4
+            packet.  Actions following the <var>icmp4</var> action, if any,
+            apply to the original, unmodified packet.
+          </p>
+
+          <p>
+            The ICMPv4 packet that this action operates on is initialized based
+            on the IPv4 packet being processed, as follows.  These are default
+            values that the nested actions will probably want to change.
+            Ethernet and IPv4 fields not listed here are not changed:
+          </p>
 
-        <dt><code>arp { <var>action</var>, </code>...<code> }</code></dt>
-        <dd>generate ARP from packet, execute <var>action</var>s</dd>
+          <ul>
+            <li><code>ip.proto = 1</code> (ICMPv4)</li>
+            <li><code>ip.frag = 0</code> (not a fragment)</li>
+            <li><code>icmp4.type = 3</code> (destination unreachable)</li>
+            <li><code>icmp4.code = 1</code> (host unreachable)</li>
+          </ul>
+
+          <p>
+            XXX need to explain exactly how the ICMP packet is constructed
+          </p>
+
+          <p><b>Prerequisite:</b> <code>ip4</code></p>
+        </dd>
+
+        <dt><code>tcp_reset;</code></dt>
+        <dd>
+          <p>
+            This action transforms the current TCP packet according to the
+            following pseudocode:
+          </p>
+
+          <pre>
+if (tcp.ack) {
+        tcp.seq = tcp.ack;
+} else {
+        tcp.ack = tcp.seq + length(tcp.payload);
+        tcp.seq = 0;
+}
+tcp.flags = RST;
+</pre>
+
+          <p>
+            Then, the action drops all TCP options and payload data, and
+            updates the TCP checksum.
+          </p>
+
+          <p><b>Prerequisite:</b> <code>tcp</code></p>
+        </dd>
       </dl>
     </column>