diff mbox

[2/3] NETFILTER module xt_hmark, new target for HASH based fwmark

Message ID 1326448338-13416-3-git-send-email-hans.schillstrom@ericsson.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Hans Schillstrom Jan. 13, 2012, 9:52 a.m. UTC
The target allows you to create rules in the "raw" and "mangle" tables
which alter the netfilter mark (nfmark) field within a given range.
First a 32 bit hash value is generated then modulus by <limit> and
finally an offset is added before it's written to nfmark.
Prior to routing, the nfmark can influence the routing method (see
"Use netfilter MARK value as routing key") and can also be used by
other subsystems to change their behavior.

man page
   HMARK
       This  module  does  the same as MARK, i.e. set an fwmark,
       but the mark is based on a hash value.  The hash is based on
       saddr, daddr, sport, dport and proto. The same mark will be produced
       independet of direction if no masks is set or the same masks is used for
       src and dest. The hash mark could be adjusted by modulus and finaly an
       offset could be added, i.e the final mark will be within a range.
       ICMP errors will have hash calc based on the original message.
       Note:
        - None of the parameters effect the packet it self
          only the calculated hash value.
       Fragmentation:
        - If a packet is fragmented NONE of the Fragments will include
          ports in the hash calculation.
          When fragments arrives on different machines HMARK will produce
          same fwmark for all frags inpendent of machine so they can be routed
          to the same destination.
        - If nf_defrag_ipv{4,6} is loaded the packets will be defragmented
          before reaching HMARK, i.e. in that case ports (if any) will be used.
          (ICMP Time exceeded will be sent if fragments are lost)

       Parameters: For all masks default is all "1:s", to disable a field
                   use mask 0. For IPv6 it's just the last 32 bits that
                   is included in the hash.

       --hmark-smask length (0-32 or 0-128)
              The value to AND the source address with (saddr & value).

       --hmark-dmask length (0-32 or 0-128)
              The value to AND the dest. address with (daddr & value).

       --hmark-sp-mask value
              A 16 bit value to AND the src port with (sport & value).

       --hmark-dp-mask value
              A 16 bit value to AND the dest port with (dport & value).

       --hmark-sp-set value
              A 16 bit value to OR the src port with (sport | value).

       --hmark-dp-set value
              A 16 bit value to OR the dest port with (dport | value).

       --hmark-spi-mask value
              Value to AND the spi field with (spi & value) valid for proto esp or ah.

       --hmark-spi-set value
              Value to OR the spi field with (spi | value) valid for proto esp or ah.

       --hmark-proto-mask value
              A 16 bit value to AND the L4 proto field with (proto & value).

       --hmark-rnd value
              A 32 bit intitial value for hash calc, default is 0xc175a3b8.

       --hmark-dnat (only IPv4)
              Replace src addr/port with original dst addr/port before calc, hash

       --hmark-snat (only IPv4)
              Replace dst addr/port with original src addr/port before calc, hash

       Final processing of the mark in order of execution.

       --hmark-mod value (must be > 0)
              The easiest way to describe this is:  hash = hash mod <value>

       --hmark-offs alue (must be > 0)
              The easiest way to describe this is:  hash = hash + <value>

       Examples:

       Default rule handles all TCP, UDP, SCTP, ESP & AH

Rev 7
      IPv6 descending into icmp error hdr didn't work as expected
      with ipv6_find_hdr() Now it works as expected.

Rev 6
      Compile options with or without conntrack fixed.
      __ipv6_find_hdr() replaced by ipv6_find_hdr()

Rev 5
      IPv6 rewritten uses __ipv6_find_hdr() (P. Mc Hardy)
      Full mask and address used for IPv6 smask and dmask (J.Engelhart)
      Changes due to comments by Pablo Neira Ayuso  and Eric Dumazet
      i.e uses of skb_header_pointer() and Null check of info->hmod
      Man page changes

Rev 4
      different targets for IPv4 and IPv6
      Changes based on review by Pablo.

Rev 3
      Support added to SCTP for IPv6
Rev 2
      IPv6 header scan changed to follow RFC 2640
      IPv4 icmp echo fragmented does now use proto as ipv6
      IPv6 pskb_may_pull() check is done in every time in header loop.
      IPv4 nat support added.
      default added in IPv6 loop and null check of hp

Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
---
 include/linux/netfilter/xt_hmark.h |   62 +++++++
 net/netfilter/Kconfig              |   17 ++
 net/netfilter/Makefile             |    1 +
 net/netfilter/xt_hmark.c           |  337 ++++++++++++++++++++++++++++++++++++
 4 files changed, 417 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/netfilter/xt_hmark.h
 create mode 100644 net/netfilter/xt_hmark.c

Comments

Pablo Neira Ayuso Jan. 22, 2012, 9:44 p.m. UTC | #1
Hi Hans,

This starts looking good :-). Some comments on your patch:

On Fri, Jan 13, 2012 at 10:52:17AM +0100, Hans Schillstrom wrote:
>        Fragmentation:
>         - If a packet is fragmented NONE of the Fragments will include
>           ports in the hash calculation.
>           When fragments arrives on different machines HMARK will produce
>           same fwmark for all frags inpendent of machine so they can be routed
>           to the same destination.

I've got one scenario that may break with this assumption:

1) your traffic follows one path over router A and B to reach your
   firewall F which requires no fragmentation at all.
2) path to router B becomes broken while there are established flows
   with firewall F.
3) router A decides to forward packets to router C, which fragment
   packets because it is using smaller MTU than router A.
4) packets arrive to firewall F, then hashing is calculated based on
   addresses, not ports, and you load-sharing becomes inconsistent.

This can rarely happen, but it does, it would break.

To fix this, I think that HMARK requires that you have to specify the
hashing strategy. If you want to support fragments, use only
addresses. If you're sure you will not get fragments, use layer 3 and
layer 4 information.

>         - If nf_defrag_ipv{4,6} is loaded the packets will be defragmented
>           before reaching HMARK, i.e. in that case ports (if any) will be used.
>           (ICMP Time exceeded will be sent if fragments are lost)
> 
>        Parameters: For all masks default is all "1:s", to disable a field
>                    use mask 0. For IPv6 it's just the last 32 bits that
>                    is included in the hash.
> 
>        --hmark-smask length (0-32 or 0-128)
>               The value to AND the source address with (saddr & value).
> 
>        --hmark-dmask length (0-32 or 0-128)
>               The value to AND the dest. address with (daddr & value).
> 
>        --hmark-sp-mask value
>               A 16 bit value to AND the src port with (sport & value).
> 
>        --hmark-dp-mask value
>               A 16 bit value to AND the dest port with (dport & value).
> 
>        --hmark-sp-set value
>               A 16 bit value to OR the src port with (sport | value).
> 
>        --hmark-dp-set value
>               A 16 bit value to OR the dest port with (dport | value).
> 
>        --hmark-spi-mask value
>               Value to AND the spi field with (spi & value) valid for proto esp or ah.
> 
>        --hmark-spi-set value
>               Value to OR the spi field with (spi | value) valid for proto esp or ah.
> 
>        --hmark-proto-mask value
>               A 16 bit value to AND the L4 proto field with (proto & value).
> 
>        --hmark-rnd value
>               A 32 bit intitial value for hash calc, default is 0xc175a3b8.
> 
>        --hmark-dnat (only IPv4)
>               Replace src addr/port with original dst addr/port before calc, hash
> 
>        --hmark-snat (only IPv4)
>               Replace dst addr/port with original src addr/port before calc, hash
> 
>        Final processing of the mark in order of execution.
> 
>        --hmark-mod value (must be > 0)
>               The easiest way to describe this is:  hash = hash mod <value>
> 
>        --hmark-offs alue (must be > 0)
>               The easiest way to describe this is:  hash = hash + <value>
> 
>        Examples:
> 
>        Default rule handles all TCP, UDP, SCTP, ESP & AH
> 
> Rev 7
>       IPv6 descending into icmp error hdr didn't work as expected
>       with ipv6_find_hdr() Now it works as expected.
> 
> Rev 6
>       Compile options with or without conntrack fixed.
>       __ipv6_find_hdr() replaced by ipv6_find_hdr()
> 
> Rev 5
>       IPv6 rewritten uses __ipv6_find_hdr() (P. Mc Hardy)
>       Full mask and address used for IPv6 smask and dmask (J.Engelhart)
>       Changes due to comments by Pablo Neira Ayuso  and Eric Dumazet
>       i.e uses of skb_header_pointer() and Null check of info->hmod
>       Man page changes
> 
> Rev 4
>       different targets for IPv4 and IPv6
>       Changes based on review by Pablo.
> 
> Rev 3
>       Support added to SCTP for IPv6
> Rev 2
>       IPv6 header scan changed to follow RFC 2640
>       IPv4 icmp echo fragmented does now use proto as ipv6
>       IPv6 pskb_may_pull() check is done in every time in header loop.
>       IPv4 nat support added.
>       default added in IPv6 loop and null check of hp
> 
> Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
> ---
>  include/linux/netfilter/xt_hmark.h |   62 +++++++
>  net/netfilter/Kconfig              |   17 ++
>  net/netfilter/Makefile             |    1 +
>  net/netfilter/xt_hmark.c           |  337 ++++++++++++++++++++++++++++++++++++
>  4 files changed, 417 insertions(+), 0 deletions(-)
>  create mode 100644 include/linux/netfilter/xt_hmark.h
>  create mode 100644 net/netfilter/xt_hmark.c
> 
> diff --git a/include/linux/netfilter/xt_hmark.h b/include/linux/netfilter/xt_hmark.h
> new file mode 100644
> index 0000000..366ecce
> --- /dev/null
> +++ b/include/linux/netfilter/xt_hmark.h
> @@ -0,0 +1,62 @@
> +#ifndef XT_HMARK_H_
> +#define XT_HMARK_H_
> +
> +#include <linux/types.h>
> +
> +/*
> + * Flags must not start at 0, since it's used as none.
> + */
> +enum {
> +	XT_HMARK_SADR_AND = 1,	/* SNAT & DNAT are used by the kernel module */
> +	XT_HMARK_DADR_AND,
> +	XT_HMARK_SPI_AND,
> +	XT_HMARK_SPI_OR,
> +	XT_HMARK_SPORT_AND,
> +	XT_HMARK_DPORT_AND,
> +	XT_HMARK_SPORT_OR,
> +	XT_HMARK_DPORT_OR,
> +	XT_HMARK_PROTO_AND,
> +	XT_HMARK_RND,
> +	XT_HMARK_MODULUS,
> +	XT_HMARK_OFFSET,
> +	XT_HMARK_USE_SNAT,
> +	XT_HMARK_USE_DNAT,
> +	XT_F_HMARK_USE_SNAT = 1 << XT_HMARK_USE_SNAT,
> +	XT_F_HMARK_USE_DNAT = 1 << XT_HMARK_USE_DNAT,
> +	XT_F_HMARK_SADR_AND = 1 << XT_HMARK_SADR_AND,
> +	XT_F_HMARK_DADR_AND = 1 << XT_HMARK_DADR_AND,
> +	XT_F_HMARK_SPI_AND = 1 << XT_HMARK_SPI_AND,
> +	XT_F_HMARK_SPI_OR = 1 << XT_HMARK_SPI_OR,
> +	XT_F_HMARK_SPORT_AND = 1 << XT_HMARK_SPORT_AND,
> +	XT_F_HMARK_DPORT_AND = 1 << XT_HMARK_DPORT_AND,
> +	XT_F_HMARK_SPORT_OR = 1 << XT_HMARK_SPORT_OR,
> +	XT_F_HMARK_DPORT_OR = 1 << XT_HMARK_DPORT_OR,
> +	XT_F_HMARK_PROTO_AND = 1 << XT_HMARK_PROTO_AND,
> +	XT_F_HMARK_RND = 1 << XT_HMARK_RND,
> +	XT_F_HMARK_MODULUS = 1 << XT_HMARK_MODULUS,
> +	XT_F_HMARK_OFFSET = 1 << XT_HMARK_OFFSET,
> +};
> +
> +union hports {
> +	struct {
> +		__u16	src;
> +		__u16	dst;
> +	} p16;
> +	__u32	v32;
> +};
> +
> +struct xt_hmark_info {
> +	union nf_inet_addr	smask;		/* Source address mask */
> +	union nf_inet_addr	dmask;		/* Dest address mask */
> +	union hports		pmask;
> +	union hports		pset;
> +	__u32			spimask;
> +	__u32			spiset;
> +	__u16			flags;		/* Print out only */
> +	__u16			prmask;		/* L4 Proto mask */
> +	__u32			hashrnd;
> +	__u32			hmod;		/* Modulus */
> +	__u32			hoffs;		/* Offset */
> +};
> +
> +#endif /* XT_HMARK_H_ */
> diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
> index f8ac4ef..dfe84e1 100644
> --- a/net/netfilter/Kconfig
> +++ b/net/netfilter/Kconfig
> @@ -488,6 +488,23 @@ config NETFILTER_XT_TARGET_HL
>  	since you can easily create immortal packets that loop
>  	forever on the network.
>  
> +config NETFILTER_XT_TARGET_HMARK
> +	tristate '"HMARK" target support'
> +	depends on NETFILTER_ADVANCED
> +	---help---
> +	This option adds the "HMARK" target.
> +
> +	The target allows you to create rules in the "raw" and "mangle" tables
> +	which alter the netfilter mark (nfmark) field within a given range.
> +	First a 32 bit hash value is generated then modulus by <limit> and
> +	finally an offset is added before it's written to nfmark.
> +
> +	Prior to routing, the nfmark can influence the routing method (see
> +	"Use netfilter MARK value as routing key") and can also be used by
> +	other subsystems to change their behavior.
> +
> +	The mark match can also be used to match nfmark produced by this module.
> +
>  config NETFILTER_XT_TARGET_IDLETIMER
>  	tristate  "IDLETIMER target support"
>  	depends on NETFILTER_ADVANCED
> diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
> index 40f4c3d..21bc5e8 100644
> --- a/net/netfilter/Makefile
> +++ b/net/netfilter/Makefile
> @@ -57,6 +57,7 @@ obj-$(CONFIG_NETFILTER_XT_TARGET_CONNSECMARK) += xt_CONNSECMARK.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_CT) += xt_CT.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_DSCP) += xt_DSCP.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_HL) += xt_HL.o
> +obj-$(CONFIG_NETFILTER_XT_TARGET_HMARK) += xt_hmark.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_LED) += xt_LED.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_NFLOG) += xt_NFLOG.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_NFQUEUE) += xt_NFQUEUE.o
> diff --git a/net/netfilter/xt_hmark.c b/net/netfilter/xt_hmark.c
> new file mode 100644
> index 0000000..fd73d15
> --- /dev/null
> +++ b/net/netfilter/xt_hmark.c
> @@ -0,0 +1,337 @@
> +/*
> + * xt_hmark - Netfilter module to set mark as hash value
> + *
> + * (C) 2011 Hans Schillstrom <hans.schillstrom@ericsson.com>
> + *
> + *Description:
> + *	This module calculates a hash value that can be modified by modulus
> + *	and an offset, i.e. it is possible to produce a skb->mark within a range.
> + *	The hash value is based on a direction independent five tuple:
> + *	src & dst addr src & dst ports and protocol.
> + *	However src & dst port can be masked and are not used for fragmented
> + *	packets, ESP and AH don't have ports so SPI will be used instead.
> + *	AH will not use ports even if it might be possible.
> + *	Tunnels - only the outer saddr and daddr will beused,
> + *
> + *	For ICMP error messages the hash mark values will be calculated on
> + *	the source packet i.e. the packet caused the error (If sufficient
> + *	amount of data exists).
> + *
> + *Note:	None of the fragments will include ports/spi in the calculation of
> + *	the hash value. (i.e. all frags must the same hash value.)
> + *
> + *	This program is free software; you can redistribute it and/or modify
> + *	it under the terms of the GNU General Public License version 2 as
> + *	published by the Free Software Foundation.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <net/ip.h>
> +#include <linux/icmp.h>
> +
> +#include <linux/netfilter/xt_hmark.h>
> +#include <linux/netfilter/x_tables.h>
> +#if defined(CONFIG_NF_NAT)
> +#include <net/netfilter/nf_nat.h>
> +#endif
> +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
> +#	define WITH_IPV6 1
> +#include <net/ipv6.h>
> +#include <linux/netfilter_ipv6/ip6_tables.h>
> +#endif
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Hans Schillstrom <hans.schillstrom@ericsson.com>");
> +MODULE_DESCRIPTION("Xtables: packet range mark operations by hash value");
> +MODULE_ALIAS("ipt_HMARK");
> +MODULE_ALIAS("ip6t_HMARK");
> +
> +/*
> + * ICMP, get inner header so calc can be made on the source message
> + *
> + * iphsz: ip header size in bytes
> + * nhoff: network header offset
> + * return; updated nhoff if an icmp error
> + */

Please, remove these comments:

1) These functions are static, they provide no API for other users.
2) By reading the code, you can notice what it does. It's redundant.

> +static int get_inner_hdr(struct sk_buff *skb, int iphsz, int nhoff)
> +{
> +	const struct icmphdr *icmph;
> +	struct icmphdr _ih;
> +
> +	/* Not enough header? */
> +	icmph = skb_header_pointer(skb, nhoff + iphsz, sizeof(_ih), &_ih);
> +	if (icmph == NULL)
> +		return nhoff;
> +
> +	if (icmph->type > NR_ICMP_TYPES)
> +		return nhoff;
> +
> +	/* Error message? */
> +	if (icmph->type != ICMP_DEST_UNREACH &&
> +	    icmph->type != ICMP_SOURCE_QUENCH &&
> +	    icmph->type != ICMP_TIME_EXCEEDED &&
> +	    icmph->type != ICMP_PARAMETERPROB &&
> +	    icmph->type != ICMP_REDIRECT)
> +		return nhoff;
> +
> +	return nhoff + iphsz + sizeof(_ih);
> +}
> +
> +#ifdef WITH_IPV6
> +/* Dummy header used for size calculation of an error header */
> +struct _icmpv6_errh {
> +	__u8		icmp6_type;
> +	__u8		icmp6_code;
> +	__u16		icmp6_cksum;
> +	__u32		icmp6_nu;
> +};

Interesting, by quick search, I don't find this structure defined
elsewhere, why?

> +/*
> + * Get ipv6 header offset if icmp type < 128 i.e. an error.
> + * @param: offset  input: where ICMPv6 header starts
> + *                output: where ipv6 header starts / unchanged.
> + *
> + * Returns true if it's an icmp error
> + *              and updates *offset to where ipv6 header starts
> + */
> +static int get_inner6_hdr(struct sk_buff *skb, int *offset)
> +{
> +	struct icmp6hdr *icmp6h, _ih6;
> +
> +	icmp6h = skb_header_pointer(skb, *offset, sizeof(_ih6), &_ih6);
> +	if (icmp6h == NULL)
> +		return 0;
> +
> +	if (icmp6h->icmp6_type && icmp6h->icmp6_type < 128) {
> +		*offset +=  sizeof(struct _icmpv6_errh);
> +		return 1;
> +	}
> +	return 0;
> +}
> +/*
> + * Calculate hash based fw-mark, on the five tuple if possible.
> + * special cases :
> + *  - Fragments do not use ports not even on the first fragment,
> + *    nf_defrag_ipv6.ko don't defrag for us like it do in ipv4.
> + *    This might be changed in the future.
> + *  - On ICMP errors the inner header will be used.
> + *  - Tunnels no ports
> + *  - ESP & AH uses SPI
> + * @returns XT_CONTINUE
> + */
> +__u32 hmark_v6(struct sk_buff *skb, const struct xt_action_param *par)
> +{
> +	struct xt_hmark_info *info = (struct xt_hmark_info *)par->targinfo;
> +	struct ipv6hdr *ip6, _ip6;
> +	int poff, flag = IP6T_FH_F_AUTH; /* Ports offset, find_hdr flags */
> +	u32 addr1, addr2, hash, nhoffs=0;
> +	u8 nexthdr;
> +	union hports uports = { .v32 = 0 };
> +	unsigned short fragoff = 0;
> +
> +	if (!info->hmod)
> +		return XT_CONTINUE;

why this? check in user-space that libxt_HMARK does not send this to
kernel-space and check it again in checkentry().
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans Schillstrom Jan. 22, 2012, 11:20 p.m. UTC | #2
Hello Pablo

On Sunday, January 22, 2012 22:44:34 Pablo Neira Ayuso wrote:
> Hi Hans,
> 
> This starts looking good :-). Some comments on your patch:

Thanks

> 
> On Fri, Jan 13, 2012 at 10:52:17AM +0100, Hans Schillstrom wrote:
> >        Fragmentation:
> >         - If a packet is fragmented NONE of the Fragments will include
> >           ports in the hash calculation.
> >           When fragments arrives on different machines HMARK will produce
> >           same fwmark for all frags inpendent of machine so they can be routed
> >           to the same destination.
> 

The text should clarify that this is valid for the fragments not the "flow"

> I've got one scenario that may break with this assumption:
> 
> 1) your traffic follows one path over router A and B to reach your
>    firewall F which requires no fragmentation at all.
> 2) path to router B becomes broken while there are established flows
>    with firewall F.
> 3) router A decides to forward packets to router C, which fragment
>    packets because it is using smaller MTU than router A.
> 4) packets arrive to firewall F, then hashing is calculated based on
>    addresses, not ports, and you load-sharing becomes inconsistent.
> 
> This can rarely happen, but it does, it would break.
> 
> To fix this, I think that HMARK requires that you have to specify the
> hashing strategy. If you want to support fragments, use only
> addresses. If you're sure you will not get fragments, use layer 3 and
> layer 4 information.

I know but if you use conntrack, fragments will not be seen by HMARK
(except for IPv6 until Patric has fix the IPv6 defrag)

We handle this by not having stateful FW:s when connected to external routers.
Fragments will take an extra turn to a container with conntrack and there
HMARK works as on the unfragmented packets in the flow.

> 
> >         - If nf_defrag_ipv{4,6} is loaded the packets will be defragmented
> >           before reaching HMARK, i.e. in that case ports (if any) will be used.
> >           (ICMP Time exceeded will be sent if fragments are lost)
> > 
> >        Parameters: For all masks default is all "1:s", to disable a field
> >                    use mask 0. For IPv6 it's just the last 32 bits that
> >                    is included in the hash.
> > 
> >        --hmark-smask length (0-32 or 0-128)
> >               The value to AND the source address with (saddr & value).
> > 
> >        --hmark-dmask length (0-32 or 0-128)
> >               The value to AND the dest. address with (daddr & value).
> > 
> >        --hmark-sp-mask value
> >               A 16 bit value to AND the src port with (sport & value).
> > 
> >        --hmark-dp-mask value
> >               A 16 bit value to AND the dest port with (dport & value).
> > 
> >        --hmark-sp-set value
> >               A 16 bit value to OR the src port with (sport | value).
> > 
> >        --hmark-dp-set value
> >               A 16 bit value to OR the dest port with (dport | value).
> > 
> >        --hmark-spi-mask value
> >               Value to AND the spi field with (spi & value) valid for proto esp or ah.
> > 
> >        --hmark-spi-set value
> >               Value to OR the spi field with (spi | value) valid for proto esp or ah.
> > 
> >        --hmark-proto-mask value
> >               A 16 bit value to AND the L4 proto field with (proto & value).
> > 
> >        --hmark-rnd value
> >               A 32 bit intitial value for hash calc, default is 0xc175a3b8.
> > 
> >        --hmark-dnat (only IPv4)
> >               Replace src addr/port with original dst addr/port before calc, hash
> > 
> >        --hmark-snat (only IPv4)
> >               Replace dst addr/port with original src addr/port before calc, hash
> > 
> >        Final processing of the mark in order of execution.
> > 
> >        --hmark-mod value (must be > 0)
> >               The easiest way to describe this is:  hash = hash mod <value>
> > 
> >        --hmark-offs alue (must be > 0)
> >               The easiest way to describe this is:  hash = hash + <value>
> > 
> >        Examples:
> > 
> >        Default rule handles all TCP, UDP, SCTP, ESP & AH
> > 
> > Rev 7
> >       IPv6 descending into icmp error hdr didn't work as expected
> >       with ipv6_find_hdr() Now it works as expected.
> > 
> > Rev 6
> >       Compile options with or without conntrack fixed.
> >       __ipv6_find_hdr() replaced by ipv6_find_hdr()
> > 
> > Rev 5
> >       IPv6 rewritten uses __ipv6_find_hdr() (P. Mc Hardy)
> >       Full mask and address used for IPv6 smask and dmask (J.Engelhart)
> >       Changes due to comments by Pablo Neira Ayuso  and Eric Dumazet
> >       i.e uses of skb_header_pointer() and Null check of info->hmod
> >       Man page changes
> > 
> > Rev 4
> >       different targets for IPv4 and IPv6
> >       Changes based on review by Pablo.
> > 
> > Rev 3
> >       Support added to SCTP for IPv6
> > Rev 2
> >       IPv6 header scan changed to follow RFC 2640
> >       IPv4 icmp echo fragmented does now use proto as ipv6
> >       IPv6 pskb_may_pull() check is done in every time in header loop.
> >       IPv4 nat support added.
> >       default added in IPv6 loop and null check of hp
> > 
> > Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
> > ---
> >  include/linux/netfilter/xt_hmark.h |   62 +++++++
> >  net/netfilter/Kconfig              |   17 ++
> >  net/netfilter/Makefile             |    1 +
> >  net/netfilter/xt_hmark.c           |  337 ++++++++++++++++++++++++++++++++++++
> >  4 files changed, 417 insertions(+), 0 deletions(-)
> >  create mode 100644 include/linux/netfilter/xt_hmark.h
> >  create mode 100644 net/netfilter/xt_hmark.c
> > 
> > diff --git a/include/linux/netfilter/xt_hmark.h b/include/linux/netfilter/xt_hmark.h
> > new file mode 100644
> > index 0000000..366ecce
> > --- /dev/null
> > +++ b/include/linux/netfilter/xt_hmark.h
> > @@ -0,0 +1,62 @@
> > +#ifndef XT_HMARK_H_
> > +#define XT_HMARK_H_
> > +
> > +#include <linux/types.h>
> > +
> > +/*
> > + * Flags must not start at 0, since it's used as none.
> > + */
> > +enum {
> > +	XT_HMARK_SADR_AND = 1,	/* SNAT & DNAT are used by the kernel module */
> > +	XT_HMARK_DADR_AND,
> > +	XT_HMARK_SPI_AND,
> > +	XT_HMARK_SPI_OR,
> > +	XT_HMARK_SPORT_AND,
> > +	XT_HMARK_DPORT_AND,
> > +	XT_HMARK_SPORT_OR,
> > +	XT_HMARK_DPORT_OR,
> > +	XT_HMARK_PROTO_AND,
> > +	XT_HMARK_RND,
> > +	XT_HMARK_MODULUS,
> > +	XT_HMARK_OFFSET,
> > +	XT_HMARK_USE_SNAT,
> > +	XT_HMARK_USE_DNAT,
> > +	XT_F_HMARK_USE_SNAT = 1 << XT_HMARK_USE_SNAT,
> > +	XT_F_HMARK_USE_DNAT = 1 << XT_HMARK_USE_DNAT,
> > +	XT_F_HMARK_SADR_AND = 1 << XT_HMARK_SADR_AND,
> > +	XT_F_HMARK_DADR_AND = 1 << XT_HMARK_DADR_AND,
> > +	XT_F_HMARK_SPI_AND = 1 << XT_HMARK_SPI_AND,
> > +	XT_F_HMARK_SPI_OR = 1 << XT_HMARK_SPI_OR,
> > +	XT_F_HMARK_SPORT_AND = 1 << XT_HMARK_SPORT_AND,
> > +	XT_F_HMARK_DPORT_AND = 1 << XT_HMARK_DPORT_AND,
> > +	XT_F_HMARK_SPORT_OR = 1 << XT_HMARK_SPORT_OR,
> > +	XT_F_HMARK_DPORT_OR = 1 << XT_HMARK_DPORT_OR,
> > +	XT_F_HMARK_PROTO_AND = 1 << XT_HMARK_PROTO_AND,
> > +	XT_F_HMARK_RND = 1 << XT_HMARK_RND,
> > +	XT_F_HMARK_MODULUS = 1 << XT_HMARK_MODULUS,
> > +	XT_F_HMARK_OFFSET = 1 << XT_HMARK_OFFSET,
> > +};
> > +
> > +union hports {
> > +	struct {
> > +		__u16	src;
> > +		__u16	dst;
> > +	} p16;
> > +	__u32	v32;
> > +};
> > +
> > +struct xt_hmark_info {
> > +	union nf_inet_addr	smask;		/* Source address mask */
> > +	union nf_inet_addr	dmask;		/* Dest address mask */
> > +	union hports		pmask;
> > +	union hports		pset;
> > +	__u32			spimask;
> > +	__u32			spiset;
> > +	__u16			flags;		/* Print out only */
> > +	__u16			prmask;		/* L4 Proto mask */
> > +	__u32			hashrnd;
> > +	__u32			hmod;		/* Modulus */
> > +	__u32			hoffs;		/* Offset */
> > +};
> > +
> > +#endif /* XT_HMARK_H_ */
> > diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
> > index f8ac4ef..dfe84e1 100644
> > --- a/net/netfilter/Kconfig
> > +++ b/net/netfilter/Kconfig
> > @@ -488,6 +488,23 @@ config NETFILTER_XT_TARGET_HL
> >  	since you can easily create immortal packets that loop
> >  	forever on the network.
> >  
> > +config NETFILTER_XT_TARGET_HMARK
> > +	tristate '"HMARK" target support'
> > +	depends on NETFILTER_ADVANCED
> > +	---help---
> > +	This option adds the "HMARK" target.
> > +
> > +	The target allows you to create rules in the "raw" and "mangle" tables
> > +	which alter the netfilter mark (nfmark) field within a given range.
> > +	First a 32 bit hash value is generated then modulus by <limit> and
> > +	finally an offset is added before it's written to nfmark.
> > +
> > +	Prior to routing, the nfmark can influence the routing method (see
> > +	"Use netfilter MARK value as routing key") and can also be used by
> > +	other subsystems to change their behavior.
> > +
> > +	The mark match can also be used to match nfmark produced by this module.
> > +
> >  config NETFILTER_XT_TARGET_IDLETIMER
> >  	tristate  "IDLETIMER target support"
> >  	depends on NETFILTER_ADVANCED
> > diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
> > index 40f4c3d..21bc5e8 100644
> > --- a/net/netfilter/Makefile
> > +++ b/net/netfilter/Makefile
> > @@ -57,6 +57,7 @@ obj-$(CONFIG_NETFILTER_XT_TARGET_CONNSECMARK) += xt_CONNSECMARK.o
> >  obj-$(CONFIG_NETFILTER_XT_TARGET_CT) += xt_CT.o
> >  obj-$(CONFIG_NETFILTER_XT_TARGET_DSCP) += xt_DSCP.o
> >  obj-$(CONFIG_NETFILTER_XT_TARGET_HL) += xt_HL.o
> > +obj-$(CONFIG_NETFILTER_XT_TARGET_HMARK) += xt_hmark.o
> >  obj-$(CONFIG_NETFILTER_XT_TARGET_LED) += xt_LED.o
> >  obj-$(CONFIG_NETFILTER_XT_TARGET_NFLOG) += xt_NFLOG.o
> >  obj-$(CONFIG_NETFILTER_XT_TARGET_NFQUEUE) += xt_NFQUEUE.o
> > diff --git a/net/netfilter/xt_hmark.c b/net/netfilter/xt_hmark.c
> > new file mode 100644
> > index 0000000..fd73d15
> > --- /dev/null
> > +++ b/net/netfilter/xt_hmark.c
> > @@ -0,0 +1,337 @@
> > +/*
> > + * xt_hmark - Netfilter module to set mark as hash value
> > + *
> > + * (C) 2011 Hans Schillstrom <hans.schillstrom@ericsson.com>
> > + *
> > + *Description:
> > + *	This module calculates a hash value that can be modified by modulus
> > + *	and an offset, i.e. it is possible to produce a skb->mark within a range.
> > + *	The hash value is based on a direction independent five tuple:
> > + *	src & dst addr src & dst ports and protocol.
> > + *	However src & dst port can be masked and are not used for fragmented
> > + *	packets, ESP and AH don't have ports so SPI will be used instead.
> > + *	AH will not use ports even if it might be possible.
> > + *	Tunnels - only the outer saddr and daddr will beused,
> > + *
> > + *	For ICMP error messages the hash mark values will be calculated on
> > + *	the source packet i.e. the packet caused the error (If sufficient
> > + *	amount of data exists).
> > + *
> > + *Note:	None of the fragments will include ports/spi in the calculation of
> > + *	the hash value. (i.e. all frags must the same hash value.)
> > + *
> > + *	This program is free software; you can redistribute it and/or modify
> > + *	it under the terms of the GNU General Public License version 2 as
> > + *	published by the Free Software Foundation.
> > + */
> > +
> > +#include <linux/module.h>
> > +#include <linux/skbuff.h>
> > +#include <net/ip.h>
> > +#include <linux/icmp.h>
> > +
> > +#include <linux/netfilter/xt_hmark.h>
> > +#include <linux/netfilter/x_tables.h>
> > +#if defined(CONFIG_NF_NAT)
> > +#include <net/netfilter/nf_nat.h>
> > +#endif
> > +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
> > +#	define WITH_IPV6 1
> > +#include <net/ipv6.h>
> > +#include <linux/netfilter_ipv6/ip6_tables.h>
> > +#endif
> > +
> > +MODULE_LICENSE("GPL");
> > +MODULE_AUTHOR("Hans Schillstrom <hans.schillstrom@ericsson.com>");
> > +MODULE_DESCRIPTION("Xtables: packet range mark operations by hash value");
> > +MODULE_ALIAS("ipt_HMARK");
> > +MODULE_ALIAS("ip6t_HMARK");
> > +
> > +/*
> > + * ICMP, get inner header so calc can be made on the source message
> > + *
> > + * iphsz: ip header size in bytes
> > + * nhoff: network header offset
> > + * return; updated nhoff if an icmp error
> > + */
> 
> Please, remove these comments:

No problems

> 
> 1) These functions are static, they provide no API for other users.
> 2) By reading the code, you can notice what it does. It's redundant.
> 
> > +static int get_inner_hdr(struct sk_buff *skb, int iphsz, int nhoff)
> > +{
> > +	const struct icmphdr *icmph;
> > +	struct icmphdr _ih;
> > +
> > +	/* Not enough header? */
> > +	icmph = skb_header_pointer(skb, nhoff + iphsz, sizeof(_ih), &_ih);
> > +	if (icmph == NULL)
> > +		return nhoff;
> > +
> > +	if (icmph->type > NR_ICMP_TYPES)
> > +		return nhoff;
> > +
> > +	/* Error message? */
> > +	if (icmph->type != ICMP_DEST_UNREACH &&
> > +	    icmph->type != ICMP_SOURCE_QUENCH &&
> > +	    icmph->type != ICMP_TIME_EXCEEDED &&
> > +	    icmph->type != ICMP_PARAMETERPROB &&
> > +	    icmph->type != ICMP_REDIRECT)
> > +		return nhoff;
> > +
> > +	return nhoff + iphsz + sizeof(_ih);
> > +}
> > +
> > +#ifdef WITH_IPV6
> > +/* Dummy header used for size calculation of an error header */
> > +struct _icmpv6_errh {
> > +	__u8		icmp6_type;
> > +	__u8		icmp6_code;
> > +	__u16		icmp6_cksum;
> > +	__u32		icmp6_nu;
> > +};
> 
> Interesting, by quick search, I don't find this structure defined
> elsewhere, why?
> 
I have no idea ...
the closest is "struct icmp6hdr" but it contains everything

> > +/*
> > + * Get ipv6 header offset if icmp type < 128 i.e. an error.
> > + * @param: offset  input: where ICMPv6 header starts
> > + *                output: where ipv6 header starts / unchanged.
> > + *
> > + * Returns true if it's an icmp error
> > + *              and updates *offset to where ipv6 header starts
> > + */
> > +static int get_inner6_hdr(struct sk_buff *skb, int *offset)
> > +{
> > +	struct icmp6hdr *icmp6h, _ih6;
> > +
> > +	icmp6h = skb_header_pointer(skb, *offset, sizeof(_ih6), &_ih6);
> > +	if (icmp6h == NULL)
> > +		return 0;
> > +
> > +	if (icmp6h->icmp6_type && icmp6h->icmp6_type < 128) {
> > +		*offset +=  sizeof(struct _icmpv6_errh);
> > +		return 1;
> > +	}
> > +	return 0;
> > +}
> > +/*
> > + * Calculate hash based fw-mark, on the five tuple if possible.
> > + * special cases :
> > + *  - Fragments do not use ports not even on the first fragment,
> > + *    nf_defrag_ipv6.ko don't defrag for us like it do in ipv4.
> > + *    This might be changed in the future.
> > + *  - On ICMP errors the inner header will be used.
> > + *  - Tunnels no ports
> > + *  - ESP & AH uses SPI
> > + * @returns XT_CONTINUE
> > + */
> > +__u32 hmark_v6(struct sk_buff *skb, const struct xt_action_param *par)
> > +{
> > +	struct xt_hmark_info *info = (struct xt_hmark_info *)par->targinfo;
> > +	struct ipv6hdr *ip6, _ip6;
> > +	int poff, flag = IP6T_FH_F_AUTH; /* Ports offset, find_hdr flags */
> > +	u32 addr1, addr2, hash, nhoffs=0;
> > +	u8 nexthdr;
> > +	union hports uports = { .v32 = 0 };
> > +	unsigned short fragoff = 0;
> > +
> > +	if (!info->hmod)
> > +		return XT_CONTINUE;
> 
> why this? check in user-space that libxt_HMARK does not send this to
> kernel-space and check it again in checkentry().

Well, better safe than ... divide by zero

OK, it very very unlikely that it becomes zero
so if you want I can remove that check.

Thanks
/Hans
Pablo Neira Ayuso Jan. 23, 2012, 9:12 a.m. UTC | #3
On Mon, Jan 23, 2012 at 12:20:15AM +0100, Hans Schillstrom wrote:
> The text should clarify that this is valid for the fragments not the "flow"
> 
> > I've got one scenario that may break with this assumption:
> > 
> > 1) your traffic follows one path over router A and B to reach your
> >    firewall F which requires no fragmentation at all.
> > 2) path to router B becomes broken while there are established flows
> >    with firewall F.
> > 3) router A decides to forward packets to router C, which fragment
> >    packets because it is using smaller MTU than router A.
> > 4) packets arrive to firewall F, then hashing is calculated based on
> >    addresses, not ports, and you load-sharing becomes inconsistent.
> > 
> > This can rarely happen, but it does, it would break.
> > 
> > To fix this, I think that HMARK requires that you have to specify the
> > hashing strategy. If you want to support fragments, use only
> > addresses. If you're sure you will not get fragments, use layer 3 and
> > layer 4 information.
> 
> I know but if you use conntrack, fragments will not be seen by HMARK
> (except for IPv6 until Patric has fix the IPv6 defrag)

Please, read the scenario, I'm not talking about conntrack this time.

> We handle this by not having stateful FW:s when connected to external routers.
> Fragments will take an extra turn to a container with conntrack and there
> HMARK works as on the unfragmented packets in the flow.

Yes, I got the idea. Indeed HMARK can be very useful in other situations,
like cluster-based OSPF setups with stateful firewalls following a
similar approach.

However, you don't reply to my scenario. What I'm telling is that,
even with conntrack disabled, HMARK is not consistent if you start
receiving fragments at some point.

[...]
> > > +/*
> > > + * ICMP, get inner header so calc can be made on the source message
> > > + *
> > > + * iphsz: ip header size in bytes
> > > + * nhoff: network header offset
> > > + * return; updated nhoff if an icmp error
> > > + */
> > 
> > Please, remove these comments:
> 
> No problems

Thanks.

> > > +struct _icmpv6_errh {
> > > +	__u8		icmp6_type;
> > > +	__u8		icmp6_code;
> > > +	__u16		icmp6_cksum;
> > > +	__u32		icmp6_nu;
> > > +};
> > 
> > Interesting, by quick search, I don't find this structure defined
> > elsewhere, why?
> > 
> I have no idea ...
> the closest is "struct icmp6hdr" but it contains everythingi

have a look at offsetof, you can use the existing structure but tell
skb_copy_header to copy only the part you're interested. Add a comment
telling what you're only copying part of the header to warn others (in
this case, the comment becomes useful since it clarifies something
that you may not notice at a first glance by looking at the code).

[...]
> > > +	if (!info->hmod)
> > > +		return XT_CONTINUE;
> > 
> > why this? check in user-space that libxt_HMARK does not send this to
> > kernel-space and check it again in checkentry().
> 
> Well, better safe than ... divide by zero
> 
> OK, it very very unlikely that it becomes zero
> so if you want I can remove that check.

*Extremely unlikely*, I'd say :-). If you double check that hmod is
non-zero in user-space and checkentry(), we will not hit that branch
ever. Moreover, that branch is in the hot path while the others are
only configure-time paths.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans Schillstrom Jan. 23, 2012, 9:49 a.m. UTC | #4
On Monday 23 January 2012 10:12:41 Pablo Neira Ayuso wrote:
> On Mon, Jan 23, 2012 at 12:20:15AM +0100, Hans Schillstrom wrote:
> > The text should clarify that this is valid for the fragments not the "flow"
> > 
> > > I've got one scenario that may break with this assumption:
> > > 
> > > 1) your traffic follows one path over router A and B to reach your
> > >    firewall F which requires no fragmentation at all.

I missed the last part here  "requires no fragmentation at all"

> > > 2) path to router B becomes broken while there are established flows
> > >    with firewall F.
> > > 3) router A decides to forward packets to router C, which fragment
> > >    packets because it is using smaller MTU than router A.
> > > 4) packets arrive to firewall F, then hashing is calculated based on
> > >    addresses, not ports, and you load-sharing becomes inconsistent.
> > > 
> > > This can rarely happen, but it does, it would break.
> > > 
> > > To fix this, I think that HMARK requires that you have to specify the
> > > hashing strategy. If you want to support fragments, use only
> > > addresses. If you're sure you will not get fragments, use layer 3 and
> > > layer 4 information.

This can be acomplished by setting --hmark-sp-mask and --hmark-dp-mask to Zero
Then you don't use port in the hash calc.

> > 
> > I know but if you use conntrack, fragments will not be seen by HMARK
> > (except for IPv6 until Patric has fix the IPv6 defrag)
> 
> Please, read the scenario, I'm not talking about conntrack this time.
> 
> > We handle this by not having stateful FW:s when connected to external routers.
> > Fragments will take an extra turn to a container with conntrack and there
> > HMARK works as on the unfragmented packets in the flow.
> 
> Yes, I got the idea. Indeed HMARK can be very useful in other situations,
> like cluster-based OSPF setups with stateful firewalls following a
> similar approach.
> 
> However, you don't reply to my scenario. What I'm telling is that,
> even with conntrack disabled, HMARK is not consistent if you start
> receiving fragments at some point.
 
Yes, after reading once again I see what you mean.
I think that masking src & dst ports will be sufficient, 
i.e. no new param will be needed for hashing strategy.

> [...]
> > > > +/*
> > > > + * ICMP, get inner header so calc can be made on the source message
> > > > + *
> > > > + * iphsz: ip header size in bytes
> > > > + * nhoff: network header offset
> > > > + * return; updated nhoff if an icmp error
> > > > + */
> > > 
> > > Please, remove these comments:
> > 
> > No problems
> 
> Thanks.
> 
> > > > +struct _icmpv6_errh {
> > > > +	__u8		icmp6_type;
> > > > +	__u8		icmp6_code;
> > > > +	__u16		icmp6_cksum;
> > > > +	__u32		icmp6_nu;
> > > > +};
> > > 
> > > Interesting, by quick search, I don't find this structure defined
> > > elsewhere, why?
> > > 
> > I have no idea ...
> > the closest is "struct icmp6hdr" but it contains everythingi
> 
> have a look at offsetof, you can use the existing structure but tell
> skb_copy_header to copy only the part you're interested. Add a comment
> telling what you're only copying part of the header to warn others (in
> this case, the comment becomes useful since it clarifies something
> that you may not notice at a first glance by looking at the code).
> 

OK I'll do that.

> [...]
> > > > +	if (!info->hmod)
> > > > +		return XT_CONTINUE;
> > > 
> > > why this? check in user-space that libxt_HMARK does not send this to
> > > kernel-space and check it again in checkentry().
> > 
> > Well, better safe than ... divide by zero
> > 
> > OK, it very very unlikely that it becomes zero
> > so if you want I can remove that check.
> 
> *Extremely unlikely*, I'd say :-). If you double check that hmod is
> non-zero in user-space and checkentry(), we will not hit that branch
> ever. Moreover, that branch is in the hot path while the others are
> only configure-time paths.

Got the message :-)
I'll remove it.
Pablo Neira Ayuso Jan. 23, 2012, 5:01 p.m. UTC | #5
Hi Hans,

On Mon, Jan 23, 2012 at 10:49:16AM +0100, Hans Schillstrom wrote:
> On Monday 23 January 2012 10:12:41 Pablo Neira Ayuso wrote:
> > On Mon, Jan 23, 2012 at 12:20:15AM +0100, Hans Schillstrom wrote:
> > > The text should clarify that this is valid for the fragments not the "flow"
> > > 
> > > > I've got one scenario that may break with this assumption:
> > > > 
> > > > 1) your traffic follows one path over router A and B to reach your
> > > >    firewall F which requires no fragmentation at all.
> 
> I missed the last part here  "requires no fragmentation at all"
> 
> > > > 2) path to router B becomes broken while there are established flows
> > > >    with firewall F.
> > > > 3) router A decides to forward packets to router C, which fragment
> > > >    packets because it is using smaller MTU than router A.
> > > > 4) packets arrive to firewall F, then hashing is calculated based on
> > > >    addresses, not ports, and you load-sharing becomes inconsistent.
> > > > 
> > > > This can rarely happen, but it does, it would break.
> > > > 
> > > > To fix this, I think that HMARK requires that you have to specify the
> > > > hashing strategy. If you want to support fragments, use only
> > > > addresses. If you're sure you will not get fragments, use layer 3 and
> > > > layer 4 information.
> 
> This can be acomplished by setting --hmark-sp-mask and --hmark-dp-mask to Zero
> Then you don't use port in the hash calc.

OK, it would be great if we can provide a cleaner interface. The
current behaviour uses layer3-layer4 tuple hashing plus defaulting to
layer3 in case of fragments.

I'd prefer explicit configuration options:

--hashmark-method layer3
        use only address for hashing, this is fragment safe.

--hashmark-method layer3-layer4
        use addresses and ports for hashing, fragments not supported
        unless defrag is enabled.

Still, if you want to support the current behaviour, it should be
something like:

--hashmark-method layer3-layer4-fragments
        use addresses and ports for hashing, for fragments default to
        layer3 hashing. Document scenario in which hash consistency
        may break.

The behaviour of the target has to be specified by the configurations.
Defaulting to internal assumptions seems obscure to me.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans Schillstrom Jan. 24, 2012, 5:56 p.m. UTC | #6
On Monday 23 January 2012 18:01:50 Pablo Neira Ayuso wrote:
> Hi Hans,
> 
> On Mon, Jan 23, 2012 at 10:49:16AM +0100, Hans Schillstrom wrote:
> > On Monday 23 January 2012 10:12:41 Pablo Neira Ayuso wrote:
> > > On Mon, Jan 23, 2012 at 12:20:15AM +0100, Hans Schillstrom wrote:
> > > > The text should clarify that this is valid for the fragments not the "flow"
> > > > 
> > > > > I've got one scenario that may break with this assumption:
> > > > > 
> > > > > 1) your traffic follows one path over router A and B to reach your
> > > > >    firewall F which requires no fragmentation at all.
> > 
> > I missed the last part here  "requires no fragmentation at all"
> > 
> > > > > 2) path to router B becomes broken while there are established flows
> > > > >    with firewall F.
> > > > > 3) router A decides to forward packets to router C, which fragment
> > > > >    packets because it is using smaller MTU than router A.
> > > > > 4) packets arrive to firewall F, then hashing is calculated based on
> > > > >    addresses, not ports, and you load-sharing becomes inconsistent.
> > > > > 
> > > > > This can rarely happen, but it does, it would break.
> > > > > 
> > > > > To fix this, I think that HMARK requires that you have to specify the
> > > > > hashing strategy. If you want to support fragments, use only
> > > > > addresses. If you're sure you will not get fragments, use layer 3 and
> > > > > layer 4 information.
> > 
> > This can be acomplished by setting --hmark-sp-mask and --hmark-dp-mask to Zero
> > Then you don't use port in the hash calc.
> 
> OK, it would be great if we can provide a cleaner interface. The
> current behaviour uses layer3-layer4 tuple hashing plus defaulting to
> layer3 in case of fragments.
> 
> I'd prefer explicit configuration options:
> 
> --hashmark-method layer3
>         use only address for hashing, this is fragment safe.
> 
> --hashmark-method layer3-layer4
>         use addresses and ports for hashing, fragments not supported
>         unless defrag is enabled.
> 
> Still, if you want to support the current behaviour, it should be
> something like:
> 
> --hashmark-method layer3-layer4-fragments
>         use addresses and ports for hashing, for fragments default to
>         layer3 hashing. Document scenario in which hash consistency
>         may break.
> 
> The behaviour of the target has to be specified by the configurations.
> Defaulting to internal assumptions seems obscure to me.
> 
OK this is resonable, and it makes the fragment problem visible.

I'll make the changes to day and have a test run for a couple of days.
or should I wait ?

Tanks
Hans 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Jan. 24, 2012, 6:15 p.m. UTC | #7
On Tue, Jan 24, 2012 at 06:56:10PM +0100, Hans Schillstrom wrote:
> On Monday 23 January 2012 18:01:50 Pablo Neira Ayuso wrote:
> > Hi Hans,
> > 
> > On Mon, Jan 23, 2012 at 10:49:16AM +0100, Hans Schillstrom wrote:
> > > On Monday 23 January 2012 10:12:41 Pablo Neira Ayuso wrote:
> > > > On Mon, Jan 23, 2012 at 12:20:15AM +0100, Hans Schillstrom wrote:
> > > > > The text should clarify that this is valid for the fragments not the "flow"
> > > > > 
> > > > > > I've got one scenario that may break with this assumption:
> > > > > > 
> > > > > > 1) your traffic follows one path over router A and B to reach your
> > > > > >    firewall F which requires no fragmentation at all.
> > > 
> > > I missed the last part here  "requires no fragmentation at all"
> > > 
> > > > > > 2) path to router B becomes broken while there are established flows
> > > > > >    with firewall F.
> > > > > > 3) router A decides to forward packets to router C, which fragment
> > > > > >    packets because it is using smaller MTU than router A.
> > > > > > 4) packets arrive to firewall F, then hashing is calculated based on
> > > > > >    addresses, not ports, and you load-sharing becomes inconsistent.
> > > > > > 
> > > > > > This can rarely happen, but it does, it would break.
> > > > > > 
> > > > > > To fix this, I think that HMARK requires that you have to specify the
> > > > > > hashing strategy. If you want to support fragments, use only
> > > > > > addresses. If you're sure you will not get fragments, use layer 3 and
> > > > > > layer 4 information.
> > > 
> > > This can be acomplished by setting --hmark-sp-mask and --hmark-dp-mask to Zero
> > > Then you don't use port in the hash calc.
> > 
> > OK, it would be great if we can provide a cleaner interface. The
> > current behaviour uses layer3-layer4 tuple hashing plus defaulting to
> > layer3 in case of fragments.
> > 
> > I'd prefer explicit configuration options:
> > 
> > --hashmark-method layer3
> >         use only address for hashing, this is fragment safe.
> > 
> > --hashmark-method layer3-layer4
> >         use addresses and ports for hashing, fragments not supported
> >         unless defrag is enabled.
> > 
> > Still, if you want to support the current behaviour, it should be
> > something like:
> > 
> > --hashmark-method layer3-layer4-fragments
> >         use addresses and ports for hashing, for fragments default to
> >         layer3 hashing. Document scenario in which hash consistency
> >         may break.
> > 
> > The behaviour of the target has to be specified by the configurations.
> > Defaulting to internal assumptions seems obscure to me.
> > 
> OK this is resonable, and it makes the fragment problem visible.
> 
> I'll make the changes to day and have a test run for a couple of days.

Fine, thanks Hans.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans Schillstrom Jan. 25, 2012, 10:14 a.m. UTC | #8
On Tuesday 24 January 2012 19:15:40 Pablo Neira Ayuso wrote:
> On Tue, Jan 24, 2012 at 06:56:10PM +0100, Hans Schillstrom wrote:
> > On Monday 23 January 2012 18:01:50 Pablo Neira Ayuso wrote:
> > > Hi Hans,
> > > 
> > > On Mon, Jan 23, 2012 at 10:49:16AM +0100, Hans Schillstrom wrote:
> > > > On Monday 23 January 2012 10:12:41 Pablo Neira Ayuso wrote:
> > > > > On Mon, Jan 23, 2012 at 12:20:15AM +0100, Hans Schillstrom wrote:
> > > > > > The text should clarify that this is valid for the fragments not the "flow"
> > > > > > 
> > > > > > > I've got one scenario that may break with this assumption:
> > > > > > > 
> > > > > > > 1) your traffic follows one path over router A and B to reach your
> > > > > > >    firewall F which requires no fragmentation at all.
> > > > 
> > > > I missed the last part here  "requires no fragmentation at all"
> > > > 
> > > > > > > 2) path to router B becomes broken while there are established flows
> > > > > > >    with firewall F.
> > > > > > > 3) router A decides to forward packets to router C, which fragment
> > > > > > >    packets because it is using smaller MTU than router A.
> > > > > > > 4) packets arrive to firewall F, then hashing is calculated based on
> > > > > > >    addresses, not ports, and you load-sharing becomes inconsistent.
> > > > > > > 
> > > > > > > This can rarely happen, but it does, it would break.
> > > > > > > 
> > > > > > > To fix this, I think that HMARK requires that you have to specify the
> > > > > > > hashing strategy. If you want to support fragments, use only
> > > > > > > addresses. If you're sure you will not get fragments, use layer 3 and
> > > > > > > layer 4 information.
> > > > 
> > > > This can be acomplished by setting --hmark-sp-mask and --hmark-dp-mask to Zero
> > > > Then you don't use port in the hash calc.
> > > 
> > > OK, it would be great if we can provide a cleaner interface. The
> > > current behaviour uses layer3-layer4 tuple hashing plus defaulting to
> > > layer3 in case of fragments.
> > > 
> > > I'd prefer explicit configuration options:
> > > 
> > > --hashmark-method layer3
> > >         use only address for hashing, this is fragment safe.
> > > 
> > > --hashmark-method layer3-layer4
> > >         use addresses and ports for hashing, fragments not supported
> > >         unless defrag is enabled.
> > > 
> > > Still, if you want to support the current behaviour, it should be
> > > something like:
> > > 

We skip this option, 
fragments can be catched by iptables rules and feeded to a HMARK rule
with --hmark-method L3 option
It's more clear.

> > > --hashmark-method layer3-layer4-fragments
> > >         use addresses and ports for hashing, for fragments default to
> > >         layer3 hashing. Document scenario in which hash consistency
> > >         may break.
> > > 
> > > The behaviour of the target has to be specified by the configurations.
> > > Defaulting to internal assumptions seems obscure to me.
> > > 
> > OK this is resonable, and it makes the fragment problem visible.
> > 
> > I'll make the changes to day and have a test run for a couple of days.
> 
> Fine, thanks Hans.

Here is help text and man page just to clarify the changes:
Is this clear enough ?

HMARK target options, i.e. modify hash calculation by:
  --hmark-method <method>            Overall L3/L4 and fragment behavior
                 L3                  Fragment safe, do not use ports or protocol
                                     i.e  Fragments don't need special care.

                 L3-4 (Default)      Fragment unsafe, use ports and protocol
                                     if defrag is off in conntrack
                                        no hmark produced on any part of fragments.
  Limit/modify the calculated hash mark by:
  --hmark-mod value                  nfmark modulus value
  --hmark-offs value                 Last action add value to nfmark

 Fine tuning of what will be included in hash calculation
  --hmark-smask length               Source address mask length
  --hmark-dmask length               Dest address mask length
  --hmark-sp-mask value              Mask src port with value
  --hmark-dp-mask value              Mask dst port with value
  --hmark-spi-mask value             For esp and ah AND spi with value
  --hmark-sp-set value               OR src port with value
  --hmark-dp-set value               OR dst port with value
  --hmark-spi-set value              For esp and ah OR spi with value
  --hmark-proto-mask value           Mask Protocol with value
  --hmark-rnd                        Initial Random value to hash cacl.
 For NAT in IPv4 the original address can be used in the return path.
 Make sure to qualify the statement in a proper way when using nat flags
  --hmark-dnat                       Replace src addr with original dst addr
  --hmark-snat                       Replace dst addr with original src addr
 In many cases hmark can be omitted i.e. --smask can be used


MAN PAGE

   HMARK
       This module does the same as MARK, i.e. set an fwmark, but the mark is based on a hash value.
       The hash is based on saddr, daddr, sport, dport and proto. The same mark will be produced independent 
       of direction if no masks is set or the same masks is used for src and dest.
       The hash mark could be adjusted by modulus and finally an offset could be added, i.e the final mark will
       be within a range.  ICMP error will use the the original message for hash calculation not the icmp it self.

       Note: None of the parameters effect the packet it self only the calculated hash value.
             IPv4 packets with nf_defrag_ipv4 loaded will be defragmented before they reach hmark,
             IPv6 nf_defrag is not implemented this way, hence fragmented ipv6 packets will reach hmark.
             Default behavior is to completely ignore any fragment if it reach hmark.
             --hmark-method L3 is fragment safe since neither ports or L4 protocol is used.
             

       Parameters: Short hand methods

       --hmark-method L3
              Do not use proto, ports or spi, only Layer 3 addresses, mask length of L3 addresses can still be used.  
              Fragment or not does not matter in this case since only L3 address can be used in calc. of hash value.

       --hmark-method L3-4
              Default method, Include L4 in calc. of hash value i.e. all masks below are valid.
              Fragments will be ignored. (i.e no hash value produced)

       For all masks default is all "1:s", to disable a field use mask 0 

       --hmark-smask length
              The length of the mask to AND the source address with (saddr & value).

       --hmark-dmask length
              The length of the mask to AND the dest. address with (daddr & value).

       --hmark-sp-mask value
              A 16 bit value to AND the src port with (sport & value).

       --hmark-dp-mask value
              A 16 bit value to AND the dest port with (dport & value).

       --hmark-sp-set value
              A 16 bit value to OR the src port with (sport | value).

       --hmark-dp-set value
              A 16 bit value to OR the dest port with (dport | value).

       --hmark-spi-mask value
              Value to AND the spi field with (spi & value) valid for proto esp or ah.

       --hmark-spi-set value
              Value to OR the spi field with (spi | value) valid for proto esp or ah.

       --hmark-proto-mask value
              An 8 bit value to AND the L4 proto field with (proto & value).

       --hmark-rnd value
              A 32 bit initial value for hash calc, default is 0xc175a3b8.

       Final processing of the mark in order of execution.

       --hmark-mod value (must be > 0)
              The easiest way to describe this is:  hash = hash mod <value>

       --hmark-offs value
              The easiest way to describe this is:  hash = hash + <value>

       Examples:

       Default rule handles all TCP, UDP, SCTP, ESP & AH

              iptables -t mangle -A PREROUTING -j HMARK --hmark-offs 10000 --hmark-mod 10

       Handle SCTP and hash dest port only and produce a nfmark between 100-119.

              iptables -t mangle -A PREROUTING -p SCTP -j HMARK --smask 0 --dmask 0 \
               --sp-mask 0 --offs 100 --mod 20

       No defragment by conntrack, None Fragments will have fwmark 100-119 
       Fragments will have fwmark 120-139 (based on saddr and daddr only)

              iptables -t mangle -A PREROUTING -j HMARK --method L3-4 --mod 20 --offs 100
              iptables -t mangle -A PRROUTING -m mark --mark 0 -j HMARK --method L3 --mod 20 --offs 120

       Fragment safe Layer 3 only that keep a class C netw together

              iptables -t mangle -A PREROUTING -j HMARK --method L3 --smask 24 --mod 20 --offs 100
Pablo Neira Ayuso Jan. 25, 2012, 11:49 a.m. UTC | #9
On Wed, Jan 25, 2012 at 11:14:33AM +0100, Hans Schillstrom wrote:
> Here is help text and man page just to clarify the changes:
> Is this clear enough ?
> 
> HMARK target options, i.e. modify hash calculation by:
>   --hmark-method <method>            Overall L3/L4 and fragment behavior
>                  L3                  Fragment safe, do not use ports or protocol
>                                      i.e  Fragments don't need special care.
> 
>                  L3-4 (Default)      Fragment unsafe, use ports and protocol
>                                      if defrag is off in conntrack
>                                         no hmark produced on any part of fragments.

This is fine.

>   Limit/modify the calculated hash mark by:
>   --hmark-mod value                  nfmark modulus value
>   --hmark-offs value                 Last action add value to nfmark
            ^^^^
no need to be cryptic here, just say offset.

>  Fine tuning of what will be included in hash calculation
>   --hmark-smask length               Source address mask length
            ^^^^^

I'd say hmark-src-mask to keep it consistent with the options in
iptables.

>   --hmark-dmask length               Dest address mask length

hmark-dst-mask

>   --hmark-sp-mask value              Mask src port with value

hmark-sport-mask

>   --hmark-dp-mask value              Mask dst port with value

hmark-dport-mask

>   --hmark-spi-mask value             For esp and ah AND spi with value

hmark-ah-spi-mask

>   --hmark-sp-set value               OR src port with value

hmark-sport-or

>   --hmark-dp-set value               OR dst port with value

hmark-dport-or

>   --hmark-spi-set value              For esp and ah OR spi with value

These three can be useful? Providing lots of options is fine, but they
may confuse users. What do we gain from this?

In other words, is it possible to deploy consistent hashing with some
sane configuration using these options?

>   --hmark-proto-mask value           Mask Protocol with value
                                       ^^^^^^^^^^^ ^^^ ^^^ ^^^^
useful?

>   --hmark-rnd                        Initial Random value to hash cacl.
>  For NAT in IPv4 the original address can be used in the return path.

We'll have IPv6 NAT soon. Please, make sure we can extend HMARK to
support IPv6 support.

>  Make sure to qualify the statement in a proper way when using nat flags

this description is fine. I'd propose to change the option names
below:

>   --hmark-dnat                       Replace src addr with original dst addr
>   --hmark-snat                       Replace dst addr with original src addr

better:

--hmark-ct-orig-src
--hmark-ct-orig-dst

>  In many cases hmark can be omitted i.e. --smask can be used

Thanks again.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans Schillstrom Jan. 25, 2012, 12:28 p.m. UTC | #10
On Wednesday 25 January 2012 12:49:32 Pablo Neira Ayuso wrote:
> On Wed, Jan 25, 2012 at 11:14:33AM +0100, Hans Schillstrom wrote:
> > Here is help text and man page just to clarify the changes:
> > Is this clear enough ?
> > 
> > HMARK target options, i.e. modify hash calculation by:
> >   --hmark-method <method>            Overall L3/L4 and fragment behavior
> >                  L3                  Fragment safe, do not use ports or protocol
> >                                      i.e  Fragments don't need special care.
> > 
> >                  L3-4 (Default)      Fragment unsafe, use ports and protocol
> >                                      if defrag is off in conntrack
> >                                         no hmark produced on any part of fragments.
> 
> This is fine.
> 
> >   Limit/modify the calculated hash mark by:
> >   --hmark-mod value                  nfmark modulus value
> >   --hmark-offs value                 Last action add value to nfmark
>             ^^^^
> no need to be cryptic here, just say offset.

OK
> 
> >  Fine tuning of what will be included in hash calculation
> >   --hmark-smask length               Source address mask length
>             ^^^^^

OK

> 
> I'd say hmark-src-mask to keep it consistent with the options in
> iptables.
> 
> >   --hmark-dmask length               Dest address mask length
> 
> hmark-dst-mask
OK
> 
> >   --hmark-sp-mask value              Mask src port with value
> 
> hmark-sport-mask
OK
> 
> >   --hmark-dp-mask value              Mask dst port with value
> 
> hmark-dport-mask
OK
> 
> >   --hmark-spi-mask value             For esp and ah AND spi with value
> 
> hmark-ah-spi-mask
No, it is for esp as well so I think spi is enough

> 
> >   --hmark-sp-set value               OR src port with value
> 
> hmark-sport-or
> 
> >   --hmark-dp-set value               OR dst port with value
> 
> hmark-dport-or
> 
> >   --hmark-spi-set value              For esp and ah OR spi with value
> 
> These three can be useful? Providing lots of options is fine, but they
> may confuse users. What do we gain from this?
> 
> In other words, is it possible to deploy consistent hashing with some
> sane configuration using these options?

Ex if you want stickiness between ports ex 80 and 443
iptables  -p tcp --dport 443 -j HMARK --sport-mask 0 --dport-set 80 ....
iptables  ...  -j HMARK --sport-mask 0 ....

Usefull or not that can be discussed.
From my point of view it's not a "MUST"

> 
> >   --hmark-proto-mask value           Mask Protocol with value
>                                        ^^^^^^^^^^^ ^^^ ^^^ ^^^^
> useful?
Yes, stickiness between protocols (in most cases --sport-mask needs to be zero)
ex sip uses both TCP and UDP port 5060

> 
> >   --hmark-rnd                        Initial Random value to hash cacl.
> >  For NAT in IPv4 the original address can be used in the return path.
> 
> We'll have IPv6 NAT soon. Please, make sure we can extend HMARK to
> support IPv6 support.

Sure, allready tesed.

> 
> >  Make sure to qualify the statement in a proper way when using nat flags
> 
> this description is fine. I'd propose to change the option names
> below:
> 
> >   --hmark-dnat                       Replace src addr with original dst addr
> >   --hmark-snat                       Replace dst addr with original src addr
> 
> better:
> 
> --hmark-ct-orig-src
> --hmark-ct-orig-dst

I agree, thanks

> 
> >  In many cases hmark can be omitted i.e. --smask can be used
> 
> Thanks again.
>
diff mbox

Patch

diff --git a/include/linux/netfilter/xt_hmark.h b/include/linux/netfilter/xt_hmark.h
new file mode 100644
index 0000000..366ecce
--- /dev/null
+++ b/include/linux/netfilter/xt_hmark.h
@@ -0,0 +1,62 @@ 
+#ifndef XT_HMARK_H_
+#define XT_HMARK_H_
+
+#include <linux/types.h>
+
+/*
+ * Flags must not start at 0, since it's used as none.
+ */
+enum {
+	XT_HMARK_SADR_AND = 1,	/* SNAT & DNAT are used by the kernel module */
+	XT_HMARK_DADR_AND,
+	XT_HMARK_SPI_AND,
+	XT_HMARK_SPI_OR,
+	XT_HMARK_SPORT_AND,
+	XT_HMARK_DPORT_AND,
+	XT_HMARK_SPORT_OR,
+	XT_HMARK_DPORT_OR,
+	XT_HMARK_PROTO_AND,
+	XT_HMARK_RND,
+	XT_HMARK_MODULUS,
+	XT_HMARK_OFFSET,
+	XT_HMARK_USE_SNAT,
+	XT_HMARK_USE_DNAT,
+	XT_F_HMARK_USE_SNAT = 1 << XT_HMARK_USE_SNAT,
+	XT_F_HMARK_USE_DNAT = 1 << XT_HMARK_USE_DNAT,
+	XT_F_HMARK_SADR_AND = 1 << XT_HMARK_SADR_AND,
+	XT_F_HMARK_DADR_AND = 1 << XT_HMARK_DADR_AND,
+	XT_F_HMARK_SPI_AND = 1 << XT_HMARK_SPI_AND,
+	XT_F_HMARK_SPI_OR = 1 << XT_HMARK_SPI_OR,
+	XT_F_HMARK_SPORT_AND = 1 << XT_HMARK_SPORT_AND,
+	XT_F_HMARK_DPORT_AND = 1 << XT_HMARK_DPORT_AND,
+	XT_F_HMARK_SPORT_OR = 1 << XT_HMARK_SPORT_OR,
+	XT_F_HMARK_DPORT_OR = 1 << XT_HMARK_DPORT_OR,
+	XT_F_HMARK_PROTO_AND = 1 << XT_HMARK_PROTO_AND,
+	XT_F_HMARK_RND = 1 << XT_HMARK_RND,
+	XT_F_HMARK_MODULUS = 1 << XT_HMARK_MODULUS,
+	XT_F_HMARK_OFFSET = 1 << XT_HMARK_OFFSET,
+};
+
+union hports {
+	struct {
+		__u16	src;
+		__u16	dst;
+	} p16;
+	__u32	v32;
+};
+
+struct xt_hmark_info {
+	union nf_inet_addr	smask;		/* Source address mask */
+	union nf_inet_addr	dmask;		/* Dest address mask */
+	union hports		pmask;
+	union hports		pset;
+	__u32			spimask;
+	__u32			spiset;
+	__u16			flags;		/* Print out only */
+	__u16			prmask;		/* L4 Proto mask */
+	__u32			hashrnd;
+	__u32			hmod;		/* Modulus */
+	__u32			hoffs;		/* Offset */
+};
+
+#endif /* XT_HMARK_H_ */
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index f8ac4ef..dfe84e1 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -488,6 +488,23 @@  config NETFILTER_XT_TARGET_HL
 	since you can easily create immortal packets that loop
 	forever on the network.
 
+config NETFILTER_XT_TARGET_HMARK
+	tristate '"HMARK" target support'
+	depends on NETFILTER_ADVANCED
+	---help---
+	This option adds the "HMARK" target.
+
+	The target allows you to create rules in the "raw" and "mangle" tables
+	which alter the netfilter mark (nfmark) field within a given range.
+	First a 32 bit hash value is generated then modulus by <limit> and
+	finally an offset is added before it's written to nfmark.
+
+	Prior to routing, the nfmark can influence the routing method (see
+	"Use netfilter MARK value as routing key") and can also be used by
+	other subsystems to change their behavior.
+
+	The mark match can also be used to match nfmark produced by this module.
+
 config NETFILTER_XT_TARGET_IDLETIMER
 	tristate  "IDLETIMER target support"
 	depends on NETFILTER_ADVANCED
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 40f4c3d..21bc5e8 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -57,6 +57,7 @@  obj-$(CONFIG_NETFILTER_XT_TARGET_CONNSECMARK) += xt_CONNSECMARK.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_CT) += xt_CT.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_DSCP) += xt_DSCP.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_HL) += xt_HL.o
+obj-$(CONFIG_NETFILTER_XT_TARGET_HMARK) += xt_hmark.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_LED) += xt_LED.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_NFLOG) += xt_NFLOG.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_NFQUEUE) += xt_NFQUEUE.o
diff --git a/net/netfilter/xt_hmark.c b/net/netfilter/xt_hmark.c
new file mode 100644
index 0000000..fd73d15
--- /dev/null
+++ b/net/netfilter/xt_hmark.c
@@ -0,0 +1,337 @@ 
+/*
+ * xt_hmark - Netfilter module to set mark as hash value
+ *
+ * (C) 2011 Hans Schillstrom <hans.schillstrom@ericsson.com>
+ *
+ *Description:
+ *	This module calculates a hash value that can be modified by modulus
+ *	and an offset, i.e. it is possible to produce a skb->mark within a range.
+ *	The hash value is based on a direction independent five tuple:
+ *	src & dst addr src & dst ports and protocol.
+ *	However src & dst port can be masked and are not used for fragmented
+ *	packets, ESP and AH don't have ports so SPI will be used instead.
+ *	AH will not use ports even if it might be possible.
+ *	Tunnels - only the outer saddr and daddr will beused,
+ *
+ *	For ICMP error messages the hash mark values will be calculated on
+ *	the source packet i.e. the packet caused the error (If sufficient
+ *	amount of data exists).
+ *
+ *Note:	None of the fragments will include ports/spi in the calculation of
+ *	the hash value. (i.e. all frags must the same hash value.)
+ *
+ *	This program is free software; you can redistribute it and/or modify
+ *	it under the terms of the GNU General Public License version 2 as
+ *	published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <net/ip.h>
+#include <linux/icmp.h>
+
+#include <linux/netfilter/xt_hmark.h>
+#include <linux/netfilter/x_tables.h>
+#if defined(CONFIG_NF_NAT)
+#include <net/netfilter/nf_nat.h>
+#endif
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+#	define WITH_IPV6 1
+#include <net/ipv6.h>
+#include <linux/netfilter_ipv6/ip6_tables.h>
+#endif
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Hans Schillstrom <hans.schillstrom@ericsson.com>");
+MODULE_DESCRIPTION("Xtables: packet range mark operations by hash value");
+MODULE_ALIAS("ipt_HMARK");
+MODULE_ALIAS("ip6t_HMARK");
+
+/*
+ * ICMP, get inner header so calc can be made on the source message
+ *
+ * iphsz: ip header size in bytes
+ * nhoff: network header offset
+ * return; updated nhoff if an icmp error
+ */
+static int get_inner_hdr(struct sk_buff *skb, int iphsz, int nhoff)
+{
+	const struct icmphdr *icmph;
+	struct icmphdr _ih;
+
+	/* Not enough header? */
+	icmph = skb_header_pointer(skb, nhoff + iphsz, sizeof(_ih), &_ih);
+	if (icmph == NULL)
+		return nhoff;
+
+	if (icmph->type > NR_ICMP_TYPES)
+		return nhoff;
+
+	/* Error message? */
+	if (icmph->type != ICMP_DEST_UNREACH &&
+	    icmph->type != ICMP_SOURCE_QUENCH &&
+	    icmph->type != ICMP_TIME_EXCEEDED &&
+	    icmph->type != ICMP_PARAMETERPROB &&
+	    icmph->type != ICMP_REDIRECT)
+		return nhoff;
+
+	return nhoff + iphsz + sizeof(_ih);
+}
+
+#ifdef WITH_IPV6
+/* Dummy header used for size calculation of an error header */
+struct _icmpv6_errh {
+	__u8		icmp6_type;
+	__u8		icmp6_code;
+	__u16		icmp6_cksum;
+	__u32		icmp6_nu;
+};
+/*
+ * Get ipv6 header offset if icmp type < 128 i.e. an error.
+ * @param: offset  input: where ICMPv6 header starts
+ *                output: where ipv6 header starts / unchanged.
+ *
+ * Returns true if it's an icmp error
+ *              and updates *offset to where ipv6 header starts
+ */
+static int get_inner6_hdr(struct sk_buff *skb, int *offset)
+{
+	struct icmp6hdr *icmp6h, _ih6;
+
+	icmp6h = skb_header_pointer(skb, *offset, sizeof(_ih6), &_ih6);
+	if (icmp6h == NULL)
+		return 0;
+
+	if (icmp6h->icmp6_type && icmp6h->icmp6_type < 128) {
+		*offset +=  sizeof(struct _icmpv6_errh);
+		return 1;
+	}
+	return 0;
+}
+/*
+ * Calculate hash based fw-mark, on the five tuple if possible.
+ * special cases :
+ *  - Fragments do not use ports not even on the first fragment,
+ *    nf_defrag_ipv6.ko don't defrag for us like it do in ipv4.
+ *    This might be changed in the future.
+ *  - On ICMP errors the inner header will be used.
+ *  - Tunnels no ports
+ *  - ESP & AH uses SPI
+ * @returns XT_CONTINUE
+ */
+__u32 hmark_v6(struct sk_buff *skb, const struct xt_action_param *par)
+{
+	struct xt_hmark_info *info = (struct xt_hmark_info *)par->targinfo;
+	struct ipv6hdr *ip6, _ip6;
+	int poff, flag = IP6T_FH_F_AUTH; /* Ports offset, find_hdr flags */
+	u32 addr1, addr2, hash, nhoffs=0;
+	u8 nexthdr;
+	union hports uports = { .v32 = 0 };
+	unsigned short fragoff = 0;
+
+	if (!info->hmod)
+		return XT_CONTINUE;
+
+	ip6 = (struct ipv6hdr *) (skb->data + skb_network_offset(skb));
+
+	/* Try to get transport header */
+	nexthdr = ipv6_find_hdr(skb, &nhoffs, -1, &fragoff, &flag);
+	if (nexthdr < 0)
+		return XT_CONTINUE;
+	/* dont check for icmp on fragments */
+	if ((flag & IP6T_FH_F_FRAG) || (nexthdr != IPPROTO_ICMPV6))
+		goto noicmp;
+	/* ICMP: if an error then move ptr to inner header */
+	if (get_inner6_hdr(skb, &nhoffs)) {
+		/* Get IPv6 header ptr just to get the saddr & daddr later */
+		ip6 = skb_header_pointer(skb, nhoffs, sizeof(_ip6), &_ip6);
+		if (!ip6)
+			return XT_CONTINUE;
+		/* Treat AH as ESP */
+		flag = IP6T_FH_F_AUTH;
+		nexthdr = ipv6_find_hdr(skb, &nhoffs, -1, &fragoff, &flag);
+		if (nexthdr < 0)
+			return XT_CONTINUE;
+	}
+noicmp:
+	/* Mask of the address and xor it into a u32 */
+	addr1 = (__force u32)
+		(ip6->saddr.s6_addr32[0] & info->smask.in6.s6_addr32[0]) ^
+		(ip6->saddr.s6_addr32[1] & info->smask.in6.s6_addr32[1]) ^
+		(ip6->saddr.s6_addr32[2] & info->smask.in6.s6_addr32[2]) ^
+		(ip6->saddr.s6_addr32[3] & info->smask.in6.s6_addr32[3]);
+	addr2 = (__force u32)
+		(ip6->daddr.s6_addr32[0] & info->dmask.in6.s6_addr32[0]) ^
+		(ip6->daddr.s6_addr32[1] & info->dmask.in6.s6_addr32[1]) ^
+		(ip6->daddr.s6_addr32[2] & info->dmask.in6.s6_addr32[2]) ^
+		(ip6->daddr.s6_addr32[3] & info->dmask.in6.s6_addr32[3]);
+
+	/* Is next header valid for port or SPI calculation ? */
+	poff = proto_ports_offset(nexthdr);
+	if ((flag & IP6T_FH_F_FRAG) || poff < 0)
+		goto no6ports;
+	nhoffs += poff;
+	/* Since uports is modified, skb_header_pointer() can't be used */
+	 if(!pskb_may_pull(skb, nhoffs + 4))
+		goto no6ports;
+	uports.v32 = * (__force u32 *) (skb->data + nhoffs);
+
+	if ((nexthdr == IPPROTO_ESP) || (nexthdr == IPPROTO_AH)) {
+		uports.v32 = (uports.v32 & info->spimask) | info->spiset;
+	} else {
+		uports.v32 = (uports.v32 & info->pmask.v32) | info->pset.v32;
+		/* get a consistent hash (same value on both flow directions) */
+		if (uports.p16.dst < uports.p16.src)
+			swap(uports.p16.dst, uports.p16.src);
+	}
+
+no6ports:
+	nexthdr &= info->prmask;
+	/* get a consistent hash (same value on both flow directions) */
+	if (addr2 < addr1)
+		swap(addr1, addr2);
+
+	hash = jhash_3words(addr1, addr2, uports.v32, info->hashrnd) ^ nexthdr;
+	skb->mark = (hash % info->hmod) + info->hoffs;
+	return XT_CONTINUE;
+}
+#endif
+/*
+ * Calculate hash based fw-mark, on the five tuple if possible.
+ * special cases :
+ *  - Fragments do not use ports not even on the first fragment,
+ *    unless nf_defrag_xx.ko is used.
+ *  - On ICMP errors the inner header will be used.
+ *  - Tunnels no ports
+ *  - ESP & AH uses SPI
+ * @returns XT_CONTINUE
+ */
+unsigned int hmark_v4(struct sk_buff *skb, const struct xt_action_param *par)
+{
+	struct xt_hmark_info *info = (struct xt_hmark_info *)par->targinfo;
+	int nhoff, poff, frag = 0;
+	struct iphdr *ip, _ip;
+	u8 ip_proto;
+	u32 addr1, addr2, hash;
+	u16 snatport = 0, dnatport = 0;
+#if defined(CONFIG_NF_NAT)
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *ct = ct = nf_ct_get(skb, &ctinfo);
+#endif
+	union hports uports;
+
+	if (!info->hmod)
+		return XT_CONTINUE;
+
+	nhoff = skb_network_offset(skb);
+	uports.v32 = 0;
+
+	ip = (struct iphdr *) (skb->data + nhoff);
+	if (ip->protocol == IPPROTO_ICMP) {
+		/* calc hash on inner header if an icmp error */
+		nhoff = get_inner_hdr(skb, ip->ihl * 4, nhoff);
+		ip = skb_header_pointer(skb, nhoff, sizeof(_ip), &_ip);
+		if (!ip)
+			return XT_CONTINUE;
+	}
+
+	ip_proto = ip->protocol;
+	if (ip->frag_off & htons(IP_MF | IP_OFFSET))
+		frag = 1;
+
+	addr1 = (__force u32) ip->saddr & info->smask.ip;
+	addr2 = (__force u32) ip->daddr & info->dmask.ip;
+
+#if defined(CONFIG_NF_NAT)
+	if (ct && test_bit(IP_CT_IS_REPLY, &ct->status)) {
+		struct nf_conntrack_tuple *otuple;
+
+		otuple = &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
+		/*
+		 * On the "return flow", to get the original address
+		 */
+		if ((ct->status & IPS_DST_NAT) &&
+			(info->flags & XT_HMARK_USE_DNAT)) {
+			addr1 = (__force u32) otuple->dst.u3.in.s_addr;
+			dnatport = otuple->dst.u.udp.port;
+		}
+		if ((ct->status & IPS_SRC_NAT) &&
+			(info->flags & XT_HMARK_USE_SNAT)) {
+			addr2 = (__force u32) otuple->src.u3.in.s_addr;
+			snatport = otuple->src.u.udp.port;
+		}
+	}
+#endif
+	/* Check if ports can be used in hash calculation. */
+	poff = proto_ports_offset(ip_proto);
+	if (frag || poff < 0)
+		goto noports;
+
+	nhoff += (ip->ihl * 4) + poff;
+	if (!pskb_may_pull(skb, nhoff + 4))
+		goto noports;
+
+	uports.v32 = * (__force u32 *) (skb->data + nhoff);
+	if (ip_proto == IPPROTO_ESP || ip_proto == IPPROTO_AH) {
+		uports.v32 = (uports.v32 & info->spimask) | info->spiset;
+	} else {
+		if (snatport)	/* Replace nat'ed port(s) */
+			uports.p16.dst = snatport;
+		if (dnatport)
+			uports.p16.src = dnatport;
+		uports.v32 = (uports.v32 & info->pmask.v32) |
+				info->pset.v32;
+		/* get a consistent hash (same value on both flow directions) */
+		if (uports.p16.dst < uports.p16.src)
+			swap(uports.p16.src, uports.p16.dst);
+	}
+
+noports:
+	ip_proto &= info->prmask;
+	/* get a consistent hash (same value on both flow directions) */
+	if (addr2 < addr1)
+		swap(addr1, addr2);
+
+	hash = jhash_3words(addr1, addr2, uports.v32, info->hashrnd) ^ ip_proto;
+	skb->mark = (hash % info->hmod) + info->hoffs;
+	return XT_CONTINUE;
+}
+
+static struct xt_target hmark_tg_reg[] __read_mostly = {
+	{
+		.name           = "HMARK",
+		.revision       = 0,
+		.family         = NFPROTO_IPV4,
+		.target         = hmark_v4,
+		.targetsize     = sizeof(struct xt_hmark_info),
+		.me             = THIS_MODULE,
+	},
+#ifdef WITH_IPV6
+	{
+		.name           = "HMARK",
+		.revision       = 0,
+		.family         = NFPROTO_IPV6,
+		.target         = hmark_v6,
+		.targetsize     = sizeof(struct xt_hmark_info),
+		.me             = THIS_MODULE,
+	},
+#endif
+};
+
+static int __init hmark_mt_init(void)
+{
+	int ret;
+
+	ret = xt_register_targets(hmark_tg_reg, ARRAY_SIZE(hmark_tg_reg));
+	if (ret < 0)
+		return ret;
+	return 0;
+}
+
+static void __exit hmark_mt_exit(void)
+{
+	xt_unregister_targets(hmark_tg_reg, ARRAY_SIZE(hmark_tg_reg));
+}
+
+module_init(hmark_mt_init);
+module_exit(hmark_mt_exit);