diff mbox

[nf-next] netfilter: nft_queue: add _SREG_FROM and _SRGE_TO to select the queue numbers

Message ID 1473602728-33502-2-git-send-email-zlpnobody@163.com
State Changes Requested
Delegated to: Pablo Neira
Headers show

Commit Message

Liping Zhang Sept. 11, 2016, 2:05 p.m. UTC
From: Liping Zhang <liping.zhang@spreadtrum.com>

Currently, the user can specify the queue numbers by _QUEUE_NUM and
_QUEUE_TOTAL attributes, this is enough in most situations.

But acctually, it is not very flexible, for example:
  tcp dport 80 mapped to queue0
  tcp dport 81 mapped to queue1
  tcp dport 82 mapped to queue2
In order to do this thing, we must add 3 nft rules, and more
mapping means more rules ...

So similer to nft_nat, take two registers to select the queue numbers,
then we can add one simple rule to mapping queues, maybe like this:
  queue num tcp dport map { 80:0, 81:1, 82:2 ... }

Another drawback is that queue range 0-65535 is not available, because
queue_total is u16 type, range 0-65535 means queue_total is 65536, then
it is overflowed to 0. Now there's no such restriction when we use
_SREG_FROM and _SREG_TO.

For compatibility, _SREG_FROM and _QUEUE_NUM can be specified at the same
time, but we will be preferred to use _SREG_FROM attr.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
---
 include/uapi/linux/netfilter/nf_tables.h |  4 ++
 net/netfilter/nft_queue.c                | 99 +++++++++++++++++++++++++-------
 2 files changed, 82 insertions(+), 21 deletions(-)

Comments

Florian Westphal Sept. 11, 2016, 9:12 p.m. UTC | #1
Liping Zhang <zlpnobody@163.com> wrote:
> From: Liping Zhang <liping.zhang@spreadtrum.com>
> 
> Currently, the user can specify the queue numbers by _QUEUE_NUM and
> _QUEUE_TOTAL attributes, this is enough in most situations.
> 
> But acctually, it is not very flexible, for example:
>   tcp dport 80 mapped to queue0
>   tcp dport 81 mapped to queue1
>   tcp dport 82 mapped to queue2
> In order to do this thing, we must add 3 nft rules, and more
> mapping means more rules ...
> 
> So similer to nft_nat, take two registers to select the queue numbers,
> then we can add one simple rule to mapping queues, maybe like this:
>   queue num tcp dport map { 80:0, 81:1, 82:2 ... }

I like this.

My first thought was that it would be better to just support one single
sreg (the queue number) and eventually externalize the hashing/queue
selection:

queue num jhash ip saddr . ip daddr mod ...

Problem is that with plain jhash we won't get a symmetric hash
for origin and reply, so for this we would need a new expression/hash
mode.

We would also need another expression to allow distribution
starting with a queue other than 0.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liping Zhang Sept. 12, 2016, 2:19 a.m. UTC | #2
2016-09-12 5:12 GMT+08:00 Florian Westphal <fw@strlen.de>:
> Liping Zhang <zlpnobody@163.com> wrote:
>> So similer to nft_nat, take two registers to select the queue numbers,
>> then we can add one simple rule to mapping queues, maybe like this:
>>   queue num tcp dport map { 80:0, 81:1, 82:2 ... }
>
> I like this.
>
> My first thought was that it would be better to just support one single
> sreg (the queue number) and eventually externalize the hashing/queue
> selection:
>
> queue num jhash ip saddr . ip daddr mod ...

Sounds good.

At first, my another intention is use _SREG_FROM and _SREG_TO to replace
_QUEUE_NUM and _QUEUE_TOTAL, there's no restriction to use range 0-65535:

[ immediate reg 1 0x00000000 ]
[ immediate reg 2 0x0000ffff ]
[ queue num 0 reg_from 1 reg_to 2 ]

But I think your "queue num jhash ip saddr . ip daddr mod ..." is more
flexible and
there's no restriction to use range 0-65535 too.

I agree with you, one sreg seems enough. I will send V2 later.

> Problem is that with plain jhash we won't get a symmetric hash
> for origin and reply, so for this we would need a new expression/hash
> mode.
>
> We would also need another expression to allow distribution
> starting with a queue other than 0.

I think Laura is developing this option, see
https://patchwork.ozlabs.org/patch/666334/.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
nevola Sept. 12, 2016, 10:53 a.m. UTC | #3
On Sun, Sep 11, 2016 at 11:12:26PM +0200, Florian Westphal wrote:
> Liping Zhang <zlpnobody@163.com> wrote:
> > From: Liping Zhang <liping.zhang@spreadtrum.com>
> > 
> > Currently, the user can specify the queue numbers by _QUEUE_NUM and
> > _QUEUE_TOTAL attributes, this is enough in most situations.
> > 
> > But acctually, it is not very flexible, for example:
> >   tcp dport 80 mapped to queue0
> >   tcp dport 81 mapped to queue1
> >   tcp dport 82 mapped to queue2
> > In order to do this thing, we must add 3 nft rules, and more
> > mapping means more rules ...
> > 
> > So similer to nft_nat, take two registers to select the queue numbers,
> > then we can add one simple rule to mapping queues, maybe like this:
> >   queue num tcp dport map { 80:0, 81:1, 82:2 ... }
> 
> I like this.
> 
> My first thought was that it would be better to just support one single
> sreg (the queue number) and eventually externalize the hashing/queue
> selection:
> 
> queue num jhash ip saddr . ip daddr mod ...
> 
> Problem is that with plain jhash we won't get a symmetric hash
> for origin and reply, so for this we would need a new expression/hash
> mode.
> 
> We would also need another expression to allow distribution
> starting with a queue other than 0.

For such feature, I've already sent a patch "netfilter: nft_hash: Add
hash offset value" in order to set an initial value for the hash
expression. I think it'll be available in the tree soon.

> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Sept. 12, 2016, 12:09 p.m. UTC | #4
On Sun, Sep 11, 2016 at 11:12:26PM +0200, Florian Westphal wrote:
> Liping Zhang <zlpnobody@163.com> wrote:
> > From: Liping Zhang <liping.zhang@spreadtrum.com>
> > 
> > Currently, the user can specify the queue numbers by _QUEUE_NUM and
> > _QUEUE_TOTAL attributes, this is enough in most situations.
> > 
> > But acctually, it is not very flexible, for example:
> >   tcp dport 80 mapped to queue0
> >   tcp dport 81 mapped to queue1
> >   tcp dport 82 mapped to queue2
> > In order to do this thing, we must add 3 nft rules, and more
> > mapping means more rules ...
> > 
> > So similer to nft_nat, take two registers to select the queue numbers,
> > then we can add one simple rule to mapping queues, maybe like this:
> >   queue num tcp dport map { 80:0, 81:1, 82:2 ... }
> 
> I like this.
> 
> My first thought was that it would be better to just support one single
> sreg (the queue number) and eventually externalize the hashing/queue
> selection:
> 
> queue num jhash ip saddr . ip daddr mod ...
> 
> Problem is that with plain jhash we won't get a symmetric hash
> for origin and reply, so for this we would need a new expression/hash
> mode.

Are you think of xor hashing to provide the symmetry? Downside is that
bad tuple selection may result in poor distribution, but this is
something we can document.

Still it would be possible to use jhash at the cost of having two
rules.

It would be great to get a new NFTA_HASH_MODE attribute and something
like:

enum nft_hash_types {
        NFT_HASH_JENKINS        = 0,
};

So we have room to make other hash algos fit into this expression.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Sept. 12, 2016, 12:22 p.m. UTC | #5
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Sun, Sep 11, 2016 at 11:12:26PM +0200, Florian Westphal wrote:
> > My first thought was that it would be better to just support one single
> > sreg (the queue number) and eventually externalize the hashing/queue
> > selection:
> > 
> > queue num jhash ip saddr . ip daddr mod ...
> > 
> > Problem is that with plain jhash we won't get a symmetric hash
> > for origin and reply, so for this we would need a new expression/hash
> > mode.
> 
> Are you think of xor hashing to provide the symmetry? Downside is that
> bad tuple selection may result in poor distribution, but this is
> something we can document.

No, I was thinking of a new hash mode to do this, e.g. just do same
what current nfqueue selection does: hash lower address first.

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Sept. 12, 2016, 12:28 p.m. UTC | #6
Laura Garcia <nevola@gmail.com> wrote:
> On Sun, Sep 11, 2016 at 11:12:26PM +0200, Florian Westphal wrote:
> > Liping Zhang <zlpnobody@163.com> wrote:
> > > From: Liping Zhang <liping.zhang@spreadtrum.com>
> > > 
> > > Currently, the user can specify the queue numbers by _QUEUE_NUM and
> > > _QUEUE_TOTAL attributes, this is enough in most situations.
> > > 
> > > But acctually, it is not very flexible, for example:
> > >   tcp dport 80 mapped to queue0
> > >   tcp dport 81 mapped to queue1
> > >   tcp dport 82 mapped to queue2
> > > In order to do this thing, we must add 3 nft rules, and more
> > > mapping means more rules ...
> > > 
> > > So similer to nft_nat, take two registers to select the queue numbers,
> > > then we can add one simple rule to mapping queues, maybe like this:
> > >   queue num tcp dport map { 80:0, 81:1, 82:2 ... }
> > 
> > I like this.
> > 
> > My first thought was that it would be better to just support one single
> > sreg (the queue number) and eventually externalize the hashing/queue
> > selection:
> > 
> > queue num jhash ip saddr . ip daddr mod ...
> > 
> > Problem is that with plain jhash we won't get a symmetric hash
> > for origin and reply, so for this we would need a new expression/hash
> > mode.
> > 
> > We would also need another expression to allow distribution
> > starting with a queue other than 0.
> 
> For such feature, I've already sent a patch "netfilter: nft_hash: Add
> hash offset value" in order to set an initial value for the hash
> expression. I think it'll be available in the tree soon.

Yes, but that means one would always need to use hash.

Maybe thats indeed sufficient since it already provides a modulo
functionality.

I was thinking

nfqueue num meta cpu ...
nfqueue num meta hash ... [ i.e. use skb->hash, not supported right now ]
nfqueue num meta mark
nfqueue num ct mark

etc.  so we would have no control over start qnum (something like
'nfqueue num (meta cpu + 10000)' ...

Admittingly, I do not see a big drawback if we'd always have to
feed it to jhash first.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Sept. 12, 2016, 12:32 p.m. UTC | #7
On Mon, Sep 12, 2016 at 02:22:57PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Sun, Sep 11, 2016 at 11:12:26PM +0200, Florian Westphal wrote:
> > > My first thought was that it would be better to just support one single
> > > sreg (the queue number) and eventually externalize the hashing/queue
> > > selection:
> > > 
> > > queue num jhash ip saddr . ip daddr mod ...
> > > 
> > > Problem is that with plain jhash we won't get a symmetric hash
> > > for origin and reply, so for this we would need a new expression/hash
> > > mode.
> > 
> > Are you think of xor hashing to provide the symmetry? Downside is that
> > bad tuple selection may result in poor distribution, but this is
> > something we can document.
> 
> No, I was thinking of a new hash mode to do this, e.g. just do same
> what current nfqueue selection does: hash lower address first.

Currently we have one single register pointing to the entire tuple
concatenation that we hash, we would need to support multiple
registers as input, check that they are consecutive. Then, the logic
to compare the data. And a way to express this in syntax.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Sept. 12, 2016, 12:39 p.m. UTC | #8
On Mon, Sep 12, 2016 at 02:28:49PM +0200, Florian Westphal wrote:
> Laura Garcia <nevola@gmail.com> wrote:
> > On Sun, Sep 11, 2016 at 11:12:26PM +0200, Florian Westphal wrote:
> > > Liping Zhang <zlpnobody@163.com> wrote:
> > > > From: Liping Zhang <liping.zhang@spreadtrum.com>
> > > > 
> > > > Currently, the user can specify the queue numbers by _QUEUE_NUM and
> > > > _QUEUE_TOTAL attributes, this is enough in most situations.
> > > > 
> > > > But acctually, it is not very flexible, for example:
> > > >   tcp dport 80 mapped to queue0
> > > >   tcp dport 81 mapped to queue1
> > > >   tcp dport 82 mapped to queue2
> > > > In order to do this thing, we must add 3 nft rules, and more
> > > > mapping means more rules ...
> > > > 
> > > > So similer to nft_nat, take two registers to select the queue numbers,
> > > > then we can add one simple rule to mapping queues, maybe like this:
> > > >   queue num tcp dport map { 80:0, 81:1, 82:2 ... }
> > > 
> > > I like this.
> > > 
> > > My first thought was that it would be better to just support one single
> > > sreg (the queue number) and eventually externalize the hashing/queue
> > > selection:
> > > 
> > > queue num jhash ip saddr . ip daddr mod ...
> > > 
> > > Problem is that with plain jhash we won't get a symmetric hash
> > > for origin and reply, so for this we would need a new expression/hash
> > > mode.
> > > 
> > > We would also need another expression to allow distribution
> > > starting with a queue other than 0.
> > 
> > For such feature, I've already sent a patch "netfilter: nft_hash: Add
> > hash offset value" in order to set an initial value for the hash
> > expression. I think it'll be available in the tree soon.
> 
> Yes, but that means one would always need to use hash.
> 
> Maybe thats indeed sufficient since it already provides a modulo
> functionality.
> 
> I was thinking
> 
> nfqueue num meta cpu ...
> nfqueue num meta hash ... [ i.e. use skb->hash, not supported right now ]
> nfqueue num meta mark
> nfqueue num ct mark
> 
> etc.  so we would have no control over start qnum (something like
> 'nfqueue num (meta cpu + 10000)' ...
>
> Admittingly, I do not see a big drawback if we'd always have to
> feed it to jhash first.

I think it would be easy to add this offset support to meta and ct, so
we adding this value to whatever we fetch. I think a generic
arithmetic expression that allows us to add two values from two
registers is too much.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Sept. 12, 2016, 1:18 p.m. UTC | #9
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Mon, Sep 12, 2016 at 02:22:57PM +0200, Florian Westphal wrote:
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > On Sun, Sep 11, 2016 at 11:12:26PM +0200, Florian Westphal wrote:
> > > > My first thought was that it would be better to just support one single
> > > > sreg (the queue number) and eventually externalize the hashing/queue
> > > > selection:
> > > > 
> > > > queue num jhash ip saddr . ip daddr mod ...
> > > > 
> > > > Problem is that with plain jhash we won't get a symmetric hash
> > > > for origin and reply, so for this we would need a new expression/hash
> > > > mode.
> > > 
> > > Are you think of xor hashing to provide the symmetry? Downside is that
> > > bad tuple selection may result in poor distribution, but this is
> > > something we can document.
> > 
> > No, I was thinking of a new hash mode to do this, e.g. just do same
> > what current nfqueue selection does: hash lower address first.
> 
> Currently we have one single register pointing to the entire tuple
> concatenation that we hash, we would need to support multiple
> registers as input, check that they are consecutive. Then, the logic
> to compare the data. And a way to express this in syntax.

Ugh.  Ok, lets ignore this for now.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Sept. 13, 2016, 9:19 a.m. UTC | #10
Hi Liping,

A bit more comments on top of Florian's suggestion to use one single
_SREG.

On Sun, Sep 11, 2016 at 10:05:28PM +0800, Liping Zhang wrote:
> diff --git a/net/netfilter/nft_queue.c b/net/netfilter/nft_queue.c
> index d16d599..6557118 100644
> --- a/net/netfilter/nft_queue.c
> +++ b/net/netfilter/nft_queue.c
> @@ -22,9 +22,11 @@
>  static u32 jhash_initval __read_mostly;
>  
>  struct nft_queue {
> -	u16	queuenum;
> -	u16	queues_total;
> -	u16	flags;
> +	enum nft_registers	sreg_from:8;
> +	enum nft_registers	sreg_to:8;
> +	u16			queuenum;
> +	u16			queues_total;
> +	u16			flags;
>  };
>  
>  static void nft_queue_eval(const struct nft_expr *expr,
> @@ -32,17 +34,30 @@ static void nft_queue_eval(const struct nft_expr *expr,
>  			   const struct nft_pktinfo *pkt)
>  {
>  	struct nft_queue *priv = nft_expr_priv(expr);
> -	u32 queue = priv->queuenum;
> +	u32 queue, queues_total, queue_end;
>  	u32 ret;
>  
> -	if (priv->queues_total > 1) {
> +	if (priv->sreg_from) {
> +		queue = (u16)regs->data[priv->sreg_from];
> +		queue_end = (u16)regs->data[priv->sreg_to];
> +
> +		if (queue_end > queue)
> +			queues_total = queue_end - queue + 1;
> +		else
> +			queues_total = 1;
> +	} else {
> +		queue = priv->queuenum;
> +		queues_total = priv->queues_total;
> +	}

I suggest you use .select_ops to use a specific _eval function if
_SREG is set. This would disentangle this code a bit.

> +
> +	if (queues_total > 1) {
>  		if (priv->flags & NFT_QUEUE_FLAG_CPU_FANOUT) {
>  			int cpu = smp_processor_id();
>  
> -			queue = priv->queuenum + cpu % priv->queues_total;
> +			queue += cpu % queues_total;
>  		} else {
>  			queue = nfqueue_hash(pkt->skb, queue,
> -					     priv->queues_total, pkt->pf,
> +					     queues_total, pkt->pf,
>  					     jhash_initval);
>  		}
>  	}
> @@ -58,6 +73,8 @@ static const struct nla_policy nft_queue_policy[NFTA_QUEUE_MAX + 1] = {
>  	[NFTA_QUEUE_NUM]	= { .type = NLA_U16 },
>  	[NFTA_QUEUE_TOTAL]	= { .type = NLA_U16 },
>  	[NFTA_QUEUE_FLAGS]	= { .type = NLA_U16 },
> +	[NFTA_QUEUE_SREG_FROM]	= { .type = NLA_U32 },
> +	[NFTA_QUEUE_SREG_TO]	= { .type = NLA_U32 },
>  };
>  
>  static int nft_queue_init(const struct nft_ctx *ctx,
> @@ -66,30 +83,58 @@ static int nft_queue_init(const struct nft_ctx *ctx,
>  {
>  	struct nft_queue *priv = nft_expr_priv(expr);
>  	u32 maxid;
> +	int err;
>  
> -	if (tb[NFTA_QUEUE_NUM] == NULL)
> +	if (!tb[NFTA_QUEUE_NUM] && !tb[NFTA_QUEUE_SREG_FROM])
>  		return -EINVAL;
>  
>  	init_hashrandom(&jhash_initval);
> -	priv->queuenum = ntohs(nla_get_be16(tb[NFTA_QUEUE_NUM]));
>  
> -	if (tb[NFTA_QUEUE_TOTAL] != NULL)
> -		priv->queues_total = ntohs(nla_get_be16(tb[NFTA_QUEUE_TOTAL]));
> -	else
> -		priv->queues_total = 1;
> +	/* nftables has no idea whether the kernel supports _SREG_FROM or not,
> +	 * so for compatibility reason, it may specify the _SREG_FROM and
> +	 * _QUEUE_NUM attributes at the same time. We prefer to use _SREG_FROM,
> +	 * it is more flexible and has less restriction, for example, queue
> +	 * range 0-65535 is ok when use _SREG_FROM and _SREG_TO.
> +	 */

No need for this large comment, look.

From userspace, we have to modify the nft parser to take an expression
as 'num'. Then, from the netlink_linearize path, we can check the
expression type:

* If it's a value expression, we can simply use NFTA_QUEUE_NUM.
* If it's a range expression, we have to set up NFTA_QUEUE_NUM and
  NFTA_QUEUE_TOTAL.
* If it's a mapping, then we only use _SREG.

Old kernels will bail out if the mapping is used, as this is not
supported, which is what we expect. This should be fine by now, until
I finish the patchset to provide a VM description that userspace can
use to generate bytecode depending on the features available in the
kernel.

> +	if (tb[NFTA_QUEUE_SREG_FROM]) {
> +		priv->sreg_from = nft_parse_register(tb[NFTA_QUEUE_SREG_FROM]);
> +		err = nft_validate_register_load(priv->sreg_from, sizeof(u16));
> +		if (err < 0)
> +			return err;
> +
> +		if (tb[NFTA_QUEUE_SREG_TO]) {
> +			priv->sreg_to =
> +				nft_parse_register(tb[NFTA_QUEUE_SREG_TO]);
> +			err = nft_validate_register_load(priv->sreg_to,
> +							 sizeof(u16));
> +			if (err < 0)
> +				return err;
> +		} else {
> +			priv->sreg_to = priv->sreg_from;
> +		}
> +	} else if (tb[NFTA_QUEUE_NUM]) {
> +		priv->queuenum = ntohs(nla_get_be16(tb[NFTA_QUEUE_NUM]));
>  
> -	if (priv->queues_total == 0)
> -		return -EINVAL;
> +		if (tb[NFTA_QUEUE_TOTAL])
> +			priv->queues_total =
> +				ntohs(nla_get_be16(tb[NFTA_QUEUE_TOTAL]));
> +		else
> +			priv->queues_total = 1;
>  
> -	maxid = priv->queues_total - 1 + priv->queuenum;
> -	if (maxid > U16_MAX)
> -		return -ERANGE;
> +		if (priv->queues_total == 0)
> +			return -EINVAL;
> +
> +		maxid = priv->queues_total - 1 + priv->queuenum;
> +		if (maxid > U16_MAX)
> +			return -ERANGE;
> +	}
>  
>  	if (tb[NFTA_QUEUE_FLAGS] != NULL) {
>  		priv->flags = ntohs(nla_get_be16(tb[NFTA_QUEUE_FLAGS]));
>  		if (priv->flags & ~NFT_QUEUE_FLAG_MASK)
>  			return -EINVAL;
>  	}
> +
>  	return 0;
>  }
>  
> @@ -97,9 +142,21 @@ static int nft_queue_dump(struct sk_buff *skb, const struct nft_expr *expr)
>  {
>  	const struct nft_queue *priv = nft_expr_priv(expr);
>  
> -	if (nla_put_be16(skb, NFTA_QUEUE_NUM, htons(priv->queuenum)) ||
> -	    nla_put_be16(skb, NFTA_QUEUE_TOTAL, htons(priv->queues_total)) ||
> -	    nla_put_be16(skb, NFTA_QUEUE_FLAGS, htons(priv->flags)))
> +	if (priv->sreg_from) {
> +		if (nft_dump_register(skb, NFTA_QUEUE_SREG_FROM,
> +				      priv->sreg_from))
> +			goto nla_put_failure;
> +		if (nft_dump_register(skb, NFTA_QUEUE_SREG_TO,
> +				      priv->sreg_to))
> +			goto nla_put_failure;
> +	} else {
> +		if (nla_put_be16(skb, NFTA_QUEUE_NUM, htons(priv->queuenum)) ||
> +		    nla_put_be16(skb, NFTA_QUEUE_TOTAL,
> +				 htons(priv->queues_total)))
> +			goto nla_put_failure;
> +	}

Looking at this, the code will look better if we should use
.select_ops indeed.

Thanks for resolving existing nft_queue limitations!
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liping Zhang Sept. 13, 2016, 12:19 p.m. UTC | #11
2016-09-13 17:19 GMT+08:00 Pablo Neira Ayuso <pablo@netfilter.org>:
> Hi Liping,
>
> A bit more comments on top of Florian's suggestion to use one single
> _SREG.
>
> On Sun, Sep 11, 2016 at 10:05:28PM +0800, Liping Zhang wrote:
>> diff --git a/net/netfilter/nft_queue.c b/net/netfilter/nft_queue.c
>>  static void nft_queue_eval(const struct nft_expr *expr,
>> @@ -32,17 +34,30 @@ static void nft_queue_eval(const struct nft_expr *expr,
>>                          const struct nft_pktinfo *pkt)
>>  {
>>       struct nft_queue *priv = nft_expr_priv(expr);
>> -     u32 queue = priv->queuenum;
>> +     u32 queue, queues_total, queue_end;
>>       u32 ret;
>>
>> -     if (priv->queues_total > 1) {
>> +     if (priv->sreg_from) {
>> +             queue = (u16)regs->data[priv->sreg_from];
>> +             queue_end = (u16)regs->data[priv->sreg_to];
>> +
>> +             if (queue_end > queue)
>> +                     queues_total = queue_end - queue + 1;
>> +             else
>> +                     queues_total = 1;
>> +     } else {
>> +             queue = priv->queuenum;
>> +             queues_total = priv->queues_total;
>> +     }
>
> I suggest you use .select_ops to use a specific _eval function if
> _SREG is set. This would disentangle this code a bit.
>
>>
>> +     /* nftables has no idea whether the kernel supports _SREG_FROM or not,
>> +      * so for compatibility reason, it may specify the _SREG_FROM and
>> +      * _QUEUE_NUM attributes at the same time. We prefer to use _SREG_FROM,
>> +      * it is more flexible and has less restriction, for example, queue
>> +      * range 0-65535 is ok when use _SREG_FROM and _SREG_TO.
>> +      */
>
> No need for this large comment, look.
>
> From userspace, we have to modify the nft parser to take an expression
> as 'num'. Then, from the netlink_linearize path, we can check the
> expression type:
>
> * If it's a value expression, we can simply use NFTA_QUEUE_NUM.
> * If it's a range expression, we have to set up NFTA_QUEUE_NUM and
>   NFTA_QUEUE_TOTAL.
> * If it's a mapping, then we only use _SREG.
>
> Old kernels will bail out if the mapping is used, as this is not
> supported, which is what we expect. This should be fine by now, until
> I finish the patchset to provide a VM description that userspace can
> use to generate bytecode depending on the features available in the
> kernel.
>
>>
>> @@ -97,9 +142,21 @@ static int nft_queue_dump(struct sk_buff *skb, const struct nft_expr *expr)
>>  {
>>       const struct nft_queue *priv = nft_expr_priv(expr);
>>
>> -     if (nla_put_be16(skb, NFTA_QUEUE_NUM, htons(priv->queuenum)) ||
>> -         nla_put_be16(skb, NFTA_QUEUE_TOTAL, htons(priv->queues_total)) ||
>> -         nla_put_be16(skb, NFTA_QUEUE_FLAGS, htons(priv->flags)))
>> +     if (priv->sreg_from) {
>> +             if (nft_dump_register(skb, NFTA_QUEUE_SREG_FROM,
>> +                                   priv->sreg_from))
>> +                     goto nla_put_failure;
>> +             if (nft_dump_register(skb, NFTA_QUEUE_SREG_TO,
>> +                                   priv->sreg_to))
>> +                     goto nla_put_failure;
>> +     } else {
>> +             if (nla_put_be16(skb, NFTA_QUEUE_NUM, htons(priv->queuenum)) ||
>> +                 nla_put_be16(skb, NFTA_QUEUE_TOTAL,
>> +                              htons(priv->queues_total)))
>> +                     goto nla_put_failure;
>> +     }
>
> Looking at this, the code will look better if we should use
> .select_ops indeed.
>
> Thanks for resolving existing nft_queue limitations!

I will re-spin my patch based on your and Florian's suggestions.
Thanks Pablo.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index 24161e2..5b4e373 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -886,12 +886,16 @@  enum nft_log_attributes {
  * @NFTA_QUEUE_NUM: netlink queue to send messages to (NLA_U16)
  * @NFTA_QUEUE_TOTAL: number of queues to load balance packets on (NLA_U16)
  * @NFTA_QUEUE_FLAGS: various flags (NLA_U16)
+ * @NFTA_QUEUE_SREG_FROM: source register of queue number start (NLA_U32: nft_registers)
+ * @NFTA_QUEUE_SREG_TO: source register of queue number end (NLA_U32: nft_registers)
  */
 enum nft_queue_attributes {
 	NFTA_QUEUE_UNSPEC,
 	NFTA_QUEUE_NUM,
 	NFTA_QUEUE_TOTAL,
 	NFTA_QUEUE_FLAGS,
+	NFTA_QUEUE_SREG_FROM,
+	NFTA_QUEUE_SREG_TO,
 	__NFTA_QUEUE_MAX
 };
 #define NFTA_QUEUE_MAX		(__NFTA_QUEUE_MAX - 1)
diff --git a/net/netfilter/nft_queue.c b/net/netfilter/nft_queue.c
index d16d599..6557118 100644
--- a/net/netfilter/nft_queue.c
+++ b/net/netfilter/nft_queue.c
@@ -22,9 +22,11 @@ 
 static u32 jhash_initval __read_mostly;
 
 struct nft_queue {
-	u16	queuenum;
-	u16	queues_total;
-	u16	flags;
+	enum nft_registers	sreg_from:8;
+	enum nft_registers	sreg_to:8;
+	u16			queuenum;
+	u16			queues_total;
+	u16			flags;
 };
 
 static void nft_queue_eval(const struct nft_expr *expr,
@@ -32,17 +34,30 @@  static void nft_queue_eval(const struct nft_expr *expr,
 			   const struct nft_pktinfo *pkt)
 {
 	struct nft_queue *priv = nft_expr_priv(expr);
-	u32 queue = priv->queuenum;
+	u32 queue, queues_total, queue_end;
 	u32 ret;
 
-	if (priv->queues_total > 1) {
+	if (priv->sreg_from) {
+		queue = (u16)regs->data[priv->sreg_from];
+		queue_end = (u16)regs->data[priv->sreg_to];
+
+		if (queue_end > queue)
+			queues_total = queue_end - queue + 1;
+		else
+			queues_total = 1;
+	} else {
+		queue = priv->queuenum;
+		queues_total = priv->queues_total;
+	}
+
+	if (queues_total > 1) {
 		if (priv->flags & NFT_QUEUE_FLAG_CPU_FANOUT) {
 			int cpu = smp_processor_id();
 
-			queue = priv->queuenum + cpu % priv->queues_total;
+			queue += cpu % queues_total;
 		} else {
 			queue = nfqueue_hash(pkt->skb, queue,
-					     priv->queues_total, pkt->pf,
+					     queues_total, pkt->pf,
 					     jhash_initval);
 		}
 	}
@@ -58,6 +73,8 @@  static const struct nla_policy nft_queue_policy[NFTA_QUEUE_MAX + 1] = {
 	[NFTA_QUEUE_NUM]	= { .type = NLA_U16 },
 	[NFTA_QUEUE_TOTAL]	= { .type = NLA_U16 },
 	[NFTA_QUEUE_FLAGS]	= { .type = NLA_U16 },
+	[NFTA_QUEUE_SREG_FROM]	= { .type = NLA_U32 },
+	[NFTA_QUEUE_SREG_TO]	= { .type = NLA_U32 },
 };
 
 static int nft_queue_init(const struct nft_ctx *ctx,
@@ -66,30 +83,58 @@  static int nft_queue_init(const struct nft_ctx *ctx,
 {
 	struct nft_queue *priv = nft_expr_priv(expr);
 	u32 maxid;
+	int err;
 
-	if (tb[NFTA_QUEUE_NUM] == NULL)
+	if (!tb[NFTA_QUEUE_NUM] && !tb[NFTA_QUEUE_SREG_FROM])
 		return -EINVAL;
 
 	init_hashrandom(&jhash_initval);
-	priv->queuenum = ntohs(nla_get_be16(tb[NFTA_QUEUE_NUM]));
 
-	if (tb[NFTA_QUEUE_TOTAL] != NULL)
-		priv->queues_total = ntohs(nla_get_be16(tb[NFTA_QUEUE_TOTAL]));
-	else
-		priv->queues_total = 1;
+	/* nftables has no idea whether the kernel supports _SREG_FROM or not,
+	 * so for compatibility reason, it may specify the _SREG_FROM and
+	 * _QUEUE_NUM attributes at the same time. We prefer to use _SREG_FROM,
+	 * it is more flexible and has less restriction, for example, queue
+	 * range 0-65535 is ok when use _SREG_FROM and _SREG_TO.
+	 */
+	if (tb[NFTA_QUEUE_SREG_FROM]) {
+		priv->sreg_from = nft_parse_register(tb[NFTA_QUEUE_SREG_FROM]);
+		err = nft_validate_register_load(priv->sreg_from, sizeof(u16));
+		if (err < 0)
+			return err;
+
+		if (tb[NFTA_QUEUE_SREG_TO]) {
+			priv->sreg_to =
+				nft_parse_register(tb[NFTA_QUEUE_SREG_TO]);
+			err = nft_validate_register_load(priv->sreg_to,
+							 sizeof(u16));
+			if (err < 0)
+				return err;
+		} else {
+			priv->sreg_to = priv->sreg_from;
+		}
+	} else if (tb[NFTA_QUEUE_NUM]) {
+		priv->queuenum = ntohs(nla_get_be16(tb[NFTA_QUEUE_NUM]));
 
-	if (priv->queues_total == 0)
-		return -EINVAL;
+		if (tb[NFTA_QUEUE_TOTAL])
+			priv->queues_total =
+				ntohs(nla_get_be16(tb[NFTA_QUEUE_TOTAL]));
+		else
+			priv->queues_total = 1;
 
-	maxid = priv->queues_total - 1 + priv->queuenum;
-	if (maxid > U16_MAX)
-		return -ERANGE;
+		if (priv->queues_total == 0)
+			return -EINVAL;
+
+		maxid = priv->queues_total - 1 + priv->queuenum;
+		if (maxid > U16_MAX)
+			return -ERANGE;
+	}
 
 	if (tb[NFTA_QUEUE_FLAGS] != NULL) {
 		priv->flags = ntohs(nla_get_be16(tb[NFTA_QUEUE_FLAGS]));
 		if (priv->flags & ~NFT_QUEUE_FLAG_MASK)
 			return -EINVAL;
 	}
+
 	return 0;
 }
 
@@ -97,9 +142,21 @@  static int nft_queue_dump(struct sk_buff *skb, const struct nft_expr *expr)
 {
 	const struct nft_queue *priv = nft_expr_priv(expr);
 
-	if (nla_put_be16(skb, NFTA_QUEUE_NUM, htons(priv->queuenum)) ||
-	    nla_put_be16(skb, NFTA_QUEUE_TOTAL, htons(priv->queues_total)) ||
-	    nla_put_be16(skb, NFTA_QUEUE_FLAGS, htons(priv->flags)))
+	if (priv->sreg_from) {
+		if (nft_dump_register(skb, NFTA_QUEUE_SREG_FROM,
+				      priv->sreg_from))
+			goto nla_put_failure;
+		if (nft_dump_register(skb, NFTA_QUEUE_SREG_TO,
+				      priv->sreg_to))
+			goto nla_put_failure;
+	} else {
+		if (nla_put_be16(skb, NFTA_QUEUE_NUM, htons(priv->queuenum)) ||
+		    nla_put_be16(skb, NFTA_QUEUE_TOTAL,
+				 htons(priv->queues_total)))
+			goto nla_put_failure;
+	}
+
+	if (nla_put_be16(skb, NFTA_QUEUE_FLAGS, htons(priv->flags)))
 		goto nla_put_failure;
 
 	return 0;