diff mbox

[net,v6,1/2] net: sched: tbf: fix the calculation of max_size

Message ID 1386313205-87660-2-git-send-email-yangyingliang@huawei.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Yang Yingliang Dec. 6, 2013, 7 a.m. UTC
Current max_size is caluated from rate table. Now, the rate table
has been replaced and it's wrong to caculate max_size based on this
rate table. It can lead wrong calculation of max_size.

The burst in kernel may be lower than user asked, because burst may gets
some loss when transform it to buffer(E.g. "burst 40kb rate 30mbit/s")
and it seems we cannot avoid this loss. Burst's value(max_size) based on
rate table may be equal user asked. If a packet's length is max_size, this
packet will be stalled in tbf_dequeue() because its length is above the
burst in kernel so that it cannot get enough tokens. The max_size guards
against enqueuing packet sizes above q->buffer "time" in tbf_enqueue().

To make consistent with the calculation of tokens, this patch add a helper
psched_ns_t2l() to calculate burst(max_size) directly to fix this problem.

After this fix, we can support to using 64bit rates to calculate burst as well.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
---
 net/sched/sch_tbf.c | 104 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 61 insertions(+), 43 deletions(-)

Comments

David Laight Dec. 6, 2013, 10:56 a.m. UTC | #1
> From: Yang Yingliang
...
> 
> +/* Time to Length, convert time in ns to length in bytes
> + * to determinate how many bytes can be sent in given time.
> + */
> +static u64 psched_ns_t2l(const struct psched_ratecfg *r,
> +			 u64 time_in_ns)
> +{
> +	/* The formula is :
> +	 * len = (time_in_ns * r->rate_bytes_ps) / NSEC_PER_SEC
> +	 */
> +	u64 len = time_in_ns * r->rate_bytes_ps;
> +
> +	do_div(len, NSEC_PER_SEC);

You are multiplying two values then dividing by 10**9
I'd guess that the intermediate value might exceed 2**64.

> +	if (unlikely(r->linklayer == TC_LINKLAYER_ATM))
> +		len = (len / 53) * 48;

You probably want to do the multiply first.
But why not scale rate_bytes_ps instead.

> +	if (len > r->overhead)
> +		len -= r->overhead;
> +	else
> +		len = 0;
> +
> +	return len;
> +}

Personally I'd work out how much time you have to send each byte.
So if you want a rate of 1MB/sec you have a 'time cost' per byte of 1000ns.
The cost of sending a packet is simply the length multiplied by this cost.
To work out whether a packet can be sent you have a credit variable that
tracks current time.
If 'credit > now' the packet can't be sent, queue and schedule a wakeup.
if 'credit + backlog < now' set credit = now - backlog.
if 'credit <= now' send the packet and add the packet's 'cost' to 'credit'.

In the non-ratelimited case this is almost no work.

You'd probably need to work in 1/1024ns time units and/or blocks of 16 bytes.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yang Yingliang Dec. 9, 2013, 2:26 a.m. UTC | #2
On 2013/12/6 18:56, David Laight wrote:
>> From: Yang Yingliang
> ...
>>
>> +/* Time to Length, convert time in ns to length in bytes
>> + * to determinate how many bytes can be sent in given time.
>> + */
>> +static u64 psched_ns_t2l(const struct psched_ratecfg *r,
>> +			 u64 time_in_ns)
>> +{
>> +	/* The formula is :
>> +	 * len = (time_in_ns * r->rate_bytes_ps) / NSEC_PER_SEC
>> +	 */
>> +	u64 len = time_in_ns * r->rate_bytes_ps;
>> +
>> +	do_div(len, NSEC_PER_SEC);
> 
> You are multiplying two values then dividing by 10**9
> I'd guess that the intermediate value might exceed 2**64.

I thought the max value of len is burst which is a type of u32 sent
userland, so the max value of (time_in_ns * r->rate_bytes_ps)
should be 64K*(10**9).
Hmm, maybe I should do something in this helper to avoid overflow.

> 
>> +	if (unlikely(r->linklayer == TC_LINKLAYER_ATM))
>> +		len = (len / 53) * 48;
> 
> You probably want to do the multiply first.
> But why not scale rate_bytes_ps instead.
> 
>> +	if (len > r->overhead)
>> +		len -= r->overhead;
>> +	else
>> +		len = 0;
>> +
>> +	return len;
>> +}
> 
> Personally I'd work out how much time you have to send each byte.
> So if you want a rate of 1MB/sec you have a 'time cost' per byte of 1000ns.
> The cost of sending a packet is simply the length multiplied by this cost.
> To work out whether a packet can be sent you have a credit variable that
> tracks current time.
> If 'credit > now' the packet can't be sent, queue and schedule a wakeup.
> if 'credit + backlog < now' set credit = now - backlog.
> if 'credit <= now' send the packet and add the packet's 'cost' to 'credit'.
> 
> In the non-ratelimited case this is almost no work.
> 
> You'd probably need to work in 1/1024ns time units and/or blocks of 16 bytes.
> 
> 	David
> 
> 
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yang Yingliang Dec. 9, 2013, 3:26 a.m. UTC | #3
On 2013/12/6 18:56, David Laight wrote:
>> From: Yang Yingliang
> ...
>>
>> +/* Time to Length, convert time in ns to length in bytes
>> + * to determinate how many bytes can be sent in given time.
>> + */
>> +static u64 psched_ns_t2l(const struct psched_ratecfg *r,
>> +			 u64 time_in_ns)
>> +{
>> +	/* The formula is :
>> +	 * len = (time_in_ns * r->rate_bytes_ps) / NSEC_PER_SEC
>> +	 */
>> +	u64 len = time_in_ns * r->rate_bytes_ps;
>> +
>> +	do_div(len, NSEC_PER_SEC);
> 
> You are multiplying two values then dividing by 10**9
> I'd guess that the intermediate value might exceed 2**64.
> 
>> +	if (unlikely(r->linklayer == TC_LINKLAYER_ATM))
>> +		len = (len / 53) * 48;
> 
> You probably want to do the multiply first.
> But why not scale rate_bytes_ps instead.
> 

When the linklayer is ATM, the formula to calculate tokens is:

((u64)(DIV_ROUND_UP(len,48)*53) * r->mult) >> r->shift

So, I scale len here.

Regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Dec. 9, 2013, 10:07 a.m. UTC | #4
> From: Yang Yingliang
> On 2013/12/6 18:56, David Laight wrote:
> >> From: Yang Yingliang
> > ...
> >
> > You are multiplying two values then dividing by 10**9
> > I'd guess that the intermediate value might exceed 2**64.
> >
> >> +	if (unlikely(r->linklayer == TC_LINKLAYER_ATM))
> >> +		len = (len / 53) * 48;
> >
> > You probably want to do the multiply first.
> > But why not scale rate_bytes_ps instead.
> >
> 
> When the linklayer is ATM, the formula to calculate tokens is:
> 
> ((u64)(DIV_ROUND_UP(len,48)*53) * r->mult) >> r->shift
> 
> So, I scale len here.

Seems to me that there are a lot of unnecessary multiplies and
divides going on here.
Looks to me like the code was originally written to require one
multiply and one shift for each packet.

In any case the latter code is allowing for more of the ATM cell
overhead. I'm not at all sure the intent is to remove that when
setting up the constants.

OTOH should this code be worrying about the packet overheads at all?
Does it add in the ethernet pre-emable, CRC and inter packet gap?

I'd guess that the most the ATM code should do it round the length
up to a multiple of 48 (user payload in a cell).

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yang Yingliang Dec. 9, 2013, 12:21 p.m. UTC | #5
On 2013/12/9 18:07, David Laight wrote:
>> From: Yang Yingliang
>> On 2013/12/6 18:56, David Laight wrote:
>>>> From: Yang Yingliang
>>> ...
>>>
>>> You are multiplying two values then dividing by 10**9
>>> I'd guess that the intermediate value might exceed 2**64.
>>>
>>>> +	if (unlikely(r->linklayer == TC_LINKLAYER_ATM))
>>>> +		len = (len / 53) * 48;
>>>
>>> You probably want to do the multiply first.
>>> But why not scale rate_bytes_ps instead.
>>>
>>
>> When the linklayer is ATM, the formula to calculate tokens is:
>>
>> ((u64)(DIV_ROUND_UP(len,48)*53) * r->mult) >> r->shift
>>
>> So, I scale len here.
> 
> Seems to me that there are a lot of unnecessary multiplies and
> divides going on here.
> Looks to me like the code was originally written to require one
> multiply and one shift for each packet.

The way to calc tokens in psched_l2t_ns() means:
we got a payload which length is 'len'. To get the actual size we
need send, round len up to a multiple of 53. After getting the
size, we do multiply and shift to calculate tokens that we need.

> 
> In any case the latter code is allowing for more of the ATM cell
> overhead. I'm not at all sure the intent is to remove that when
> setting up the constants.
> 
> OTOH should this code be worrying about the packet overheads at all?
> Does it add in the ethernet pre-emable, CRC and inter packet gap?

This overhead is sent from userspace by tc.

> 
> I'd guess that the most the ATM code should do it round the length
> up to a multiple of 48 (user payload in a cell).

53 is ATM cell size include header, sending a header needs tokens as
well. I'd guess round the length up to a multiple of 53 in psched_l2t_ns()
is necessary.


Regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight Dec. 9, 2013, 1:11 p.m. UTC | #6
> From: Yang Yingliang [mailto:yangyingliang@huawei.com]
> On 2013/12/9 18:07, David Laight wrote:
> >> From: Yang Yingliang
> >> On 2013/12/6 18:56, David Laight wrote:
> >>>> From: Yang Yingliang
> >>> ...
> >>>
> >>> You are multiplying two values then dividing by 10**9
> >>> I'd guess that the intermediate value might exceed 2**64.
> >>>
> >>>> +	if (unlikely(r->linklayer == TC_LINKLAYER_ATM))
> >>>> +		len = (len / 53) * 48;
> >>>
> >>> You probably want to do the multiply first.
> >>> But why not scale rate_bytes_ps instead.
> >>>
> >>
> >> When the linklayer is ATM, the formula to calculate tokens is:
> >>
> >> ((u64)(DIV_ROUND_UP(len,48)*53) * r->mult) >> r->shift
> >>
> >> So, I scale len here.
> >
> > Seems to me that there are a lot of unnecessary multiplies and
> > divides going on here.
> > Looks to me like the code was originally written to require one
> > multiply and one shift for each packet.
> 
> The way to calc tokens in psched_l2t_ns() means:
> we got a payload which length is 'len'. To get the actual size we
> need send, round len up to a multiple of 53. After getting the
> size, we do multiply and shift to calculate tokens that we need.
> 
> >
> > In any case the latter code is allowing for more of the ATM cell
> > overhead. I'm not at all sure the intent is to remove that when
> > setting up the constants.
> >
> > OTOH should this code be worrying about the packet overheads at all?
> > Does it add in the ethernet pre-emable, CRC and inter packet gap?
> 
> This overhead is sent from userspace by tc.
> 
> >
> > I'd guess that the most the ATM code should do it round the length
> > up to a multiple of 48 (user payload in a cell).
> 
> 53 is ATM cell size include header, sending a header needs tokens as
> well. I'd guess round the length up to a multiple of 53 in psched_l2t_ns()
> is necessary.

The point I'm trying to make is that in one place you multiply by 53/48
and in the other you multiply be 48/53.
These two operations almost certainly cancel each other out.
Or would modulo small rounding errors if you did the multiplies first.
So why do either of them?

Whether you should be rounding up ATM data to a whole number of cells
is a different question entirely.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yang Yingliang Dec. 10, 2013, 2:04 a.m. UTC | #7
On 2013/12/9 21:11, David Laight wrote:
>> From: Yang Yingliang [mailto:yangyingliang@huawei.com]
>> On 2013/12/9 18:07, David Laight wrote:
>>>> From: Yang Yingliang
>>>> On 2013/12/6 18:56, David Laight wrote:
>>>>>> From: Yang Yingliang
>>>>> ...
>>>>>
>>>>> You are multiplying two values then dividing by 10**9
>>>>> I'd guess that the intermediate value might exceed 2**64.
>>>>>
>>>>>> +	if (unlikely(r->linklayer == TC_LINKLAYER_ATM))
>>>>>> +		len = (len / 53) * 48;
>>>>>
>>>>> You probably want to do the multiply first.
>>>>> But why not scale rate_bytes_ps instead.
>>>>>
>>>>
>>>> When the linklayer is ATM, the formula to calculate tokens is:
>>>>
>>>> ((u64)(DIV_ROUND_UP(len,48)*53) * r->mult) >> r->shift
>>>>
>>>> So, I scale len here.
>>>
>>> Seems to me that there are a lot of unnecessary multiplies and
>>> divides going on here.
>>> Looks to me like the code was originally written to require one
>>> multiply and one shift for each packet.
>>
>> The way to calc tokens in psched_l2t_ns() means:
>> we got a payload which length is 'len'. To get the actual size we
>> need send, round len up to a multiple of 53. After getting the
>> size, we do multiply and shift to calculate tokens that we need.
>>
>>>
>>> In any case the latter code is allowing for more of the ATM cell
>>> overhead. I'm not at all sure the intent is to remove that when
>>> setting up the constants.
>>>
>>> OTOH should this code be worrying about the packet overheads at all?
>>> Does it add in the ethernet pre-emable, CRC and inter packet gap?
>>
>> This overhead is sent from userspace by tc.
>>
>>>
>>> I'd guess that the most the ATM code should do it round the length
>>> up to a multiple of 48 (user payload in a cell).
>>
>> 53 is ATM cell size include header, sending a header needs tokens as
>> well. I'd guess round the length up to a multiple of 53 in psched_l2t_ns()
>> is necessary.
> 
> The point I'm trying to make is that in one place you multiply by 53/48
> and in the other you multiply be 48/53.
> These two operations almost certainly cancel each other out.
> Or would modulo small rounding errors if you did the multiplies first.
> So why do either of them?

Current kernel does multiplication by 53/48 to calculate tokens in
psched_l2t_ns(). It expands a packet length. So I need reduce the
length which is used as max_size in psched_ns_t2l(), or it will be
too big to get enough tokens.

len = (len / 53) * 48;
Here I do division first, because the length to calculate tokens is
53bytes at least. If len < 53, means that no packet can pass, we
should set len to 0.

Regards,
Yang

> 
> Whether you should be rounding up ATM data to a whole number of cells
> is a different question entirely.
> 
> 	David
> 
> 
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c
index a609005..dd731f5 100644
--- a/net/sched/sch_tbf.c
+++ b/net/sched/sch_tbf.c
@@ -118,6 +118,30 @@  struct tbf_sched_data {
 };
 
 
+/* Time to Length, convert time in ns to length in bytes
+ * to determinate how many bytes can be sent in given time.
+ */
+static u64 psched_ns_t2l(const struct psched_ratecfg *r,
+			 u64 time_in_ns)
+{
+	/* The formula is :
+	 * len = (time_in_ns * r->rate_bytes_ps) / NSEC_PER_SEC
+	 */
+	u64 len = time_in_ns * r->rate_bytes_ps;
+
+	do_div(len, NSEC_PER_SEC);
+
+	if (unlikely(r->linklayer == TC_LINKLAYER_ATM))
+		len = (len / 53) * 48;
+
+	if (len > r->overhead)
+		len -= r->overhead;
+	else
+		len = 0;
+
+	return len;
+}
+
 /*
  * Return length of individual segments of a gso packet,
  * including all headers (MAC, IP, TCP/UDP)
@@ -289,10 +313,8 @@  static int tbf_change(struct Qdisc *sch, struct nlattr *opt)
 	struct tbf_sched_data *q = qdisc_priv(sch);
 	struct nlattr *tb[TCA_TBF_MAX + 1];
 	struct tc_tbf_qopt *qopt;
-	struct qdisc_rate_table *rtab = NULL;
-	struct qdisc_rate_table *ptab = NULL;
 	struct Qdisc *child = NULL;
-	int max_size, n;
+	u64 max_size;
 	u64 rate64 = 0, prate64 = 0;
 
 	err = nla_parse_nested(tb, TCA_TBF_MAX, opt, tbf_policy);
@@ -304,38 +326,13 @@  static int tbf_change(struct Qdisc *sch, struct nlattr *opt)
 		goto done;
 
 	qopt = nla_data(tb[TCA_TBF_PARMS]);
-	rtab = qdisc_get_rtab(&qopt->rate, tb[TCA_TBF_RTAB]);
-	if (rtab == NULL)
-		goto done;
-
-	if (qopt->peakrate.rate) {
-		if (qopt->peakrate.rate > qopt->rate.rate)
-			ptab = qdisc_get_rtab(&qopt->peakrate, tb[TCA_TBF_PTAB]);
-		if (ptab == NULL)
-			goto done;
-	}
+	if (qopt->rate.linklayer == TC_LINKLAYER_UNAWARE)
+		qdisc_put_rtab(qdisc_get_rtab(&qopt->rate,
+					      tb[TCA_TBF_RTAB]));
 
-	for (n = 0; n < 256; n++)
-		if (rtab->data[n] > qopt->buffer)
-			break;
-	max_size = (n << qopt->rate.cell_log) - 1;
-	if (ptab) {
-		int size;
-
-		for (n = 0; n < 256; n++)
-			if (ptab->data[n] > qopt->mtu)
-				break;
-		size = (n << qopt->peakrate.cell_log) - 1;
-		if (size < max_size)
-			max_size = size;
-	}
-	if (max_size < 0)
-		goto done;
-
-	if (max_size < psched_mtu(qdisc_dev(sch)))
-		pr_warn_ratelimited("sch_tbf: burst %u is lower than device %s mtu (%u) !\n",
-				    max_size, qdisc_dev(sch)->name,
-				    psched_mtu(qdisc_dev(sch)));
+	if (qopt->peakrate.linklayer == TC_LINKLAYER_UNAWARE)
+			qdisc_put_rtab(qdisc_get_rtab(&qopt->peakrate,
+						      tb[TCA_TBF_PTAB]));
 
 	if (q->qdisc != &noop_qdisc) {
 		err = fifo_set_limit(q->qdisc, qopt->limit);
@@ -357,30 +354,51 @@  static int tbf_change(struct Qdisc *sch, struct nlattr *opt)
 	}
 	q->limit = qopt->limit;
 	q->mtu = PSCHED_TICKS2NS(qopt->mtu);
-	q->max_size = max_size;
 	q->buffer = PSCHED_TICKS2NS(qopt->buffer);
 	q->tokens = q->buffer;
 	q->ptokens = q->mtu;
 
 	if (tb[TCA_TBF_RATE64])
 		rate64 = nla_get_u64(tb[TCA_TBF_RATE64]);
-	psched_ratecfg_precompute(&q->rate, &rtab->rate, rate64);
-	if (ptab) {
+	psched_ratecfg_precompute(&q->rate, &qopt->rate, rate64);
+	if (!q->rate.rate_bytes_ps)
+		goto unlock_done;
+
+	max_size = min_t(u64, psched_ns_t2l(&q->rate, q->buffer), ~0U);
+
+	if (qopt->peakrate.rate) {
 		if (tb[TCA_TBF_PRATE64])
 			prate64 = nla_get_u64(tb[TCA_TBF_PRATE64]);
-		psched_ratecfg_precompute(&q->peak, &ptab->rate, prate64);
+		psched_ratecfg_precompute(&q->peak, &qopt->peakrate, prate64);
+		if (q->peak.rate_bytes_ps <= q->rate.rate_bytes_ps) {
+			pr_warn_ratelimited("sch_tbf: peakrate %llu is lower than or equals to rate %llu !\n",
+					    q->peak.rate_bytes_ps, q->rate.rate_bytes_ps);
+			goto unlock_done;
+		}
+
+		max_size = min_t(u64, max_size, psched_ns_t2l(&q->peak, q->mtu));
 		q->peak_present = true;
 	} else {
 		q->peak_present = false;
 	}
 
+	if (max_size < psched_mtu(qdisc_dev(sch)))
+		pr_warn_ratelimited("sch_tbf: burst %llu is lower than device %s mtu (%u) !\n",
+				    max_size, qdisc_dev(sch)->name,
+				    psched_mtu(qdisc_dev(sch)));
+
+	if (!max_size)
+		goto unlock_done;
+
+	q->max_size = max_size;
+
 	sch_tree_unlock(sch);
-	err = 0;
+	return 0;
+
+unlock_done:
+	sch_tree_unlock(sch);
+	err = -EINVAL;
 done:
-	if (rtab)
-		qdisc_put_rtab(rtab);
-	if (ptab)
-		qdisc_put_rtab(ptab);
 	return err;
 }