From patchwork Thu Feb 26 23:01:31 2009
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Michal Soltys <soltys@ziu.info>
X-Patchwork-Id: 23804
X-Patchwork-Delegate: davem@davemloft.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming@ozlabs.org
Delivered-To: patchwork-incoming@ozlabs.org
Received: from vger.kernel.org (vger.kernel.org [209.132.176.167])
	by ozlabs.org (Postfix) with ESMTP id 8EEB2DDD1C
	for <patchwork-incoming@ozlabs.org>;
	Fri, 27 Feb 2009 10:22:27 +1100 (EST)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1755276AbZBZXWS (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);
	Thu, 26 Feb 2009 18:22:18 -0500
Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755181AbZBZXWP
	(ORCPT <rfc822; netdev-outgoing>); Thu, 26 Feb 2009 18:22:15 -0500
Received: from relay.ppgk.com.pl ([80.53.243.36]:41341 "EHLO
	relay.ppgk.com.pl"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1754842AbZBZXWK (ORCPT <rfc822;netdev@vger.kernel.org>);
	Thu, 26 Feb 2009 18:22:10 -0500
X-Greylist: delayed 506 seconds by postgrey-1.27 at vger.kernel.org;
	Thu, 26 Feb 2009 18:22:08 EST
Received: from localhost.ppgk.com.pl (localhost [127.0.0.1])
	by relay.ppgk.com.pl (Postfix) with ESMTP id CC752375EA
	for <netdev@vger.kernel.org>; Fri, 27 Feb 2009 00:13:39 +0100 (CET)
X-PPGK-Scanned: amavisd-new 2.6.1 (20080629) at ppgk.com.pl
Received: from relay.ppgk.com.pl ([127.0.0.1])
	by localhost.ppgk.com.pl (relay.ppgk.com.pl [127.0.0.1]) (amavisd-new,
	port 10024) with LMTP id BsZ3jCQ1lkCP for <netdev@vger.kernel.org>;
	Fri, 27 Feb 2009 00:13:39 +0100 (CET)
Received: from localhost.localdomain (87-205-204-139.adsl.inetia.pl
	[87.205.204.139])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by relay.ppgk.com.pl (Postfix) with ESMTPSA id 8629C375E2
	for <netdev@vger.kernel.org>; Fri, 27 Feb 2009 00:13:37 +0100 (CET)
From: Michal Soltys <soltys@ziu.info>
To: netdev@vger.kernel.org
Subject: [PATCH/RFC 1/2] HFSC (7) & (8) documentation + assorted changes
Date: Fri, 27 Feb 2009 00:01:31 +0100
Message-Id: 
 <dd4be29320bb8ab261489328bde6fb5620dbec1c.1235685021.git.soltys@ziu.info>
X-Mailer: git-send-email 1.6.1
In-Reply-To: <cover.1235685021.git.soltys@ziu.info>
References: <cover.1235685021.git.soltys@ziu.info>
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

This patch adds detailed documentation for HFSC scheduler. It roughly
follows HFSC paper, but tries to not rely too much on math side of things.
Post-paper/Linux specific subjects (timer resolution, ul service curve, etc.)
are also discussed.

I've read it many times over, but it's a lengthy chunk of text - so try
to be understanding in case I made some mistakes.

tc-hfsc(7): explains algorithm in detail (very long)
tc-hfsc(8): explains command line options briefly
tc(8): adds references to new man pages
Makefile: adds man7 directory to install target
q_hfsc.c: minimal help text changes, consistency with tc-hfsc(8)
---
 Makefile           |    2 +
 man/man7/tc-hfsc.7 |  525 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 man/man8/tc-hfsc.8 |   61 ++++++
 man/man8/tc.8      |    3 +
 tc/q_hfsc.c        |    6 +-
 5 files changed, 596 insertions(+), 1 deletions(-)
 create mode 100644 man/man7/tc-hfsc.7
 create mode 100644 man/man8/tc-hfsc.8

diff --git a/Makefile b/Makefile
index 6096a99..8ac0fe6 100644
--- a/Makefile
+++ b/Makefile
@@ -53,6 +53,8 @@ install: all
 	install -m 0644 $(shell find etc/iproute2 -maxdepth 1 -type f) $(DESTDIR)$(CONFDIR)
 	install -m 0755 -d $(DESTDIR)$(MANDIR)/man8
 	install -m 0644 $(shell find man/man8 -maxdepth 1 -type f) $(DESTDIR)$(MANDIR)/man8
+	install -m 0755 -d $(DESTDIR)$(MANDIR)/man7
+	install -m 0644 $(shell find man/man7 -maxdepth 1 -type f) $(DESTDIR)$(MANDIR)/man7
 	ln -sf tc-bfifo.8  $(DESTDIR)$(MANDIR)/man8/tc-pfifo.8
 	ln -sf lnstat.8  $(DESTDIR)$(MANDIR)/man8/rtstat.8
 	ln -sf lnstat.8  $(DESTDIR)$(MANDIR)/man8/ctstat.8
diff --git a/man/man7/tc-hfsc.7 b/man/man7/tc-hfsc.7
new file mode 100644
index 0000000..99073eb
--- /dev/null
+++ b/man/man7/tc-hfsc.7
@@ -0,0 +1,525 @@
+.TH HFSC 7 "25 February 2009" iproute2 Linux
+.ce 1
+\fBHIERARCHICAL FAIR SERVICE CURVE\fR
+.
+.SH "HISTORY & INTRODUCTION"
+.
+HFSC \- \fBHierarchical Fair Service Curve\fR was first presented at
+SIGCOMM'97. Developed as a part of ALTQ (ALTernative Queuing) on NetBSD, found
+its way quickly to other BSD systems, and then a few years ago became part of
+the linux kernel. Still, it's not the most popular scheduling algorithm \-
+especially if compared to HTB \- and it's not well documented from enduser's
+perspective. This introduction aims to explain how HFSC works without
+going to deep into math side of things (although some if it will be
+inevitable).
+
+In short HFSC aims to:
+.
+.RS 4
+.IP \fB1)\fR 4
+guarantee precise bandwidth and delay allocation for all leaf classes (realtime
+criterion)
+.IP \fB2)\fR
+allocate excess bandwidth fairly as specified by class hierarchy (linkshare &
+upperlimit criterion)
+.IP \fB3)\fR
+minimize any discrepancy between the service curve and the actual amount of
+service provided during linksharing
+.RE
+.PP
+.
+The main "selling" point of HFSC is feature \fB(1)\fR, which is achieved by
+using nonlinear service curves (more about what it actually is later). This is
+particularly useful in VoIP or games, where not only guarantee of consistent
+bandwidth is important, but initial delay of a data stream as well. Note that
+it matters only for leaf classes (where the actual queues are) \- thus class
+hierarchy is ignored in realtime case.
+
+Feature \fB(2)\fR is well, obvious \- any algorithm featuring class hierarchy
+(such as HTB or CBQ) strives to achieve that. HFSC does that well, although
+you might end with unusual situations, if you define service curves carelessly
+\- see section CORNER CASES for examples.
+
+Feature \fB(3)\fR is mentioned due to the nature of the problem. There may be
+situations where it's either not possible to guarantee service of all curves at
+the same time, and/or it's impossible to do so fairly. Both will be explained
+later. Note that this is mainly related to interior (aka aggregate) classes, as
+the leafs are already handled by \fB(1)\fR. Still \- it's perfectly possible to
+create a leaf class w/o realtime service, and in such case \- the caveats will
+naturally extend to leaf classes as well.
+
+.SH ABBREVIATIONS
+For the remaining part of the document, we'll use following shortcuts:
+.nf
+.RS 4
+
+RT \- realtime
+LS \- linkshare
+UL \- upperlimit
+SC \- service curve
+.fi
+.
+.SH "BASICS OF HFSC"
+.
+To understand how HFSC works, we must first introduce a service curve.
+Overall, it's a nondecreasing function of some time unit, returning amount of
+service (allowed or allocated amount of bandwidth) by some specific point in
+time. The purpose of it should be subconsciously obvious \- if a class was
+allowed to transfer not less than the amount specified by its service curve \-
+then service curve is not violated.
+
+Still \- we need more elaborate criterion than just the above (although in
+most generic case it can be reduced to it). The criterion has to take two
+things into account:
+.
+.RS 4
+.IP \(bu 4
+idling periods
+.IP \(bu
+ability to "look back", so if during current active period service curve is violated, maybe it
+isn't if we count excess bandwidth received during earlier active period(s)
+.RE
+.PP
+Let's define the criterion as follows:
+.RS 4
+.nf
+.IP "\fB(1)\fR" 4
+For each t1, there must exist t0 in set B, so S(t1\-t0)\~<=\~w(t0,t1)
+.fi
+.RE
+.
+.PP
+Here 'w' denotes the amount of service received during some time period between t0
+and t1. B is a set of all times, where a session becomes active after idling
+period (further denoted as 'becoming backlogged'). For a clearer picture,
+imagine two situations:
+.
+.RS 4
+.IP \fBa)\fR 4
+our session was active during two periods, with a small time gap between them
+.IP \fBb)\fR
+as in (a), but with a larger gap
+.RE
+.
+.PP
+Consider \fB(a)\fR \- if the service received during both periods meets
+\fB(1)\fR, then all is good. But what if it doesn't do so during the 2nd
+period ? If the amount of service received during the 1st period is bigger
+than the service curve, then it might compensate for smaller service during
+the 2nd period \fIand\fR the gap \- if the gap is small enough.
+
+If the gap is larger \fB(b)\fR \- then it's less likely to happen (unless the
+excess bandwidth allocated during the 1st part was really large). Still, the
+larger the gap \- the less interesting is what happened in the past (e.g. 10
+minutes ago) \- what matters is the current traffic that just started.
+
+From HFSC's perspective, more interesting is answering the following question:
+when should we start transferring packets, so a service curve of a class is not
+violated. Or rephrasing it: How much X() amount of service should a session
+receive by time t, so the service curve is not violated. Function X() defined
+as below is the basic building block of HFSC, used in: eligible, deadline,
+virtual\-time and fit\-time curves. Of course, X() is based on equation
+\fB(1)\fR and is defined recursively:
+
+.RS 4
+.IP \(bu 4
+At the 1st backlogged period beginning function X is initialized to generic
+service curve assigned to a class
+.IP \(bu
+At any subsequent backlogged period, X() is:
+.nf
+\fBmin(X() from previous period ; w(t0)+S(t\-t0) for t>=t0),\fR
+.fi
+\&... where t0 denotes the beginning of the current backlogged period.
+.RE
+.
+.PP
+HFSC uses either linear, or two\-piece linear service curves. In case of
+linear or two\-piece linear convex functions (first slope < second slope),
+min() in X's definition reduces to the 2nd argument. But in case of two\-piece
+concave functions, the 1st argument might quickly become lesser for some
+t>=t0. Note, that for some backlogged period, X() is defined only from that
+period's beginning. We also define X^(\-1)(w) as smallest t>=t0, for which
+X(t)\~=\~w. We have to define it this way, as X() is usually not an injection.
+
+The above generic X() can be one of the following:
+.
+.RS 4
+.IP "E()" 4
+In realtime criterion, selects packets eligible for sending. If none are
+eligible, HFSC will use linkshare criterion. Eligible time \&'et' is calculated
+with reference to packets' heads ( et\~=\~E^(\-1)(w) ). It's based on RT
+service curve, \fIbut in case of a convex curve, uses its 2nd slope only.\fR
+.IP "D()"
+In realtime criterion, selects the most suitable packet from the ones chosen
+by E(). Deadline time \&'dt' corresponds to packets' tails
+(dt\~=\~D^(\-1)(w+l), where \&'l' is packet's length). Based on RT service
+curve.
+.IP "V()"
+In linkshare criterion, arbitrates which packet to send next. Note that V() is
+function of a virtual time \- see \fBLINKSHARE CRITERION\fR section for
+details.  Virtual time \&'vt' corresponds to packets' heads
+(vt\~=\~V^(\-1)(w)). Based on LS service curve.
+.IP "F()"
+An extension to linkshare criterion, used to limit at which speed linkshare
+criterion is allowed to dequeue. Fit\-time 'ft' corresponds to packets' heads
+as well (ft\~=\~F^(\-1)(w)). Based on UL service curve.
+.RE
+
+Be sure to make clean distinction between session's RT, LS and UL service
+curves and the above "utility" functions.
+.
+.SH "REALTIME CRITERION"
+.
+RT criterion \fIignores class hierarchy\fR and guarantees precise bandwidth and
+delay allocation. We say that packet is eligible for sending, when current real
+time is bigger than eligible time. From all packets eligible, the one most
+suited for sending, is the one with the smallest deadline time. Sounds simply,
+but consider following example:
+
+Interface 10mbit, two classes, both with two\-piece linear service curves:
+.RS 4
+.IP \(bu 4
+1st class \- 2mbit for 100ms, then 7mbit (convex \- 1st slope < 2nd slope)
+.IP \(bu
+2nd class \- 7mbit for 100ms, then 2mbit (concave \- 1st slope > 2nd slope)
+.RE
+.PP
+Assume for a moment, that we only use D() for both finding eligible packets,
+and choosing the most fitting one, thus eligible time would be computed as
+D^(\-1)(w) and deadline time would be computed as D^(\-1)(w+l).  If the 2nd
+class starts sending packets 1 second after the 1st class, it's of course
+impossible to guarantee 14mbit, as the interface capability is only 10mbit.
+The only workaround in this scenario is to allow the 1st class to send the
+packets earlier that would normally be allowed. That's where separate E() comes
+to help.  Putting all the math aside (see HFSC paper for details), E() for RT
+concave service curve is just like D(), but for the RT convex service curve \-
+it's constructed using \fIonly\fR RT service curve's 2nd slope (in our example
+\- 7mbit).
+
+The effect of such E() \- packets will be sent earlier, and at the same time
+D() \fIwill\fR be updated \- so current deadline time calculated from it will
+be bigger. Thus, when the 2nd class starts sending packets later, both the 1st
+and the 2nd class will be eligible, but the 2nd session's deadline time will be
+smaller and its packets will be sent first. When the 1st class becomes idle at
+some later point, the 2nd class will be able to "buffer" up again for later
+active period of the 1st class.
+
+A short remark \- in a situation, where the total amount of bandwidth
+available on the interface is bigger than the allocated total realtime parts
+(imagine interface 10 mbit, but 1mbit/2mbit and 2mbit/1mbit classes), the sole
+speed of the interface could suffice to guarantee the times.
+
+Important part of RT criterion is that apart from updating its D() and E(),
+also V() used by LS criterion is updated. Generally the RT criterion is
+secondary to LS one, and used \fIonly\fR if there's a risk of violating precise
+realtime requirements. Still, the "participation" in bandwidth distributed by
+LS criterion is there, so V() has to be updated along the way. LS criterion can
+than properly compensate for non\-ideal fair sharing situation, caused by RT
+scheduling. If you use UL service curve its F() will be updated as well (UL
+service curve is an extension to LS one \- see \fBUPPERLIMIT CRITERION\fR
+section).
+
+Anyway \- careless specification of LS and RT service curves can lead to
+potentially undesired situations (see CORNER CASES for examples). This wasn't
+the case in HFSC paper where LS and RT service curves couldn't be specified
+separately.
+
+.SH "LINKSHARING CRITERION"
+.
+LS criterion's task is to distribute bandwidth according to specified class
+hierarchy. Contrary to RT criterion, there're no comparisons between current
+real time and virtual time \- the decision is based solely on direct comparison
+of virtual times of all active subclasses \- the one with the smallest vt wins
+and gets scheduled. One immediate conclusion from this fact is that absolute
+values don't matter \- only ratios between them (so for example, two children
+classes with simple linear 1mbit service curves will get the same treatment
+from LS criterion's perspective, as if they were 5mbit). The other conclusion
+is, that in perfectly fluid system with linear curves, all virtual times across
+whole class hierarchy would be equal.
+
+Why is VC defined in term of virtual time (and what is it) ?
+
+Imagine an example: class A with two children \- A1 and A2, both with let's say
+10mbit SCs. If A2 is idle, A1 receives all the bandwidth of A (and update its
+V() in the process). When A2 becomes active, A1's virtual time is already
+\fIfar\fR bigger than A2's one. Considering the type of decision made by LS
+criterion, A1 would become idle for a lot of time. We can workaround this
+situation by adjusting virtual time of the class becoming active \- we do that
+by getting such time "up to date". HFSC uses a mean of the smallest and the
+biggest virtual time of currently active children fit for sending. As it's not
+real time anymore (excluding trivial case of situation where all classes become
+active at the same time, and never become idle), it's called virtual time.
+
+Such approach has its price though. The problem is analogous to what was
+presented in previous section and is caused by non\-linearity of service
+curves:
+.IP 1) 4
+either it's impossible to guarantee both service curves and satisfy fairness
+during certain time periods:
+
+.RS 4
+Recall the example from RT section, slightly modified (with 3mbit slopes
+instead of 2mbit ones):
+
+.IP \(bu 4
+1st class \- 3mbit for 100ms, then 7mbit (convex \- 1st slope < 2nd slope)
+.IP \(bu
+2nd class \- 7mbit for 100ms, then 3mbit (concave \- 1st slope > 2nd slope)
+
+.PP
+They sum up nicely to 10mbit \- interface's capacity. But if we wanted to only
+use LS for guarantees and fairness \- it simply won't work. In LS context,
+only V() is used for making decision which class to schedule. If the 2nd class
+becomes active when the 1st one is in its second slope, the fairness will be
+preserved \- ratio will be 1:1 (7mbit:7mbit), but LS itself is of course
+unable to guarantee the absolute values themselves \- as it would have to go
+beyond of what the interface is capable of.
+.RE
+
+.IP 2) 4
+and/or it's impossible to guarantee service curves of all classes at all
+
+.RS 4
+Even if we didn't use virtual time and allowed a session to be "punished",
+there's a possibility that service curves of all classes couldn't be
+guaranteed for a brief period. Consider following, a bit more complicated
+example:
+
+Root interface, classes A and B with concave and convex curve (summing up to
+root), A1 & A2 (children of A), \fIboth\fR with concave curves summing up to A,
+B1 & B2 (children of B), \fIboth\fR with convex curves summing up to B.
+
+Assume that A2, B1 and B2 are constantly backlogged, and at some later point
+A1 becomes backlogged. We can easily choose slopes, so that even if we
+"punish" A2 for earlier excess bandwidth received, A1 will have no chance of
+getting bandwidth corresponding to its first slope. Following from the above
+example:
+
+.nf
+A  \- 7mbit, then 3mbit
+A1 \- 5mbit, then 2mbit
+A2 \- 2mbit, then 1mbit
+
+B  \- 3mbit, then 7mbit
+B1 \- 2mbit, then 5mbit
+B2 \- 1mbit, then 2mbit
+.fi
+
+At the point when A1 starts sending, it should get 5mbit to not violate its
+service curve. A2 gets punished and doesn't send at all, B1 and B2 both keep
+sending at their 5mbit and 2mbit. But as you can see, we already are beyond
+interface's capacity \- at 12mbit. A1 could get 3mbit at most. If we used
+virtual times and kept fairness property, A1 and A2 would send at 3mbit
+together with 5:2 ratio (so respectively at ~2.14mbit and ~0.86mbit).
+.RE
+.
+.SH "UPPERLIMIT CRITERION"
+.
+UL criterion is an extensions to LS one, that permits sending packets only
+if current real time is bigger than fit\-time ('ft'). So the modified LS
+criterion becomes: choose the smallest virtual time from all active children,
+such that fit\-time < current real time also holds. Fit\-time is calculated
+from F(), which is based on UL service curve. As you can see, it's role is
+kinda similar to E() used in RT criterion. Also, for obvious reasons \- you
+can't specify UL service curve without LS one.
+
+Main purpose of UL service curve is to limit HFSC to bandwidth available on the
+upstream router (think adsl home modem/router, and linux server as
+nat/firewall/etc. with 100mbit+ connection to mentioned modem/router).
+Typically, it's used to create a single class directly under root, setting
+linear UL service curve to available bandwidth \- and then creating your class
+structure from that class downwards. Of course, you're free to add UL service
+(linear or not) curve to any class with LS criterion.
+
+Important part about UL service curve is, that whenever at some point in time
+a class doesn't qualify for linksharing due to its fit\-time, the next time it
+does qualify, it will update its virtual time to the smallest virtual time of
+all active children fit for linksharing. This way, one of the main things LS
+criterion tries to achieve \- equality of all virtual times across whole
+hierarchy \- is preserved (in perfectly fluid system with only linear curves,
+all virtual times would be equal).
+
+Without that, 'vt' would lag behind other virtual times, and could cause
+problems. Consider interface with capacity 10mbit, and following leaf classes
+(just in case you're skipping this text quickly \- this example shows behavior
+that \f(BIdoesn't happen\fR):
+
+.nf
+A \- ls 5.0mbit
+B \- ls 2.5mbit
+C \- ls 2.5mbit, ul 2.5mbit
+.fi
+
+If B was idle, while A and C were constantly backlogged, they would normally
+(as far as LS criterion is concerned) divide bandwidth in 2:1 ratio. But due
+to UL service curve in place, C would get at most 2.5mbit, and A would get the
+remaining 7.5mbit. The longer the backlogged period, the more virtual times of
+A and C would drift apart. If B became backlogged at some later point in time,
+its virtual time would be set to (A's\~vt\~+\~C's\~vt)/2, thus blocking A from
+sending any traffic, until B's virtual time catches up with A.
+.
+.SH "SEPARATE LS / RT SCs"
+.
+Another difference from original HFSC paper, is that RT and LS SCs can be
+specified separately. Moreover \- leaf classes are allowed to have only either
+RT SC or LS SC. For interior classes, only LS SCs make sense \- Any RT SC will
+be ignored.
+.
+.SH "CORNER CASES"
+.
+Separate service curves for LS and RT criteria can lead to certain traps,
+that come from "fighting" between ideal linksharing and enforced realtime
+guarantees. Those situations didn't exist in original HFSC paper, where
+specifying separate LS / RT service curves was not discussed.
+
+Consider interface with capacity 10mbit, with following leaf classes:
+
+.nf
+A \- ls 5.0mbit, rt 8mbit
+B \- ls 2.5mbit
+C \- ls 2.5mbit
+.fi
+
+Imagine A and C are constantly backlogged. As B is idle, A and C would divide
+bandwidth in 2:1 ratio, considering LS service curve (so in theory \- 6.66 and
+3.33). Alas RT criterion takes priority, so A will get 8mbit and LS will be
+able to compensate class C for only 2 mbit \- this will cause discrepancy
+between virtual times of A and C.
+
+Assume this situation lasts for a lot of time with no idle periods, and
+suddenly B becomes active. B's virtual time will be updated to
+(A's\~vt\~+\~C's\~vt)/2, effectively landing in the middle between A's and C's
+virtual time. The effect \- B, having no RT guarantees, will be punished and
+will not be allowed to transfer until C's virtual time catches up.
+
+If the interface had higher capacity \- for example 100mbit, this example
+would behave perfectly fine though.
+
+Let's look a bit closer at the above example \- it "cleverly" invalidates one
+of the basic things LS criterion tries to achieve \- equality of all virtual
+times across class hierarchy. Leaf classes without RT service curves are
+literally left to their own fate (governed by messed up virtual times).
+
+Also - it doesn't make much sense. Class A will always be guaranteed up to
+8mbit, and this is more than any absolute bandwidth that could happen from its
+LS criterion (excluding trivial case of only A being active). If the bandwidth
+taken by A is smaller than absolute value from LS criterion, the unused part
+will be automatically assigned to other active classes (as A has idling periods
+in such case). The only "advantage" is, that even in case of low bandwidth on
+average, bursts would be handled at the speed defined by RT criterion. Still,
+if extra speed is needed (e.g. due to latency), non linear service curves
+should be used in such case.
+
+In the other words - LS criterion is meaningless in the above example.
+
+You can quickly "workaround" it by making sure each leaf class has RT service
+curve assigned (thus guaranteeing all of them will get some bandwidth), but it
+doesn't make it any more valid.
+.
+.SH "LINUX AND TIMER RESOLUTION"
+.
+In certain situations, the scheduler can throttle itself and setup so
+called watchdog to wakeup dequeue function at some time later. In case of HFSC
+it happens when for example no packet is eligible for scheduling, and UL
+service curve is used to limit the speed at which LS criterion is allowed to
+dequeue packets. It's called throttling, and accuracy of it is dependent on
+how the kernel is compiled.
+
+There're 3 important options in modern kernels, as far as timers' resolution
+goes: \&'tickless system', \&'high resolution timer support' and \&'timer
+frequency'.
+
+If you have \&'tickless system' enabled, then the timer interrupt will trigger
+as slowly as possible, but each time a scheduler throttles itself (or any
+other part of the kernel needs better accuracy), the rate will be increased as
+needed / possible. The ceiling is either \&'timer frequency' if \&'high
+resolution timer support' is not available or not compiled in. Otherwise it's
+hardware dependent and can go \fIfar\fR beyond the highest \&'timer frequency'
+setting available.
+
+If \&'tickless system' is not enabled, the timer will trigger at a fixed rate
+specified by \&'timer frequency' \- regardless if high resolution timers are
+or aren't available.
+
+This is important to keep those settings in mind, as in scenario like: no
+tickless, no HR timers, frequency set to 100hz \- throttling accuracy would be
+at 10ms. It doesn't automatically mean you would be limited to ~0.8mbit/s
+(assuming packets at ~1KB) \- as long as your queues are prepared to cover for
+timer inaccuracy. Of course, in case of e.g. locally generated udp traffic \-
+appropriate socket size is needed as well. Short example to make it more
+understandable (assume hardcore anti\-schedule settings \- HZ=100, no HR
+timers, no tickless):
+
+.nf
+tc qdisc add dev eth0 root handle 1:0 hfsc default 1
+tc class add dev eth0 parent 1:0 classid 1:1 hfsc rt m2 10mbit
+.fi
+
+Assuming packet of ~1KB size and HZ=100, that averages to ~0.8mbit \- anything
+beyond it (e.g. the above example with specified rate over 10x bigger) will
+require appropriate queuing and cause bursts every ~10 ms.  As you can
+imagine, any HFSC's RT guarantees will be seriously invalidated by that.
+Aforementioned example is mainly important if you deal with old hardware \- as
+it's particularly popular for home server chores. Even then, you can easily
+set HZ=1000 and have very accurate scheduling for typical adsl speeds.
+
+Anything modern (apic or even hpet msi based timers + \&'tickless system')
+will provide enough accuracy for superb 1gbit scheduling. For example, on one
+of basically cheap dual core AMD boards I have with following settings:
+
+.nf
+tc qdisc add dev eth0 parent root handle 1:0 hfsc default 1
+tc class add dev eth0 paretn 1:0 classid 1:1 hfsc rt m2 300mbit
+.fi
+
+And simple:
+
+.nf
+nc \-u dst.host.com 54321 </dev/zero
+nc \-l \-p 54321 >/dev/null
+.fi
+
+\&...will yield following effects over period of ~10 seconds (taken from
+/proc/interrupts):
+
+.nf
+319: 42124229   0  HPET_MSI\-edge  hpet2 (before)
+319: 42436214   0  HPET_MSI\-edge  hpet2 (after 10s.)
+.fi
+
+That's roughly 31000/s. Now compare it with HZ=1000 setting. The obvious
+drawback of it is that cpu load can be rather extensive with servicing that
+many timer interrupts. Example with 300mbit RT service curve on 1gbit link is
+particularly ugly, as it requires a lot of throttling with minuscule delays.
+
+Also note that it's just an example showing capability of current hardware.
+The above example (essentially 300mbit TBF emulator) is pointless on internal
+interface to begin with \- you will pretty much always want regular LS service
+curve there, and in such scenario HFSC simply doesn't throttle at all.
+
+300mbit RT service curve (selected columns from mpstat \-P ALL 1):
+
+.nf
+10:56:43 PM  CPU  %sys     %irq   %soft   %idle
+10:56:44 PM  all  20.10    6.53   34.67   37.19
+10:56:44 PM    0  35.00    0.00   63.00    0.00
+10:56:44 PM    1   4.95   12.87    6.93   73.27
+.fi
+
+So, in rare case you need those speeds with only RT service curve, or with UL
+service curve \- remember about drawbacks.
+.
+.SH "LAYER2 ADAPTATION"
+.
+Please refer to \fBtc\-stab\fR(8)
+.
+.SH "SEE ALSO"
+.
+\fBtc\fR(8), \fBtc\-hfsc\fR(8), \fBtc\-stab\fR(8)
+
+Please direct bugreports and patches to: <netdev@vger.kernel.org>
+.
+.SH "AUTHOR"
+.
+Manpage created by Michal Soltys (soltys@ziu.info)
diff --git a/man/man8/tc-hfsc.8 b/man/man8/tc-hfsc.8
new file mode 100644
index 0000000..1d0aada
--- /dev/null
+++ b/man/man8/tc-hfsc.8
@@ -0,0 +1,61 @@
+.TH HFSC 8 "25 February 2009" iproute2 Linux
+.
+.SH NAME
+HFSC \- Hierarchical Fair Service Curve's control under linux
+.
+.SH SYNOPSIS
+.nf
+tc qdisc add ... hfsc [ \fBdefault\fR CLASSID ]
+
+tc class add ... hfsc [ [ \fBrt\fR SC ] [ \fBls\fR SC ] | [ \fBsc\fR SC ] ] [ \fBul\fR SC ]
+
+\fBrt\fR : realtime service curve
+\fBls\fR : linkshare service curve
+\fBsc\fR : rt+ls service curve
+\fBul\fR : upperlimit service curve
+
+\(bu at least one of \fBrt\fR, \fBls\fR or \fBsc\fR must be specified
+\(bu \fBul\fR can only be specified with \fBls\fR or \fBsc\fR
+.
+.IP "SC := [ [ \fBm1\fR BPS ] \fBd\fR SEC ] \fBm2\fR BPS"
+\fBm1\fR : slope of the first segment
+\fBd\fR  : x\-coordinate of intersection
+\fBm2\fR : slope of the second segment
+.PP
+.IP "SC := [ [ \fBumax\fR BYTE ] \fBdmax\fR SEC ] \fBrate\fR BPS"
+\fBumax\fR : maximum unit of work
+\fBdmax\fR : maximum delay
+\fBrate\fR : rate
+.PP
+.fi
+For description of BYTE, BPS and SEC \- please see \fBUNITS\fR
+section of \fBtc\fR(8).
+.
+.SH DESCRIPTION (qdisc)
+HFSC qdisc has only one optional parameter \- \fBdefault\fR.  CLASSID specifies
+the minor part of the default classid, where packets not classified by other
+means (e.g. u32 filter, CLASSIFY target of iptables) will be enqueued. If
+\fBdefault\fR is not specified, unclassified packets will be dropped.
+.
+.SH DESCRIPTION (class)
+HFSC class is used to create a class hierarchy for HFSC scheduler. For
+explanation of the algorithm, and the meaning behind \fBrt\fR, \fBls\fR,
+\fBsc\fR and \fBul\fR service curves \- please refer to \fBtc\-hfsc\fR(7).
+
+As you can see in \fBSYNOPSIS\fR, service curve (SC) can be specified in two
+ways. Either as maximum delay for certain amount of work, or as a bandwidth
+assigned for certain amount of time. Obviously, \fBm1\fR is simply
+\fBumax\fR/\fBdmax\fR.
+
+Both \fBm2\fR and \fBrate\fR are mandatory. If you omit other
+parameters, you will specify linear service curve.
+.
+.SH "SEE ALSO"
+.
+\fBtc\fR(8), \fBtc\-hfsc\fR(7), \fBtc\-stab\fR(8)
+
+Please direct bugreports and patches to: <netdev@vger.kernel.org>
+.
+.SH "AUTHOR"
+.
+Manpage created by Michal Soltys (soltys@ziu.info)
diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 8c0880f..da26478 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -368,12 +368,15 @@ was written by Alexey N. Kuznetsov and added in Linux 2.2.
 .SH SEE ALSO
 .BR tc-cbq (8),
 .BR tc-htb (8),
+.BR tc-hfsc (8),
+.BR tc-hfsc (7),
 .BR tc-sfq (8),
 .BR tc-red (8),
 .BR tc-tbf (8),
 .BR tc-pfifo (8),
 .BR tc-bfifo (8),
 .BR tc-pfifo_fast (8),
+.BR tc-stab (8),
 .br
 .RB "User documentation at " http://lartc.org/ ", but please direct bugreports and patches to: " <netdev@vger.kernel.org>
 
diff --git a/tc/q_hfsc.c b/tc/q_hfsc.c
index b190c71..03539ec 100644
--- a/tc/q_hfsc.c
+++ b/tc/q_hfsc.c
@@ -43,7 +43,7 @@ explain_class(void)
 	fprintf(stderr,
 		"Usage: ... hfsc [ [ rt SC ] [ ls SC ] | [ sc SC ] ] [ ul SC ]\n"
 		"\n"
-		"SC := [ [ m1 BPS ] [ d SEC ] m2 BPS\n"
+		"SC := [ [ m1 BPS ] d SEC ] m2 BPS\n"
 		"\n"
 		" m1 : slope of first segment\n"
 		" d  : x-coordinate of intersection\n"
@@ -57,6 +57,10 @@ explain_class(void)
 		" dmax : maximum delay\n"
 		" rate : rate\n"
 		"\n"
+		"Remarks:\n"
+		" - at least one of 'rt', 'ls' or 'sc' must be specified\n"
+		" - 'ul' can only be specified with 'ls' or 'sc'\n"
+		"\n"
 	);
 }