diff mbox

[RFC] net: lls epoll support

Message ID 51C1993C.9030204@linux.intel.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eliezer Tamir June 19, 2013, 11:42 a.m. UTC
This is a wild hack, just as a POC to show the power or LLS with epoll.

We assume that we only ever need to poll on one device queue,
so the first FD that reports POLL_LL gets saved aside so we can poll on.

While this assumption is wrong in so many ways, it's very easy to 
satisfy with a micro-benchmark.

[this patch needs the poll patch to be applied first]
with sockperf doing epoll on 1000 sockets I see an avg latency of 6us

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
---

  fs/eventpoll.c |   39 +++++++++++++++++++++++++++++++++------
  1 files changed, 33 insertions(+), 6 deletions(-)

*head,
  			       void *priv)
  {
@@ -789,7 +808,7 @@ static int ep_read_events_proc(struct eventpoll *ep, 
struct list_head *head,
  	init_poll_funcptr(&pt, NULL);

  	list_for_each_entry_safe(epi, tmp, head, rdllink) {
-		if (ep_item_poll(epi, &pt))
+		if (ep_item_poll(epi, &pt, ep))
  			return POLLIN | POLLRDNORM;
  		else {
  			/*
@@ -1271,7 +1290,7 @@ static int ep_insert(struct eventpoll *ep, struct 
epoll_event *event,
  	 * this operation completes, the poll callback can start hitting
  	 * the new item.
  	 */
-	revents = ep_item_poll(epi, &epq.pt);
+	revents = ep_item_poll(epi, &epq.pt, ep);

  	/*
  	 * We have to check if something went wrong during the poll wait queue
@@ -1403,7 +1422,7 @@ static int ep_modify(struct eventpoll *ep, struct 
epitem *epi, struct epoll_even
  	 * Get current event bits. We can safely use the file* here because
  	 * its usage count has been increased by the caller of this function.
  	 */
-	revents = ep_item_poll(epi, &pt);
+	revents = ep_item_poll(epi, &pt, ep);

  	/*
  	 * If the item is "hot" and it is not registered inside the ready
@@ -1471,7 +1490,7 @@ static int ep_send_events_proc(struct eventpoll 
*ep, struct list_head *head,

  		list_del_init(&epi->rdllink);

-		revents = ep_item_poll(epi, &pt);
+		revents = ep_item_poll(epi, &pt, ep);

  		/*
  		 * If the event mask intersect the caller-requested one,
@@ -1558,6 +1577,10 @@ static int ep_poll(struct eventpoll *ep, struct 
epoll_event __user *events,
  	long slack = 0;
  	wait_queue_t wait;
  	ktime_t expires, *to = NULL;
+	cycles_t ll_time = ll_end_time();
+	//bool try_ll = true;
+	bool can_ll = !!ep->ll_epi;
+

  	if (timeout > 0) {
  		struct timespec end_time = ep_set_mstimeout(timeout);
@@ -1601,6 +1624,10 @@ fetch_events:
  				break;
  			}

+			while (can_ll && can_poll_ll(ll_time)
+					&& !ep_events_available(ep))
+				ep_item_poll_ll(ep->ll_epi);
+
  			spin_unlock_irqrestore(&ep->lock, flags);
  			if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
  				timed_out = 1;


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

yaniv saar June 25, 2013, 2:26 p.m. UTC | #1
On Wed, Jun 19, 2013 at 2:42 PM, Eliezer Tamir
<eliezer.tamir@linux.intel.com> wrote:
> This is a wild hack, just as a POC to show the power or LLS with epoll.
>
> We assume that we only ever need to poll on one device queue,
> so the first FD that reports POLL_LL gets saved aside so we can poll on.
>
> While this assumption is wrong in so many ways, it's very easy to satisfy
> with a micro-benchmark.
>
> [this patch needs the poll patch to be applied first]
> with sockperf doing epoll on 1000 sockets I see an avg latency of 6us
>

hi eliezer,

please consider the following solution for epoll that is based on
polling dev+queue.
instead of looping over the socket as in LLS, maintain in eventpool
struct a list of device+queues (qdlist).
the dqlist must be unique w.r.t. device+queue, (no two identical
device+queues items in qdlist).
each device+queues item (qditem) holds:
* device (id)
* queue (id)
* list of epi (epilist) that created this qditem
- I think it won't be possible to extend epitem (breaks cache aligned
to 128)... instead you can have a simple ll_usec list.
* ll_usec, the maximum time to poll from all the referring epi items.
finally, polling should iterate over the qdlist once, and then check for events.

----
as far as coding this sketch involves:
1) adjust eventpoll struct.
2) initialize on creation (epoll_create)
3) update the list on modification (epoll_ctl)
3.1) ep_insert->add this epi/ll_usec in relevant qditem (or create new
one), and update qditem->ll_usec
3.2) ep_remove->remove this epi/ll_usec from relevant qditem (MUST be
existing -- sort of ref counting), and update qditem->ll_usec
3.3) ep_modify->...
4) on polling event (epoll_wait)
ep_poll->if qdlist is not empty, then find the maximum ll_usec (could
be done while maintaining...)
... and just before going into wait ...
if max ll_usec!=0
    poll once on all device+queues in the qdlist.
    continue to the next iteration (check events).
5) to support this flow we also need to implement API for
5.1) given a file/fd/epi, if is a sock then get the device+queue.
5.2) poll over a given device+queue (dq_poll_ll) once.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eliezer Tamir June 25, 2013, 3:34 p.m. UTC | #2
On 25/06/2013 17:26, yaniv saar wrote:
> On Wed, Jun 19, 2013 at 2:42 PM, Eliezer Tamir
> <eliezer.tamir@linux.intel.com> wrote:
>>
>> [this patch needs the poll patch to be applied first]
>> with sockperf doing epoll on 1000 sockets I see an avg latency of 6us
>>
>
> hi eliezer,
>
> please consider the following solution for epoll that is based on
> polling dev+queue.
> instead of looping over the socket as in LLS, maintain in eventpool
> struct a list of device+queues (qdlist).

Thanks for looking into this.
I'm currently working on a solution that has a lot similar to what you 
are proposing.

We don't need a new id mechanism, we already have the napi_id.
The nice thing about the napi_id is that the only locking it needs
is an rcu_read_lock when dereferencing.

we don't need to remember the ll_usec value of each socket because
the patch for select/poll (currently waiting for review) added
a separate sysctl value for poll.

I would like to find a way for the user to specify how long to busy
wait, directly from the system call, but I was not able to find
a simple way of adding this without a change to the system call
prototype.

we do however need to track when a socket's napi_id changes.
But for that we can hook into sk_mark_ll().

so here is a list of proposed changes:

1. add a linked list of unique napi_id's to struct eventpoll.
each id will have a collision list of sockets that have the same id.
-a hash is gratuitous, we expect the unique list to have 0 to 2
  elements in most cases.

2. when a new socket is added, if its id is new it gets added to the 
unique list, otherwise to the collision list of that id.

3. when a socket is removed, if it's on the unique list, replace it
with the first on its former collision list.

4. add callback mechanism to sk_mark_ll() which will be activated when
  the mark changes, update the lists.
(a socket may be polled by more than one epoll so be careful)

5. add  and remove to/from the lists in ep_insert and ep_remove
  respectively. check if we need to do something for ep_modify().

6. add an ep_poll helper that will round robin polling on the
files in the unique list.

7. init everything from epoll_create.

locking:

napi_id's are great since they don't need locking except for an
rcu_read_lock when polling on one.

the lists need a spinlock for adding/removing, maybe they
can use ep->lock.

callback registration/removal needs to use the same mechanism that
ep_add / ep_remove use to protect themselves from the rest of epoll.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index deecc72..3c7562b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -41,6 +41,7 @@ 
  #include <linux/proc_fs.h>
  #include <linux/seq_file.h>
  #include <linux/compat.h>
+#include <net/ll_poll.h>

  /*
   * LOCKING:
@@ -214,6 +215,7 @@  struct eventpoll {
  	/* used to optimize loop detection check */
  	int visited;
  	struct list_head visited_list_link;
+	struct epitem *ll_epi;
  };

  /* Wait structure used by the poll hooks */
@@ -773,13 +775,30 @@  static int ep_eventpoll_release(struct inode 
*inode, struct file *file)
  	return 0;
  }

-static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt)
+static inline unsigned int ep_item_poll(struct epitem *epi, poll_table 
*pt, struct eventpoll *ep)
  {
+	unsigned int events = epi->ffd.file->f_op->poll(epi->ffd.file, pt);
  	pt->_key = epi->event.events;

-	return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events;
+	if (events & POLLLLS) {
+		events &= ~POLLLLS;
+		ep->ll_epi = epi;
+	}
+
+	return events & epi->event.events;
+}
+
+static inline bool ep_item_poll_ll(struct epitem *epi)
+{
+	poll_table wait;
+
+	wait._key = POLLLLS;
+	wait._qproc = NULL;
+
+	return epi->ffd.file->f_op->poll(epi->ffd.file, &wait);
  }

+
  static int ep_read_events_proc(struct eventpoll *ep, struct list_head