diff mbox

[v1,6/8] dmaengine: enhance network subsystem to support DMA device hotplug

Message ID 1335189109-4871-7-git-send-email-jiang.liu@huawei.com
State Not Applicable, archived
Headers show

Commit Message

Jiang Liu April 23, 2012, 1:51 p.m. UTC
Enhance network subsystem to correctly update DMA channel reference counts,
so it won't break DMA device hotplug logic.

Signed-off-by: Jiang Liu <liuj97@gmail.com>
---
 include/net/netdma.h |   26 ++++++++++++++++++++++++++
 net/ipv4/tcp.c       |   10 +++-------
 net/ipv4/tcp_input.c |    5 +----
 net/ipv4/tcp_ipv4.c  |    4 +---
 net/ipv6/tcp_ipv6.c  |    4 +---
 5 files changed, 32 insertions(+), 17 deletions(-)

Comments

Dan Williams April 23, 2012, 6:30 p.m. UTC | #1
On Mon, Apr 23, 2012 at 6:51 AM, Jiang Liu <liuj97@gmail.com> wrote:
> Enhance network subsystem to correctly update DMA channel reference counts,
> so it won't break DMA device hotplug logic.
>
> Signed-off-by: Jiang Liu <liuj97@gmail.com>

This introduces an atomic action on every channel touch, which is more
expensive than what we had previously.  There has always been a
concern about the overhead of offload that sometimes makes ineffective
or a loss compared to cpu copies.  In the cases where net_dma shows
improvement this will eat into / maybe eliminate that advantage.

Take a look at where dmaengine started [1].  It was from the beginning
going through contortions to avoid something like this.  We made it
simpler here [2], but still kept the principle of not dirtying a
shared cacheline on every channel touch, and certainly not locking it.

If you are going to hotplug the entire IOH, then you are probably ok
with network links going down, so could you just down the links and
remove the driver with the existing code?

--
Dan

[1]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=c13c826
[2]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=6f49a57a
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiang Liu April 24, 2012, 2:30 a.m. UTC | #2
Hi Dan,
	Thanks for your great comments!
	gerry
On 2012-4-24 2:30, Dan Williams wrote:
> On Mon, Apr 23, 2012 at 6:51 AM, Jiang Liu<liuj97@gmail.com>  wrote:
>> Enhance network subsystem to correctly update DMA channel reference counts,
>> so it won't break DMA device hotplug logic.
>>
>> Signed-off-by: Jiang Liu<liuj97@gmail.com>
>
> This introduces an atomic action on every channel touch, which is more
> expensive than what we had previously.  There has always been a
> concern about the overhead of offload that sometimes makes ineffective
> or a loss compared to cpu copies.  In the cases where net_dma shows
> improvement this will eat into / maybe eliminate that advantage.
Good point, we should avoid pollute a shared cacheline here, otherwise
it may eat the benefits of IOAT acceleration.

>
> Take a look at where dmaengine started [1].  It was from the beginning
> going through contortions to avoid something like this.  We made it
> simpler here [2], but still kept the principle of not dirtying a
> shared cacheline on every channel touch, and certainly not locking it.
Thanks for the great background information, especially the second one.
The check-in log message as below.
 >Why?, beyond reducing complication:
 >1/ Tracking reference counts per-transaction in an efficient manner, as
 >   is currently done, requires a complicated scheme to avoid cache-line
 >   bouncing effects.
The really issue here is polluting shared cachelines here, right?
Will it help to use percpu counter instead of atomic operations here?
I will have a try to use percpu counter for reference count.
BTW, do you have any DMAEngine benchmarks so we could use them to
compare the performance difference?

 >2/ Per-transaction ref-counting gives the false impression that a
 >   dma-driver can be gracefully removed ahead of its user (net, md, or
 >   dma-slave)
 >3/ None of the in-tree dma-drivers talk to hot pluggable hardware, but
Seems the situation has changed now:)
Intel 7500 (Boxboro) chipset supports hotplug. And we are working on
a system, which adopts Boxboro chipset and supports node hotplug.
So we try to enhance the DMAEngine to support IOAT hotplug.

On the other hand, Intel next generation processor Ivybridge has
embedded IOH, so we need to support IOH/IOAT hotplug when supporting
processor hotplug.

 >   if such an engine were built one day we still would not need to >notify
 >   clients of remove events.  The driver can simply return NULL to a
 >   ->prep() request, something that is much easier for a client to 
 >handle.
Could you please help to give more explanations about "The driver can
simply return NULL to a ->prep() request", I have gotten the idea yet.

>
> If you are going to hotplug the entire IOH, then you are probably ok
> with network links going down, so could you just down the links and
> remove the driver with the existing code?
I feel it's a little risky to shut down/restart all network interfaces
for hot-removal of IOH, that may disturb the applications. And there
are also other kinds of clients, such as ASYNC_TX, seems we can't
adopt this method to reclaim DMA channels from ASYNC_TX subsystem.

>
> --
> Dan
>
> [1]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=c13c826
> [2]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=6f49a57a
>
> .
>


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan Williams April 24, 2012, 3:09 a.m. UTC | #3
On Mon, Apr 23, 2012 at 7:30 PM, Jiang Liu <jiang.liu@huawei.com> wrote:
>> If you are going to hotplug the entire IOH, then you are probably ok
>> with network links going down, so could you just down the links and
>> remove the driver with the existing code?
>
> I feel it's a little risky to shut down/restart all network interfaces
> for hot-removal of IOH, that may disturb the applications.

I guess I'm confused... wouldn't the removal of an entire domain of
pci devices disturb userspace applications?

> And there
> are also other kinds of clients, such as ASYNC_TX, seems we can't
> adopt this method to reclaim DMA channels from ASYNC_TX subsystem.

I say handle this like block device hotplug.  I.e. the driver stays
loaded but the channel is put into an 'offline' state.  So the driver
hides the fact that the hardware went away.  Similar to how you can
remove a disk but /dev/sda sticks around until the last reference is
gone (and the driver 'sd' sticks around until all block devices are
gone).

I expect the work will be in making sure existing clients are prepared
to handle NULL returns from ->device_prep_dma_*.  In some cases the
channel is treated more like a cpu, so a NULL return from
->device_prep_dma_memcpy() has been interpreted as "device is
temporarily busy, it is safe to try again".  We would need to change
that to a permanent indication that the device is gone and not attempt
retry.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiang Liu April 24, 2012, 3:56 a.m. UTC | #4
On 2012-4-24 11:09, Dan Williams wrote:
> On Mon, Apr 23, 2012 at 7:30 PM, Jiang Liu<jiang.liu@huawei.com>  wrote:
>>> If you are going to hotplug the entire IOH, then you are probably ok
>>> with network links going down, so could you just down the links and
>>> remove the driver with the existing code?
>>
>> I feel it's a little risky to shut down/restart all network interfaces
>> for hot-removal of IOH, that may disturb the applications.
>
> I guess I'm confused... wouldn't the removal of an entire domain of
> pci devices disturb userspace applications?
Here I mean removing an IOH shouldn't affect devices under other IOHs
if possible.
With current dmaengine implementation, a DMA device/channel may be used
by clients in other PCI domains. So to safely remove a DMA device, we
need to return dmaengine_ref_count to zero by stopping all DMA clients.
For network, that means we need to stop all network interfaces, seems
a little heavy:)

>
>> And there
>> are also other kinds of clients, such as ASYNC_TX, seems we can't
>> adopt this method to reclaim DMA channels from ASYNC_TX subsystem.
>
> I say handle this like block device hotplug.  I.e. the driver stays
> loaded but the channel is put into an 'offline' state.  So the driver
> hides the fact that the hardware went away.  Similar to how you can
> remove a disk but /dev/sda sticks around until the last reference is
> gone (and the driver 'sd' sticks around until all block devices are
> gone).
Per my understanding, this mechanism could be used to stop driver from
accessing surprisingly removed devices, but it still needs a reference
count mechanism to finish the driver unbinding operation eventually.

For IOH hotplug, we need to wait for the completion of driver unbinding
operations before destroying the PCI device nodes of IOAT, so still need
reference count to track channel usage.

Another way is to notify all clients to release all channels when IOAT
device hotplug happens, but that may need heavy modification to the
DMA clients.

>
> I expect the work will be in making sure existing clients are prepared
> to handle NULL returns from ->device_prep_dma_*.  In some cases the
> channel is treated more like a cpu, so a NULL return from
> ->device_prep_dma_memcpy() has been interpreted as "device is
> temporarily busy, it is safe to try again".  We would need to change
> that to a permanent indication that the device is gone and not attempt
> retry.
Yes, some ASYNC_TX clients interpret NULL return as EBUSY and keep on
retry when doing context aware computations. Will try to investigate
on this direction.

>
> --
> Dan
>
> .
>


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiang Liu April 25, 2012, 3:47 p.m. UTC | #5
Hi Dan,
	Thanks for your great comments about the performance penalty issue. And I'm trying
to refine the implementation to reduce penalty caused by hotplug logic. If the algorithm works
correctly, the optimized hot path code will be:

------------------------------------------------------------------------------
struct dma_chan *dma_find_channel(enum dma_transaction_type tx_type)
{
        struct dma_chan *chan = this_cpu_read(channel_table[tx_type]->chan);

        this_cpu_inc(dmaengine_chan_ref_count);
        if (static_key_false(&dmaengine_quiesce)) {
                chan = NULL;
        }

        return chan;
}
EXPORT_SYMBOL(dma_find_channel);

struct dma_chan *dma_get_channel(struct dma_chan *chan)
{
        if (static_key_false(&dmaengine_quiesce))
                atomic_inc(&dmaengine_dirty);
        this_cpu_inc(dmaengine_chan_ref_count);

        return chan;
}
EXPORT_SYMBOL(dma_get_channel);

void dma_put_channel(struct dma_chan *chan)
{
        this_cpu_dec(dmaengine_chan_ref_count);
}
EXPORT_SYMBOL(dma_put_channel);
-----------------------------------------------------------------------------

The disassembled code is:
(gdb) disassemble dma_find_channel 
Dump of assembler code for function dma_find_channel:
   0x0000000000000000 <+0>:	push   %rbp
   0x0000000000000001 <+1>:	mov    %rsp,%rbp
   0x0000000000000004 <+4>:	callq  0x9 <dma_find_channel+9>
   0x0000000000000009 <+9>:	mov    %edi,%edi
   0x000000000000000b <+11>:	mov    0x0(,%rdi,8),%rax
   0x0000000000000013 <+19>:	mov    %gs:(%rax),%rax
   0x0000000000000017 <+23>:	incq   %gs:0x0				//overhead: this_cpu_inc(dmaengine_chan_ref_count)
   0x0000000000000020 <+32>:	jmpq   0x25 <dma_find_channel+37>	//overhead: if (static_key_false(&dmaengine_quiesce)), will be replaced as NOP by jump label
   0x0000000000000025 <+37>:	pop    %rbp
   0x0000000000000026 <+38>:	retq   
   0x0000000000000027 <+39>:	nopw   0x0(%rax,%rax,1)
   0x0000000000000030 <+48>:	xor    %eax,%eax
   0x0000000000000032 <+50>:	pop    %rbp
   0x0000000000000033 <+51>:	retq   
End of assembler dump.
(gdb) disassemble dma_put_channel 	// overhead: to decrease channel reference count, 6 instructions
Dump of assembler code for function dma_put_channel:
   0x0000000000000070 <+0>:	push   %rbp
   0x0000000000000071 <+1>:	mov    %rsp,%rbp
   0x0000000000000074 <+4>:	callq  0x79 <dma_put_channel+9>
   0x0000000000000079 <+9>:	decq   %gs:0x0
   0x0000000000000082 <+18>:	pop    %rbp
   0x0000000000000083 <+19>:	retq   
End of assembler dump.
(gdb) disassemble dma_get_channel 
Dump of assembler code for function dma_get_channel:
   0x0000000000000040 <+0>:	push   %rbp
   0x0000000000000041 <+1>:	mov    %rsp,%rbp
   0x0000000000000044 <+4>:	callq  0x49 <dma_get_channel+9>
   0x0000000000000049 <+9>:	mov    %rdi,%rax
   0x000000000000004c <+12>:	jmpq   0x51 <dma_get_channel+17>
   0x0000000000000051 <+17>:	incq   %gs:0x0
   0x000000000000005a <+26>:	pop    %rbp
   0x000000000000005b <+27>:	retq   
   0x000000000000005c <+28>:	nopl   0x0(%rax)
   0x0000000000000060 <+32>:	lock incl 0x0(%rip)        # 0x67 <dma_get_channel+39>
   0x0000000000000067 <+39>:	jmp    0x51 <dma_get_channel+17>
End of assembler dump.

So for a typical dma_find_channel()/dma_put_channel(), the total overhead
is about 10 instructions and two percpu(local) memory updates. And there's
no shared cache pollution any more. Is this acceptable ff the algorithm 
works as expected? I will test the code tomorrow.

For typical systems which don't support DMA device hotplug, the overhead
could be completely removed by condition compilation.

Any comments are welcomed!

Thanks!
--gerry


On 04/24/2012 11:09 AM, Dan Williams wrote:
>>> If you are going to hotplug the entire IOH, then you are probably ok

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/netdma.h b/include/net/netdma.h
index 8ba8ce2..6d71724 100644
--- a/include/net/netdma.h
+++ b/include/net/netdma.h
@@ -24,6 +24,32 @@ 
 #include <linux/dmaengine.h>
 #include <linux/skbuff.h>
 
+static inline bool
+net_dma_capable(void)
+{
+	struct dma_chan *chan = net_dma_find_channel();
+	dma_put_channel(chan);
+
+	return !!chan;
+}
+
+static inline struct dma_chan *
+net_dma_get_channel(struct tcp_sock *tp)
+{
+	if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+		tp->ucopy.dma_chan = net_dma_find_channel();
+	return tp->ucopy.dma_chan;
+}
+
+static inline void
+net_dma_put_channel(struct tcp_sock *tp)
+{
+	if (tp->ucopy.dma_chan) {
+		dma_put_channel(tp->ucopy.dma_chan);
+		tp->ucopy.dma_chan = NULL;
+	}
+}
+
 int dma_skb_copy_datagram_iovec(struct dma_chan* chan,
 		struct sk_buff *skb, int offset, struct iovec *to,
 		size_t len, struct dma_pinned_list *pinned_list);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8bb6ade..aea4032 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1451,8 +1451,7 @@  int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			available = TCP_SKB_CB(skb)->seq + skb->len - (*seq);
 		if ((available < target) &&
 		    (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
-		    !sysctl_tcp_low_latency &&
-		    net_dma_find_channel()) {
+		    !sysctl_tcp_low_latency && net_dma_capable()) {
 			preempt_enable_no_resched();
 			tp->ucopy.pinned_list =
 					dma_pin_iovec_pages(msg->msg_iov, len);
@@ -1666,10 +1665,7 @@  do_prequeue:
 
 		if (!(flags & MSG_TRUNC)) {
 #ifdef CONFIG_NET_DMA
-			if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
-				tp->ucopy.dma_chan = net_dma_find_channel();
-
-			if (tp->ucopy.dma_chan) {
+			if (net_dma_get_channel(tp)) {
 				tp->ucopy.dma_cookie = dma_skb_copy_datagram_iovec(
 					tp->ucopy.dma_chan, skb, offset,
 					msg->msg_iov, used,
@@ -1758,7 +1754,7 @@  skip_copy:
 
 #ifdef CONFIG_NET_DMA
 	tcp_service_net_dma(sk, true);  /* Wait for queue to drain */
-	tp->ucopy.dma_chan = NULL;
+	net_dma_put_channel(tp);
 
 	if (tp->ucopy.pinned_list) {
 		dma_unpin_iovec_pages(tp->ucopy.pinned_list);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9944c1d..3878916 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5227,10 +5227,7 @@  static int tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
 	if (tp->ucopy.wakeup)
 		return 0;
 
-	if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
-		tp->ucopy.dma_chan = net_dma_find_channel();
-
-	if (tp->ucopy.dma_chan && skb_csum_unnecessary(skb)) {
+	if (net_dma_get_channel(tp) && skb_csum_unnecessary(skb)) {
 
 		dma_cookie = dma_skb_copy_datagram_iovec(tp->ucopy.dma_chan,
 							 skb, hlen,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0cb86ce..90ea1c0 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1729,9 +1729,7 @@  process:
 	if (!sock_owned_by_user(sk)) {
 #ifdef CONFIG_NET_DMA
 		struct tcp_sock *tp = tcp_sk(sk);
-		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
-			tp->ucopy.dma_chan = net_dma_find_channel();
-		if (tp->ucopy.dma_chan)
+		if (net_dma_get_channel(tp))
 			ret = tcp_v4_do_rcv(sk, skb);
 		else
 #endif
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 86cfe60..fb81bbd 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1644,9 +1644,7 @@  process:
 	if (!sock_owned_by_user(sk)) {
 #ifdef CONFIG_NET_DMA
 		struct tcp_sock *tp = tcp_sk(sk);
-		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
-			tp->ucopy.dma_chan = net_dma_find_channel();
-		if (tp->ucopy.dma_chan)
+		if (net_dma_get_channel(tp))
 			ret = tcp_v6_do_rcv(sk, skb);
 		else
 #endif