Message ID | 1335189109-4871-7-git-send-email-jiang.liu@huawei.com |
---|---|
State | Not Applicable, archived |
Headers | show |
On Mon, Apr 23, 2012 at 6:51 AM, Jiang Liu <liuj97@gmail.com> wrote: > Enhance network subsystem to correctly update DMA channel reference counts, > so it won't break DMA device hotplug logic. > > Signed-off-by: Jiang Liu <liuj97@gmail.com> This introduces an atomic action on every channel touch, which is more expensive than what we had previously. There has always been a concern about the overhead of offload that sometimes makes ineffective or a loss compared to cpu copies. In the cases where net_dma shows improvement this will eat into / maybe eliminate that advantage. Take a look at where dmaengine started [1]. It was from the beginning going through contortions to avoid something like this. We made it simpler here [2], but still kept the principle of not dirtying a shared cacheline on every channel touch, and certainly not locking it. If you are going to hotplug the entire IOH, then you are probably ok with network links going down, so could you just down the links and remove the driver with the existing code? -- Dan [1]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=c13c826 [2]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=6f49a57a -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Dan, Thanks for your great comments! gerry On 2012-4-24 2:30, Dan Williams wrote: > On Mon, Apr 23, 2012 at 6:51 AM, Jiang Liu<liuj97@gmail.com> wrote: >> Enhance network subsystem to correctly update DMA channel reference counts, >> so it won't break DMA device hotplug logic. >> >> Signed-off-by: Jiang Liu<liuj97@gmail.com> > > This introduces an atomic action on every channel touch, which is more > expensive than what we had previously. There has always been a > concern about the overhead of offload that sometimes makes ineffective > or a loss compared to cpu copies. In the cases where net_dma shows > improvement this will eat into / maybe eliminate that advantage. Good point, we should avoid pollute a shared cacheline here, otherwise it may eat the benefits of IOAT acceleration. > > Take a look at where dmaengine started [1]. It was from the beginning > going through contortions to avoid something like this. We made it > simpler here [2], but still kept the principle of not dirtying a > shared cacheline on every channel touch, and certainly not locking it. Thanks for the great background information, especially the second one. The check-in log message as below. >Why?, beyond reducing complication: >1/ Tracking reference counts per-transaction in an efficient manner, as > is currently done, requires a complicated scheme to avoid cache-line > bouncing effects. The really issue here is polluting shared cachelines here, right? Will it help to use percpu counter instead of atomic operations here? I will have a try to use percpu counter for reference count. BTW, do you have any DMAEngine benchmarks so we could use them to compare the performance difference? >2/ Per-transaction ref-counting gives the false impression that a > dma-driver can be gracefully removed ahead of its user (net, md, or > dma-slave) >3/ None of the in-tree dma-drivers talk to hot pluggable hardware, but Seems the situation has changed now:) Intel 7500 (Boxboro) chipset supports hotplug. And we are working on a system, which adopts Boxboro chipset and supports node hotplug. So we try to enhance the DMAEngine to support IOAT hotplug. On the other hand, Intel next generation processor Ivybridge has embedded IOH, so we need to support IOH/IOAT hotplug when supporting processor hotplug. > if such an engine were built one day we still would not need to >notify > clients of remove events. The driver can simply return NULL to a > ->prep() request, something that is much easier for a client to >handle. Could you please help to give more explanations about "The driver can simply return NULL to a ->prep() request", I have gotten the idea yet. > > If you are going to hotplug the entire IOH, then you are probably ok > with network links going down, so could you just down the links and > remove the driver with the existing code? I feel it's a little risky to shut down/restart all network interfaces for hot-removal of IOH, that may disturb the applications. And there are also other kinds of clients, such as ASYNC_TX, seems we can't adopt this method to reclaim DMA channels from ASYNC_TX subsystem. > > -- > Dan > > [1]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=c13c826 > [2]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=6f49a57a > > . > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 23, 2012 at 7:30 PM, Jiang Liu <jiang.liu@huawei.com> wrote: >> If you are going to hotplug the entire IOH, then you are probably ok >> with network links going down, so could you just down the links and >> remove the driver with the existing code? > > I feel it's a little risky to shut down/restart all network interfaces > for hot-removal of IOH, that may disturb the applications. I guess I'm confused... wouldn't the removal of an entire domain of pci devices disturb userspace applications? > And there > are also other kinds of clients, such as ASYNC_TX, seems we can't > adopt this method to reclaim DMA channels from ASYNC_TX subsystem. I say handle this like block device hotplug. I.e. the driver stays loaded but the channel is put into an 'offline' state. So the driver hides the fact that the hardware went away. Similar to how you can remove a disk but /dev/sda sticks around until the last reference is gone (and the driver 'sd' sticks around until all block devices are gone). I expect the work will be in making sure existing clients are prepared to handle NULL returns from ->device_prep_dma_*. In some cases the channel is treated more like a cpu, so a NULL return from ->device_prep_dma_memcpy() has been interpreted as "device is temporarily busy, it is safe to try again". We would need to change that to a permanent indication that the device is gone and not attempt retry. -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2012-4-24 11:09, Dan Williams wrote: > On Mon, Apr 23, 2012 at 7:30 PM, Jiang Liu<jiang.liu@huawei.com> wrote: >>> If you are going to hotplug the entire IOH, then you are probably ok >>> with network links going down, so could you just down the links and >>> remove the driver with the existing code? >> >> I feel it's a little risky to shut down/restart all network interfaces >> for hot-removal of IOH, that may disturb the applications. > > I guess I'm confused... wouldn't the removal of an entire domain of > pci devices disturb userspace applications? Here I mean removing an IOH shouldn't affect devices under other IOHs if possible. With current dmaengine implementation, a DMA device/channel may be used by clients in other PCI domains. So to safely remove a DMA device, we need to return dmaengine_ref_count to zero by stopping all DMA clients. For network, that means we need to stop all network interfaces, seems a little heavy:) > >> And there >> are also other kinds of clients, such as ASYNC_TX, seems we can't >> adopt this method to reclaim DMA channels from ASYNC_TX subsystem. > > I say handle this like block device hotplug. I.e. the driver stays > loaded but the channel is put into an 'offline' state. So the driver > hides the fact that the hardware went away. Similar to how you can > remove a disk but /dev/sda sticks around until the last reference is > gone (and the driver 'sd' sticks around until all block devices are > gone). Per my understanding, this mechanism could be used to stop driver from accessing surprisingly removed devices, but it still needs a reference count mechanism to finish the driver unbinding operation eventually. For IOH hotplug, we need to wait for the completion of driver unbinding operations before destroying the PCI device nodes of IOAT, so still need reference count to track channel usage. Another way is to notify all clients to release all channels when IOAT device hotplug happens, but that may need heavy modification to the DMA clients. > > I expect the work will be in making sure existing clients are prepared > to handle NULL returns from ->device_prep_dma_*. In some cases the > channel is treated more like a cpu, so a NULL return from > ->device_prep_dma_memcpy() has been interpreted as "device is > temporarily busy, it is safe to try again". We would need to change > that to a permanent indication that the device is gone and not attempt > retry. Yes, some ASYNC_TX clients interpret NULL return as EBUSY and keep on retry when doing context aware computations. Will try to investigate on this direction. > > -- > Dan > > . > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Dan,
Thanks for your great comments about the performance penalty issue. And I'm trying
to refine the implementation to reduce penalty caused by hotplug logic. If the algorithm works
correctly, the optimized hot path code will be:
------------------------------------------------------------------------------
struct dma_chan *dma_find_channel(enum dma_transaction_type tx_type)
{
struct dma_chan *chan = this_cpu_read(channel_table[tx_type]->chan);
this_cpu_inc(dmaengine_chan_ref_count);
if (static_key_false(&dmaengine_quiesce)) {
chan = NULL;
}
return chan;
}
EXPORT_SYMBOL(dma_find_channel);
struct dma_chan *dma_get_channel(struct dma_chan *chan)
{
if (static_key_false(&dmaengine_quiesce))
atomic_inc(&dmaengine_dirty);
this_cpu_inc(dmaengine_chan_ref_count);
return chan;
}
EXPORT_SYMBOL(dma_get_channel);
void dma_put_channel(struct dma_chan *chan)
{
this_cpu_dec(dmaengine_chan_ref_count);
}
EXPORT_SYMBOL(dma_put_channel);
-----------------------------------------------------------------------------
The disassembled code is:
(gdb) disassemble dma_find_channel
Dump of assembler code for function dma_find_channel:
0x0000000000000000 <+0>: push %rbp
0x0000000000000001 <+1>: mov %rsp,%rbp
0x0000000000000004 <+4>: callq 0x9 <dma_find_channel+9>
0x0000000000000009 <+9>: mov %edi,%edi
0x000000000000000b <+11>: mov 0x0(,%rdi,8),%rax
0x0000000000000013 <+19>: mov %gs:(%rax),%rax
0x0000000000000017 <+23>: incq %gs:0x0 //overhead: this_cpu_inc(dmaengine_chan_ref_count)
0x0000000000000020 <+32>: jmpq 0x25 <dma_find_channel+37> //overhead: if (static_key_false(&dmaengine_quiesce)), will be replaced as NOP by jump label
0x0000000000000025 <+37>: pop %rbp
0x0000000000000026 <+38>: retq
0x0000000000000027 <+39>: nopw 0x0(%rax,%rax,1)
0x0000000000000030 <+48>: xor %eax,%eax
0x0000000000000032 <+50>: pop %rbp
0x0000000000000033 <+51>: retq
End of assembler dump.
(gdb) disassemble dma_put_channel // overhead: to decrease channel reference count, 6 instructions
Dump of assembler code for function dma_put_channel:
0x0000000000000070 <+0>: push %rbp
0x0000000000000071 <+1>: mov %rsp,%rbp
0x0000000000000074 <+4>: callq 0x79 <dma_put_channel+9>
0x0000000000000079 <+9>: decq %gs:0x0
0x0000000000000082 <+18>: pop %rbp
0x0000000000000083 <+19>: retq
End of assembler dump.
(gdb) disassemble dma_get_channel
Dump of assembler code for function dma_get_channel:
0x0000000000000040 <+0>: push %rbp
0x0000000000000041 <+1>: mov %rsp,%rbp
0x0000000000000044 <+4>: callq 0x49 <dma_get_channel+9>
0x0000000000000049 <+9>: mov %rdi,%rax
0x000000000000004c <+12>: jmpq 0x51 <dma_get_channel+17>
0x0000000000000051 <+17>: incq %gs:0x0
0x000000000000005a <+26>: pop %rbp
0x000000000000005b <+27>: retq
0x000000000000005c <+28>: nopl 0x0(%rax)
0x0000000000000060 <+32>: lock incl 0x0(%rip) # 0x67 <dma_get_channel+39>
0x0000000000000067 <+39>: jmp 0x51 <dma_get_channel+17>
End of assembler dump.
So for a typical dma_find_channel()/dma_put_channel(), the total overhead
is about 10 instructions and two percpu(local) memory updates. And there's
no shared cache pollution any more. Is this acceptable ff the algorithm
works as expected? I will test the code tomorrow.
For typical systems which don't support DMA device hotplug, the overhead
could be completely removed by condition compilation.
Any comments are welcomed!
Thanks!
--gerry
On 04/24/2012 11:09 AM, Dan Williams wrote:
>>> If you are going to hotplug the entire IOH, then you are probably ok
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/netdma.h b/include/net/netdma.h index 8ba8ce2..6d71724 100644 --- a/include/net/netdma.h +++ b/include/net/netdma.h @@ -24,6 +24,32 @@ #include <linux/dmaengine.h> #include <linux/skbuff.h> +static inline bool +net_dma_capable(void) +{ + struct dma_chan *chan = net_dma_find_channel(); + dma_put_channel(chan); + + return !!chan; +} + +static inline struct dma_chan * +net_dma_get_channel(struct tcp_sock *tp) +{ + if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list) + tp->ucopy.dma_chan = net_dma_find_channel(); + return tp->ucopy.dma_chan; +} + +static inline void +net_dma_put_channel(struct tcp_sock *tp) +{ + if (tp->ucopy.dma_chan) { + dma_put_channel(tp->ucopy.dma_chan); + tp->ucopy.dma_chan = NULL; + } +} + int dma_skb_copy_datagram_iovec(struct dma_chan* chan, struct sk_buff *skb, int offset, struct iovec *to, size_t len, struct dma_pinned_list *pinned_list); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 8bb6ade..aea4032 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1451,8 +1451,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, available = TCP_SKB_CB(skb)->seq + skb->len - (*seq); if ((available < target) && (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) && - !sysctl_tcp_low_latency && - net_dma_find_channel()) { + !sysctl_tcp_low_latency && net_dma_capable()) { preempt_enable_no_resched(); tp->ucopy.pinned_list = dma_pin_iovec_pages(msg->msg_iov, len); @@ -1666,10 +1665,7 @@ do_prequeue: if (!(flags & MSG_TRUNC)) { #ifdef CONFIG_NET_DMA - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list) - tp->ucopy.dma_chan = net_dma_find_channel(); - - if (tp->ucopy.dma_chan) { + if (net_dma_get_channel(tp)) { tp->ucopy.dma_cookie = dma_skb_copy_datagram_iovec( tp->ucopy.dma_chan, skb, offset, msg->msg_iov, used, @@ -1758,7 +1754,7 @@ skip_copy: #ifdef CONFIG_NET_DMA tcp_service_net_dma(sk, true); /* Wait for queue to drain */ - tp->ucopy.dma_chan = NULL; + net_dma_put_channel(tp); if (tp->ucopy.pinned_list) { dma_unpin_iovec_pages(tp->ucopy.pinned_list); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9944c1d..3878916 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5227,10 +5227,7 @@ static int tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb, if (tp->ucopy.wakeup) return 0; - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list) - tp->ucopy.dma_chan = net_dma_find_channel(); - - if (tp->ucopy.dma_chan && skb_csum_unnecessary(skb)) { + if (net_dma_get_channel(tp) && skb_csum_unnecessary(skb)) { dma_cookie = dma_skb_copy_datagram_iovec(tp->ucopy.dma_chan, skb, hlen, diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 0cb86ce..90ea1c0 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1729,9 +1729,7 @@ process: if (!sock_owned_by_user(sk)) { #ifdef CONFIG_NET_DMA struct tcp_sock *tp = tcp_sk(sk); - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list) - tp->ucopy.dma_chan = net_dma_find_channel(); - if (tp->ucopy.dma_chan) + if (net_dma_get_channel(tp)) ret = tcp_v4_do_rcv(sk, skb); else #endif diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 86cfe60..fb81bbd 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1644,9 +1644,7 @@ process: if (!sock_owned_by_user(sk)) { #ifdef CONFIG_NET_DMA struct tcp_sock *tp = tcp_sk(sk); - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list) - tp->ucopy.dma_chan = net_dma_find_channel(); - if (tp->ucopy.dma_chan) + if (net_dma_get_channel(tp)) ret = tcp_v6_do_rcv(sk, skb); else #endif
Enhance network subsystem to correctly update DMA channel reference counts, so it won't break DMA device hotplug logic. Signed-off-by: Jiang Liu <liuj97@gmail.com> --- include/net/netdma.h | 26 ++++++++++++++++++++++++++ net/ipv4/tcp.c | 10 +++------- net/ipv4/tcp_input.c | 5 +---- net/ipv4/tcp_ipv4.c | 4 +--- net/ipv6/tcp_ipv6.c | 4 +--- 5 files changed, 32 insertions(+), 17 deletions(-)