Message ID | 1283708635.3402.100.camel@edumazet-laptop |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Mon, Sep 6, 2010 at 1:43 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le lundi 06 septembre 2010 à 01:18 +0800, Changli Gao a écrit : >> af_packet uses tpacket_destruct_skb() to notify its user a frame is >> sent out through NIC, and the memory for that frame is available for >> the others. If the driver calls skb_orphan() before the frame is sent >> out successfully, and the user may fill other data into the space for >> this frame, this frame will be corrupted. It became more likely after >> skb_try_orphan() was added into dev_hard_start_xmit(). >> >> Am I correct? >> > > Yes good catch. We might add a : > > SKBTX_NO_EARLY_ORPHAN = 1 << 4, > > so that skb_orphan_try() do not early orphan this kind of skb > It may solve the issue of skb_orphan_try(), but some NICs still call skb_orphan(). Maybe replacing skb_orphan() with skb_orphan_try() can work around this issue. localhost linux # grep skb_orphan drivers/net/ -r drivers/net/can/dev.c: skb_orphan(skb); drivers/net/mlx4/en_tx.c: skb_orphan(skb); drivers/net/cxgb3/sge.c: skb_orphan(skb); drivers/net/cxgb4/sge.c: skb_orphan(skb); drivers/net/niu.c: skb_orphan(skb); drivers/net/cxgb4vf/sge.c: skb_orphan(skb); drivers/net/tun.c: skb_orphan(skb); drivers/net/virtio_net.c: skb_orphan(skb); drivers/net/loopback.c: skb_orphan(skb); drivers/net/wireless/mac80211_hwsim.c: skb_orphan(skb); drivers/net/wireless/libertas/tx.c: skb_orphan(skb);
On 05.09.2010 19:51, Changli Gao wrote: > On Mon, Sep 6, 2010 at 1:43 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> Le lundi 06 septembre 2010 à 01:18 +0800, Changli Gao a écrit : >>> af_packet uses tpacket_destruct_skb() to notify its user a frame is >>> sent out through NIC, and the memory for that frame is available for >>> the others. If the driver calls skb_orphan() before the frame is sent >>> out successfully, and the user may fill other data into the space for >>> this frame, this frame will be corrupted. It became more likely after >>> skb_try_orphan() was added into dev_hard_start_xmit(). >>> >>> Am I correct? >>> >> >> Yes good catch. We might add a : >> >> SKBTX_NO_EARLY_ORPHAN = 1 << 4, >> >> so that skb_orphan_try() do not early orphan this kind of skb >> > > It may solve the issue of skb_orphan_try(), but some NICs still call > skb_orphan(). Maybe replacing skb_orphan() with skb_orphan_try() can > work around this issue. > Hm - i'm not really sure if skb_orphan_try() helps on this level. E.g. for drivers/net/can/dev.c the skb_orphan is used to handle the correct order of echo'ed skbs. Other drivers may do similar things. AFAIK skb_orphan_try() helps to increase the performance for usual network traffic, as it allows the orphan on a 'higher' level inside the networking stack. In some cases the skb needs to be untouched, and this can be indicated in the shared tx_flags. IMO skb_orphan_try() on driver level would break the correct behaviour in most cases. Regards, Oliver > localhost linux # grep skb_orphan drivers/net/ -r > drivers/net/can/dev.c: skb_orphan(skb); > drivers/net/mlx4/en_tx.c: skb_orphan(skb); > drivers/net/cxgb3/sge.c: skb_orphan(skb); > drivers/net/cxgb4/sge.c: skb_orphan(skb); > drivers/net/niu.c: skb_orphan(skb); > drivers/net/cxgb4vf/sge.c: skb_orphan(skb); > drivers/net/tun.c: skb_orphan(skb); > drivers/net/virtio_net.c: skb_orphan(skb); > drivers/net/loopback.c: skb_orphan(skb); > drivers/net/wireless/mac80211_hwsim.c: skb_orphan(skb); > drivers/net/wireless/libertas/tx.c: skb_orphan(skb); > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Sep 05, 2010 at 07:43:55PM +0200, Eric Dumazet wrote: > Le lundi 06 septembre 2010 à 01:18 +0800, Changli Gao a écrit : > > af_packet uses tpacket_destruct_skb() to notify its user a frame is > > sent out through NIC, and the memory for that frame is available for > > the others. If the driver calls skb_orphan() before the frame is sent > > out successfully, and the user may fill other data into the space for > > this frame, this frame will be corrupted. It became more likely after > > skb_try_orphan() was added into dev_hard_start_xmit(). > > > > Am I correct? > > > > Yes good catch. We might add a : > > SKBTX_NO_EARLY_ORPHAN = 1 << 4, > > so that skb_orphan_try() do not early orphan this kind of skb > I think there are bigger issues here. As was pointed out, drivers might orphan skbs before they transmit them. And at least for tun, the reason is that we might hang on to skbs indefinitely because userspace is not reading them. So in that case, if you just prevent tun from orphaning skbs, the socket will be prevented from sending any more packets out even if they are for a completely unrelated destinations, right? Further, module can't get unloaded and I think socket can not get closed, so user can't kill the task which has the socket? And thinking about this, I think I see another issue related to the use of the destructor callback: static void tpacket_destruct_skb(struct sk_buff *skb) { struct packet_sock *po = pkt_sk(skb->sk); void *ph; BUG_ON(skb == NULL); if (likely(po->tx_ring.pg_vec)) { ph = skb_shinfo(skb)->destructor_arg; BUG_ON(__packet_get_status(po, ph) != TP_STATUS_SENDING); BUG_ON(atomic_read(&po->tx_ring.pending) == 0); atomic_dec(&po->tx_ring.pending); __packet_set_status(po, ph, TP_STATUS_AVAILABLE); } sock_wfree(skb); <----- at this point we still have to execute instructions in this function to return from it. However socket and thus module reference count got already dropped to 0, so I think module could get unloaded and these instructions could get overwritten. } I conclude that destructor callback should never point to a function residing in a module, always to a function that is guaranteed to be builtin, this function must be the one that drops the last module reference. Comments? > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index f900ffc..9c1a480 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -176,6 +176,9 @@ enum { > > /* ensure the originating sk reference is available on driver level */ > SKBTX_DRV_NEEDS_SK_REF = 1 << 3, > + > + /* dont early orphan this skb in skb_orphan_try() */ > + SKBTX_NO_EARLY_ORPHAN = 1 << 4, > }; > > /* This data is invariant across clones and lives at > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index 3616f27..306795d 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -1029,6 +1029,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) > } > > skb->destructor = tpacket_destruct_skb; > + skb_shinfo(skb)->tx_flags |= SKBTX_NO_EARLY_ORPHAN; > __packet_set_status(po, ph, TP_STATUS_SENDING); > atomic_inc(&po->tx_ring.pending); > > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le lundi 06 septembre 2010 à 13:35 +0300, Michael S. Tsirkin a écrit : > I think there are bigger issues here. As was pointed out, drivers might > orphan skbs before they transmit them. > And at least for tun, the reason is that we might hang on > to skbs indefinitely because userspace is not reading them. > > So in that case, if you just prevent tun from orphaning skbs, the socket > will be prevented from sending any more packets out even if they are for > a completely unrelated destinations, right? > Further, module can't get unloaded and I think socket can not get > closed, so user can't kill the task which has the socket? > > And thinking about this, I think I see > another issue related to the use of the destructor callback: > > static void tpacket_destruct_skb(struct sk_buff *skb) > { > struct packet_sock *po = pkt_sk(skb->sk); > void *ph; > > BUG_ON(skb == NULL); > > if (likely(po->tx_ring.pg_vec)) { > ph = skb_shinfo(skb)->destructor_arg; > BUG_ON(__packet_get_status(po, ph) != TP_STATUS_SENDING); > BUG_ON(atomic_read(&po->tx_ring.pending) == 0); > atomic_dec(&po->tx_ring.pending); > __packet_set_status(po, ph, TP_STATUS_AVAILABLE); > } > > sock_wfree(skb); > > <----- > at this point we still have to execute instructions > in this function to return from it. However > socket and thus module reference count > got already dropped to 0, so I think module could get unloaded > and these instructions could get overwritten. > > } > > I conclude that destructor callback should never point to a function residing > in a module, always to a function that is guaranteed to be builtin, this > function must be the one that drops the last module reference. It would be a surprise to use tx mmap (presumably to get high performance), and a modular af_unix ;) > > Comments? The whole thing (packet / tx mmap) is broken, if you ask me. skb_orphan() is not about protecting data, but doing per socket memory accounting. We have to skb_orphan() while data is still in use by skb, not only in drivers but in core network stack. (loopback case for example, no need to think about TUN being special ;) ) So I believe using mmap and tx on af_unix is racy in its current design. We probably can remove some skb_orphan() calls (now its done in core network, no real need to make it from some drivers), but not have a complete solution to the problem Changli raised, without adding yet another field into skb_shared_info... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 06, 2010 at 05:44:27PM +0200, Eric Dumazet wrote: > Le lundi 06 septembre 2010 à 13:35 +0300, Michael S. Tsirkin a écrit : > > > I think there are bigger issues here. As was pointed out, drivers might > > orphan skbs before they transmit them. > > And at least for tun, the reason is that we might hang on > > to skbs indefinitely because userspace is not reading them. > > > > So in that case, if you just prevent tun from orphaning skbs, the socket > > will be prevented from sending any more packets out even if they are for > > a completely unrelated destinations, right? > > Further, module can't get unloaded and I think socket can not get > > closed, so user can't kill the task which has the socket? > > > > And thinking about this, I think I see > > another issue related to the use of the destructor callback: > > > > static void tpacket_destruct_skb(struct sk_buff *skb) > > { > > struct packet_sock *po = pkt_sk(skb->sk); > > void *ph; > > > > BUG_ON(skb == NULL); > > > > if (likely(po->tx_ring.pg_vec)) { > > ph = skb_shinfo(skb)->destructor_arg; > > BUG_ON(__packet_get_status(po, ph) != TP_STATUS_SENDING); > > BUG_ON(atomic_read(&po->tx_ring.pending) == 0); > > atomic_dec(&po->tx_ring.pending); > > __packet_set_status(po, ph, TP_STATUS_AVAILABLE); > > } > > > > sock_wfree(skb); > > > > <----- > > at this point we still have to execute instructions > > in this function to return from it. However > > socket and thus module reference count > > got already dropped to 0, so I think module could get unloaded > > and these instructions could get overwritten. > > > > } > > > > I conclude that destructor callback should never point to a function residing > > in a module, always to a function that is guaranteed to be builtin, this > > function must be the one that drops the last module reference. > > It would be a surprise to use tx mmap (presumably to get high > performance), and a modular af_unix ;) Hmm, this seems to be what distros ship. If it's illegal, we should disable this in Kconfig? > > > > Comments? > > The whole thing (packet / tx mmap) is broken, if you ask me. > > skb_orphan() is not about protecting data, but doing per socket memory > accounting. Right. > We have to skb_orphan() while data is still in use by skb, not only in > drivers but in core network stack. (loopback case for example, no need > to think about TUN being special ;) ) > > So I believe using mmap and tx on af_unix is racy in its current design. > > We probably can remove some skb_orphan() calls (now its done in core > network, no real need to make it from some drivers), but not have a > complete solution to the problem Changli raised, without adding yet > another field into skb_shared_info... Whouldn't that just make the problem harder to debug?
From: "Michael S. Tsirkin" <mst@redhat.com> Date: Mon, 6 Sep 2010 22:53:30 +0300 > On Mon, Sep 06, 2010 at 05:44:27PM +0200, Eric Dumazet wrote: >> It would be a surprise to use tx mmap (presumably to get high >> performance), and a modular af_unix ;) > > Hmm, this seems to be what distros ship. If it's illegal, we should disable > this in Kconfig? He said it would be illogical to configure the kernel like this if the goal is top performance, he didn't say it was illegal. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Mon, 06 Sep 2010 17:44:27 +0200 > skb_orphan() is not about protecting data, but doing per socket memory > accounting. Right, therefore I would suggest we use a different callback mechanism for af_packet to fix this problem. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index f900ffc..9c1a480 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -176,6 +176,9 @@ enum { /* ensure the originating sk reference is available on driver level */ SKBTX_DRV_NEEDS_SK_REF = 1 << 3, + + /* dont early orphan this skb in skb_orphan_try() */ + SKBTX_NO_EARLY_ORPHAN = 1 << 4, }; /* This data is invariant across clones and lives at diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 3616f27..306795d 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1029,6 +1029,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) } skb->destructor = tpacket_destruct_skb; + skb_shinfo(skb)->tx_flags |= SKBTX_NO_EARLY_ORPHAN; __packet_set_status(po, ph, TP_STATUS_SENDING); atomic_inc(&po->tx_ring.pending);