diff mbox

net: af_packet: skb_orphan should be avoided in TX path.

Message ID 1283708635.3402.100.camel@edumazet-laptop
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Sept. 5, 2010, 5:43 p.m. UTC
Le lundi 06 septembre 2010 à 01:18 +0800, Changli Gao a écrit :
> af_packet uses tpacket_destruct_skb() to notify its user a frame is
> sent out through NIC, and the memory for that frame is available for
> the others. If the driver calls skb_orphan() before the frame is sent
> out successfully, and the user may fill other data into the space for
> this frame, this frame will be corrupted. It became more likely after
> skb_try_orphan() was added into dev_hard_start_xmit().
> 
> Am I correct?
> 

Yes good catch. We might add a :

SKBTX_NO_EARLY_ORPHAN = 1 << 4,

so that skb_orphan_try() do not early orphan this kind of skb




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Changli Gao Sept. 5, 2010, 5:51 p.m. UTC | #1
On Mon, Sep 6, 2010 at 1:43 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le lundi 06 septembre 2010 à 01:18 +0800, Changli Gao a écrit :
>> af_packet uses tpacket_destruct_skb() to notify its user a frame is
>> sent out through NIC, and the memory for that frame is available for
>> the others. If the driver calls skb_orphan() before the frame is sent
>> out successfully, and the user may fill other data into the space for
>> this frame, this frame will be corrupted. It became more likely after
>> skb_try_orphan() was added into dev_hard_start_xmit().
>>
>> Am I correct?
>>
>
> Yes good catch. We might add a :
>
> SKBTX_NO_EARLY_ORPHAN = 1 << 4,
>
> so that skb_orphan_try() do not early orphan this kind of skb
>

It may solve the issue of skb_orphan_try(), but some NICs still call
skb_orphan(). Maybe replacing skb_orphan() with skb_orphan_try() can
work around this issue.

localhost linux # grep skb_orphan drivers/net/ -r
drivers/net/can/dev.c:                  skb_orphan(skb);
drivers/net/mlx4/en_tx.c:               skb_orphan(skb);
drivers/net/cxgb3/sge.c:                skb_orphan(skb);
drivers/net/cxgb4/sge.c:                skb_orphan(skb);
drivers/net/niu.c:              skb_orphan(skb);
drivers/net/cxgb4vf/sge.c:              skb_orphan(skb);
drivers/net/tun.c:      skb_orphan(skb);
drivers/net/virtio_net.c:       skb_orphan(skb);
drivers/net/loopback.c: skb_orphan(skb);
drivers/net/wireless/mac80211_hwsim.c:  skb_orphan(skb);
drivers/net/wireless/libertas/tx.c:             skb_orphan(skb);
Oliver Hartkopp Sept. 5, 2010, 6:08 p.m. UTC | #2
On 05.09.2010 19:51, Changli Gao wrote:
> On Mon, Sep 6, 2010 at 1:43 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Le lundi 06 septembre 2010 à 01:18 +0800, Changli Gao a écrit :
>>> af_packet uses tpacket_destruct_skb() to notify its user a frame is
>>> sent out through NIC, and the memory for that frame is available for
>>> the others. If the driver calls skb_orphan() before the frame is sent
>>> out successfully, and the user may fill other data into the space for
>>> this frame, this frame will be corrupted. It became more likely after
>>> skb_try_orphan() was added into dev_hard_start_xmit().
>>>
>>> Am I correct?
>>>
>>
>> Yes good catch. We might add a :
>>
>> SKBTX_NO_EARLY_ORPHAN = 1 << 4,
>>
>> so that skb_orphan_try() do not early orphan this kind of skb
>>
> 
> It may solve the issue of skb_orphan_try(), but some NICs still call
> skb_orphan(). Maybe replacing skb_orphan() with skb_orphan_try() can
> work around this issue.
> 

Hm - i'm not really sure if skb_orphan_try() helps on this level.

E.g. for drivers/net/can/dev.c the skb_orphan is used to handle the correct
order of echo'ed skbs. Other drivers may do similar things.

AFAIK skb_orphan_try() helps to increase the performance for usual network
traffic, as it allows the orphan on a 'higher' level inside the networking stack.

In some cases the skb needs to be untouched, and this can be indicated in the
shared tx_flags. IMO skb_orphan_try() on driver level would break the correct
behaviour in most cases.

Regards,
Oliver

> localhost linux # grep skb_orphan drivers/net/ -r
> drivers/net/can/dev.c:                  skb_orphan(skb);
> drivers/net/mlx4/en_tx.c:               skb_orphan(skb);
> drivers/net/cxgb3/sge.c:                skb_orphan(skb);
> drivers/net/cxgb4/sge.c:                skb_orphan(skb);
> drivers/net/niu.c:              skb_orphan(skb);
> drivers/net/cxgb4vf/sge.c:              skb_orphan(skb);
> drivers/net/tun.c:      skb_orphan(skb);
> drivers/net/virtio_net.c:       skb_orphan(skb);
> drivers/net/loopback.c: skb_orphan(skb);
> drivers/net/wireless/mac80211_hwsim.c:  skb_orphan(skb);
> drivers/net/wireless/libertas/tx.c:             skb_orphan(skb);
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Sept. 6, 2010, 10:35 a.m. UTC | #3
On Sun, Sep 05, 2010 at 07:43:55PM +0200, Eric Dumazet wrote:
> Le lundi 06 septembre 2010 à 01:18 +0800, Changli Gao a écrit :
> > af_packet uses tpacket_destruct_skb() to notify its user a frame is
> > sent out through NIC, and the memory for that frame is available for
> > the others. If the driver calls skb_orphan() before the frame is sent
> > out successfully, and the user may fill other data into the space for
> > this frame, this frame will be corrupted. It became more likely after
> > skb_try_orphan() was added into dev_hard_start_xmit().
> > 
> > Am I correct?
> > 
> 
> Yes good catch. We might add a :
> 
> SKBTX_NO_EARLY_ORPHAN = 1 << 4,
> 
> so that skb_orphan_try() do not early orphan this kind of skb
> 

I think there are bigger issues here.  As was pointed out, drivers might
orphan skbs before they transmit them.
And at least for tun, the reason is that we might hang on
to skbs indefinitely because userspace is not reading them.

So in that case, if you just prevent tun from orphaning skbs, the socket
will be prevented from sending any more packets out even if they are for
a completely unrelated destinations, right?
Further, module can't get unloaded and I think socket can not get
closed, so user can't kill the task which has the socket?

And thinking about this, I think I see
another issue related to the use of the destructor callback:

static void tpacket_destruct_skb(struct sk_buff *skb)
{
        struct packet_sock *po = pkt_sk(skb->sk);
        void *ph;

        BUG_ON(skb == NULL);

        if (likely(po->tx_ring.pg_vec)) {
                ph = skb_shinfo(skb)->destructor_arg;
                BUG_ON(__packet_get_status(po, ph) != TP_STATUS_SENDING);
                BUG_ON(atomic_read(&po->tx_ring.pending) == 0);
                atomic_dec(&po->tx_ring.pending);
                __packet_set_status(po, ph, TP_STATUS_AVAILABLE);
        }

        sock_wfree(skb);

<-----
at this point we still have to execute instructions
in this function to return from it. However
socket and thus module reference count
got already dropped to 0, so I think module could get unloaded
and these instructions could get overwritten.

}

I conclude that destructor callback should never point to a function residing
in a module, always to a function that is guaranteed to be builtin, this
function must be the one that drops the last module reference.

Comments?


> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index f900ffc..9c1a480 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -176,6 +176,9 @@ enum {
>  
>  	/* ensure the originating sk reference is available on driver level */
>  	SKBTX_DRV_NEEDS_SK_REF = 1 << 3,
> +
> +	/* dont early orphan this skb in skb_orphan_try() */
> +	SKBTX_NO_EARLY_ORPHAN = 1 << 4,
>  };
>  
>  /* This data is invariant across clones and lives at
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 3616f27..306795d 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -1029,6 +1029,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>  		}
>  
>  		skb->destructor = tpacket_destruct_skb;
> +		skb_shinfo(skb)->tx_flags |= SKBTX_NO_EARLY_ORPHAN;
>  		__packet_set_status(po, ph, TP_STATUS_SENDING);
>  		atomic_inc(&po->tx_ring.pending);
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 6, 2010, 3:44 p.m. UTC | #4
Le lundi 06 septembre 2010 à 13:35 +0300, Michael S. Tsirkin a écrit :

> I think there are bigger issues here.  As was pointed out, drivers might
> orphan skbs before they transmit them.
> And at least for tun, the reason is that we might hang on
> to skbs indefinitely because userspace is not reading them.
> 
> So in that case, if you just prevent tun from orphaning skbs, the socket
> will be prevented from sending any more packets out even if they are for
> a completely unrelated destinations, right?
> Further, module can't get unloaded and I think socket can not get
> closed, so user can't kill the task which has the socket?
> 
> And thinking about this, I think I see
> another issue related to the use of the destructor callback:
> 
> static void tpacket_destruct_skb(struct sk_buff *skb)
> {
>         struct packet_sock *po = pkt_sk(skb->sk);
>         void *ph;
> 
>         BUG_ON(skb == NULL);
> 
>         if (likely(po->tx_ring.pg_vec)) {
>                 ph = skb_shinfo(skb)->destructor_arg;
>                 BUG_ON(__packet_get_status(po, ph) != TP_STATUS_SENDING);
>                 BUG_ON(atomic_read(&po->tx_ring.pending) == 0);
>                 atomic_dec(&po->tx_ring.pending);
>                 __packet_set_status(po, ph, TP_STATUS_AVAILABLE);
>         }
> 
>         sock_wfree(skb);
> 
> <-----
> at this point we still have to execute instructions
> in this function to return from it. However
> socket and thus module reference count
> got already dropped to 0, so I think module could get unloaded
> and these instructions could get overwritten.
> 
> }
> 
> I conclude that destructor callback should never point to a function residing
> in a module, always to a function that is guaranteed to be builtin, this
> function must be the one that drops the last module reference.

It would be a surprise to use tx mmap (presumably to get high
performance), and a modular af_unix ;)

> 
> Comments?

The whole thing (packet / tx mmap) is broken, if you ask me.

skb_orphan() is not about protecting data, but doing per socket memory
accounting.

We have to skb_orphan() while data is still in use by skb, not only in
drivers but in core network stack. (loopback case for example, no need
to think about TUN being special ;) )

So I believe using mmap and tx on af_unix is racy in its current design.

We probably can remove some skb_orphan() calls (now its done in core
network, no real need to make it from some drivers), but not have a
complete solution to the problem Changli raised, without adding yet
another field into skb_shared_info...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Sept. 6, 2010, 7:53 p.m. UTC | #5
On Mon, Sep 06, 2010 at 05:44:27PM +0200, Eric Dumazet wrote:
> Le lundi 06 septembre 2010 à 13:35 +0300, Michael S. Tsirkin a écrit :
> 
> > I think there are bigger issues here.  As was pointed out, drivers might
> > orphan skbs before they transmit them.
> > And at least for tun, the reason is that we might hang on
> > to skbs indefinitely because userspace is not reading them.
> > 
> > So in that case, if you just prevent tun from orphaning skbs, the socket
> > will be prevented from sending any more packets out even if they are for
> > a completely unrelated destinations, right?
> > Further, module can't get unloaded and I think socket can not get
> > closed, so user can't kill the task which has the socket?
> > 
> > And thinking about this, I think I see
> > another issue related to the use of the destructor callback:
> > 
> > static void tpacket_destruct_skb(struct sk_buff *skb)
> > {
> >         struct packet_sock *po = pkt_sk(skb->sk);
> >         void *ph;
> > 
> >         BUG_ON(skb == NULL);
> > 
> >         if (likely(po->tx_ring.pg_vec)) {
> >                 ph = skb_shinfo(skb)->destructor_arg;
> >                 BUG_ON(__packet_get_status(po, ph) != TP_STATUS_SENDING);
> >                 BUG_ON(atomic_read(&po->tx_ring.pending) == 0);
> >                 atomic_dec(&po->tx_ring.pending);
> >                 __packet_set_status(po, ph, TP_STATUS_AVAILABLE);
> >         }
> > 
> >         sock_wfree(skb);
> > 
> > <-----
> > at this point we still have to execute instructions
> > in this function to return from it. However
> > socket and thus module reference count
> > got already dropped to 0, so I think module could get unloaded
> > and these instructions could get overwritten.
> > 
> > }
> > 
> > I conclude that destructor callback should never point to a function residing
> > in a module, always to a function that is guaranteed to be builtin, this
> > function must be the one that drops the last module reference.
> 
> It would be a surprise to use tx mmap (presumably to get high
> performance), and a modular af_unix ;)

Hmm, this seems to be what distros ship. If it's illegal, we should disable
this in Kconfig?

> > 
> > Comments?
> 
> The whole thing (packet / tx mmap) is broken, if you ask me.
> 
> skb_orphan() is not about protecting data, but doing per socket memory
> accounting.

Right.

> We have to skb_orphan() while data is still in use by skb, not only in
> drivers but in core network stack. (loopback case for example, no need
> to think about TUN being special ;) )
> 
> So I believe using mmap and tx on af_unix is racy in its current design.
> 
> We probably can remove some skb_orphan() calls (now its done in core
> network, no real need to make it from some drivers), but not have a
> complete solution to the problem Changli raised, without adding yet
> another field into skb_shared_info...

Whouldn't that just make the problem harder to debug?
David Miller Sept. 6, 2010, 8:11 p.m. UTC | #6
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Mon, 6 Sep 2010 22:53:30 +0300

> On Mon, Sep 06, 2010 at 05:44:27PM +0200, Eric Dumazet wrote:
>> It would be a surprise to use tx mmap (presumably to get high
>> performance), and a modular af_unix ;)
> 
> Hmm, this seems to be what distros ship. If it's illegal, we should disable
> this in Kconfig?

He said it would be illogical to configure the kernel like this if
the goal is top performance, he didn't say it was illegal.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Sept. 8, 2010, 5:39 p.m. UTC | #7
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 06 Sep 2010 17:44:27 +0200

> skb_orphan() is not about protecting data, but doing per socket memory
> accounting.

Right, therefore I would suggest we use a different callback mechanism
for af_packet to fix this problem.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f900ffc..9c1a480 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -176,6 +176,9 @@  enum {
 
 	/* ensure the originating sk reference is available on driver level */
 	SKBTX_DRV_NEEDS_SK_REF = 1 << 3,
+
+	/* dont early orphan this skb in skb_orphan_try() */
+	SKBTX_NO_EARLY_ORPHAN = 1 << 4,
 };
 
 /* This data is invariant across clones and lives at
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 3616f27..306795d 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1029,6 +1029,7 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		}
 
 		skb->destructor = tpacket_destruct_skb;
+		skb_shinfo(skb)->tx_flags |= SKBTX_NO_EARLY_ORPHAN;
 		__packet_set_status(po, ph, TP_STATUS_SENDING);
 		atomic_inc(&po->tx_ring.pending);