diff mbox

IB/ipoib: move back the IB LL address into the hard header

Message ID 1dbd83dfe7f435eecc5bc460e901b47758280f30.1476206016.git.pabeni@redhat.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Paolo Abeni Oct. 11, 2016, 5:15 p.m. UTC
After the commit 9207f9d45b0a ("net: preserve IP control block
during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
That destroy the IPoIB address information cached there,
causing a severe performance regression, as better described here:

http://marc.info/?l=linux-kernel&m=146787279825501&w=2

This change moves the data cached by the IPoIB driver from the
skb control lock into the IPoIB hard header, as done before
the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len
and use skb->cb to stash LL addresses").
In order to avoid GRO issue, on packet reception, the IPoIB driver
stash into the skb a dummy pseudo header, so that the received
packets have actually a hard header matching the declared length.
Also the connected mode maximum mtu is reduced by 16 bytes to
cope with the increased hard header len.

After this commit, IPoIB performances are back to pre-regression
value.

Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           | 24 ++++++++----
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        | 17 ++++----
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        | 12 +++---
 drivers/infiniband/ulp/ipoib/ipoib_main.c      | 54 ++++++++++++++++----------
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |  6 ++-
 5 files changed, 67 insertions(+), 46 deletions(-)

Comments

Jason Gunthorpe Oct. 11, 2016, 5:32 p.m. UTC | #1
On Tue, Oct 11, 2016 at 07:15:44PM +0200, Paolo Abeni wrote:

> Also the connected mode maximum mtu is reduced by 16 bytes to
> cope with the increased hard header len.

Changing the MTU is going to cause annoying interop problems, can you
avoid this?

Jason
Paolo Abeni Oct. 11, 2016, 5:37 p.m. UTC | #2
On Tue, 2016-10-11 at 11:32 -0600, Jason Gunthorpe wrote:
> On Tue, Oct 11, 2016 at 07:15:44PM +0200, Paolo Abeni wrote:
> 
> > Also the connected mode maximum mtu is reduced by 16 bytes to
> > cope with the increased hard header len.
> 
> Changing the MTU is going to cause annoying interop problems, can you
> avoid this?

I don't like changing the maximum MTU value, too, but I was unable to
find an alternative solution. The PMTU detection should protect against
such issues.
Doug Ledford Oct. 11, 2016, 5:41 p.m. UTC | #3
On 10/11/2016 1:32 PM, Jason Gunthorpe wrote:
> On Tue, Oct 11, 2016 at 07:15:44PM +0200, Paolo Abeni wrote:
> 
>> Also the connected mode maximum mtu is reduced by 16 bytes to
>> cope with the increased hard header len.
> 
> Changing the MTU is going to cause annoying interop problems, can you
> avoid this?

(Paolo did the work I'm describing here, I'm just giving the explanation
he gave me):

Not using this particular solution I don't think.  We tried it without
increasing the declared hard header length and it broke when dealing
with skb_clone/GSO paths.  In order to make the LL pseudo header get
copied along with the rest of the encap and data on clone, we had to
declare the header.  The problem then became that the sg setup is such
that we are limited to 16 4k pages for the sg array, so that header had
to come out of the 64k maximum mtu.
Jason Gunthorpe Oct. 11, 2016, 5:42 p.m. UTC | #4
On Tue, Oct 11, 2016 at 07:37:32PM +0200, Paolo Abeni wrote:
> On Tue, 2016-10-11 at 11:32 -0600, Jason Gunthorpe wrote:
> > On Tue, Oct 11, 2016 at 07:15:44PM +0200, Paolo Abeni wrote:
> > 
> > > Also the connected mode maximum mtu is reduced by 16 bytes to
> > > cope with the increased hard header len.
> > 
> > Changing the MTU is going to cause annoying interop problems, can you
> > avoid this?
> 
> I don't like changing the maximum MTU value, too, but I was unable to
> find an alternative solution. The PMTU detection should protect against
> such issues.

It is more that PMTU, we have instructed all users that is the MTU
number needed to enable CM mode, so it appears in documentation,
scripts, etc.

There is really no way to re-use some of the existing alignment
padding or exceed 64k?

Jason
Paolo Abeni Oct. 11, 2016, 5:52 p.m. UTC | #5
On Tue, 2016-10-11 at 11:42 -0600, Jason Gunthorpe wrote:
> On Tue, Oct 11, 2016 at 07:37:32PM +0200, Paolo Abeni wrote:
> > On Tue, 2016-10-11 at 11:32 -0600, Jason Gunthorpe wrote:
> > > On Tue, Oct 11, 2016 at 07:15:44PM +0200, Paolo Abeni wrote:
> > > 
> > > > Also the connected mode maximum mtu is reduced by 16 bytes to
> > > > cope with the increased hard header len.
> > > 
> > > Changing the MTU is going to cause annoying interop problems, can you
> > > avoid this?
> > 
> > I don't like changing the maximum MTU value, too, but I was unable to
> > find an alternative solution. The PMTU detection should protect against
> > such issues.
> 
> It is more that PMTU, we have instructed all users that is the MTU
> number needed to enable CM mode, so it appears in documentation,
> scripts, etc.

AFAICS the max mtu is already underlying h/w dependent, how does such
differences are currently coped by ? (I'm sorry I lack some/a lot of IB
back-ground)
Jason Gunthorpe Oct. 11, 2016, 6:01 p.m. UTC | #6
On Tue, Oct 11, 2016 at 01:41:56PM -0400, Doug Ledford wrote:

> declare the header.  The problem then became that the sg setup is such
> that we are limited to 16 4k pages for the sg array, so that header had
> to come out of the 64k maximum mtu.

Oh, that clarifies things..

Hum, so various options become:
 - Use >=17 SGL entries when creating the QP. Is this possible
   on common adapters?
 - Use the FRWR infrastructure when necessary. Is there any chance
   the majority of skbs will have at least two physically
   continuous pages to make this overhead rare? Perhaps as a fall
   back if many adaptors can do >=17 SGLs 
 - Pad the hard header out to 4k and discard the first page
   when building the sgl
 - Memcopy the first ~8k into a contiguous 8k region on send
 - Move the pseudo header to the end so it can cross the page
   barrier without needing a sgl entry. (probably impossible?)
 
From Paolo

> AFAICS the max mtu is already underlying h/w dependent, how does such
> differences are currently coped by ? (I'm sorry I lack some/a lot of IB
> back-ground)

It isn't h/w dependent. In CM mode the MTU is 65520 because that is
what is hard coded into the ipoib driver. We tell everyone to use that
number. Eg see RH's docs on the subject:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Configuring_IPoIB.html

AFAIK, today everyone just wires that number into their scripts, so we
have to mass change everything to the smaller number. That sounds
really hard, IMHO if there is any way to avoid it we should, even if
it is a little costly.

Jason
Paolo Abeni Oct. 11, 2016, 6:10 p.m. UTC | #7
On Tue, 2016-10-11 at 12:01 -0600, Jason Gunthorpe wrote:
> > AFAICS the max mtu is already underlying h/w dependent, how does such
> > differences are currently coped by ? (I'm sorry I lack some/a lot of IB
> > back-ground)
> 
> It isn't h/w dependent. In CM mode the MTU is 65520 because that is
> what is hard coded into the ipoib driver. We tell everyone to use that
> number. Eg see RH's docs on the subject:
> 
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Configuring_IPoIB.html
> 
> AFAIK, today everyone just wires that number into their scripts, so we
> have to mass change everything to the smaller number. That sounds
> really hard, IMHO if there is any way to avoid it we should, even if
> it is a little costly.

Thank you for the details!

The first s/g fragment (the head buffer) is not allocated with the page
allocator, so perhaps there is some not too difficult/costly way out of
this.
Doug Ledford Oct. 11, 2016, 6:17 p.m. UTC | #8
On 10/11/2016 2:10 PM, Paolo Abeni wrote:
> On Tue, 2016-10-11 at 12:01 -0600, Jason Gunthorpe wrote:
>>> AFAICS the max mtu is already underlying h/w dependent, how does such
>>> differences are currently coped by ? (I'm sorry I lack some/a lot of IB
>>> back-ground)
>>
>> It isn't h/w dependent. In CM mode the MTU is 65520 because that is
>> what is hard coded into the ipoib driver. We tell everyone to use that
>> number. Eg see RH's docs on the subject:
>>
>> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Configuring_IPoIB.html
>>
>> AFAIK, today everyone just wires that number into their scripts, so we
>> have to mass change everything to the smaller number.

Well, not exactly.  Even if we put 65520 into the scripts, the kernel
will silently drop it down to 65504.  It actually won't require anyone
change anything, they just won't get the full value.  I experimented
with this in the past for other reasons and an overly large MTU setting
just resulted in the max MTU.  I don't know if that's changed, but if it
still works that way, this is much less of an issue than it might
otherwise be.

>> That sounds
>> really hard, IMHO if there is any way to avoid it we should, even if
>> it is a little costly.
> 
> Thank you for the details!
> 
> The first s/g fragment (the head buffer) is not allocated with the page
> allocator, so perhaps there is some not too difficult/costly way out of
> this.
> 
> 
> 
> 
> 
>
Jason Gunthorpe Oct. 11, 2016, 6:27 p.m. UTC | #9
On Tue, Oct 11, 2016 at 08:10:07PM +0200, Paolo Abeni wrote:

> The first s/g fragment (the head buffer) is not allocated with the page
> allocator, so perhaps there is some not too difficult/costly way out of
> this.

Keep in mind, there is nothing magic about the 16 SGL limit, other
than we know all hardware supports it. That can be bumped up and most
hardware will support a higher value.

We'd just have to figure out if any hardware breaks, Mellanox and Intel
should be able to respond to that question.

Jason
Jason Gunthorpe Oct. 11, 2016, 6:30 p.m. UTC | #10
On Tue, Oct 11, 2016 at 02:17:51PM -0400, Doug Ledford wrote:

> Well, not exactly.  Even if we put 65520 into the scripts, the kernel
> will silently drop it down to 65504.  It actually won't require anyone
> change anything, they just won't get the full value.  I experimented
> with this in the past for other reasons and an overly large MTU setting
> just resulted in the max MTU.  I don't know if that's changed, but if it
> still works that way, this is much less of an issue than it might
> otherwise be.

So it is just docs and relying on PMTU? That is not as bad..

Still would be nice to avoid if at all possible..

Jason
Doug Ledford Oct. 11, 2016, 6:50 p.m. UTC | #11
On 10/11/2016 2:30 PM, Jason Gunthorpe wrote:
> On Tue, Oct 11, 2016 at 02:17:51PM -0400, Doug Ledford wrote:
> 
>> Well, not exactly.  Even if we put 65520 into the scripts, the kernel
>> will silently drop it down to 65504.  It actually won't require anyone
>> change anything, they just won't get the full value.  I experimented
>> with this in the past for other reasons and an overly large MTU setting
>> just resulted in the max MTU.  I don't know if that's changed, but if it
>> still works that way, this is much less of an issue than it might
>> otherwise be.
> 
> So it is just docs and relying on PMTU? That is not as bad..
> 
> Still would be nice to avoid if at all possible..

I agree, but we have a test getting ready to commence.  We'll know
shortly how much the reduced MTU effects things because they aren't
going to alter any of their setup, just put the new kernel in place, and
see what happens.
Doug Ledford Oct. 12, 2016, 6:21 p.m. UTC | #12
On 10/11/2016 2:50 PM, Doug Ledford wrote:
> On 10/11/2016 2:30 PM, Jason Gunthorpe wrote:
>> On Tue, Oct 11, 2016 at 02:17:51PM -0400, Doug Ledford wrote:
>>
>>> Well, not exactly.  Even if we put 65520 into the scripts, the kernel
>>> will silently drop it down to 65504.  It actually won't require anyone
>>> change anything, they just won't get the full value.  I experimented
>>> with this in the past for other reasons and an overly large MTU setting
>>> just resulted in the max MTU.  I don't know if that's changed, but if it
>>> still works that way, this is much less of an issue than it might
>>> otherwise be.
>>
>> So it is just docs and relying on PMTU? That is not as bad..
>>
>> Still would be nice to avoid if at all possible..
> 
> I agree, but we have a test getting ready to commence.  We'll know
> shortly how much the reduced MTU effects things because they aren't
> going to alter any of their setup, just put the new kernel in place, and
> see what happens.
> 
> 

Long story short on the MTU stuff, the setups whined a bit about not
being able to set the desired MTU, used the new max MTU instead, and
things otherwise worked fine.  But, Paolo submitted a v2 patch that
removes this change, so it's all moot anyway.
David Miller Oct. 13, 2016, 2:24 p.m. UTC | #13
From: Paolo Abeni <pabeni@redhat.com>
Date: Tue, 11 Oct 2016 19:15:44 +0200

> After the commit 9207f9d45b0a ("net: preserve IP control block
> during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
> That destroy the IPoIB address information cached there,
> causing a severe performance regression, as better described here:
> 
> http://marc.info/?l=linux-kernel&m=146787279825501&w=2
> 
> This change moves the data cached by the IPoIB driver from the
> skb control lock into the IPoIB hard header, as done before
> the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len
> and use skb->cb to stash LL addresses").
> In order to avoid GRO issue, on packet reception, the IPoIB driver
> stash into the skb a dummy pseudo header, so that the received
> packets have actually a hard header matching the declared length.
> Also the connected mode maximum mtu is reduced by 16 bytes to
> cope with the increased hard header len.
> 
> After this commit, IPoIB performances are back to pre-regression
> value.
> 
> Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Not providing an accurate hard_header_len causes many problems.

In fact we recently fixed the mlxsw driver to stop doing this.
Doug Ledford Oct. 13, 2016, 2:35 p.m. UTC | #14
On 10/13/2016 10:24 AM, David Miller wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Tue, 11 Oct 2016 19:15:44 +0200
> 
>> After the commit 9207f9d45b0a ("net: preserve IP control block
>> during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
>> That destroy the IPoIB address information cached there,
>> causing a severe performance regression, as better described here:
>>
>> http://marc.info/?l=linux-kernel&m=146787279825501&w=2
>>
>> This change moves the data cached by the IPoIB driver from the
>> skb control lock into the IPoIB hard header, as done before
>> the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len
>> and use skb->cb to stash LL addresses").
>> In order to avoid GRO issue, on packet reception, the IPoIB driver
>> stash into the skb a dummy pseudo header, so that the received
>> packets have actually a hard header matching the declared length.
>> Also the connected mode maximum mtu is reduced by 16 bytes to
>> cope with the increased hard header len.
>>
>> After this commit, IPoIB performances are back to pre-regression
>> value.
>>
>> Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation")
>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> 
> Not providing an accurate hard_header_len causes many problems.
> 
> In fact we recently fixed the mlxsw driver to stop doing this.
> 

Sure, but there are too many users of the cb struct, and whatever
problems you are saying there are by lying about the hard header len are
dwarfed by the problems caused by the inability to store the ll address
anywhere between hard_header and send time.
David Miller Oct. 13, 2016, 2:43 p.m. UTC | #15
From: Doug Ledford <dledford@redhat.com>
Date: Thu, 13 Oct 2016 10:35:35 -0400

> On 10/13/2016 10:24 AM, David Miller wrote:
>> From: Paolo Abeni <pabeni@redhat.com>
>> Date: Tue, 11 Oct 2016 19:15:44 +0200
>> 
>>> After the commit 9207f9d45b0a ("net: preserve IP control block
>>> during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
>>> That destroy the IPoIB address information cached there,
>>> causing a severe performance regression, as better described here:
>>>
>>> http://marc.info/?l=linux-kernel&m=146787279825501&w=2
>>>
>>> This change moves the data cached by the IPoIB driver from the
>>> skb control lock into the IPoIB hard header, as done before
>>> the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len
>>> and use skb->cb to stash LL addresses").
>>> In order to avoid GRO issue, on packet reception, the IPoIB driver
>>> stash into the skb a dummy pseudo header, so that the received
>>> packets have actually a hard header matching the declared length.
>>> Also the connected mode maximum mtu is reduced by 16 bytes to
>>> cope with the increased hard header len.
>>>
>>> After this commit, IPoIB performances are back to pre-regression
>>> value.
>>>
>>> Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation")
>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> 
>> Not providing an accurate hard_header_len causes many problems.
>> 
>> In fact we recently fixed the mlxsw driver to stop doing this.
>> 
> 
> Sure, but there are too many users of the cb struct, and whatever
> problems you are saying there are by lying about the hard header len are
> dwarfed by the problems caused by the inability to store the ll address
> anywhere between hard_header and send time.

IB wants to pass addressing information between layers, it needs to
find a safe way to do that.  The currently propsoed patch does not
meet this criteria.

Pushing metadata before the head of the SKB data pointer is illegal,
as the layers in between might want to push protocol headers, mirror
the packet to another interface, etc.

So this "metadata in SKB data" approach is buggy too.
Paolo Abeni Oct. 13, 2016, 3:17 p.m. UTC | #16
On Thu, 2016-10-13 at 10:24 -0400, David Miller wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> Date: Tue, 11 Oct 2016 19:15:44 +0200
> 
> > After the commit 9207f9d45b0a ("net: preserve IP control block
> > during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
> > That destroy the IPoIB address information cached there,
> > causing a severe performance regression, as better described here:
> > 
> > http://marc.info/?l=linux-kernel&m=146787279825501&w=2
> > 
> > This change moves the data cached by the IPoIB driver from the
> > skb control lock into the IPoIB hard header, as done before
> > the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len
> > and use skb->cb to stash LL addresses").
> > In order to avoid GRO issue, on packet reception, the IPoIB driver
> > stash into the skb a dummy pseudo header, so that the received
> > packets have actually a hard header matching the declared length.
> > Also the connected mode maximum mtu is reduced by 16 bytes to
> > cope with the increased hard header len.
> > 
> > After this commit, IPoIB performances are back to pre-regression
> > value.
> > 
> > Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation")
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> 
> Not providing an accurate hard_header_len causes many problems.
> 
> In fact we recently fixed the mlxsw driver to stop doing this.

AFAICS the mlxsw did a different thing, adding an additional header in
the xmit function and pushing packets in the network stack with a mac
length different from the declared hard header length

The hard header specified by the IPoIB driver is build in the create()
header_ops and its length matches the mac header length of the packets
pushed into the network stack by such driver. GRO works correctly on top
of that. From the networking stack PoV the hard_header_len should be
'accurate'.

Can you please give more details on the problem that may arise from
this ?

thank you,

Paolo
Doug Ledford Oct. 13, 2016, 3:20 p.m. UTC | #17
On 10/13/2016 10:43 AM, David Miller wrote:
> From: Doug Ledford <dledford@redhat.com>
> Date: Thu, 13 Oct 2016 10:35:35 -0400
> 
>> On 10/13/2016 10:24 AM, David Miller wrote:
>>> From: Paolo Abeni <pabeni@redhat.com>
>>> Date: Tue, 11 Oct 2016 19:15:44 +0200
>>>
>>>> After the commit 9207f9d45b0a ("net: preserve IP control block
>>>> during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
>>>> That destroy the IPoIB address information cached there,
>>>> causing a severe performance regression, as better described here:
>>>>
>>>> http://marc.info/?l=linux-kernel&m=146787279825501&w=2
>>>>
>>>> This change moves the data cached by the IPoIB driver from the
>>>> skb control lock into the IPoIB hard header, as done before
>>>> the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len
>>>> and use skb->cb to stash LL addresses").
>>>> In order to avoid GRO issue, on packet reception, the IPoIB driver
>>>> stash into the skb a dummy pseudo header, so that the received
>>>> packets have actually a hard header matching the declared length.
>>>> Also the connected mode maximum mtu is reduced by 16 bytes to
>>>> cope with the increased hard header len.
>>>>
>>>> After this commit, IPoIB performances are back to pre-regression
>>>> value.
>>>>
>>>> Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation")
>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>>
>>> Not providing an accurate hard_header_len causes many problems.
>>>
>>> In fact we recently fixed the mlxsw driver to stop doing this.
>>>
>>
>> Sure, but there are too many users of the cb struct, and whatever
>> problems you are saying there are by lying about the hard header len are
>> dwarfed by the problems caused by the inability to store the ll address
>> anywhere between hard_header and send time.
> 
> IB wants to pass addressing information between layers, it needs to
> find a safe way to do that.

We *had* a safe way to do that.  It got broken.  What about increasing
the size of skb->cb?  Or adding a skb->dgid that is a
u8[INFINIBAND_ALEN]?  Or a more generic skb->dest_ll_addr that is sized
to hold the dest address for any link layer?

> Pushing metadata before the head of the SKB data pointer is illegal,
> as the layers in between might want to push protocol headers,

That's a total non-issue for us.  There are no headers that protocols
can add before ours.

> mirror
> the packet to another interface

Doesn't that then mean that the headers specific to this interface
should be stripped before the mirror?  If so, I believe the way Paolo
did this patch, that is what will be done.

>, etc.
> 
> So this "metadata in SKB data" approach is buggy too.
David Miller Oct. 13, 2016, 3:57 p.m. UTC | #18
From: Doug Ledford <dledford@redhat.com>
Date: Thu, 13 Oct 2016 11:20:59 -0400

> We *had* a safe way to do that.  It got broken.  What about increasing
> the size of skb->cb?  Or adding a skb->dgid that is a
> u8[INFINIBAND_ALEN]?  Or a more generic skb->dest_ll_addr that is sized
> to hold the dest address for any link layer?

I understand the situation, and I also believe that making sk_buff any
huger than it already is happens to be a non-starter.

>> Pushing metadata before the head of the SKB data pointer is illegal,
>> as the layers in between might want to push protocol headers,
> 
> That's a total non-issue for us.  There are no headers that protocols
> can add before ours.

Ok, if that's the case, and based upon Paolo's response to me it appears
to be, I guess this is OK for now.

Paolo please resubmit your patch, thanks.
diff mbox

Patch

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 9dbfcc0..5dd01fa 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -63,12 +63,14 @@  enum ipoib_flush_level {
 
 enum {
 	IPOIB_ENCAP_LEN		  = 4,
+	IPOIB_PSEUDO_LEN	  = 20,
+	IPOIB_HARD_LEN		  = IPOIB_ENCAP_LEN + IPOIB_PSEUDO_LEN,
 
 	IPOIB_UD_HEAD_SIZE	  = IB_GRH_BYTES + IPOIB_ENCAP_LEN,
 	IPOIB_UD_RX_SG		  = 2, /* max buffer needed for 4K mtu */
 
-	IPOIB_CM_MTU		  = 0x10000 - 0x10, /* padding to align header to 16 */
-	IPOIB_CM_BUF_SIZE	  = IPOIB_CM_MTU  + IPOIB_ENCAP_LEN,
+	IPOIB_CM_MTU		  = 0x10000 - 0x20, /* padding to align header to 16 */
+	IPOIB_CM_BUF_SIZE	  = IPOIB_CM_MTU  + IPOIB_HARD_LEN,
 	IPOIB_CM_HEAD_SIZE	  = IPOIB_CM_BUF_SIZE % PAGE_SIZE,
 	IPOIB_CM_RX_SG		  = ALIGN(IPOIB_CM_BUF_SIZE, PAGE_SIZE) / PAGE_SIZE,
 	IPOIB_RX_RING_SIZE	  = 256,
@@ -134,15 +136,21 @@  struct ipoib_header {
 	u16	reserved;
 };
 
-struct ipoib_cb {
-	struct qdisc_skb_cb	qdisc_cb;
-	u8			hwaddr[INFINIBAND_ALEN];
+struct ipoib_pseudo_header {
+	u8	hwaddr[INFINIBAND_ALEN];
 };
 
-static inline struct ipoib_cb *ipoib_skb_cb(const struct sk_buff *skb)
+static inline void skb_add_pseudo_hdr(struct sk_buff *skb)
 {
-	BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct ipoib_cb));
-	return (struct ipoib_cb *)skb->cb;
+	char *data = skb_push(skb, IPOIB_PSEUDO_LEN);
+
+	/*
+	 * only the ipoib header is present now, make room for a dummy
+	 * pseudo header and set skb field accordingly
+	 */
+	memset(data, 0, IPOIB_PSEUDO_LEN);
+	skb_reset_mac_header(skb);
+	skb_pull(skb, IPOIB_HARD_LEN);
 }
 
 /* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 4ad297d..1b04c8a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -63,6 +63,8 @@  MODULE_PARM_DESC(cm_data_debug_level,
 #define IPOIB_CM_RX_DELAY       (3 * 256 * HZ)
 #define IPOIB_CM_RX_UPDATE_MASK (0x3)
 
+#define IPOIB_CM_RX_RESERVE     (ALIGN(IPOIB_HARD_LEN, 16) - IPOIB_ENCAP_LEN)
+
 static struct ib_qp_attr ipoib_cm_err_attr = {
 	.qp_state = IB_QPS_ERR
 };
@@ -146,15 +148,15 @@  static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev,
 	struct sk_buff *skb;
 	int i;
 
-	skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12);
+	skb = dev_alloc_skb(ALIGN(IPOIB_CM_HEAD_SIZE, 16));
 	if (unlikely(!skb))
 		return NULL;
 
 	/*
-	 * IPoIB adds a 4 byte header. So we need 12 more bytes to align the
+	 * IPoIB adds a IPOIB_ENCAP_LEN byte header, this will align the
 	 * IP header to a multiple of 16.
 	 */
-	skb_reserve(skb, 12);
+	skb_reserve(skb, IPOIB_CM_RX_RESERVE);
 
 	mapping[0] = ib_dma_map_single(priv->ca, skb->data, IPOIB_CM_HEAD_SIZE,
 				       DMA_FROM_DEVICE);
@@ -624,9 +626,9 @@  void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	if (wc->byte_len < IPOIB_CM_COPYBREAK) {
 		int dlen = wc->byte_len;
 
-		small_skb = dev_alloc_skb(dlen + 12);
+		small_skb = dev_alloc_skb(dlen + IPOIB_CM_RX_RESERVE);
 		if (small_skb) {
-			skb_reserve(small_skb, 12);
+			skb_reserve(small_skb, IPOIB_CM_RX_RESERVE);
 			ib_dma_sync_single_for_cpu(priv->ca, rx_ring[wr_id].mapping[0],
 						   dlen, DMA_FROM_DEVICE);
 			skb_copy_from_linear_data(skb, small_skb->data, dlen);
@@ -663,8 +665,7 @@  void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 
 copied:
 	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
-	skb_reset_mac_header(skb);
-	skb_pull(skb, IPOIB_ENCAP_LEN);
+	skb_add_pseudo_hdr(skb);
 
 	++dev->stats.rx_packets;
 	dev->stats.rx_bytes += skb->len;
@@ -1583,7 +1584,7 @@  int ipoib_cm_dev_init(struct net_device *dev)
 	max_srq_sge = min_t(int, IPOIB_CM_RX_SG, priv->ca->attrs.max_srq_sge);
 	ipoib_cm_create_srq(dev, max_srq_sge);
 	if (ipoib_cm_has_srq(dev)) {
-		priv->cm.max_cm_mtu = max_srq_sge * PAGE_SIZE - 0x10;
+		priv->cm.max_cm_mtu = max_srq_sge * PAGE_SIZE - 0x20;
 		priv->cm.num_frags  = max_srq_sge;
 		ipoib_dbg(priv, "max_cm_mtu = 0x%x, num_frags=%d\n",
 			  priv->cm.max_cm_mtu, priv->cm.num_frags);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index be11d5d..830fecb 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -128,16 +128,15 @@  static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
 
 	buf_size = IPOIB_UD_BUF_SIZE(priv->max_ib_mtu);
 
-	skb = dev_alloc_skb(buf_size + IPOIB_ENCAP_LEN);
+	skb = dev_alloc_skb(buf_size + IPOIB_HARD_LEN);
 	if (unlikely(!skb))
 		return NULL;
 
 	/*
-	 * IB will leave a 40 byte gap for a GRH and IPoIB adds a 4 byte
-	 * header.  So we need 4 more bytes to get to 48 and align the
-	 * IP header to a multiple of 16.
+	 * the IP header will be at IPOIP_HARD_LEN + IB_GRH_BYTES, that is
+	 * 64 bytes aligned
 	 */
-	skb_reserve(skb, 4);
+	skb_reserve(skb, sizeof(struct ipoib_pseudo_header));
 
 	mapping = priv->rx_ring[id].mapping;
 	mapping[0] = ib_dma_map_single(priv->ca, skb->data, buf_size,
@@ -253,8 +252,7 @@  static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	skb_pull(skb, IB_GRH_BYTES);
 
 	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
-	skb_reset_mac_header(skb);
-	skb_pull(skb, IPOIB_ENCAP_LEN);
+	skb_add_pseudo_hdr(skb);
 
 	++dev->stats.rx_packets;
 	dev->stats.rx_bytes += skb->len;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index cc1c1b0..823a528 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -925,9 +925,12 @@  static void neigh_add_path(struct sk_buff *skb, u8 *daddr,
 				ipoib_neigh_free(neigh);
 				goto err_drop;
 			}
-			if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE)
+			if (skb_queue_len(&neigh->queue) <
+			    IPOIB_MAX_PATH_REC_QUEUE) {
+				/* put pseudoheader back on for next time */
+				skb_push(skb, IPOIB_PSEUDO_LEN);
 				__skb_queue_tail(&neigh->queue, skb);
-			else {
+			} else {
 				ipoib_warn(priv, "queue length limit %d. Packet drop.\n",
 					   skb_queue_len(&neigh->queue));
 				goto err_drop;
@@ -964,7 +967,7 @@  err_drop:
 }
 
 static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
-			     struct ipoib_cb *cb)
+			     struct ipoib_pseudo_header *phdr)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_path *path;
@@ -972,16 +975,18 @@  static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
 
 	spin_lock_irqsave(&priv->lock, flags);
 
-	path = __path_find(dev, cb->hwaddr + 4);
+	path = __path_find(dev, phdr->hwaddr + 4);
 	if (!path || !path->valid) {
 		int new_path = 0;
 
 		if (!path) {
-			path = path_rec_create(dev, cb->hwaddr + 4);
+			path = path_rec_create(dev, phdr->hwaddr + 4);
 			new_path = 1;
 		}
 		if (path) {
 			if (skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
+				/* put pseudoheader back on for next time */
+				skb_push(skb, IPOIB_PSEUDO_LEN);
 				__skb_queue_tail(&path->queue, skb);
 			} else {
 				++dev->stats.tx_dropped;
@@ -1009,10 +1014,12 @@  static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
 			  be16_to_cpu(path->pathrec.dlid));
 
 		spin_unlock_irqrestore(&priv->lock, flags);
-		ipoib_send(dev, skb, path->ah, IPOIB_QPN(cb->hwaddr));
+		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr));
 		return;
 	} else if ((path->query || !path_rec_start(dev, path)) &&
 		   skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
+		/* put pseudoheader back on for next time */
+		skb_push(skb, IPOIB_PSEUDO_LEN);
 		__skb_queue_tail(&path->queue, skb);
 	} else {
 		++dev->stats.tx_dropped;
@@ -1026,13 +1033,15 @@  static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_neigh *neigh;
-	struct ipoib_cb *cb = ipoib_skb_cb(skb);
+	struct ipoib_pseudo_header *phdr;
 	struct ipoib_header *header;
 	unsigned long flags;
 
+	phdr = (struct ipoib_pseudo_header *) skb->data;
+	skb_pull(skb, sizeof(*phdr));
 	header = (struct ipoib_header *) skb->data;
 
-	if (unlikely(cb->hwaddr[4] == 0xff)) {
+	if (unlikely(phdr->hwaddr[4] == 0xff)) {
 		/* multicast, arrange "if" according to probability */
 		if ((header->proto != htons(ETH_P_IP)) &&
 		    (header->proto != htons(ETH_P_IPV6)) &&
@@ -1045,13 +1054,13 @@  static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			return NETDEV_TX_OK;
 		}
 		/* Add in the P_Key for multicast*/
-		cb->hwaddr[8] = (priv->pkey >> 8) & 0xff;
-		cb->hwaddr[9] = priv->pkey & 0xff;
+		phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
+		phdr->hwaddr[9] = priv->pkey & 0xff;
 
-		neigh = ipoib_neigh_get(dev, cb->hwaddr);
+		neigh = ipoib_neigh_get(dev, phdr->hwaddr);
 		if (likely(neigh))
 			goto send_using_neigh;
-		ipoib_mcast_send(dev, cb->hwaddr, skb);
+		ipoib_mcast_send(dev, phdr->hwaddr, skb);
 		return NETDEV_TX_OK;
 	}
 
@@ -1060,16 +1069,16 @@  static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	case htons(ETH_P_IP):
 	case htons(ETH_P_IPV6):
 	case htons(ETH_P_TIPC):
-		neigh = ipoib_neigh_get(dev, cb->hwaddr);
+		neigh = ipoib_neigh_get(dev, phdr->hwaddr);
 		if (unlikely(!neigh)) {
-			neigh_add_path(skb, cb->hwaddr, dev);
+			neigh_add_path(skb, phdr->hwaddr, dev);
 			return NETDEV_TX_OK;
 		}
 		break;
 	case htons(ETH_P_ARP):
 	case htons(ETH_P_RARP):
 		/* for unicast ARP and RARP should always perform path find */
-		unicast_arp_send(skb, dev, cb);
+		unicast_arp_send(skb, dev, phdr);
 		return NETDEV_TX_OK;
 	default:
 		/* ethertype not supported by IPoIB */
@@ -1086,11 +1095,13 @@  send_using_neigh:
 			goto unref;
 		}
 	} else if (neigh->ah) {
-		ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(cb->hwaddr));
+		ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(phdr->hwaddr));
 		goto unref;
 	}
 
 	if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
+		/* put pseudoheader back on for next time */
+		skb_push(skb, sizeof(*phdr));
 		spin_lock_irqsave(&priv->lock, flags);
 		__skb_queue_tail(&neigh->queue, skb);
 		spin_unlock_irqrestore(&priv->lock, flags);
@@ -1122,8 +1133,8 @@  static int ipoib_hard_header(struct sk_buff *skb,
 			     unsigned short type,
 			     const void *daddr, const void *saddr, unsigned len)
 {
+	struct ipoib_pseudo_header *phdr;
 	struct ipoib_header *header;
-	struct ipoib_cb *cb = ipoib_skb_cb(skb);
 
 	header = (struct ipoib_header *) skb_push(skb, sizeof *header);
 
@@ -1132,12 +1143,13 @@  static int ipoib_hard_header(struct sk_buff *skb,
 
 	/*
 	 * we don't rely on dst_entry structure,  always stuff the
-	 * destination address into skb->cb so we can figure out where
+	 * destination address into skb hard header so we can figure out where
 	 * to send the packet later.
 	 */
-	memcpy(cb->hwaddr, daddr, INFINIBAND_ALEN);
+	phdr = (struct ipoib_pseudo_header *) skb_push(skb, sizeof(*phdr));
+	memcpy(phdr->hwaddr, daddr, INFINIBAND_ALEN);
 
-	return sizeof *header;
+	return IPOIB_HARD_LEN;
 }
 
 static void ipoib_set_mcast_list(struct net_device *dev)
@@ -1759,7 +1771,7 @@  void ipoib_setup(struct net_device *dev)
 
 	dev->flags		|= IFF_BROADCAST | IFF_MULTICAST;
 
-	dev->hard_header_len	 = IPOIB_ENCAP_LEN;
+	dev->hard_header_len	 = IPOIB_HARD_LEN;
 	dev->addr_len		 = INFINIBAND_ALEN;
 	dev->type		 = ARPHRD_INFINIBAND;
 	dev->tx_queue_len	 = ipoib_sendq_size * 2;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index d3394b6..1909dd2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -796,9 +796,11 @@  void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 			__ipoib_mcast_add(dev, mcast);
 			list_add_tail(&mcast->list, &priv->multicast_list);
 		}
-		if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE)
+		if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE) {
+			/* put pseudoheader back on for next time */
+			skb_push(skb, sizeof(struct ipoib_pseudo_header));
 			skb_queue_tail(&mcast->pkt_queue, skb);
-		else {
+		} else {
 			++dev->stats.tx_dropped;
 			dev_kfree_skb_any(skb);
 		}