diff mbox

net: convert mvneta to build_skb()

Message ID 20130704173552.GA23370@1wt.eu
State Deferred, archived
Delegated to: David Miller
Headers show

Commit Message

Willy Tarreau July 4, 2013, 5:35 p.m. UTC
From 0180a5e651dd771de18bf2c031ecfe7bb4c88d3e Mon Sep 17 00:00:00 2001
From: Willy Tarreau <w@1wt.eu>
Date: Sat, 15 Jun 2013 23:25:15 +0200
Subject: [PATCH] net: convert mvneta to build_skb()

We store the frag_size in the mvneta_port struct. In practice we'd need
a single bit to know how to free the data, but since we need the size to
call build_skb() anyway, let's store the full size.

With this patch, I observed a reproducible 2% performance improvement on
HTTP-based benchmarks.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
Cc: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 71 ++++++++++++++++++++++++-----------
 1 file changed, 50 insertions(+), 21 deletions(-)

Comments

David Miller July 4, 2013, 9:31 p.m. UTC | #1
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 4 Jul 2013 19:35:52 +0200

> From 0180a5e651dd771de18bf2c031ecfe7bb4c88d3e Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Sat, 15 Jun 2013 23:25:15 +0200
> Subject: [PATCH] net: convert mvneta to build_skb()
> 
> We store the frag_size in the mvneta_port struct. In practice we'd need
> a single bit to know how to free the data, but since we need the size to
> call build_skb() anyway, let's store the full size.
> 
> With this patch, I observed a reproducible 2% performance improvement on
> HTTP-based benchmarks.
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>

The net-next tree is closed now that the merge window is open, therefore
submissions such as this are not appropriate at this time.

Please resubmit this when net-next opens again, an event which I will
clearly announce here on netdev.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau July 4, 2013, 10:12 p.m. UTC | #2
On Thu, Jul 04, 2013 at 02:31:03PM -0700, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 4 Jul 2013 19:35:52 +0200
> 
> > From 0180a5e651dd771de18bf2c031ecfe7bb4c88d3e Mon Sep 17 00:00:00 2001
> > From: Willy Tarreau <w@1wt.eu>
> > Date: Sat, 15 Jun 2013 23:25:15 +0200
> > Subject: [PATCH] net: convert mvneta to build_skb()
> > 
> > We store the frag_size in the mvneta_port struct. In practice we'd need
> > a single bit to know how to free the data, but since we need the size to
> > call build_skb() anyway, let's store the full size.
> > 
> > With this patch, I observed a reproducible 2% performance improvement on
> > HTTP-based benchmarks.
> > 
> > Signed-off-by: Willy Tarreau <w@1wt.eu>
> 
> The net-next tree is closed now that the merge window is open, therefore
> submissions such as this are not appropriate at this time.
> 
> Please resubmit this when net-next opens again, an event which I will
> clearly announce here on netdev.

OK no problem, thanks.

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni July 5, 2013, 7:17 a.m. UTC | #3
Dear Willy Tarreau,

On Thu, 4 Jul 2013 19:35:52 +0200, Willy Tarreau wrote:
> From 0180a5e651dd771de18bf2c031ecfe7bb4c88d3e Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Sat, 15 Jun 2013 23:25:15 +0200
> Subject: [PATCH] net: convert mvneta to build_skb()
> 
> We store the frag_size in the mvneta_port struct. In practice we'd need
> a single bit to know how to free the data, but since we need the size to
> call build_skb() anyway, let's store the full size.
> 
> With this patch, I observed a reproducible 2% performance improvement on
> HTTP-based benchmarks.
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>
> Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
> Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
> Cc: Eric Dumazet <edumazet@google.com>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 71 ++++++++++++++++++++++++-----------
>  1 file changed, 50 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index c966785..0f2c6df 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -225,6 +225,7 @@ struct mvneta_stats {
>  
>  struct mvneta_port {
>  	int pkt_size;
> +	unsigned int frag_size;
>  	void __iomem *base;
>  	struct mvneta_rx_queue *rxqs;
>  	struct mvneta_tx_queue *txqs;

Thanks Willy. Sorry for asking such a stupid question, but I'm not very
familiar with how this mechanism works. Can you explain why a single
'frag_size' field per port is sufficient? My concern is that this
frag_size seems to be a per-packet information, but we have potentially
multiple packets being received, and multiple RX queues. Is one single
'frag_size' per network interface sufficient?

For example, in mvneta_rx_refill(), you store the skb_size in
pp->frag_size, and then you later re-use it in mvneta_rxq_drop_pkts.
What guarantees you that mvneta_rx_refill() hasn't be called in the
mean time for a different packet, in a different rxq, for the same
network interface, and the value of pp->frag_size has been overridden?

Again, sorry for this stupid question, this is maybe just some basic
networking knowledge that I'm missing here, but I thought it would be
good to ask anyway.

Thanks!

Thomas
Willy Tarreau July 5, 2013, 7:43 a.m. UTC | #4
Hi Thomas,

On Fri, Jul 05, 2013 at 09:17:52AM +0200, Thomas Petazzoni wrote:
> > --- a/drivers/net/ethernet/marvell/mvneta.c
> > +++ b/drivers/net/ethernet/marvell/mvneta.c
> > @@ -225,6 +225,7 @@ struct mvneta_stats {
> >  
> >  struct mvneta_port {
> >  	int pkt_size;
> > +	unsigned int frag_size;
> >  	void __iomem *base;
> >  	struct mvneta_rx_queue *rxqs;
> >  	struct mvneta_tx_queue *txqs;
> 
> Thanks Willy. Sorry for asking such a stupid question, but I'm not very
> familiar with how this mechanism works. Can you explain why a single
> 'frag_size' field per port is sufficient? My concern is that this
> frag_size seems to be a per-packet information, but we have potentially
> multiple packets being received, and multiple RX queues. Is one single
> 'frag_size' per network interface sufficient?

I had the exact same question when Eric sent me an experimental patch to
do this on mv64xxx_eth a few months ago. Then I checked how the frag_size
is computed. As you can see below, it does not depend on a per-packet size
but on a per-port size which is in fact the MTU ("pkt_size" is the misleading
part here) :

      skb_size = SKB_DATA_ALIGN(pp->pkt_size + MVNETA_MH_SIZE + NET_SKB_PAD) +
                 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));

So if skb_size depends solely on pp (struct mvneta_port), then it makes
sense to have frag_size stored at the same place.

In practice we don't really need to store the frag_size in the struct, we
just need to know if the data were allocated using netdev_alloc_frag()
or kmalloc() to know how to free them. So a single bit would be enough,
and I thought about just doing a +1 on the pointer when we need to free
using kmalloc(). But that would add unneeded extra work in the fast path
to fix the pointers. And since we need to pass frag_size to build_skb()
it was a reasonable solution in my opinion.

> For example, in mvneta_rx_refill(), you store the skb_size in
> pp->frag_size, and then you later re-use it in mvneta_rxq_drop_pkts.
> What guarantees you that mvneta_rx_refill() hasn't be called in the
> mean time for a different packet, in a different rxq, for the same
> network interface, and the value of pp->frag_size has been overridden?

It's not a problem since the refill() applies pp->pkt_size which doesn't
change between calls. It's only changed in mvneta_change_mtu() which
first stops the device. So I think it's safe.

> Again, sorry for this stupid question, this is maybe just some basic
> networking knowledge that I'm missing here, but I thought it would be
> good to ask anyway.

No it's neither stupid nor basic. It's a bit tricky to understand first
but it looks really nice once you get it. I think I will study the ability
to use +/-1 on the pointer and see if we can get rid of frag_size (eg: tg3
doesn't store it).

Hoping this clarifies some points,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni July 5, 2013, 7:50 a.m. UTC | #5
Dear Willy Tarreau,

On Fri, 5 Jul 2013 09:43:30 +0200, Willy Tarreau wrote:

> > Thanks Willy. Sorry for asking such a stupid question, but I'm not very
> > familiar with how this mechanism works. Can you explain why a single
> > 'frag_size' field per port is sufficient? My concern is that this
> > frag_size seems to be a per-packet information, but we have potentially
> > multiple packets being received, and multiple RX queues. Is one single
> > 'frag_size' per network interface sufficient?
> 
> I had the exact same question when Eric sent me an experimental patch to
> do this on mv64xxx_eth a few months ago. Then I checked how the frag_size
> is computed. As you can see below, it does not depend on a per-packet size
> but on a per-port size which is in fact the MTU ("pkt_size" is the misleading
> part here) :
> 
>       skb_size = SKB_DATA_ALIGN(pp->pkt_size + MVNETA_MH_SIZE + NET_SKB_PAD) +
>                  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> 
> So if skb_size depends solely on pp (struct mvneta_port), then it makes
> sense to have frag_size stored at the same place.
> 
> In practice we don't really need to store the frag_size in the struct, we
> just need to know if the data were allocated using netdev_alloc_frag()
> or kmalloc() to know how to free them. So a single bit would be enough,
> and I thought about just doing a +1 on the pointer when we need to free
> using kmalloc(). But that would add unneeded extra work in the fast path
> to fix the pointers. And since we need to pass frag_size to build_skb()
> it was a reasonable solution in my opinion.

Aah, okay makes sense. So now, the question that comes up is why this
skb_size calculation is done in every call of mvneta_rx_refill() ? It
should be computed once, at the same time pkt_size is calculated, and
stored in the mvneta_port structure? Then you just need to test whether
it is smaller or larger than PAGE_SIZE to decide whether to use
netdev_alloc_frag() vs. kmalloc().

I.e, I would turn all the:

	if (pp->frag_size)
		...
	else
		...

into:

	if (pp->skb_size <= PAGE_SIZE)
		...
	else
		...

Of course, you can always hide this test behind a small macro or inline
function, to make it something like:

	if (mvneta_uses_small_skbs(pp))
		...
	else
		...

A better name than mvneta_uses_small_skbs() would of course be useful.

> > For example, in mvneta_rx_refill(), you store the skb_size in
> > pp->frag_size, and then you later re-use it in mvneta_rxq_drop_pkts.
> > What guarantees you that mvneta_rx_refill() hasn't be called in the
> > mean time for a different packet, in a different rxq, for the same
> > network interface, and the value of pp->frag_size has been overridden?
> 
> It's not a problem since the refill() applies pp->pkt_size which doesn't
> change between calls. It's only changed in mvneta_change_mtu() which
> first stops the device. So I think it's safe.

Yeah, sure, now that I see that pp->frag_size is constant for a given
MTU configuration, it looks safe. But I find the current code that
retests and reassigns pp->frag_size at every call of rxq_refill() to be
very confusing.

Thanks!

Thomas
Willy Tarreau July 5, 2013, 8:09 a.m. UTC | #6
On Fri, Jul 05, 2013 at 09:50:38AM +0200, Thomas Petazzoni wrote:
> > In practice we don't really need to store the frag_size in the struct, we
> > just need to know if the data were allocated using netdev_alloc_frag()
> > or kmalloc() to know how to free them. So a single bit would be enough,
> > and I thought about just doing a +1 on the pointer when we need to free
> > using kmalloc(). But that would add unneeded extra work in the fast path
> > to fix the pointers. And since we need to pass frag_size to build_skb()
> > it was a reasonable solution in my opinion.
> 
> Aah, okay makes sense. So now, the question that comes up is why this
> skb_size calculation is done in every call of mvneta_rx_refill() ? It
> should be computed once, at the same time pkt_size is calculated, and
> stored in the mvneta_port structure? Then you just need to test whether
> it is smaller or larger than PAGE_SIZE to decide whether to use
> netdev_alloc_frag() vs. kmalloc().
> 
> I.e, I would turn all the:
> 
> 	if (pp->frag_size)
> 		...
> 	else
> 		...
> 
> into:
> 
> 	if (pp->skb_size <= PAGE_SIZE)
> 		...
> 	else
> 		...

I was concerned by the same thing and saw that tg3 does the same. Then
I realized that we're only using constants here, so there isn't much
calculation (basically it's skb_size = pkt_size + something). However
it would probably make the code less confusing, especially if we were
two different readers concerned about this frag_size not changing per
packet.

> Of course, you can always hide this test behind a small macro or inline
> function, to make it something like:
> 
> 	if (mvneta_uses_small_skbs(pp))
> 		...
> 	else
> 		...
> 
> A better name than mvneta_uses_small_skbs() would of course be useful.
> 
> > > For example, in mvneta_rx_refill(), you store the skb_size in
> > > pp->frag_size, and then you later re-use it in mvneta_rxq_drop_pkts.
> > > What guarantees you that mvneta_rx_refill() hasn't be called in the
> > > mean time for a different packet, in a different rxq, for the same
> > > network interface, and the value of pp->frag_size has been overridden?
> > 
> > It's not a problem since the refill() applies pp->pkt_size which doesn't
> > change between calls. It's only changed in mvneta_change_mtu() which
> > first stops the device. So I think it's safe.
> 
> Yeah, sure, now that I see that pp->frag_size is constant for a given
> MTU configuration, it looks safe. But I find the current code that
> retests and reassigns pp->frag_size at every call of rxq_refill() to be
> very confusing.

Yes I agree. I think I'll experiment with the +/-1 on the pointer to
see the impact on the rest of the code, because probably that with a
few macros we could make that more transparent this way.

Best regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni July 15, 2013, 2:34 p.m. UTC | #7
Dear Willy Tarreau,

On Thu, 4 Jul 2013 19:35:52 +0200, Willy Tarreau wrote:
> From 0180a5e651dd771de18bf2c031ecfe7bb4c88d3e Mon Sep 17 00:00:00 2001
> From: Willy Tarreau <w@1wt.eu>
> Date: Sat, 15 Jun 2013 23:25:15 +0200
> Subject: [PATCH] net: convert mvneta to build_skb()
> 
> We store the frag_size in the mvneta_port struct. In practice we'd need
> a single bit to know how to free the data, but since we need the size to
> call build_skb() anyway, let's store the full size.
> 
> With this patch, I observed a reproducible 2% performance improvement on
> HTTP-based benchmarks.
> 
> Signed-off-by: Willy Tarreau <w@1wt.eu>

3.11-rc1 is now out. Do you intend to resend an updated version of your
patches, according to the discussions we had?

Thanks!

Thomas
Willy Tarreau July 15, 2013, 3:12 p.m. UTC | #8
Hi Thomas,

On Mon, Jul 15, 2013 at 04:34:08PM +0200, Thomas Petazzoni wrote:
> Dear Willy Tarreau,
> 
> On Thu, 4 Jul 2013 19:35:52 +0200, Willy Tarreau wrote:
> > From 0180a5e651dd771de18bf2c031ecfe7bb4c88d3e Mon Sep 17 00:00:00 2001
> > From: Willy Tarreau <w@1wt.eu>
> > Date: Sat, 15 Jun 2013 23:25:15 +0200
> > Subject: [PATCH] net: convert mvneta to build_skb()
> > 
> > We store the frag_size in the mvneta_port struct. In practice we'd need
> > a single bit to know how to free the data, but since we need the size to
> > call build_skb() anyway, let's store the full size.
> > 
> > With this patch, I observed a reproducible 2% performance improvement on
> > HTTP-based benchmarks.
> > 
> > Signed-off-by: Willy Tarreau <w@1wt.eu>
> 
> 3.11-rc1 is now out. Do you intend to resend an updated version of your
> patches, according to the discussions we had?

Yes I'll do. This week-end I have tried to pre-compute the frag_size.
The performance did not change, nor the driver size, but the code is
cleaner and more readable (in my opinion).

I have noticed your recent patches to support big-endian, and since
we risk to get conflicts by touching similar areas, I guess you'll
prefer to have a clean patch series that applies well.

I'm still checking a little bit what can be easily improved (if any)
and will resend.

Best regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni July 15, 2013, 3:23 p.m. UTC | #9
Dear Willy Tarreau,

On Mon, 15 Jul 2013 17:12:40 +0200, Willy Tarreau wrote:

> > 3.11-rc1 is now out. Do you intend to resend an updated version of your
> > patches, according to the discussions we had?
> 
> Yes I'll do. This week-end I have tried to pre-compute the frag_size.
> The performance did not change, nor the driver size, but the code is
> cleaner and more readable (in my opinion).

Ok.

> I have noticed your recent patches to support big-endian, and since
> we risk to get conflicts by touching similar areas, I guess you'll
> prefer to have a clean patch series that applies well.

The big-endian support is pretty light in terms of driver changes, so
if there are conflicts, they should be easy to solve.

> I'm still checking a little bit what can be easily improved (if any)
> and will resend.

Excellent, thanks!

Thomas
Willy Tarreau July 15, 2013, 3:30 p.m. UTC | #10
[ removing Eric from Cc who's probably not interested by this ]

On Mon, Jul 15, 2013 at 05:23:21PM +0200, Thomas Petazzoni wrote:
> The big-endian support is pretty light in terms of driver changes, so
> if there are conflicts, they should be easy to solve.

I agree.

BTW, I'm still having the patch to re-enable phy-polling at boot in my
queue. Last time I tried without it I still only had eth0 working. I
completely forgot about this one. Since I did it blindly, I have no way
to verify if what I've done is necessary according to the specs. But I'm
quite sure that the NICs that are not enabled by the boot loaders don't
work.

If you want, I'll resend it to you at the same time.

Best regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni July 15, 2013, 3:35 p.m. UTC | #11
Dear Willy Tarreau,

On Mon, 15 Jul 2013 17:30:49 +0200, Willy Tarreau wrote:

> On Mon, Jul 15, 2013 at 05:23:21PM +0200, Thomas Petazzoni wrote:
> > The big-endian support is pretty light in terms of driver changes, so
> > if there are conflicts, they should be easy to solve.
> 
> I agree.
> 
> BTW, I'm still having the patch to re-enable phy-polling at boot in my
> queue. Last time I tried without it I still only had eth0 working. I
> completely forgot about this one. Since I did it blindly, I have no way
> to verify if what I've done is necessary according to the specs. But I'm
> quite sure that the NICs that are not enabled by the boot loaders don't
> work.
> 
> If you want, I'll resend it to you at the same time.

I still believe this is not the right fix, but unfortunately, I haven't
had the time to work on what would be the right fix. Hardware PHY
polling has been intentionally disabled, and the driver should work
without, since the PHY polling is done in software by the phylib.

Best regards,

Thomas
Florian Fainelli July 15, 2013, 3:52 p.m. UTC | #12
Hello Thomas, Willy,

2013/7/15 Thomas Petazzoni <thomas.petazzoni@free-electrons.com>:
> Dear Willy Tarreau,
>
> On Mon, 15 Jul 2013 17:30:49 +0200, Willy Tarreau wrote:
>
>> On Mon, Jul 15, 2013 at 05:23:21PM +0200, Thomas Petazzoni wrote:
>> > The big-endian support is pretty light in terms of driver changes, so
>> > if there are conflicts, they should be easy to solve.
>>
>> I agree.
>>
>> BTW, I'm still having the patch to re-enable phy-polling at boot in my
>> queue. Last time I tried without it I still only had eth0 working. I
>> completely forgot about this one. Since I did it blindly, I have no way
>> to verify if what I've done is necessary according to the specs. But I'm
>> quite sure that the NICs that are not enabled by the boot loaders don't
>> work.
>>
>> If you want, I'll resend it to you at the same time.
>
> I still believe this is not the right fix, but unfortunately, I haven't
> had the time to work on what would be the right fix. Hardware PHY
> polling has been intentionally disabled, and the driver should work
> without, since the PHY polling is done in software by the phylib.

You may now be able to use the PHY interrupt bit from your MAC
interrupt handler and call phy_mac_interrupt() provided by libphy. I
added it specifically to be able to use PHY link interrupts connected
to the MAC.
--
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau July 15, 2013, 5:01 p.m. UTC | #13
Hi Florian,

On Mon, Jul 15, 2013 at 04:52:44PM +0100, Florian Fainelli wrote:
> > I still believe this is not the right fix, but unfortunately, I haven't
> > had the time to work on what would be the right fix. Hardware PHY
> > polling has been intentionally disabled, and the driver should work
> > without, since the PHY polling is done in software by the phylib.
> 
> You may now be able to use the PHY interrupt bit from your MAC
> interrupt handler and call phy_mac_interrupt() provided by libphy. I
> added it specifically to be able to use PHY link interrupts connected
> to the MAC.

Thanks for the tip. I'm not seeing any interrupt when I plug/unplug a
cable, so maybe we first need to set a specific bit in the interrupt
mask to enable interrupt delivery on link state change prior to this.

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni July 15, 2013, 7:44 p.m. UTC | #14
Dear Florian Fainelli,

On Mon, 15 Jul 2013 16:52:44 +0100, Florian Fainelli wrote:

> > I still believe this is not the right fix, but unfortunately, I haven't
> > had the time to work on what would be the right fix. Hardware PHY
> > polling has been intentionally disabled, and the driver should work
> > without, since the PHY polling is done in software by the phylib.
> 
> You may now be able to use the PHY interrupt bit from your MAC
> interrupt handler and call phy_mac_interrupt() provided by libphy. I
> added it specifically to be able to use PHY link interrupts connected
> to the MAC.

Right, but the MAC feature in question makes the MAC directly talks to
the PHY to configure it, completely bypassing the entire libphy logic.
I am not sure how to handle this together with libphy.

Thomas
Florian Fainelli July 15, 2013, 11:02 p.m. UTC | #15
Hello Thomas,

2013/7/15 Thomas Petazzoni <thomas.petazzoni@free-electrons.com>:
> Dear Florian Fainelli,
>
> On Mon, 15 Jul 2013 16:52:44 +0100, Florian Fainelli wrote:
>
>> > I still believe this is not the right fix, but unfortunately, I haven't
>> > had the time to work on what would be the right fix. Hardware PHY
>> > polling has been intentionally disabled, and the driver should work
>> > without, since the PHY polling is done in software by the phylib.
>>
>> You may now be able to use the PHY interrupt bit from your MAC
>> interrupt handler and call phy_mac_interrupt() provided by libphy. I
>> added it specifically to be able to use PHY link interrupts connected
>> to the MAC.
>
> Right, but the MAC feature in question makes the MAC directly talks to
> the PHY to configure it, completely bypassing the entire libphy logic.
> I am not sure how to handle this together with libphy.

Then you might just replace the default libphy with your PHY state
machine and manage it from within the mvneta driver. This is something
less usual but libphy still allows that.
--
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index c966785..0f2c6df 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -225,6 +225,7 @@  struct mvneta_stats {
 
 struct mvneta_port {
 	int pkt_size;
+	unsigned int frag_size;
 	void __iomem *base;
 	struct mvneta_rx_queue *rxqs;
 	struct mvneta_tx_queue *txqs;
@@ -1259,22 +1260,33 @@  static int mvneta_rx_refill(struct mvneta_port *pp,
 
 {
 	dma_addr_t phys_addr;
-	struct sk_buff *skb;
-
-	skb = netdev_alloc_skb(pp->dev, pp->pkt_size);
-	if (!skb)
+	void *data;
+	unsigned int skb_size;
+
+	skb_size = SKB_DATA_ALIGN(pp->pkt_size + MVNETA_MH_SIZE + NET_SKB_PAD) +
+		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	if (skb_size <= PAGE_SIZE) {
+		data = netdev_alloc_frag(skb_size);
+		pp->frag_size = skb_size;
+	} else {
+		data = kmalloc(skb_size, GFP_ATOMIC);
+		pp->frag_size = 0;
+	}
+	if (!data)
 		return -ENOMEM;
 
-	phys_addr = dma_map_single(pp->dev->dev.parent, skb->head,
+	phys_addr = dma_map_single(pp->dev->dev.parent, data,
 				   MVNETA_RX_BUF_SIZE(pp->pkt_size),
 				   DMA_FROM_DEVICE);
 	if (unlikely(dma_mapping_error(pp->dev->dev.parent, phys_addr))) {
-		dev_kfree_skb(skb);
+		if (pp->frag_size)
+			put_page(virt_to_head_page(data));
+		else
+			kfree(data);
 		return -ENOMEM;
 	}
 
-	mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)skb);
-
+	mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)data);
 	return 0;
 }
 
@@ -1328,9 +1340,13 @@  static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
 	rx_done = mvneta_rxq_busy_desc_num_get(pp, rxq);
 	for (i = 0; i < rxq->size; i++) {
 		struct mvneta_rx_desc *rx_desc = rxq->descs + i;
-		struct sk_buff *skb = (struct sk_buff *)rx_desc->buf_cookie;
+		void *data = (void *)rx_desc->buf_cookie;
+
+		if (pp->frag_size)
+			put_page(virt_to_head_page(data));
+		else
+			kfree(data);
 
-		dev_kfree_skb_any(skb);
 		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
 				 rx_desc->data_size, DMA_FROM_DEVICE);
 	}
@@ -1359,6 +1375,7 @@  static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 	while (rx_done < rx_todo) {
 		struct mvneta_rx_desc *rx_desc = mvneta_rxq_next_desc_get(rxq);
 		struct sk_buff *skb;
+		unsigned char *data;
 		u32 rx_status;
 		int rx_bytes, err;
 
@@ -1366,14 +1383,14 @@  static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 		rx_done++;
 		rx_filled++;
 		rx_status = rx_desc->status;
-		skb = (struct sk_buff *)rx_desc->buf_cookie;
+		data = (unsigned char *)rx_desc->buf_cookie;
 
 		if (!mvneta_rxq_desc_is_first_last(rx_desc) ||
-		    (rx_status & MVNETA_RXD_ERR_SUMMARY)) {
+		    (rx_status & MVNETA_RXD_ERR_SUMMARY) ||
+		    !(skb = build_skb(data, pp->frag_size))) {
 			dev->stats.rx_errors++;
 			mvneta_rx_error(pp, rx_desc);
-			mvneta_rx_desc_fill(rx_desc, rx_desc->buf_phys_addr,
-					    (u32)skb);
+			/* leave the descriptor untouched */
 			continue;
 		}
 
@@ -1388,7 +1405,7 @@  static int mvneta_rx(struct mvneta_port *pp, int rx_todo,
 		u64_stats_update_end(&pp->rx_stats.syncp);
 
 		/* Linux processing */
-		skb_reserve(skb, MVNETA_MH_SIZE);
+		skb_reserve(skb, MVNETA_MH_SIZE + NET_SKB_PAD);
 		skb_put(skb, rx_bytes);
 
 		skb->protocol = eth_type_trans(skb, dev);
@@ -1905,12 +1922,21 @@  static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 	int i;
 
 	for (i = 0; i < num; i++) {
-		struct sk_buff *skb;
+		void *data;
 		struct mvneta_rx_desc *rx_desc;
 		unsigned long phys_addr;
+		unsigned int skb_size;
 
-		skb = dev_alloc_skb(pp->pkt_size);
-		if (!skb) {
+		skb_size = SKB_DATA_ALIGN(pp->pkt_size + MVNETA_MH_SIZE + NET_SKB_PAD) +
+			SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+		if (skb_size <= PAGE_SIZE) {
+			data = netdev_alloc_frag(skb_size);
+			pp->frag_size = skb_size;
+		} else {
+			data = kmalloc(skb_size, GFP_ATOMIC);
+			pp->frag_size = 0;
+		}
+		if (!data) {
 			netdev_err(dev, "%s:rxq %d, %d of %d buffs  filled\n",
 				__func__, rxq->id, i, num);
 			break;
@@ -1918,15 +1944,18 @@  static int mvneta_rxq_fill(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 
 		rx_desc = rxq->descs + i;
 		memset(rx_desc, 0, sizeof(struct mvneta_rx_desc));
-		phys_addr = dma_map_single(dev->dev.parent, skb->head,
+		phys_addr = dma_map_single(dev->dev.parent, data,
 					   MVNETA_RX_BUF_SIZE(pp->pkt_size),
 					   DMA_FROM_DEVICE);
 		if (unlikely(dma_mapping_error(dev->dev.parent, phys_addr))) {
-			dev_kfree_skb(skb);
+			if (pp->frag_size)
+				put_page(virt_to_head_page(data));
+			else
+				kfree(data);
 			break;
 		}
 
-		mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)skb);
+		mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)data);
 	}
 
 	/* Add this number of RX descriptors as non occupied (ready to