diff mbox series

[LEDE-DEV] ag71xx: Add some unlikely calls + rearange some stuff in hard_start_xmit.

Message ID 20180213225351.8560-1-rosenp@gmail.com
State Changes Requested
Delegated to: John Crispin
Headers show
Series [LEDE-DEV] ag71xx: Add some unlikely calls + rearange some stuff in hard_start_xmit. | expand

Commit Message

Rosen Penev Feb. 13, 2018, 10:53 p.m. UTC
Based on Qualcomm driver. Improves iperf3 throughput by ~20mbps on transmit on Archer C7v4.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
---
 .../drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c      | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

Comments

John Crispin Feb. 14, 2018, 6:52 a.m. UTC | #1
On 13/02/18 23:53, Rosen Penev wrote:
> Based on Qualcomm driver. Improves iperf3 throughput by ~20mbps on transmit on Archer C7v4.

this is missing the description of what the patch does.

     John
>
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
> ---
>   .../drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c      | 14 +++++++-------
>   1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
> index 95682b7641..d32f220178 100644
> --- a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
> +++ b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
> @@ -797,11 +797,14 @@ static netdev_tx_t ag71xx_hard_start_xmit(struct sk_buff *skb,
>   	if (ag71xx_has_ar8216(ag))
>   		ag71xx_add_ar8216_header(ag, skb);
>   
> -	if (skb->len <= 4) {
> +	dma_cache_sync (NULL, skb->data, skb->len, DMA_TO_DEVICE);
> +
> +	if (unlikely(skb->len <= 4)) {
>   		DBG("%s: packet len is too small\n", ag->dev->name);
>   		goto err_drop;
>   	}
>   
> +	netdev_sent_queue(dev, skb->len);
>   	dma_addr = dma_map_single(&dev->dev, skb->data, skb->len,
>   				  DMA_TO_DEVICE);
>   
> @@ -817,27 +820,24 @@ static netdev_tx_t ag71xx_hard_start_xmit(struct sk_buff *skb,
>   	ring->buf[i].len = skb->len;
>   	ring->buf[i].skb = skb;
>   
> -	netdev_sent_queue(dev, skb->len);
> -
>   	skb_tx_timestamp(skb);
>   
>   	desc->ctrl &= ~DESC_EMPTY;
>   	ring->curr += n;
>   
> -	/* flush descriptor */
> -	wmb();
> -
>   	ring_min = 2;
>   	if (ring->desc_split)
>   	    ring_min *= AG71XX_TX_RING_DS_PER_PKT;
>   
> -	if (ring->curr - ring->dirty >= ring_size - ring_min) {
> +	if (unlikely(ring->curr - ring->dirty >= ring_size - ring_min)) {
>   		DBG("%s: tx queue full\n", dev->name);
>   		netif_stop_queue(dev);
>   	}
>   
>   	DBG("%s: packet injected into TX queue\n", ag->dev->name);
>   
> +	/* flush descriptor */
> +	wmb();
>   	/* enable TX engine */
>   	ag71xx_wr(ag, AG71XX_REG_TX_CTRL, TX_CTRL_TXE);
>
Felix Fietkau Feb. 14, 2018, 7:10 a.m. UTC | #2
On 2018-02-13 23:53, Rosen Penev wrote:
> Based on Qualcomm driver. Improves iperf3 throughput by ~20mbps on transmit on Archer C7v4.
> 
> Signed-off-by: Rosen Penev <rosenp@gmail.com>
> ---
>  .../drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c      | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
> index 95682b7641..d32f220178 100644
> --- a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
> +++ b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
> @@ -797,11 +797,14 @@ static netdev_tx_t ag71xx_hard_start_xmit(struct sk_buff *skb,
>  	if (ag71xx_has_ar8216(ag))
>  		ag71xx_add_ar8216_header(ag, skb);
>  
> -	if (skb->len <= 4) {
> +	dma_cache_sync (NULL, skb->data, skb->len, DMA_TO_DEVICE);
The use of dma_cache_sync here makes no sense, since it's the wrong API.
Also, effectively it results in the same kind of cache flush as the one
that's done by the DMA mapping done later.

- Felix
Rosen Penev Feb. 14, 2018, 3:20 p.m. UTC | #3
On Tue, Feb 13, 2018 at 11:10 PM, Felix Fietkau <nbd@nbd.name> wrote:
> On 2018-02-13 23:53, Rosen Penev wrote:
>> Based on Qualcomm driver. Improves iperf3 throughput by ~20mbps on transmit on Archer C7v4.
>>
>> Signed-off-by: Rosen Penev <rosenp@gmail.com>
>> ---
>>  .../drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c      | 14 +++++++-------
>>  1 file changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>> index 95682b7641..d32f220178 100644
>> --- a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>> +++ b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>> @@ -797,11 +797,14 @@ static netdev_tx_t ag71xx_hard_start_xmit(struct sk_buff *skb,
>>       if (ag71xx_has_ar8216(ag))
>>               ag71xx_add_ar8216_header(ag, skb);
>>
>> -     if (skb->len <= 4) {
>> +     dma_cache_sync (NULL, skb->data, skb->len, DMA_TO_DEVICE);
> The use of dma_cache_sync here makes no sense, since it's the wrong API.
> Also, effectively it results in the same kind of cache flush as the one
> that's done by the DMA mapping done later.

From my reading of dma_cache_flush, I agree. However that's the part
of this patch that gives the biggest speedup. Before this patch, I
tested just adding that and it worked. I can back this up with
benchmarks later on.
>
> - Felix
Rosen Penev Feb. 14, 2018, 3:21 p.m. UTC | #4
On Tue, Feb 13, 2018 at 10:52 PM, John Crispin <john@phrozen.org> wrote:
>
>
> On 13/02/18 23:53, Rosen Penev wrote:
>>
>> Based on Qualcomm driver. Improves iperf3 throughput by ~20mbps on
>> transmit on Archer C7v4.
>
>
> this is missing the description of what the patch does.
Unfortunately I have no real idea as of yet.
>
>     John
>
>>
>> Signed-off-by: Rosen Penev <rosenp@gmail.com>
>> ---
>>   .../drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c      | 14
>> +++++++-------
>>   1 file changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git
>> a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>> b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>> index 95682b7641..d32f220178 100644
>> ---
>> a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>> +++
>> b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>> @@ -797,11 +797,14 @@ static netdev_tx_t ag71xx_hard_start_xmit(struct
>> sk_buff *skb,
>>         if (ag71xx_has_ar8216(ag))
>>                 ag71xx_add_ar8216_header(ag, skb);
>>   -     if (skb->len <= 4) {
>> +       dma_cache_sync (NULL, skb->data, skb->len, DMA_TO_DEVICE);
>> +
>> +       if (unlikely(skb->len <= 4)) {
>>                 DBG("%s: packet len is too small\n", ag->dev->name);
>>                 goto err_drop;
>>         }
>>   +     netdev_sent_queue(dev, skb->len);
>>         dma_addr = dma_map_single(&dev->dev, skb->data, skb->len,
>>                                   DMA_TO_DEVICE);
>>   @@ -817,27 +820,24 @@ static netdev_tx_t ag71xx_hard_start_xmit(struct
>> sk_buff *skb,
>>         ring->buf[i].len = skb->len;
>>         ring->buf[i].skb = skb;
>>   -     netdev_sent_queue(dev, skb->len);
>> -
>>         skb_tx_timestamp(skb);
>>         desc->ctrl &= ~DESC_EMPTY;
>>         ring->curr += n;
>>   -     /* flush descriptor */
>> -       wmb();
>> -
>>         ring_min = 2;
>>         if (ring->desc_split)
>>             ring_min *= AG71XX_TX_RING_DS_PER_PKT;
>>   -     if (ring->curr - ring->dirty >= ring_size - ring_min) {
>> +       if (unlikely(ring->curr - ring->dirty >= ring_size - ring_min)) {
>>                 DBG("%s: tx queue full\n", dev->name);
>>                 netif_stop_queue(dev);
>>         }
>>         DBG("%s: packet injected into TX queue\n", ag->dev->name);
>>   +     /* flush descriptor */
>> +       wmb();
>>         /* enable TX engine */
>>         ag71xx_wr(ag, AG71XX_REG_TX_CTRL, TX_CTRL_TXE);
>>
>
>
Felix Fietkau Feb. 15, 2018, 11:29 a.m. UTC | #5
On 2018-02-14 16:20, Rosen Penev wrote:
> On Tue, Feb 13, 2018 at 11:10 PM, Felix Fietkau <nbd@nbd.name> wrote:
>> On 2018-02-13 23:53, Rosen Penev wrote:
>>> Based on Qualcomm driver. Improves iperf3 throughput by ~20mbps on transmit on Archer C7v4.
>>>
>>> Signed-off-by: Rosen Penev <rosenp@gmail.com>
>>> ---
>>>  .../drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c      | 14 +++++++-------
>>>  1 file changed, 7 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>>> index 95682b7641..d32f220178 100644
>>> --- a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>>> +++ b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>>> @@ -797,11 +797,14 @@ static netdev_tx_t ag71xx_hard_start_xmit(struct sk_buff *skb,
>>>       if (ag71xx_has_ar8216(ag))
>>>               ag71xx_add_ar8216_header(ag, skb);
>>>
>>> -     if (skb->len <= 4) {
>>> +     dma_cache_sync (NULL, skb->data, skb->len, DMA_TO_DEVICE);
>> The use of dma_cache_sync here makes no sense, since it's the wrong API.
>> Also, effectively it results in the same kind of cache flush as the one
>> that's done by the DMA mapping done later.
> 
> From my reading of dma_cache_flush, I agree. However that's the part
> of this patch that gives the biggest speedup. Before this patch, I
> tested just adding that and it worked. I can back this up with
> benchmarks later on.
In my test I quite often encountered big differences in throughput from
minor (and often very much unrelated) changes. I haven't found the real
source of these differences yet, so it's hard to know which changes
really help in the long run. My best guess is that some changes affect
the alignment of critical functions and thus affect the icache footprint
somehow.

Until I see a reasonable explanation of how this change helps, I'm going
to assume it's the same kind of random change fluctuation that I've seen
so often and NACK this change.

- Felix
Rosen Penev Feb. 16, 2018, 10:44 p.m. UTC | #6
On Thu, Feb 15, 2018 at 3:29 AM, Felix Fietkau <nbd@nbd.name> wrote:
> On 2018-02-14 16:20, Rosen Penev wrote:
>> On Tue, Feb 13, 2018 at 11:10 PM, Felix Fietkau <nbd@nbd.name> wrote:
>>> On 2018-02-13 23:53, Rosen Penev wrote:
>>>> Based on Qualcomm driver. Improves iperf3 throughput by ~20mbps on transmit on Archer C7v4.
>>>>
>>>> Signed-off-by: Rosen Penev <rosenp@gmail.com>
>>>> ---
>>>>  .../drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c      | 14 +++++++-------
>>>>  1 file changed, 7 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>>>> index 95682b7641..d32f220178 100644
>>>> --- a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>>>> +++ b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>>>> @@ -797,11 +797,14 @@ static netdev_tx_t ag71xx_hard_start_xmit(struct sk_buff *skb,
>>>>       if (ag71xx_has_ar8216(ag))
>>>>               ag71xx_add_ar8216_header(ag, skb);
>>>>
>>>> -     if (skb->len <= 4) {
>>>> +     dma_cache_sync (NULL, skb->data, skb->len, DMA_TO_DEVICE);
>>> The use of dma_cache_sync here makes no sense, since it's the wrong API.
>>> Also, effectively it results in the same kind of cache flush as the one
>>> that's done by the DMA mapping done later.
>>
>> From my reading of dma_cache_flush, I agree. However that's the part
>> of this patch that gives the biggest speedup. Before this patch, I
>> tested just adding that and it worked. I can back this up with
>> benchmarks later on.
> In my test I quite often encountered big differences in throughput from
> minor (and often very much unrelated) changes. I haven't found the real
> source of these differences yet, so it's hard to know which changes
> really help in the long run. My best guess is that some changes affect
> the alignment of critical functions and thus affect the icache footprint
> somehow.
>
> Until I see a reasonable explanation of how this change helps, I'm going
> to assume it's the same kind of random change fluctuation that I've seen
> so often and NACK this change.

So I just benchmarked this while not connected to anything. Basically,
WAN port assigned to LAN interface (it's the same CPU port - eth0) and
br-lan removed so that eth0.1 is assigned to LAN.

I noticed something bizarre. With the patch applied, I was getting
~150mbps on iperf3 consistently. With the patch not applied, I was
getting the same speed, EXCEPT the speed would sometimes bump up to
~250mbps.

Adding dma_cache_sync eliminated this boost. I have no explanation for
this. Maybe TX queue gets full? Sample result:

Server listening on 5201
-----------------------------------------------------------
Accepted connection from 192.168.1.1, port 40080
[  5] local 192.168.1.2 port 5201 connected to 192.168.1.1 port 40082
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  17.0 MBytes   143 Mbits/sec
[  5]   1.00-2.00   sec  18.1 MBytes   152 Mbits/sec
[  5]   2.00-3.00   sec  17.5 MBytes   147 Mbits/sec
[  5]   3.00-4.00   sec  17.6 MBytes   148 Mbits/sec
[  5]   4.00-5.00   sec  17.8 MBytes   149 Mbits/sec
[  5]   5.00-6.00   sec  18.0 MBytes   151 Mbits/sec
[  5]   6.00-7.00   sec  25.5 MBytes   214 Mbits/sec
[  5]   7.00-8.00   sec  30.9 MBytes   259 Mbits/sec
[  5]   8.00-9.00   sec  30.4 MBytes   255 Mbits/sec
[  5]   9.00-10.00  sec  31.1 MBytes   261 Mbits/sec
[  5]  10.00-10.06  sec  1.77 MBytes   250 Mbits/sec

>
> - Felix
Roman Yeryomin Feb. 19, 2018, 10:55 a.m. UTC | #7
On 2018-02-15 13:29, Felix Fietkau wrote:
> On 2018-02-14 16:20, Rosen Penev wrote:
>> On Tue, Feb 13, 2018 at 11:10 PM, Felix Fietkau <nbd@nbd.name> wrote:
>>> On 2018-02-13 23:53, Rosen Penev wrote:
>>>> Based on Qualcomm driver. Improves iperf3 throughput by ~20mbps on 
>>>> transmit on Archer C7v4.
>>>> 
>>>> Signed-off-by: Rosen Penev <rosenp@gmail.com>
>>>> ---
>>>>  .../drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c      | 14 
>>>> +++++++-------
>>>>  1 file changed, 7 insertions(+), 7 deletions(-)
>>>> 
>>>> diff --git 
>>>> a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c 
>>>> b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>>>> index 95682b7641..d32f220178 100644
>>>> --- 
>>>> a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>>>> +++ 
>>>> b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
>>>> @@ -797,11 +797,14 @@ static netdev_tx_t 
>>>> ag71xx_hard_start_xmit(struct sk_buff *skb,
>>>>       if (ag71xx_has_ar8216(ag))
>>>>               ag71xx_add_ar8216_header(ag, skb);
>>>> 
>>>> -     if (skb->len <= 4) {
>>>> +     dma_cache_sync (NULL, skb->data, skb->len, DMA_TO_DEVICE);
>>> The use of dma_cache_sync here makes no sense, since it's the wrong 
>>> API.
>>> Also, effectively it results in the same kind of cache flush as the 
>>> one
>>> that's done by the DMA mapping done later.
>> 
>> From my reading of dma_cache_flush, I agree. However that's the part
>> of this patch that gives the biggest speedup. Before this patch, I
>> tested just adding that and it worked. I can back this up with
>> benchmarks later on.
> In my test I quite often encountered big differences in throughput from
> minor (and often very much unrelated) changes. I haven't found the real
> source of these differences yet, so it's hard to know which changes
> really help in the long run. My best guess is that some changes affect
> the alignment of critical functions and thus affect the icache 
> footprint
> somehow.

Exactly my impression I got when have made a mistake and jumped into 
this rabbit hole some time ago.
Trying to port the changes QCA did I was getting random tput changes, 
often not logical at all. Then I started to suspect it's related to code 
alignment and MIPS architecture features.
I realized that after I got down to (ar8327) switch port buffers and 
wrote few additional functions to control them. And got +50Mbps stable 
right away, not even touching default buffer settings.
Functions were never run, not even at switch init.

> Until I see a reasonable explanation of how this change helps, I'm 
> going
> to assume it's the same kind of random change fluctuation that I've 
> seen
> so often and NACK this change.
> 
> - Felix
> 
> _______________________________________________
> Lede-dev mailing list
> Lede-dev@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/lede-dev
diff mbox series

Patch

diff --git a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
index 95682b7641..d32f220178 100644
--- a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
+++ b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
@@ -797,11 +797,14 @@  static netdev_tx_t ag71xx_hard_start_xmit(struct sk_buff *skb,
 	if (ag71xx_has_ar8216(ag))
 		ag71xx_add_ar8216_header(ag, skb);
 
-	if (skb->len <= 4) {
+	dma_cache_sync (NULL, skb->data, skb->len, DMA_TO_DEVICE);
+
+	if (unlikely(skb->len <= 4)) {
 		DBG("%s: packet len is too small\n", ag->dev->name);
 		goto err_drop;
 	}
 
+	netdev_sent_queue(dev, skb->len);
 	dma_addr = dma_map_single(&dev->dev, skb->data, skb->len,
 				  DMA_TO_DEVICE);
 
@@ -817,27 +820,24 @@  static netdev_tx_t ag71xx_hard_start_xmit(struct sk_buff *skb,
 	ring->buf[i].len = skb->len;
 	ring->buf[i].skb = skb;
 
-	netdev_sent_queue(dev, skb->len);
-
 	skb_tx_timestamp(skb);
 
 	desc->ctrl &= ~DESC_EMPTY;
 	ring->curr += n;
 
-	/* flush descriptor */
-	wmb();
-
 	ring_min = 2;
 	if (ring->desc_split)
 	    ring_min *= AG71XX_TX_RING_DS_PER_PKT;
 
-	if (ring->curr - ring->dirty >= ring_size - ring_min) {
+	if (unlikely(ring->curr - ring->dirty >= ring_size - ring_min)) {
 		DBG("%s: tx queue full\n", dev->name);
 		netif_stop_queue(dev);
 	}
 
 	DBG("%s: packet injected into TX queue\n", ag->dev->name);
 
+	/* flush descriptor */
+	wmb();
 	/* enable TX engine */
 	ag71xx_wr(ag, AG71XX_REG_TX_CTRL, TX_CTRL_TXE);