diff mbox

[net-next,1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries

Message ID 1481549958-1265-1-git-send-email-volodymyr.bendiuga@gmail.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Volodymyr Bendiuga Dec. 12, 2016, 1:39 p.m. UTC
Hashtable will make it extremely faster when inserting fdb entries
into the forwarding database.

Signed-off-by: Volodymyr Bendiuga <volodymyr.bendiuga@gmail.com>
---
 drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

Comments

Andrew Lunn Dec. 12, 2016, 1:54 p.m. UTC | #1
On Mon, Dec 12, 2016 at 02:39:18PM +0100, Volodymyr Bendiuga wrote:
> Hashtable will make it extremely faster when inserting fdb entries
> into the forwarding database.
> 
> Signed-off-by: Volodymyr Bendiuga <volodymyr.bendiuga@gmail.com>
> ---
>  drivers/net/dsa/mv88e6xxx/mv88e6xxx.h | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> index 431e954..407e6db 100644
> --- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> +++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
> @@ -15,6 +15,8 @@
>  #include <linux/if_vlan.h>
>  #include <linux/irq.h>
>  #include <linux/gpio/consumer.h>
> +#include <linux/crc32.h>
> +#include <linux/hashtable.h>
>  
>  #ifndef UINT64_MAX
>  #define UINT64_MAX		(u64)(~((u64)0))
> @@ -672,6 +674,16 @@ struct mv88e6xxx_info {
>  	const struct mv88e6xxx_ops *ops;
>  };
>  
> +struct pvec_tbl_entry {

Please use the mv88e6xxx_ prefix.

       Andrew
Andrew Lunn Dec. 12, 2016, 2:18 p.m. UTC | #2
Hi Volodymyr

Any multi patch patchset should include a cover note. See git
format-patch --cover. This should explain the big picture what does
the patchset do, why etc.

Also, David sent out an email saying net-next was closed. While it is
closed, please send patches are RFC.

Thanks
	Andrew
Vivien Didelot Dec. 12, 2016, 3:03 p.m. UTC | #3
Hi Volodymyr,

Volodymyr Bendiuga <volodymyr.bendiuga@gmail.com> writes:

> Hashtable will make it extremely faster when inserting fdb entries
> into the forwarding database.

This is hard to follow. As Andrew correctly mentioned, when you have two
or more patches, please format them with --cover-letter, describe them
in patch 0 and send them all together, so that we get a single thread.

You are correct about the ATU get operations being slow. However, we
intentionally keep the driver stateless at the moment to keep it simple.

Unless speed is an issue to fix here, I am quite reluctant to add
caching to this driver yet. What is the issue you are having with the
ATU being slow?

Thanks,

	Vivien
Andrew Lunn Dec. 12, 2016, 4:07 p.m. UTC | #4
On Mon, Dec 12, 2016 at 04:22:13PM +0100, Volodymyr Bendiuga wrote:
> Hi,
> 
> I apologise for incorrectly formatted patch, I will fix and resend it.

Please don't resend with just the white space fixed. The memory model
is very important, using the fdb_prepare() call to allocate memory,
etc.

   Andrew
Vivien Didelot Dec. 12, 2016, 4:22 p.m. UTC | #5
Hi Volodymyr,

Volodymyr Bendiuga <volodymyr.bendiuga@gmail.com> writes:

> +struct pvec_tbl_entry {
> +        struct hlist_node entry;
> +        u32 key_crc32; /* key */
> +        u16 pvec;
> +        struct pvec_tbl_key {
> +                u8 addr[ETH_ALEN];
> +                u16 fid;
> +        } key;
> +};
> +
>  struct mv88e6xxx_atu_entry {
>  	u16	fid;
>  	u8	state;

Also, if we were to cache some values in mv88e6xxx, I'd use the existing
structures which match the hardware definition. Easier to understand.

mv88e6xxx_atu_entry already represents an ATU entry with its port
vector, FID, MAC address and more. Do you think it can be used here?

Thanks,

        Vivien
Florian Fainelli Dec. 12, 2016, 4:37 p.m. UTC | #6
On 12/12/2016 07:22 AM, Volodymyr Bendiuga wrote:
> Hi,
> 
> I apologise for incorrectly formatted patch, I will fix and resend it.
> The problem with the ATU right now is that it is too slow when inserting
> entries.
> When the OS boots up, it might insert some multicast entries into the
> atu (if
> they are preconfigured by user). I run a test with 10 mc entries being
> configured for
> each port (13 ports), and it took 15 seconds, which made system quite
> slow on responding to
> other commands, as it has been inserting mc entries. The implementation
> with hashtable
> made insert command for 13 ports and 10 entries per port about 700 msec
> long.


Just wondering how do you achieve such speed up? What part of using a
hashtable allows you not to write down all 10 MC entries across the
ports? I would assume that the number of MDIO (is that the bus you are
using here?) operations would be identical.

Seeing such a change makes me wonder if we should not try to push some
of this hashtable abstraction (provided that we agree we want it) at a
higher layer, like net/dsa/slave.c?

Thanks!
Vivien Didelot Dec. 12, 2016, 5:11 p.m. UTC | #7
Hi all,

Florian Fainelli <f.fainelli@gmail.com> writes:

> Seeing such a change makes me wonder if we should not try to push some
> of this hashtable abstraction (provided that we agree we want it) at a
> higher layer, like net/dsa/slave.c?

That is the major reason why I am reluctant to cache stuffs in drivers.

In most cases, we want the DSA drivers to be "stupid", as much stateless
as possible, simply implementing the supported DSA switch operations.

The DSA core then handles the generic logic of how switch fabrics should
behave, and thus all DSA drivers are consistent and benefit from this.

Thanks,

        Vivien
Andrew Lunn Dec. 12, 2016, 7:09 p.m. UTC | #8
On Mon, Dec 12, 2016 at 08:37:50AM -0800, Florian Fainelli wrote:
> On 12/12/2016 07:22 AM, Volodymyr Bendiuga wrote:
> > Hi,
> > 
> > I apologise for incorrectly formatted patch, I will fix and resend it.
> > The problem with the ATU right now is that it is too slow when inserting
> > entries.
> > When the OS boots up, it might insert some multicast entries into the
> > atu (if
> > they are preconfigured by user). I run a test with 10 mc entries being
> > configured for
> > each port (13 ports), and it took 15 seconds, which made system quite
> > slow on responding to
> > other commands, as it has been inserting mc entries. The implementation
> > with hashtable
> > made insert command for 13 ports and 10 entries per port about 700 msec
> > long.

Humm, it looks like we are doing the atu_get wrong. We are looking for
a specific MAC address. Yet we seem to be walking the whole table to
find it, rather than getting the hardware to do the search. 

The current code is:

static int mv88e6xxx_atu_get(struct mv88e6xxx_chip *chip, int fid,
                             const u8 *addr, struct mv88e6xxx_atu_entry *entry)
{
        struct mv88e6xxx_atu_entry next;
        int err;

        eth_broadcast_addr(next.mac);

        err = _mv88e6xxx_atu_mac_write(chip, next.mac);

We should be setting next.mac to one less than the address we are
looking for.

Volodymyr, please could you try that, and see how much of a speed up
you get.

There is another optimization which can be made. We only say there is
no such entry once we have reached the end of the table. But it will
return the entries in ascending order. So if the entry it returned is
bigger than what we are looking for, we can immediately abort the
search and say it does not exist.

   Andrew
Vivien Didelot Dec. 12, 2016, 8:03 p.m. UTC | #9
Hi Andrew,

Andrew Lunn <andrew@lunn.ch> writes:

> Humm, it looks like we are doing the atu_get wrong. We are looking for
> a specific MAC address. Yet we seem to be walking the whole table to
> find it, rather than getting the hardware to do the search. 

We are not doing it wrong, the hardware does the search. A classic dump
of an ATU database consists of starting from the broadcast address
ff:ff:ff:ff:ff:ff and issuing GetNext operation until we reach back the
broadcast address. Only addresses in used are returned by GetNext, thus
dumping an empty database is completed in a single operation.

I implemented atu_get intentionally this way because it provides simpler
code, rather than doing arithmetic on MAC addresses (Unless I am unaware
of simple increment/decrement code.)

> The current code is:
>
> static int mv88e6xxx_atu_get(struct mv88e6xxx_chip *chip, int fid,
>                              const u8 *addr, struct mv88e6xxx_atu_entry *entry)
> {
>         struct mv88e6xxx_atu_entry next;
>         int err;
>
>         eth_broadcast_addr(next.mac);
>
>         err = _mv88e6xxx_atu_mac_write(chip, next.mac);
>
> We should be setting next.mac to one less than the address we are
> looking for.
>
> Volodymyr, please could you try that, and see how much of a speed up
> you get.
>
> There is another optimization which can be made. We only say there is
> no such entry once we have reached the end of the table. But it will
> return the entries in ascending order. So if the entry it returned is
> bigger than what we are looking for, we can immediately abort the
> search and say it does not exist.

However your two suggestions to optimize the lookup are correct. It'd be
interesting to see if that makes a significant difference or not.

Thanks,

        Vivien
Andrew Lunn Dec. 13, 2016, 3:09 p.m. UTC | #10
> Another interesting thing is that fdb dump operation is slow
> too. The more entries you have the slower it gets. We have
> implemented caching with timing there as well (I have not submitted
> such patch yet), which makes fdb dump much more pleasant in user
> space, the operation happens instantly, where regular fdb dump op
> might take few seconds.

I know some people won't like the added MDIO transactions, because
they are doing PTP and want low jitter when talking to the switch and
PHYs. As things are now, once the system is configured, the MDIO bus
is idle. Hence there is very low jitter for PTP. For your cache to be
useful, i assume you are refreshing it at regular intervals. So you
are adding a constant load, which i guess 99.9% of the timer is never
used because there is no user looking at the table, where as the PTP
load is used to keep the clocks synchronised.

   Andrew
Andrew Lunn Dec. 14, 2016, 10:46 a.m. UTC | #11
On Wed, Dec 14, 2016 at 11:03:14AM +0100, Volodymyr Bendiuga wrote:
> Hi,
> 
> I understand what you wrote, but we do not refresh anything
> at intervals. FDB dump works only on user request, i.e. user
> run bridge fdb dump command. What we did is just a smarter
> way of feeding the dump request with entries obtained from
> switchcore. Interesting thing is that fdb dump in switchcore
> reads all entries, and from those entries only one is returned at
> a time, others are discarded, because request comes as per port.

Ah, O.K.

> What we do is we dump entries from switchcore once, when the
> first dump request comes, save all of them in the cache, and then
> all following fdb dump requests for each port will be fed from the cache,
> so no more hardware dump operations. We set the cache to be valid for
> 2 seconds, but this could be adjusted and tweaked. So in our case
> we decrease the number of MDIO reads quite a lot.

> What we are thinking now, is that this logics could be moved
> to net/dsa layer, and maybe unified with fdb load logics, if possible.

We should check what the other switches do. Can they perform a dump
with the hardware performing the port filter? Those switches will not
benefit from such a cache. But they should also not suffer.

Combining it with load is a bigger question. Currently the drivers are
stateless. That makes the error handling very simple. We don't have to
worry about running out of memory, since we don't allocate any memory.

If we run out of memory for this dump cache, it is not serious, since
a dump does not result in state change. But if we are out of memory on
a load, we do have state changes. We have to deal with the complexity
of allocating resources in the _prepare call, and then use the
resource in the do call. I would much prefer to avoid this at first,
by optimising the atu get. If we don't see a significant speed up,
then we can consider this complex solution of a cache for load.

     Andrew
Vivien Didelot Dec. 15, 2016, 5:21 p.m. UTC | #12
Hi Volodymyr,

Volodymyr Bendiuga <volodymyr.bendiuga@gmail.com> writes:

> Hi Andrew,
>
> I have tested the approach you wrote in previous mails, the one
> with setting next.mac to address we are looking for -1. It seems
> to be as slow as the original implementation, unfortunately.

Hum, that is what I was expecting... The ATU GetNext operation
(alongside an ether_addr_equal() call) should be quite fast.

> We use 6097 and 6352 chips, and both of them can not do any port
> filtering in hardware for fdb dump operation. Seems like they would
> benefit from cache. But I am not sure about other switches.
>
> Does anyone know about such feature in other switches?

Marvell switches cannot filter ATU entries for a specific port, they
contain a port vector.

I guess Florian might answer for Broadcom switches, and John might
answer for Qualcomm switches.

In all cases *if caching is really needed*, I think it won't hurt to do
it in DSA core even if a switch support FDB dump operations on a
per-port basis, as Andrew mentioned.

Thanks,

        Vivien
Florian Fainelli Dec. 15, 2016, 5:32 p.m. UTC | #13
On 12/15/2016 09:21 AM, Vivien Didelot wrote:
> Hi Volodymyr,
> 
> Volodymyr Bendiuga <volodymyr.bendiuga@gmail.com> writes:
> 
>> Hi Andrew,
>>
>> I have tested the approach you wrote in previous mails, the one
>> with setting next.mac to address we are looking for -1. It seems
>> to be as slow as the original implementation, unfortunately.
> 
> Hum, that is what I was expecting... The ATU GetNext operation
> (alongside an ether_addr_equal() call) should be quite fast.
> 
>> We use 6097 and 6352 chips, and both of them can not do any port
>> filtering in hardware for fdb dump operation. Seems like they would
>> benefit from cache. But I am not sure about other switches.
>>
>> Does anyone know about such feature in other switches?
> 
> Marvell switches cannot filter ATU entries for a specific port, they
> contain a port vector.
> 
> I guess Florian might answer for Broadcom switches, and John might
> answer for Qualcomm switches.

For Broadcom switches, we use the ARL search and then apply software
filtering to discard entries that are not for the target port bridge fdb
show was called with.

> 
> In all cases *if caching is really needed*, I think it won't hurt to do
> it in DSA core even if a switch support FDB dump operations on a
> per-port basis, as Andrew mentioned.

Agreed, and there does not appear to be any need to new dsa_switch_ops
operations to be introduced?
John Crispin Dec. 15, 2016, 5:47 p.m. UTC | #14
On 15/12/2016 18:21, Vivien Didelot wrote:
> Hi Volodymyr,
> 
> Volodymyr Bendiuga <volodymyr.bendiuga@gmail.com> writes:
> 
>> Hi Andrew,
>>
>> I have tested the approach you wrote in previous mails, the one
>> with setting next.mac to address we are looking for -1. It seems
>> to be as slow as the original implementation, unfortunately.
> 
> Hum, that is what I was expecting... The ATU GetNext operation
> (alongside an ether_addr_equal() call) should be quite fast.
> 
>> We use 6097 and 6352 chips, and both of them can not do any port
>> filtering in hardware for fdb dump operation. Seems like they would
>> benefit from cache. But I am not sure about other switches.
>>
>> Does anyone know about such feature in other switches?
> 
> Marvell switches cannot filter ATU entries for a specific port, they
> contain a port vector.
> 
> I guess Florian might answer for Broadcom switches, and John might
> answer for Qualcomm switches.
> 
> In all cases *if caching is really needed*, I think it won't hurt to do
> it in DSA core even if a switch support FDB dump operations on a
> per-port basis, as Andrew mentioned.


QCA switches allow defining a port mask in a fdb search/iterate
operation. just had a look in the brcm driver and it filters in software

	John
Vivien Didelot Dec. 15, 2016, 5:50 p.m. UTC | #15
Florian Fainelli <f.fainelli@gmail.com> writes:

>> In all cases *if caching is really needed*, I think it won't hurt to do
>> it in DSA core even if a switch support FDB dump operations on a
>> per-port basis, as Andrew mentioned.
>
> Agreed, and there does not appear to be any need to new dsa_switch_ops
> operations to be introduced?

Nope.
John Crispin Dec. 16, 2016, 9:31 a.m. UTC | #16
On 16/12/2016 10:24, Volodymyr Bendiuga wrote:
> Hi all,
> 
> Does this mean we all agree on implementing caching
> mechanism in net/dsa layer? If yes, then I can start
> working on it immediately.
> 
> Regards,
> Volodymyr

i think that the brcm approach is better as at least on QCA you can also
read auto learned values from the table which would get broken by a
caching mechanism. filtering inside the dump function or the code
calling the dump function seams a lot simpler and lightweight to me.

	John

> 
> On Thu, Dec 15, 2016 at 6:50 PM, Vivien Didelot
> <vivien.didelot@savoirfairelinux.com
> <mailto:vivien.didelot@savoirfairelinux.com>> wrote:
> 
>     Florian Fainelli <f.fainelli@gmail.com
>     <mailto:f.fainelli@gmail.com>> writes:
> 
>     >> In all cases *if caching is really needed*, I think it won't hurt to do
>     >> it in DSA core even if a switch support FDB dump operations on a
>     >> per-port basis, as Andrew mentioned.
>     >
>     > Agreed, and there does not appear to be any need to new dsa_switch_ops
>     > operations to be introduced?
> 
>     Nope.
> 
>
Andrew Lunn Dec. 16, 2016, 10:08 a.m. UTC | #17
On Fri, Dec 16, 2016 at 10:24:28AM +0100, Volodymyr Bendiuga wrote:
> Hi all,
> 
> Does this mean we all agree on implementing caching
> mechanism in net/dsa layer? If yes, then I can start
> working on it immediately.

Volodymyr

Could you provide us with your test script, which adds multicast MAC
addresses to ports. I would like to do some profiling with it.

	  Thanks
		Andrew
Andrew Lunn Dec. 16, 2016, 11:01 a.m. UTC | #18
On Fri, Dec 16, 2016 at 11:26:01AM +0100, Volodymyr Bendiuga wrote:
> Hi Andrew,
> 
> I don't have any script right now, the code I have is a part of
> the OS, but I could write simple C program which represents
> what we are doing with mc entries and send it to you for profiling.

It would be nice to have a benchmark test we can use to test out
ideas.

Please try to make the code flexible. The slave interface names on my
boards are probably not the same as on your. Also, the number of ports
may differ.

    Thanks
	Andrew
Volodymyr Bendiuga Dec. 23, 2016, 9:30 a.m. UTC | #19
Hi Andrew,
Here is the program I promised you.
There is a .c and Makefile attached to this mail. Simply type ”make” to build it.
There is a dependency on libnl3, which needs to be installed. If you don’t have it
Just install it: "apt-get install libnl-3-dev libnl-route-3-dev” if you use ubuntu based
Linux.

What the program does is it creates a libnl cache, inserts into it some hardcoded
Multicast entries, and then adds entries to the kernel (to the bridge and to hardware).
By default the program looks for interfaces that have ”eth” in their names. If your interfaces
Have different names, just correct that line in the code. Otherwise it should just run straight away.
When entries are added the timing is presented. By default the program uses vlan id 1. If you need
To override it, just pass "-v 2” or whatever vlan id you wish.
Please try running the program with hash table patch and without. You should see a significant difference.

Let me know if there is something you need help with.

Thanks,
Volodymyr
> 16 dec. 2016 kl. 12:01 skrev Andrew Lunn <andrew@lunn.ch>:
> 
> On Fri, Dec 16, 2016 at 11:26:01AM +0100, Volodymyr Bendiuga wrote:
>> Hi Andrew,
>> 
>> I don't have any script right now, the code I have is a part of
>> the OS, but I could write simple C program which represents
>> what we are doing with mc entries and send it to you for profiling.
> 
> It would be nice to have a benchmark test we can use to test out
> ideas.
> 
> Please try to make the code flexible. The slave interface names on my
> boards are probably not the same as on your. Also, the number of ports
> may differ.
> 
>    Thanks
> 	Andrew
Andrew Lunn Dec. 23, 2016, 10:17 a.m. UTC | #20
On Fri, Dec 23, 2016 at 10:30:27AM +0100, Volodymyr Bendiuga wrote:
> Hi Andrew,
> Here is the program I promised you.

Hi Volodymyr

Thanks for this. I will try it out beginning of January. I don't have
access to the hardware at the moment.

       Thanks
		Andrew
diff mbox

Patch

diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
index 431e954..407e6db 100644
--- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
@@ -15,6 +15,8 @@ 
 #include <linux/if_vlan.h>
 #include <linux/irq.h>
 #include <linux/gpio/consumer.h>
+#include <linux/crc32.h>
+#include <linux/hashtable.h>
 
 #ifndef UINT64_MAX
 #define UINT64_MAX		(u64)(~((u64)0))
@@ -672,6 +674,16 @@  struct mv88e6xxx_info {
 	const struct mv88e6xxx_ops *ops;
 };
 
+struct pvec_tbl_entry {
+        struct hlist_node entry;
+        u32 key_crc32; /* key */
+        u16 pvec;
+        struct pvec_tbl_key {
+                u8 addr[ETH_ALEN];
+                u16 fid;
+        } key;
+};
+
 struct mv88e6xxx_atu_entry {
 	u16	fid;
 	u8	state;
@@ -736,6 +748,9 @@  struct mv88e6xxx_chip {
 
 	struct mv88e6xxx_priv_port	ports[DSA_MAX_PORTS];
 
+	/* This table hold a port vector for each multicast address */
+	DECLARE_HASHTABLE(pvec_tbl, 12);
+
 	/* A switch may have a GPIO line tied to its reset pin. Parse
 	 * this from the device tree, and use it before performing
 	 * switch soft reset.