diff mbox series

[RFC] Efficiency of the phandle_cache on ppc64/SLOF

Message ID 20191129151056.o5c44lm5lb4wsr4r@linutronix.de
State Superseded, archived
Headers show
Series [RFC] Efficiency of the phandle_cache on ppc64/SLOF | expand

Checks

Context Check Description
robh/checkpatch warning "total: 1 errors, 3 warnings, 39 lines checked"

Commit Message

Sebastian Andrzej Siewior Nov. 29, 2019, 3:10 p.m. UTC
I've been looking at phandle_cache and noticed the following: The raw
phandle value as generated by dtc starts at zero and is incremented by
one for each phandle entry. The qemu pSeries model is using Slof (which
is probably the same thing as used on real hardware) and this looks like
a poiner value for the phandle.
With
	qemu-system-ppc64le -m 16G -machine pseries -smp 8 

I got the following output:
| entries: 64
| phandle 7e732468 slot 28 hash c
| phandle 7e732ad0 slot 10 hash 27
| phandle 7e732ee8 slot 28 hash 3a
| phandle 7e734160 slot 20 hash 36
| phandle 7e734318 slot 18 hash 3a
| phandle 7e734428 slot 28 hash 33
| phandle 7e734538 slot 38 hash 2c
| phandle 7e734850 slot 10 hash e
| phandle 7e735220 slot 20 hash 2d
| phandle 7e735bf0 slot 30 hash d
| phandle 7e7365c0 slot 0 hash 2d
| phandle 7e736f90 slot 10 hash d
| phandle 7e737960 slot 20 hash 2d
| phandle 7e738330 slot 30 hash d
| phandle 7e738d00 slot 0 hash 2d
| phandle 7e739730 slot 30 hash 38
| phandle 7e73bd08 slot 8 hash 17
| phandle 7e73c2e0 slot 20 hash 32
| phandle 7e73c7f8 slot 38 hash 37
| phandle 7e782420 slot 20 hash 13
| phandle 7e782ed8 slot 18 hash 1b
| phandle 7e73ce28 slot 28 hash 39
| phandle 7e73d390 slot 10 hash 22
| phandle 7e73d9a8 slot 28 hash 1a
| phandle 7e73dc28 slot 28 hash 37
| phandle 7e73de00 slot 0 hash a
| phandle 7e73e028 slot 28 hash 0
| phandle 7e7621a8 slot 28 hash 36
| phandle 7e73e458 slot 18 hash 1e
| phandle 7e73e608 slot 8 hash 1e
| phandle 7e740078 slot 38 hash 28
| phandle 7e740180 slot 0 hash 1d
| phandle 7e740240 slot 0 hash 33
| phandle 7e740348 slot 8 hash 29
| phandle 7e740410 slot 10 hash 2
| phandle 7e740eb0 slot 30 hash 3e
| phandle 7e745390 slot 10 hash 33
| phandle 7e747b08 slot 8 hash c
| phandle 7e748528 slot 28 hash f
| phandle 7e74a6e0 slot 20 hash 18
| phandle 7e74aab0 slot 30 hash b
| phandle 7e74f788 slot 8 hash d
| Used entries: 8, hashed: 29

So the hash array has 64 entries out which only 8 are populated. Using
hash_32() populates 29 entries.
Could someone with real hardware verify this?
I'm not sure how important this performance wise, it looks just like a
waste using only 1/8 of the array.

The patch used for testing:


Sebastian

Comments

Frank Rowand Nov. 30, 2019, 2:14 a.m. UTC | #1
On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote:
> I've been looking at phandle_cache and noticed the following: The raw
> phandle value as generated by dtc starts at zero and is incremented by
> one for each phandle entry. The qemu pSeries model is using Slof (which
> is probably the same thing as used on real hardware) and this looks like
> a poiner value for the phandle.
> With
> 	qemu-system-ppc64le -m 16G -machine pseries -smp 8 
> 
> I got the following output:
> | entries: 64
> | phandle 7e732468 slot 28 hash c
> | phandle 7e732ad0 slot 10 hash 27
> | phandle 7e732ee8 slot 28 hash 3a
> | phandle 7e734160 slot 20 hash 36
> | phandle 7e734318 slot 18 hash 3a
> | phandle 7e734428 slot 28 hash 33
> | phandle 7e734538 slot 38 hash 2c
> | phandle 7e734850 slot 10 hash e
> | phandle 7e735220 slot 20 hash 2d
> | phandle 7e735bf0 slot 30 hash d
> | phandle 7e7365c0 slot 0 hash 2d
> | phandle 7e736f90 slot 10 hash d
> | phandle 7e737960 slot 20 hash 2d
> | phandle 7e738330 slot 30 hash d
> | phandle 7e738d00 slot 0 hash 2d
> | phandle 7e739730 slot 30 hash 38
> | phandle 7e73bd08 slot 8 hash 17
> | phandle 7e73c2e0 slot 20 hash 32
> | phandle 7e73c7f8 slot 38 hash 37
> | phandle 7e782420 slot 20 hash 13
> | phandle 7e782ed8 slot 18 hash 1b
> | phandle 7e73ce28 slot 28 hash 39
> | phandle 7e73d390 slot 10 hash 22
> | phandle 7e73d9a8 slot 28 hash 1a
> | phandle 7e73dc28 slot 28 hash 37
> | phandle 7e73de00 slot 0 hash a
> | phandle 7e73e028 slot 28 hash 0
> | phandle 7e7621a8 slot 28 hash 36
> | phandle 7e73e458 slot 18 hash 1e
> | phandle 7e73e608 slot 8 hash 1e
> | phandle 7e740078 slot 38 hash 28
> | phandle 7e740180 slot 0 hash 1d
> | phandle 7e740240 slot 0 hash 33
> | phandle 7e740348 slot 8 hash 29
> | phandle 7e740410 slot 10 hash 2
> | phandle 7e740eb0 slot 30 hash 3e
> | phandle 7e745390 slot 10 hash 33
> | phandle 7e747b08 slot 8 hash c
> | phandle 7e748528 slot 28 hash f
> | phandle 7e74a6e0 slot 20 hash 18
> | phandle 7e74aab0 slot 30 hash b
> | phandle 7e74f788 slot 8 hash d
> | Used entries: 8, hashed: 29
> 
> So the hash array has 64 entries out which only 8 are populated. Using
> hash_32() populates 29 entries.
> Could someone with real hardware verify this?
> I'm not sure how important this performance wise, it looks just like a
> waste using only 1/8 of the array.

The hash used is based on the assumptions you noted, and as stated in the
code, that phandle property values are in a contiguous range of 1..n
(not starting from zero), which is what dtc generates.

We knew that for systems that do not match the assumptions that the hash
will not be optimal.  Unless there is a serious performance problem for
such systems, I do not want to make the phandle hash code more complicated
to optimize for these cases.  And the pseries have been performing ok
without phandle related performance issues that I remember hearing since
before the cache was added, which could have only helped the performance.
Yes, if your observations are correct, some memory is being wasted, but
a 64 entry cache is not very large on a pseries.

There is already some push back from Rob that the existing code is more
complex than needed (eg variable cache size).

-Frank

> 
> The patch used for testing:
> 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index 1d667eb730e19..2640d4bc81a9a 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -197,6 +197,7 @@ void of_populate_phandle_cache(void)
>  	u32 cache_entries;
>  	struct device_node *np;
>  	u32 phandles = 0;
> +	struct device_node **cache2;
>  
>  	raw_spin_lock_irqsave(&devtree_lock, flags);
>  
> @@ -214,14 +215,32 @@ void of_populate_phandle_cache(void)
>  
>  	phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
>  				GFP_ATOMIC);
> +	cache2 = kcalloc(cache_entries, sizeof(*phandle_cache), GFP_ATOMIC);
>  	if (!phandle_cache)
>  		goto out;
>  
> +	pr_err("%s(%d) entries: %d\n", __func__, __LINE__, cache_entries);
>  	for_each_of_allnodes(np)
>  		if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) {
> +			int slot;
>  			of_node_get(np);
>  			phandle_cache[np->phandle & phandle_cache_mask] = np;
> +			slot = hash_32(np->phandle, __ffs(cache_entries));
> +			cache2[slot] = np;
> +			pr_err("%s(%d) phandle %x slot %x hash %x\n", __func__, __LINE__,
> +			       np->phandle, np->phandle & phandle_cache_mask, slot);
>  		}
> +	{
> +		int i, filled = 0, filled_hash = 0;
> +
> +		for (i = 0; i < cache_entries; i++) {
> +			if (phandle_cache[i])
> +				filled++;
> +			if (cache2[i])
> +				filled_hash++;
> +		}
> +		pr_err("%s(%d) Used entries: %d, hashed: %d\n", __func__, __LINE__, filled, filled_hash);
> +	}
>  
>  out:
>  	raw_spin_unlock_irqrestore(&devtree_lock, flags);
> 
> Sebastian
>
Sebastian Andrzej Siewior Dec. 2, 2019, 11:07 a.m. UTC | #2
On 2019-11-29 20:14:47 [-0600], Frank Rowand wrote:
> The hash used is based on the assumptions you noted, and as stated in the
> code, that phandle property values are in a contiguous range of 1..n
> (not starting from zero), which is what dtc generates.
> 
> We knew that for systems that do not match the assumptions that the hash
> will not be optimal.  Unless there is a serious performance problem for
> such systems, I do not want to make the phandle hash code more complicated
> to optimize for these cases.  And the pseries have been performing ok
> without phandle related performance issues that I remember hearing since
> before the cache was added, which could have only helped the performance.
> Yes, if your observations are correct, some memory is being wasted, but
> a 64 entry cache is not very large on a pseries.

okay, so it is nothing new and everyone is aware of the situation. I
move on then :)

> -Frank

Sebastian
Michael Ellerman Dec. 3, 2019, 4:03 a.m. UTC | #3
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
> I've been looking at phandle_cache and noticed the following: The raw
> phandle value as generated by dtc starts at zero and is incremented by
> one for each phandle entry. The qemu pSeries model is using Slof (which
> is probably the same thing as used on real hardware) and this looks like
> a poiner value for the phandle.

We don't use SLOF on bare metal these days.

I've certainly heard it said that on some OF's the phandle was just ==
the address of the internal representation, and I guess maybe for SLOF
that is true.

They seem to vary wildly though, eg. on an Apple G5:

  $ find /proc/device-tree/ -name phandle | xargs lsprop | head -10
  /proc/device-tree/vsp@0,f9000000/veo@f9180000/phandle ff970848
  /proc/device-tree/vsp@0,f9000000/phandle ff970360
  /proc/device-tree/vsp@0,f9000000/veo@f9080000/phandle ff970730
  /proc/device-tree/nvram@0,fff04000/phandle ff967fb8
  /proc/device-tree/xmodem/phandle ff9655e8
  /proc/device-tree/multiboot/phandle ff9504f0
  /proc/device-tree/diagnostics/phandle ff965550
  /proc/device-tree/options/phandle ff893cf0
  /proc/device-tree/openprom/client-services/phandle ff8925b8
  /proc/device-tree/openprom/phandle ff892458

That machine does not have enough RAM for those to be 32-bit real
addresses. I think Apple OF is running in virtual mode though (?), so
maybe they are pointers?

And on an IBM pseries machine they're a bit all over the place:

  /proc/device-tree/cpus/PowerPC,POWER8@40/ibm,phandle 10000040
  /proc/device-tree/cpus/l2-cache@2005/ibm,phandle 00002005
  /proc/device-tree/cpus/PowerPC,POWER8@30/ibm,phandle 10000030
  /proc/device-tree/cpus/PowerPC,POWER8@20/ibm,phandle 10000020
  /proc/device-tree/cpus/PowerPC,POWER8@10/ibm,phandle 10000010
  /proc/device-tree/cpus/l2-cache@2003/ibm,phandle 00002003
  /proc/device-tree/cpus/l2-cache@200a/ibm,phandle 0000200a
  /proc/device-tree/cpus/l3-cache@3108/ibm,phandle 00003108
  /proc/device-tree/cpus/l2-cache@2001/ibm,phandle 00002001
  /proc/device-tree/cpus/l3-cache@3106/ibm,phandle 00003106
  /proc/device-tree/cpus/ibm,phandle fffffff8
  /proc/device-tree/cpus/l3-cache@3104/ibm,phandle 00003104
  /proc/device-tree/cpus/l2-cache@2008/ibm,phandle 00002008
  /proc/device-tree/cpus/l3-cache@3102/ibm,phandle 00003102
  /proc/device-tree/cpus/l2-cache@2006/ibm,phandle 00002006
  /proc/device-tree/cpus/l3-cache@3100/ibm,phandle 00003100
  /proc/device-tree/cpus/PowerPC,POWER8@8/ibm,phandle 10000008
  /proc/device-tree/cpus/l2-cache@2004/ibm,phandle 00002004
  /proc/device-tree/cpus/PowerPC,POWER8@48/ibm,phandle 10000048
  /proc/device-tree/cpus/PowerPC,POWER8@38/ibm,phandle 10000038
  /proc/device-tree/cpus/l2-cache@2002/ibm,phandle 00002002
  /proc/device-tree/cpus/PowerPC,POWER8@28/ibm,phandle 10000028
  /proc/device-tree/cpus/l3-cache@3107/ibm,phandle 00003107
  /proc/device-tree/cpus/PowerPC,POWER8@18/ibm,phandle 10000018
  /proc/device-tree/cpus/l2-cache@2000/ibm,phandle 00002000
  /proc/device-tree/cpus/l3-cache@3105/ibm,phandle 00003105
  /proc/device-tree/cpus/l3-cache@3103/ibm,phandle 00003103
  /proc/device-tree/cpus/l3-cache@310a/ibm,phandle 0000310a
  /proc/device-tree/cpus/PowerPC,POWER8@0/ibm,phandle 10000000
  /proc/device-tree/cpus/l2-cache@2007/ibm,phandle 00002007
  /proc/device-tree/cpus/l3-cache@3101/ibm,phandle 00003101
  /proc/device-tree/pci@80000002000001b/ibm,phandle 2000001b


> With
> 	qemu-system-ppc64le -m 16G -machine pseries -smp 8 
>
> I got the following output:
> | entries: 64
> | phandle 7e732468 slot 28 hash c
> | phandle 7e732ad0 slot 10 hash 27
> | phandle 7e732ee8 slot 28 hash 3a
> | phandle 7e734160 slot 20 hash 36
> | phandle 7e734318 slot 18 hash 3a
> | phandle 7e734428 slot 28 hash 33
> | phandle 7e734538 slot 38 hash 2c
> | phandle 7e734850 slot 10 hash e
> | phandle 7e735220 slot 20 hash 2d
> | phandle 7e735bf0 slot 30 hash d
> | phandle 7e7365c0 slot 0 hash 2d
> | phandle 7e736f90 slot 10 hash d
> | phandle 7e737960 slot 20 hash 2d
> | phandle 7e738330 slot 30 hash d
> | phandle 7e738d00 slot 0 hash 2d
> | phandle 7e739730 slot 30 hash 38
> | phandle 7e73bd08 slot 8 hash 17
> | phandle 7e73c2e0 slot 20 hash 32
> | phandle 7e73c7f8 slot 38 hash 37
> | phandle 7e782420 slot 20 hash 13
> | phandle 7e782ed8 slot 18 hash 1b
> | phandle 7e73ce28 slot 28 hash 39
> | phandle 7e73d390 slot 10 hash 22
> | phandle 7e73d9a8 slot 28 hash 1a
> | phandle 7e73dc28 slot 28 hash 37
> | phandle 7e73de00 slot 0 hash a
> | phandle 7e73e028 slot 28 hash 0
> | phandle 7e7621a8 slot 28 hash 36
> | phandle 7e73e458 slot 18 hash 1e
> | phandle 7e73e608 slot 8 hash 1e
> | phandle 7e740078 slot 38 hash 28
> | phandle 7e740180 slot 0 hash 1d
> | phandle 7e740240 slot 0 hash 33
> | phandle 7e740348 slot 8 hash 29
> | phandle 7e740410 slot 10 hash 2
> | phandle 7e740eb0 slot 30 hash 3e
> | phandle 7e745390 slot 10 hash 33
> | phandle 7e747b08 slot 8 hash c
> | phandle 7e748528 slot 28 hash f
> | phandle 7e74a6e0 slot 20 hash 18
> | phandle 7e74aab0 slot 30 hash b
> | phandle 7e74f788 slot 8 hash d
> | Used entries: 8, hashed: 29
>
> So the hash array has 64 entries out which only 8 are populated. Using
> hash_32() populates 29 entries.
> Could someone with real hardware verify this?
> I'm not sure how important this performance wise, it looks just like a
> waste using only 1/8 of the array.

On the G5 it's similarly inefficient:

[    0.005444] OF: of_populate_phandle_cache(222) entries: 256
[    0.005457] OF: of_populate_phandle_cache(231) phandle ff88e0c0 slot c0 hash da
[    0.005469] OF: of_populate_phandle_cache(231) phandle ff890a20 slot 20 hash a3
[    0.005480] OF: of_populate_phandle_cache(231) phandle ff890c88 slot 88 hash ed
[    0.005499] OF: of_populate_phandle_cache(231) phandle ff891138 slot 38 hash 49
[    0.005513] OF: of_populate_phandle_cache(231) phandle ff8920c0 slot c0 hash fc
[    0.005528] OF: of_populate_phandle_cache(231) phandle ff892248 slot 48 hash b7
[    0.005542] OF: of_populate_phandle_cache(231) phandle ff892458 slot 58 hash 64
[    0.005557] OF: of_populate_phandle_cache(231) phandle ff8925b8 slot b8 hash d8
[    0.005571] OF: of_populate_phandle_cache(231) phandle ff8938a8 slot a8 hash 9d
[    0.005585] OF: of_populate_phandle_cache(231) phandle ff893a68 slot 68 hash bc
[    0.005599] OF: of_populate_phandle_cache(231) phandle ff893c58 slot 58 hash 31
[    0.005614] OF: of_populate_phandle_cache(231) phandle ff893cf0 slot f0 hash 40
[    0.005628] OF: of_populate_phandle_cache(231) phandle ff893d88 slot 88 hash 4f
[    0.005642] OF: of_populate_phandle_cache(231) phandle ff894178 slot 78 hash 55
[    0.005657] OF: of_populate_phandle_cache(231) phandle ff894ac8 slot c8 hash f0
[    0.005671] OF: of_populate_phandle_cache(231) phandle ff895548 slot 48 hash a9
[    0.005685] OF: of_populate_phandle_cache(231) phandle ff8a0118 slot 18 hash e
[    0.005700] OF: of_populate_phandle_cache(231) phandle ff8a09d0 slot d0 hash 9a
[    0.005714] OF: of_populate_phandle_cache(231) phandle ff8a22f8 slot f8 hash 77
[    0.005729] OF: of_populate_phandle_cache(231) phandle ff8a5470 slot 70 hash af
[    0.005743] OF: of_populate_phandle_cache(231) phandle ff8aa718 slot 18 hash 15
[    0.005757] OF: of_populate_phandle_cache(231) phandle ff8ad4b8 slot b8 hash 72
[    0.005771] OF: of_populate_phandle_cache(231) phandle ff8ae2d0 slot d0 hash 94
[    0.005786] OF: of_populate_phandle_cache(231) phandle ff8aff38 slot 38 hash 3c
[    0.005800] OF: of_populate_phandle_cache(231) phandle ff8b0a10 slot 10 hash 93
[    0.005815] OF: of_populate_phandle_cache(231) phandle ff8b3880 slot 80 hash 63
[    0.005829] OF: of_populate_phandle_cache(231) phandle ff8b4288 slot 88 hash 46
[    0.005843] OF: of_populate_phandle_cache(231) phandle ff8b61d0 slot d0 hash f
[    0.005858] OF: of_populate_phandle_cache(231) phandle ff8b8d20 slot 20 hash 4c
[    0.005872] OF: of_populate_phandle_cache(231) phandle ff8bba60 slot 60 hash fe
[    0.005886] OF: of_populate_phandle_cache(231) phandle ff929568 slot 68 hash bd
[    0.005901] OF: of_populate_phandle_cache(231) phandle ff92bb30 slot 30 hash 1d
[    0.005915] OF: of_populate_phandle_cache(231) phandle ff92e180 slot 80 hash 6f
[    0.005929] OF: of_populate_phandle_cache(231) phandle ff931b98 slot 98 hash 8
[    0.005944] OF: of_populate_phandle_cache(231) phandle ff938788 slot 88 hash 85
[    0.005958] OF: of_populate_phandle_cache(231) phandle ff938850 slot 50 hash e9
[    0.005972] OF: of_populate_phandle_cache(231) phandle ff94f338 slot 38 hash 15
[    0.005987] OF: of_populate_phandle_cache(231) phandle ff94f3f0 slot f0 hash 5d
[    0.006001] OF: of_populate_phandle_cache(231) phandle ff94fb08 slot 8 hash 3
[    0.006014] OF: of_populate_phandle_cache(231) phandle ff950058 slot 58 hash 7d
[    0.006028] OF: of_populate_phandle_cache(231) phandle ff9504f0 slot f0 hash ae
[    0.006043] OF: of_populate_phandle_cache(231) phandle ff965550 slot 50 hash 89
[    0.006057] OF: of_populate_phandle_cache(231) phandle ff9655e8 slot e8 hash 98
[    0.006071] OF: of_populate_phandle_cache(231) phandle ff967fb8 slot b8 hash 29
[    0.006086] OF: of_populate_phandle_cache(231) phandle ff969740 slot 40 hash 1f
[    0.006100] OF: of_populate_phandle_cache(231) phandle ff969a20 slot 20 hash 40
[    0.006114] OF: of_populate_phandle_cache(231) phandle ff96a5b0 slot b0 hash df
[    0.006129] OF: of_populate_phandle_cache(231) phandle ff96acb0 slot b0 hash 5a
[    0.006143] OF: of_populate_phandle_cache(231) phandle ff96adf8 slot f8 hash a3
[    0.006157] OF: of_populate_phandle_cache(231) phandle ff96b088 slot 88 hash 35
[    0.006171] OF: of_populate_phandle_cache(231) phandle ff9c1100 slot 0 hash dd
[    0.006186] OF: of_populate_phandle_cache(231) phandle ff9f0428 slot 28 hash 88
[    0.006200] OF: of_populate_phandle_cache(231) phandle ff9f3c60 slot 60 hash c9
[    0.006214] OF: of_populate_phandle_cache(231) phandle ff9f1230 slot 30 hash 8e
[    0.006228] OF: of_populate_phandle_cache(231) phandle ff96c240 slot 40 hash ce
[    0.006242] OF: of_populate_phandle_cache(231) phandle ff96d050 slot 50 hash e2
[    0.006256] OF: of_populate_phandle_cache(231) phandle ff9bca00 slot 0 hash 3f
[    0.006271] OF: of_populate_phandle_cache(231) phandle ff9f3080 slot 80 hash 9c
[    0.006285] OF: of_populate_phandle_cache(231) phandle ff96e148 slot 48 hash 25
[    0.006299] OF: of_populate_phandle_cache(231) phandle ff970960 slot 60 hash a4
[    0.006313] OF: of_populate_phandle_cache(231) phandle ff972038 slot 38 hash 61
[    0.006328] OF: of_populate_phandle_cache(231) phandle ff9723e0 slot e0 hash e6
[    0.006342] OF: of_populate_phandle_cache(231) phandle ff972808 slot 8 hash 50
[    0.006356] OF: of_populate_phandle_cache(231) phandle ff9729c0 slot c0 hash 60
[    0.006370] OF: of_populate_phandle_cache(231) phandle ff972b78 slot 78 hash 71
[    0.006385] OF: of_populate_phandle_cache(231) phandle ff972da8 slot a8 hash 58
[    0.006399] OF: of_populate_phandle_cache(231) phandle ff972f00 slot 0 hash bd
[    0.006414] OF: of_populate_phandle_cache(231) phandle ff973058 slot 58 hash 22
[    0.006428] OF: of_populate_phandle_cache(231) phandle ff973210 slot 10 hash 33
[    0.006442] OF: of_populate_phandle_cache(231) phandle ff973360 slot 60 hash 8a
[    0.006456] OF: of_populate_phandle_cache(231) phandle ff973520 slot 20 hash a9
[    0.006471] OF: of_populate_phandle_cache(231) phandle ff973670 slot 70 hash 0
[    0.006485] OF: of_populate_phandle_cache(231) phandle ff973828 slot 28 hash 11
[    0.006499] OF: of_populate_phandle_cache(231) phandle ff9739e0 slot e0 hash 22
[    0.006513] OF: of_populate_phandle_cache(231) phandle ff973b40 slot 40 hash 95
[    0.006528] OF: of_populate_phandle_cache(231) phandle ff973d88 slot 88 hash a7
[    0.006542] OF: of_populate_phandle_cache(231) phandle ff973fb0 slot b0 hash 7f
[    0.006556] OF: of_populate_phandle_cache(231) phandle ff974168 slot 68 hash 90
[    0.006570] OF: of_populate_phandle_cache(231) phandle ff974320 slot 20 hash a1
[    0.006584] OF: of_populate_phandle_cache(231) phandle ff974560 slot 60 hash a4
[    0.006599] OF: of_populate_phandle_cache(231) phandle ff975178 slot 78 hash 35
[    0.006613] OF: of_populate_phandle_cache(231) phandle ff975ce8 slot e8 hash 9a
[    0.006628] OF: of_populate_phandle_cache(231) phandle ff9768a8 slot a8 hash 8f
[    0.006642] OF: of_populate_phandle_cache(231) phandle ff976fa8 slot a8 hash a
[    0.006656] OF: of_populate_phandle_cache(231) phandle ff9770a8 slot a8 hash d3
[    0.006670] OF: of_populate_phandle_cache(231) phandle ff9772a0 slot a0 hash 56
[    0.006685] OF: of_populate_phandle_cache(231) phandle ff977468 slot 68 hash 83
[    0.006699] OF: of_populate_phandle_cache(231) phandle ff9f3710 slot 10 hash 50
[    0.006713] OF: of_populate_phandle_cache(231) phandle ff977600 slot 0 hash 5a
[    0.006728] OF: of_populate_phandle_cache(231) phandle ff977978 slot 78 hash 8a
[    0.006742] OF: of_populate_phandle_cache(231) phandle ff994880 slot 80 hash 43
[    0.006757] OF: of_populate_phandle_cache(231) phandle ff9f1f80 slot 80 hash 4b
[    0.006771] OF: of_populate_phandle_cache(231) phandle ff9f2190 slot 90 hash f9
[    0.006785] OF: of_populate_phandle_cache(231) phandle ff9f24b8 slot b8 hash 9a
[    0.006800] OF: of_populate_phandle_cache(231) phandle ff9f2638 slot 38 hash 46
[    0.006814] OF: of_populate_phandle_cache(231) phandle ff9f2980 slot 80 hash 20
[    0.006828] OF: of_populate_phandle_cache(231) phandle ff99ccd0 slot d0 hash 37
[    0.006843] OF: of_populate_phandle_cache(231) phandle ff9a5120 slot 20 hash 2b
[    0.006857] OF: of_populate_phandle_cache(231) phandle ff96f258 slot 58 hash 92
[    0.006872] OF: of_populate_phandle_cache(231) phandle ff9a5520 slot 20 hash 4d
[    0.006886] OF: of_populate_phandle_cache(231) phandle ff9a6438 slot 38 hash 38
[    0.006901] OF: of_populate_phandle_cache(231) phandle ff9a9340 slot 40 hash 16
[    0.006915] OF: of_populate_phandle_cache(231) phandle ff9a99d0 slot d0 hash ca
[    0.006929] OF: of_populate_phandle_cache(231) phandle ff9acba0 slot a0 hash 9f
[    0.006944] OF: of_populate_phandle_cache(231) phandle ff9ad210 slot 10 hash 1a
[    0.006959] OF: of_populate_phandle_cache(231) phandle ff970360 slot 60 hash f1
[    0.006973] OF: of_populate_phandle_cache(231) phandle ff970730 slot 30 hash be
[    0.006987] OF: of_populate_phandle_cache(231) phandle ff970848 slot 48 hash b1
[    0.007002] OF: of_populate_phandle_cache(231) phandle ff977ab8 slot b8 hash c4
[    0.007016] OF: of_populate_phandle_cache(231) phandle ff977c98 slot 98 hash 1c
[    0.007030] OF: of_populate_phandle_cache(231) phandle ff97f300 slot 0 hash 44
[    0.007044] OF: of_populate_phandle_cache(231) phandle ff97f4e8 slot e8 hash aa
[    0.007059] OF: of_populate_phandle_cache(231) phandle ff97f708 slot 8 hash 74
[    0.007073] OF: of_populate_phandle_cache(231) phandle ff97f920 slot 20 hash 30
[    0.007087] OF: of_populate_phandle_cache(231) phandle ff97fb40 slot 40 hash fa
[    0.007102] OF: of_populate_phandle_cache(231) phandle ff97fd18 slot 18 hash 44
[    0.007116] OF: of_populate_phandle_cache(231) phandle ff97fee8 slot e8 hash 7f
[    0.007130] OF: of_populate_phandle_cache(231) phandle ff9800c0 slot c0 hash c9
[    0.007144] OF: of_populate_phandle_cache(231) phandle ff980298 slot 98 hash 13
[    0.007158] OF: of_populate_phandle_cache(231) phandle ff980430 slot 30 hash ea
[    0.007172] OF: of_populate_phandle_cache(231) phandle ff9806c8 slot c8 hash 8a
[    0.007187] OF: of_populate_phandle_cache(231) phandle ff980898 slot 98 hash c6
[    0.007201] OF: of_populate_phandle_cache(231) phandle ff980ba8 slot a8 hash 3c
[    0.007215] OF: of_populate_phandle_cache(231) phandle ff982648 slot 48 hash b7
[    0.007230] OF: of_populate_phandle_cache(231) phandle ff982df0 slot f0 hash 5e
[    0.007244] OF: of_populate_phandle_cache(231) phandle ff98e7a0 slot a0 hash 81
[    0.007259] OF: of_populate_phandle_cache(231) phandle ff98e998 slot 98 hash 4
[    0.007273] OF: of_populate_phandle_cache(231) phandle ff98eb28 slot 28 hash cd
[    0.007287] OF: of_populate_phandle_cache(231) phandle ff98ecb8 slot b8 hash 97
[    0.007301] OF: of_populate_phandle_cache(231) phandle ff98ee50 slot 50 hash 6e
[    0.007316] OF: of_populate_phandle_cache(231) phandle ff98efe8 slot e8 hash 46
[    0.007330] OF: of_populate_phandle_cache(231) phandle ff98f178 slot 78 hash f
[    0.007344] OF: of_populate_phandle_cache(231) phandle ff98f310 slot 10 hash e7
[    0.007358] OF: of_populate_phandle_cache(231) phandle ff98f4a8 slot a8 hash be
[    0.007379] OF: of_populate_phandle_cache(242) Used entries: 31, hashed: 111


And some output from a "real" pseries machine (IBM OF), which is
slightly better:

[    0.129026] OF: of_populate_phandle_cache(222) entries: 128
[    0.129030] OF: of_populate_phandle_cache(231) phandle ffffffff slot 7f hash 4f
[    0.129034] OF: of_populate_phandle_cache(231) phandle cbffa0 slot 20 hash 2a
[    0.129038] OF: of_populate_phandle_cache(231) phandle fffffff7 slot 77 hash 47
[    0.129043] OF: of_populate_phandle_cache(231) phandle fffffff8 slot 78 hash 78
[    0.129046] OF: of_populate_phandle_cache(231) phandle 10000000 slot 0 hash 38
[    0.129050] OF: of_populate_phandle_cache(231) phandle 10000008 slot 8 hash 3f
[    0.129055] OF: of_populate_phandle_cache(231) phandle 10000010 slot 10 hash 46
[    0.129058] OF: of_populate_phandle_cache(231) phandle 10000018 slot 18 hash 4d
[    0.129062] OF: of_populate_phandle_cache(231) phandle 10000020 slot 20 hash 54
[    0.129066] OF: of_populate_phandle_cache(231) phandle 10000028 slot 28 hash 5b
[    0.129070] OF: of_populate_phandle_cache(231) phandle 10000030 slot 30 hash 62
[    0.129074] OF: of_populate_phandle_cache(231) phandle 10000038 slot 38 hash 69
[    0.129078] OF: of_populate_phandle_cache(231) phandle 10000040 slot 40 hash 71
[    0.129082] OF: of_populate_phandle_cache(231) phandle 10000048 slot 48 hash 78
[    0.129086] OF: of_populate_phandle_cache(231) phandle 2000 slot 0 hash 8
[    0.129089] OF: of_populate_phandle_cache(231) phandle 2001 slot 1 hash 39
[    0.129093] OF: of_populate_phandle_cache(231) phandle 2002 slot 2 hash 6a
[    0.129097] OF: of_populate_phandle_cache(231) phandle 2003 slot 3 hash 1b
[    0.129100] OF: of_populate_phandle_cache(231) phandle 2004 slot 4 hash 4b
[    0.129104] OF: of_populate_phandle_cache(231) phandle 2005 slot 5 hash 7c
[    0.129108] OF: of_populate_phandle_cache(231) phandle 2006 slot 6 hash 2d
[    0.129112] OF: of_populate_phandle_cache(231) phandle 2007 slot 7 hash 5e
[    0.129115] OF: of_populate_phandle_cache(231) phandle 2008 slot 8 hash f
[    0.129119] OF: of_populate_phandle_cache(231) phandle 200a slot a hash 71
[    0.129123] OF: of_populate_phandle_cache(231) phandle 3100 slot 0 hash 30
[    0.129127] OF: of_populate_phandle_cache(231) phandle 3101 slot 1 hash 61
[    0.129130] OF: of_populate_phandle_cache(231) phandle 3102 slot 2 hash 12
[    0.129134] OF: of_populate_phandle_cache(231) phandle 3103 slot 3 hash 43
[    0.129138] OF: of_populate_phandle_cache(231) phandle 3104 slot 4 hash 74
[    0.129141] OF: of_populate_phandle_cache(231) phandle 3105 slot 5 hash 25
[    0.129145] OF: of_populate_phandle_cache(231) phandle 3106 slot 6 hash 56
[    0.129148] OF: of_populate_phandle_cache(231) phandle 3107 slot 7 hash 7
[    0.129152] OF: of_populate_phandle_cache(231) phandle 3108 slot 8 hash 37
[    0.129156] OF: of_populate_phandle_cache(231) phandle 310a slot a hash 19
[    0.129160] OF: of_populate_phandle_cache(231) phandle dd08a0 slot 20 hash 26
[    0.129164] OF: of_populate_phandle_cache(231) phandle dd2a58 slot 58 hash 37
[    0.129168] OF: of_populate_phandle_cache(231) phandle fffffff9 slot 79 hash 29
[    0.129172] OF: of_populate_phandle_cache(231) phandle fffffff6 slot 76 hash 17
[    0.129176] OF: of_populate_phandle_cache(231) phandle fffffff4 slot 74 hash 35
[    0.129180] OF: of_populate_phandle_cache(231) phandle fffffff5 slot 75 hash 66
[    0.129184] OF: of_populate_phandle_cache(231) phandle fffffff3 slot 73 hash 4
[    0.129188] OF: of_populate_phandle_cache(231) phandle c9cca8 slot 28 hash 32
[    0.129191] OF: of_populate_phandle_cache(231) phandle dd2bc8 slot 48 hash 7f
[    0.129195] OF: of_populate_phandle_cache(231) phandle 25000015 slot 15 hash 24
[    0.129199] OF: of_populate_phandle_cache(231) phandle 2500001b slot 1b hash 49
[    0.129203] OF: of_populate_phandle_cache(231) phandle 2500001e slot 1e hash 5c
[    0.129207] OF: of_populate_phandle_cache(231) phandle fffffffa slot 7a hash 5a
[    0.129211] OF: of_populate_phandle_cache(231) phandle 80000100 slot 0 hash 24
[    0.129215] OF: of_populate_phandle_cache(231) phandle 80000140 slot 40 hash 5d
[    0.129219] OF: of_populate_phandle_cache(231) phandle fffffffe slot 7e hash 1e
[    0.129223] OF: of_populate_phandle_cache(231) phandle fffffffb slot 7b hash b
[    0.129227] OF: of_populate_phandle_cache(231) phandle c9ed48 slot 48 hash 49
[    0.129231] OF: of_populate_phandle_cache(231) phandle db9740 slot 40 hash 4c
[    0.129235] OF: of_populate_phandle_cache(231) phandle d609a0 slot 20 hash 34
[    0.129238] OF: of_populate_phandle_cache(231) phandle dcd820 slot 20 hash 67
[    0.129242] OF: of_populate_phandle_cache(231) phandle dae0a0 slot 20 hash 75
[    0.129246] OF: of_populate_phandle_cache(231) phandle d21df0 slot 70 hash 44
[    0.129250] OF: of_populate_phandle_cache(231) phandle cc2ec8 slot 48 hash 36
[    0.129254] OF: of_populate_phandle_cache(231) phandle d6a2a8 slot 28 hash 27
[    0.129258] OF: of_populate_phandle_cache(231) phandle d23330 slot 30 hash 37
[    0.129262] OF: of_populate_phandle_cache(231) phandle dd0320 slot 20 hash 3e
[    0.129266] OF: of_populate_phandle_cache(231) phandle dae248 slot 48 hash 6f
[    0.129270] OF: of_populate_phandle_cache(231) phandle d981d8 slot 58 hash 2f
[    0.129274] OF: of_populate_phandle_cache(231) phandle da3da8 slot 28 hash 26
[    0.129278] OF: of_populate_phandle_cache(231) phandle d88480 slot 0 hash 4a
[    0.129282] OF: of_populate_phandle_cache(231) phandle d8e118 slot 18 hash 5a
[    0.129285] OF: of_populate_phandle_cache(231) phandle d8ff50 slot 50 hash 4c
[    0.129289] OF: of_populate_phandle_cache(231) phandle e9da68 slot 68 hash 59
[    0.129293] OF: of_populate_phandle_cache(231) phandle ebc460 slot 60 hash 3a
[    0.129297] OF: of_populate_phandle_cache(231) phandle d91c48 slot 48 hash 20
[    0.129301] OF: of_populate_phandle_cache(231) phandle d3b5f0 slot 70 hash f
[    0.129305] OF: of_populate_phandle_cache(231) phandle d428d0 slot 50 hash 7d
[    0.129309] OF: of_populate_phandle_cache(231) phandle d80ae0 slot 60 hash 58
[    0.129313] OF: of_populate_phandle_cache(231) phandle d9a500 slot 0 hash 8
[    0.129316] OF: of_populate_phandle_cache(231) phandle dacc78 slot 78 hash 7c
[    0.129320] OF: of_populate_phandle_cache(231) phandle cc4930 slot 30 hash 42
[    0.129324] OF: of_populate_phandle_cache(231) phandle db2510 slot 10 hash 7b
[    0.129328] OF: of_populate_phandle_cache(231) phandle dcca28 slot 28 hash 73
[    0.129332] OF: of_populate_phandle_cache(231) phandle d28488 slot 8 hash 3f
[    0.129336] OF: of_populate_phandle_cache(231) phandle d589f8 slot 78 hash 61
[    0.129340] OF: of_populate_phandle_cache(231) phandle ecf870 slot 70 hash 69
[    0.129344] OF: of_populate_phandle_cache(231) phandle d97948 slot 48 hash d
[    0.129348] OF: of_populate_phandle_cache(231) phandle d27720 slot 20 hash 4a
[    0.129352] OF: of_populate_phandle_cache(231) phandle d513c8 slot 48 hash 7f
[    0.129356] OF: of_populate_phandle_cache(231) phandle dd0228 slot 28 hash 61
[    0.129360] OF: of_populate_phandle_cache(231) phandle d76568 slot 68 hash 4e
[    0.129364] OF: of_populate_phandle_cache(231) phandle d4a390 slot 10 hash 70
[    0.129368] OF: of_populate_phandle_cache(231) phandle cf3a10 slot 10 hash f
[    0.129372] OF: of_populate_phandle_cache(231) phandle d0bc58 slot 58 hash 7c
[    0.129376] OF: of_populate_phandle_cache(231) phandle 20000015 slot 15 hash 72
[    0.129380] OF: of_populate_phandle_cache(231) phandle e21228 slot 28 hash 75
[    0.129384] OF: of_populate_phandle_cache(231) phandle e33b48 slot 48 hash 64
[    0.129388] OF: of_populate_phandle_cache(231) phandle e38f08 slot 8 hash 11
[    0.129392] OF: of_populate_phandle_cache(231) phandle e3b8a0 slot 20 hash 27
[    0.129396] OF: of_populate_phandle_cache(231) phandle e3e378 slot 78 hash 5b
[    0.129400] OF: of_populate_phandle_cache(231) phandle e43740 slot 40 hash f
[    0.129404] OF: of_populate_phandle_cache(231) phandle 2000001b slot 1b hash 18
[    0.129408] OF: of_populate_phandle_cache(231) phandle e45fc8 slot 48 hash 32
[    0.129412] OF: of_populate_phandle_cache(231) phandle e5be68 slot 68 hash 55
[    0.129416] OF: of_populate_phandle_cache(231) phandle 2000001e slot 1e hash 2a
[    0.129420] OF: of_populate_phandle_cache(231) phandle 2204001e slot 1e hash 7e
[    0.129423] OF: of_populate_phandle_cache(231) phandle e6c880 slot 0 hash 18
[    0.129427] OF: of_populate_phandle_cache(231) phandle e7bca0 slot 20 hash 45
[    0.129431] OF: of_populate_phandle_cache(231) phandle e8b0c0 slot 40 hash 71
[    0.129435] OF: of_populate_phandle_cache(231) phandle fffffffd slot 7d hash 6d
[    0.129439] OF: of_populate_phandle_cache(231) phandle fffffffc slot 7c hash 3c
[    0.129443] OF: of_populate_phandle_cache(231) phandle df1d80 slot 0 hash 49
[    0.129447] OF: of_populate_phandle_cache(231) phandle df3488 slot 8 hash 52
[    0.129451] OF: of_populate_phandle_cache(231) phandle df3d80 slot 0 hash 52
[    0.129454] OF: of_populate_phandle_cache(231) phandle df3198 slot 18 hash 34
[    0.129458] OF: of_populate_phandle_cache(231) phandle df2888 slot 8 hash 1f
[    0.129462] OF: of_populate_phandle_cache(231) phandle 30000000 slot 0 hash 28
[    0.129467] OF: of_populate_phandle_cache(242) Used entries: 39, hashed: 81



So yeah using hash_32() is quite a bit better in both cases.

And if I'm reading your patch right it would be a single line change to
switch, so that seems like it's worth doing to me.

cheers


> The patch used for testing:
>
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index 1d667eb730e19..2640d4bc81a9a 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -197,6 +197,7 @@ void of_populate_phandle_cache(void)
>  	u32 cache_entries;
>  	struct device_node *np;
>  	u32 phandles = 0;
> +	struct device_node **cache2;
>  
>  	raw_spin_lock_irqsave(&devtree_lock, flags);
>  
> @@ -214,14 +215,32 @@ void of_populate_phandle_cache(void)
>  
>  	phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
>  				GFP_ATOMIC);
> +	cache2 = kcalloc(cache_entries, sizeof(*phandle_cache), GFP_ATOMIC);
>  	if (!phandle_cache)
>  		goto out;
>  
> +	pr_err("%s(%d) entries: %d\n", __func__, __LINE__, cache_entries);
>  	for_each_of_allnodes(np)
>  		if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) {
> +			int slot;
>  			of_node_get(np);
>  			phandle_cache[np->phandle & phandle_cache_mask] = np;
> +			slot = hash_32(np->phandle, __ffs(cache_entries));
> +			cache2[slot] = np;
> +			pr_err("%s(%d) phandle %x slot %x hash %x\n", __func__, __LINE__,
> +			       np->phandle, np->phandle & phandle_cache_mask, slot);
>  		}
> +	{
> +		int i, filled = 0, filled_hash = 0;
> +
> +		for (i = 0; i < cache_entries; i++) {
> +			if (phandle_cache[i])
> +				filled++;
> +			if (cache2[i])
> +				filled_hash++;
> +		}
> +		pr_err("%s(%d) Used entries: %d, hashed: %d\n", __func__, __LINE__, filled, filled_hash);
> +	}
>  
>  out:
>  	raw_spin_unlock_irqrestore(&devtree_lock, flags);
>
> Sebastian
Michael Ellerman Dec. 3, 2019, 4:12 a.m. UTC | #4
Frank Rowand <frowand.list@gmail.com> writes:
> On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote:
>> I've been looking at phandle_cache and noticed the following: The raw
>> phandle value as generated by dtc starts at zero and is incremented by
>> one for each phandle entry. The qemu pSeries model is using Slof (which
>> is probably the same thing as used on real hardware) and this looks like
>> a poiner value for the phandle.
>> With
>> 	qemu-system-ppc64le -m 16G -machine pseries -smp 8 
>> 
>> I got the following output:
>> | entries: 64
>> | phandle 7e732468 slot 28 hash c
>> | phandle 7e732ad0 slot 10 hash 27
>> | phandle 7e732ee8 slot 28 hash 3a
>> | phandle 7e734160 slot 20 hash 36
>> | phandle 7e734318 slot 18 hash 3a
>> | phandle 7e734428 slot 28 hash 33
>> | phandle 7e734538 slot 38 hash 2c
>> | phandle 7e734850 slot 10 hash e
>> | phandle 7e735220 slot 20 hash 2d
>> | phandle 7e735bf0 slot 30 hash d
>> | phandle 7e7365c0 slot 0 hash 2d
>> | phandle 7e736f90 slot 10 hash d
>> | phandle 7e737960 slot 20 hash 2d
>> | phandle 7e738330 slot 30 hash d
>> | phandle 7e738d00 slot 0 hash 2d
>> | phandle 7e739730 slot 30 hash 38
>> | phandle 7e73bd08 slot 8 hash 17
>> | phandle 7e73c2e0 slot 20 hash 32
>> | phandle 7e73c7f8 slot 38 hash 37
>> | phandle 7e782420 slot 20 hash 13
>> | phandle 7e782ed8 slot 18 hash 1b
>> | phandle 7e73ce28 slot 28 hash 39
>> | phandle 7e73d390 slot 10 hash 22
>> | phandle 7e73d9a8 slot 28 hash 1a
>> | phandle 7e73dc28 slot 28 hash 37
>> | phandle 7e73de00 slot 0 hash a
>> | phandle 7e73e028 slot 28 hash 0
>> | phandle 7e7621a8 slot 28 hash 36
>> | phandle 7e73e458 slot 18 hash 1e
>> | phandle 7e73e608 slot 8 hash 1e
>> | phandle 7e740078 slot 38 hash 28
>> | phandle 7e740180 slot 0 hash 1d
>> | phandle 7e740240 slot 0 hash 33
>> | phandle 7e740348 slot 8 hash 29
>> | phandle 7e740410 slot 10 hash 2
>> | phandle 7e740eb0 slot 30 hash 3e
>> | phandle 7e745390 slot 10 hash 33
>> | phandle 7e747b08 slot 8 hash c
>> | phandle 7e748528 slot 28 hash f
>> | phandle 7e74a6e0 slot 20 hash 18
>> | phandle 7e74aab0 slot 30 hash b
>> | phandle 7e74f788 slot 8 hash d
>> | Used entries: 8, hashed: 29
>> 
>> So the hash array has 64 entries out which only 8 are populated. Using
>> hash_32() populates 29 entries.
>> Could someone with real hardware verify this?
>> I'm not sure how important this performance wise, it looks just like a
>> waste using only 1/8 of the array.
>
> The hash used is based on the assumptions you noted, and as stated in the
> code, that phandle property values are in a contiguous range of 1..n
> (not starting from zero), which is what dtc generates.

> We knew that for systems that do not match the assumptions that the hash
> will not be optimal.

If we're going to have the phandle cache it should at least make some
attempt to work across the systems that Linux supports.

> Unless there is a serious performance problem for
> such systems, I do not want to make the phandle hash code more complicated
> to optimize for these cases.  And the pseries have been performing ok
> without phandle related performance issues that I remember hearing since
> before the cache was added, which could have only helped the performance.
> Yes, if your observations are correct, some memory is being wasted, but
> a 64 entry cache is not very large on a pseries.

A single line change to use an actual hash function is hardly
complicating it, compared to the amount of code already there.

cheers
Frank Rowand Dec. 3, 2019, 4:28 a.m. UTC | #5
On 12/2/19 10:12 PM, Michael Ellerman wrote:
> Frank Rowand <frowand.list@gmail.com> writes:
>> On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote:
>>> I've been looking at phandle_cache and noticed the following: The raw
>>> phandle value as generated by dtc starts at zero and is incremented by
>>> one for each phandle entry. The qemu pSeries model is using Slof (which
>>> is probably the same thing as used on real hardware) and this looks like
>>> a poiner value for the phandle.
>>> With
>>> 	qemu-system-ppc64le -m 16G -machine pseries -smp 8 
>>>
>>> I got the following output:
>>> | entries: 64
>>> | phandle 7e732468 slot 28 hash c
>>> | phandle 7e732ad0 slot 10 hash 27
>>> | phandle 7e732ee8 slot 28 hash 3a
>>> | phandle 7e734160 slot 20 hash 36
>>> | phandle 7e734318 slot 18 hash 3a
>>> | phandle 7e734428 slot 28 hash 33
>>> | phandle 7e734538 slot 38 hash 2c
>>> | phandle 7e734850 slot 10 hash e
>>> | phandle 7e735220 slot 20 hash 2d
>>> | phandle 7e735bf0 slot 30 hash d
>>> | phandle 7e7365c0 slot 0 hash 2d
>>> | phandle 7e736f90 slot 10 hash d
>>> | phandle 7e737960 slot 20 hash 2d
>>> | phandle 7e738330 slot 30 hash d
>>> | phandle 7e738d00 slot 0 hash 2d
>>> | phandle 7e739730 slot 30 hash 38
>>> | phandle 7e73bd08 slot 8 hash 17
>>> | phandle 7e73c2e0 slot 20 hash 32
>>> | phandle 7e73c7f8 slot 38 hash 37
>>> | phandle 7e782420 slot 20 hash 13
>>> | phandle 7e782ed8 slot 18 hash 1b
>>> | phandle 7e73ce28 slot 28 hash 39
>>> | phandle 7e73d390 slot 10 hash 22
>>> | phandle 7e73d9a8 slot 28 hash 1a
>>> | phandle 7e73dc28 slot 28 hash 37
>>> | phandle 7e73de00 slot 0 hash a
>>> | phandle 7e73e028 slot 28 hash 0
>>> | phandle 7e7621a8 slot 28 hash 36
>>> | phandle 7e73e458 slot 18 hash 1e
>>> | phandle 7e73e608 slot 8 hash 1e
>>> | phandle 7e740078 slot 38 hash 28
>>> | phandle 7e740180 slot 0 hash 1d
>>> | phandle 7e740240 slot 0 hash 33
>>> | phandle 7e740348 slot 8 hash 29
>>> | phandle 7e740410 slot 10 hash 2
>>> | phandle 7e740eb0 slot 30 hash 3e
>>> | phandle 7e745390 slot 10 hash 33
>>> | phandle 7e747b08 slot 8 hash c
>>> | phandle 7e748528 slot 28 hash f
>>> | phandle 7e74a6e0 slot 20 hash 18
>>> | phandle 7e74aab0 slot 30 hash b
>>> | phandle 7e74f788 slot 8 hash d
>>> | Used entries: 8, hashed: 29
>>>
>>> So the hash array has 64 entries out which only 8 are populated. Using
>>> hash_32() populates 29 entries.
>>> Could someone with real hardware verify this?
>>> I'm not sure how important this performance wise, it looks just like a
>>> waste using only 1/8 of the array.
>>
>> The hash used is based on the assumptions you noted, and as stated in the
>> code, that phandle property values are in a contiguous range of 1..n
>> (not starting from zero), which is what dtc generates.
> 
>> We knew that for systems that do not match the assumptions that the hash
>> will not be optimal.
> 
> If we're going to have the phandle cache it should at least make some
> attempt to work across the systems that Linux supports.
> 
>> Unless there is a serious performance problem for
>> such systems, I do not want to make the phandle hash code more complicated
>> to optimize for these cases.  And the pseries have been performing ok
>> without phandle related performance issues that I remember hearing since
>> before the cache was added, which could have only helped the performance.
>> Yes, if your observations are correct, some memory is being wasted, but
>> a 64 entry cache is not very large on a pseries.
> 
> A single line change to use an actual hash function is hardly
> complicating it, compared to the amount of code already there.

With a dtc generated devicetree, the hash is perfect, with no
misses.  That single line change then makes the hash bad for
dtc generated devicetrees.

The cache was added to address a problem with a system with a
dtc generated devicetree.  I had not heard of any phandle
lookup performance issues on ppc systems.  An imperfect hash
for ppc should not make the ppc performance worse (unless
every single phandle value hashes to a single bucket).  So
the ppc performance is likely better than it was before
the hash was added, even without an optimal hash algorithm.

So the change would not be a single line change.  It would
be a change to use different hash algorithms for dtc
generated device trees versus other device trees.

Another possibility would be to make the cache be dependent
upon not CONFIG_PPC.  It might be possible to disable the
cache with a minimal code change.

> 
> cheers
>
Rob Herring Dec. 3, 2019, 4:56 p.m. UTC | #6
On Mon, Dec 2, 2019 at 10:28 PM Frank Rowand <frowand.list@gmail.com> wrote:
>
> On 12/2/19 10:12 PM, Michael Ellerman wrote:
> > Frank Rowand <frowand.list@gmail.com> writes:
> >> On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote:
> >>> I've been looking at phandle_cache and noticed the following: The raw
> >>> phandle value as generated by dtc starts at zero and is incremented by
> >>> one for each phandle entry. The qemu pSeries model is using Slof (which
> >>> is probably the same thing as used on real hardware) and this looks like
> >>> a poiner value for the phandle.
> >>> With
> >>>     qemu-system-ppc64le -m 16G -machine pseries -smp 8
> >>>
> >>> I got the following output:
> >>> | entries: 64
> >>> | phandle 7e732468 slot 28 hash c
> >>> | phandle 7e732ad0 slot 10 hash 27
> >>> | phandle 7e732ee8 slot 28 hash 3a
> >>> | phandle 7e734160 slot 20 hash 36
> >>> | phandle 7e734318 slot 18 hash 3a
> >>> | phandle 7e734428 slot 28 hash 33
> >>> | phandle 7e734538 slot 38 hash 2c
> >>> | phandle 7e734850 slot 10 hash e
> >>> | phandle 7e735220 slot 20 hash 2d
> >>> | phandle 7e735bf0 slot 30 hash d
> >>> | phandle 7e7365c0 slot 0 hash 2d
> >>> | phandle 7e736f90 slot 10 hash d
> >>> | phandle 7e737960 slot 20 hash 2d
> >>> | phandle 7e738330 slot 30 hash d
> >>> | phandle 7e738d00 slot 0 hash 2d
> >>> | phandle 7e739730 slot 30 hash 38
> >>> | phandle 7e73bd08 slot 8 hash 17
> >>> | phandle 7e73c2e0 slot 20 hash 32
> >>> | phandle 7e73c7f8 slot 38 hash 37
> >>> | phandle 7e782420 slot 20 hash 13
> >>> | phandle 7e782ed8 slot 18 hash 1b
> >>> | phandle 7e73ce28 slot 28 hash 39
> >>> | phandle 7e73d390 slot 10 hash 22
> >>> | phandle 7e73d9a8 slot 28 hash 1a
> >>> | phandle 7e73dc28 slot 28 hash 37
> >>> | phandle 7e73de00 slot 0 hash a
> >>> | phandle 7e73e028 slot 28 hash 0
> >>> | phandle 7e7621a8 slot 28 hash 36
> >>> | phandle 7e73e458 slot 18 hash 1e
> >>> | phandle 7e73e608 slot 8 hash 1e
> >>> | phandle 7e740078 slot 38 hash 28
> >>> | phandle 7e740180 slot 0 hash 1d
> >>> | phandle 7e740240 slot 0 hash 33
> >>> | phandle 7e740348 slot 8 hash 29
> >>> | phandle 7e740410 slot 10 hash 2
> >>> | phandle 7e740eb0 slot 30 hash 3e
> >>> | phandle 7e745390 slot 10 hash 33
> >>> | phandle 7e747b08 slot 8 hash c
> >>> | phandle 7e748528 slot 28 hash f
> >>> | phandle 7e74a6e0 slot 20 hash 18
> >>> | phandle 7e74aab0 slot 30 hash b
> >>> | phandle 7e74f788 slot 8 hash d
> >>> | Used entries: 8, hashed: 29
> >>>
> >>> So the hash array has 64 entries out which only 8 are populated. Using
> >>> hash_32() populates 29 entries.
> >>> Could someone with real hardware verify this?
> >>> I'm not sure how important this performance wise, it looks just like a
> >>> waste using only 1/8 of the array.
> >>
> >> The hash used is based on the assumptions you noted, and as stated in the
> >> code, that phandle property values are in a contiguous range of 1..n
> >> (not starting from zero), which is what dtc generates.
> >
> >> We knew that for systems that do not match the assumptions that the hash
> >> will not be optimal.
> >
> > If we're going to have the phandle cache it should at least make some
> > attempt to work across the systems that Linux supports.
> >
> >> Unless there is a serious performance problem for
> >> such systems, I do not want to make the phandle hash code more complicated
> >> to optimize for these cases.  And the pseries have been performing ok
> >> without phandle related performance issues that I remember hearing since
> >> before the cache was added, which could have only helped the performance.
> >> Yes, if your observations are correct, some memory is being wasted, but
> >> a 64 entry cache is not very large on a pseries.
> >
> > A single line change to use an actual hash function is hardly
> > complicating it, compared to the amount of code already there.
>
> With a dtc generated devicetree, the hash is perfect, with no
> misses.  That single line change then makes the hash bad for
> dtc generated devicetrees.

To quantify that, I did some tests with the hash algo and it's about a
10% collision rate using phandle values of 1-$cache_size. There's
never more than 2 phandles colliding in a slot. The actual impact
would be less given 100% thrashing seems unlikely.

> The cache was added to address a problem with a system with a
> dtc generated devicetree.

The cache was added to address a problem with a system with a large
number of nodes and phandles (814 phandles in the reported case). dtc
happens to be used in that case.

> I had not heard of any phandle
> lookup performance issues on ppc systems.  An imperfect hash
> for ppc should not make the ppc performance worse (unless
> every single phandle value hashes to a single bucket).  So
> the ppc performance is likely better than it was before
> the hash was added, even without an optimal hash algorithm.
>
> So the change would not be a single line change.  It would
> be a change to use different hash algorithms for dtc
> generated device trees versus other device trees.
>
> Another possibility would be to make the cache be dependent
> upon not CONFIG_PPC.  It might be possible to disable the
> cache with a minimal code change.

I'd rather not do that.

And yes, as mentioned earlier I don't like the complexity. I didn't
from the start and I'm  I'm still of the opinion we should have a
fixed or 1 time sized true cache (i.e. smaller than total # of
phandles). That would solve the RT memory allocation and locking issue
too.

For reference, the performance difference between the current
implementation (assuming fixes haven't regressed it) was ~400ms vs. a
~340ms improvement with a 64 entry cache (using a mask, not a hash).
IMO, 340ms improvement was good enough.

Rob
Segher Boessenkool Dec. 3, 2019, 6:35 p.m. UTC | #7
Hi!

On Tue, Dec 03, 2019 at 03:03:22PM +1100, Michael Ellerman wrote:
> Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
> I've certainly heard it said that on some OF's the phandle was just ==
> the address of the internal representation, and I guess maybe for SLOF
> that is true.

It is (or was).  In many OFs it is just the effective address of some
node structure.  SLOF runs with translation off normally.

> They seem to vary wildly though, eg. on an Apple G5:

Apple OF runs with translation on usually.  IIRC these are effective
addresses as well.

The OF they have on G5 machines is mostly 32-bit, for compatibility is my
guess (for userland things dealing with addresses from OF, importantly).

>   $ find /proc/device-tree/ -name phandle | xargs lsprop | head -10
>   /proc/device-tree/vsp@0,f9000000/veo@f9180000/phandle ff970848
>   /proc/device-tree/vsp@0,f9000000/phandle ff970360
>   /proc/device-tree/vsp@0,f9000000/veo@f9080000/phandle ff970730
>   /proc/device-tree/nvram@0,fff04000/phandle ff967fb8
>   /proc/device-tree/xmodem/phandle ff9655e8
>   /proc/device-tree/multiboot/phandle ff9504f0
>   /proc/device-tree/diagnostics/phandle ff965550
>   /proc/device-tree/options/phandle ff893cf0
>   /proc/device-tree/openprom/client-services/phandle ff8925b8
>   /proc/device-tree/openprom/phandle ff892458
> 
> That machine does not have enough RAM for those to be 32-bit real
> addresses. I think Apple OF is running in virtual mode though (?), so
> maybe they are pointers?

Yes, I think the default is to have 8MB ram at the top of 4GB (which is
the physical address of the bootrom, btw) for OF.

> And on an IBM pseries machine they're a bit all over the place:
> 
>   /proc/device-tree/cpus/PowerPC,POWER8@40/ibm,phandle 10000040
>   /proc/device-tree/cpus/l2-cache@2005/ibm,phandle 00002005
>   /proc/device-tree/cpus/PowerPC,POWER8@30/ibm,phandle 10000030
>   /proc/device-tree/cpus/PowerPC,POWER8@20/ibm,phandle 10000020
>   /proc/device-tree/cpus/PowerPC,POWER8@10/ibm,phandle 10000010
>   /proc/device-tree/cpus/l2-cache@2003/ibm,phandle 00002003
>   /proc/device-tree/cpus/l2-cache@200a/ibm,phandle 0000200a
>   /proc/device-tree/cpus/l3-cache@3108/ibm,phandle 00003108
>   /proc/device-tree/cpus/l2-cache@2001/ibm,phandle 00002001
>   /proc/device-tree/cpus/l3-cache@3106/ibm,phandle 00003106
>   /proc/device-tree/cpus/ibm,phandle fffffff8
>   /proc/device-tree/cpus/l3-cache@3104/ibm,phandle 00003104
>   /proc/device-tree/cpus/l2-cache@2008/ibm,phandle 00002008
>   /proc/device-tree/cpus/l3-cache@3102/ibm,phandle 00003102
>   /proc/device-tree/cpus/l2-cache@2006/ibm,phandle 00002006
>   /proc/device-tree/cpus/l3-cache@3100/ibm,phandle 00003100
>   /proc/device-tree/cpus/PowerPC,POWER8@8/ibm,phandle 10000008
>   /proc/device-tree/cpus/l2-cache@2004/ibm,phandle 00002004
>   /proc/device-tree/cpus/PowerPC,POWER8@48/ibm,phandle 10000048
>   /proc/device-tree/cpus/PowerPC,POWER8@38/ibm,phandle 10000038
>   /proc/device-tree/cpus/l2-cache@2002/ibm,phandle 00002002
>   /proc/device-tree/cpus/PowerPC,POWER8@28/ibm,phandle 10000028
>   /proc/device-tree/cpus/l3-cache@3107/ibm,phandle 00003107
>   /proc/device-tree/cpus/PowerPC,POWER8@18/ibm,phandle 10000018
>   /proc/device-tree/cpus/l2-cache@2000/ibm,phandle 00002000
>   /proc/device-tree/cpus/l3-cache@3105/ibm,phandle 00003105
>   /proc/device-tree/cpus/l3-cache@3103/ibm,phandle 00003103
>   /proc/device-tree/cpus/l3-cache@310a/ibm,phandle 0000310a
>   /proc/device-tree/cpus/PowerPC,POWER8@0/ibm,phandle 10000000
>   /proc/device-tree/cpus/l2-cache@2007/ibm,phandle 00002007
>   /proc/device-tree/cpus/l3-cache@3101/ibm,phandle 00003101
>   /proc/device-tree/pci@80000002000001b/ibm,phandle 2000001b

Some (the 1000xxxx) look like addresses as well.

> > So the hash array has 64 entries out which only 8 are populated. Using
> > hash_32() populates 29 entries.

> On the G5 it's similarly inefficient:
> [    0.007379] OF: of_populate_phandle_cache(242) Used entries: 31, hashed: 111

> And some output from a "real" pseries machine (IBM OF), which is
> slightly better:
> [    0.129467] OF: of_populate_phandle_cache(242) Used entries: 39, hashed: 81

> So yeah using hash_32() is quite a bit better in both cases.

Yup, no surprise there.  And hash_32 is very cheap to compute.

> And if I'm reading your patch right it would be a single line change to
> switch, so that seems like it's worth doing to me.

Agreed!

Btw.  Some OFs mangle the phandles some way, to make it easier to catch
people using it as an address (and similarly, mangle ihandles differently,
so you catch confusion between ihandles and phandles as well).  Like a
simple xor, with some odd number preferably.  You should assume *nothing*
about phandles, they are opaque identifiers.


Segher
Sebastian Andrzej Siewior Dec. 5, 2019, 4:35 p.m. UTC | #8
On 2019-12-03 10:56:35 [-0600], Rob Herring wrote:
> > Another possibility would be to make the cache be dependent
> > upon not CONFIG_PPC.  It might be possible to disable the
> > cache with a minimal code change.
> 
> I'd rather not do that.
> 
> And yes, as mentioned earlier I don't like the complexity. I didn't
> from the start and I'm  I'm still of the opinion we should have a
> fixed or 1 time sized true cache (i.e. smaller than total # of
> phandles). That would solve the RT memory allocation and locking issue
> too.
> 
> For reference, the performance difference between the current
> implementation (assuming fixes haven't regressed it) was ~400ms vs. a
> ~340ms improvement with a 64 entry cache (using a mask, not a hash).
> IMO, 340ms improvement was good enough.

Okay. So the 814 phandles would result in an array with 1024 slots. That
would need 4KiB of memory.
What about we go back to the fix 64 slots array but with hash32 for the
lookup? Without the hash we would be 60ms slower during boot (compared
to now, based on ancient data) but then the hash isn't expensive so we
end up with better coverage of the memory on systems which don't have a
plain enumeration of the phandle.

> Rob

Sebastian
Frank Rowand Dec. 6, 2019, 1:37 a.m. UTC | #9
On 12/3/19 12:35 PM, Segher Boessenkool wrote:
> Hi!
> 
> On Tue, Dec 03, 2019 at 03:03:22PM +1100, Michael Ellerman wrote:
>> Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:
>> I've certainly heard it said that on some OF's the phandle was just ==
>> the address of the internal representation, and I guess maybe for SLOF
>> that is true.
> 
> It is (or was).  In many OFs it is just the effective address of some
> node structure.  SLOF runs with translation off normally.
> 
>> They seem to vary wildly though, eg. on an Apple G5:
> 
> Apple OF runs with translation on usually.  IIRC these are effective
> addresses as well.
> 
> The OF they have on G5 machines is mostly 32-bit, for compatibility is my
> guess (for userland things dealing with addresses from OF, importantly).
> 
>>   $ find /proc/device-tree/ -name phandle | xargs lsprop | head -10
>>   /proc/device-tree/vsp@0,f9000000/veo@f9180000/phandle ff970848
>>   /proc/device-tree/vsp@0,f9000000/phandle ff970360
>>   /proc/device-tree/vsp@0,f9000000/veo@f9080000/phandle ff970730
>>   /proc/device-tree/nvram@0,fff04000/phandle ff967fb8
>>   /proc/device-tree/xmodem/phandle ff9655e8
>>   /proc/device-tree/multiboot/phandle ff9504f0
>>   /proc/device-tree/diagnostics/phandle ff965550
>>   /proc/device-tree/options/phandle ff893cf0
>>   /proc/device-tree/openprom/client-services/phandle ff8925b8
>>   /proc/device-tree/openprom/phandle ff892458
>>
>> That machine does not have enough RAM for those to be 32-bit real
>> addresses. I think Apple OF is running in virtual mode though (?), so
>> maybe they are pointers?
> 
> Yes, I think the default is to have 8MB ram at the top of 4GB (which is
> the physical address of the bootrom, btw) for OF.
> 
>> And on an IBM pseries machine they're a bit all over the place:
>>
>>   /proc/device-tree/cpus/PowerPC,POWER8@40/ibm,phandle 10000040
>>   /proc/device-tree/cpus/l2-cache@2005/ibm,phandle 00002005
>>   /proc/device-tree/cpus/PowerPC,POWER8@30/ibm,phandle 10000030
>>   /proc/device-tree/cpus/PowerPC,POWER8@20/ibm,phandle 10000020
>>   /proc/device-tree/cpus/PowerPC,POWER8@10/ibm,phandle 10000010
>>   /proc/device-tree/cpus/l2-cache@2003/ibm,phandle 00002003
>>   /proc/device-tree/cpus/l2-cache@200a/ibm,phandle 0000200a
>>   /proc/device-tree/cpus/l3-cache@3108/ibm,phandle 00003108
>>   /proc/device-tree/cpus/l2-cache@2001/ibm,phandle 00002001
>>   /proc/device-tree/cpus/l3-cache@3106/ibm,phandle 00003106
>>   /proc/device-tree/cpus/ibm,phandle fffffff8
>>   /proc/device-tree/cpus/l3-cache@3104/ibm,phandle 00003104
>>   /proc/device-tree/cpus/l2-cache@2008/ibm,phandle 00002008
>>   /proc/device-tree/cpus/l3-cache@3102/ibm,phandle 00003102
>>   /proc/device-tree/cpus/l2-cache@2006/ibm,phandle 00002006
>>   /proc/device-tree/cpus/l3-cache@3100/ibm,phandle 00003100
>>   /proc/device-tree/cpus/PowerPC,POWER8@8/ibm,phandle 10000008
>>   /proc/device-tree/cpus/l2-cache@2004/ibm,phandle 00002004
>>   /proc/device-tree/cpus/PowerPC,POWER8@48/ibm,phandle 10000048
>>   /proc/device-tree/cpus/PowerPC,POWER8@38/ibm,phandle 10000038
>>   /proc/device-tree/cpus/l2-cache@2002/ibm,phandle 00002002
>>   /proc/device-tree/cpus/PowerPC,POWER8@28/ibm,phandle 10000028
>>   /proc/device-tree/cpus/l3-cache@3107/ibm,phandle 00003107
>>   /proc/device-tree/cpus/PowerPC,POWER8@18/ibm,phandle 10000018
>>   /proc/device-tree/cpus/l2-cache@2000/ibm,phandle 00002000
>>   /proc/device-tree/cpus/l3-cache@3105/ibm,phandle 00003105
>>   /proc/device-tree/cpus/l3-cache@3103/ibm,phandle 00003103
>>   /proc/device-tree/cpus/l3-cache@310a/ibm,phandle 0000310a
>>   /proc/device-tree/cpus/PowerPC,POWER8@0/ibm,phandle 10000000
>>   /proc/device-tree/cpus/l2-cache@2007/ibm,phandle 00002007
>>   /proc/device-tree/cpus/l3-cache@3101/ibm,phandle 00003101
>>   /proc/device-tree/pci@80000002000001b/ibm,phandle 2000001b
> 
> Some (the 1000xxxx) look like addresses as well.
> 
>>> So the hash array has 64 entries out which only 8 are populated. Using
>>> hash_32() populates 29 entries.
> 
>> On the G5 it's similarly inefficient:
>> [    0.007379] OF: of_populate_phandle_cache(242) Used entries: 31, hashed: 111
> 
>> And some output from a "real" pseries machine (IBM OF), which is
>> slightly better:
>> [    0.129467] OF: of_populate_phandle_cache(242) Used entries: 39, hashed: 81
> 
>> So yeah using hash_32() is quite a bit better in both cases.
> 
> Yup, no surprise there.  And hash_32 is very cheap to compute.
> 
>> And if I'm reading your patch right it would be a single line change to>> switch, so that seems like it's worth doing to me.
> 
> Agreed!
> 
> Btw.  Some OFs mangle the phandles some way, to make it easier to catch
> people using it as an address (and similarly, mangle ihandles differently,
> so you catch confusion between ihandles and phandles as well).  Like a
> simple xor, with some odd number preferably.  You should assume *nothing*
> about phandles, they are opaque identifiers.

For arm32 machines that use dtc to generate the devicetree, which is a
very large user base, we certainly can make assumptions about phandles.
Especially because the complaints about the overhead of phandle based
lookups have been voiced by users of this specific set of machines.

For systems with a devicetree that does not follow the assumptions, the
phandle cache should not measurably increase the overhead of phandle
based lookups.  For these systems, they might not see an overhead
reduction from the existence of the cache and they may or may not
see the overhead reduction.  This was explicitly stated during the
reviews of the possible phandle cache implementation alternatives.

If you have measurements of a system where implementing the phandle
cache increased the overhead, and the additional overhead is a concern
(such as significantly increasing boot time) then please share that
information with us.  Otherwise this is just a theoretical exercise.

-Frank

> 
> 
> Segher
>
Frank Rowand Dec. 6, 2019, 1:52 a.m. UTC | #10
On 12/3/19 10:56 AM, Rob Herring wrote:
> On Mon, Dec 2, 2019 at 10:28 PM Frank Rowand <frowand.list@gmail.com> wrote:
>>
>> On 12/2/19 10:12 PM, Michael Ellerman wrote:
>>> Frank Rowand <frowand.list@gmail.com> writes:
>>>> On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote:
>>>>> I've been looking at phandle_cache and noticed the following: The raw
>>>>> phandle value as generated by dtc starts at zero and is incremented by
>>>>> one for each phandle entry. The qemu pSeries model is using Slof (which
>>>>> is probably the same thing as used on real hardware) and this looks like
>>>>> a poiner value for the phandle.
>>>>> With
>>>>>     qemu-system-ppc64le -m 16G -machine pseries -smp 8
>>>>>
>>>>> I got the following output:
>>>>> | entries: 64
>>>>> | phandle 7e732468 slot 28 hash c
>>>>> | phandle 7e732ad0 slot 10 hash 27
>>>>> | phandle 7e732ee8 slot 28 hash 3a
>>>>> | phandle 7e734160 slot 20 hash 36
>>>>> | phandle 7e734318 slot 18 hash 3a
>>>>> | phandle 7e734428 slot 28 hash 33
>>>>> | phandle 7e734538 slot 38 hash 2c
>>>>> | phandle 7e734850 slot 10 hash e
>>>>> | phandle 7e735220 slot 20 hash 2d
>>>>> | phandle 7e735bf0 slot 30 hash d
>>>>> | phandle 7e7365c0 slot 0 hash 2d
>>>>> | phandle 7e736f90 slot 10 hash d
>>>>> | phandle 7e737960 slot 20 hash 2d
>>>>> | phandle 7e738330 slot 30 hash d
>>>>> | phandle 7e738d00 slot 0 hash 2d
>>>>> | phandle 7e739730 slot 30 hash 38
>>>>> | phandle 7e73bd08 slot 8 hash 17
>>>>> | phandle 7e73c2e0 slot 20 hash 32
>>>>> | phandle 7e73c7f8 slot 38 hash 37
>>>>> | phandle 7e782420 slot 20 hash 13
>>>>> | phandle 7e782ed8 slot 18 hash 1b
>>>>> | phandle 7e73ce28 slot 28 hash 39
>>>>> | phandle 7e73d390 slot 10 hash 22
>>>>> | phandle 7e73d9a8 slot 28 hash 1a
>>>>> | phandle 7e73dc28 slot 28 hash 37
>>>>> | phandle 7e73de00 slot 0 hash a
>>>>> | phandle 7e73e028 slot 28 hash 0
>>>>> | phandle 7e7621a8 slot 28 hash 36
>>>>> | phandle 7e73e458 slot 18 hash 1e
>>>>> | phandle 7e73e608 slot 8 hash 1e
>>>>> | phandle 7e740078 slot 38 hash 28
>>>>> | phandle 7e740180 slot 0 hash 1d
>>>>> | phandle 7e740240 slot 0 hash 33
>>>>> | phandle 7e740348 slot 8 hash 29
>>>>> | phandle 7e740410 slot 10 hash 2
>>>>> | phandle 7e740eb0 slot 30 hash 3e
>>>>> | phandle 7e745390 slot 10 hash 33
>>>>> | phandle 7e747b08 slot 8 hash c
>>>>> | phandle 7e748528 slot 28 hash f
>>>>> | phandle 7e74a6e0 slot 20 hash 18
>>>>> | phandle 7e74aab0 slot 30 hash b
>>>>> | phandle 7e74f788 slot 8 hash d
>>>>> | Used entries: 8, hashed: 29
>>>>>
>>>>> So the hash array has 64 entries out which only 8 are populated. Using
>>>>> hash_32() populates 29 entries.
>>>>> Could someone with real hardware verify this?
>>>>> I'm not sure how important this performance wise, it looks just like a
>>>>> waste using only 1/8 of the array.
>>>>
>>>> The hash used is based on the assumptions you noted, and as stated in the
>>>> code, that phandle property values are in a contiguous range of 1..n
>>>> (not starting from zero), which is what dtc generates.
>>>
>>>> We knew that for systems that do not match the assumptions that the hash
>>>> will not be optimal.
>>>
>>> If we're going to have the phandle cache it should at least make some
>>> attempt to work across the systems that Linux supports.
>>>
>>>> Unless there is a serious performance problem for
>>>> such systems, I do not want to make the phandle hash code more complicated
>>>> to optimize for these cases.  And the pseries have been performing ok
>>>> without phandle related performance issues that I remember hearing since
>>>> before the cache was added, which could have only helped the performance.
>>>> Yes, if your observations are correct, some memory is being wasted, but
>>>> a 64 entry cache is not very large on a pseries.
>>>
>>> A single line change to use an actual hash function is hardly
>>> complicating it, compared to the amount of code already there.
>>
>> With a dtc generated devicetree, the hash is perfect, with no
>> misses.  That single line change then makes the hash bad for
>> dtc generated devicetrees.
> 
> To quantify that, I did some tests with the hash algo and it's about a
> 10% collision rate using phandle values of 1-$cache_size. There's
> never more than 2 phandles colliding in a slot. The actual impact
> would be less given 100% thrashing seems unlikely.

Thank you for doing the tests.  Actual data helps a lot.

If there is only a 10% collision rate for this case, that does not
sound bad to me.  There is the possibility of current or future
code resulting in ping ponging between two phandle values which
collide in the cache, but that should not be the common case.

However, given a choice between two algorithms, one of which
guarantees no thrashing (the current cache algorithm) and one
which allows a pathologic use case which results in thrashing,
I prefer the first algorithm.  This may seem theoretical, but
I have seen enough pathological code paths in my years of
performance analysis and tuning to be sensitive to this issue.

> 
>> The cache was added to address a problem with a system with a
>> dtc generated devicetree.
> 
> The cache was added to address a problem with a system with a large
> number of nodes and phandles (814 phandles in the reported case). dtc
> happens to be used in that case.

Yes, you stated that in a more general way than I did.  Agreed.


> 
>> I had not heard of any phandle
>> lookup performance issues on ppc systems.  An imperfect hash
>> for ppc should not make the ppc performance worse (unless
>> every single phandle value hashes to a single bucket).  So
>> the ppc performance is likely better than it was before
>> the hash was added, even without an optimal hash algorithm.
>>
>> So the change would not be a single line change.  It would
>> be a change to use different hash algorithms for dtc
>> generated device trees versus other device trees.
>>
>> Another possibility would be to make the cache be dependent
>> upon not CONFIG_PPC.  It might be possible to disable the
>> cache with a minimal code change.
> 
> I'd rather not do that.

I also agree with that.  I threw that out as kind of a nuclear option.
If ppc did not benefit from the cache having been implemented and
perceived a negative impact from the cache, this is simply a way
to remove that negative impact.  Then ppc would be in the same
state as before we implemented the cache.  Again, I also do not
like such a special casing.

> 
> And yes, as mentioned earlier I don't like the complexity. I didn't
> from the start and I'm  I'm still of the opinion we should have a
> fixed or 1 time sized true cache (i.e. smaller than total # of
> phandles). That would solve the RT memory allocation and locking issue
> too.

I could be convinced to go to a one time sized cache, especially now
that the RT issue exists.  If we change to that, it would be documented
as a possible overhead impact that devicetree overlays would have to
accept.

-Frank

> 
> For reference, the performance difference between the current
> implementation (assuming fixes haven't regressed it) was ~400ms vs. a
> ~340ms improvement with a 64 entry cache (using a mask, not a hash).
> IMO, 340ms improvement was good enough.
> 
> Rob
>
Frank Rowand Dec. 6, 2019, 2:01 a.m. UTC | #11
On 12/5/19 10:35 AM, Sebastian Andrzej Siewior wrote:
> On 2019-12-03 10:56:35 [-0600], Rob Herring wrote:
>>> Another possibility would be to make the cache be dependent
>>> upon not CONFIG_PPC.  It might be possible to disable the
>>> cache with a minimal code change.
>>
>> I'd rather not do that.
>>
>> And yes, as mentioned earlier I don't like the complexity. I didn't
>> from the start and I'm  I'm still of the opinion we should have a
>> fixed or 1 time sized true cache (i.e. smaller than total # of
>> phandles). That would solve the RT memory allocation and locking issue
>> too.
>>
>> For reference, the performance difference between the current
>> implementation (assuming fixes haven't regressed it) was ~400ms vs. a
>> ~340ms improvement with a 64 entry cache (using a mask, not a hash).
>> IMO, 340ms improvement was good enough.
> 
> Okay. So the 814 phandles would result in an array with 1024 slots. That
> would need 4KiB of memory.

Is this amount of memory an issue for this system?

If module support is not configured into the kernel then the cache is
removed and memory freed in a late initcall.  I don't know if that helps
your use case or not.


> What about we go back to the fix 64 slots array but with hash32 for the
> lookup? Without the hash we would be 60ms slower during boot (compared
> to now, based on ancient data) but then the hash isn't expensive so we
> end up with better coverage of the memory on systems which don't have a
> plain enumeration of the phandle.

That performance data is specific to one particular system.  It does not
generalize to all devicetree based systems.  So don't draw too many
conclusions from it.  If you want to understand the boot performance
impact for your system, you need to measure the alternatives on
your system.

Is there a memory usage issue for the systems that led to this thread?
Unless there is a documented memory issue, I do not want to expand an
issue with poor cache bucket percent utilization to the other issue
of cache size.

-Frank

> 
>> Rob
> 
> Sebastian
>
Segher Boessenkool Dec. 6, 2019, 11:40 p.m. UTC | #12
Hi,

On Thu, Dec 05, 2019 at 07:37:24PM -0600, Frank Rowand wrote:
> On 12/3/19 12:35 PM, Segher Boessenkool wrote:
> > Btw.  Some OFs mangle the phandles some way, to make it easier to catch
> > people using it as an address (and similarly, mangle ihandles differently,
> > so you catch confusion between ihandles and phandles as well).  Like a
> > simple xor, with some odd number preferably.  You should assume *nothing*
> > about phandles, they are opaque identifiers.
> 
> For arm32 machines that use dtc to generate the devicetree, which is a
> very large user base, we certainly can make assumptions about phandles.

I was talking about OF.  Phandles are explicitly defined to be opaque
tokens.  If there is an extra meaning to them in flattened device trees,
well, the kernel should then only depend on that there, not for more
general phandles.  Where is this documented btw?

> Especially because the complaints about the overhead of phandle based
> lookups have been voiced by users of this specific set of machines.
> 
> For systems with a devicetree that does not follow the assumptions, the
> phandle cache should not measurably increase the overhead of phandle
> based lookups.

It's an extra memory access and extra code to execute, for not much gain
(if anything).  While with a reasonable hash function it will be good
for everyone.

> If you have measurements of a system where implementing the phandle
> cache increased the overhead,

Are you seriously saying you think this code can run in zero time and
space on most systems?

> and the additional overhead is a concern
> (such as significantly increasing boot time) then please share that
> information with us.  Otherwise this is just a theoretical exercise.

The point is that this code could be easily beneficial for most (or all)
users, not just those that use dtc-constructed device trees.  It is
completely obvious that having a worse cache hash function results in
many more lookups.  Whether that results in something expressed as
milliseconds on tiny systems or microseconds on bigger systems is
completely beside the point.


Segher
Frank Rowand Dec. 8, 2019, 4:30 a.m. UTC | #13
On 12/6/19 5:40 PM, Segher Boessenkool wrote:
> Hi,
> 
> On Thu, Dec 05, 2019 at 07:37:24PM -0600, Frank Rowand wrote:
>> On 12/3/19 12:35 PM, Segher Boessenkool wrote:
>>> Btw.  Some OFs mangle the phandles some way, to make it easier to catch
>>> people using it as an address (and similarly, mangle ihandles differently,
>>> so you catch confusion between ihandles and phandles as well).  Like a
>>> simple xor, with some odd number preferably.  You should assume *nothing*
>>> about phandles, they are opaque identifiers.
>>
>> For arm32 machines that use dtc to generate the devicetree, which is a
>> very large user base, we certainly can make assumptions about phandles.
> 
> I was talking about OF.  Phandles are explicitly defined to be opaque
> tokens.  If there is an extra meaning to them in flattened device trees,
> well, the kernel should then only depend on that there, not for more
> general phandles.  Where is this documented btw?

And dtc generated devicetrees are a huge proportion of the OF systems.

It is not documented.

As an aside, overlays also depend upon the current dtc implementation.
If an extremely large value is used for a phandle then overlay
application will fail.


> 
>> Especially because the complaints about the overhead of phandle based
>> lookups have been voiced by users of this specific set of machines.
>>
>> For systems with a devicetree that does not follow the assumptions, the
>> phandle cache should not measurably increase the overhead of phandle
>> based lookups.
> 
> It's an extra memory access and extra code to execute, for not much gain
> (if anything).  While with a reasonable hash function it will be good
> for everyone.
> 
>> If you have measurements of a system where implementing the phandle
>> cache increased the overhead,
> 
> Are you seriously saying you think this code can run in zero time and
> space on most systems?

No.  I made no such claim.  Note the additional words in the following
sentences.


>> and the additional overhead is a concern
>> (such as significantly increasing boot time) then please share that
>> information with us.  Otherwise this is just a theoretical exercise.
> 
> The point is that this code could be easily beneficial for most (or all)
> users, not just those that use dtc-constructed device trees.  It is

The point is that the cache was implemented to solve a specific problem
for certain specific systems.  There had been a few reports of various
machines having the same issue, but finally someone measures a **significant**
improvement in boot time for a specific machine.  The boot time with
the cache was **measured** to be much shorter.  The boot time for all
systems with a dtc generated devicetree is expected to be faster.  No
one responded to the implementation when it was proposed with a
**measurement** that showed increased boot time.  A concern of using
more memory was raised and discussed, with at least on feature added
as a result (freeing the cache in late init if modules are not
enabled).

Being "beneficial for most (or all) users" has to be balanced against
whether the change would remove the benefit for the system that the
feature was originally implemented to solve a problem for.  There
was no performance data supplied to answer this question.  (Though
eventually Rob did some measurements of the impact on hash efficiency
for such a system.)


> completely obvious that having a worse cache hash function results in
> many more lookups.  Whether that results in something expressed as
> milliseconds on tiny systems or microseconds on bigger systems is
> completely beside the point.

There was no performance data accompanying the proposed change that
started this thread.  There was no data showing whether the systems
that this feature was created for would suffer.  There was no data
showing that the boot time of the pseries systems would improve.
There was no assertion made that too much memory was being used
by the cache (there was an implied assertion that a large
percentage of the memory used for the cache was not used, thus
the performance benefit of the cache could be improved by
changing to using a hash instead of mask).  We had rejected
creating a cache for several years until finally some solid
data was provided showing an actual need for it.

It is not a question of "milliseconds on tiny systems or
microseconds on bigger systems".  I agree with that.  But it
does matter whether the performance impact of various
implementations is large enough to either solve a problem
or create a problem.  On the other hand, if the amount of
memory used by the cache is a problem (which is _not_
what was asserted by the patch submitter) then we can have
a conversation about how to resolve that.

-Frank

> 
> 
> Segher
>
Frank Rowand Dec. 8, 2019, 6:59 a.m. UTC | #14
On 12/5/19 7:52 PM, Frank Rowand wrote:
> On 12/3/19 10:56 AM, Rob Herring wrote:
>> On Mon, Dec 2, 2019 at 10:28 PM Frank Rowand <frowand.list@gmail.com> wrote:
>>>
>>> On 12/2/19 10:12 PM, Michael Ellerman wrote:
>>>> Frank Rowand <frowand.list@gmail.com> writes:
>>>>> On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote:
>>>>>> I've been looking at phandle_cache and noticed the following: The raw
>>>>>> phandle value as generated by dtc starts at zero and is incremented by
>>>>>> one for each phandle entry. The qemu pSeries model is using Slof (which
>>>>>> is probably the same thing as used on real hardware) and this looks like
>>>>>> a poiner value for the phandle.
>>>>>> With
>>>>>>     qemu-system-ppc64le -m 16G -machine pseries -smp 8
>>>>>>
>>>>>> I got the following output:
>>>>>> | entries: 64
>>>>>> | phandle 7e732468 slot 28 hash c
>>>>>> | phandle 7e732ad0 slot 10 hash 27
>>>>>> | phandle 7e732ee8 slot 28 hash 3a
>>>>>> | phandle 7e734160 slot 20 hash 36
>>>>>> | phandle 7e734318 slot 18 hash 3a
>>>>>> | phandle 7e734428 slot 28 hash 33
>>>>>> | phandle 7e734538 slot 38 hash 2c
>>>>>> | phandle 7e734850 slot 10 hash e
>>>>>> | phandle 7e735220 slot 20 hash 2d
>>>>>> | phandle 7e735bf0 slot 30 hash d
>>>>>> | phandle 7e7365c0 slot 0 hash 2d
>>>>>> | phandle 7e736f90 slot 10 hash d
>>>>>> | phandle 7e737960 slot 20 hash 2d
>>>>>> | phandle 7e738330 slot 30 hash d
>>>>>> | phandle 7e738d00 slot 0 hash 2d
>>>>>> | phandle 7e739730 slot 30 hash 38
>>>>>> | phandle 7e73bd08 slot 8 hash 17
>>>>>> | phandle 7e73c2e0 slot 20 hash 32
>>>>>> | phandle 7e73c7f8 slot 38 hash 37
>>>>>> | phandle 7e782420 slot 20 hash 13
>>>>>> | phandle 7e782ed8 slot 18 hash 1b
>>>>>> | phandle 7e73ce28 slot 28 hash 39
>>>>>> | phandle 7e73d390 slot 10 hash 22
>>>>>> | phandle 7e73d9a8 slot 28 hash 1a
>>>>>> | phandle 7e73dc28 slot 28 hash 37
>>>>>> | phandle 7e73de00 slot 0 hash a
>>>>>> | phandle 7e73e028 slot 28 hash 0
>>>>>> | phandle 7e7621a8 slot 28 hash 36
>>>>>> | phandle 7e73e458 slot 18 hash 1e
>>>>>> | phandle 7e73e608 slot 8 hash 1e
>>>>>> | phandle 7e740078 slot 38 hash 28
>>>>>> | phandle 7e740180 slot 0 hash 1d
>>>>>> | phandle 7e740240 slot 0 hash 33
>>>>>> | phandle 7e740348 slot 8 hash 29
>>>>>> | phandle 7e740410 slot 10 hash 2
>>>>>> | phandle 7e740eb0 slot 30 hash 3e
>>>>>> | phandle 7e745390 slot 10 hash 33
>>>>>> | phandle 7e747b08 slot 8 hash c
>>>>>> | phandle 7e748528 slot 28 hash f
>>>>>> | phandle 7e74a6e0 slot 20 hash 18
>>>>>> | phandle 7e74aab0 slot 30 hash b
>>>>>> | phandle 7e74f788 slot 8 hash d
>>>>>> | Used entries: 8, hashed: 29
>>>>>>
>>>>>> So the hash array has 64 entries out which only 8 are populated. Using
>>>>>> hash_32() populates 29 entries.
>>>>>> Could someone with real hardware verify this?
>>>>>> I'm not sure how important this performance wise, it looks just like a
>>>>>> waste using only 1/8 of the array.
>>>>>
>>>>> The hash used is based on the assumptions you noted, and as stated in the
>>>>> code, that phandle property values are in a contiguous range of 1..n
>>>>> (not starting from zero), which is what dtc generates.
>>>>
>>>>> We knew that for systems that do not match the assumptions that the hash
>>>>> will not be optimal.
>>>>
>>>> If we're going to have the phandle cache it should at least make some
>>>> attempt to work across the systems that Linux supports.
>>>>
>>>>> Unless there is a serious performance problem for
>>>>> such systems, I do not want to make the phandle hash code more complicated
>>>>> to optimize for these cases.  And the pseries have been performing ok
>>>>> without phandle related performance issues that I remember hearing since
>>>>> before the cache was added, which could have only helped the performance.
>>>>> Yes, if your observations are correct, some memory is being wasted, but
>>>>> a 64 entry cache is not very large on a pseries.
>>>>
>>>> A single line change to use an actual hash function is hardly
>>>> complicating it, compared to the amount of code already there.
>>>
>>> With a dtc generated devicetree, the hash is perfect, with no
>>> misses.  That single line change then makes the hash bad for
>>> dtc generated devicetrees.
>>
>> To quantify that, I did some tests with the hash algo and it's about a
>> 10% collision rate using phandle values of 1-$cache_size. There's
>> never more than 2 phandles colliding in a slot. The actual impact
>> would be less given 100% thrashing seems unlikely.

I also tested a few cache sizes.

For cache size 64 entries, I get a 14% collision rate (and also 14% of
entries unused).  For 1024 entries, I get a 12% collision rate.  And
I can confirm no more than two phandles hashing to the same slot.

So essentially confirming your data.

-Frank

> 
> Thank you for doing the tests.  Actual data helps a lot.
> 
> If there is only a 10% collision rate for this case, that does not
> sound bad to me.  There is the possibility of current or future
> code resulting in ping ponging between two phandle values which
> collide in the cache, but that should not be the common case.
> 
> However, given a choice between two algorithms, one of which
> guarantees no thrashing (the current cache algorithm) and one
> which allows a pathologic use case which results in thrashing,
> I prefer the first algorithm.  This may seem theoretical, but
> I have seen enough pathological code paths in my years of
> performance analysis and tuning to be sensitive to this issue.
> 
>>
>>> The cache was added to address a problem with a system with a
>>> dtc generated devicetree.
>>
>> The cache was added to address a problem with a system with a large
>> number of nodes and phandles (814 phandles in the reported case). dtc
>> happens to be used in that case.
> 
> Yes, you stated that in a more general way than I did.  Agreed.
> 
> 
>>
>>> I had not heard of any phandle
>>> lookup performance issues on ppc systems.  An imperfect hash
>>> for ppc should not make the ppc performance worse (unless
>>> every single phandle value hashes to a single bucket).  So
>>> the ppc performance is likely better than it was before
>>> the hash was added, even without an optimal hash algorithm.
>>>
>>> So the change would not be a single line change.  It would
>>> be a change to use different hash algorithms for dtc
>>> generated device trees versus other device trees.
>>>
>>> Another possibility would be to make the cache be dependent
>>> upon not CONFIG_PPC.  It might be possible to disable the
>>> cache with a minimal code change.
>>
>> I'd rather not do that.
> 
> I also agree with that.  I threw that out as kind of a nuclear option.
> If ppc did not benefit from the cache having been implemented and
> perceived a negative impact from the cache, this is simply a way
> to remove that negative impact.  Then ppc would be in the same
> state as before we implemented the cache.  Again, I also do not
> like such a special casing.
> 
>>
>> And yes, as mentioned earlier I don't like the complexity. I didn't
>> from the start and I'm  I'm still of the opinion we should have a
>> fixed or 1 time sized true cache (i.e. smaller than total # of
>> phandles). That would solve the RT memory allocation and locking issue
>> too.
> 
> I could be convinced to go to a one time sized cache, especially now
> that the RT issue exists.  If we change to that, it would be documented
> as a possible overhead impact that devicetree overlays would have to
> accept.
> 
> -Frank
> 
>>
>> For reference, the performance difference between the current
>> implementation (assuming fixes haven't regressed it) was ~400ms vs. a
>> ~340ms improvement with a 64 entry cache (using a mask, not a hash).
>> IMO, 340ms improvement was good enough.
>>
>> Rob
>>
> 
>
Sebastian Andrzej Siewior Dec. 9, 2019, 1:35 p.m. UTC | #15
On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote:
> Is there a memory usage issue for the systems that led to this thread?

No, no memory issue led to this thread. I was just testing my patch and
I assumed that I did something wrong in the counting/lock drop/lock
acquire/allocate path because the array was hardly used. So I started to
look deeper…
Once I figured out everything was fine, I was curious if everyone is
aware of the different phandle creation by dtc vs POWER. And I posted
the mail in the thread.
Once you confirmed that everything is "known / not an issue" I was ready
to take off [0].

Later more replies came in such as one mail [1] from Rob describing the
original reason with 814 phandles. _Here_ I was just surprised that 1024
were used over 64 entries for a benefit of 60ms. I understand that this
is low concern for you because that memory is released if modules are
not enabled. I usually see that module support is left enabled.

However, Rob suggested / asked about the fixed size array (this is how I
understood it):
|And yes, as mentioned earlier I don't like the complexity. I didn't
|from the start and I'm  I'm still of the opinion we should have a
|fixed or 1 time sized true cache (i.e. smaller than total # of
|phandles). That would solve the RT memory allocation and locking issue
|too.

so I attempted to ask if we should have the fixed size array maybe
with the hash_32() instead the mask. This would make my other patch
obsolete because the fixed size array should not have a RT issue. The
hash_32() part here would address the POWER issue where the cache is
currently not used efficiently.

If you want instead to keep things as-is then this is okay from my side.
If you want to keep this cache off on POWER then I could contribute a
patch doing so.

[0] https://lore.kernel.org/linux-devicetree/20191202110732.4dvzrro5o6zrlpax@linutronix.de/
[1] https://lore.kernel.org/linux-devicetree/CAL_JsqKieG5=teL7gABPKbJOQfvoS9s-ZPF-=R0yEE_LUoy-Kw@mail.gmail.com/
> -Frank

Sebastian
Rob Herring Dec. 10, 2019, 1:51 a.m. UTC | #16
On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote:
> > Is there a memory usage issue for the systems that led to this thread?
>
> No, no memory issue led to this thread. I was just testing my patch and
> I assumed that I did something wrong in the counting/lock drop/lock
> acquire/allocate path because the array was hardly used. So I started to
> look deeper…
> Once I figured out everything was fine, I was curious if everyone is
> aware of the different phandle creation by dtc vs POWER. And I posted
> the mail in the thread.
> Once you confirmed that everything is "known / not an issue" I was ready
> to take off [0].
>
> Later more replies came in such as one mail [1] from Rob describing the
> original reason with 814 phandles. _Here_ I was just surprised that 1024
> were used over 64 entries for a benefit of 60ms. I understand that this
> is low concern for you because that memory is released if modules are
> not enabled. I usually see that module support is left enabled.
>
> However, Rob suggested / asked about the fixed size array (this is how I
> understood it):
> |And yes, as mentioned earlier I don't like the complexity. I didn't
> |from the start and I'm  I'm still of the opinion we should have a
> |fixed or 1 time sized true cache (i.e. smaller than total # of
> |phandles). That would solve the RT memory allocation and locking issue
> |too.
>
> so I attempted to ask if we should have the fixed size array maybe
> with the hash_32() instead the mask. This would make my other patch
> obsolete because the fixed size array should not have a RT issue. The
> hash_32() part here would address the POWER issue where the cache is
> currently not used efficiently.
>
> If you want instead to keep things as-is then this is okay from my side.
> If you want to keep this cache off on POWER then I could contribute a
> patch doing so.

It turns out there's actually a bug in the current implementation. If
we have multiple phandles with the same mask, then we leak node
references if we miss in the cache and re-assign the cache entry.
Easily fixed I suppose, but holding a ref count for a cached entry
seems wrong. That means we never have a ref count of 0 on every node
with a phandle.

I've done some more experiments with the performance. I've come to the
conclusion that just measuring boot time is not a good way at least if
the time is not a significant percentage of the total. I couldn't get
any measurable data. I'm using a RK3399 Rock960 board. It has 190
phandles. I get about 1500 calls to of_find_node_by_phandle() during
boot. Note that about the first 300 are before we have any timekeeping
(the prior measurements also would not account for this). Those calls
have no cache in the current implementation and are cached in my
implementation.

no cache:  20124 us
current cache: 819 us

new cache (64 entry): 4342 us
new cache (128 entry): 2875 us
new cache (256 entry): 1235 us

To get some idea on the time spent before timekeeping is up, if we
take the avg miss time is ~17us (20124/1200), then we're spending
about ~5ms on lookups before the cache is enabled. I'd estimate the
new cache takes ~400us before timekeeping is up as there's 11 misses
early.

From these numbers, it seems the miss rate has a significant impact on
performance for the different sizes. But taken from the original 20+
ms, they all look like good improvement.

Rob
Frank Rowand Dec. 10, 2019, 8:17 a.m. UTC | #17
On 12/9/19 7:51 PM, Rob Herring wrote:
> On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
>>
>> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote:
>>> Is there a memory usage issue for the systems that led to this thread?
>>
>> No, no memory issue led to this thread. I was just testing my patch and
>> I assumed that I did something wrong in the counting/lock drop/lock
>> acquire/allocate path because the array was hardly used. So I started to
>> look deeper…
>> Once I figured out everything was fine, I was curious if everyone is
>> aware of the different phandle creation by dtc vs POWER. And I posted
>> the mail in the thread.
>> Once you confirmed that everything is "known / not an issue" I was ready
>> to take off [0].
>>
>> Later more replies came in such as one mail [1] from Rob describing the
>> original reason with 814 phandles. _Here_ I was just surprised that 1024
>> were used over 64 entries for a benefit of 60ms. I understand that this
>> is low concern for you because that memory is released if modules are
>> not enabled. I usually see that module support is left enabled.
>>
>> However, Rob suggested / asked about the fixed size array (this is how I
>> understood it):
>> |And yes, as mentioned earlier I don't like the complexity. I didn't
>> |from the start and I'm  I'm still of the opinion we should have a
>> |fixed or 1 time sized true cache (i.e. smaller than total # of
>> |phandles). That would solve the RT memory allocation and locking issue
>> |too.
>>
>> so I attempted to ask if we should have the fixed size array maybe
>> with the hash_32() instead the mask. This would make my other patch
>> obsolete because the fixed size array should not have a RT issue. The
>> hash_32() part here would address the POWER issue where the cache is
>> currently not used efficiently.
>>
>> If you want instead to keep things as-is then this is okay from my side.
>> If you want to keep this cache off on POWER then I could contribute a
>> patch doing so.
> 
> It turns out there's actually a bug in the current implementation. If
> we have multiple phandles with the same mask, then we leak node
> references if we miss in the cache and re-assign the cache entry.

Aaargh.  Patch sent.

> Easily fixed I suppose, but holding a ref count for a cached entry
> seems wrong. That means we never have a ref count of 0 on every node
> with a phandle.

It will go to zero when the cache is freed.

My memory is that we free the cache as part of removing an overlay.  I'll
verify whether my memory is correct.

-Frank


> 
> I've done some more experiments with the performance. I've come to the
> conclusion that just measuring boot time is not a good way at least if
> the time is not a significant percentage of the total. I couldn't get
> any measurable data. I'm using a RK3399 Rock960 board. It has 190
> phandles. I get about 1500 calls to of_find_node_by_phandle() during
> boot. Note that about the first 300 are before we have any timekeeping
> (the prior measurements also would not account for this). Those calls
> have no cache in the current implementation and are cached in my
> implementation.
> 
> no cache:  20124 us
> current cache: 819 us
> 
> new cache (64 entry): 4342 us
> new cache (128 entry): 2875 us
> new cache (256 entry): 1235 us
> 
> To get some idea on the time spent before timekeeping is up, if we
> take the avg miss time is ~17us (20124/1200), then we're spending
> about ~5ms on lookups before the cache is enabled. I'd estimate the
> new cache takes ~400us before timekeeping is up as there's 11 misses
> early.
> 
>>From these numbers, it seems the miss rate has a significant impact on
> performance for the different sizes. But taken from the original 20+
> ms, they all look like good improvement.
> 
> Rob
>
Frank Rowand Dec. 10, 2019, 12:46 p.m. UTC | #18
On 12/10/19 2:17 AM, Frank Rowand wrote:
> On 12/9/19 7:51 PM, Rob Herring wrote:
>> On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior
>> <bigeasy@linutronix.de> wrote:
>>>
>>> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote:
>>>> Is there a memory usage issue for the systems that led to this thread?
>>>
>>> No, no memory issue led to this thread. I was just testing my patch and
>>> I assumed that I did something wrong in the counting/lock drop/lock
>>> acquire/allocate path because the array was hardly used. So I started to
>>> look deeper…
>>> Once I figured out everything was fine, I was curious if everyone is
>>> aware of the different phandle creation by dtc vs POWER. And I posted
>>> the mail in the thread.
>>> Once you confirmed that everything is "known / not an issue" I was ready
>>> to take off [0].
>>>
>>> Later more replies came in such as one mail [1] from Rob describing the
>>> original reason with 814 phandles. _Here_ I was just surprised that 1024
>>> were used over 64 entries for a benefit of 60ms. I understand that this
>>> is low concern for you because that memory is released if modules are
>>> not enabled. I usually see that module support is left enabled.
>>>
>>> However, Rob suggested / asked about the fixed size array (this is how I
>>> understood it):
>>> |And yes, as mentioned earlier I don't like the complexity. I didn't
>>> |from the start and I'm  I'm still of the opinion we should have a
>>> |fixed or 1 time sized true cache (i.e. smaller than total # of
>>> |phandles). That would solve the RT memory allocation and locking issue
>>> |too.
>>>
>>> so I attempted to ask if we should have the fixed size array maybe
>>> with the hash_32() instead the mask. This would make my other patch
>>> obsolete because the fixed size array should not have a RT issue. The
>>> hash_32() part here would address the POWER issue where the cache is
>>> currently not used efficiently.
>>>
>>> If you want instead to keep things as-is then this is okay from my side.
>>> If you want to keep this cache off on POWER then I could contribute a
>>> patch doing so.
>>
>> It turns out there's actually a bug in the current implementation. If
>> we have multiple phandles with the same mask, then we leak node
>> references if we miss in the cache and re-assign the cache entry.
> 
> Aaargh.  Patch sent.
> 
>> Easily fixed I suppose, but holding a ref count for a cached entry
>> seems wrong. That means we never have a ref count of 0 on every node
>> with a phandle.
> 
> It will go to zero when the cache is freed.
> 
> My memory is that we free the cache as part of removing an overlay.  I'll
> verify whether my memory is correct.

And I'll look at non-overlay use of dynamic devicetree too.

-Frank

> 
> -Frank
> 
> 
>>
>> I've done some more experiments with the performance. I've come to the
>> conclusion that just measuring boot time is not a good way at least if
>> the time is not a significant percentage of the total. I couldn't get
>> any measurable data. I'm using a RK3399 Rock960 board. It has 190
>> phandles. I get about 1500 calls to of_find_node_by_phandle() during
>> boot. Note that about the first 300 are before we have any timekeeping
>> (the prior measurements also would not account for this). Those calls
>> have no cache in the current implementation and are cached in my
>> implementation.
>>
>> no cache:  20124 us
>> current cache: 819 us
>>
>> new cache (64 entry): 4342 us
>> new cache (128 entry): 2875 us
>> new cache (256 entry): 1235 us
>>
>> To get some idea on the time spent before timekeeping is up, if we
>> take the avg miss time is ~17us (20124/1200), then we're spending
>> about ~5ms on lookups before the cache is enabled. I'd estimate the
>> new cache takes ~400us before timekeeping is up as there's 11 misses
>> early.
>>
>> >From these numbers, it seems the miss rate has a significant impact on
>> performance for the different sizes. But taken from the original 20+
>> ms, they all look like good improvement.
>>
>> Rob
>>
> 
>
Rob Herring Dec. 11, 2019, 2:42 p.m. UTC | #19
On Tue, Dec 10, 2019 at 2:17 AM Frank Rowand <frowand.list@gmail.com> wrote:
>
> On 12/9/19 7:51 PM, Rob Herring wrote:
> > On Mon, Dec 9, 2019 at 7:35 AM Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de> wrote:
> >>
> >> On 2019-12-05 20:01:41 [-0600], Frank Rowand wrote:
> >>> Is there a memory usage issue for the systems that led to this thread?
> >>
> >> No, no memory issue led to this thread. I was just testing my patch and
> >> I assumed that I did something wrong in the counting/lock drop/lock
> >> acquire/allocate path because the array was hardly used. So I started to
> >> look deeper…
> >> Once I figured out everything was fine, I was curious if everyone is
> >> aware of the different phandle creation by dtc vs POWER. And I posted
> >> the mail in the thread.
> >> Once you confirmed that everything is "known / not an issue" I was ready
> >> to take off [0].
> >>
> >> Later more replies came in such as one mail [1] from Rob describing the
> >> original reason with 814 phandles. _Here_ I was just surprised that 1024
> >> were used over 64 entries for a benefit of 60ms. I understand that this
> >> is low concern for you because that memory is released if modules are
> >> not enabled. I usually see that module support is left enabled.
> >>
> >> However, Rob suggested / asked about the fixed size array (this is how I
> >> understood it):
> >> |And yes, as mentioned earlier I don't like the complexity. I didn't
> >> |from the start and I'm  I'm still of the opinion we should have a
> >> |fixed or 1 time sized true cache (i.e. smaller than total # of
> >> |phandles). That would solve the RT memory allocation and locking issue
> >> |too.
> >>
> >> so I attempted to ask if we should have the fixed size array maybe
> >> with the hash_32() instead the mask. This would make my other patch
> >> obsolete because the fixed size array should not have a RT issue. The
> >> hash_32() part here would address the POWER issue where the cache is
> >> currently not used efficiently.
> >>
> >> If you want instead to keep things as-is then this is okay from my side.
> >> If you want to keep this cache off on POWER then I could contribute a
> >> patch doing so.
> >
> > It turns out there's actually a bug in the current implementation. If
> > we have multiple phandles with the same mask, then we leak node
> > references if we miss in the cache and re-assign the cache entry.
>
> Aaargh.  Patch sent.
>
> > Easily fixed I suppose, but holding a ref count for a cached entry
> > seems wrong. That means we never have a ref count of 0 on every node
> > with a phandle.
>
> It will go to zero when the cache is freed.
>
> My memory is that we free the cache as part of removing an overlay.  I'll
> verify whether my memory is correct.

Yes, as part of having entries for every phandle we release and
realloc when number of phandles changes. If the size is fixed, then we
can stop doing that. We only need to remove entries in
of_detach_node() as that should always happen before nodes are
removed, right?

Rob
diff mbox series

Patch

diff --git a/drivers/of/base.c b/drivers/of/base.c
index 1d667eb730e19..2640d4bc81a9a 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -197,6 +197,7 @@  void of_populate_phandle_cache(void)
 	u32 cache_entries;
 	struct device_node *np;
 	u32 phandles = 0;
+	struct device_node **cache2;
 
 	raw_spin_lock_irqsave(&devtree_lock, flags);
 
@@ -214,14 +215,32 @@  void of_populate_phandle_cache(void)
 
 	phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
 				GFP_ATOMIC);
+	cache2 = kcalloc(cache_entries, sizeof(*phandle_cache), GFP_ATOMIC);
 	if (!phandle_cache)
 		goto out;
 
+	pr_err("%s(%d) entries: %d\n", __func__, __LINE__, cache_entries);
 	for_each_of_allnodes(np)
 		if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) {
+			int slot;
 			of_node_get(np);
 			phandle_cache[np->phandle & phandle_cache_mask] = np;
+			slot = hash_32(np->phandle, __ffs(cache_entries));
+			cache2[slot] = np;
+			pr_err("%s(%d) phandle %x slot %x hash %x\n", __func__, __LINE__,
+			       np->phandle, np->phandle & phandle_cache_mask, slot);
 		}
+	{
+		int i, filled = 0, filled_hash = 0;
+
+		for (i = 0; i < cache_entries; i++) {
+			if (phandle_cache[i])
+				filled++;
+			if (cache2[i])
+				filled_hash++;
+		}
+		pr_err("%s(%d) Used entries: %d, hashed: %d\n", __func__, __LINE__, filled, filled_hash);
+	}
 
 out:
 	raw_spin_unlock_irqrestore(&devtree_lock, flags);