diff mbox series

[RESEND,04/18] i386/cpu: Fix number of addressable IDs in CPUID.04H

Message ID 20230213093625.158170-5-zhao1.liu@linux.intel.com
State New
Headers show
Series Support smp.clusters for x86 | expand

Commit Message

Zhao Liu Feb. 13, 2023, 9:36 a.m. UTC
From: Zhao Liu <zhao1.liu@intel.com>

For i-cache and d-cache, the maximum IDs for CPUs sharing cache (
CPUID.04H.00H:EAX[bits 25:14] and CPUID.04H.01H:EAX[bits 25:14]) are
both 0, and this means i-cache and d-cache are shared in the SMT level.
This is correct if there's single thread per core, but is wrong for the
hyper threading case (one core contains multiple threads) since the
i-cache and d-cache are shared in the core level other than SMT level.

Therefore, in order to be compatible with both multi-threaded and
single-threaded situations, we should set i-cache and d-cache be shared
at the core level by default.

Refer to the fixes of cache_info_passthrough ([1], [2]) and SDM, the
CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits 31:26] should use the
nearest power-of-2 integer.

The nearest power-of-2 integer can be caculated by pow2ceil() or by
using APIC ID offset (like L3 topology using 1 << die_offset [3]).

But in fact, CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits 31:26]
are associated with APIC ID. For example, in linux kernel, the field
"num_threads_sharing" (Bits 25 - 14) is parsed with APIC ID. And for
another example, on Alder Lake P, the CPUID.04H:EAX[bits 31:26] is not
matched with actual core numbers and it's caculated by:
"(1 << (pkg_offset - core_offset)) - 1".

Therefore the offset of APIC ID should be preferred to caculate nearest
power-of-2 integer for CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits
31:26]:
1. d/i cache is shared in a core, 1 << core_offset should be used
   instand of "1" in encode_cache_cpuid4() for CPUID.04H.00H:EAX[bits
   25:14] and CPUID.04H.01H:EAX[bits 25:14].
2. L2 cache is supposed to be shared in a core as for now, thereby
   1 << core_offset should also be used instand of "cs->nr_threads" in
   encode_cache_cpuid4() for CPUID.04H.02H:EAX[bits 25:14].
3. Similarly, the value for CPUID.04H:EAX[bits 31:26] should also be
   replaced by the offsets upper SMT level in APIC ID.

And since [1] and [2] are good enough to make cache_into_passthrough
work well, its "pow2ceil()" uses are enough so that they're no need to
be replaced by APIC ID offset way.

[1]: efb3934 (x86: cpu: make sure number of addressable IDs for processor cores meets the spec)
[2]: d7caf13 (x86: cpu: fixup number of addressable IDs for logical processors sharing cache)
[3]: d65af28 (i386: Update new x86_apicid parsing rules with die_offset support)

Fixes: 7e3482f (i386: Helpers to encode cache information consistently)
Suggested-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
---
 target/i386/cpu.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

Comments

Zhao Liu Feb. 15, 2023, 2:33 p.m. UTC | #1
On Wed, Feb 15, 2023 at 06:11:23PM +0800, wangyanan (Y) wrote:
> Date: Wed, 15 Feb 2023 18:11:23 +0800
> From: "wangyanan (Y)" <wangyanan55@huawei.com>
> Subject: Re: [PATCH RESEND 04/18] i386/cpu: Fix number of addressable IDs
>  in CPUID.04H
> 
> Hi Zhao,
> 
> 在 2023/2/13 17:36, Zhao Liu 写道:
> > From: Zhao Liu <zhao1.liu@intel.com>
> > 
> > For i-cache and d-cache, the maximum IDs for CPUs sharing cache (
> > CPUID.04H.00H:EAX[bits 25:14] and CPUID.04H.01H:EAX[bits 25:14]) are
> > both 0, and this means i-cache and d-cache are shared in the SMT level.
> > This is correct if there's single thread per core, but is wrong for the
> > hyper threading case (one core contains multiple threads) since the
> > i-cache and d-cache are shared in the core level other than SMT level.
> > 
> > Therefore, in order to be compatible with both multi-threaded and
> > single-threaded situations, we should set i-cache and d-cache be shared
> > at the core level by default.
> > 
> > Refer to the fixes of cache_info_passthrough ([1], [2]) and SDM, the
> > CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits 31:26] should use the
> > nearest power-of-2 integer.
> > 
> > The nearest power-of-2 integer can be caculated by pow2ceil() or by
> > using APIC ID offset (like L3 topology using 1 << die_offset [3]).
> > 
> > But in fact, CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits 31:26]
> > are associated with APIC ID. For example, in linux kernel, the field
> > "num_threads_sharing" (Bits 25 - 14) is parsed with APIC ID. And for
> > another example, on Alder Lake P, the CPUID.04H:EAX[bits 31:26] is not
> > matched with actual core numbers and it's caculated by:
> > "(1 << (pkg_offset - core_offset)) - 1".
> > 
> > Therefore the offset of APIC ID should be preferred to caculate nearest
> > power-of-2 integer for CPUID.04H:EAX[bits 25:14] and CPUID.04H:EAX[bits
> > 31:26]:
> > 1. d/i cache is shared in a core, 1 << core_offset should be used
> >     instand of "1" in encode_cache_cpuid4() for CPUID.04H.00H:EAX[bits
> >     25:14] and CPUID.04H.01H:EAX[bits 25:14].
> > 2. L2 cache is supposed to be shared in a core as for now, thereby
> >     1 << core_offset should also be used instand of "cs->nr_threads" in
> >     encode_cache_cpuid4() for CPUID.04H.02H:EAX[bits 25:14].
> > 3. Similarly, the value for CPUID.04H:EAX[bits 31:26] should also be
> >     replaced by the offsets upper SMT level in APIC ID.
> > 
> > And since [1] and [2] are good enough to make cache_into_passthrough
> > work well, its "pow2ceil()" uses are enough so that they're no need to
> > be replaced by APIC ID offset way.
> If you uniformly tweak these two places with APIC ID offset too, then
> you can also use the more spec-compliant helpers
> (e.g max_processor_ids_for_cache and max_core_ids_in_pkg) here in
> future patch #18. Would be it best to uniform the code?

Yes, it makes sense! Will also do that. Thanks!

> 
> Thanks,
> Yanan
> > [1]: efb3934 (x86: cpu: make sure number of addressable IDs for processor cores meets the spec)
> > [2]: d7caf13 (x86: cpu: fixup number of addressable IDs for logical processors sharing cache)
> > [3]: d65af28 (i386: Update new x86_apicid parsing rules with die_offset support)
> > 
> > Fixes: 7e3482f (i386: Helpers to encode cache information consistently)
> > Suggested-by: Robert Hoo <robert.hu@linux.intel.com>
> > Signed-off-by: Zhao Liu <zhao1.liu@intel.com>
> > ---
> >   target/i386/cpu.c | 20 +++++++++++++++-----
> >   1 file changed, 15 insertions(+), 5 deletions(-)
> > 
> > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> > index 29afec12c281..7833505092d8 100644
> > --- a/target/i386/cpu.c
> > +++ b/target/i386/cpu.c
> > @@ -5212,7 +5212,6 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
> >   {
> >       X86CPU *cpu = env_archcpu(env);
> >       CPUState *cs = env_cpu(env);
> > -    uint32_t die_offset;
> >       uint32_t limit;
> >       uint32_t signature[3];
> >       X86CPUTopoInfo topo_info;
> > @@ -5308,27 +5307,38 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
> >               *eax = *ebx = *ecx = *edx = 0;
> >           } else {
> >               *eax = 0;
> > +            int addressable_cores_offset = apicid_pkg_offset(&topo_info) -
> > +                                           apicid_core_offset(&topo_info);
> > +            int core_offset, die_offset;
> > +
> >               switch (count) {
> >               case 0: /* L1 dcache info */
> > +                core_offset = apicid_core_offset(&topo_info);
> >                   encode_cache_cpuid4(env->cache_info_cpuid4.l1d_cache,
> > -                                    1, cs->nr_cores,
> > +                                    (1 << core_offset),
> > +                                    (1 << addressable_cores_offset),
> >                                       eax, ebx, ecx, edx);
> >                   break;
> >               case 1: /* L1 icache info */
> > +                core_offset = apicid_core_offset(&topo_info);
> >                   encode_cache_cpuid4(env->cache_info_cpuid4.l1i_cache,
> > -                                    1, cs->nr_cores,
> > +                                    (1 << core_offset),
> > +                                    (1 << addressable_cores_offset),
> >                                       eax, ebx, ecx, edx);
> >                   break;
> >               case 2: /* L2 cache info */
> > +                core_offset = apicid_core_offset(&topo_info);
> >                   encode_cache_cpuid4(env->cache_info_cpuid4.l2_cache,
> > -                                    cs->nr_threads, cs->nr_cores,
> > +                                    (1 << core_offset),
> > +                                    (1 << addressable_cores_offset),
> >                                       eax, ebx, ecx, edx);
> >                   break;
> >               case 3: /* L3 cache info */
> >                   die_offset = apicid_die_offset(&topo_info);
> >                   if (cpu->enable_l3_cache) {
> >                       encode_cache_cpuid4(env->cache_info_cpuid4.l3_cache,
> > -                                        (1 << die_offset), cs->nr_cores,
> > +                                        (1 << die_offset),
> > +                                        (1 << addressable_cores_offset),
> >                                           eax, ebx, ecx, edx);
> >                       break;
> >                   }
>
Xiaoyao Li Feb. 20, 2023, 6:59 a.m. UTC | #2
On 2/13/2023 5:36 PM, Zhao Liu wrote:
> For i-cache and d-cache, the maximum IDs for CPUs sharing cache (
> CPUID.04H.00H:EAX[bits 25:14] and CPUID.04H.01H:EAX[bits 25:14]) are
> both 0, and this means i-cache and d-cache are shared in the SMT level.
> This is correct if there's single thread per core, but is wrong for the
> hyper threading case (one core contains multiple threads) since the
> i-cache and d-cache are shared in the core level other than SMT level.
> 
> Therefore, in order to be compatible with both multi-threaded and
> single-threaded situations, we should set i-cache and d-cache be shared
> at the core level by default.

It's true for VM only when the exactly HW topology is configured to VM. 
i.e., two virtual LPs of one virtual CORE are pinned to two physical LPs 
that of one physical CORE. Otherwise it's incorrect for VM.

for example. given a VM of 4 threads and 2 cores. If not pinning the 4 
threads to physical 4 LPs of 2 CORES. It's likely each vcpu running on a 
LP of different physical cores. So no vcpu shares L1i/L1d cache at core 
level.
Zhao Liu Feb. 22, 2023, 6:37 a.m. UTC | #3
Hi Xiaoyao,

Thanks, I've spent some time thinking about it here.

On Mon, Feb 20, 2023 at 02:59:20PM +0800, Xiaoyao Li wrote:
> Date: Mon, 20 Feb 2023 14:59:20 +0800
> From: Xiaoyao Li <xiaoyao.li@intel.com>
> Subject: Re: [PATCH RESEND 04/18] i386/cpu: Fix number of addressable IDs
>  in CPUID.04H
> 
> On 2/13/2023 5:36 PM, Zhao Liu wrote:
> > For i-cache and d-cache, the maximum IDs for CPUs sharing cache (
> > CPUID.04H.00H:EAX[bits 25:14] and CPUID.04H.01H:EAX[bits 25:14]) are
> > both 0, and this means i-cache and d-cache are shared in the SMT level.
> > This is correct if there's single thread per core, but is wrong for the
> > hyper threading case (one core contains multiple threads) since the
> > i-cache and d-cache are shared in the core level other than SMT level.
> > 
> > Therefore, in order to be compatible with both multi-threaded and
> > single-threaded situations, we should set i-cache and d-cache be shared
> > at the core level by default.
> 
> It's true for VM only when the exactly HW topology is configured to VM.
> i.e., two virtual LPs of one virtual CORE are pinned to two physical LPs
> that of one physical CORE.

Yes, in this case, host and guest has the same topology, and their
topology can match.

> Otherwise it's incorrect for VM.

My understanding here is that what we do in QEMU is to create
self-consistent CPU topology and cache topology for the guest.

If the VM topology is self-consistent and emulated to be almost
identical to the real machine, then the emulation in QEMU is correct,
right? ;-)

> 
> for example. given a VM of 4 threads and 2 cores. If not pinning the 4
> threads to physical 4 LPs of 2 CORES. It's likely each vcpu running on a LP
> of different physical cores.

Thanks for bringing this up, this is worth discussing.

I looked into it and found that the specific scheduling policy for the
vCPU actually depends on the host setting. For example, (IIUC) if host

enables core scheduling, then host will schedule the vCPU on the SMT
threads of same core.

Also, to explore the original purpose of the "per thread" i/d cache
topology, I have retraced its history.

The related commit should be in '09, which is 400281a (set CPUID bits
to present cores and threads topology). In this commit, the
multithreading cache topology is added in CPUID.04H. In particular, here
it set the L2 cache level to per core, but it did not change the level of
L1 (i/d cache), that is, L1 still remains per thread.

I think that here is the problem, L1 should also be per core in
multithreading case. (the fix for this patch is worth it?)

Another thing we can refer to is that AMD's i/d cache topology is per
core rather than per thread (different CPUID leaf than intel): In
encode_cache_cpuid8000001d() (target/i386/cpu.c), i/d cache and L2
are encoded as core level in EAX. They set up the per core supposedly
to emulate the L1 topology of the real machine as well.

So, I guess this example is "unintentionally" benefiting from the
"per thread" level of i/d cache.

What do you think?

> So no vcpu shares L1i/L1d cache at core level.

Yes. The scheduling of host is not guaranteed, and workload balance
policies in various scenarios and some security mitigation ways may
break the delicate balance we have carefully set up.

Perhaps another way is to also add a new command "x-l1-cache-topo" (like
[1] did for L2) that can adjust the i/d cache level from core to smt to
benefit cases like this.

[1]: https://lists.gnu.org/archive/html/qemu-devel/2023-02/msg03201.html

Thanks,
Zhao
Xiaoyao Li Feb. 23, 2023, 3:52 a.m. UTC | #4
On 2/22/2023 2:37 PM, Zhao Liu wrote:
> Hi Xiaoyao,
> 
> Thanks, I've spent some time thinking about it here.
> 
> On Mon, Feb 20, 2023 at 02:59:20PM +0800, Xiaoyao Li wrote:
>> Date: Mon, 20 Feb 2023 14:59:20 +0800
>> From: Xiaoyao Li <xiaoyao.li@intel.com>
>> Subject: Re: [PATCH RESEND 04/18] i386/cpu: Fix number of addressable IDs
>>   in CPUID.04H
>>
>> On 2/13/2023 5:36 PM, Zhao Liu wrote:
>>> For i-cache and d-cache, the maximum IDs for CPUs sharing cache (
>>> CPUID.04H.00H:EAX[bits 25:14] and CPUID.04H.01H:EAX[bits 25:14]) are
>>> both 0, and this means i-cache and d-cache are shared in the SMT level.
>>> This is correct if there's single thread per core, but is wrong for the
>>> hyper threading case (one core contains multiple threads) since the
>>> i-cache and d-cache are shared in the core level other than SMT level.
>>>
>>> Therefore, in order to be compatible with both multi-threaded and
>>> single-threaded situations, we should set i-cache and d-cache be shared
>>> at the core level by default.
>>
>> It's true for VM only when the exactly HW topology is configured to VM.
>> i.e., two virtual LPs of one virtual CORE are pinned to two physical LPs
>> that of one physical CORE.
> 
> Yes, in this case, host and guest has the same topology, and their
> topology can match.
> 
>> Otherwise it's incorrect for VM.
> 
> My understanding here is that what we do in QEMU is to create
> self-consistent CPU topology and cache topology for the guest.
> 
> If the VM topology is self-consistent and emulated to be almost
> identical to the real machine, then the emulation in QEMU is correct,
> right? ;-)

Real machine tells two threads in the same CORE share the L1 cahche via 
CUPID because it's the fact and it is how exactly hardware resource lay 
out. However, for VM, when you tell the same thing (two threads share 
the L1 cache), is it true for vcpus?

The target is to emulate things correctly, not emulate it identical to 
real machine. In fact, for these shared resources, it's mostly 
infeasible to emulate correctly if not pinning vcpus to physical LPs.

>>
>> for example. given a VM of 4 threads and 2 cores. If not pinning the 4
>> threads to physical 4 LPs of 2 CORES. It's likely each vcpu running on a LP
>> of different physical cores.
> 
> Thanks for bringing this up, this is worth discussing.
> 
> I looked into it and found that the specific scheduling policy for the
> vCPU actually depends on the host setting. For example, (IIUC) if host
> 
> enables core scheduling, then host will schedule the vCPU on the SMT
> threads of same core.
> 
> Also, to explore the original purpose of the "per thread" i/d cache
> topology, I have retraced its history.
> 
> The related commit should be in '09, which is 400281a (set CPUID bits
> to present cores and threads topology). In this commit, the
> multithreading cache topology is added in CPUID.04H. In particular, here
> it set the L2 cache level to per core, but it did not change the level of
> L1 (i/d cache), that is, L1 still remains per thread.
> 
> I think that here is the problem, L1 should also be per core in
> multithreading case. (the fix for this patch is worth it?)
> 
> Another thing we can refer to is that AMD's i/d cache topology is per
> core rather than per thread (different CPUID leaf than intel): In
> encode_cache_cpuid8000001d() (target/i386/cpu.c), i/d cache and L2
> are encoded as core level in EAX. They set up the per core supposedly
> to emulate the L1 topology of the real machine as well.
> 
> So, I guess this example is "unintentionally" benefiting from the
> "per thread" level of i/d cache.
> 
> What do you think?
> 
>> So no vcpu shares L1i/L1d cache at core level.
> 
> Yes. The scheduling of host is not guaranteed, and workload balance
> policies in various scenarios and some security mitigation ways may
> break the delicate balance we have carefully set up.
> 
> Perhaps another way is to also add a new command "x-l1-cache-topo" (like
> [1] did for L2) that can adjust the i/d cache level from core to smt to
> benefit cases like this.
> 
> [1]: https://lists.gnu.org/archive/html/qemu-devel/2023-02/msg03201.html
> 
> Thanks,
> Zhao
>
diff mbox series

Patch

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 29afec12c281..7833505092d8 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -5212,7 +5212,6 @@  void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
 {
     X86CPU *cpu = env_archcpu(env);
     CPUState *cs = env_cpu(env);
-    uint32_t die_offset;
     uint32_t limit;
     uint32_t signature[3];
     X86CPUTopoInfo topo_info;
@@ -5308,27 +5307,38 @@  void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,
             *eax = *ebx = *ecx = *edx = 0;
         } else {
             *eax = 0;
+            int addressable_cores_offset = apicid_pkg_offset(&topo_info) -
+                                           apicid_core_offset(&topo_info);
+            int core_offset, die_offset;
+
             switch (count) {
             case 0: /* L1 dcache info */
+                core_offset = apicid_core_offset(&topo_info);
                 encode_cache_cpuid4(env->cache_info_cpuid4.l1d_cache,
-                                    1, cs->nr_cores,
+                                    (1 << core_offset),
+                                    (1 << addressable_cores_offset),
                                     eax, ebx, ecx, edx);
                 break;
             case 1: /* L1 icache info */
+                core_offset = apicid_core_offset(&topo_info);
                 encode_cache_cpuid4(env->cache_info_cpuid4.l1i_cache,
-                                    1, cs->nr_cores,
+                                    (1 << core_offset),
+                                    (1 << addressable_cores_offset),
                                     eax, ebx, ecx, edx);
                 break;
             case 2: /* L2 cache info */
+                core_offset = apicid_core_offset(&topo_info);
                 encode_cache_cpuid4(env->cache_info_cpuid4.l2_cache,
-                                    cs->nr_threads, cs->nr_cores,
+                                    (1 << core_offset),
+                                    (1 << addressable_cores_offset),
                                     eax, ebx, ecx, edx);
                 break;
             case 3: /* L3 cache info */
                 die_offset = apicid_die_offset(&topo_info);
                 if (cpu->enable_l3_cache) {
                     encode_cache_cpuid4(env->cache_info_cpuid4.l3_cache,
-                                        (1 << die_offset), cs->nr_cores,
+                                        (1 << die_offset),
+                                        (1 << addressable_cores_offset),
                                         eax, ebx, ecx, edx);
                     break;
                 }