Message ID | 20211101142008.18228-1-tim.gardner@canonical.com |
---|---|
State | New |
Headers | show |
Series | [focal:linux-azure] Drivers: hv: vmbus: Fix duplicate CPU assignments within a device | expand |
On 01.11.21 15:20, Tim Gardner wrote: > From: Haiyang Zhang <haiyangz@microsoft.com> > > BugLink: https://bugs.launchpad.net/bugs/1937078 > > commit 7c9ff3deeee61b253715dcf968a6307af148c9b2 upstream > > The vmbus module uses a rotational algorithm to assign target CPUs to > a device's channels. Depending on the timing of different device's channel > offers, different channels of a device may be assigned to the same CPU. > > For example on a VM with 2 CPUs, if NIC A and B's channels are offered > in the following order, NIC A will have both channels on CPU0, and > NIC B will have both channels on CPU1 -- see below. This kind of > assignment causes RSS load that is spreading across different channels > to end up on the same CPU. > > Timing of channel offers: > NIC A channel 0 > NIC B channel 0 > NIC A channel 1 > NIC B channel 1 > > VMBUS ID 14: Class_ID = {f8615163-df3e-46c5-913f-f2d2f965ed0e} - Synthetic network adapter > Device_ID = {cab064cd-1f31-47d5-a8b4-9d57e320cccd} > Sysfs path: /sys/bus/vmbus/devices/cab064cd-1f31-47d5-a8b4-9d57e320cccd > Rel_ID=14, target_cpu=0 > Rel_ID=17, target_cpu=0 > > VMBUS ID 16: Class_ID = {f8615163-df3e-46c5-913f-f2d2f965ed0e} - Synthetic network adapter > Device_ID = {244225ca-743e-4020-a17d-d7baa13d6cea} > Sysfs path: /sys/bus/vmbus/devices/244225ca-743e-4020-a17d-d7baa13d6cea > Rel_ID=16, target_cpu=1 > Rel_ID=18, target_cpu=1 > > Update the vmbus CPU assignment algorithm to avoid duplicate CPU > assignments within a device. > > The new algorithm iterates num_online_cpus + 1 times. > The existing rotational algorithm to find "next NUMA & CPU" is still here. > But if the resulting CPU is already used by the same device, it will try > the next CPU. > In the last iteration, it assigns the channel to the next available CPU > like the existing algorithm. This is not normally expected, because > during device probe, we limit the number of channels of a device to > be <= number of online CPUs. > > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> > Reviewed-by: Michael Kelley <mikelley@microsoft.com> > Tested-by: Michael Kelley <mikelley@microsoft.com> > Link: https://lore.kernel.org/r/1626459673-17420-1-git-send-email-haiyangz@microsoft.com > Signed-off-by: Wei Liu <wei.liu@kernel.org> > (backported from commit 7c9ff3deeee61b253715dcf968a6307af148c9b2) > Signed-off-by: Tim Gardner <tim.gardner@canonical.com> > [ rtg - backported by Haiyang Zhang and has been > verified to fix the bug on A2 and D32 instance types. ] Acked-by: Stefan Bader <stefan.bader@canonical.com> > --- > > v2 - v1 of this patch caused boot hangs on certain instance types and was dropped from > the SRU cycle (https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1937078/comments/6). > This backport was developed by the original author who likely has a better handle on > things then I do. From SF, "Haiyang backported the patch to azure-5.4-next (attached). He > also reproduced the bug on 5.4 kernel with VM size A2 and D32, and verified the patch fixes the bug." > > --- > drivers/hv/channel_mgmt.c | 33 +++++++++++++++++++++++++++++++++ > 1 file changed, 33 insertions(+) > > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c > index 2b5d20965b1bf..ceac42e48f3a2 100644 > --- a/drivers/hv/channel_mgmt.c > +++ b/drivers/hv/channel_mgmt.c > @@ -628,6 +628,32 @@ static void vmbus_process_offer(struct vmbus_channel *newchannel) > queue_work(wq, &newchannel->add_channel_work); > } > > +/* > + * Check if CPU is used by other channels of the same device. > + * It should only be called by init_vp_index(). > + */ > +static bool hv_cpuself_used(u32 cpu, struct vmbus_channel *chn) > +{ > + struct vmbus_channel *primary = chn->primary_channel; > + struct vmbus_channel *sc; > + bool self_used = false; > + unsigned long flags; > + > + if (!primary) > + return false; > + > + if (primary->target_cpu == cpu) > + return true; > + > + spin_lock_irqsave(&primary->lock, flags); > + list_for_each_entry(sc, &primary->sc_list, sc_list) > + if (sc != chn && sc->target_cpu == cpu) > + self_used = true; > + spin_unlock_irqrestore(&primary->lock, flags); > + > + return self_used; > +} > + > /* > * We use this state to statically distribute the channel interrupt load. > */ > @@ -654,6 +680,7 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) > u32 cur_cpu; > bool perf_chn = vmbus_devs[dev_type].perf_device; > struct vmbus_channel *primary = channel->primary_channel; > + u32 cnt = 1, ncpu = num_online_cpus(); > int next_node; > cpumask_var_t available_mask; > struct cpumask *alloced_mask; > @@ -676,6 +703,7 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) > > spin_lock(&bind_channel_to_cpu_lock); > > +retry: > /* > * Based on the channel affinity policy, we will assign the NUMA > * nodes. > @@ -755,6 +783,11 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) > } > } > > + if (channel->affinity_policy == HV_BALANCED && > + channel->offermsg.offer.sub_channel_index < ncpu && > + cnt++ < ncpu + 1 && hv_cpuself_used(cur_cpu, channel)) > + goto retry; > + > channel->target_cpu = cur_cpu; > channel->target_vp = hv_cpu_number_to_vp_number(cur_cpu); > >
Acked-by: Kelsey Skunberg <kelsey.skunberg@canonical.com> On 2021-11-01 08:20:08 , Tim Gardner wrote: > From: Haiyang Zhang <haiyangz@microsoft.com> > > BugLink: https://bugs.launchpad.net/bugs/1937078 > > commit 7c9ff3deeee61b253715dcf968a6307af148c9b2 upstream > > The vmbus module uses a rotational algorithm to assign target CPUs to > a device's channels. Depending on the timing of different device's channel > offers, different channels of a device may be assigned to the same CPU. > > For example on a VM with 2 CPUs, if NIC A and B's channels are offered > in the following order, NIC A will have both channels on CPU0, and > NIC B will have both channels on CPU1 -- see below. This kind of > assignment causes RSS load that is spreading across different channels > to end up on the same CPU. > > Timing of channel offers: > NIC A channel 0 > NIC B channel 0 > NIC A channel 1 > NIC B channel 1 > > VMBUS ID 14: Class_ID = {f8615163-df3e-46c5-913f-f2d2f965ed0e} - Synthetic network adapter > Device_ID = {cab064cd-1f31-47d5-a8b4-9d57e320cccd} > Sysfs path: /sys/bus/vmbus/devices/cab064cd-1f31-47d5-a8b4-9d57e320cccd > Rel_ID=14, target_cpu=0 > Rel_ID=17, target_cpu=0 > > VMBUS ID 16: Class_ID = {f8615163-df3e-46c5-913f-f2d2f965ed0e} - Synthetic network adapter > Device_ID = {244225ca-743e-4020-a17d-d7baa13d6cea} > Sysfs path: /sys/bus/vmbus/devices/244225ca-743e-4020-a17d-d7baa13d6cea > Rel_ID=16, target_cpu=1 > Rel_ID=18, target_cpu=1 > > Update the vmbus CPU assignment algorithm to avoid duplicate CPU > assignments within a device. > > The new algorithm iterates num_online_cpus + 1 times. > The existing rotational algorithm to find "next NUMA & CPU" is still here. > But if the resulting CPU is already used by the same device, it will try > the next CPU. > In the last iteration, it assigns the channel to the next available CPU > like the existing algorithm. This is not normally expected, because > during device probe, we limit the number of channels of a device to > be <= number of online CPUs. > > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> > Reviewed-by: Michael Kelley <mikelley@microsoft.com> > Tested-by: Michael Kelley <mikelley@microsoft.com> > Link: https://lore.kernel.org/r/1626459673-17420-1-git-send-email-haiyangz@microsoft.com > Signed-off-by: Wei Liu <wei.liu@kernel.org> > (backported from commit 7c9ff3deeee61b253715dcf968a6307af148c9b2) > Signed-off-by: Tim Gardner <tim.gardner@canonical.com> > [ rtg - backported by Haiyang Zhang and has been > verified to fix the bug on A2 and D32 instance types. ] > --- > > v2 - v1 of this patch caused boot hangs on certain instance types and was dropped from > the SRU cycle (https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1937078/comments/6). > This backport was developed by the original author who likely has a better handle on > things then I do. From SF, "Haiyang backported the patch to azure-5.4-next (attached). He > also reproduced the bug on 5.4 kernel with VM size A2 and D32, and verified the patch fixes the bug." > > --- > drivers/hv/channel_mgmt.c | 33 +++++++++++++++++++++++++++++++++ > 1 file changed, 33 insertions(+) > > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c > index 2b5d20965b1bf..ceac42e48f3a2 100644 > --- a/drivers/hv/channel_mgmt.c > +++ b/drivers/hv/channel_mgmt.c > @@ -628,6 +628,32 @@ static void vmbus_process_offer(struct vmbus_channel *newchannel) > queue_work(wq, &newchannel->add_channel_work); > } > > +/* > + * Check if CPU is used by other channels of the same device. > + * It should only be called by init_vp_index(). > + */ > +static bool hv_cpuself_used(u32 cpu, struct vmbus_channel *chn) > +{ > + struct vmbus_channel *primary = chn->primary_channel; > + struct vmbus_channel *sc; > + bool self_used = false; > + unsigned long flags; > + > + if (!primary) > + return false; > + > + if (primary->target_cpu == cpu) > + return true; > + > + spin_lock_irqsave(&primary->lock, flags); > + list_for_each_entry(sc, &primary->sc_list, sc_list) > + if (sc != chn && sc->target_cpu == cpu) > + self_used = true; > + spin_unlock_irqrestore(&primary->lock, flags); > + > + return self_used; > +} > + > /* > * We use this state to statically distribute the channel interrupt load. > */ > @@ -654,6 +680,7 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) > u32 cur_cpu; > bool perf_chn = vmbus_devs[dev_type].perf_device; > struct vmbus_channel *primary = channel->primary_channel; > + u32 cnt = 1, ncpu = num_online_cpus(); > int next_node; > cpumask_var_t available_mask; > struct cpumask *alloced_mask; > @@ -676,6 +703,7 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) > > spin_lock(&bind_channel_to_cpu_lock); > > +retry: > /* > * Based on the channel affinity policy, we will assign the NUMA > * nodes. > @@ -755,6 +783,11 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) > } > } > > + if (channel->affinity_policy == HV_BALANCED && > + channel->offermsg.offer.sub_channel_index < ncpu && > + cnt++ < ncpu + 1 && hv_cpuself_used(cur_cpu, channel)) > + goto retry; > + > channel->target_cpu = cur_cpu; > channel->target_vp = hv_cpu_number_to_vp_number(cur_cpu); > > -- > 2.33.1 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team
Applied to focal/linux-azure. Thanks. -rtg On 11/1/21 8:20 AM, Tim Gardner wrote: > From: Haiyang Zhang <haiyangz@microsoft.com> > > BugLink: https://bugs.launchpad.net/bugs/1937078 > > commit 7c9ff3deeee61b253715dcf968a6307af148c9b2 upstream > > The vmbus module uses a rotational algorithm to assign target CPUs to > a device's channels. Depending on the timing of different device's channel > offers, different channels of a device may be assigned to the same CPU. > > For example on a VM with 2 CPUs, if NIC A and B's channels are offered > in the following order, NIC A will have both channels on CPU0, and > NIC B will have both channels on CPU1 -- see below. This kind of > assignment causes RSS load that is spreading across different channels > to end up on the same CPU. > > Timing of channel offers: > NIC A channel 0 > NIC B channel 0 > NIC A channel 1 > NIC B channel 1 > > VMBUS ID 14: Class_ID = {f8615163-df3e-46c5-913f-f2d2f965ed0e} - Synthetic network adapter > Device_ID = {cab064cd-1f31-47d5-a8b4-9d57e320cccd} > Sysfs path: /sys/bus/vmbus/devices/cab064cd-1f31-47d5-a8b4-9d57e320cccd > Rel_ID=14, target_cpu=0 > Rel_ID=17, target_cpu=0 > > VMBUS ID 16: Class_ID = {f8615163-df3e-46c5-913f-f2d2f965ed0e} - Synthetic network adapter > Device_ID = {244225ca-743e-4020-a17d-d7baa13d6cea} > Sysfs path: /sys/bus/vmbus/devices/244225ca-743e-4020-a17d-d7baa13d6cea > Rel_ID=16, target_cpu=1 > Rel_ID=18, target_cpu=1 > > Update the vmbus CPU assignment algorithm to avoid duplicate CPU > assignments within a device. > > The new algorithm iterates num_online_cpus + 1 times. > The existing rotational algorithm to find "next NUMA & CPU" is still here. > But if the resulting CPU is already used by the same device, it will try > the next CPU. > In the last iteration, it assigns the channel to the next available CPU > like the existing algorithm. This is not normally expected, because > during device probe, we limit the number of channels of a device to > be <= number of online CPUs. > > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> > Reviewed-by: Michael Kelley <mikelley@microsoft.com> > Tested-by: Michael Kelley <mikelley@microsoft.com> > Link: https://lore.kernel.org/r/1626459673-17420-1-git-send-email-haiyangz@microsoft.com > Signed-off-by: Wei Liu <wei.liu@kernel.org> > (backported from commit 7c9ff3deeee61b253715dcf968a6307af148c9b2) > Signed-off-by: Tim Gardner <tim.gardner@canonical.com> > [ rtg - backported by Haiyang Zhang and has been > verified to fix the bug on A2 and D32 instance types. ] > --- > > v2 - v1 of this patch caused boot hangs on certain instance types and was dropped from > the SRU cycle (https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1937078/comments/6). > This backport was developed by the original author who likely has a better handle on > things then I do. From SF, "Haiyang backported the patch to azure-5.4-next (attached). He > also reproduced the bug on 5.4 kernel with VM size A2 and D32, and verified the patch fixes the bug." > > --- > drivers/hv/channel_mgmt.c | 33 +++++++++++++++++++++++++++++++++ > 1 file changed, 33 insertions(+) > > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c > index 2b5d20965b1bf..ceac42e48f3a2 100644 > --- a/drivers/hv/channel_mgmt.c > +++ b/drivers/hv/channel_mgmt.c > @@ -628,6 +628,32 @@ static void vmbus_process_offer(struct vmbus_channel *newchannel) > queue_work(wq, &newchannel->add_channel_work); > } > > +/* > + * Check if CPU is used by other channels of the same device. > + * It should only be called by init_vp_index(). > + */ > +static bool hv_cpuself_used(u32 cpu, struct vmbus_channel *chn) > +{ > + struct vmbus_channel *primary = chn->primary_channel; > + struct vmbus_channel *sc; > + bool self_used = false; > + unsigned long flags; > + > + if (!primary) > + return false; > + > + if (primary->target_cpu == cpu) > + return true; > + > + spin_lock_irqsave(&primary->lock, flags); > + list_for_each_entry(sc, &primary->sc_list, sc_list) > + if (sc != chn && sc->target_cpu == cpu) > + self_used = true; > + spin_unlock_irqrestore(&primary->lock, flags); > + > + return self_used; > +} > + > /* > * We use this state to statically distribute the channel interrupt load. > */ > @@ -654,6 +680,7 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) > u32 cur_cpu; > bool perf_chn = vmbus_devs[dev_type].perf_device; > struct vmbus_channel *primary = channel->primary_channel; > + u32 cnt = 1, ncpu = num_online_cpus(); > int next_node; > cpumask_var_t available_mask; > struct cpumask *alloced_mask; > @@ -676,6 +703,7 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) > > spin_lock(&bind_channel_to_cpu_lock); > > +retry: > /* > * Based on the channel affinity policy, we will assign the NUMA > * nodes. > @@ -755,6 +783,11 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) > } > } > > + if (channel->affinity_policy == HV_BALANCED && > + channel->offermsg.offer.sub_channel_index < ncpu && > + cnt++ < ncpu + 1 && hv_cpuself_used(cur_cpu, channel)) > + goto retry; > + > channel->target_cpu = cur_cpu; > channel->target_vp = hv_cpu_number_to_vp_number(cur_cpu); > >
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 2b5d20965b1bf..ceac42e48f3a2 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -628,6 +628,32 @@ static void vmbus_process_offer(struct vmbus_channel *newchannel) queue_work(wq, &newchannel->add_channel_work); } +/* + * Check if CPU is used by other channels of the same device. + * It should only be called by init_vp_index(). + */ +static bool hv_cpuself_used(u32 cpu, struct vmbus_channel *chn) +{ + struct vmbus_channel *primary = chn->primary_channel; + struct vmbus_channel *sc; + bool self_used = false; + unsigned long flags; + + if (!primary) + return false; + + if (primary->target_cpu == cpu) + return true; + + spin_lock_irqsave(&primary->lock, flags); + list_for_each_entry(sc, &primary->sc_list, sc_list) + if (sc != chn && sc->target_cpu == cpu) + self_used = true; + spin_unlock_irqrestore(&primary->lock, flags); + + return self_used; +} + /* * We use this state to statically distribute the channel interrupt load. */ @@ -654,6 +680,7 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) u32 cur_cpu; bool perf_chn = vmbus_devs[dev_type].perf_device; struct vmbus_channel *primary = channel->primary_channel; + u32 cnt = 1, ncpu = num_online_cpus(); int next_node; cpumask_var_t available_mask; struct cpumask *alloced_mask; @@ -676,6 +703,7 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) spin_lock(&bind_channel_to_cpu_lock); +retry: /* * Based on the channel affinity policy, we will assign the NUMA * nodes. @@ -755,6 +783,11 @@ static void init_vp_index(struct vmbus_channel *channel, u16 dev_type) } } + if (channel->affinity_policy == HV_BALANCED && + channel->offermsg.offer.sub_channel_index < ncpu && + cnt++ < ncpu + 1 && hv_cpuself_used(cur_cpu, channel)) + goto retry; + channel->target_cpu = cur_cpu; channel->target_vp = hv_cpu_number_to_vp_number(cur_cpu);