mbox series

[0/3] PCI: vmd: Reducing tail latency by affining to the storage stack

Message ID 1573040408-3831-1-git-send-email-jonathan.derrick@intel.com
Headers show
Series PCI: vmd: Reducing tail latency by affining to the storage stack | expand

Message

Jon Derrick Nov. 6, 2019, 11:40 a.m. UTC
This patchset optimizes VMD performance through the storage stack by locating
commonly-affined NVMe interrupts on the same VMD interrupt handler lists.

The current strategy of round-robin assignment to VMD IRQ lists can be
suboptimal when vectors with different affinities are assigned to the same VMD
IRQ list. VMD is an NVMe storage domain and this set aligns the vector
allocation and affinity strategy with that of the NVMe driver. This invokes the
kernel to do the right thing when affining NVMe submission cpus to NVMe
completion vectors as serviced through the VMD interrupt handler lists.

This set greatly reduced tail latency when testing 8 threads of random 4k reads
against two drives at queue depth=128. After pinning the tasks to reduce test
variability, the tests also showed a moderate tail latency reduction. A
one-drive configuration also shows improvements due to the alignment of VMD IRQ
list affinities with NVMe affinities.

An example with two NVMe drives and a 33-vector VMD:
VMD irq[42]  Affinity[0-27,56-83]   Effective[10]
VMD irq[43]  Affinity[28-29,84-85]  Effective[85]
VMD irq[44]  Affinity[30-31,86-87]  Effective[87]
VMD irq[45]  Affinity[32-33,88-89]  Effective[89]
VMD irq[46]  Affinity[34-35,90-91]  Effective[91]
VMD irq[47]  Affinity[36-37,92-93]  Effective[93]
VMD irq[48]  Affinity[38-39,94-95]  Effective[95]
VMD irq[49]  Affinity[40-41,96-97]  Effective[97]
VMD irq[50]  Affinity[42-43,98-99]  Effective[99]
VMD irq[51]  Affinity[44-45,100]    Effective[100]
VMD irq[52]  Affinity[46-47,102]    Effective[102]
VMD irq[53]  Affinity[48-49,104]    Effective[104]
VMD irq[54]  Affinity[50-51,106]    Effective[106]
VMD irq[55]  Affinity[52-53,108]    Effective[108]
VMD irq[56]  Affinity[54-55,110]    Effective[110]
VMD irq[57]  Affinity[101,103,105]  Effective[105]
VMD irq[58]  Affinity[107,109,111]  Effective[111]
VMD irq[59]  Affinity[0-1,56-57]    Effective[57]
VMD irq[60]  Affinity[2-3,58-59]    Effective[59]
VMD irq[61]  Affinity[4-5,60-61]    Effective[61]
VMD irq[62]  Affinity[6-7,62-63]    Effective[63]
VMD irq[63]  Affinity[8-9,64-65]    Effective[65]
VMD irq[64]  Affinity[10-11,66-67]  Effective[67]
VMD irq[65]  Affinity[12-13,68-69]  Effective[69]
VMD irq[66]  Affinity[14-15,70-71]  Effective[71]
VMD irq[67]  Affinity[16-17,72]     Effective[72]
VMD irq[68]  Affinity[18-19,74]     Effective[74]
VMD irq[69]  Affinity[20-21,76]     Effective[76]
VMD irq[70]  Affinity[22-23,78]     Effective[78]
VMD irq[71]  Affinity[24-25,80]     Effective[80]
VMD irq[72]  Affinity[26-27,82]     Effective[82]
VMD irq[73]  Affinity[73,75,77]     Effective[77]
VMD irq[74]  Affinity[79,81,83]     Effective[83]

nvme0n1q1   MQ CPUs[28, 29, 84, 85]
nvme0n1q2   MQ CPUs[30, 31, 86, 87]
nvme0n1q3   MQ CPUs[32, 33, 88, 89]
nvme0n1q4   MQ CPUs[34, 35, 90, 91]
nvme0n1q5   MQ CPUs[36, 37, 92, 93]
nvme0n1q6   MQ CPUs[38, 39, 94, 95]
nvme0n1q7   MQ CPUs[40, 41, 96, 97]
nvme0n1q8   MQ CPUs[42, 43, 98, 99]
nvme0n1q9   MQ CPUs[44, 45, 100]
nvme0n1q10  MQ CPUs[46, 47, 102]
nvme0n1q11  MQ CPUs[48, 49, 104]
nvme0n1q12  MQ CPUs[50, 51, 106]
nvme0n1q13  MQ CPUs[52, 53, 108]
nvme0n1q14  MQ CPUs[54, 55, 110]
nvme0n1q15  MQ CPUs[101, 103, 105]
nvme0n1q16  MQ CPUs[107, 109, 111]
nvme0n1q17  MQ CPUs[0, 1, 56, 57]
nvme0n1q18  MQ CPUs[2, 3, 58, 59]
nvme0n1q19  MQ CPUs[4, 5, 60, 61]
nvme0n1q20  MQ CPUs[6, 7, 62, 63]
nvme0n1q21  MQ CPUs[8, 9, 64, 65]
nvme0n1q22  MQ CPUs[10, 11, 66, 67]
nvme0n1q23  MQ CPUs[12, 13, 68, 69]
nvme0n1q24  MQ CPUs[14, 15, 70, 71]
nvme0n1q25  MQ CPUs[16, 17, 72]
nvme0n1q26  MQ CPUs[18, 19, 74]
nvme0n1q27  MQ CPUs[20, 21, 76]
nvme0n1q28  MQ CPUs[22, 23, 78]
nvme0n1q29  MQ CPUs[24, 25, 80]
nvme0n1q30  MQ CPUs[26, 27, 82]
nvme0n1q31  MQ CPUs[73, 75, 77]
nvme0n1q32  MQ CPUs[79, 81, 83]

nvme1n1q1   MQ CPUs[28, 29, 84, 85]
nvme1n1q2   MQ CPUs[30, 31, 86, 87]
nvme1n1q3   MQ CPUs[32, 33, 88, 89]
nvme1n1q4   MQ CPUs[34, 35, 90, 91]
nvme1n1q5   MQ CPUs[36, 37, 92, 93]
nvme1n1q6   MQ CPUs[38, 39, 94, 95]
nvme1n1q7   MQ CPUs[40, 41, 96, 97]
nvme1n1q8   MQ CPUs[42, 43, 98, 99]
nvme1n1q9   MQ CPUs[44, 45, 100]
nvme1n1q10  MQ CPUs[46, 47, 102]
nvme1n1q11  MQ CPUs[48, 49, 104]
nvme1n1q12  MQ CPUs[50, 51, 106]
nvme1n1q13  MQ CPUs[52, 53, 108]
nvme1n1q14  MQ CPUs[54, 55, 110]
nvme1n1q15  MQ CPUs[101, 103, 105]
nvme1n1q16  MQ CPUs[107, 109, 111]
nvme1n1q17  MQ CPUs[0, 1, 56, 57]
nvme1n1q18  MQ CPUs[2, 3, 58, 59]
nvme1n1q19  MQ CPUs[4, 5, 60, 61]
nvme1n1q20  MQ CPUs[6, 7, 62, 63]
nvme1n1q21  MQ CPUs[8, 9, 64, 65]
nvme1n1q22  MQ CPUs[10, 11, 66, 67]
nvme1n1q23  MQ CPUs[12, 13, 68, 69]
nvme1n1q24  MQ CPUs[14, 15, 70, 71]
nvme1n1q25  MQ CPUs[16, 17, 72]
nvme1n1q26  MQ CPUs[18, 19, 74]
nvme1n1q27  MQ CPUs[20, 21, 76]
nvme1n1q28  MQ CPUs[22, 23, 78]
nvme1n1q29  MQ CPUs[24, 25, 80]
nvme1n1q30  MQ CPUs[26, 27, 82]
nvme1n1q31  MQ CPUs[73, 75, 77]
nvme1n1q32  MQ CPUs[79, 81, 83]


This patchset applies after the VMD IRQ List indirection patch:
https://lore.kernel.org/linux-pci/1572527333-6212-1-git-send-email-jonathan.derrick@intel.com/

Jon Derrick (3):
  PCI: vmd: Reduce VMD vectors using NVMe calculation
  PCI: vmd: Align IRQ lists with child device vectors
  PCI: vmd: Use managed irq affinities

 drivers/pci/controller/vmd.c | 90 +++++++++++++++++++-------------------------
 1 file changed, 39 insertions(+), 51 deletions(-)

Comments

Christoph Hellwig Nov. 7, 2019, 9:39 a.m. UTC | #1
On Wed, Nov 06, 2019 at 04:40:05AM -0700, Jon Derrick wrote:
> This patchset optimizes VMD performance through the storage stack by locating
> commonly-affined NVMe interrupts on the same VMD interrupt handler lists.
> 
> The current strategy of round-robin assignment to VMD IRQ lists can be
> suboptimal when vectors with different affinities are assigned to the same VMD
> IRQ list. VMD is an NVMe storage domain and this set aligns the vector
> allocation and affinity strategy with that of the NVMe driver. This invokes the
> kernel to do the right thing when affining NVMe submission cpus to NVMe
> completion vectors as serviced through the VMD interrupt handler lists.
> 
> This set greatly reduced tail latency when testing 8 threads of random 4k reads
> against two drives at queue depth=128. After pinning the tasks to reduce test
> variability, the tests also showed a moderate tail latency reduction. A
> one-drive configuration also shows improvements due to the alignment of VMD IRQ
> list affinities with NVMe affinities.

How does this compare to simplify disabling VMD?
Jon Derrick Nov. 7, 2019, 2:12 p.m. UTC | #2
On Thu, 2019-11-07 at 01:39 -0800, Christoph Hellwig wrote:
> On Wed, Nov 06, 2019 at 04:40:05AM -0700, Jon Derrick wrote:
> > This patchset optimizes VMD performance through the storage stack by locating
> > commonly-affined NVMe interrupts on the same VMD interrupt handler lists.
> > 
> > The current strategy of round-robin assignment to VMD IRQ lists can be
> > suboptimal when vectors with different affinities are assigned to the same VMD
> > IRQ list. VMD is an NVMe storage domain and this set aligns the vector
> > allocation and affinity strategy with that of the NVMe driver. This invokes the
> > kernel to do the right thing when affining NVMe submission cpus to NVMe
> > completion vectors as serviced through the VMD interrupt handler lists.
> > 
> > This set greatly reduced tail latency when testing 8 threads of random 4k reads
> > against two drives at queue depth=128. After pinning the tasks to reduce test
> > variability, the tests also showed a moderate tail latency reduction. A
> > one-drive configuration also shows improvements due to the alignment of VMD IRQ
> > list affinities with NVMe affinities.
> 
> How does this compare to simplify disabling VMD?

It's a moot point since Keith pointed out a few flaws with this set,
however disabling VMD is not an option for users who wish to
passthrough VMD
Christoph Hellwig Nov. 7, 2019, 3:37 p.m. UTC | #3
On Thu, Nov 07, 2019 at 02:12:50PM +0000, Derrick, Jonathan wrote:
> > How does this compare to simplify disabling VMD?
> 
> It's a moot point since Keith pointed out a few flaws with this set,
> however disabling VMD is not an option for users who wish to
> passthrough VMD

And why would you ever pass through vmd instead of the actual device?
That just makes things go slower and adds zero value.
Jon Derrick Nov. 7, 2019, 3:40 p.m. UTC | #4
On Thu, 2019-11-07 at 07:37 -0800, hch@infradead.org wrote:
> On Thu, Nov 07, 2019 at 02:12:50PM +0000, Derrick, Jonathan wrote:
> > > How does this compare to simplify disabling VMD?
> > 
> > It's a moot point since Keith pointed out a few flaws with this set,
> > however disabling VMD is not an option for users who wish to
> > passthrough VMD
> 
> And why would you ever pass through vmd instead of the actual device?
> That just makes things go slower and adds zero value.

Ability to use physical Root Ports/DSPs/etc in a guest. Slower is
acceptable for many users if it fits within a performance window
Christoph Hellwig Nov. 7, 2019, 3:42 p.m. UTC | #5
On Thu, Nov 07, 2019 at 03:40:15PM +0000, Derrick, Jonathan wrote:
> On Thu, 2019-11-07 at 07:37 -0800, hch@infradead.org wrote:
> > On Thu, Nov 07, 2019 at 02:12:50PM +0000, Derrick, Jonathan wrote:
> > > > How does this compare to simplify disabling VMD?
> > > 
> > > It's a moot point since Keith pointed out a few flaws with this set,
> > > however disabling VMD is not an option for users who wish to
> > > passthrough VMD
> > 
> > And why would you ever pass through vmd instead of the actual device?
> > That just makes things go slower and adds zero value.
> 
> Ability to use physical Root Ports/DSPs/etc in a guest. Slower is
> acceptable for many users if it fits within a performance window

What is the actual use case?  What does it enable that otherwise doesn't
work and is actually useful?  And real use cases please and no marketing
mumble jumble.
Jon Derrick Nov. 7, 2019, 3:47 p.m. UTC | #6
On Thu, 2019-11-07 at 07:42 -0800, hch@infradead.org wrote:
> On Thu, Nov 07, 2019 at 03:40:15PM +0000, Derrick, Jonathan wrote:
> > On Thu, 2019-11-07 at 07:37 -0800, hch@infradead.org wrote:
> > > On Thu, Nov 07, 2019 at 02:12:50PM +0000, Derrick, Jonathan wrote:
> > > > > How does this compare to simplify disabling VMD?
> > > > 
> > > > It's a moot point since Keith pointed out a few flaws with this set,
> > > > however disabling VMD is not an option for users who wish to
> > > > passthrough VMD
> > > 
> > > And why would you ever pass through vmd instead of the actual device?
> > > That just makes things go slower and adds zero value.
> > 
> > Ability to use physical Root Ports/DSPs/etc in a guest. Slower is
> > acceptable for many users if it fits within a performance window
> 
> What is the actual use case?  What does it enable that otherwise doesn't
> work and is actually useful?  And real use cases please and no marketing
> mumble jumble.

A cloud service provider might have several VMs on a single system and
wish to provide surprise hotplug functionality within the guests so
that they don't need to bring the whole server down or migrate VMs in
order to swap disks.
Christoph Hellwig Nov. 11, 2019, 5:03 p.m. UTC | #7
On Thu, Nov 07, 2019 at 03:47:09PM +0000, Derrick, Jonathan wrote:
> A cloud service provider might have several VMs on a single system and
> wish to provide surprise hotplug functionality within the guests so
> that they don't need to bring the whole server down or migrate VMs in
> order to swap disks.

And how does the vmd mechanism help with that?  Maybe qemu is missing
a memremap to not access the remove device right now, but adding that
is way simpler than having to deal with a device that makes everyones
life complicated.
Kai-Heng Feng Dec. 23, 2022, 2:33 a.m. UTC | #8
Hi Jon,

On Wed, Nov 6, 2019 at 7:40 PM Jon Derrick <jonathan.derrick@intel.com> wrote:
>
> This patchset optimizes VMD performance through the storage stack by locating
> commonly-affined NVMe interrupts on the same VMD interrupt handler lists.
>
> The current strategy of round-robin assignment to VMD IRQ lists can be
> suboptimal when vectors with different affinities are assigned to the same VMD
> IRQ list. VMD is an NVMe storage domain and this set aligns the vector
> allocation and affinity strategy with that of the NVMe driver. This invokes the
> kernel to do the right thing when affining NVMe submission cpus to NVMe
> completion vectors as serviced through the VMD interrupt handler lists.
>
> This set greatly reduced tail latency when testing 8 threads of random 4k reads
> against two drives at queue depth=128. After pinning the tasks to reduce test
> variability, the tests also showed a moderate tail latency reduction. A
> one-drive configuration also shows improvements due to the alignment of VMD IRQ
> list affinities with NVMe affinities.

Is there any followup on this series? Because of
vmd_irq_set_affinity() always returning -EINVAL, so the system can't
perform S3 and CPU hotplug.

Bug filed here:
https://bugzilla.kernel.org/show_bug.cgi?id=216835

Kai-Heng

>
> An example with two NVMe drives and a 33-vector VMD:
> VMD irq[42]  Affinity[0-27,56-83]   Effective[10]
> VMD irq[43]  Affinity[28-29,84-85]  Effective[85]
> VMD irq[44]  Affinity[30-31,86-87]  Effective[87]
> VMD irq[45]  Affinity[32-33,88-89]  Effective[89]
> VMD irq[46]  Affinity[34-35,90-91]  Effective[91]
> VMD irq[47]  Affinity[36-37,92-93]  Effective[93]
> VMD irq[48]  Affinity[38-39,94-95]  Effective[95]
> VMD irq[49]  Affinity[40-41,96-97]  Effective[97]
> VMD irq[50]  Affinity[42-43,98-99]  Effective[99]
> VMD irq[51]  Affinity[44-45,100]    Effective[100]
> VMD irq[52]  Affinity[46-47,102]    Effective[102]
> VMD irq[53]  Affinity[48-49,104]    Effective[104]
> VMD irq[54]  Affinity[50-51,106]    Effective[106]
> VMD irq[55]  Affinity[52-53,108]    Effective[108]
> VMD irq[56]  Affinity[54-55,110]    Effective[110]
> VMD irq[57]  Affinity[101,103,105]  Effective[105]
> VMD irq[58]  Affinity[107,109,111]  Effective[111]
> VMD irq[59]  Affinity[0-1,56-57]    Effective[57]
> VMD irq[60]  Affinity[2-3,58-59]    Effective[59]
> VMD irq[61]  Affinity[4-5,60-61]    Effective[61]
> VMD irq[62]  Affinity[6-7,62-63]    Effective[63]
> VMD irq[63]  Affinity[8-9,64-65]    Effective[65]
> VMD irq[64]  Affinity[10-11,66-67]  Effective[67]
> VMD irq[65]  Affinity[12-13,68-69]  Effective[69]
> VMD irq[66]  Affinity[14-15,70-71]  Effective[71]
> VMD irq[67]  Affinity[16-17,72]     Effective[72]
> VMD irq[68]  Affinity[18-19,74]     Effective[74]
> VMD irq[69]  Affinity[20-21,76]     Effective[76]
> VMD irq[70]  Affinity[22-23,78]     Effective[78]
> VMD irq[71]  Affinity[24-25,80]     Effective[80]
> VMD irq[72]  Affinity[26-27,82]     Effective[82]
> VMD irq[73]  Affinity[73,75,77]     Effective[77]
> VMD irq[74]  Affinity[79,81,83]     Effective[83]
>
> nvme0n1q1   MQ CPUs[28, 29, 84, 85]
> nvme0n1q2   MQ CPUs[30, 31, 86, 87]
> nvme0n1q3   MQ CPUs[32, 33, 88, 89]
> nvme0n1q4   MQ CPUs[34, 35, 90, 91]
> nvme0n1q5   MQ CPUs[36, 37, 92, 93]
> nvme0n1q6   MQ CPUs[38, 39, 94, 95]
> nvme0n1q7   MQ CPUs[40, 41, 96, 97]
> nvme0n1q8   MQ CPUs[42, 43, 98, 99]
> nvme0n1q9   MQ CPUs[44, 45, 100]
> nvme0n1q10  MQ CPUs[46, 47, 102]
> nvme0n1q11  MQ CPUs[48, 49, 104]
> nvme0n1q12  MQ CPUs[50, 51, 106]
> nvme0n1q13  MQ CPUs[52, 53, 108]
> nvme0n1q14  MQ CPUs[54, 55, 110]
> nvme0n1q15  MQ CPUs[101, 103, 105]
> nvme0n1q16  MQ CPUs[107, 109, 111]
> nvme0n1q17  MQ CPUs[0, 1, 56, 57]
> nvme0n1q18  MQ CPUs[2, 3, 58, 59]
> nvme0n1q19  MQ CPUs[4, 5, 60, 61]
> nvme0n1q20  MQ CPUs[6, 7, 62, 63]
> nvme0n1q21  MQ CPUs[8, 9, 64, 65]
> nvme0n1q22  MQ CPUs[10, 11, 66, 67]
> nvme0n1q23  MQ CPUs[12, 13, 68, 69]
> nvme0n1q24  MQ CPUs[14, 15, 70, 71]
> nvme0n1q25  MQ CPUs[16, 17, 72]
> nvme0n1q26  MQ CPUs[18, 19, 74]
> nvme0n1q27  MQ CPUs[20, 21, 76]
> nvme0n1q28  MQ CPUs[22, 23, 78]
> nvme0n1q29  MQ CPUs[24, 25, 80]
> nvme0n1q30  MQ CPUs[26, 27, 82]
> nvme0n1q31  MQ CPUs[73, 75, 77]
> nvme0n1q32  MQ CPUs[79, 81, 83]
>
> nvme1n1q1   MQ CPUs[28, 29, 84, 85]
> nvme1n1q2   MQ CPUs[30, 31, 86, 87]
> nvme1n1q3   MQ CPUs[32, 33, 88, 89]
> nvme1n1q4   MQ CPUs[34, 35, 90, 91]
> nvme1n1q5   MQ CPUs[36, 37, 92, 93]
> nvme1n1q6   MQ CPUs[38, 39, 94, 95]
> nvme1n1q7   MQ CPUs[40, 41, 96, 97]
> nvme1n1q8   MQ CPUs[42, 43, 98, 99]
> nvme1n1q9   MQ CPUs[44, 45, 100]
> nvme1n1q10  MQ CPUs[46, 47, 102]
> nvme1n1q11  MQ CPUs[48, 49, 104]
> nvme1n1q12  MQ CPUs[50, 51, 106]
> nvme1n1q13  MQ CPUs[52, 53, 108]
> nvme1n1q14  MQ CPUs[54, 55, 110]
> nvme1n1q15  MQ CPUs[101, 103, 105]
> nvme1n1q16  MQ CPUs[107, 109, 111]
> nvme1n1q17  MQ CPUs[0, 1, 56, 57]
> nvme1n1q18  MQ CPUs[2, 3, 58, 59]
> nvme1n1q19  MQ CPUs[4, 5, 60, 61]
> nvme1n1q20  MQ CPUs[6, 7, 62, 63]
> nvme1n1q21  MQ CPUs[8, 9, 64, 65]
> nvme1n1q22  MQ CPUs[10, 11, 66, 67]
> nvme1n1q23  MQ CPUs[12, 13, 68, 69]
> nvme1n1q24  MQ CPUs[14, 15, 70, 71]
> nvme1n1q25  MQ CPUs[16, 17, 72]
> nvme1n1q26  MQ CPUs[18, 19, 74]
> nvme1n1q27  MQ CPUs[20, 21, 76]
> nvme1n1q28  MQ CPUs[22, 23, 78]
> nvme1n1q29  MQ CPUs[24, 25, 80]
> nvme1n1q30  MQ CPUs[26, 27, 82]
> nvme1n1q31  MQ CPUs[73, 75, 77]
> nvme1n1q32  MQ CPUs[79, 81, 83]
>
>
> This patchset applies after the VMD IRQ List indirection patch:
> https://lore.kernel.org/linux-pci/1572527333-6212-1-git-send-email-jonathan.derrick@intel.com/
>
> Jon Derrick (3):
>   PCI: vmd: Reduce VMD vectors using NVMe calculation
>   PCI: vmd: Align IRQ lists with child device vectors
>   PCI: vmd: Use managed irq affinities
>
>  drivers/pci/controller/vmd.c | 90 +++++++++++++++++++-------------------------
>  1 file changed, 39 insertions(+), 51 deletions(-)
>
> --
> 1.8.3.1
>