Patchwork VFIO: Not require to make VFIO_IOMMU_MAP_DMA for MMAPED PCI bars

login
register
mail settings
Submitter Bharat Bhushan
Date Feb. 14, 2013, 8:22 a.m.
Message ID <6A3DF150A5B70D4F9B66A25E3F7C888D065ABF7B@039-SN2MPN1-022.039d.mgd.msft.net>
Download mbox | patch
Permalink /patch/220392/
State New
Headers show

Comments

Bharat Bhushan - Feb. 14, 2013, 8:22 a.m.
Hi Alex Williamson,

I am able (with some hacks :)) to directly assign the e1000 PCI device to KVM guest using VFIO on Freescale device. One of the problem I am facing is about the DMA mapping in IOMMU for PCI device BARs. On Freescale devices, the mmaped PCI device BARs are not required to be mapped in IOMMU. Typically the flow of in/out transaction (from CPU) is: 

Incoming flow:

-----|                     |----------|         |---------------|                   |-------------|
CORE |<----<------<-----<--| IOMMU    |<---<---<| PCI-Controller|<------<-----<----<| PCI device  |
-----|                     |----------|         |---------------|                   |-------------|

Outgoing Flow: IOMMU is bypassed for out transactions

-----|                     |----------|         |---------------|                   |-------------|
CORE |>---->------>----|   | IOMMU    |    ->-->| PCI-Controller|>------>----->---->| PCI device  |
-----|                 |   |----------|   ^     |---------------|                   |-------------|
                       |                  |
                       |------------------|


Also because of some hardware limitations on our IOMMU it is difficult to map these BAR regions with RAM (DDR) regions. So on Freescale device we want the VFIO_IOMMU_MAP_DMA ioctl to be called for RAM regions (DDR) only and _not_ for these mmaped ram regions of PCI device bars. I can understand that we need to add these mmaped PCI bars as RAM type in qemu memory_region_*(). So for that I tried to skip these regions in VFIO memory_listeners. Below changes which works for me. I am not sure whether this is correct approach, please suggest.

-------------
-----------------


Thanks
-Bharat
Alex Williamson - Feb. 14, 2013, 6:27 p.m.
On Thu, 2013-02-14 at 08:22 +0000, Bhushan Bharat-R65777 wrote:
> Hi Alex Williamson,
> 
> I am able (with some hacks :)) to directly assign the e1000 PCI device
> to KVM guest using VFIO on Freescale device. One of the problem I am
> facing is about the DMA mapping in IOMMU for PCI device BARs. On
> Freescale devices, the mmaped PCI device BARs are not required to be
> mapped in IOMMU. Typically the flow of in/out transaction (from CPU)
> is: 
> 
> Incoming flow:
> 
> -----|                     |----------|         |---------------|                   |-------------|
> CORE |<----<------<-----<--| IOMMU    |<---<---<| PCI-Controller|<------<-----<----<| PCI device  |
> -----|                     |----------|         |---------------|                   |-------------|
> 
> Outgoing Flow: IOMMU is bypassed for out transactions
> 
> -----|                     |----------|         |---------------|                   |-------------|
> CORE |>---->------>----|   | IOMMU    |    ->-->| PCI-Controller|>------>----->---->| PCI device  |
> -----|                 |   |----------|   ^     |---------------|                   |-------------|
>                        |                  |
>                        |------------------|

Mapping the BAR into the IOMMU isn't for this second use case, CPU to
device is never IOMMU translated.  Instead, it's for this:

|----------|         |---------------|                   |-------------|
| IOMMU    |<---<---<| PCI-Controller|<------<-----<----<| PCI device  |
|----------|         |---------------|                   |-------------|
      |
      |              |---------------|                   |-------------|
      +-->---->--- ->| PCI-Controller|>------>----->---->| PCI device  |
                     |---------------|                   |-------------|

It's perfectly fine to not support peer-to-peer DMA, but let's skip it
where it's not supported in case it might actually work in some cases.

> Also because of some hardware limitations on our IOMMU it is difficult
> to map these BAR regions with RAM (DDR) regions. So on Freescale
> device we want the VFIO_IOMMU_MAP_DMA ioctl to be called for RAM
> regions (DDR) only and _not_ for these mmaped ram regions of PCI
> device bars. I can understand that we need to add these mmaped PCI
> bars as RAM type in qemu memory_region_*(). So for that I tried to
> skip these regions in VFIO memory_listeners. Below changes which works
> for me. I am not sure whether this is correct approach, please
> suggest.
> 
> -------------
> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> index c51ae67..63728d8 100644
> --- a/hw/vfio_pci.c
> +++ b/hw/vfio_pci.c
> @@ -1115,9 +1115,35 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>      return -errno;
>  }
>  
> -static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> +static int memory_region_is_mmap_bars(VFIOContainer *container,
> +                                      MemoryRegionSection *section)
>  {
> -    return !memory_region_is_ram(section->mr);
> +    VFIOGroup *group;
> +    VFIODevice *vdev;
> +    int i;
> +
> +    QLIST_FOREACH(group, &container->group_list, next) {

I think you want to start at the global &group_list

> +        QLIST_FOREACH(vdev, &group->device_list, next) {
> +            if (vdev->msix->mmap_mem.ram_addr == section->mr->ram_addr)

Note the comment in memory.h:

struct MemoryRegion {
    /* All fields are private - violators will be prosecuted */
    ...
    ram_addr_t ram_addr;

You'll need to invent some kind of memory region comparison interfaces
instead of accessing these private fields.

> +                return 1;
> +            for (i = 0; i < PCI_ROM_SLOT; i++) {
> +                VFIOBAR *bar = &vdev->bars[i];
> +                if (bar->mmap_mem.ram_addr == section->mr->ram_addr)
> +                    return 1;
> +            }
> +        }
> +    }
> +
> +    return 0;
> +}

It's a pretty heavy weight function, I think the memory API should help
us differentiate MMIO from RAM.

> +static bool vfio_listener_skipped_section(VFIOContainer *container,
> +                                          MemoryRegionSection *section)
> +{
> +    if (!memory_region_is_ram(section->mr))
> +       return 1;

Need some kind of conditional here for platforms that support
peer-to-peer.  Thanks,

Alex

> +    return memory_region_is_mmap_bars(container, section);
>  }
>  
>  static void vfio_listener_region_add(MemoryListener *listener,
> @@ -1129,7 +1155,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      void *vaddr;
>      int ret;
>  
> -    if (vfio_listener_skipped_section(section)) {
> +    if (vfio_listener_skipped_section(container, section)) {
>          DPRINTF("vfio: SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
>                  section->offset_within_address_space,
>                  section->offset_within_address_space + section->size - 1);
> @@ -1173,7 +1199,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      hwaddr iova, end;
>      int ret;
> -    if (vfio_listener_skipped_section(section)) {
> +    if (vfio_listener_skipped_section(container, section)) {
>          DPRINTF("vfio: SKIPPING region_del %"HWADDR_PRIx" - %"PRIx64"\n",
>                  section->offset_within_address_space,
>                  section->offset_within_address_space + section->size - 1);
> 
> -----------------
> 
> 
> Thanks
> -Bharat
>
Bharat Bhushan - Feb. 19, 2013, 4:05 a.m.
> -----Original Message-----

> From: Alex Williamson [mailto:alex.williamson@redhat.com]

> Sent: Thursday, February 14, 2013 11:57 PM

> To: Bhushan Bharat-R65777

> Cc: Yoder Stuart-B08248; Wood Scott-B07421; Graf Alexander-B36701; qemu-

> devel@nongnu.org; qemu-ppc@nongnu.org

> Subject: Re: VFIO: Not require to make VFIO_IOMMU_MAP_DMA for MMAPED PCI bars

> 

> On Thu, 2013-02-14 at 08:22 +0000, Bhushan Bharat-R65777 wrote:

> > Hi Alex Williamson,

> >

> > I am able (with some hacks :)) to directly assign the e1000 PCI device

> > to KVM guest using VFIO on Freescale device. One of the problem I am

> > facing is about the DMA mapping in IOMMU for PCI device BARs. On

> > Freescale devices, the mmaped PCI device BARs are not required to be

> > mapped in IOMMU. Typically the flow of in/out transaction (from CPU)

> > is:

> >

> > Incoming flow:

> >

> > -----|                     |----------|         |---------------|

> |-------------|

> > CORE |<----<------<-----<--| IOMMU    |<---<---<| PCI-Controller|<------<-----

> <----<| PCI device  |

> > -----|                     |----------|         |---------------|

> |-------------|

> >

> > Outgoing Flow: IOMMU is bypassed for out transactions

> >

> > -----|                     |----------|         |---------------|

> |-------------|

> > CORE |>---->------>----|   | IOMMU    |    ->-->| PCI-Controller|>------>-----

> >---->| PCI device  |

> > -----|                 |   |----------|   ^     |---------------|

> |-------------|

> >                        |                  |

> >                        |------------------|

> 

> Mapping the BAR into the IOMMU isn't for this second use case, CPU to device is

> never IOMMU translated.  Instead, it's for this:

> 

> |----------|         |---------------|                   |-------------|

> | IOMMU    |<---<---<| PCI-Controller|<------<-----<----<| PCI device  |

> |----------|         |---------------|                   |-------------|

>       |

>       |              |---------------|                   |-------------|

>       +-->---->--- ->| PCI-Controller|>------>----->---->| PCI device  |

>                      |---------------|                   |-------------|

> 

> It's perfectly fine to not support peer-to-peer DMA, but let's skip it where

> it's not supported in case it might actually work in some cases.

> 

> > Also because of some hardware limitations on our IOMMU it is difficult

> > to map these BAR regions with RAM (DDR) regions. So on Freescale

> > device we want the VFIO_IOMMU_MAP_DMA ioctl to be called for RAM

> > regions (DDR) only and _not_ for these mmaped ram regions of PCI

> > device bars. I can understand that we need to add these mmaped PCI

> > bars as RAM type in qemu memory_region_*(). So for that I tried to

> > skip these regions in VFIO memory_listeners. Below changes which works

> > for me. I am not sure whether this is correct approach, please

> > suggest.

> >

> > -------------

> > diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index c51ae67..63728d8

> > 100644

> > --- a/hw/vfio_pci.c

> > +++ b/hw/vfio_pci.c

> > @@ -1115,9 +1115,35 @@ static int vfio_dma_map(VFIOContainer *container,

> hwaddr iova,

> >      return -errno;

> >  }

> >

> > -static bool vfio_listener_skipped_section(MemoryRegionSection

> > *section)

> > +static int memory_region_is_mmap_bars(VFIOContainer *container,

> > +                                      MemoryRegionSection *section)

> >  {

> > -    return !memory_region_is_ram(section->mr);

> > +    VFIOGroup *group;

> > +    VFIODevice *vdev;

> > +    int i;

> > +

> > +    QLIST_FOREACH(group, &container->group_list, next) {

> 

> I think you want to start at the global &group_list


Why you think global? My understanding is that the operation on dma_map/unmap is limited to a group in the given container.

> 

> > +        QLIST_FOREACH(vdev, &group->device_list, next) {

> > +            if (vdev->msix->mmap_mem.ram_addr ==

> > + section->mr->ram_addr)

> 

> Note the comment in memory.h:

> 

> struct MemoryRegion {

>     /* All fields are private - violators will be prosecuted */

>     ...

>     ram_addr_t ram_addr;

> 

> You'll need to invent some kind of memory region comparison interfaces instead

> of accessing these private fields.


You mean adding some api ine memory.c?

> 

> > +                return 1;

> > +            for (i = 0; i < PCI_ROM_SLOT; i++) {

> > +                VFIOBAR *bar = &vdev->bars[i];

> > +                if (bar->mmap_mem.ram_addr == section->mr->ram_addr)

> > +                    return 1;

> > +            }

> > +        }

> > +    }

> > +

> > +    return 0;

> > +}

> 

> It's a pretty heavy weight function, I think the memory API should help us

> differentiate MMIO from RAM.


What about adding a bool in " struct MemoryRegion {" 

Bool require_iommu_map;
There will be APIs to set/clear this bool and default all "ram = true", will have "require_iommu_map = true"

if (!(memory_region->ram == true && memory_region->require_iommu_map == true)
	skip_iommumap;


> 

> > +static bool vfio_listener_skipped_section(VFIOContainer *container,

> > +                                          MemoryRegionSection

> > +*section) {

> > +    if (!memory_region_is_ram(section->mr))

> > +       return 1;

> 

> Need some kind of conditional here for platforms that support peer-to-peer.

> Thanks,


What about adding a iommu_type will help here (VFIO_TYPE2_IOMMU), which does not require BARs to be mapped in iommu ?

Thanks
-Bharat

> 

> Alex

> 

> > +    return memory_region_is_mmap_bars(container, section);

> >  }

> >

> >  static void vfio_listener_region_add(MemoryListener *listener, @@

> > -1129,7 +1155,7 @@ static void vfio_listener_region_add(MemoryListener

> *listener,

> >      void *vaddr;

> >      int ret;

> >

> > -    if (vfio_listener_skipped_section(section)) {

> > +    if (vfio_listener_skipped_section(container, section)) {

> >          DPRINTF("vfio: SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",

> >                  section->offset_within_address_space,

> >                  section->offset_within_address_space + section->size

> > - 1); @@ -1173,7 +1199,7 @@ static void

> vfio_listener_region_del(MemoryListener *listener,

> >      hwaddr iova, end;

> >      int ret;

> > -    if (vfio_listener_skipped_section(section)) {

> > +    if (vfio_listener_skipped_section(container, section)) {

> >          DPRINTF("vfio: SKIPPING region_del %"HWADDR_PRIx" - %"PRIx64"\n",

> >                  section->offset_within_address_space,

> >                  section->offset_within_address_space + section->size

> > - 1);

> >

> > -----------------

> >

> >

> > Thanks

> > -Bharat

> >

> 

> 

>
Scott Wood - Feb. 19, 2013, 4:08 a.m.
On 02/18/2013 10:05:20 PM, Bhushan Bharat-R65777 wrote:
> 
> 
> > -----Original Message-----
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, February 14, 2013 11:57 PM
> > To: Bhushan Bharat-R65777
> > Cc: Yoder Stuart-B08248; Wood Scott-B07421; Graf Alexander-B36701;  
> qemu-
> > devel@nongnu.org; qemu-ppc@nongnu.org
> > Subject: Re: VFIO: Not require to make VFIO_IOMMU_MAP_DMA for  
> MMAPED PCI bars
> >
> > On Thu, 2013-02-14 at 08:22 +0000, Bhushan Bharat-R65777 wrote:
> > > +static bool vfio_listener_skipped_section(VFIOContainer  
> *container,
> > > +                                          MemoryRegionSection
> > > +*section) {
> > > +    if (!memory_region_is_ram(section->mr))
> > > +       return 1;
> >
> > Need some kind of conditional here for platforms that support  
> peer-to-peer.
> > Thanks,
> 
> What about adding a iommu_type will help here (VFIO_TYPE2_IOMMU),  
> which does not require BARs to be mapped in iommu ?

What specifically does "type 2" mean again?  If we're going to use it  
as the thing to key off of for all sorts of PAMU quirks, might as well  
just call it VFIO_IOMMU_FSL_PAMU.

-Scott
Alex Williamson - Feb. 19, 2013, 4:24 a.m.
On Tue, 2013-02-19 at 04:05 +0000, Bhushan Bharat-R65777 wrote:
> 
> > -----Original Message-----
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, February 14, 2013 11:57 PM
> > To: Bhushan Bharat-R65777
> > Cc: Yoder Stuart-B08248; Wood Scott-B07421; Graf Alexander-B36701; qemu-
> > devel@nongnu.org; qemu-ppc@nongnu.org
> > Subject: Re: VFIO: Not require to make VFIO_IOMMU_MAP_DMA for MMAPED PCI bars
> > 
> > On Thu, 2013-02-14 at 08:22 +0000, Bhushan Bharat-R65777 wrote:
> > > Hi Alex Williamson,
> > >
> > > I am able (with some hacks :)) to directly assign the e1000 PCI device
> > > to KVM guest using VFIO on Freescale device. One of the problem I am
> > > facing is about the DMA mapping in IOMMU for PCI device BARs. On
> > > Freescale devices, the mmaped PCI device BARs are not required to be
> > > mapped in IOMMU. Typically the flow of in/out transaction (from CPU)
> > > is:
> > >
> > > Incoming flow:
> > >
> > > -----|                     |----------|         |---------------|
> > |-------------|
> > > CORE |<----<------<-----<--| IOMMU    |<---<---<| PCI-Controller|<------<-----
> > <----<| PCI device  |
> > > -----|                     |----------|         |---------------|
> > |-------------|
> > >
> > > Outgoing Flow: IOMMU is bypassed for out transactions
> > >
> > > -----|                     |----------|         |---------------|
> > |-------------|
> > > CORE |>---->------>----|   | IOMMU    |    ->-->| PCI-Controller|>------>-----
> > >---->| PCI device  |
> > > -----|                 |   |----------|   ^     |---------------|
> > |-------------|
> > >                        |                  |
> > >                        |------------------|
> > 
> > Mapping the BAR into the IOMMU isn't for this second use case, CPU to device is
> > never IOMMU translated.  Instead, it's for this:
> > 
> > |----------|         |---------------|                   |-------------|
> > | IOMMU    |<---<---<| PCI-Controller|<------<-----<----<| PCI device  |
> > |----------|         |---------------|                   |-------------|
> >       |
> >       |              |---------------|                   |-------------|
> >       +-->---->--- ->| PCI-Controller|>------>----->---->| PCI device  |
> >                      |---------------|                   |-------------|
> > 
> > It's perfectly fine to not support peer-to-peer DMA, but let's skip it where
> > it's not supported in case it might actually work in some cases.
> > 
> > > Also because of some hardware limitations on our IOMMU it is difficult
> > > to map these BAR regions with RAM (DDR) regions. So on Freescale
> > > device we want the VFIO_IOMMU_MAP_DMA ioctl to be called for RAM
> > > regions (DDR) only and _not_ for these mmaped ram regions of PCI
> > > device bars. I can understand that we need to add these mmaped PCI
> > > bars as RAM type in qemu memory_region_*(). So for that I tried to
> > > skip these regions in VFIO memory_listeners. Below changes which works
> > > for me. I am not sure whether this is correct approach, please
> > > suggest.
> > >
> > > -------------
> > > diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index c51ae67..63728d8
> > > 100644
> > > --- a/hw/vfio_pci.c
> > > +++ b/hw/vfio_pci.c
> > > @@ -1115,9 +1115,35 @@ static int vfio_dma_map(VFIOContainer *container,
> > hwaddr iova,
> > >      return -errno;
> > >  }
> > >
> > > -static bool vfio_listener_skipped_section(MemoryRegionSection
> > > *section)
> > > +static int memory_region_is_mmap_bars(VFIOContainer *container,
> > > +                                      MemoryRegionSection *section)
> > >  {
> > > -    return !memory_region_is_ram(section->mr);
> > > +    VFIOGroup *group;
> > > +    VFIODevice *vdev;
> > > +    int i;
> > > +
> > > +    QLIST_FOREACH(group, &container->group_list, next) {
> > 
> > I think you want to start at the global &group_list
> 
> Why you think global? My understanding is that the operation on dma_map/unmap is limited to a group in the given container.

There can theoretically be more than one container, so if you're only
searching groups within a container then you may not be searching all
the devices.

> > > +        QLIST_FOREACH(vdev, &group->device_list, next) {
> > > +            if (vdev->msix->mmap_mem.ram_addr ==
> > > + section->mr->ram_addr)
> > 
> > Note the comment in memory.h:
> > 
> > struct MemoryRegion {
> >     /* All fields are private - violators will be prosecuted */
> >     ...
> >     ram_addr_t ram_addr;
> > 
> > You'll need to invent some kind of memory region comparison interfaces instead
> > of accessing these private fields.
> 
> You mean adding some api ine memory.c?

Yes

> > > +                return 1;
> > > +            for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > +                VFIOBAR *bar = &vdev->bars[i];
> > > +                if (bar->mmap_mem.ram_addr == section->mr->ram_addr)
> > > +                    return 1;
> > > +            }
> > > +        }
> > > +    }
> > > +
> > > +    return 0;
> > > +}
> > 
> > It's a pretty heavy weight function, I think the memory API should help us
> > differentiate MMIO from RAM.
> 
> What about adding a bool in " struct MemoryRegion {" 
> 
> Bool require_iommu_map;
> There will be APIs to set/clear this bool and default all "ram = true", will have "require_iommu_map = true"
> 
> if (!(memory_region->ram == true && memory_region->require_iommu_map == true)
> 	skip_iommumap;

I don't think memory.c wants to care about requiring iommu mapping or
not.  Perhaps an mmio address space vs a ram address space.
> > 
> > > +static bool vfio_listener_skipped_section(VFIOContainer *container,
> > > +                                          MemoryRegionSection
> > > +*section) {
> > > +    if (!memory_region_is_ram(section->mr))
> > > +       return 1;
> > 
> > Need some kind of conditional here for platforms that support peer-to-peer.
> > Thanks,
> 
> What about adding a iommu_type will help here (VFIO_TYPE2_IOMMU), which does not require BARs to be mapped in iommu ?

"Type2" seems excessive, there's already an VFIO_IOMMU_GET_INFO ioctl
that could easily add a flag and maybe a field.  I think you're doing
something with iommu geometry, perhaps inferring the geometry somehow.
Why not add that to the GET_INFO data and use that to decide whether to
restrict mapping to ram?  Thanks,

Alex
Bharat Bhushan - Feb. 19, 2013, 4:24 a.m.
> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Tuesday, February 19, 2013 9:39 AM
> To: Bhushan Bharat-R65777
> Cc: Alex Williamson; Yoder Stuart-B08248; Wood Scott-B07421; Graf Alexander-
> B36701; qemu-devel@nongnu.org; qemu-ppc@nongnu.org
> Subject: Re: VFIO: Not require to make VFIO_IOMMU_MAP_DMA for MMAPED PCI bars
> 
> On 02/18/2013 10:05:20 PM, Bhushan Bharat-R65777 wrote:
> >
> >
> > > -----Original Message-----
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, February 14, 2013 11:57 PM
> > > To: Bhushan Bharat-R65777
> > > Cc: Yoder Stuart-B08248; Wood Scott-B07421; Graf Alexander-B36701;
> > qemu-
> > > devel@nongnu.org; qemu-ppc@nongnu.org
> > > Subject: Re: VFIO: Not require to make VFIO_IOMMU_MAP_DMA for
> > MMAPED PCI bars
> > >
> > > On Thu, 2013-02-14 at 08:22 +0000, Bhushan Bharat-R65777 wrote:
> > > > +static bool vfio_listener_skipped_section(VFIOContainer
> > *container,
> > > > +                                          MemoryRegionSection
> > > > +*section) {
> > > > +    if (!memory_region_is_ram(section->mr))
> > > > +       return 1;
> > >
> > > Need some kind of conditional here for platforms that support
> > peer-to-peer.
> > > Thanks,
> >
> > What about adding a iommu_type will help here (VFIO_TYPE2_IOMMU),
> > which does not require BARs to be mapped in iommu ?
> 
> What specifically does "type 2" mean again?  If we're going to use it as the
> thing to key off of for all sorts of PAMU quirks, might as well just call it
> VFIO_IOMMU_FSL_PAMU.

Just thought of a common name so in case some other IOMMU also fall in this category and add a comment where this will be defined.
I am fine with this name as well :)

Thanks
-Bharat
Bharat Bhushan - Feb. 19, 2013, 4:56 a.m.
> -----Original Message-----

> From: Alex Williamson [mailto:alex.williamson@redhat.com]

> Sent: Tuesday, February 19, 2013 9:55 AM

> To: Bhushan Bharat-R65777

> Cc: Yoder Stuart-B08248; Wood Scott-B07421; Graf Alexander-B36701; qemu-

> devel@nongnu.org; qemu-ppc@nongnu.org

> Subject: Re: VFIO: Not require to make VFIO_IOMMU_MAP_DMA for MMAPED PCI bars

> 

> On Tue, 2013-02-19 at 04:05 +0000, Bhushan Bharat-R65777 wrote:

> >

> > > -----Original Message-----

> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]

> > > Sent: Thursday, February 14, 2013 11:57 PM

> > > To: Bhushan Bharat-R65777

> > > Cc: Yoder Stuart-B08248; Wood Scott-B07421; Graf Alexander-B36701;

> > > qemu- devel@nongnu.org; qemu-ppc@nongnu.org

> > > Subject: Re: VFIO: Not require to make VFIO_IOMMU_MAP_DMA for MMAPED

> > > PCI bars

> > >

> > > On Thu, 2013-02-14 at 08:22 +0000, Bhushan Bharat-R65777 wrote:

> > > > Hi Alex Williamson,

> > > >

> > > > I am able (with some hacks :)) to directly assign the e1000 PCI

> > > > device to KVM guest using VFIO on Freescale device. One of the

> > > > problem I am facing is about the DMA mapping in IOMMU for PCI

> > > > device BARs. On Freescale devices, the mmaped PCI device BARs are

> > > > not required to be mapped in IOMMU. Typically the flow of in/out

> > > > transaction (from CPU)

> > > > is:

> > > >

> > > > Incoming flow:

> > > >

> > > > -----|                     |----------|         |---------------|

> > > |-------------|

> > > > CORE |<----<------<-----<--| IOMMU    |<---<---<| PCI-Controller|<------<-

> ----

> > > <----<| PCI device  |

> > > > -----|                     |----------|         |---------------|

> > > |-------------|

> > > >

> > > > Outgoing Flow: IOMMU is bypassed for out transactions

> > > >

> > > > -----|                     |----------|         |---------------|

> > > |-------------|

> > > > CORE |>---->------>----|   | IOMMU    |    ->-->| PCI-Controller|>------>-

> ----

> > > >---->| PCI device  |

> > > > -----|                 |   |----------|   ^     |---------------|

> > > |-------------|

> > > >                        |                  |

> > > >                        |------------------|

> > >

> > > Mapping the BAR into the IOMMU isn't for this second use case, CPU

> > > to device is never IOMMU translated.  Instead, it's for this:

> > >

> > > |----------|         |---------------|                   |-------------|

> > > | IOMMU    |<---<---<| PCI-Controller|<------<-----<----<| PCI device  |

> > > |----------|         |---------------|                   |-------------|

> > >       |

> > >       |              |---------------|                   |-------------|

> > >       +-->---->--- ->| PCI-Controller|>------>----->---->| PCI device  |

> > >                      |---------------|                   |-------------|

> > >

> > > It's perfectly fine to not support peer-to-peer DMA, but let's skip

> > > it where it's not supported in case it might actually work in some cases.

> > >

> > > > Also because of some hardware limitations on our IOMMU it is

> > > > difficult to map these BAR regions with RAM (DDR) regions. So on

> > > > Freescale device we want the VFIO_IOMMU_MAP_DMA ioctl to be called

> > > > for RAM regions (DDR) only and _not_ for these mmaped ram regions

> > > > of PCI device bars. I can understand that we need to add these

> > > > mmaped PCI bars as RAM type in qemu memory_region_*(). So for that

> > > > I tried to skip these regions in VFIO memory_listeners. Below

> > > > changes which works for me. I am not sure whether this is correct

> > > > approach, please suggest.

> > > >

> > > > -------------

> > > > diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index c51ae67..63728d8

> > > > 100644

> > > > --- a/hw/vfio_pci.c

> > > > +++ b/hw/vfio_pci.c

> > > > @@ -1115,9 +1115,35 @@ static int vfio_dma_map(VFIOContainer

> > > > *container,

> > > hwaddr iova,

> > > >      return -errno;

> > > >  }

> > > >

> > > > -static bool vfio_listener_skipped_section(MemoryRegionSection

> > > > *section)

> > > > +static int memory_region_is_mmap_bars(VFIOContainer *container,

> > > > +                                      MemoryRegionSection

> > > > +*section)

> > > >  {

> > > > -    return !memory_region_is_ram(section->mr);

> > > > +    VFIOGroup *group;

> > > > +    VFIODevice *vdev;

> > > > +    int i;

> > > > +

> > > > +    QLIST_FOREACH(group, &container->group_list, next) {

> > >

> > > I think you want to start at the global &group_list

> >

> > Why you think global? My understanding is that the operation on dma_map/unmap

> is limited to a group in the given container.

> 

> There can theoretically be more than one container, so if you're only searching

> groups within a container then you may not be searching all the devices.

> 

> > > > +        QLIST_FOREACH(vdev, &group->device_list, next) {

> > > > +            if (vdev->msix->mmap_mem.ram_addr ==

> > > > + section->mr->ram_addr)

> > >

> > > Note the comment in memory.h:

> > >

> > > struct MemoryRegion {

> > >     /* All fields are private - violators will be prosecuted */

> > >     ...

> > >     ram_addr_t ram_addr;

> > >

> > > You'll need to invent some kind of memory region comparison

> > > interfaces instead of accessing these private fields.

> >

> > You mean adding some api ine memory.c?

> 

> Yes

> 

> > > > +                return 1;

> > > > +            for (i = 0; i < PCI_ROM_SLOT; i++) {

> > > > +                VFIOBAR *bar = &vdev->bars[i];

> > > > +                if (bar->mmap_mem.ram_addr == section->mr->ram_addr)

> > > > +                    return 1;

> > > > +            }

> > > > +        }

> > > > +    }

> > > > +

> > > > +    return 0;

> > > > +}

> > >

> > > It's a pretty heavy weight function, I think the memory API should

> > > help us differentiate MMIO from RAM.

> >

> > What about adding a bool in " struct MemoryRegion {"

> >

> > Bool require_iommu_map;

> > There will be APIs to set/clear this bool and default all "ram = true", will

> have "require_iommu_map = true"

> >

> > if (!(memory_region->ram == true && memory_region->require_iommu_map == true)

> > 	skip_iommumap;

> 

> I don't think memory.c wants to care about requiring iommu mapping or not.

> Perhaps an mmio address space vs a ram address space.


Can you please describe what you mean by mmio address space vs ram address space here? Which elements on struct MemoryRegion you are talking about?

> > >

> > > > +static bool vfio_listener_skipped_section(VFIOContainer *container,

> > > > +                                          MemoryRegionSection

> > > > +*section) {

> > > > +    if (!memory_region_is_ram(section->mr))

> > > > +       return 1;

> > >

> > > Need some kind of conditional here for platforms that support peer-to-peer.

> > > Thanks,

> >

> > What about adding a iommu_type will help here (VFIO_TYPE2_IOMMU), which does

> not require BARs to be mapped in iommu ?

> 

> "Type2" seems excessive, there's already an VFIO_IOMMU_GET_INFO ioctl that could

> easily add a flag and maybe a field.  I think you're doing something with iommu

> geometry, perhaps inferring the geometry somehow.

> Why not add that to the GET_INFO data and use that to decide whether to restrict

> mapping to ram?  Thanks,


I agree that getting the iommu type in VFIO_IOMMU_GET_INFO and then using a flag is a good idea. What I mean was that the flag have value either TYPE1 or TYPE2/FSL_PAMU etc.

BTW: Currently the hw/vfio_pci.c uses VFIO_TYPE1_IOMMU defines

Thanks
-Bharat

> 

> Alex

>
Alex Williamson - Feb. 19, 2013, 5:40 a.m.
On Tue, 2013-02-19 at 04:56 +0000, Bhushan Bharat-R65777 wrote:
> 
> > -----Original Message-----
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Tuesday, February 19, 2013 9:55 AM
> > To: Bhushan Bharat-R65777
> > Cc: Yoder Stuart-B08248; Wood Scott-B07421; Graf Alexander-B36701; qemu-
> > devel@nongnu.org; qemu-ppc@nongnu.org
> > Subject: Re: VFIO: Not require to make VFIO_IOMMU_MAP_DMA for MMAPED PCI bars
> > 
> > On Tue, 2013-02-19 at 04:05 +0000, Bhushan Bharat-R65777 wrote:
> > >
> > > > -----Original Message-----
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Thursday, February 14, 2013 11:57 PM
> > > > To: Bhushan Bharat-R65777
> > > > Cc: Yoder Stuart-B08248; Wood Scott-B07421; Graf Alexander-B36701;
> > > > qemu- devel@nongnu.org; qemu-ppc@nongnu.org
> > > > Subject: Re: VFIO: Not require to make VFIO_IOMMU_MAP_DMA for MMAPED
> > > > PCI bars
> > > >
> > > > On Thu, 2013-02-14 at 08:22 +0000, Bhushan Bharat-R65777 wrote:
> > > > > Hi Alex Williamson,
> > > > >
> > > > > I am able (with some hacks :)) to directly assign the e1000 PCI
> > > > > device to KVM guest using VFIO on Freescale device. One of the
> > > > > problem I am facing is about the DMA mapping in IOMMU for PCI
> > > > > device BARs. On Freescale devices, the mmaped PCI device BARs are
> > > > > not required to be mapped in IOMMU. Typically the flow of in/out
> > > > > transaction (from CPU)
> > > > > is:
> > > > >
> > > > > Incoming flow:
> > > > >
> > > > > -----|                     |----------|         |---------------|
> > > > |-------------|
> > > > > CORE |<----<------<-----<--| IOMMU    |<---<---<| PCI-Controller|<------<-
> > ----
> > > > <----<| PCI device  |
> > > > > -----|                     |----------|         |---------------|
> > > > |-------------|
> > > > >
> > > > > Outgoing Flow: IOMMU is bypassed for out transactions
> > > > >
> > > > > -----|                     |----------|         |---------------|
> > > > |-------------|
> > > > > CORE |>---->------>----|   | IOMMU    |    ->-->| PCI-Controller|>------>-
> > ----
> > > > >---->| PCI device  |
> > > > > -----|                 |   |----------|   ^     |---------------|
> > > > |-------------|
> > > > >                        |                  |
> > > > >                        |------------------|
> > > >
> > > > Mapping the BAR into the IOMMU isn't for this second use case, CPU
> > > > to device is never IOMMU translated.  Instead, it's for this:
> > > >
> > > > |----------|         |---------------|                   |-------------|
> > > > | IOMMU    |<---<---<| PCI-Controller|<------<-----<----<| PCI device  |
> > > > |----------|         |---------------|                   |-------------|
> > > >       |
> > > >       |              |---------------|                   |-------------|
> > > >       +-->---->--- ->| PCI-Controller|>------>----->---->| PCI device  |
> > > >                      |---------------|                   |-------------|
> > > >
> > > > It's perfectly fine to not support peer-to-peer DMA, but let's skip
> > > > it where it's not supported in case it might actually work in some cases.
> > > >
> > > > > Also because of some hardware limitations on our IOMMU it is
> > > > > difficult to map these BAR regions with RAM (DDR) regions. So on
> > > > > Freescale device we want the VFIO_IOMMU_MAP_DMA ioctl to be called
> > > > > for RAM regions (DDR) only and _not_ for these mmaped ram regions
> > > > > of PCI device bars. I can understand that we need to add these
> > > > > mmaped PCI bars as RAM type in qemu memory_region_*(). So for that
> > > > > I tried to skip these regions in VFIO memory_listeners. Below
> > > > > changes which works for me. I am not sure whether this is correct
> > > > > approach, please suggest.
> > > > >
> > > > > -------------
> > > > > diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index c51ae67..63728d8
> > > > > 100644
> > > > > --- a/hw/vfio_pci.c
> > > > > +++ b/hw/vfio_pci.c
> > > > > @@ -1115,9 +1115,35 @@ static int vfio_dma_map(VFIOContainer
> > > > > *container,
> > > > hwaddr iova,
> > > > >      return -errno;
> > > > >  }
> > > > >
> > > > > -static bool vfio_listener_skipped_section(MemoryRegionSection
> > > > > *section)
> > > > > +static int memory_region_is_mmap_bars(VFIOContainer *container,
> > > > > +                                      MemoryRegionSection
> > > > > +*section)
> > > > >  {
> > > > > -    return !memory_region_is_ram(section->mr);
> > > > > +    VFIOGroup *group;
> > > > > +    VFIODevice *vdev;
> > > > > +    int i;
> > > > > +
> > > > > +    QLIST_FOREACH(group, &container->group_list, next) {
> > > >
> > > > I think you want to start at the global &group_list
> > >
> > > Why you think global? My understanding is that the operation on dma_map/unmap
> > is limited to a group in the given container.
> > 
> > There can theoretically be more than one container, so if you're only searching
> > groups within a container then you may not be searching all the devices.
> > 
> > > > > +        QLIST_FOREACH(vdev, &group->device_list, next) {
> > > > > +            if (vdev->msix->mmap_mem.ram_addr ==
> > > > > + section->mr->ram_addr)
> > > >
> > > > Note the comment in memory.h:
> > > >
> > > > struct MemoryRegion {
> > > >     /* All fields are private - violators will be prosecuted */
> > > >     ...
> > > >     ram_addr_t ram_addr;
> > > >
> > > > You'll need to invent some kind of memory region comparison
> > > > interfaces instead of accessing these private fields.
> > >
> > > You mean adding some api ine memory.c?
> > 
> > Yes
> > 
> > > > > +                return 1;
> > > > > +            for (i = 0; i < PCI_ROM_SLOT; i++) {
> > > > > +                VFIOBAR *bar = &vdev->bars[i];
> > > > > +                if (bar->mmap_mem.ram_addr == section->mr->ram_addr)
> > > > > +                    return 1;
> > > > > +            }
> > > > > +        }
> > > > > +    }
> > > > > +
> > > > > +    return 0;
> > > > > +}
> > > >
> > > > It's a pretty heavy weight function, I think the memory API should
> > > > help us differentiate MMIO from RAM.
> > >
> > > What about adding a bool in " struct MemoryRegion {"
> > >
> > > Bool require_iommu_map;
> > > There will be APIs to set/clear this bool and default all "ram = true", will
> > have "require_iommu_map = true"
> > >
> > > if (!(memory_region->ram == true && memory_region->require_iommu_map == true)
> > > 	skip_iommumap;
> > 
> > I don't think memory.c wants to care about requiring iommu mapping or not.
> > Perhaps an mmio address space vs a ram address space.
> 
> Can you please describe what you mean by mmio address space vs ram address space here? Which elements on struct MemoryRegion you are talking about?

MemoryListeners can be registered per address space, so if MMIO was
registered in a different address space from system RAM, then it would
be a trivial matter to register for one or the other or both.  I don't
know how feasible that is, but I think something like that is what we'd
want.  I don't think it's as simple as pointing at an existing field in
struct MemoryRegion.

> > > >
> > > > > +static bool vfio_listener_skipped_section(VFIOContainer *container,
> > > > > +                                          MemoryRegionSection
> > > > > +*section) {
> > > > > +    if (!memory_region_is_ram(section->mr))
> > > > > +       return 1;
> > > >
> > > > Need some kind of conditional here for platforms that support peer-to-peer.
> > > > Thanks,
> > >
> > > What about adding a iommu_type will help here (VFIO_TYPE2_IOMMU), which does
> > not require BARs to be mapped in iommu ?
> > 
> > "Type2" seems excessive, there's already an VFIO_IOMMU_GET_INFO ioctl that could
> > easily add a flag and maybe a field.  I think you're doing something with iommu
> > geometry, perhaps inferring the geometry somehow.
> > Why not add that to the GET_INFO data and use that to decide whether to restrict
> > mapping to ram?  Thanks,
> 
> I agree that getting the iommu type in VFIO_IOMMU_GET_INFO and then using a flag is a good idea. What I mean was that the flag have value either TYPE1 or TYPE2/FSL_PAMU etc.

Why would VFIO_TYPE1_IOMMU have another flag that reconfirms that it's
type1?  What I'm saying is that instead of declaring a new type or some
new proprietary info, figure out what is the restriction and describe
that.  AIUI, your restriction has something to do with a limited iommu
aperture that can map guest memory, but not I/O devices.  Isn't that
described by the iommu geometry?  If so, describe the geometry and let
qemu come to the same conclusion, don't just make a flag that declares
you're platform TYPE2 if you can avoid it.

> BTW: Currently the hw/vfio_pci.c uses VFIO_TYPE1_IOMMU defines

Can you be more specific?  Is this a problem?  QEMU currently
understands VFIO_TYPE1_IOMMU as necessitated by vfio's module iommu
interface.  See the patches Alexey's been working on for POWER if you're
curious how to initialize a non-TYPE1 backend.  Thanks,

Alex

Patch

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index c51ae67..63728d8 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -1115,9 +1115,35 @@  static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
     return -errno;
 }
 
-static bool vfio_listener_skipped_section(MemoryRegionSection *section)
+static int memory_region_is_mmap_bars(VFIOContainer *container,
+                                      MemoryRegionSection *section)
 {
-    return !memory_region_is_ram(section->mr);
+    VFIOGroup *group;
+    VFIODevice *vdev;
+    int i;
+
+    QLIST_FOREACH(group, &container->group_list, next) {
+        QLIST_FOREACH(vdev, &group->device_list, next) {
+            if (vdev->msix->mmap_mem.ram_addr == section->mr->ram_addr)
+                return 1;
+            for (i = 0; i < PCI_ROM_SLOT; i++) {
+                VFIOBAR *bar = &vdev->bars[i];
+                if (bar->mmap_mem.ram_addr == section->mr->ram_addr)
+                    return 1;
+            }
+        }
+    }
+
+    return 0;
+}
+
+static bool vfio_listener_skipped_section(VFIOContainer *container,
+                                          MemoryRegionSection *section)
+{
+    if (!memory_region_is_ram(section->mr))
+       return 1;
+
+    return memory_region_is_mmap_bars(container, section);
 }
 
 static void vfio_listener_region_add(MemoryListener *listener,
@@ -1129,7 +1155,7 @@  static void vfio_listener_region_add(MemoryListener *listener,
     void *vaddr;
     int ret;
 
-    if (vfio_listener_skipped_section(section)) {
+    if (vfio_listener_skipped_section(container, section)) {
         DPRINTF("vfio: SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
                 section->offset_within_address_space,
                 section->offset_within_address_space + section->size - 1);
@@ -1173,7 +1199,7 @@  static void vfio_listener_region_del(MemoryListener *listener,
     hwaddr iova, end;
     int ret;
-    if (vfio_listener_skipped_section(section)) {
+    if (vfio_listener_skipped_section(container, section)) {
         DPRINTF("vfio: SKIPPING region_del %"HWADDR_PRIx" - %"PRIx64"\n",
                 section->offset_within_address_space,
                 section->offset_within_address_space + section->size - 1);