diff mbox series

ppc/pnv: Add model for Power8 PHB3 PCIe Host bridge

Message ID 20180626135928.23950-1-clg@kaod.org
State New
Headers show
Series ppc/pnv: Add model for Power8 PHB3 PCIe Host bridge | expand

Commit Message

Cédric Le Goater June 26, 2018, 1:59 p.m. UTC
From: Benjamin Herrenschmidt <benh@kernel.crashing.org>

This is a model of the PCIe host bridge found on Power8 chips,
including PowerBus logic interface, IOMMU support, PCIe root complex,
XICS MSI and LSI interrupt sources.

4 PHBs are provisioned under the Power8 chip model to fit hardware but
only one is currently initialized. No default device layout is
provided and devices should be added on the pci.0 bus using command
line options such as :

  -device e1000,netdev=net0,mac=C0:FF:EE:00:00:02,bus=pci.0,addr=0x2
  -netdev bridge,id=net0,helper=/usr/libexec/qemu-bridge-helper,br=virbr0,id=hostnet0

  -device megasas,id=scsi0,bus=pci.0,addr=0x1
  -drive file=$disk,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none
  -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=2

This implementation doesn't emulate the EEH error handling (and may
never do).

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[clg: rewrote the QOM models
      misc fixes
      lots of love and care]
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 default-configs/ppc64-softmmu.mak   |    1 +
 include/hw/pci-host/pnv_phb3.h      |  158 +++++
 include/hw/pci-host/pnv_phb3_regs.h |  510 ++++++++++++++++
 include/hw/ppc/pnv.h                |    6 +
 include/hw/ppc/pnv_xscom.h          |    9 +
 include/hw/ppc/xics.h               |    1 +
 hw/intc/xics.c                      |    2 +-
 hw/pci-host/pnv_phb3.c              | 1089 +++++++++++++++++++++++++++++++++++
 hw/pci-host/pnv_phb3_msi.c          |  315 ++++++++++
 hw/pci-host/pnv_phb3_pbcq.c         |  350 +++++++++++
 hw/ppc/pnv.c                        |  106 +++-
 hw/ppc/pnv_xscom.c                  |    6 +-
 hw/pci-host/Makefile.objs           |    1 +
 13 files changed, 2549 insertions(+), 5 deletions(-)
 create mode 100644 include/hw/pci-host/pnv_phb3.h
 create mode 100644 include/hw/pci-host/pnv_phb3_regs.h
 create mode 100644 hw/pci-host/pnv_phb3.c
 create mode 100644 hw/pci-host/pnv_phb3_msi.c
 create mode 100644 hw/pci-host/pnv_phb3_pbcq.c

Comments

Andrea Bolognani June 26, 2018, 3:57 p.m. UTC | #1
On Tue, 2018-06-26 at 15:59 +0200, Cédric Le Goater wrote:
> This is a model of the PCIe host bridge found on Power8 chips,
> including PowerBus logic interface, IOMMU support, PCIe root complex,
> XICS MSI and LSI interrupt sources.
> 
> 4 PHBs are provisioned under the Power8 chip model to fit hardware but
> only one is currently initialized.

What's the advantage in creating 4 PHBs instead of a single one,
like we already do for pSeries guests? As it is, this will confuse
the heck out of libvirt's PCI address allocation algorithm :)
Cédric Le Goater June 26, 2018, 5:02 p.m. UTC | #2
On 06/26/2018 05:57 PM, Andrea Bolognani wrote:
> On Tue, 2018-06-26 at 15:59 +0200, Cédric Le Goater wrote:
>> This is a model of the PCIe host bridge found on Power8 chips,
>> including PowerBus logic interface, IOMMU support, PCIe root complex,
>> XICS MSI and LSI interrupt sources.
>>
>> 4 PHBs are provisioned under the Power8 chip model to fit hardware but
>> only one is currently initialized.
> 
> What's the advantage in creating 4 PHBs instead of a single one,

The Power8 chip comes in different flavors: Venice, Murano, Naple, 
each having a different number of PHBs. We don't need to initialize 
them all to plug only a couple of devices (net, storage, usbs) 

When time comes, we might want to test some more complex configurations
or extend the modeling with CAPI support. That's why we have a :

	#define PNV_MAX_CHIP_PHB 4
	    PnvPHB3      phbs[PNV_MAX_CHIP_PHB];

under the chip, and a 'num_phbs' attribute to increase the number
of controllers. It still needs to be tested but that's the goal.

> like we already do for pSeries guests? 

I didn't follow that discussion but this is "another" kind of PHB.
This one models the baremetal controller as found on OpenPOWER and
IBM Power machines. pSeries has a virtual PHB.  

> As it is, this will confuse the heck out of libvirt's PCI address > allocation algorithm :)

The pci bus name should be directly related to the PHB index. But
I agree we need to be careful. That's why you are in cc: :)

Thanks,

C.
Benjamin Herrenschmidt June 26, 2018, 10:21 p.m. UTC | #3
On Tue, 2018-06-26 at 17:57 +0200, Andrea Bolognani wrote:
> On Tue, 2018-06-26 at 15:59 +0200, Cédric Le Goater wrote:
> > This is a model of the PCIe host bridge found on Power8 chips,
> > including PowerBus logic interface, IOMMU support, PCIe root complex,
> > XICS MSI and LSI interrupt sources.
> > 
> > 4 PHBs are provisioned under the Power8 chip model to fit hardware but
> > only one is currently initialized.
> 
> What's the advantage in creating 4 PHBs instead of a single one,
> like we already do for pSeries guests? As it is, this will confuse
> the heck out of libvirt's PCI address allocation algorithm :)

This matches the actual HW. POWER9 will have 6 per chip :-)

The goal of the "powernv" platform in qemu is to closely match the
actual HW.

Note that pseries guests can (and will under some cirscumstances) have
multiple PHBs as well.

Cheers,
Ben.
Michael S. Tsirkin June 27, 2018, 12:35 a.m. UTC | #4
On Tue, Jun 26, 2018 at 03:59:28PM +0200, Cédric Le Goater wrote:
> diff --git a/include/hw/pci-host/pnv_phb3_regs.h b/include/hw/pci-host/pnv_phb3_regs.h
> new file mode 100644
> index 000000000000..a1672726b908
> --- /dev/null
> +++ b/include/hw/pci-host/pnv_phb3_regs.h
> @@ -0,0 +1,510 @@
> +/* Copyright (c) 2013-2018, IBM Corporation.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at
> + *
> + *      http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> + * implied.
> + *
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef PCI_HOST_PNV_PHB3_REGS_H
> +#define PCI_HOST_PNV_PHB3_REGS_H
> +
> +/*
> + * Duplicated from target/ppc/cpu.h
> + */
> +#define PPC_BIT(bit)            (0x8000000000000000UL >> (bit))
> +#define PPC_BIT32(bit)          (0x80000000UL >> (bit))
> +#define PPC_BIT8(bit)           (0x80UL >> (bit))
> +#define PPC_BITMASK(bs, be)     ((PPC_BIT(bs) - PPC_BIT(be)) | PPC_BIT(bs))
> +#define PPC_BITMASK32(bs, be)   ((PPC_BIT32(bs) - PPC_BIT32(be)) | \
> +                                 PPC_BIT32(bs))
> +#define PPC_BITLSHIFT(be)       (63 - (be))
> +#define PPC_BITLSHIFT32(be)     (31 - (be))



> +
> +/* Extract field fname from val */
> +#define GETFIELD(fname, val)                    \
> +        (((val) & fname##_MASK) >> fname##_LSH)
> +
> +/* Set field fname of oval to fval
> + * NOTE: oval isn't modified, the combined result is returned
> + */
> +#define SETFIELD(fname, oval, fval)                     \
> +        (((oval) & ~fname##_MASK) | \
> +         ((((typeof(oval))(fval)) << fname##_LSH) & fname##_MASK))
> +

Pls don't make up macros like these. We can't have each device do it.


> @@ -1031,6 +1110,7 @@ static Property pnv_chip_properties[] = {
>      DEFINE_PROP_UINT64("ram-size", PnvChip, ram_size, 0),
>      DEFINE_PROP_UINT32("nr-cores", PnvChip, nr_cores, 1),
>      DEFINE_PROP_UINT64("cores-mask", PnvChip, cores_mask, 0x0),
> +    DEFINE_PROP_UINT32("num-phbs", PnvChip, num_phbs, 1),
>      DEFINE_PROP_END_OF_LIST(),
>  };

How about instanciating each extra phb using -device?
Seems better than teaching everyone about platform-specific
options.
Benjamin Herrenschmidt June 27, 2018, 1:38 a.m. UTC | #5
On Wed, 2018-06-27 at 03:35 +0300, Michael S. Tsirkin wrote:
> 
> > +
> > +/* Extract field fname from val */
> > +#define GETFIELD(fname, val)                    \
> > +        (((val) & fname##_MASK) >> fname##_LSH)
> > +
> > +/* Set field fname of oval to fval
> > + * NOTE: oval isn't modified, the combined result is returned
> > + */
> > +#define SETFIELD(fname, oval, fval)                     \
> > +        (((oval) & ~fname##_MASK) | \
> > +         ((((typeof(oval))(fval)) << fname##_LSH) & fname##_MASK))
> > +
> 
> Pls don't make up macros like these. We can't have each device do it.

So what ? We move the macros in a generic place ? These are MUCH better
than open-coding the masks & shifts and much less error prone.

> > @@ -1031,6 +1110,7 @@ static Property pnv_chip_properties[] = {
> >      DEFINE_PROP_UINT64("ram-size", PnvChip, ram_size, 0),
> >      DEFINE_PROP_UINT32("nr-cores", PnvChip, nr_cores, 1),
> >      DEFINE_PROP_UINT64("cores-mask", PnvChip, cores_mask, 0x0),
> > +    DEFINE_PROP_UINT32("num-phbs", PnvChip, num_phbs, 1),
> >      DEFINE_PROP_END_OF_LIST(),
> >  };
> 
> How about instanciating each extra phb using -device?
> Seems better than teaching everyone about platform-specific
> options.

It's about which PHBs are enabled, not which are instanciated, if I
understand Cedric changes ...

This aims are implementing the POWER8 chip correctly, it has a fixed
number of PHBs per-chip at very specific XSCOM addresses, that the
firwmare knows about.

Cheers,
Ben.
Michael S. Tsirkin June 27, 2018, 2:39 a.m. UTC | #6
On Wed, Jun 27, 2018 at 11:38:17AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-06-27 at 03:35 +0300, Michael S. Tsirkin wrote:
> > 
> > > +
> > > +/* Extract field fname from val */
> > > +#define GETFIELD(fname, val)                    \
> > > +        (((val) & fname##_MASK) >> fname##_LSH)
> > > +
> > > +/* Set field fname of oval to fval
> > > + * NOTE: oval isn't modified, the combined result is returned
> > > + */
> > > +#define SETFIELD(fname, oval, fval)                     \
> > > +        (((oval) & ~fname##_MASK) | \
> > > +         ((((typeof(oval))(fval)) << fname##_LSH) & fname##_MASK))
> > > +
> > 
> > Pls don't make up macros like these. We can't have each device do it.
> 
> So what ? We move the macros in a generic place ? These are MUCH better
> than open-coding the masks & shifts and much less error prone.

include/qemu/bitops.h has a ton of handy macros.

> > > @@ -1031,6 +1110,7 @@ static Property pnv_chip_properties[] = {
> > >      DEFINE_PROP_UINT64("ram-size", PnvChip, ram_size, 0),
> > >      DEFINE_PROP_UINT32("nr-cores", PnvChip, nr_cores, 1),
> > >      DEFINE_PROP_UINT64("cores-mask", PnvChip, cores_mask, 0x0),
> > > +    DEFINE_PROP_UINT32("num-phbs", PnvChip, num_phbs, 1),
> > >      DEFINE_PROP_END_OF_LIST(),
> > >  };
> > 
> > How about instanciating each extra phb using -device?
> > Seems better than teaching everyone about platform-specific
> > options.
> 
> It's about which PHBs are enabled, not which are instanciated, if I
> understand Cedric changes ...
> 
> This aims are implementing the POWER8 chip correctly, it has a fixed
> number of PHBs per-chip at very specific XSCOM addresses, that the
> firwmare knows about.
> 
> Cheers,
> Ben.
David Gibson June 27, 2018, 7:28 a.m. UTC | #7
On Wed, Jun 27, 2018 at 11:38:17AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-06-27 at 03:35 +0300, Michael S. Tsirkin wrote:
> > 
> > > +
> > > +/* Extract field fname from val */
> > > +#define GETFIELD(fname, val)                    \
> > > +        (((val) & fname##_MASK) >> fname##_LSH)
> > > +
> > > +/* Set field fname of oval to fval
> > > + * NOTE: oval isn't modified, the combined result is returned
> > > + */
> > > +#define SETFIELD(fname, oval, fval)                     \
> > > +        (((oval) & ~fname##_MASK) | \
> > > +         ((((typeof(oval))(fval)) << fname##_LSH) & fname##_MASK))
> > > +
> > 
> > Pls don't make up macros like these. We can't have each device do it.
> 
> So what ? We move the macros in a generic place ? These are MUCH better
> than open-coding the masks & shifts and much less error prone.

There are already deposit32 and deposit64() functions which I think
would cover this.

> 
> > > @@ -1031,6 +1110,7 @@ static Property pnv_chip_properties[] = {
> > >      DEFINE_PROP_UINT64("ram-size", PnvChip, ram_size, 0),
> > >      DEFINE_PROP_UINT32("nr-cores", PnvChip, nr_cores, 1),
> > >      DEFINE_PROP_UINT64("cores-mask", PnvChip, cores_mask, 0x0),
> > > +    DEFINE_PROP_UINT32("num-phbs", PnvChip, num_phbs, 1),
> > >      DEFINE_PROP_END_OF_LIST(),
> > >  };
> > 
> > How about instanciating each extra phb using -device?
> > Seems better than teaching everyone about platform-specific
> > options.
> 
> It's about which PHBs are enabled, not which are instanciated, if I
> understand Cedric changes ...
> 
> This aims are implementing the POWER8 chip correctly, it has a fixed
> number of PHBs per-chip at very specific XSCOM addresses, that the
> firwmare knows about.

Yeah, this is a recurring design conflict, which I'm not sure how to
address.  Usually instantiating things with -device works better, but
when do we do when that allows more flexibility with hardware
arrangement than is actually possible in the hardware.
Cédric Le Goater June 27, 2018, 7:46 a.m. UTC | #8
On 06/27/2018 09:28 AM, David Gibson wrote:
> On Wed, Jun 27, 2018 at 11:38:17AM +1000, Benjamin Herrenschmidt wrote:
>> On Wed, 2018-06-27 at 03:35 +0300, Michael S. Tsirkin wrote:
>>>
>>>> +
>>>> +/* Extract field fname from val */
>>>> +#define GETFIELD(fname, val)                    \
>>>> +        (((val) & fname##_MASK) >> fname##_LSH)
>>>> +
>>>> +/* Set field fname of oval to fval
>>>> + * NOTE: oval isn't modified, the combined result is returned
>>>> + */
>>>> +#define SETFIELD(fname, oval, fval)                     \
>>>> +        (((oval) & ~fname##_MASK) | \
>>>> +         ((((typeof(oval))(fval)) << fname##_LSH) & fname##_MASK))
>>>> +
>>>
>>> Pls don't make up macros like these. We can't have each device do it.
>>
>> So what ? We move the macros in a generic place ? These are MUCH better
>> than open-coding the masks & shifts and much less error prone.
> 
> There are already deposit32 and deposit64() functions which I think
> would cover this.

OK. I understand but we can't include target/ppc/cpu.h in this file
either and we want to use these specific macros to keep the register 
definition file similar to the one in skiboot.

Is their a common area for such needs ? 

>>
>>>> @@ -1031,6 +1110,7 @@ static Property pnv_chip_properties[] = {
>>>>      DEFINE_PROP_UINT64("ram-size", PnvChip, ram_size, 0),
>>>>      DEFINE_PROP_UINT32("nr-cores", PnvChip, nr_cores, 1),
>>>>      DEFINE_PROP_UINT64("cores-mask", PnvChip, cores_mask, 0x0),
>>>> +    DEFINE_PROP_UINT32("num-phbs", PnvChip, num_phbs, 1),
>>>>      DEFINE_PROP_END_OF_LIST(),
>>>>  };
>>>
>>> How about instanciating each extra phb using -device?
>>> Seems better than teaching everyone about platform-specific
>>> options.
>>
>> It's about which PHBs are enabled, not which are instanciated, if I
>> understand Cedric changes ...
>>
>> This aims are implementing the POWER8 chip correctly, it has a fixed
>> number of PHBs per-chip at very specific XSCOM addresses, that the
>> firwmare knows about.
> 
> Yeah, this is a recurring design conflict, which I'm not sure how to
> address.  Usually instantiating things with -device works better, but
> when do we do when that allows more flexibility with hardware
> arrangement than is actually possible in the hardware.
> 

So the "IBM PHB3 PCIE Root Port" is already user createable.

I can take a look at user createable PHB3s. I think this is OK from a model
perspective. The object is rather standalone, it needs the machine for 
the XICS fabric and a couple of ids, phb id and chip id. These can come
from the command line.

We want at least one PHB3 per socket/chip though. 

C.
Benjamin Herrenschmidt June 27, 2018, 8:41 a.m. UTC | #9
On Wed, 2018-06-27 at 09:46 +0200, Cédric Le Goater wrote:
> So the "IBM PHB3 PCIE Root Port" is already user createable.
> 
> I can take a look at user createable PHB3s. I think this is OK from a model
> perspective. The object is rather standalone, it needs the machine for 
> the XICS fabric and a couple of ids, phb id and chip id. These can come
> from the command line.
> 
> We want at least one PHB3 per socket/chip though. 

We don't want the user to specify the SCOM addresses though (for the
MMIO windows we should get skiboot to assign them).

If the user gets to specify a thing it would be which of the 3 or 4 HW
PHBs of the chip it is, the SCOM addresses gets deduced.

Cheers,
Ben.
Andrea Bolognani June 27, 2018, 10:22 a.m. UTC | #10
On Tue, 2018-06-26 at 19:02 +0200, Cédric Le Goater wrote:
> On 06/26/2018 05:57 PM, Andrea Bolognani wrote:
> > On Tue, 2018-06-26 at 15:59 +0200, Cédric Le Goater wrote:
> > > This is a model of the PCIe host bridge found on Power8 chips,
> > > including PowerBus logic interface, IOMMU support, PCIe root complex,
> > > XICS MSI and LSI interrupt sources.
> > > 
> > > 4 PHBs are provisioned under the Power8 chip model to fit hardware but
> > > only one is currently initialized.
> > 
> > What's the advantage in creating 4 PHBs instead of a single one,
> 
> The Power8 chip comes in different flavors: Venice, Murano, Naple, 
> each having a different number of PHBs. We don't need to initialize 
> them all to plug only a couple of devices (net, storage, usbs) 
> 
> When time comes, we might want to test some more complex configurations
> or extend the modeling with CAPI support. That's why we have a :
> 
> 	#define PNV_MAX_CHIP_PHB 4
> 	    PnvPHB3      phbs[PNV_MAX_CHIP_PHB];
> 
> under the chip, and a 'num_phbs' attribute to increase the number
> of controllers. It still needs to be tested but that's the goal.

I was a bit confused about the difference between "provisioning"
and "initializing" a PHB, I guess.

Now that I've looked at the code, my (very uneducated) reading is
that you're allocating memory for 4 PHBs along with the chip, but
only actually initializing num_phbs of them, with 1 being the
default.

I have no idea what this implies when it comes to adding PCI
controller to the guest, though. If I run a guest with num_phbs=1,
are ids pci.1..pci.3 reserved even though the corresponding PHBs
have not been initialized? Is num_phbs the only way you can control
how many PHBs a PowerNV guest will have?

I would play around and try to figure out all the kinks on my own,
but I can't actually compile QEMU with this patch applied:

  hw/pci-host/pnv_phb3_msi.c: In function ‘phb3_msi_reset’:
  hw/pci-host/pnv_phb3_msi.c:229:11: error: ‘ICSStateClass’ {aka ‘struct ICSStateClass’} has no member named ‘parent_reset’; did you mean ‘parent_class’?
       icsc->parent_reset(dev);
             ^~~~~~~~~~~~
             parent_class
  hw/pci-host/pnv_phb3_msi.c: In function ‘phb3_msi_realize’:
  hw/pci-host/pnv_phb3_msi.c:260:11: error: ‘ICSStateClass’ {aka ‘struct ICSStateClass’} has no member named ‘parent_realize’; did you mean ‘parent_class’?
       icsc->parent_realize(dev, &local_err);
             ^~~~~~~~~~~~~~
             parent_class
  hw/pci-host/pnv_phb3_msi.c: In function ‘phb3_msi_class_init’:
  hw/pci-host/pnv_phb3_msi.c:293:43: error: ‘ICSStateClass’ {aka ‘struct ICSStateClass’} has no member named ‘parent_realize’; did you mean ‘parent_class’?
                                       &isc->parent_realize);
                                             ^~~~~~~~~~~~~~
                                             parent_class
  hw/pci-host/pnv_phb3_msi.c:295:41: error: ‘ICSStateClass’ {aka ‘struct ICSStateClass’} has no member named ‘parent_reset’; did you mean ‘parent_class’?
                                     &isc->parent_reset);
                                           ^~~~~~~~~~~~
                                           parent_class

> > like we already do for pSeries guests? 
> 
> I didn't follow that discussion but this is "another" kind of PHB.
> This one models the baremetal controller as found on OpenPOWER and
> IBM Power machines. pSeries has a virtual PHB.

I understand that, and of course libvirt will need to learn about
this new type of PHB and make sure both pSeries and PowerNV guests
get the correct one assigned to them.

What I meant is that pSeries guests get a single PHB by default,
with additional ones being instantiable through -device; this is
also consistent with how PCI controllers are added to other guest
types including pc, q35 and aarch64/virt, so it would be really
nice if PowerNV behaved the same way.

> > As it is, this will confuse the heck out of libvirt's PCI address > allocation algorithm :)
> 
> The pci bus name should be directly related to the PHB index. But
> I agree we need to be careful. That's why you are in cc: :)

Thanks, I really appreciate you keeping me in the loop :)
Andrea Bolognani June 27, 2018, 10:40 a.m. UTC | #11
On Wed, 2018-06-27 at 18:41 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-06-27 at 09:46 +0200, Cédric Le Goater wrote:
> > So the "IBM PHB3 PCIE Root Port" is already user createable.
> > 
> > I can take a look at user createable PHB3s. I think this is OK from a model
> > perspective. The object is rather standalone, it needs the machine for 
> > the XICS fabric and a couple of ids, phb id and chip id. These can come
> > from the command line.
> > 
> > We want at least one PHB3 per socket/chip though. 
> 
> We don't want the user to specify the SCOM addresses though (for the
> MMIO windows we should get skiboot to assign them).
> 
> If the user gets to specify a thing it would be which of the 3 or 4 HW
> PHBs of the chip it is, the SCOM addresses gets deduced.

For pSeries guests libvirt will either automatically create, or
allow users to configure manually, PHBs with something like

  <controller type='pci' index='1' model='pci-root'>
    <target index='1'/>
  </controller>

which is ultimately converted to

  -device spapr-pci-host-bridge,index=1,id=pci.1

Ideally the interface for PowerNV guests can be made to be similar
if not identical at the libvirt level, without having to add too
many hacks... It would certainly help a lot if the QEMU interface
for PowerNV PHBs didn't stray too far from the above.
Cédric Le Goater June 27, 2018, 11:51 a.m. UTC | #12
On 06/27/2018 10:41 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2018-06-27 at 09:46 +0200, Cédric Le Goater wrote:
>> So the "IBM PHB3 PCIE Root Port" is already user createable.
>>
>> I can take a look at user createable PHB3s. I think this is OK from a model
>> perspective. The object is rather standalone, it needs the machine for 
>> the XICS fabric and a couple of ids, phb id and chip id. These can come
>> from the command line.
>>
>> We want at least one PHB3 per socket/chip though. 
> 
> We don't want the user to specify the SCOM addresses though

We don't need to.

>  (for the MMIO windows we should get skiboot to assign them).

yes. That's a small fix yet to be done. It can come later.

> If the user gets to specify a thing it would be which of the 3 or 4 HW
> PHBs of the chip it is, the SCOM addresses gets deduced.


So I modified a bit the model to support user creatable PHB3 devices and
could start a QEMU powernv machine with these options :
  
Raid controller on PHB0 :

  -device megasas,id=scsi0,bus=pci.0,addr=0x1
  -drive file=$file,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none
  -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1

Ethernet controller on PHB0 :

  -device e1000,netdev=net0,mac=C0:FF:EE:00:00:02,bus=pci.0,addr=0x2
  -netdev bridge,id=net0,helper=/usr/libexec/qemu-bridge-helper,br=virbr0,id=net0

USB controller on PHB0 :

  -device nec-usb-xhci,bus=pci.0,addr=0x7

Extra PHB1 on chip 0 :

  -device pnv-phb3,chip-id=0,phb-id=1,id=phb0.1

USB controller on PHB1 :

  -device nec-usb-xhci,bus=pci.1,addr=0x7 

Most of the code changes are fine and even fix a few problems.  

But, user created PHB3 devices are not parented to the PnvChip and 
this means that we need to : 

 1 - scan the full object hierarchy to populate the device tree 
 2 - look for the owning chip in the PHP3 realize routine 
 3 - grab the pnv machine to get the XICS fabric

2 & 3 are somewhat violation of the QOM models but we can live with 
that. 1 is annoying. I think.

So may be we should forget about the -device possibility and use fixed
values for the number of PHBs. The value would depend on the chip flavor.
 

Cheers,

C.
Cédric Le Goater June 27, 2018, 12:18 p.m. UTC | #13
On 06/27/2018 12:22 PM, Andrea Bolognani wrote:
> On Tue, 2018-06-26 at 19:02 +0200, Cédric Le Goater wrote:
>> On 06/26/2018 05:57 PM, Andrea Bolognani wrote:
>>> On Tue, 2018-06-26 at 15:59 +0200, Cédric Le Goater wrote:
>>>> This is a model of the PCIe host bridge found on Power8 chips,
>>>> including PowerBus logic interface, IOMMU support, PCIe root complex,
>>>> XICS MSI and LSI interrupt sources.
>>>>
>>>> 4 PHBs are provisioned under the Power8 chip model to fit hardware but
>>>> only one is currently initialized.
>>>
>>> What's the advantage in creating 4 PHBs instead of a single one,
>>
>> The Power8 chip comes in different flavors: Venice, Murano, Naple, 
>> each having a different number of PHBs. We don't need to initialize 
>> them all to plug only a couple of devices (net, storage, usbs) 
>>
>> When time comes, we might want to test some more complex configurations
>> or extend the modeling with CAPI support. That's why we have a :
>>
>> 	#define PNV_MAX_CHIP_PHB 4
>> 	    PnvPHB3      phbs[PNV_MAX_CHIP_PHB];
>>
>> under the chip, and a 'num_phbs' attribute to increase the number
>> of controllers. It still needs to be tested but that's the goal.
> 
> I was a bit confused about the difference between "provisioning"
> and "initializing" a PHB, I guess.

objects have different states : allocated, initialized, realized
I guess "provision (memory)" is not the best choice for the first
state.

> Now that I've looked at the code, my (very uneducated) reading is
> that you're allocating memory for 4 PHBs along with the chip, but
> only actually initializing num_phbs of them, with 1 being the
> default.

Yes. I had the idea to increase the number of PHBs with a machine
option later on if needed.

> I have no idea what this implies when it comes to adding PCI
> controller to the guest, though. If I run a guest with num_phbs=1,
> are ids pci.1..pci.3 reserved even though the corresponding PHBs
> have not been initialized? 

They don't exist on the PowerNV machine if you don't have a PHB3
device backing them.

> Is num_phbs the only way you can control how many PHBs a PowerNV 
> guest will have?

yes. Unless we make them user creatable but that's a discussion in
progress. I am not sure user creatable PHB3s are needed. We will
see.
 
> I would play around and try to figure out all the kinks on my own,
> but I can't actually compile QEMU with this patch applied:

you need to use David's branch :

	https://github.com/dgibson/qemu/tree/ppc-for-3.0


>   hw/pci-host/pnv_phb3_msi.c: In function ‘phb3_msi_reset’:
>   hw/pci-host/pnv_phb3_msi.c:229:11: error: ‘ICSStateClass’ {aka ‘struct ICSStateClass’} has no member named ‘parent_reset’; did you mean ‘parent_class’?
>        icsc->parent_reset(dev);
>              ^~~~~~~~~~~~
>              parent_class
>   hw/pci-host/pnv_phb3_msi.c: In function ‘phb3_msi_realize’:
>   hw/pci-host/pnv_phb3_msi.c:260:11: error: ‘ICSStateClass’ {aka ‘struct ICSStateClass’} has no member named ‘parent_realize’; did you mean ‘parent_class’?
>        icsc->parent_realize(dev, &local_err);
>              ^~~~~~~~~~~~~~
>              parent_class
>   hw/pci-host/pnv_phb3_msi.c: In function ‘phb3_msi_class_init’:
>   hw/pci-host/pnv_phb3_msi.c:293:43: error: ‘ICSStateClass’ {aka ‘struct ICSStateClass’} has no member named ‘parent_realize’; did you mean ‘parent_class’?
>                                        &isc->parent_realize);
>                                              ^~~~~~~~~~~~~~
>                                              parent_class
>   hw/pci-host/pnv_phb3_msi.c:295:41: error: ‘ICSStateClass’ {aka ‘struct ICSStateClass’} has no member named ‘parent_reset’; did you mean ‘parent_class’?
>                                      &isc->parent_reset);
>                                            ^~~~~~~~~~~~
>                                            parent_class
> 
>>> like we already do for pSeries guests? 
>>
>> I didn't follow that discussion but this is "another" kind of PHB.
>> This one models the baremetal controller as found on OpenPOWER and
>> IBM Power machines. pSeries has a virtual PHB.
> 
> I understand that, and of course libvirt will need to learn about
> this new type of PHB and make sure both pSeries and PowerNV guests
> get the correct one assigned to them.

yes. we don't want to change libvirt too much so this is a requirement 
to take into account.

> What I meant is that pSeries guests get a single PHB by default,

yes

> with additional ones being instantiable through -device; 

yes.

> this is
> also consistent with how PCI controllers are added to other guest
> types including pc, q35 and aarch64/virt, so it would be really
> nice if PowerNV behaved the same way.

OK. So that's a +1 in favor of user creatable PHB3s devices 

>>> As it is, this will confuse the heck out of libvirt's PCI address > allocation algorithm :)
>>
>> The pci bus name should be directly related to the PHB index. But
>> I agree we need to be careful. That's why you are in cc: :)
> 
> Thanks, I really appreciate you keeping me in the loop :)
> 

Thanks for the feedback.

C.
Cédric Le Goater June 27, 2018, 1:03 p.m. UTC | #14
On 06/27/2018 12:40 PM, Andrea Bolognani wrote:
> On Wed, 2018-06-27 at 18:41 +1000, Benjamin Herrenschmidt wrote:
>> On Wed, 2018-06-27 at 09:46 +0200, Cédric Le Goater wrote:
>>> So the "IBM PHB3 PCIE Root Port" is already user createable.
>>>
>>> I can take a look at user createable PHB3s. I think this is OK from a model
>>> perspective. The object is rather standalone, it needs the machine for 
>>> the XICS fabric and a couple of ids, phb id and chip id. These can come
>>> from the command line.
>>>
>>> We want at least one PHB3 per socket/chip though. 
>>
>> We don't want the user to specify the SCOM addresses though (for the
>> MMIO windows we should get skiboot to assign them).
>>
>> If the user gets to specify a thing it would be which of the 3 or 4 HW
>> PHBs of the chip it is, the SCOM addresses gets deduced.
> 
> For pSeries guests libvirt will either automatically create, or
> allow users to configure manually, PHBs with something like
> 
>   <controller type='pci' index='1' model='pci-root'>
>     <target index='1'/>
>   </controller>
> 
> which is ultimately converted to
> 
>   -device spapr-pci-host-bridge,index=1,id=pci.1
> 
> Ideally the interface for PowerNV guests can be made to be similar
> if not identical at the libvirt level, without having to add too
> many hacks... It would certainly help a lot if the QEMU interface
> for PowerNV PHBs didn't stray too far from the above.
> 

I think that we will need an extra attribute to specify the chip, but 
only in the case of a multichip system, which is not the common scenario.

So the 'index' attribute should work fine. 

Thanks,

C.
Cédric Le Goater June 27, 2018, 7:48 p.m. UTC | #15
On 06/27/2018 02:18 PM, Cédric Le Goater wrote:
> On 06/27/2018 12:22 PM, Andrea Bolognani wrote:
>> On Tue, 2018-06-26 at 19:02 +0200, Cédric Le Goater wrote:
>>> On 06/26/2018 05:57 PM, Andrea Bolognani wrote:
>>>> On Tue, 2018-06-26 at 15:59 +0200, Cédric Le Goater wrote:
>>>>> This is a model of the PCIe host bridge found on Power8 chips,
>>>>> including PowerBus logic interface, IOMMU support, PCIe root complex,
>>>>> XICS MSI and LSI interrupt sources.
>>>>>
>>>>> 4 PHBs are provisioned under the Power8 chip model to fit hardware but
>>>>> only one is currently initialized.
>>>>
>>>> What's the advantage in creating 4 PHBs instead of a single one,
>>>
>>> The Power8 chip comes in different flavors: Venice, Murano, Naple, 
>>> each having a different number of PHBs. We don't need to initialize 
>>> them all to plug only a couple of devices (net, storage, usbs) 
>>>
>>> When time comes, we might want to test some more complex configurations
>>> or extend the modeling with CAPI support. That's why we have a :
>>>
>>> 	#define PNV_MAX_CHIP_PHB 4
>>> 	    PnvPHB3      phbs[PNV_MAX_CHIP_PHB];
>>>
>>> under the chip, and a 'num_phbs' attribute to increase the number
>>> of controllers. It still needs to be tested but that's the goal.
>>
>> I was a bit confused about the difference between "provisioning"
>> and "initializing" a PHB, I guess.
> 
> objects have different states : allocated, initialized, realized
> I guess "provision (memory)" is not the best choice for the first
> state.
> 
>> Now that I've looked at the code, my (very uneducated) reading is
>> that you're allocating memory for 4 PHBs along with the chip, but
>> only actually initializing num_phbs of them, with 1 being the
>> default.
> 
> Yes. I had the idea to increase the number of PHBs with a machine
> option later on if needed.
> 
>> I have no idea what this implies when it comes to adding PCI
>> controller to the guest, though. If I run a guest with num_phbs=1,
>> are ids pci.1..pci.3 reserved even though the corresponding PHBs
>> have not been initialized? 
> 
> They don't exist on the PowerNV machine if you don't have a PHB3
> device backing them.
> 
>> Is num_phbs the only way you can control how many PHBs a PowerNV 
>> guest will have?
> 
> yes. Unless we make them user creatable but that's a discussion in
> progress. I am not sure user creatable PHB3s are needed. We will
> see.
>  
>> I would play around and try to figure out all the kinks on my own,
>> but I can't actually compile QEMU with this patch applied:
> 
> you need to use David's branch :
> 
> 	https://github.com/dgibson/qemu/tree/ppc-for-3.0

I have pushed a tree here :

	https://github.com/legoater/qemu/tree/phb3-3.0

with two patches, the one we have been discussing about, with some
cleanups and fixes, and one adding user creatable PHB3 devices so 
that extra phbs and devices can be defined on the command line :

-device pnv-phb3,chip-id=1,index=1,id=phb-1.1 -device nec-usb-xhci,bus=pci.1,addr=0x7

The PHB identifier is named 'index' now, to be compatible with the 
libvirt attribute. 

Cheers,

C.
David Gibson June 28, 2018, 3:59 a.m. UTC | #16
On Wed, Jun 27, 2018 at 12:22:31PM +0200, Andrea Bolognani wrote:
> On Tue, 2018-06-26 at 19:02 +0200, Cédric Le Goater wrote:
> > On 06/26/2018 05:57 PM, Andrea Bolognani wrote:
> > > On Tue, 2018-06-26 at 15:59 +0200, Cédric Le Goater wrote:
> > > > This is a model of the PCIe host bridge found on Power8 chips,
> > > > including PowerBus logic interface, IOMMU support, PCIe root complex,
> > > > XICS MSI and LSI interrupt sources.
> > > > 
> > > > 4 PHBs are provisioned under the Power8 chip model to fit hardware but
> > > > only one is currently initialized.
> > > 
> > > What's the advantage in creating 4 PHBs instead of a single one,
> > 
> > The Power8 chip comes in different flavors: Venice, Murano, Naple, 
> > each having a different number of PHBs. We don't need to initialize 
> > them all to plug only a couple of devices (net, storage, usbs) 
> > 
> > When time comes, we might want to test some more complex configurations
> > or extend the modeling with CAPI support. That's why we have a :
> > 
> > 	#define PNV_MAX_CHIP_PHB 4
> > 	    PnvPHB3      phbs[PNV_MAX_CHIP_PHB];
> > 
> > under the chip, and a 'num_phbs' attribute to increase the number
> > of controllers. It still needs to be tested but that's the goal.
[snip]
> > I didn't follow that discussion but this is "another" kind of PHB.
> > This one models the baremetal controller as found on OpenPOWER and
> > IBM Power machines. pSeries has a virtual PHB.
> 
> I understand that, and of course libvirt will need to learn about
> this new type of PHB and make sure both pSeries and PowerNV guests
> get the correct one assigned to them.

Hmm.. does it?  I would have thought pnv could act more like x86, in
that libvirt doesn't attempt to create PHBs at all and just use the
ones that are built in.

Though, come to that, I wouldn't think pnv support for libvirt would
be much of a priority anyway.  The machine type is still very much in
flux, and it's designed primarily for testing and development, not
"real world" usage.

> What I meant is that pSeries guests get a single PHB by default,
> with additional ones being instantiable through -device; this is
> also consistent with how PCI controllers are added to other guest
> types including pc, q35 and aarch64/virt, so it would be really
> nice if PowerNV behaved the same way.

Well.. sure.. but it doesn't.  pSeries is a virtual platform, so we
have a reasonable amount of flexibility to define it as we want.
PowerNV is an emulation of existing hardware which has a specific
behaviour which we need to match.

> > > As it is, this will confuse the heck out of libvirt's PCI address > allocation algorithm :)
> > 
> > The pci bus name should be directly related to the PHB index. But
> > I agree we need to be careful. That's why you are in cc: :)
> 
> Thanks, I really appreciate you keeping me in the loop :)
>
Andrea Bolognani June 28, 2018, 8 a.m. UTC | #17
On Thu, 2018-06-28 at 13:59 +1000, David Gibson wrote:
> On Wed, Jun 27, 2018 at 12:22:31PM +0200, Andrea Bolognani wrote:
> > On Tue, 2018-06-26 at 19:02 +0200, Cédric Le Goater wrote:
> > > I didn't follow that discussion but this is "another" kind of PHB.
> > > This one models the baremetal controller as found on OpenPOWER and
> > > IBM Power machines. pSeries has a virtual PHB.
> > 
> > I understand that, and of course libvirt will need to learn about
> > this new type of PHB and make sure both pSeries and PowerNV guests
> > get the correct one assigned to them.
> 
> Hmm.. does it?  I would have thought pnv could act more like x86, in
> that libvirt doesn't attempt to create PHBs at all and just use the
> ones that are built in.

AFAIK x86 guests have a single PHB and additional ones cannot be
created in any way, which means we don't have to do any additional
second-guessing when assigning IDs to additional PCI controllers.

> Though, come to that, I wouldn't think pnv support for libvirt would
> be much of a priority anyway.  The machine type is still very much in
> flux, and it's designed primarily for testing and development, not
> "real world" usage.

Can you *guarantee* that someone won't ask for PowerNV support in
libvirt at some point? Because if you can't (and I don't think you
can ;) then this is still a valuable conversation to have.

> > What I meant is that pSeries guests get a single PHB by default,
> > with additional ones being instantiable through -device; this is
> > also consistent with how PCI controllers are added to other guest
> > types including pc, q35 and aarch64/virt, so it would be really
> > nice if PowerNV behaved the same way.
> 
> Well.. sure.. but it doesn't.  pSeries is a virtual platform, so we
> have a reasonable amount of flexibility to define it as we want.
> PowerNV is an emulation of existing hardware which has a specific
> behaviour which we need to match.

Sure, that's something to keep in mind.

But the thing is, you still need to have *some* flexibility in
the number of PHBs, since there is variation among real Power8
and Power9 chips; in the current incarnation, that flexibility
is provided by the num_phbs parameter, which is an entirely new
interface that's exclusive to PowerNV.

What I'm suggesting is that the same amount of flexibility is
offered through a standard interface, namely -device, instead.
Cédric Le Goater June 28, 2018, 10:04 a.m. UTC | #18
On 06/28/2018 10:00 AM, Andrea Bolognani wrote:
> On Thu, 2018-06-28 at 13:59 +1000, David Gibson wrote:
>> On Wed, Jun 27, 2018 at 12:22:31PM +0200, Andrea Bolognani wrote:
>>> On Tue, 2018-06-26 at 19:02 +0200, Cédric Le Goater wrote:
>>>> I didn't follow that discussion but this is "another" kind of PHB.
>>>> This one models the baremetal controller as found on OpenPOWER and
>>>> IBM Power machines. pSeries has a virtual PHB.
>>>
>>> I understand that, and of course libvirt will need to learn about
>>> this new type of PHB and make sure both pSeries and PowerNV guests
>>> get the correct one assigned to them.
>>
>> Hmm.. does it?  I would have thought pnv could act more like x86, in
>> that libvirt doesn't attempt to create PHBs at all and just use the
>> ones that are built in.
> 
> AFAIK x86 guests have a single PHB and additional ones cannot be
> created in any way, which means we don't have to do any additional
> second-guessing when assigning IDs to additional PCI controllers.
> 
>> Though, come to that, I wouldn't think pnv support for libvirt would
>> be much of a priority anyway.  The machine type is still very much in
>> flux, and it's designed primarily for testing and development, not
>> "real world" usage.
> 
> Can you *guarantee* that someone won't ask for PowerNV support in
> libvirt at some point? Because if you can't (and I don't think you
> can ;) then this is still a valuable conversation to have.
> 
>>> What I meant is that pSeries guests get a single PHB by default,
>>> with additional ones being instantiable through -device; this is
>>> also consistent with how PCI controllers are added to other guest
>>> types including pc, q35 and aarch64/virt, so it would be really
>>> nice if PowerNV behaved the same way.
>>
>> Well.. sure.. but it doesn't.  pSeries is a virtual platform, so we
>> have a reasonable amount of flexibility to define it as we want.
>> PowerNV is an emulation of existing hardware which has a specific
>> behaviour which we need to match.
> 
> Sure, that's something to keep in mind.
> 
> But the thing is, you still need to have *some* flexibility in
> the number of PHBs, since there is variation among real Power8
> and Power9 chips; in the current incarnation, that flexibility
> is provided by the num_phbs parameter, which is an entirely new
> interface that's exclusive to PowerNV.
> 
> What I'm suggesting is that the same amount of flexibility is
> offered through a standard interface, namely -device, instead.

Yes. I don't know to be honest. Adding support for -device is not 
complex.

v2 proposes to initialize a fixed set of PHBs 2, 3, 4 depending on 
the CPU. I think this is the best modeling option to fit the HW.

Thanks,

C.
Andrea Bolognani June 28, 2018, 11:40 a.m. UTC | #19
On Thu, 2018-06-28 at 12:04 +0200, Cédric Le Goater wrote:
> On 06/28/2018 10:00 AM, Andrea Bolognani wrote:
> > On Thu, 2018-06-28 at 13:59 +1000, David Gibson wrote:
> > > Well.. sure.. but it doesn't.  pSeries is a virtual platform, so we
> > > have a reasonable amount of flexibility to define it as we want.
> > > PowerNV is an emulation of existing hardware which has a specific
> > > behaviour which we need to match.
> > 
> > Sure, that's something to keep in mind.
> > 
> > But the thing is, you still need to have *some* flexibility in
> > the number of PHBs, since there is variation among real Power8
> > and Power9 chips; in the current incarnation, that flexibility
> > is provided by the num_phbs parameter, which is an entirely new
> > interface that's exclusive to PowerNV.
> > 
> > What I'm suggesting is that the same amount of flexibility is
> > offered through a standard interface, namely -device, instead.
> 
> Yes. I don't know to be honest. Adding support for -device is not 
> complex.
> 
> v2 proposes to initialize a fixed set of PHBs 2, 3, 4 depending on 
> the CPU. I think this is the best modeling option to fit the HW.

That approach would require even more hacks in libvirt if we ever
wanted to support PowerNV - basing the PCI address allocation on
the CPU model is not something that's really ever happened before.

To make it somewhat reasonable, information about the number of
PHBs created for each CPU model would have to be exposed through
QMP. And I wonder what a multi-chip guest would look like...

Plus, as soon as you try something like

  $ qemu-system-ppc64 \
    -nodefaults -display none \
    -machine powernv -cpu POWER8E \
    -device pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1 \
    -device megasas,id=scsi0,bus=pci.1,addr=0x1

very interesting things will start happening :)
Benjamin Herrenschmidt June 28, 2018, 12:14 p.m. UTC | #20
On Thu, 2018-06-28 at 10:00 +0200, Andrea Bolognani wrote:
> On Thu, 2018-06-28 at 13:59 +1000, David Gibson wrote:
> > On Wed, Jun 27, 2018 at 12:22:31PM +0200, Andrea Bolognani wrote:
> > > On Tue, 2018-06-26 at 19:02 +0200, Cédric Le Goater wrote:
> > > > I didn't follow that discussion but this is "another" kind of PHB.
> > > > This one models the baremetal controller as found on OpenPOWER and
> > > > IBM Power machines. pSeries has a virtual PHB.
> > > 
> > > I understand that, and of course libvirt will need to learn about
> > > this new type of PHB and make sure both pSeries and PowerNV guests
> > > get the correct one assigned to them.
> > 
> > Hmm.. does it?  I would have thought pnv could act more like x86, in
> > that libvirt doesn't attempt to create PHBs at all and just use the
> > ones that are built in.
> 
> AFAIK x86 guests have a single PHB and additional ones cannot be
> created in any way, which means we don't have to do any additional
> second-guessing when assigning IDs to additional PCI controllers.

That's a surprising limitation. A single PHB only supports a limited
number of MSIs no ? And only 256 bus numbers...

> > Though, come to that, I wouldn't think pnv support for libvirt would
> > be much of a priority anyway.  The machine type is still very much in
> > flux, and it's designed primarily for testing and development, not
> > "real world" usage.
> 
> Can you *guarantee* that someone won't ask for PowerNV support in
> libvirt at some point? Because if you can't (and I don't think you
> can ;) then this is still a valuable conversation to have.

It's rather unlikely for now as there is no KVM suport for it (it's
tricky, our chips aren't designed for full virtualization). That might
change in the future but not soon.

> > > What I meant is that pSeries guests get a single PHB by default,
> > > with additional ones being instantiable through -device; this is
> > > also consistent with how PCI controllers are added to other guest
> > > types including pc, q35 and aarch64/virt, so it would be really
> > > nice if PowerNV behaved the same way.
> > 
> > Well.. sure.. but it doesn't.  pSeries is a virtual platform, so we
> > have a reasonable amount of flexibility to define it as we want.
> > PowerNV is an emulation of existing hardware which has a specific
> > behaviour which we need to match.
> 
> Sure, that's something to keep in mind.
> 
> But the thing is, you still need to have *some* flexibility in
> the number of PHBs, since there is variation among real Power8
> and Power9 chips; in the current incarnation, that flexibility
> is provided by the num_phbs parameter, which is an entirely new
> interface that's exclusive to PowerNV.
> 
> What I'm suggesting is that the same amount of flexibility is
> offered through a standard interface, namely -device, instead.

But that's harder internally to qemu to properly "bind" to the chip
where the PHB resides etc... 

Cheers,
Ben.
Cédric Le Goater June 28, 2018, 12:20 p.m. UTC | #21
On 06/28/2018 01:40 PM, Andrea Bolognani wrote:
> On Thu, 2018-06-28 at 12:04 +0200, Cédric Le Goater wrote:
>> On 06/28/2018 10:00 AM, Andrea Bolognani wrote:
>>> On Thu, 2018-06-28 at 13:59 +1000, David Gibson wrote:
>>>> Well.. sure.. but it doesn't.  pSeries is a virtual platform, so we
>>>> have a reasonable amount of flexibility to define it as we want.
>>>> PowerNV is an emulation of existing hardware which has a specific
>>>> behaviour which we need to match.
>>>
>>> Sure, that's something to keep in mind.
>>>
>>> But the thing is, you still need to have *some* flexibility in
>>> the number of PHBs, since there is variation among real Power8
>>> and Power9 chips; in the current incarnation, that flexibility
>>> is provided by the num_phbs parameter, which is an entirely new
>>> interface that's exclusive to PowerNV.
>>>
>>> What I'm suggesting is that the same amount of flexibility is
>>> offered through a standard interface, namely -device, instead.
>>
>> Yes. I don't know to be honest. Adding support for -device is not 
>> complex.
>>
>> v2 proposes to initialize a fixed set of PHBs 2, 3, 4 depending on 
>> the CPU. I think this is the best modeling option to fit the HW.
> 
> That approach would require even more hacks in libvirt if we ever
> wanted to support PowerNV - basing the PCI address allocation on
> the CPU model is not something that's really ever happened before.

ah. hmm, so that's another +1 for the -device approach.

> To make it somewhat reasonable, information about the number of
> PHBs created for each CPU model would have to be exposed through
> QMP. And I wonder what a multi-chip guest would look like...
> 
> Plus, as soon as you try something like
> 
>   $ qemu-system-ppc64 \
>     -nodefaults -display none \
>     -machine powernv -cpu POWER8E \
>     -device pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1 \
>     -device megasas,id=scsi0,bus=pci.1,addr=0x1
> 
> very interesting things will start happening :)

Well it boots :)


/ # lspci 
0000:00:00.0 PCI bridge: IBM POWER8 Host Bridge (PHB3)
0000:01:00.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
0000:02:01.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
0000:02:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
0000:02:07.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03)
0001:00:00.0 PCI bridge: IBM POWER8 Host Bridge (PHB3)
0001:01:00.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge
0001:02:01.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078


C.
Cédric Le Goater June 28, 2018, 1:05 p.m. UTC | #22
On 06/28/2018 01:40 PM, Andrea Bolognani wrote:
> On Thu, 2018-06-28 at 12:04 +0200, Cédric Le Goater wrote:
>> On 06/28/2018 10:00 AM, Andrea Bolognani wrote:
>>> On Thu, 2018-06-28 at 13:59 +1000, David Gibson wrote:
>>>> Well.. sure.. but it doesn't.  pSeries is a virtual platform, so we
>>>> have a reasonable amount of flexibility to define it as we want.
>>>> PowerNV is an emulation of existing hardware which has a specific
>>>> behaviour which we need to match.
>>>
>>> Sure, that's something to keep in mind.
>>>
>>> But the thing is, you still need to have *some* flexibility in
>>> the number of PHBs, since there is variation among real Power8
>>> and Power9 chips; in the current incarnation, that flexibility
>>> is provided by the num_phbs parameter, which is an entirely new
>>> interface that's exclusive to PowerNV.
>>>
>>> What I'm suggesting is that the same amount of flexibility is
>>> offered through a standard interface, namely -device, instead.
>>
>> Yes. I don't know to be honest. Adding support for -device is not 
>> complex.
>>
>> v2 proposes to initialize a fixed set of PHBs 2, 3, 4 depending on 
>> the CPU. I think this is the best modeling option to fit the HW.
> 
> That approach would require even more hacks in libvirt if we ever
> wanted to support PowerNV - basing the PCI address allocation on
> the CPU model is not something that's really ever happened before.

It's even more complex than that but let's keep it "simple" and
consider that some CPU flavors can have more or less PHBs.

May be, we could agree on a common number of PHBs per chip which 
would be statically initialized. *two* PHBs should cover all CPU 
flavors. Extra PHBs created by the user with -device would be 
allowed but capped by a per CPU flavor limit. How's that ? 

C.   

> 
> To make it somewhat reasonable, information about the number of
> PHBs created for each CPU model would have to be exposed through
> QMP. And I wonder what a multi-chip guest would look like...
> 
> Plus, as soon as you try something like
> 
>   $ qemu-system-ppc64 \
>     -nodefaults -display none \
>     -machine powernv -cpu POWER8E \
>     -device pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1 \
>     -device megasas,id=scsi0,bus=pci.1,addr=0x1
> 
> very interesting things will start happening :)
>
David Gibson July 2, 2018, 6:23 a.m. UTC | #23
On Thu, Jun 28, 2018 at 10:14:53PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2018-06-28 at 10:00 +0200, Andrea Bolognani wrote:
> > On Thu, 2018-06-28 at 13:59 +1000, David Gibson wrote:
> > > On Wed, Jun 27, 2018 at 12:22:31PM +0200, Andrea Bolognani wrote:
> > > > On Tue, 2018-06-26 at 19:02 +0200, Cédric Le Goater wrote:
> > > > > I didn't follow that discussion but this is "another" kind of PHB.
> > > > > This one models the baremetal controller as found on OpenPOWER and
> > > > > IBM Power machines. pSeries has a virtual PHB.
> > > > 
> > > > I understand that, and of course libvirt will need to learn about
> > > > this new type of PHB and make sure both pSeries and PowerNV guests
> > > > get the correct one assigned to them.
> > > 
> > > Hmm.. does it?  I would have thought pnv could act more like x86, in
> > > that libvirt doesn't attempt to create PHBs at all and just use the
> > > ones that are built in.
> > 
> > AFAIK x86 guests have a single PHB and additional ones cannot be
> > created in any way, which means we don't have to do any additional
> > second-guessing when assigning IDs to additional PCI controllers.
> 
> That's a surprising limitation. A single PHB only supports a limited
> number of MSIs no ? And only 256 bus numbers...

I think it depends exactly what you call a "PHB".  AIUI, on modern x86
systems, multiple PCI domains are supported, but you access them all
through the same IO ports, using a 'domain' field in some register to
distinguish which you're operating on

Wheter you want to call that multiple PHBs with a register multiplexer
in front of them, or a single PHB off which hang multiple domains is
kind of arbitrary (at least from the guest PoV).

> > > Though, come to that, I wouldn't think pnv support for libvirt would
> > > be much of a priority anyway.  The machine type is still very much in
> > > flux, and it's designed primarily for testing and development, not
> > > "real world" usage.
> > 
> > Can you *guarantee* that someone won't ask for PowerNV support in
> > libvirt at some point? Because if you can't (and I don't think you
> > can ;) then this is still a valuable conversation to have.
> 
> It's rather unlikely for now as there is no KVM suport for it (it's
> tricky, our chips aren't designed for full virtualization). That might
> change in the future but not soon.

KVM support isn't really a prerequisite for libvirt support.  More
relevant is that the qemu level machine is still changing a lot.  I
don't believe we're really maintaining version to version option
compatibility at this point, we're certainly not attempting to support
cross version migration for it.

> > > > What I meant is that pSeries guests get a single PHB by default,
> > > > with additional ones being instantiable through -device; this is
> > > > also consistent with how PCI controllers are added to other guest
> > > > types including pc, q35 and aarch64/virt, so it would be really
> > > > nice if PowerNV behaved the same way.
> > > 
> > > Well.. sure.. but it doesn't.  pSeries is a virtual platform, so we
> > > have a reasonable amount of flexibility to define it as we want.
> > > PowerNV is an emulation of existing hardware which has a specific
> > > behaviour which we need to match.
> > 
> > Sure, that's something to keep in mind.
> > 
> > But the thing is, you still need to have *some* flexibility in
> > the number of PHBs, since there is variation among real Power8
> > and Power9 chips; in the current incarnation, that flexibility
> > is provided by the num_phbs parameter, which is an entirely new
> > interface that's exclusive to PowerNV.
> > 
> > What I'm suggesting is that the same amount of flexibility is
> > offered through a standard interface, namely -device, instead.
> 
> But that's harder internally to qemu to properly "bind" to the chip
> where the PHB resides etc... 
>
diff mbox series

Patch

diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
index b94af6c7c62a..deebba2b044a 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -9,6 +9,7 @@  CONFIG_IPMI=y
 CONFIG_IPMI_LOCAL=y
 CONFIG_IPMI_EXTERN=y
 CONFIG_ISA_IPMI_BT=y
+CONFIG_PCIE_PORT=y
 
 # For pSeries
 CONFIG_PSERIES=y
diff --git a/include/hw/pci-host/pnv_phb3.h b/include/hw/pci-host/pnv_phb3.h
new file mode 100644
index 000000000000..795e9b3af2c0
--- /dev/null
+++ b/include/hw/pci-host/pnv_phb3.h
@@ -0,0 +1,158 @@ 
+/*
+ * QEMU PowerPC PowerNV PHB3 model
+ *
+ * Copyright (c) 2014-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef PCI_HOST_PNV_PHB3_H
+#define PCI_HOST_PNV_PHB3_H
+
+#include "hw/pci/pci_host.h"
+#include "hw/ppc/xics.h"
+
+typedef struct PnvPHB3 PnvPHB3;
+
+/*
+ * PHB3 XICS Source for MSIs
+ */
+#define TYPE_PHB3_MSI "phb3-msi"
+#define PHB3_MSI(obj) OBJECT_CHECK(Phb3MsiState, (obj), TYPE_PHB3_MSI)
+
+#define PHB3_MAX_MSI     2048
+
+typedef struct Phb3MsiState {
+    ICSState ics;
+
+    PnvPHB3 *phb;
+    uint64_t rba[PHB3_MAX_MSI / 64];
+    uint32_t rba_sum;
+} Phb3MsiState;
+
+void pnv_phb3_msi_update_config(Phb3MsiState *msis, uint32_t base,
+                                uint32_t count);
+void pnv_phb3_msi_send(Phb3MsiState *msis, uint64_t addr, uint16_t data,
+                       int32_t dev_pe);
+void pnv_phb3_msi_ffi(Phb3MsiState *msis, uint64_t val);
+
+
+/* We have one such address space wrapper per possible device
+ * under the PHB since they need to be assigned statically at
+ * qemu device creation time. The relationship to a PE is done
+ * later dynamically. This means we can potentially create a lot
+ * of these guys. Q35 stores them as some kind of radix tree but
+ * we never really need to do fast lookups so instead we simply
+ * keep a QLIST of them for now, we can add the radix if needed
+ * later on.
+ *
+ * We do cache the PE number to speed things up a bit though.
+ */
+typedef struct PnvPhb3DMASpace {
+    PCIBus *bus;
+    uint8_t devfn;
+    int pe_num;         /* Cached PE number */
+#define PHB_INVALID_PE (-1)
+    PnvPHB3 *phb;
+    AddressSpace dma_as;
+    IOMMUMemoryRegion dma_mr;
+    MemoryRegion msi32_mr;
+    MemoryRegion msi64_mr;
+    bool msi32_mapped;
+    bool msi64_mapped;
+    QLIST_ENTRY(PnvPhb3DMASpace) list;
+} PnvPhb3DMASpace;
+
+/*
+ * PHB3 Power Bus Common Queue
+ */
+#define TYPE_PNV_PBCQ "pnv-pbcq"
+#define PNV_PBCQ(obj) OBJECT_CHECK(PnvPBCQState, (obj), TYPE_PNV_PBCQ)
+
+typedef struct PnvPBCQState {
+    DeviceState parent;
+
+    uint32_t chip_id;
+    uint32_t phb_id;
+    uint32_t nest_xbase;
+    uint32_t spci_xbase;
+    uint32_t pci_xbase;
+#define PBCQ_NEST_REGS_COUNT    0x46
+#define PBCQ_PCI_REGS_COUNT     0x15
+#define PBCQ_SPCI_REGS_COUNT    0x5
+
+    uint64_t nest_regs[PBCQ_NEST_REGS_COUNT];
+    uint64_t spci_regs[PBCQ_SPCI_REGS_COUNT];
+    uint64_t pci_regs[PBCQ_PCI_REGS_COUNT];
+    MemoryRegion mmbar0;
+    MemoryRegion mmbar1;
+    MemoryRegion phbbar;
+    bool mmio0_mapped;
+    bool mmio1_mapped;
+    bool phb_mapped;
+    uint64_t mmio0_base;
+    uint64_t mmio0_size;
+    uint64_t mmio1_base;
+    uint64_t mmio1_size;
+    PnvPHB3 *phb;
+
+    MemoryRegion xscom_nest_regs;
+    MemoryRegion xscom_pci_regs;
+    MemoryRegion xscom_spci_regs;
+} PnvPBCQState;
+
+/*
+ * PHB3 PCI Host Bridge for PowerNV machines (POWER8)
+ */
+#define TYPE_PNV_PHB3 "pnv-phb3"
+#define PNV_PHB3(obj) OBJECT_CHECK(PnvPHB3, (obj), TYPE_PNV_PHB3)
+
+#define PNV_PHB3_NUM_M64      16
+#define PNV_PHB3_NUM_REGS     (0x1000 >> 3)
+#define PNV_PHB3_NUM_LSI      8
+#define PNV_PHB3_NUM_PE       256
+
+#define PCI_MMIO_TOTAL_SIZE   (0x1ull << 60)
+
+struct PnvPHB3 {
+    PCIHostState parent_obj;
+
+    uint64_t regs[PNV_PHB3_NUM_REGS];
+    MemoryRegion mr_regs;
+
+    MemoryRegion mr_m32;
+    MemoryRegion mr_m64[PNV_PHB3_NUM_M64];
+    bool regs_mapped;
+    bool m32_mapped;
+    bool m64_mapped[PNV_PHB3_NUM_M64];
+    MemoryRegion pci_mmio;
+    MemoryRegion pci_io;
+
+    uint64_t ioda_LIST[8];
+    uint64_t ioda_LXIVT[8];
+    uint64_t ioda_TVT[512];
+    uint64_t ioda_M64BT[16];
+    uint64_t ioda_MDT[256];
+    uint64_t ioda_PEEV[4];
+
+    uint32_t total_irq;
+    ICSState lsis;
+    Phb3MsiState msis;
+
+    PnvPBCQState pbcq;
+
+    QLIST_HEAD(, PnvPhb3DMASpace) dma_spaces;
+};
+
+#define TYPE_PNV_PHB3_RC "pnv-phb3-rc"
+
+#define TYPE_PNV_PHB3_ROOT_BUS "pnv-phb3-root-bus"
+
+uint64_t pnv_phb3_reg_read(void *opaque, hwaddr off, unsigned size);
+void pnv_phb3_reg_write(void *opaque, hwaddr off, uint64_t val, unsigned size);
+void pnv_phb3_update_regions(PnvPHB3 *phb);
+void pnv_phb3_remap_irqs(PnvPHB3 *phb);
+
+
+#endif /* PCI_HOST_PNV_PHB3_H */
diff --git a/include/hw/pci-host/pnv_phb3_regs.h b/include/hw/pci-host/pnv_phb3_regs.h
new file mode 100644
index 000000000000..a1672726b908
--- /dev/null
+++ b/include/hw/pci-host/pnv_phb3_regs.h
@@ -0,0 +1,510 @@ 
+/* Copyright (c) 2013-2018, IBM Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ * implied.
+ *
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef PCI_HOST_PNV_PHB3_REGS_H
+#define PCI_HOST_PNV_PHB3_REGS_H
+
+/*
+ * Duplicated from target/ppc/cpu.h
+ */
+#define PPC_BIT(bit)            (0x8000000000000000UL >> (bit))
+#define PPC_BIT32(bit)          (0x80000000UL >> (bit))
+#define PPC_BIT8(bit)           (0x80UL >> (bit))
+#define PPC_BITMASK(bs, be)     ((PPC_BIT(bs) - PPC_BIT(be)) | PPC_BIT(bs))
+#define PPC_BITMASK32(bs, be)   ((PPC_BIT32(bs) - PPC_BIT32(be)) | \
+                                 PPC_BIT32(bs))
+#define PPC_BITLSHIFT(be)       (63 - (be))
+#define PPC_BITLSHIFT32(be)     (31 - (be))
+
+/* Extract field fname from val */
+#define GETFIELD(fname, val)                    \
+        (((val) & fname##_MASK) >> fname##_LSH)
+
+/* Set field fname of oval to fval
+ * NOTE: oval isn't modified, the combined result is returned
+ */
+#define SETFIELD(fname, oval, fval)                     \
+        (((oval) & ~fname##_MASK) | \
+         ((((typeof(oval))(fval)) << fname##_LSH) & fname##_MASK))
+
+/*
+ * PBCQ XSCOM registers
+ */
+
+#define PBCQ_NEST_IRSN_COMPARE  0x1a
+#define PBCQ_NEST_IRSN_COMP_MASK      PPC_BITMASK(0, 18)
+#define PBCQ_NEST_IRSN_COMP_LSH       PPC_BITLSHIFT(18)
+#define PBCQ_NEST_IRSN_MASK     0x1b
+#define PBCQ_NEST_LSI_SRC_ID    0x1f
+#define   PBCQ_NEST_LSI_SRC_MASK     PPC_BITMASK(0, 7)
+#define   PBCQ_NEST_LSI_SRC_LSH      PPC_BITLSHIFT(7)
+#define PBCQ_NEST_REGS_COUNT    0x46
+#define PBCQ_NEST_MMIO_BAR0     0x40
+#define PBCQ_NEST_MMIO_BAR1     0x41
+#define PBCQ_NEST_PHB_BAR       0x42
+#define PBCQ_NEST_MMIO_MASK0    0x43
+#define PBCQ_NEST_MMIO_MASK1    0x44
+#define PBCQ_NEST_BAR_EN        0x45
+#define   PBCQ_NEST_BAR_EN_MMIO0    PPC_BIT(0)
+#define   PBCQ_NEST_BAR_EN_MMIO1    PPC_BIT(1)
+#define   PBCQ_NEST_BAR_EN_PHB      PPC_BIT(2)
+#define   PBCQ_NEST_BAR_EN_IRSN_RX  PPC_BIT(3)
+#define   PBCQ_NEST_BAR_EN_IRSN_TX  PPC_BIT(4)
+
+#define PBCQ_PCI_REGS_COUNT     0x15
+#define PBCQ_PCI_BAR2           0x0b
+
+#define PBCQ_SPCI_REGS_COUNT    0x5
+#define PBCQ_SPCI_ASB_ADDR      0x0
+#define PBCQ_SPCI_ASB_STATUS    0x1
+#define PBCQ_SPCI_ASB_DATA      0x2
+#define PBCQ_SPCI_AIB_CAPP_EN   0x3
+#define PBCQ_SPCI_CAPP_SEC_TMR  0x4
+
+/*
+ * PHB MMIO registers
+ */
+
+/* PHB Fundamental register set A */
+#define PHB_LSI_SOURCE_ID               0x100
+#define   PHB_LSI_SRC_ID_MASK           PPC_BITMASK(5, 12)
+#define   PHB_LSI_SRC_ID_LSH            PPC_BITLSHIFT(12)
+#define PHB_DMA_CHAN_STATUS             0x110
+#define   PHB_DMA_CHAN_ANY_ERR          PPC_BIT(27)
+#define   PHB_DMA_CHAN_ANY_ERR1         PPC_BIT(28)
+#define   PHB_DMA_CHAN_ANY_FREEZE       PPC_BIT(29)
+#define PHB_CPU_LOADSTORE_STATUS        0x120
+#define   PHB_CPU_LS_ANY_ERR            PPC_BIT(27)
+#define   PHB_CPU_LS_ANY_ERR1           PPC_BIT(28)
+#define   PHB_CPU_LS_ANY_FREEZE         PPC_BIT(29)
+#define PHB_DMA_MSI_NODE_ID             0x128
+#define   PHB_DMAMSI_NID_FIXED          PPC_BIT(0)
+#define   PHB_DMAMSI_NID_MASK           PPC_BITMASK(24, 31)
+#define   PHB_DMAMSI_NID_LSH            PPC_BITLSHIFT(31)
+#define PHB_CONFIG_DATA                 0x130
+#define PHB_LOCK0                       0x138
+#define PHB_CONFIG_ADDRESS              0x140
+#define   PHB_CA_ENABLE                 PPC_BIT(0)
+#define   PHB_CA_BUS_MASK               PPC_BITMASK(4, 11)
+#define   PHB_CA_BUS_LSH                PPC_BITLSHIFT(11)
+#define   PHB_CA_DEV_MASK               PPC_BITMASK(12, 16)
+#define   PHB_CA_DEV_LSH                PPC_BITLSHIFT(16)
+#define   PHB_CA_FUNC_MASK              PPC_BITMASK(17, 19)
+#define   PHB_CA_FUNC_LSH               PPC_BITLSHIFT(19)
+#define   PHB_CA_REG_MASK               PPC_BITMASK(20, 31)
+#define   PHB_CA_REG_LSH                PPC_BITLSHIFT(31)
+#define   PHB_CA_PE_MASK                PPC_BITMASK(40, 47)
+#define   PHB_CA_PE_LSH                 PPC_BITLSHIFT(47)
+#define PHB_LOCK1                       0x148
+#define PHB_IVT_BAR                     0x150
+#define   PHB_IVT_BAR_ENABLE            PPC_BIT(0)
+#define   PHB_IVT_BASE_ADDRESS_MASK     PPC_BITMASK(14, 48)
+#define   PHB_IVT_BASE_ADDRESS_LSH      PPC_BITLSHIFT(48)
+#define   PHB_IVT_LENGTH_MASK           PPC_BITMASK(52, 63)
+#define   PHB_IVT_LENGTH_ADDRESS_LSH    PPC_BITLSHIFT(63)
+#define PHB_RBA_BAR                     0x158
+#define   PHB_RBA_BAR_ENABLE            PPC_BIT(0)
+#define   PHB_RBA_BASE_ADDRESS_MASK     PPC_BITMASK(14, 55)
+#define   PHB_RBA_BASE_ADDRESS_LSH      PPC_BITLSHIFT(55)
+#define PHB_PHB3_CONFIG                 0x160
+#define   PHB_PHB3C_64B_TCE_EN          PPC_BIT(2)
+#define   PHB_PHB3C_32BIT_MSI_EN        PPC_BIT(8)
+#define   PHB_PHB3C_64BIT_MSI_EN        PPC_BIT(14)
+#define   PHB_PHB3C_M32_EN              PPC_BIT(16)
+#define PHB_RTT_BAR                     0x168
+#define   PHB_RTT_BAR_ENABLE            PPC_BIT(0)
+#define   PHB_RTT_BASE_ADDRESS_MASK     PPC_BITMASK(14, 46)
+#define   PHB_RTT_BASE_ADDRESS_LSH      PPC_BITLSHIFT(46)
+#define PHB_PELTV_BAR                   0x188
+#define   PHB_PELTV_BAR_ENABLE          PPC_BIT(0)
+#define   PHB_PELTV_BASE_ADDRESS_MASK   PPC_BITMASK(14, 50)
+#define   PHB_PELTV_BASE_ADDRESS_LSH    PPC_BITLSHIFT(50)
+#define PHB_M32_BASE_ADDR               0x190
+#define PHB_M32_BASE_MASK               0x198
+#define PHB_M32_START_ADDR              0x1a0
+#define PHB_PEST_BAR                    0x1a8
+#define   PHB_PEST_BAR_ENABLE           PPC_BIT(0)
+#define   PHB_PEST_BASE_ADDRESS_MASK    PPC_BITMASK(14, 51)
+#define   PHB_PEST_BASE_ADDRESS_LSH     PPC_BITLSHIFT(51)
+#define PHB_M64_UPPER_BITS              0x1f0
+#define PHB_INTREP_TIMER                0x1f8
+#define PHB_DMARD_SYNC                  0x200
+#define PHB_RTC_INVALIDATE              0x208
+#define   PHB_RTC_INVALIDATE_ALL        PPC_BIT(0)
+#define   PHB_RTC_INVALIDATE_RID_MASK   PPC_BITMASK(16, 31)
+#define   PHB_RTC_INVALIDATE_RID_LSH    PPC_BITLSHIFT(31)
+#define PHB_TCE_KILL                    0x210
+#define   PHB_TCE_KILL_ALL              PPC_BIT(0)
+#define PHB_TCE_SPEC_CTL                0x218
+#define PHB_IODA_ADDR                   0x220
+#define   PHB_IODA_AD_AUTOINC           PPC_BIT(0)
+#define   PHB_IODA_AD_TSEL_MASK         PPC_BITMASK(11, 15)
+#define   PHB_IODA_AD_TSEL_LSH          PPC_BITLSHIFT(15)
+#define   PHB_IODA_AD_TADR_MASK         PPC_BITMASK(55, 63)
+#define   PHB_IODA_AD_TADR_LSH          PPC_BITLSHIFT(63)
+#define PHB_IODA_DATA0                  0x228
+#define PHB_FFI_REQUEST                 0x238
+#define   PHB_FFI_LOCK_CLEAR            PPC_BIT(3)
+#define   PHB_FFI_REQUEST_ISN_MASK      PPC_BITMASK(49, 59)
+#define   PHB_FFI_REQUEST_ISN_LSH       PPC_BITLSHIFT(59)
+#define PHB_FFI_LOCK                    0x240
+#define   PHB_FFI_LOCK_STATE            PPC_BIT(0)
+#define PHB_XIVE_UPDATE                 0x248 /* Broken in DD1 */
+#define PHB_PHB3_GEN_CAP                0x250
+#define PHB_PHB3_TCE_CAP                0x258
+#define PHB_PHB3_IRQ_CAP                0x260
+#define PHB_PHB3_EEH_CAP                0x268
+#define PHB_IVC_INVALIDATE              0x2a0
+#define   PHB_IVC_INVALIDATE_ALL        PPC_BIT(0)
+#define   PHB_IVC_INVALIDATE_SID_MASK   PPC_BITMASK(16, 31)
+#define   PHB_IVC_INVALIDATE_SID_LSH    PPC_BITLSHIFT(31)
+#define PHB_IVC_UPDATE                  0x2a8
+#define   PHB_IVC_UPDATE_ENABLE_P       PPC_BIT(0)
+#define   PHB_IVC_UPDATE_ENABLE_Q       PPC_BIT(1)
+#define   PHB_IVC_UPDATE_ENABLE_SERVER  PPC_BIT(2)
+#define   PHB_IVC_UPDATE_ENABLE_PRI     PPC_BIT(3)
+#define   PHB_IVC_UPDATE_ENABLE_GEN     PPC_BIT(4)
+#define   PHB_IVC_UPDATE_ENABLE_CON     PPC_BIT(5)
+#define   PHB_IVC_UPDATE_GEN_MATCH_MASK PPC_BITMASK(6, 7)
+#define   PHB_IVC_UPDATE_GEN_MATCH_LSH  PPC_BITLSHIFT(7)
+#define   PHB_IVC_UPDATE_SERVER_MASK    PPC_BITMASK(8, 23)
+#define   PHB_IVC_UPDATE_SERVER_LSH     PPC_BITLSHIFT(23)
+#define   PHB_IVC_UPDATE_PRI_MASK       PPC_BITMASK(24, 31)
+#define   PHB_IVC_UPDATE_PRI_LSH        PPC_BITLSHIFT(31)
+#define   PHB_IVC_UPDATE_GEN_MASK       PPC_BITMASK(32, 33)
+#define   PHB_IVC_UPDATE_GEN_LSH        PPC_BITLSHIFT(33)
+#define   PHB_IVC_UPDATE_P_MASK         PPC_BITMASK(34, 34)
+#define   PHB_IVC_UPDATE_P_LSH          PPC_BITLSHIFT(34)
+#define   PHB_IVC_UPDATE_Q_MASK         PPC_BITMASK(35, 35)
+#define   PHB_IVC_UPDATE_Q_LSH          PPC_BITLSHIFT(35)
+#define   PHB_IVC_UPDATE_SID_MASK       PPC_BITMASK(48, 63)
+#define   PHB_IVC_UPDATE_SID_LSH        PPC_BITLSHIFT(63)
+#define PHB_PAPR_ERR_INJ_CTL            0x2b0
+#define   PHB_PAPR_ERR_INJ_CTL_INB      PPC_BIT(0)
+#define   PHB_PAPR_ERR_INJ_CTL_OUTB     PPC_BIT(1)
+#define   PHB_PAPR_ERR_INJ_CTL_STICKY   PPC_BIT(2)
+#define   PHB_PAPR_ERR_INJ_CTL_CFG      PPC_BIT(3)
+#define   PHB_PAPR_ERR_INJ_CTL_RD       PPC_BIT(4)
+#define   PHB_PAPR_ERR_INJ_CTL_WR       PPC_BIT(5)
+#define   PHB_PAPR_ERR_INJ_CTL_FREEZE   PPC_BIT(6)
+#define PHB_PAPR_ERR_INJ_ADDR           0x2b8
+#define   PHB_PAPR_ERR_INJ_ADDR_MMIO_MASK       PPC_BITMASK(16, 63)
+#define   PHB_PAPR_ERR_INJ_ADDR_MMIO_LSH        PPC_BITLSHIFT(63)
+#define PHB_PAPR_ERR_INJ_MASK           0x2c0
+#define   PHB_PAPR_ERR_INJ_MASK_CFG_MASK        PPC_BITMASK(4, 11)
+#define   PHB_PAPR_ERR_INJ_MASK_CFG_LSH         PPC_BITLSHIFT(11)
+#define   PHB_PAPR_ERR_INJ_MASK_MMIO_MASK       PPC_BITMASK(16, 63)
+#define   PHB_PAPR_ERR_INJ_MASK_MMIO_LSH        PPC_BITLSHIFT(63)
+#define PHB_ETU_ERR_SUMMARY             0x2c8
+
+/*  UTL registers */
+#define UTL_SYS_BUS_CONTROL             0x400
+#define UTL_STATUS                      0x408
+#define UTL_SYS_BUS_AGENT_STATUS        0x410
+#define UTL_SYS_BUS_AGENT_ERR_SEVERITY  0x418
+#define UTL_SYS_BUS_AGENT_IRQ_EN        0x420
+#define UTL_SYS_BUS_BURST_SZ_CONF       0x440
+#define UTL_REVISION_ID                 0x448
+#define UTL_BCLK_DOMAIN_DBG1            0x460
+#define UTL_BCLK_DOMAIN_DBG2            0x468
+#define UTL_BCLK_DOMAIN_DBG3            0x470
+#define UTL_BCLK_DOMAIN_DBG4            0x478
+#define UTL_BCLK_DOMAIN_DBG5            0x480
+#define UTL_BCLK_DOMAIN_DBG6            0x488
+#define UTL_OUT_POST_HDR_BUF_ALLOC      0x4c0
+#define UTL_OUT_POST_DAT_BUF_ALLOC      0x4d0
+#define UTL_IN_POST_HDR_BUF_ALLOC       0x4e0
+#define UTL_IN_POST_DAT_BUF_ALLOC       0x4f0
+#define UTL_OUT_NP_BUF_ALLOC            0x500
+#define UTL_IN_NP_BUF_ALLOC             0x510
+#define UTL_PCIE_TAGS_ALLOC             0x520
+#define UTL_GBIF_READ_TAGS_ALLOC        0x530
+#define UTL_PCIE_PORT_CONTROL           0x540
+#define UTL_PCIE_PORT_STATUS            0x548
+#define UTL_PCIE_PORT_ERROR_SEV         0x550
+#define UTL_PCIE_PORT_IRQ_EN            0x558
+#define UTL_RC_STATUS                   0x560
+#define UTL_RC_ERR_SEVERITY             0x568
+#define UTL_RC_IRQ_EN                   0x570
+#define UTL_EP_STATUS                   0x578
+#define UTL_EP_ERR_SEVERITY             0x580
+#define UTL_EP_ERR_IRQ_EN               0x588
+#define UTL_PCI_PM_CTRL1                0x590
+#define UTL_PCI_PM_CTRL2                0x598
+#define UTL_GP_CTL1                     0x5a0
+#define UTL_GP_CTL2                     0x5a8
+#define UTL_PCLK_DOMAIN_DBG1            0x5b0
+#define UTL_PCLK_DOMAIN_DBG2            0x5b8
+#define UTL_PCLK_DOMAIN_DBG3            0x5c0
+#define UTL_PCLK_DOMAIN_DBG4            0x5c8
+
+/* PCI-E Stack registers */
+#define PHB_PCIE_SYSTEM_CONFIG          0x600
+#define PHB_PCIE_BUS_NUMBER             0x608
+#define PHB_PCIE_SYSTEM_TEST            0x618
+#define PHB_PCIE_LINK_MANAGEMENT        0x630
+#define   PHB_PCIE_LM_LINK_ACTIVE       PPC_BIT(8)
+#define PHB_PCIE_DLP_TRAIN_CTL          0x640
+#define   PHB_PCIE_DLP_TCTX_DISABLE     PPC_BIT(1)
+#define   PHB_PCIE_DLP_TCRX_DISABLED    PPC_BIT(16)
+#define   PHB_PCIE_DLP_INBAND_PRESENCE  PPC_BIT(19)
+#define   PHB_PCIE_DLP_TC_DL_LINKUP     PPC_BIT(21)
+#define   PHB_PCIE_DLP_TC_DL_PGRESET    PPC_BIT(22)
+#define   PHB_PCIE_DLP_TC_DL_LINKACT    PPC_BIT(23)
+#define PHB_PCIE_SLOP_LOOPBACK_STATUS   0x648
+#define PHB_PCIE_SYS_LINK_INIT          0x668
+#define PHB_PCIE_UTL_CONFIG             0x670
+#define PHB_PCIE_DLP_CONTROL            0x678
+#define PHB_PCIE_UTL_ERRLOG1            0x680
+#define PHB_PCIE_UTL_ERRLOG2            0x688
+#define PHB_PCIE_UTL_ERRLOG3            0x690
+#define PHB_PCIE_UTL_ERRLOG4            0x698
+#define PHB_PCIE_DLP_ERRLOG1            0x6a0
+#define PHB_PCIE_DLP_ERRLOG2            0x6a8
+#define PHB_PCIE_DLP_ERR_STATUS         0x6b0
+#define PHB_PCIE_DLP_ERR_COUNTERS       0x6b8
+#define PHB_PCIE_UTL_ERR_INJECT         0x6c0
+#define PHB_PCIE_TLDLP_ERR_INJECT       0x6c8
+#define PHB_PCIE_LANE_EQ_CNTL0          0x6d0
+#define PHB_PCIE_LANE_EQ_CNTL1          0x6d8
+#define PHB_PCIE_LANE_EQ_CNTL2          0x6e0
+#define PHB_PCIE_LANE_EQ_CNTL3          0x6e8
+#define PHB_PCIE_STRAPPING              0x700
+
+/* Fundamental register set B */
+#define PHB_VERSION                     0x800
+#define PHB_RESET                       0x808
+#define PHB_CONTROL                     0x810
+#define   PHB_CTRL_IVE_128_BYTES        PPC_BIT(24)
+#define PHB_AIB_RX_CRED_INIT_TIMER      0x818
+#define PHB_AIB_RX_CMD_CRED             0x820
+#define PHB_AIB_RX_DATA_CRED            0x828
+#define PHB_AIB_TX_CMD_CRED             0x830
+#define PHB_AIB_TX_DATA_CRED            0x838
+#define PHB_AIB_TX_CHAN_MAPPING         0x840
+#define PHB_AIB_TAG_ENABLE              0x858
+#define PHB_AIB_FENCE_CTRL              0x860
+#define PHB_TCE_TAG_ENABLE              0x868
+#define PHB_TCE_WATERMARK               0x870
+#define PHB_TIMEOUT_CTRL1               0x878
+#define PHB_TIMEOUT_CTRL2               0x880
+#define PHB_QUIESCE_DMA_G               0x888
+#define PHB_AIB_TAG_STATUS              0x900
+#define PHB_TCE_TAG_STATUS              0x908
+
+/* FIR & Error registers */
+#define PHB_LEM_FIR_ACCUM               0xc00
+#define PHB_LEM_FIR_AND_MASK            0xc08
+#define PHB_LEM_FIR_OR_MASK             0xc10
+#define PHB_LEM_ERROR_MASK              0xc18
+#define PHB_LEM_ERROR_AND_MASK          0xc20
+#define PHB_LEM_ERROR_OR_MASK           0xc28
+#define PHB_LEM_ACTION0                 0xc30
+#define PHB_LEM_ACTION1                 0xc38
+#define PHB_LEM_WOF                     0xc40
+#define PHB_ERR_STATUS                  0xc80
+#define PHB_ERR1_STATUS                 0xc88
+#define PHB_ERR_INJECT                  0xc90
+#define PHB_ERR_LEM_ENABLE              0xc98
+#define PHB_ERR_IRQ_ENABLE              0xca0
+#define PHB_ERR_FREEZE_ENABLE           0xca8
+#define PHB_ERR_AIB_FENCE_ENABLE        0xcb0
+#define PHB_ERR_LOG_0                   0xcc0
+#define PHB_ERR_LOG_1                   0xcc8
+#define PHB_ERR_STATUS_MASK             0xcd0
+#define PHB_ERR1_STATUS_MASK            0xcd8
+
+#define PHB_OUT_ERR_STATUS              0xd00
+#define PHB_OUT_ERR1_STATUS             0xd08
+#define PHB_OUT_ERR_INJECT              0xd10
+#define PHB_OUT_ERR_LEM_ENABLE          0xd18
+#define PHB_OUT_ERR_IRQ_ENABLE          0xd20
+#define PHB_OUT_ERR_FREEZE_ENABLE       0xd28
+#define PHB_OUT_ERR_AIB_FENCE_ENABLE    0xd30
+#define PHB_OUT_ERR_LOG_0               0xd40
+#define PHB_OUT_ERR_LOG_1               0xd48
+#define PHB_OUT_ERR_STATUS_MASK         0xd50
+#define PHB_OUT_ERR1_STATUS_MASK        0xd58
+
+#define PHB_INA_ERR_STATUS              0xd80
+#define PHB_INA_ERR1_STATUS             0xd88
+#define PHB_INA_ERR_INJECT              0xd90
+#define PHB_INA_ERR_LEM_ENABLE          0xd98
+#define PHB_INA_ERR_IRQ_ENABLE          0xda0
+#define PHB_INA_ERR_FREEZE_ENABLE       0xda8
+#define PHB_INA_ERR_AIB_FENCE_ENABLE    0xdb0
+#define PHB_INA_ERR_LOG_0               0xdc0
+#define PHB_INA_ERR_LOG_1               0xdc8
+#define PHB_INA_ERR_STATUS_MASK         0xdd0
+#define PHB_INA_ERR1_STATUS_MASK        0xdd8
+
+#define PHB_INB_ERR_STATUS              0xe00
+#define PHB_INB_ERR1_STATUS             0xe08
+#define PHB_INB_ERR_INJECT              0xe10
+#define PHB_INB_ERR_LEM_ENABLE          0xe18
+#define PHB_INB_ERR_IRQ_ENABLE          0xe20
+#define PHB_INB_ERR_FREEZE_ENABLE       0xe28
+#define PHB_INB_ERR_AIB_FENCE_ENABLE    0xe30
+#define PHB_INB_ERR_LOG_0               0xe40
+#define PHB_INB_ERR_LOG_1               0xe48
+#define PHB_INB_ERR_STATUS_MASK         0xe50
+#define PHB_INB_ERR1_STATUS_MASK        0xe58
+
+/* Performance monitor & Debug registers */
+#define PHB_TRACE_CONTROL               0xf80
+#define PHB_PERFMON_CONFIG              0xf88
+#define PHB_PERFMON_CTR0                0xf90
+#define PHB_PERFMON_CTR1                0xf98
+#define PHB_PERFMON_CTR2                0xfa0
+#define PHB_PERFMON_CTR3                0xfa8
+#define PHB_HOTPLUG_OVERRIDE            0xfb0
+#define   PHB_HPOVR_FORCE_RESAMPLE      PPC_BIT(9)
+#define   PHB_HPOVR_PRESENCE_A          PPC_BIT(10)
+#define   PHB_HPOVR_PRESENCE_B          PPC_BIT(11)
+#define   PHB_HPOVR_LINK_ACTIVE         PPC_BIT(12)
+#define   PHB_HPOVR_LINK_BIFURCATED     PPC_BIT(13)
+#define   PHB_HPOVR_LINK_LANE_SWAPPED   PPC_BIT(14)
+
+/*
+ * IODA2 on-chip tables
+ */
+
+#define IODA2_TBL_LIST          1
+#define IODA2_TBL_LXIVT         2
+#define IODA2_TBL_IVC_CAM       3
+#define IODA2_TBL_RBA           4
+#define IODA2_TBL_RCAM          5
+#define IODA2_TBL_MRT           6
+#define IODA2_TBL_PESTA         7
+#define IODA2_TBL_PESTB         8
+#define IODA2_TBL_TVT           9
+#define IODA2_TBL_TCAM          10
+#define IODA2_TBL_TDR           11
+#define IODA2_TBL_M64BT         16
+#define IODA2_TBL_M32DT         17
+#define IODA2_TBL_PEEV          20
+
+/* LXIVT */
+#define IODA2_LXIVT_SERVER_MASK         PPC_BITMASK(8, 23)
+#define IODA2_LXIVT_SERVER_LSH          PPC_BITLSHIFT(23)
+#define IODA2_LXIVT_PRIORITY_MASK       PPC_BITMASK(24, 31)
+#define IODA2_LXIVT_PRIORITY_LSH        PPC_BITLSHIFT(31)
+#define IODA2_LXIVT_NODE_ID_MASK        PPC_BITMASK(56, 63)
+#define IODA2_LXIVT_NODE_ID_LSH         PPC_BITLSHIFT(63)
+
+/* IVT */
+#define IODA2_IVT_SERVER_MASK           PPC_BITMASK(0, 23)
+#define IODA2_IVT_SERVER_LSH            PPC_BITLSHIFT(23)
+#define IODA2_IVT_PRIORITY_MASK         PPC_BITMASK(24, 31)
+#define IODA2_IVT_PRIORITY_LSH          PPC_BITLSHIFT(31)
+#define IODA2_IVT_GEN_MASK              PPC_BITMASK(37, 38)
+#define IODA2_IVT_GEN_LSH               PPC_BITLSHIFT(38)
+#define IODA2_IVT_P_MASK                PPC_BITMASK(39, 39)
+#define IODA2_IVT_P_LSH                 PPC_BITLSHIFT(39)
+#define IODA2_IVT_Q_MASK                PPC_BITMASK(47, 47)
+#define IODA2_IVT_Q_LSH                 PPC_BITLSHIFT(47)
+#define IODA2_IVT_PE_MASK               PPC_BITMASK(48, 63)
+#define IODA2_IVT_PE_LSH                PPC_BITLSHIFT(63)
+
+/* TVT */
+#define IODA2_TVT_TABLE_ADDR_MASK       PPC_BITMASK(0, 47)
+#define IODA2_TVT_TABLE_ADDR_LSH        PPC_BITLSHIFT(47)
+#define IODA2_TVT_NUM_LEVELS_MASK       PPC_BITMASK(48, 50)
+#define IODA2_TVT_NUM_LEVELS_LSH        PPC_BITLSHIFT(50)
+#define   IODA2_TVE_1_LEVEL     0
+#define   IODA2_TVE_2_LEVELS    1
+#define   IODA2_TVE_3_LEVELS    2
+#define   IODA2_TVE_4_LEVELS    3
+#define   IODA2_TVE_5_LEVELS    4
+#define IODA2_TVT_TCE_TABLE_SIZE_MASK   PPC_BITMASK(51, 55)
+#define IODA2_TVT_TCE_TABLE_SIZE_LSH    PPC_BITLSHIFT(55)
+#define IODA2_TVT_IO_PSIZE_MASK         PPC_BITMASK(59, 63)
+#define IODA2_TVT_IO_PSIZE_LSH          PPC_BITLSHIFT(63)
+
+/* PESTA */
+#define IODA2_PESTA_MMIO_FROZEN         PPC_BIT(0)
+
+/* PESTB */
+#define IODA2_PESTB_DMA_STOPPED         PPC_BIT(0)
+
+/* M32DT */
+#define IODA2_M32DT_PE_MASK             PPC_BITMASK(8, 15)
+#define IODA2_M32DT_PE_LSH              PPC_BITLSHIFT(15)
+
+/* M64BT */
+#define IODA2_M64BT_ENABLE              PPC_BIT(0)
+#define IODA2_M64BT_SINGLE_PE           PPC_BIT(1)
+#define IODA2_M64BT_BASE_MASK           PPC_BITMASK(2, 31)
+#define IODA2_M64BT_BASE_LSH            PPC_BITLSHIFT(31)
+#define IODA2_M64BT_MASK_MASK           PPC_BITMASK(34, 63)
+#define IODA2_M64BT_MASK_LSH            PPC_BITLSHIFT(63)
+#define IODA2_M64BT_SINGLE_BASE_MASK    PPC_BITMASK(2, 26)
+#define IODA2_M64BT_SINGLE_BASE_LSH     PPC_BITLSHIFT(26)
+#define IODA2_M64BT_PE_HI_MASK          PPC_BITMASK(27, 31)
+#define IODA2_M64BT_PE_HI_LSH           PPC_BITLSHIFT(31)
+#define IODA2_M64BT_SINGLE_MASK_MASK    PPC_BITMASK(34, 58)
+#define IODA2_M64BT_SINGLE_MASK_LSH     PPC_BITLSHIFT(58)
+#define IODA2_M64BT_PE_LOW_MASK         PPC_BITMASK(59, 63)
+#define IODA2_M64BT_PE_LOW_LSH          PPC_BITLSHIFT(63)
+
+/*
+ * IODA2 in-memory tables
+ */
+
+/* PEST
+ *
+ * 2x8 bytes entries, PEST0 and PEST1
+ */
+
+#define IODA2_PEST0_MMIO_CAUSE          PPC_BIT(2)
+#define IODA2_PEST0_CFG_READ            PPC_BIT(3)
+#define IODA2_PEST0_CFG_WRITE           PPC_BIT(4)
+#define IODA2_PEST0_TTYPE_MASK          PPC_BITMASK(5, 7)
+#define IODA2_PEST0_TTYPE_LSH           PPC_BITLSHIFT(7)
+#define   PEST_TTYPE_DMA_WRITE          0
+#define   PEST_TTYPE_MSI                1
+#define   PEST_TTYPE_DMA_READ           2
+#define   PEST_TTYPE_DMA_READ_RESP      3
+#define   PEST_TTYPE_MMIO_LOAD          4
+#define   PEST_TTYPE_MMIO_STORE         5
+#define   PEST_TTYPE_OTHER              7
+#define IODA2_PEST0_CA_RETURN           PPC_BIT(8)
+#define IODA2_PEST0_UTL_RTOS_TIMEOUT    PPC_BIT(8) /* Same bit as CA return */
+#define IODA2_PEST0_UR_RETURN           PPC_BIT(9)
+#define IODA2_PEST0_UTL_NONFATAL        PPC_BIT(10)
+#define IODA2_PEST0_UTL_FATAL           PPC_BIT(11)
+#define IODA2_PEST0_PARITY_UE           PPC_BIT(13)
+#define IODA2_PEST0_UTL_CORRECTABLE     PPC_BIT(14)
+#define IODA2_PEST0_UTL_INTERRUPT       PPC_BIT(15)
+#define IODA2_PEST0_MMIO_XLATE          PPC_BIT(16)
+#define IODA2_PEST0_IODA2_ERROR         PPC_BIT(16) /* Same bit as MMIO xlate */
+#define IODA2_PEST0_TCE_PAGE_FAULT      PPC_BIT(18)
+#define IODA2_PEST0_TCE_ACCESS_FAULT    PPC_BIT(19)
+#define IODA2_PEST0_DMA_RESP_TIMEOUT    PPC_BIT(20)
+#define IODA2_PEST0_AIB_SIZE_INVALID    PPC_BIT(21)
+#define IODA2_PEST0_LEM_BIT_MASK        PPC_BITMASK(26, 31)
+#define IODA2_PEST0_LEM_BIT_LSH         PPC_BITLSHIFT(31)
+#define IODA2_PEST0_RID_MASK            PPC_BITMASK(32, 47)
+#define IODA2_PEST0_RID_LSH             PPC_BITLSHIFT(47)
+#define IODA2_PEST0_MSI_DATA_MASK       PPC_BITMASK(48, 63)
+#define IODA2_PEST0_MSI_DATA_LSH        PPC_BITLSHIFT(63)
+
+#define IODA2_PEST1_FAIL_ADDR_MASK      PPC_BITMASK(3, 63)
+#define IODA2_PEST1_FAIL_ADDR_LSH       PPC_BITLSHIFT(63)
+
+
+#endif /* PCI_HOST_PNV_PHB3_REGS_H */
diff --git a/include/hw/ppc/pnv.h b/include/hw/ppc/pnv.h
index 86d5f54e5459..c875c552fddc 100644
--- a/include/hw/ppc/pnv.h
+++ b/include/hw/ppc/pnv.h
@@ -25,6 +25,7 @@ 
 #include "hw/ppc/pnv_lpc.h"
 #include "hw/ppc/pnv_psi.h"
 #include "hw/ppc/pnv_occ.h"
+#include "hw/pci-host/pnv_phb3.h"
 
 #define TYPE_PNV_CHIP "pnv-chip"
 #define PNV_CHIP(obj) OBJECT_CHECK(PnvChip, (obj), TYPE_PNV_CHIP)
@@ -57,6 +58,8 @@  typedef struct PnvChip {
     MemoryRegion xscom_mmio;
     MemoryRegion xscom;
     AddressSpace xscom_as;
+
+    uint32_t     num_phbs;
 } PnvChip;
 
 #define TYPE_PNV8_CHIP "pnv8-chip"
@@ -72,6 +75,9 @@  typedef struct Pnv8Chip {
     PnvLpcController lpc;
     PnvPsi       psi;
     PnvOCC       occ;
+
+#define PNV_MAX_CHIP_PHB 4
+    PnvPHB3      phbs[PNV_MAX_CHIP_PHB];
 } Pnv8Chip;
 
 #define TYPE_PNV9_CHIP "pnv9-chip"
diff --git a/include/hw/ppc/pnv_xscom.h b/include/hw/ppc/pnv_xscom.h
index 255b26a5aaf6..e7ebffa5cf8a 100644
--- a/include/hw/ppc/pnv_xscom.h
+++ b/include/hw/ppc/pnv_xscom.h
@@ -73,6 +73,15 @@  typedef struct PnvXScomInterfaceClass {
 #define PNV_XSCOM_OCC_BASE        0x0066000
 #define PNV_XSCOM_OCC_SIZE        0x6000
 
+#define PNV_XSCOM_PBCQ_NEST_BASE  0x2012000
+#define PNV_XSCOM_PBCQ_NEST_SIZE  0x46
+
+#define PNV_XSCOM_PBCQ_PCI_BASE   0x9012000
+#define PNV_XSCOM_PBCQ_PCI_SIZE   0x15
+
+#define PNV_XSCOM_PBCQ_SPCI_BASE  0x9013c00
+#define PNV_XSCOM_PBCQ_SPCI_SIZE  0x5
+
 extern void pnv_xscom_realize(PnvChip *chip, Error **errp);
 extern int pnv_dt_xscom(PnvChip *chip, void *fdt, int offset);
 
diff --git a/include/hw/ppc/xics.h b/include/hw/ppc/xics.h
index 6ac8a9392da6..966a996c2eac 100644
--- a/include/hw/ppc/xics.h
+++ b/include/hw/ppc/xics.h
@@ -194,6 +194,7 @@  void icp_set_mfrr(ICPState *icp, uint8_t mfrr);
 uint32_t icp_accept(ICPState *ss);
 uint32_t icp_ipoll(ICPState *ss, uint32_t *mfrr);
 void icp_eoi(ICPState *icp, uint32_t xirr);
+void icp_irq(ICSState *ics, int server, int nr, uint8_t priority);
 
 void ics_simple_write_xive(ICSState *ics, int nr, int server,
                            uint8_t priority, uint8_t saved_priority);
diff --git a/hw/intc/xics.c b/hw/intc/xics.c
index b9f1a3c97214..59e2a5217dcc 100644
--- a/hw/intc/xics.c
+++ b/hw/intc/xics.c
@@ -230,7 +230,7 @@  void icp_eoi(ICPState *icp, uint32_t xirr)
     }
 }
 
-static void icp_irq(ICSState *ics, int server, int nr, uint8_t priority)
+void icp_irq(ICSState *ics, int server, int nr, uint8_t priority)
 {
     ICPState *icp = xics_icp_get(ics->xics, server);
 
diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c
new file mode 100644
index 000000000000..d4b182a45f7e
--- /dev/null
+++ b/hw/pci-host/pnv_phb3.c
@@ -0,0 +1,1089 @@ 
+/*
+ * QEMU PowerPC PowerNV PHB3 model
+ *
+ * Copyright (c) 2014-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qapi/error.h"
+#include "qemu-common.h"
+#include "hw/pci-host/pnv_phb3_regs.h"
+#include "hw/pci-host/pnv_phb3.h"
+#include "hw/pci/pci_bridge.h"
+#include "hw/pci/pci_bus.h"
+#include "hw/pci/pcie_port.h"
+
+static PCIDevice *pnv_phb3_find_cfg_dev(PnvPHB3 *phb)
+{
+    PCIHostState *pci = PCI_HOST_BRIDGE(phb);
+    uint64_t addr = phb->regs[PHB_CONFIG_ADDRESS >> 3];
+    uint8_t bus, devfn;
+
+    if (!(addr >> 63)) {
+        return NULL;
+    }
+    bus = (addr >> 52) & 0xff;
+    devfn = (addr >> 44) & 0xff;
+
+    return pci_find_device(pci->bus, bus, devfn);
+}
+
+/*
+ * The CONFIG_DATA register expects little endian accesses, but as the
+ * region is big endian, we have to swap the value.
+ */
+static void pnv_phb3_config_write(PnvPHB3 *phb, unsigned off,
+                                  unsigned size, uint64_t val)
+{
+    uint32_t cfg_addr, limit;
+    PCIDevice *pdev;
+
+    pdev = pnv_phb3_find_cfg_dev(phb);
+    if (!pdev) {
+        return;
+    }
+    cfg_addr = (phb->regs[PHB_CONFIG_ADDRESS >> 3] >> 32) & 0xfff;
+    cfg_addr |= off;
+    limit = pci_config_size(pdev);
+    if (limit <= cfg_addr) {
+        /* conventional pci device can be behind pcie-to-pci bridge.
+           256 <= addr < 4K has no effects. */
+        return;
+    }
+    switch (size) {
+    case 1:
+        break;
+    case 2:
+        val = bswap16(val);
+        break;
+    case 4:
+        val = bswap32(val);
+        break;
+    default:
+        g_assert_not_reached();
+    }
+    pci_host_config_write_common(pdev, cfg_addr, limit, val, size);
+}
+
+static uint64_t pnv_phb3_config_read(PnvPHB3 *phb, unsigned off,
+                                     unsigned size)
+{
+    uint32_t cfg_addr, limit;
+    PCIDevice *pdev;
+    uint64_t val;
+
+    pdev = pnv_phb3_find_cfg_dev(phb);
+    if (!pdev) {
+        return ~0ull;
+    }
+    cfg_addr = (phb->regs[PHB_CONFIG_ADDRESS >> 3] >> 32) & 0xffc;
+    cfg_addr |= off;
+    limit = pci_config_size(pdev);
+    if (limit <= cfg_addr) {
+        /* conventional pci device can be behind pcie-to-pci bridge.
+           256 <= addr < 4K has no effects. */
+        return ~0ull;
+    }
+    val = pci_host_config_read_common(pdev, cfg_addr, limit, size);
+    switch (size) {
+    case 1:
+        return val;
+    case 2:
+        return bswap16(val);
+    case 4:
+        return bswap32(val);
+    default:
+        g_assert_not_reached();
+    }
+}
+
+static void pnv_phb3_check_m32(PnvPHB3 *phb)
+{
+    uint64_t base, start, size;
+    MemoryRegion *parent;
+    PnvPBCQState *pbcq = &phb->pbcq;
+
+    if (phb->m32_mapped) {
+        memory_region_del_subregion(phb->mr_m32.container, &phb->mr_m32);
+        phb->m32_mapped = false;
+    }
+
+    /* Disabled ? move on with life ... */
+    if (!(phb->regs[PHB_PHB3_CONFIG >> 3] & PHB_PHB3C_M32_EN)) {
+        return;
+    }
+
+    /* Grab geometry from registers */
+    base = phb->regs[PHB_M32_BASE_ADDR >> 3];
+    start = phb->regs[PHB_M32_START_ADDR >> 3];
+    size = ~(phb->regs[PHB_M32_BASE_MASK >> 3] | 0xfffc000000000000ull) + 1;
+
+    /* Check if it matches an enabled MMIO region in the PBCQ */
+    if (pbcq->mmio0_mapped && base >= pbcq->mmio0_base &&
+        (base + size) <= (pbcq->mmio0_base + pbcq->mmio0_size)) {
+        parent = &pbcq->mmbar0;
+        base -= pbcq->mmio0_base;
+    } else if (pbcq->mmio1_mapped && base >= pbcq->mmio1_base &&
+        (base + size) <= (pbcq->mmio1_base + pbcq->mmio1_size)) {
+        parent = &pbcq->mmbar1;
+        base -= pbcq->mmio1_base;
+    } else {
+        return;
+    }
+
+    /* Create alias */
+    memory_region_init_alias(&phb->mr_m32, OBJECT(phb), "phb3-m32",
+                             &phb->pci_mmio, start, size);
+    memory_region_add_subregion(parent, base, &phb->mr_m32);
+    phb->m32_mapped = true;
+}
+
+static void pnv_phb3_check_m64(PnvPHB3 *phb, uint32_t index)
+{
+    uint64_t base, start, size, m64;
+    MemoryRegion *parent;
+    PnvPBCQState *pbcq = &phb->pbcq;
+
+    if (phb->m64_mapped[index]) {
+        /* Should we destroy it in RCU friendly way... ? */
+        memory_region_del_subregion(phb->mr_m64[index].container,
+                                    &phb->mr_m64[index]);
+        phb->m64_mapped[index] = false;
+    }
+
+    /* Get table entry */
+    m64 = phb->ioda_M64BT[index];
+
+    /* Disabled ? move on with life ... */
+    if (!(m64 & IODA2_M64BT_ENABLE)) {
+        return;
+    }
+
+    /* Grab geometry from registers */
+    base = GETFIELD(IODA2_M64BT_BASE, m64) << 20;
+    if (m64 & IODA2_M64BT_SINGLE_PE) {
+        base &= ~0x1ffffffull;
+    }
+    size = GETFIELD(IODA2_M64BT_MASK, m64) << 20;
+    size |= 0xfffc000000000000ull;
+    size = ~size + 1;
+    start = base | (phb->regs[PHB_M64_UPPER_BITS >> 3]);
+
+    /* Check if it matches an enabled MMIO region in the PBCQ */
+    if (pbcq->mmio0_mapped && base >= pbcq->mmio0_base &&
+        (base + size) <= (pbcq->mmio0_base + pbcq->mmio0_size)) {
+        parent = &pbcq->mmbar0;
+        base -= pbcq->mmio0_base;
+    } else if (pbcq->mmio1_mapped && base >= pbcq->mmio1_base &&
+        (base + size) <= (pbcq->mmio1_base + pbcq->mmio1_size)) {
+        parent = &pbcq->mmbar1;
+        base -= pbcq->mmio1_base;
+    } else {
+        return;
+    }
+
+    /* Create alias */
+    memory_region_init_alias(&phb->mr_m64[index], OBJECT(phb), "phb3-m64",
+                             &phb->pci_mmio, start, size);
+    memory_region_add_subregion(parent, base, &phb->mr_m64[index]);
+    phb->m64_mapped[index] = true;
+}
+
+static void pnv_phb3_check_all_m64s(PnvPHB3 *phb)
+{
+    uint64_t i;
+
+    for (i = 0; i < PNV_PHB3_NUM_M64; i++) {
+        pnv_phb3_check_m64(phb, i);
+    }
+}
+
+static void pnv_phb3_lxivt_write(PnvPHB3 *phb, unsigned idx, uint64_t val)
+{
+    ICSState *ics = &phb->lsis;
+    uint8_t server, prio;
+
+    phb->ioda_LXIVT[idx] = val & (IODA2_LXIVT_SERVER_MASK |
+                                  IODA2_LXIVT_PRIORITY_MASK |
+                                  IODA2_LXIVT_NODE_ID_MASK);
+    server = GETFIELD(IODA2_LXIVT_SERVER, val);
+    prio = GETFIELD(IODA2_LXIVT_PRIORITY, val);
+
+    /*
+     * The low order 2 bits are the link pointer (Type II interrupts).
+     * Shift back to get a valid IRQ server.
+     */
+    server >>= 2;
+
+    ics_simple_write_xive(ics, idx, server, prio, prio);
+}
+
+static uint64_t *pnv_phb3_ioda_access(PnvPHB3 *phb,
+                                      unsigned *out_table, unsigned *out_idx)
+{
+    uint64_t adreg = phb->regs[PHB_IODA_ADDR >> 3];
+    unsigned int index = GETFIELD(PHB_IODA_AD_TADR, adreg);
+    unsigned int table = GETFIELD(PHB_IODA_AD_TSEL, adreg);
+    unsigned int mask;
+    uint64_t *tptr = NULL;
+
+    switch (table) {
+    case IODA2_TBL_LIST:
+        tptr = phb->ioda_LIST;
+        mask = 7;
+        break;
+    case IODA2_TBL_LXIVT:
+        tptr = phb->ioda_LXIVT;
+        mask = 7;
+        break;
+    case IODA2_TBL_IVC_CAM:
+    case IODA2_TBL_RBA:
+        mask = 31;
+        break;
+    case IODA2_TBL_RCAM:
+        mask = 63;
+        break;
+    case IODA2_TBL_MRT:
+        mask = 7;
+        break;
+    case IODA2_TBL_PESTA:
+    case IODA2_TBL_PESTB:
+        mask = 255;
+        break;
+    case IODA2_TBL_TVT:
+        tptr = phb->ioda_TVT;
+        mask = 511;
+        break;
+    case IODA2_TBL_TCAM:
+    case IODA2_TBL_TDR:
+        mask = 63;
+        break;
+    case IODA2_TBL_M64BT:
+        tptr = phb->ioda_M64BT;
+        mask = 15;
+        break;
+    case IODA2_TBL_M32DT:
+        tptr = phb->ioda_MDT;
+        mask = 255;
+        break;
+    case IODA2_TBL_PEEV:
+        tptr = phb->ioda_PEEV;
+        mask = 3;
+        break;
+    default:
+        return NULL;
+    }
+    index &= mask;
+    if (out_idx) {
+        *out_idx = index;
+    }
+    if (out_table) {
+        *out_table = table;
+    }
+    if (adreg & PHB_IODA_AD_AUTOINC) {
+        index = (index + 1) & mask;
+        adreg = SETFIELD(PHB_IODA_AD_TADR, adreg, index);
+    }
+    if (tptr) {
+        tptr += index;
+    }
+    phb->regs[PHB_IODA_ADDR >> 3] = adreg;
+    return tptr;
+}
+
+static uint64_t pnv_phb3_ioda_read(PnvPHB3 *phb)
+{
+        unsigned table;
+        uint64_t *tptr;
+
+        tptr = pnv_phb3_ioda_access(phb, &table, NULL);
+        if (!tptr) {
+            /* Return 0 on unsupported tables, not ff's */
+            return 0;
+        }
+        return *tptr;
+}
+
+static void pnv_phb3_ioda_write(PnvPHB3 *phb, uint64_t val)
+{
+        unsigned table, idx;
+        uint64_t *tptr;
+
+        tptr = pnv_phb3_ioda_access(phb, &table, &idx);
+        if (!tptr) {
+            return;
+        }
+
+        /* Handle side effects */
+        switch (table) {
+        case IODA2_TBL_LXIVT:
+            pnv_phb3_lxivt_write(phb, idx, val);
+            break;
+        case IODA2_TBL_M64BT:
+            *tptr = val;
+            pnv_phb3_check_m64(phb, idx);
+            break;
+        default:
+            *tptr = val;
+        }
+}
+
+/* This is called whenever the PHB LSI, MSI source ID register or
+ * the PBCQ irq filters are written.
+ */
+void pnv_phb3_remap_irqs(PnvPHB3 *phb)
+{
+    ICSState *ics = &phb->lsis;
+    uint32_t local, global, count, mask, comp;
+    uint64_t baren;
+    PnvPBCQState *pbcq = &phb->pbcq;
+
+    /* First check if we are enabled. Unlike real HW we don't separate TX and RX
+     * so we enable if both are set
+     */
+    baren = pbcq->nest_regs[PBCQ_NEST_BAR_EN];
+    if (!(baren & PBCQ_NEST_BAR_EN_IRSN_RX) ||
+        !(baren & PBCQ_NEST_BAR_EN_IRSN_TX)) {
+        ics->offset = 0;
+        return;
+    }
+
+    /* Grab local LSI source ID */
+    local = GETFIELD(PHB_LSI_SRC_ID, phb->regs[PHB_LSI_SOURCE_ID >> 3]) << 3;
+
+    /* Grab global one and compare */
+    global = GETFIELD(PBCQ_NEST_LSI_SRC,
+                      pbcq->nest_regs[PBCQ_NEST_LSI_SRC_ID]) << 3;
+    if (global != local) {
+        /* This happens during initialization, let's come back when we
+         * are properly configured
+         */
+        ics->offset = 0;
+        return;
+    }
+
+    /* Get the base on the powerbus */
+    comp = GETFIELD(PBCQ_NEST_IRSN_COMP,
+                    pbcq->nest_regs[PBCQ_NEST_IRSN_COMPARE]);
+    mask = GETFIELD(PBCQ_NEST_IRSN_COMP,
+                    pbcq->nest_regs[PBCQ_NEST_IRSN_MASK]);
+    count = ((~mask) + 1) & 0x7ffff;
+    phb->total_irq = count;
+
+    /* Sanity checks */
+    if ((global + PNV_PHB3_NUM_LSI) > count) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+                      "LSIs out of reach: LSI base=%d total irq=%d",
+                      global, count);
+    }
+
+    if (count > 2048) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+                      "More interrupts than supported: %d", count);
+    }
+
+    if ((comp & mask) != comp) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+                      "IRQ compare bits not in mask: comp=0x%x mask=0x%x",
+                      comp, mask);
+        comp &= mask;
+    }
+    /* Setup LSI offset */
+    ics->offset = comp + global;
+
+    /* Setup MSI offset */
+    pnv_phb3_msi_update_config(&phb->msis, comp, count - PNV_PHB3_NUM_LSI);
+}
+
+static void pnv_phb3_lsi_src_id_write(PnvPHB3 *phb, uint64_t val)
+{
+    /* Sanitize content */
+    val &= PHB_LSI_SRC_ID_MASK;
+    phb->regs[PHB_LSI_SOURCE_ID >> 3] = val;
+    pnv_phb3_remap_irqs(phb);
+}
+
+static void pnv_phb3_rtc_invalidate(PnvPHB3 *phb, uint64_t val)
+{
+    PnvPhb3DMASpace *ds;
+
+    /* Always invalidate all for now ... */
+    QLIST_FOREACH(ds, &phb->dma_spaces, list) {
+        ds->pe_num = PHB_INVALID_PE;
+    }
+}
+
+
+static void pnv_phb3_update_msi_regions(PnvPhb3DMASpace *ds)
+{
+    uint64_t cfg = ds->phb->regs[PHB_PHB3_CONFIG >> 3];
+
+    if (cfg & PHB_PHB3C_32BIT_MSI_EN) {
+        if (!ds->msi32_mapped) {
+            memory_region_add_subregion(MEMORY_REGION(&ds->dma_mr),
+                                        0xffff0000, &ds->msi32_mr);
+            ds->msi32_mapped = true;
+        }
+    } else {
+        if (ds->msi32_mapped) {
+            memory_region_del_subregion(MEMORY_REGION(&ds->dma_mr),
+                                        &ds->msi32_mr);
+            ds->msi32_mapped = false;
+        }
+    }
+
+    if (cfg & PHB_PHB3C_64BIT_MSI_EN) {
+        if (!ds->msi64_mapped) {
+            memory_region_add_subregion(MEMORY_REGION(&ds->dma_mr),
+                                        (1ull << 60), &ds->msi64_mr);
+            ds->msi64_mapped = true;
+        }
+    } else {
+        if (ds->msi64_mapped) {
+            memory_region_del_subregion(MEMORY_REGION(&ds->dma_mr),
+                                        &ds->msi64_mr);
+            ds->msi64_mapped = false;
+        }
+    }
+}
+
+static void pnv_phb3_update_all_msi_regions(PnvPHB3 *phb)
+{
+    PnvPhb3DMASpace *ds;
+
+    QLIST_FOREACH(ds, &phb->dma_spaces, list) {
+        pnv_phb3_update_msi_regions(ds);
+    }
+}
+
+void pnv_phb3_reg_write(void *opaque, hwaddr off, uint64_t val, unsigned size)
+{
+    PnvPHB3 *phb = opaque;
+    bool changed;
+
+    /* Special case configuration data */
+    if ((off & 0xfffc) == PHB_CONFIG_DATA) {
+        pnv_phb3_config_write(phb, off & 0x3, size, val);
+        return;
+    }
+
+    /* Other registers are 64-bit only */
+    if (size != 8 || off & 0x7) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+                      "Invalid register access, offset: 0x%"PRIx64" size: %d",
+                      off, size);
+        return;
+    }
+
+    /* Handle masking */
+    switch (off) {
+    case PHB_M64_UPPER_BITS:
+        val &= 0xfffc000000000000ull;
+        break;
+    }
+
+    /* Record whether it changed */
+    changed = phb->regs[off >> 3] != val;
+
+    /* Store in register cache first */
+    phb->regs[off >> 3] = val;
+
+    /* Handle side effects */
+    switch (off) {
+    case PHB_PHB3_CONFIG:
+        if (changed) {
+            pnv_phb3_update_all_msi_regions(phb);
+        }
+        /* fall through */
+    case PHB_M32_BASE_ADDR:
+    case PHB_M32_BASE_MASK:
+    case PHB_M32_START_ADDR:
+        if (changed) {
+            pnv_phb3_check_m32(phb);
+        }
+        break;
+    case PHB_M64_UPPER_BITS:
+        if (changed) {
+            pnv_phb3_check_all_m64s(phb);
+        }
+        break;
+    case PHB_LSI_SOURCE_ID:
+        if (changed) {
+            pnv_phb3_lsi_src_id_write(phb, val);
+        }
+        break;
+
+    /* IODA table accesses */
+    case PHB_IODA_DATA0:
+        pnv_phb3_ioda_write(phb, val);
+        break;
+
+    /* RTC invalidation */
+    case PHB_RTC_INVALIDATE:
+        pnv_phb3_rtc_invalidate(phb, val);
+        break;
+
+    /* FFI request */
+    case PHB_FFI_REQUEST:
+        pnv_phb3_msi_ffi(&phb->msis, val);
+        break;
+
+    /* Silent simple writes */
+    case PHB_CONFIG_ADDRESS:
+    case PHB_IODA_ADDR:
+    case PHB_TCE_KILL:
+    case PHB_TCE_SPEC_CTL:
+    case PHB_PEST_BAR:
+    case PHB_PELTV_BAR:
+    case PHB_RTT_BAR:
+    case PHB_RBA_BAR:
+    case PHB_IVT_BAR:
+    case PHB_FFI_LOCK:
+        break;
+
+    /* Noise on anything else */
+    default:
+        qemu_log_mask(LOG_UNIMP, "reg_write 0x%"PRIx64"=%"PRIx64, off, val);
+    }
+}
+
+uint64_t pnv_phb3_reg_read(void *opaque, hwaddr off, unsigned size)
+{
+    PnvPHB3 *phb = opaque;
+    uint64_t val;
+
+    if ((off & 0xfffc) == PHB_CONFIG_DATA) {
+        return pnv_phb3_config_read(phb, off & 0x3, size);
+    }
+
+    /* Other registers are 64-bit only */
+    if (size != 8 || off & 0x7) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+                      "Invalid register access, offset: 0x%"PRIx64" size: %d",
+                      off, size);
+        return ~0ull;
+    }
+
+    /* Default read from cache */
+    val = phb->regs[off >> 3];
+
+    switch (off) {
+    /* Simulate venice DD2.0 */
+    case PHB_VERSION:
+        return 0x000000a300000005ull;
+
+    /* IODA table accesses */
+    case PHB_IODA_DATA0:
+        return pnv_phb3_ioda_read(phb);
+
+    /* Link training always appears trained */
+    case PHB_PCIE_DLP_TRAIN_CTL:
+        return PHB_PCIE_DLP_INBAND_PRESENCE | PHB_PCIE_DLP_TC_DL_LINKACT;
+
+    /* FFI Lock */
+    case PHB_FFI_LOCK:
+        /* Set lock and return previous value */
+        phb->regs[off >> 3] |= PHB_FFI_LOCK_STATE;
+        return val;
+
+    /* Silent simple reads */
+    case PHB_PHB3_CONFIG:
+    case PHB_M32_BASE_ADDR:
+    case PHB_M32_BASE_MASK:
+    case PHB_M32_START_ADDR:
+    case PHB_CONFIG_ADDRESS:
+    case PHB_IODA_ADDR:
+    case PHB_RTC_INVALIDATE:
+    case PHB_TCE_KILL:
+    case PHB_TCE_SPEC_CTL:
+    case PHB_PEST_BAR:
+    case PHB_PELTV_BAR:
+    case PHB_RTT_BAR:
+    case PHB_RBA_BAR:
+    case PHB_IVT_BAR:
+    case PHB_M64_UPPER_BITS:
+        break;
+
+    /* Noise on anything else */
+    default:
+        qemu_log_mask(LOG_UNIMP, "reg_read 0x%"PRIx64"=%"PRIx64, off, val);
+    }
+    return val;
+}
+
+static const MemoryRegionOps pnv_phb3_reg_ops = {
+    .read = pnv_phb3_reg_read,
+    .write = pnv_phb3_reg_write,
+    .valid.min_access_size = 1,
+    .valid.max_access_size = 8,
+    .impl.min_access_size = 1,
+    .impl.max_access_size = 8,
+    .endianness = DEVICE_BIG_ENDIAN,
+};
+
+static int pnv_phb3_map_irq(PCIDevice *pci_dev, int irq_num)
+{
+    /* Check that out properly ... */
+    return irq_num & 3;
+}
+
+static void pnv_phb3_set_irq(void *opaque, int irq_num, int level)
+{
+    PnvPHB3 *phb = opaque;
+
+    /* LSI only ... */
+    if (irq_num > 3) {
+        qemu_log_mask(LOG_GUEST_ERROR, "Unknown IRQ to set %d", irq_num);
+    }
+    qemu_set_irq(phb->lsis.qirqs[irq_num], level);
+}
+
+static bool pnv_phb3_resolve_pe(PnvPhb3DMASpace *ds)
+{
+    uint64_t rtt, addr;
+    uint16_t rte;
+    int bus_num;
+
+    /* Already resolved ? */
+    if (ds->pe_num != PHB_INVALID_PE) {
+        return true;
+    }
+
+    /* We need to lookup the RTT */
+    rtt = ds->phb->regs[PHB_RTT_BAR >> 3];
+    if (!(rtt & PHB_RBA_BAR_ENABLE)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "DMA with RTT BAR disabled !");
+        /* Set error bits ? fence ? ... */
+        return false;
+    }
+
+    /* Read RTE */
+    bus_num = pci_bus_num(ds->bus);
+    addr = rtt & PHB_RTT_BASE_ADDRESS_MASK;
+    addr += 2 * ((bus_num << 8) | ds->devfn);
+    if (dma_memory_read(&address_space_memory, addr, &rte, sizeof(rte))) {
+        qemu_log_mask(LOG_GUEST_ERROR, "Failed to read RTT entry at 0x%"PRIx64,
+                      addr);
+        /* Set error bits ? fence ? ... */
+        return false;
+    }
+    rte = be16_to_cpu(rte);
+
+    /* Fail upon reading of invalid PE# */
+    if (rte >= PNV_PHB3_NUM_PE) {
+        qemu_log_mask(LOG_GUEST_ERROR, "RTE for RID 0x%x invalid (%04x)",
+                      ds->devfn, rte);
+        /* Set error bits ? fence ? ... */
+        return false;
+    }
+    ds->pe_num = rte;
+    return true;
+}
+
+static void pnv_phb3_translate_tve(PnvPhb3DMASpace *ds, hwaddr addr,
+                                   bool is_write, uint64_t tve,
+                                   IOMMUTLBEntry *tlb)
+{
+    uint64_t tta = GETFIELD(IODA2_TVT_TABLE_ADDR, tve);
+    int32_t  lev = GETFIELD(IODA2_TVT_NUM_LEVELS, tve);
+    uint32_t tts = GETFIELD(IODA2_TVT_TCE_TABLE_SIZE, tve);
+    uint32_t tps = GETFIELD(IODA2_TVT_IO_PSIZE, tve);
+
+    /* Invalid levels */
+    if (lev > 4) {
+        qemu_log_mask(LOG_GUEST_ERROR, "Invalid #levels in TVE %d", lev);
+        return;
+    }
+
+    /* IO Page Size of 0 means untranslated, else use TCEs */
+    if (tps == 0) {
+        /* We only support non-translate in top window
+         * XXX FIX THAT, Venice/Murano support it on bottom window
+         * above 4G and Naples suports it on everything
+         */
+        if (!(tve & PPC_BIT(51))) {
+            qemu_log_mask(LOG_GUEST_ERROR,
+                          "xlate for invalid non-translate TVE");
+            return;
+        }
+        /* XXX Handle boundaries */
+
+        /* XXX Use 4k pages like q35 ... for now */
+        tlb->iova = addr & 0xfffffffffffff000ull;
+        tlb->translated_addr = addr & 0x0003fffffffff000ull;
+        tlb->addr_mask = 0xfffull;
+        tlb->perm = IOMMU_RW;
+    } else {
+        uint32_t tce_shift, tbl_shift, sh;
+        uint64_t base, taddr, tce, tce_mask;
+
+        /* TVE disabled ? */
+        if (tts == 0) {
+            qemu_log_mask(LOG_GUEST_ERROR, "xlate for invalid translated TVE");
+            return;
+        }
+
+        /* Address bits per bottom level TCE entry */
+        tce_shift = tps + 11;
+
+        /* Address bits per table level */
+        tbl_shift = tts + 8;
+
+        /* Top level table base address */
+        base = tta << 12;
+
+        /* Total shift to first level */
+        sh = tbl_shift * lev + tce_shift;
+
+        /* XXX Multi-level untested */
+        while ((lev--) >= 0) {
+            /* Grab the TCE address */
+            taddr = base | (((addr >> sh) & ((1ul << tbl_shift) - 1)) << 3);
+            if (dma_memory_read(&address_space_memory, taddr, &tce,
+                                sizeof(tce))) {
+                qemu_log_mask(LOG_GUEST_ERROR,
+                              "Failed to read TCE at 0x%"PRIx64, taddr);
+                return;
+            }
+            tce = be64_to_cpu(tce);
+
+            /* Check permission for indirect TCE */
+            if ((lev >= 0) && !(tce & 3)) {
+                qemu_log_mask(LOG_GUEST_ERROR,
+                              "Invalid indirect TCE at 0x%"PRIx64, taddr);
+                qemu_log_mask(LOG_GUEST_ERROR,
+                              " xlate %"PRIx64":%c TVE=%"PRIx64,
+                              addr, is_write ? 'W' : 'R', tve);
+                qemu_log_mask(LOG_GUEST_ERROR,
+                              " tta=%"PRIx64" lev=%d tts=%d tps=%d",
+                              tta, lev, tts, tps);
+                return;
+            }
+            sh -= tbl_shift;
+            base = tce & ~0xfffull;
+        }
+
+        /* We exit the loop with TCE being the final TCE */
+        tce_mask = ~((1ull << tce_shift) - 1);
+        tlb->iova = addr & tce_mask;
+        tlb->translated_addr = tce & tce_mask;
+        tlb->addr_mask = ~tce_mask;
+        tlb->perm = tce & 3;
+        if ((is_write & !(tce & 2)) || ((!is_write) && !(tce & 1))) {
+            qemu_log_mask(LOG_GUEST_ERROR, "TCE access fault at 0x%"PRIx64,
+                          taddr);
+            qemu_log_mask(LOG_GUEST_ERROR, " xlate %"PRIx64":%c TVE=%"PRIx64,
+                          addr, is_write ? 'W' : 'R', tve);
+            qemu_log_mask(LOG_GUEST_ERROR,
+                          " tta=%"PRIx64" lev=%d tts=%d tps=%d",
+                          tta, lev, tts, tps);
+        }
+    }
+}
+
+static IOMMUTLBEntry pnv_phb3_translate_iommu(IOMMUMemoryRegion *iommu,
+                                              hwaddr addr,
+                                              IOMMUAccessFlags flag,
+                                              int iommu_idx)
+{
+    PnvPhb3DMASpace *ds = container_of(iommu, PnvPhb3DMASpace, dma_mr);
+    int tve_sel;
+    uint64_t tve, cfg;
+    IOMMUTLBEntry ret = {
+        .target_as = &address_space_memory,
+        .iova = addr,
+        .translated_addr = 0,
+        .addr_mask = ~(hwaddr)0,
+        .perm = IOMMU_NONE,
+    };
+
+    /* Resolve PE# */
+    if (!pnv_phb3_resolve_pe(ds)) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+                      "Failed to resolve PE# for bus @%p (%d) devfn 0x%x",
+                      ds->bus, pci_bus_num(ds->bus), ds->devfn);
+        return ret;
+    }
+
+    /* Check top bits */
+    switch (addr >> 60) {
+    case 00:
+        /* DMA or 32-bit MSI ? */
+        cfg = ds->phb->regs[PHB_PHB3_CONFIG >> 3];
+        if ((cfg & PHB_PHB3C_32BIT_MSI_EN) &&
+            ((addr & 0xffffffffffff0000ull) == 0xffff0000ull)) {
+            qemu_log_mask(LOG_GUEST_ERROR, "xlate on 32-bit MSI region");
+            return ret;
+        }
+        /* Choose TVE XXX Use PHB3 Control Register */
+        tve_sel = (addr >> 59) & 1;
+        tve = ds->phb->ioda_TVT[ds->pe_num * 2 + tve_sel];
+        pnv_phb3_translate_tve(ds, addr, flag & IOMMU_WO, tve, &ret);
+        break;
+    case 01:
+        qemu_log_mask(LOG_GUEST_ERROR, "xlate on 64-bit MSI region");
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "xlate on unsupported address 0x%"PRIx64,
+                      addr);
+    }
+    return ret;
+}
+
+#define TYPE_PNV_PHB3_IOMMU_MEMORY_REGION "pnv-phb3-iommu-memory-region"
+#define PNV_PHB3_IOMMU_MEMORY_REGION(obj) \
+    OBJECT_CHECK(IOMMUMemoryRegion, (obj), TYPE_PNV_PHB3_IOMMU_MEMORY_REGION)
+
+static void pnv_phb3_iommu_memory_region_class_init(ObjectClass *klass,
+                                                    void *data)
+{
+    IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(klass);
+
+    imrc->translate = pnv_phb3_translate_iommu;
+}
+
+static const TypeInfo pnv_phb3_iommu_memory_region_info = {
+    .parent = TYPE_IOMMU_MEMORY_REGION,
+    .name = TYPE_PNV_PHB3_IOMMU_MEMORY_REGION,
+    .class_init = pnv_phb3_iommu_memory_region_class_init,
+};
+
+/*
+ * MSI/MSIX memory region implementation.
+ * The handler handles both MSI and MSIX.
+ */
+static void pnv_phb3_msi_write(void *opaque, hwaddr addr,
+                               uint64_t data, unsigned size)
+{
+    PnvPhb3DMASpace *ds = opaque;
+
+    /* Resolve PE# */
+    if (!pnv_phb3_resolve_pe(ds)) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+                      "Failed to resolve PE# for bus @%p (%d) devfn 0x%x",
+                      ds->bus, pci_bus_num(ds->bus), ds->devfn);
+        return;
+    }
+
+    pnv_phb3_msi_send(&ds->phb->msis, addr, data, ds->pe_num);
+}
+
+static const MemoryRegionOps pnv_phb3_msi_ops = {
+    /* There is no .read as the read result is undefined by PCI spec */
+    .read = NULL,
+    .write = pnv_phb3_msi_write,
+    .endianness = DEVICE_LITTLE_ENDIAN
+};
+
+static AddressSpace *pnv_phb3_dma_iommu(PCIBus *bus, void *opaque, int devfn)
+{
+    PnvPHB3 *phb = opaque;
+    PnvPhb3DMASpace *ds;
+
+    QLIST_FOREACH(ds, &phb->dma_spaces, list) {
+        if (ds->bus == bus && ds->devfn == devfn) {
+            break;
+        }
+    }
+
+    if (ds == NULL) {
+        ds = g_malloc0(sizeof(PnvPhb3DMASpace));
+        ds->bus = bus;
+        ds->devfn = devfn;
+        ds->pe_num = PHB_INVALID_PE;
+        ds->phb = phb;
+        memory_region_init_iommu(&ds->dma_mr, sizeof(ds->dma_mr),
+                                 TYPE_PNV_PHB3_IOMMU_MEMORY_REGION,
+                                 OBJECT(phb), "phb3_iommu", UINT64_MAX);
+        address_space_init(&ds->dma_as, MEMORY_REGION(&ds->dma_mr),
+                           "phb3_iommu");
+        memory_region_init_io(&ds->msi32_mr, OBJECT(phb), &pnv_phb3_msi_ops,
+                              ds, "msi32", 0x10000);
+        memory_region_init_io(&ds->msi64_mr, OBJECT(phb), &pnv_phb3_msi_ops,
+                              ds, "msi64", 0x100000);
+        pnv_phb3_update_msi_regions(ds);
+
+        QLIST_INSERT_HEAD(&phb->dma_spaces, ds, list);
+    }
+    return &ds->dma_as;
+}
+
+static void pnv_phb3_instance_init(Object *obj)
+{
+    PnvPHB3 *phb = PNV_PHB3(obj);
+
+    /* Create LSI source */
+    object_initialize(&phb->lsis, sizeof(phb->lsis), TYPE_ICS_SIMPLE);
+    object_property_add_child(obj, "ics-phb-lsi", OBJECT(&phb->lsis), NULL);
+
+    /* Default init ... will be fixed by HW inits */
+    phb->lsis.offset = 0;
+
+    /* Create MSI source */
+    object_initialize(&phb->msis, sizeof(phb->msis), TYPE_PHB3_MSI);
+    object_property_add_const_link(OBJECT(&phb->msis), "phb", obj,
+                                   &error_abort);
+    object_property_add_child(obj, "ics-phb-msi", OBJECT(&phb->msis), NULL);
+
+    /* Create PBCQ */
+    object_initialize(&phb->pbcq, sizeof(phb->pbcq), TYPE_PNV_PBCQ);
+    object_property_add_const_link(OBJECT(&phb->pbcq), "phb", obj,
+                                   &error_abort);
+    object_property_add_child(obj, "pbcq", OBJECT(&phb->pbcq), NULL);
+    object_property_add_alias(obj, "chip-id", OBJECT(&phb->pbcq), "chip-id",
+                              &error_abort);
+    object_property_add_alias(obj, "phb-id", OBJECT(&phb->pbcq), "phb-id",
+                              &error_abort);
+
+    QLIST_INIT(&phb->dma_spaces);
+}
+
+static void pnv_phb3_realize(DeviceState *dev, Error **errp)
+{
+    PnvPHB3 *phb = PNV_PHB3(dev);
+    PCIHostState *pci = PCI_HOST_BRIDGE(dev);
+    Object *obj;
+    Error *error = NULL;
+    int i;
+
+    memory_region_init(&phb->pci_mmio, OBJECT(phb), "pci-mmio",
+                       PCI_MMIO_TOTAL_SIZE);
+
+    /* PHB3 doesn't support IO space. However, qemu gets very upset if
+     * we don't have an IO region to anchor IO BARs onto so we just
+     * initialize one which we never hook up to anything
+     */
+    memory_region_init(&phb->pci_io, OBJECT(phb), "pci-io", 0x10000);
+
+    memory_region_init_io(&phb->mr_regs, OBJECT(phb), &pnv_phb3_reg_ops, phb,
+                          "phb3-regs", 0x1000);
+
+   /* get XICSFabric from chip */
+    obj = object_property_get_link(OBJECT(dev), "xics", &error);
+    if (!obj) {
+        error_propagate(errp, error);
+        error_prepend(errp, "required link '" ICP_PROP_XICS "' not found: ");
+        return;
+    }
+
+    object_property_set_int(OBJECT(&phb->lsis), PNV_PHB3_NUM_LSI,
+                            "nr-irqs", &error_abort);
+    object_property_add_const_link(OBJECT(&phb->lsis), "xics", obj,
+                                   &error_abort);
+    object_property_set_bool(OBJECT(&phb->lsis), true, "realized", &error);
+    if (error) {
+        error_propagate(errp, error);
+        return;
+    }
+
+    for (i = 0; i < PNV_PHB3_NUM_LSI; i++) {
+        ics_set_irq_type(&phb->lsis, i, true);
+    }
+
+    object_property_add_const_link(OBJECT(&phb->msis), "xics", obj,
+                                   &error_abort);
+    object_property_set_int(OBJECT(&phb->msis), PHB3_MAX_MSI,
+                            "nr-irqs", &error_abort);
+    object_property_set_bool(OBJECT(&phb->msis), true, "realized", &error);
+    if (error) {
+        error_propagate(errp, error);
+        return;
+    }
+
+    object_property_set_bool(OBJECT(&phb->pbcq), true, "realized", &error);
+    if (error) {
+        error_propagate(errp, error);
+        return;
+    }
+
+    pci->bus = pci_register_root_bus(dev, "phb3-root-bus",
+                                pnv_phb3_set_irq, pnv_phb3_map_irq, phb,
+                                &phb->pci_mmio, &phb->pci_io,
+                                0, 4, TYPE_PNV_PHB3_ROOT_BUS);
+    pci_setup_iommu(pci->bus, pnv_phb3_dma_iommu, phb);
+}
+
+void pnv_phb3_update_regions(PnvPHB3 *phb)
+{
+    PnvPBCQState *pbcq = &phb->pbcq;
+
+    /* Unmap first always */
+    if (phb->regs_mapped) {
+        memory_region_del_subregion(&pbcq->phbbar, &phb->mr_regs);
+        phb->regs_mapped = false;
+    }
+
+    /* Map registers if enabled */
+    if (pbcq->phb_mapped) {
+        /* XXX We should use the PHB BAR 2 register but we don't ... */
+        memory_region_add_subregion(&pbcq->phbbar, 0, &phb->mr_regs);
+        phb->regs_mapped = true;
+    }
+
+    /* Check/update m32 */
+    if (phb->m32_mapped) {
+        pnv_phb3_check_m32(phb);
+    }
+}
+
+static void pnv_phb3_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->realize = pnv_phb3_realize;
+}
+
+static const TypeInfo pnv_phb3_type_info = {
+    .name          = TYPE_PNV_PHB3,
+    .parent        = TYPE_PCI_HOST_BRIDGE,
+    .instance_size = sizeof(PnvPHB3),
+    .class_init    = pnv_phb3_class_init,
+    .instance_init = pnv_phb3_instance_init,
+};
+
+static void pnv_phb3_root_bus_class_init(ObjectClass *klass, void *data)
+{
+    BusClass *k = BUS_CLASS(klass);
+
+    k->max_dev = 1;
+}
+
+static const TypeInfo pnv_phb3_root_bus_info = {
+    .name = TYPE_PNV_PHB3_ROOT_BUS,
+    .parent = TYPE_PCIE_BUS,
+    .class_init = pnv_phb3_root_bus_class_init,
+};
+
+static void pnv_phb3_rc_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
+    PCIERootPortClass *rpc = PCIE_ROOT_PORT_CLASS(klass);
+
+    dc->desc     = "IBM PHB3 PCIE Root Port";
+
+    k->vendor_id = PCI_VENDOR_ID_IBM;
+    k->device_id = 0x03dc;
+    k->revision  = 0;
+    rpc->exp_offset = 0x48;
+    rpc->aer_offset = 0x100;
+}
+
+static const TypeInfo pnv_phb3_rc_info = {
+    .name          = TYPE_PNV_PHB3_RC,
+    .parent        = TYPE_PCIE_ROOT_PORT,
+    .class_init    = pnv_phb3_rc_class_init,
+};
+
+static void pnv_phb3_register_types(void)
+{
+    type_register_static(&pnv_phb3_rc_info);
+    type_register_static(&pnv_phb3_type_info);
+    type_register_static(&pnv_phb3_root_bus_info);
+    type_register_static(&pnv_phb3_iommu_memory_region_info);
+}
+
+type_init(pnv_phb3_register_types)
diff --git a/hw/pci-host/pnv_phb3_msi.c b/hw/pci-host/pnv_phb3_msi.c
new file mode 100644
index 000000000000..0370211aebfd
--- /dev/null
+++ b/hw/pci-host/pnv_phb3_msi.c
@@ -0,0 +1,315 @@ 
+/*
+ * QEMU PowerPC PowerNV PHB3 model
+ *
+ * Copyright (c) 2014-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qapi/error.h"
+#include "qemu-common.h"
+#include "hw/pci-host/pnv_phb3_regs.h"
+#include "hw/pci-host/pnv_phb3.h"
+#include "hw/ppc/pnv.h"
+#include "hw/pci/msi.h"
+
+static uint64_t phb3_msi_ive_addr(PnvPHB3 *phb, int srcno)
+{
+    uint64_t ivtbar = phb->regs[PHB_IVT_BAR >> 3];
+    uint64_t phbctl = phb->regs[PHB_CONTROL >> 3];
+
+    if (!(ivtbar & PHB_IVT_BAR_ENABLE)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "Failed access to disable IVT BAR !");
+        return 0;
+    }
+
+    if (srcno >= (ivtbar & PHB_IVT_LENGTH_MASK)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "MSI out of bounds (%d vs  0x%"PRIx64")",
+                      srcno, ivtbar & PHB_IVT_LENGTH_MASK);
+        return 0;
+    }
+
+    ivtbar &= PHB_IVT_BASE_ADDRESS_MASK;
+
+    if (phbctl & PHB_CTRL_IVE_128_BYTES) {
+        return ivtbar + 128 * srcno;
+    } else {
+        return ivtbar + 16 * srcno;
+    }
+}
+
+static bool phb3_msi_read_ive(PnvPHB3 *phb, int srcno, uint64_t *out_ive)
+{
+    uint64_t ive_addr, ive;
+
+    ive_addr = phb3_msi_ive_addr(phb, srcno);
+    if (!ive_addr) {
+        return false;
+    }
+
+    if (dma_memory_read(&address_space_memory, ive_addr, &ive, sizeof(ive))) {
+        qemu_log_mask(LOG_GUEST_ERROR, "Failed to read IVE at 0x%" PRIx64,
+                      ive_addr);
+        return false;
+    }
+    *out_ive = be64_to_cpu(ive);
+
+    return true;
+}
+
+static void phb3_msi_set_p(Phb3MsiState *msi, int srcno, uint8_t gen)
+{
+    uint64_t ive_addr;
+    uint8_t p = 0x01 | (gen << 1);
+
+    ive_addr = phb3_msi_ive_addr(msi->phb, srcno);
+    if (!ive_addr) {
+        return;
+    }
+
+    if (dma_memory_write(&address_space_memory, ive_addr + 4, &p, 1)) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+                      "Failed to write IVE (set P) at 0x%" PRIx64, ive_addr);
+    }
+}
+
+static void phb3_msi_set_q(Phb3MsiState *msi, int srcno)
+{
+    uint64_t ive_addr;
+    uint8_t q = 0x01;
+
+    ive_addr = phb3_msi_ive_addr(msi->phb, srcno);
+    if (!ive_addr) {
+        return;
+    }
+
+    if (dma_memory_write(&address_space_memory, ive_addr + 5, &q, 1)) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+                      "Failed to write IVE (set Q) at 0x%" PRIx64, ive_addr);
+    }
+}
+
+static void phb3_msi_try_send(Phb3MsiState *msi, int srcno, bool ignore_p)
+{
+    ICSState *ics = ICS_BASE(msi);
+    uint64_t ive;
+    uint64_t server, prio, pq, gen;
+
+    if (!phb3_msi_read_ive(msi->phb, srcno, &ive)) {
+        return;
+    }
+
+    server = GETFIELD(IODA2_IVT_SERVER, ive);
+    prio = GETFIELD(IODA2_IVT_PRIORITY, ive);
+    pq = GETFIELD(IODA2_IVT_Q, ive);
+    if (!ignore_p) {
+        pq |= GETFIELD(IODA2_IVT_P, ive) << 1;
+    }
+    gen = GETFIELD(IODA2_IVT_GEN, ive);
+
+    /*
+     * The low order 2 bits are the link pointer (Type II interrupts).
+     * Shift back to get a valid IRQ server.
+     */
+    server >>= 2;
+
+    switch (pq) {
+    case 0: /* 00 */
+        if (prio == 0xff) {
+            /* Masked, set Q */
+            phb3_msi_set_q(msi, srcno);
+        } else {
+            /* Enabled, set P and send */
+            phb3_msi_set_p(msi, srcno, gen);
+            icp_irq(ics, server, srcno + ics->offset, prio);
+        }
+        break;
+    case 2: /* 10 */
+        /* Already pending, set Q */
+        phb3_msi_set_q(msi, srcno);
+        break;
+    case 1: /* 01 */
+    case 3: /* 11 */
+    default:
+        /* Just drop stuff if Q already set */
+        break;
+    }
+}
+
+static void phb3_msi_set_irq(void *opaque, int srcno, int val)
+{
+    Phb3MsiState *msi = PHB3_MSI(opaque);
+
+    if (val) {
+        phb3_msi_try_send(msi, srcno, false);
+    }
+}
+
+
+void pnv_phb3_msi_send(Phb3MsiState *msi, uint64_t addr, uint16_t data,
+                       int32_t dev_pe)
+{
+    ICSState *ics = ICS_BASE(msi);
+    uint64_t ive;
+    uint16_t pe;
+    uint32_t src = ((addr >> 4) & 0xffff) | (data & 0x1f);
+
+    if (src >= ics->nr_irqs) {
+        qemu_log_mask(LOG_GUEST_ERROR, "MSI %d out of bounds", src);
+        return;
+    }
+    if (dev_pe >= 0) {
+        if (!phb3_msi_read_ive(msi->phb, src, &ive)) {
+            return;
+        }
+        pe = GETFIELD(IODA2_IVT_PE, ive);
+        if (pe != dev_pe) {
+            qemu_log_mask(LOG_GUEST_ERROR,
+                          "MSI %d send by PE#%d but assigned to PE#%d",
+                          src, dev_pe, pe);
+            return;
+        }
+    }
+    qemu_irq_pulse(ics->qirqs[src]);
+}
+
+void pnv_phb3_msi_ffi(Phb3MsiState *msi, uint64_t val)
+{
+    /* Emit interrupt */
+    pnv_phb3_msi_send(msi, val, 0, -1);
+
+    /* Clear FFI lock */
+    msi->phb->regs[PHB_FFI_LOCK >> 3] = 0;
+}
+
+static void phb3_msi_reject(ICSState *ics, uint32_t nr)
+{
+    Phb3MsiState *msi = PHB3_MSI(ics);
+    unsigned int srcno = nr - ics->offset;
+    unsigned int idx = srcno >> 6;
+    unsigned int bit = 1ull << (srcno & 0x3f);
+
+    assert(srcno < PHB3_MAX_MSI);
+
+    msi->rba[idx] |= bit;
+    msi->rba_sum |= (1u << idx);
+}
+
+static void phb3_msi_resend(ICSState *ics)
+{
+    Phb3MsiState *msi = PHB3_MSI(ics);
+    unsigned int i, j;
+
+    if (msi->rba_sum == 0) {
+        return;
+    }
+
+    for (i = 0; i < 32; i++) {
+        if ((msi->rba_sum & (1u << i)) == 0) {
+            continue;
+        }
+        msi->rba_sum &= ~(1u << i);
+        for (j = 0; j < 64; j++) {
+            if ((msi->rba[i] & (1ull << j)) == 0) {
+                continue;
+            }
+            msi->rba[i] &= ~(1u << j);
+            phb3_msi_try_send(msi, i * 64 + j, true);
+        }
+    }
+}
+
+static void phb3_msi_reset(DeviceState *dev)
+{
+    Phb3MsiState *msi = PHB3_MSI(dev);
+    ICSStateClass *icsc = ICS_BASE_GET_CLASS(dev);
+
+    icsc->parent_reset(dev);
+
+    memset(msi->rba, 0, sizeof(msi->rba));
+    msi->rba_sum = 0;
+}
+
+static void phb3_msi_reset_handler(void *dev)
+{
+    phb3_msi_reset(dev);
+}
+
+void pnv_phb3_msi_update_config(Phb3MsiState *msi, uint32_t base,
+                                uint32_t count)
+{
+    ICSState *ics = ICS_BASE(msi);
+
+    if (count > PHB3_MAX_MSI) {
+        count = PHB3_MAX_MSI;
+    }
+    ics->nr_irqs = count;
+    ics->offset = base;
+}
+
+static void phb3_msi_realize(DeviceState *dev, Error **errp)
+{
+    Phb3MsiState *msi = PHB3_MSI(dev);
+    ICSState *ics = ICS_BASE(msi);
+    ICSStateClass *icsc = ICS_BASE_GET_CLASS(ics);
+    Object *obj;
+    Error *local_err = NULL;
+
+    icsc->parent_realize(dev, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    obj = object_property_get_link(OBJECT(dev), "phb", &local_err);
+    if (!obj) {
+        error_propagate(errp, local_err);
+        error_prepend(errp, "required link 'phb' not found: ");
+        return;
+    }
+    msi->phb = PNV_PHB3(obj);
+
+    ics->qirqs = qemu_allocate_irqs(phb3_msi_set_irq, msi, PHB3_MAX_MSI);
+
+    qemu_register_reset(phb3_msi_reset_handler, dev);
+}
+
+static void phb3_msi_instance_init(Object *obj)
+{
+    ICSState *ics = ICS_BASE(obj);
+
+    /* Will be overriden later */
+    ics->offset = 0;
+}
+
+static void phb3_msi_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    ICSStateClass *isc = ICS_BASE_CLASS(klass);
+
+    device_class_set_parent_realize(dc, phb3_msi_realize,
+                                    &isc->parent_realize);
+    device_class_set_parent_reset(dc, phb3_msi_reset,
+                                  &isc->parent_reset);
+
+    isc->reject = phb3_msi_reject;
+    isc->resend = phb3_msi_resend;
+}
+
+static const TypeInfo phb3_msi_info = {
+    .name = TYPE_PHB3_MSI,
+    .parent = TYPE_ICS_BASE,
+    .instance_size = sizeof(Phb3MsiState),
+    .class_init = phb3_msi_class_init,
+    .class_size = sizeof(ICSStateClass),
+    .instance_init = phb3_msi_instance_init,
+};
+
+static void pnv_phb3_msi_register_types(void)
+{
+    type_register_static(&phb3_msi_info);
+}
+
+type_init(pnv_phb3_msi_register_types)
diff --git a/hw/pci-host/pnv_phb3_pbcq.c b/hw/pci-host/pnv_phb3_pbcq.c
new file mode 100644
index 000000000000..9a2bae6637d1
--- /dev/null
+++ b/hw/pci-host/pnv_phb3_pbcq.c
@@ -0,0 +1,350 @@ 
+/*
+ * QEMU PowerPC PowerNV PHB3 model
+ *
+ * Copyright (c) 2014-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu-common.h"
+#include "qemu/log.h"
+#include "hw/ppc/fdt.h"
+#include "hw/pci-host/pnv_phb3_regs.h"
+#include "hw/pci-host/pnv_phb3.h"
+#include "hw/ppc/pnv.h"
+#include "hw/ppc/pnv_xscom.h"
+#include "hw/pci/pci_bridge.h"
+#include "hw/pci/pci_bus.h"
+
+#include <libfdt.h>
+
+static uint64_t pnv_pbcq_nest_xscom_read(void *opaque, hwaddr addr,
+                                         unsigned size)
+{
+    PnvPBCQState *pbcq = PNV_PBCQ(opaque);
+    uint32_t offset = addr >> 3;
+
+    return pbcq->nest_regs[offset];
+}
+
+static uint64_t pnv_pbcq_pci_xscom_read(void *opaque, hwaddr addr,
+                                        unsigned size)
+{
+    PnvPBCQState *pbcq = PNV_PBCQ(opaque);
+    uint32_t offset = addr >> 3;
+
+    return pbcq->pci_regs[offset];
+}
+
+static uint64_t pnv_pbcq_spci_xscom_read(void *opaque, hwaddr addr,
+                                         unsigned size)
+{
+    PnvPBCQState *pbcq = PNV_PBCQ(opaque);
+    uint32_t offset = addr >> 3;
+
+    if (offset == PBCQ_SPCI_ASB_DATA) {
+        return pnv_phb3_reg_read(pbcq->phb,
+                                 pbcq->spci_regs[PBCQ_SPCI_ASB_ADDR], 8);
+    }
+    return pbcq->spci_regs[offset];
+}
+
+static void pnv_pbcq_update_map(PnvPBCQState *pbcq)
+{
+    uint64_t bar_en = pbcq->nest_regs[PBCQ_NEST_BAR_EN];
+    uint64_t bar, mask, size;
+
+    /*
+     * NOTE: This will really not work well if those are remapped
+     * after the PHB has created its sub regions. We could do better
+     * if we had a way to resize regions but we don't really care
+     * that much in practice as the stuff below really only happens
+     * once early during boot
+     */
+
+    /* Handle unmaps */
+    if (pbcq->mmio0_mapped && !(bar_en & PBCQ_NEST_BAR_EN_MMIO0)) {
+        memory_region_del_subregion(get_system_memory(), &pbcq->mmbar0);
+        pbcq->mmio0_mapped = false;
+    }
+    if (pbcq->mmio1_mapped && !(bar_en & PBCQ_NEST_BAR_EN_MMIO1)) {
+        memory_region_del_subregion(get_system_memory(), &pbcq->mmbar1);
+        pbcq->mmio1_mapped = false;
+    }
+    if (pbcq->phb_mapped && !(bar_en & PBCQ_NEST_BAR_EN_PHB)) {
+        memory_region_del_subregion(get_system_memory(), &pbcq->phbbar);
+        pbcq->phb_mapped = false;
+    }
+
+    /* Update PHB */
+    pnv_phb3_update_regions(pbcq->phb);
+
+    /* Handle maps */
+    if (!pbcq->mmio0_mapped && (bar_en & PBCQ_NEST_BAR_EN_MMIO0)) {
+        bar = pbcq->nest_regs[PBCQ_NEST_MMIO_BAR0] >> 14;
+        mask = pbcq->nest_regs[PBCQ_NEST_MMIO_MASK0];
+        size = ((~mask) >> 14) + 1;
+        memory_region_init(&pbcq->mmbar0, OBJECT(pbcq), "pbcq-mmio0", size);
+        memory_region_add_subregion(get_system_memory(), bar, &pbcq->mmbar0);
+        pbcq->mmio0_mapped = true;
+        pbcq->mmio0_base = bar;
+        pbcq->mmio0_size = size;
+    }
+    if (!pbcq->mmio1_mapped && (bar_en & PBCQ_NEST_BAR_EN_MMIO1)) {
+        bar = pbcq->nest_regs[PBCQ_NEST_MMIO_BAR1] >> 14;
+        mask = pbcq->nest_regs[PBCQ_NEST_MMIO_MASK1];
+        size = ((~mask) >> 14) + 1;
+        memory_region_init(&pbcq->mmbar1, OBJECT(pbcq), "pbcq-mmio1", size);
+        memory_region_add_subregion(get_system_memory(), bar, &pbcq->mmbar1);
+        pbcq->mmio1_mapped = true;
+        pbcq->mmio1_base = bar;
+        pbcq->mmio1_size = size;
+    }
+    if (!pbcq->phb_mapped && (bar_en & PBCQ_NEST_BAR_EN_PHB)) {
+        bar = pbcq->nest_regs[PBCQ_NEST_PHB_BAR] >> 14;
+        size = 0x1000;
+        memory_region_init(&pbcq->phbbar, OBJECT(pbcq), "pbcq-phb", size);
+        memory_region_add_subregion(get_system_memory(), bar, &pbcq->phbbar);
+        pbcq->phb_mapped = true;
+    }
+
+    /* Update PHB */
+    pnv_phb3_update_regions(pbcq->phb);
+}
+
+static void pnv_pbcq_nest_xscom_write(void *opaque, hwaddr addr,
+                                uint64_t val, unsigned size)
+{
+    PnvPBCQState *pbcq = PNV_PBCQ(opaque);
+    uint32_t reg = addr >> 3;
+
+    switch (reg) {
+    case PBCQ_NEST_MMIO_BAR0:
+    case PBCQ_NEST_MMIO_BAR1:
+    case PBCQ_NEST_MMIO_MASK0:
+    case PBCQ_NEST_MMIO_MASK1:
+        if (pbcq->nest_regs[PBCQ_NEST_BAR_EN] &
+            (PBCQ_NEST_BAR_EN_MMIO0 |
+             PBCQ_NEST_BAR_EN_MMIO1)) {
+            qemu_log_mask(LOG_GUEST_ERROR,
+                           "PHB3: Changing enabled BAR unsupported\n");
+        }
+        pbcq->nest_regs[reg] = val & 0xffffffffc0000000ull;
+        break;
+    case PBCQ_NEST_PHB_BAR:
+        if (pbcq->nest_regs[PBCQ_NEST_BAR_EN] & PBCQ_NEST_BAR_EN_PHB) {
+            qemu_log_mask(LOG_GUEST_ERROR,
+                           "PHB3: Changing enabled BAR unsupported\n");
+        }
+        pbcq->nest_regs[reg] = val & 0xfffffffffc000000ull;
+        break;
+    case PBCQ_NEST_BAR_EN:
+        pbcq->nest_regs[reg] = val & 0xf800000000000000ull;
+        pnv_pbcq_update_map(pbcq);
+        pnv_phb3_remap_irqs(pbcq->phb);
+        break;
+    case PBCQ_NEST_IRSN_COMPARE:
+    case PBCQ_NEST_IRSN_MASK:
+        pbcq->nest_regs[reg] = val & PBCQ_NEST_IRSN_COMP_MASK;
+        pnv_phb3_remap_irqs(pbcq->phb);
+        break;
+    case PBCQ_NEST_LSI_SRC_ID:
+        pbcq->nest_regs[reg] = val & PBCQ_NEST_LSI_SRC_MASK;
+        pnv_phb3_remap_irqs(pbcq->phb);
+        break;
+    }
+
+    /* XXX Don't error out on other regs for now ... */
+}
+
+static void pnv_pbcq_pci_xscom_write(void *opaque, hwaddr addr,
+                                     uint64_t val, unsigned size)
+{
+    PnvPBCQState *pbcq = PNV_PBCQ(opaque);
+    uint32_t reg = addr >> 3;
+
+    switch (reg) {
+    case PBCQ_PCI_BAR2:
+        pbcq->pci_regs[reg] = val & 0xfffffffffc000000ull;
+        pnv_pbcq_update_map(pbcq);
+        break;
+    }
+
+    /* XXX Don't error out on other regs for now ... */
+}
+
+static void pnv_pbcq_spci_xscom_write(void *opaque, hwaddr addr,
+                                uint64_t val, unsigned size)
+{
+    PnvPBCQState *pbcq = PNV_PBCQ(opaque);
+    uint32_t reg = addr >> 3;
+
+    switch (reg) {
+    case PBCQ_SPCI_ASB_ADDR:
+        pbcq->spci_regs[reg] = val & 0xfff;
+        break;
+    case PBCQ_SPCI_ASB_STATUS:
+        pbcq->spci_regs[reg] &= ~val;
+        break;
+    case PBCQ_SPCI_ASB_DATA:
+        pnv_phb3_reg_write(pbcq->phb, pbcq->spci_regs[PBCQ_SPCI_ASB_ADDR],
+                           val, 8);
+        break;
+    case PBCQ_SPCI_AIB_CAPP_EN:
+    case PBCQ_SPCI_CAPP_SEC_TMR:
+        break;
+    }
+
+    /* XXX Don't error out on other regs for now ... */
+}
+
+static const MemoryRegionOps pnv_pbcq_nest_xscom_ops = {
+    .read = pnv_pbcq_nest_xscom_read,
+    .write = pnv_pbcq_nest_xscom_write,
+    .valid.min_access_size = 8,
+    .valid.max_access_size = 8,
+    .impl.min_access_size = 8,
+    .impl.max_access_size = 8,
+    .endianness = DEVICE_BIG_ENDIAN,
+};
+
+static const MemoryRegionOps pnv_pbcq_pci_xscom_ops = {
+    .read = pnv_pbcq_pci_xscom_read,
+    .write = pnv_pbcq_pci_xscom_write,
+    .valid.min_access_size = 8,
+    .valid.max_access_size = 8,
+    .impl.min_access_size = 8,
+    .impl.max_access_size = 8,
+    .endianness = DEVICE_BIG_ENDIAN,
+};
+
+static const MemoryRegionOps pnv_pbcq_spci_xscom_ops = {
+    .read = pnv_pbcq_spci_xscom_read,
+    .write = pnv_pbcq_spci_xscom_write,
+    .valid.min_access_size = 8,
+    .valid.max_access_size = 8,
+    .impl.min_access_size = 8,
+    .impl.max_access_size = 8,
+    .endianness = DEVICE_BIG_ENDIAN,
+};
+
+static void pnv_pbcq_default_bars(PnvPBCQState *pbcq)
+{
+    uint64_t mm0, mm1, reg;
+
+    mm0 = 0x3d00000000000ull + 0x4000000000ull * pbcq->chip_id +
+            0x1000000000ull * pbcq->phb_id;
+    mm1 = 0x3ff8000000000ull + 0x0200000000ull * pbcq->chip_id +
+            0x0080000000ull * pbcq->phb_id;
+    reg = 0x3fffe40000000ull + 0x0000400000ull * pbcq->chip_id +
+            0x0000100000ull * pbcq->phb_id;
+
+    pbcq->nest_regs[PBCQ_NEST_MMIO_BAR0] = mm0 << 14;
+    pbcq->nest_regs[PBCQ_NEST_MMIO_BAR1] = mm1 << 14;
+    pbcq->nest_regs[PBCQ_NEST_PHB_BAR] = reg << 14;
+    pbcq->nest_regs[PBCQ_NEST_MMIO_MASK0] = 0x3fff000000000ull << 14;
+    pbcq->nest_regs[PBCQ_NEST_MMIO_MASK1] = 0x3ffff80000000ull << 14;
+    pbcq->pci_regs[PBCQ_PCI_BAR2] = reg << 14;
+}
+
+static void pnv_pbcq_realize(DeviceState *dev, Error **errp)
+{
+    PnvPBCQState *pbcq = PNV_PBCQ(dev);
+    Object *obj;
+    Error *local_err = NULL;
+
+    assert(pbcq->phb_id < 4);
+
+    obj = object_property_get_link(OBJECT(dev), "phb", &local_err);
+    if (!obj) {
+        error_propagate(errp, local_err);
+        error_prepend(errp, "required link 'phb' not found: ");
+        return;
+    }
+    pbcq->phb = PNV_PHB3(obj);
+
+    /* XXX Fix OPAL to do that: establish default BAR values */
+    pnv_pbcq_default_bars(pbcq);
+
+    /* XScom region for PBCQ registers */
+    pnv_xscom_region_init(&pbcq->xscom_nest_regs, OBJECT(dev),
+                          &pnv_pbcq_nest_xscom_ops,
+                          pbcq, "xscom-pbcq-nest",
+                          PNV_XSCOM_PBCQ_NEST_SIZE);
+    pnv_xscom_region_init(&pbcq->xscom_pci_regs, OBJECT(dev),
+                          &pnv_pbcq_pci_xscom_ops,
+                          pbcq, "xscom-pbcq-pci",
+                          PNV_XSCOM_PBCQ_PCI_SIZE);
+    pnv_xscom_region_init(&pbcq->xscom_spci_regs, OBJECT(dev),
+                          &pnv_pbcq_spci_xscom_ops,
+                          pbcq, "xscom-pbcq-spci",
+                          PNV_XSCOM_PBCQ_SPCI_SIZE);
+}
+
+static int pnv_pbcq_dt_xscom(PnvXScomInterface *dev, void *fdt,
+                             int xscom_offset)
+{
+    const char compat[] = "ibm,power8-pbcq";
+    PnvPBCQState *pbcq = PNV_PBCQ(dev);
+    char *name;
+    int offset;
+    uint32_t lpc_pcba = PNV_XSCOM_PBCQ_NEST_BASE;
+    uint32_t reg[] = {
+        cpu_to_be32(lpc_pcba),
+        cpu_to_be32(PNV_XSCOM_PBCQ_NEST_SIZE),
+        cpu_to_be32(PNV_XSCOM_PBCQ_PCI_BASE),
+        cpu_to_be32(PNV_XSCOM_PBCQ_PCI_SIZE),
+        cpu_to_be32(PNV_XSCOM_PBCQ_SPCI_BASE),
+        cpu_to_be32(PNV_XSCOM_PBCQ_SPCI_SIZE)
+    };
+
+    name = g_strdup_printf("pbcq@%x", lpc_pcba);
+    offset = fdt_add_subnode(fdt, xscom_offset, name);
+    _FDT(offset);
+    g_free(name);
+
+    _FDT((fdt_setprop(fdt, offset, "reg", reg, sizeof(reg))));
+
+    _FDT((fdt_setprop_cell(fdt, offset, "ibm,phb-index", pbcq->phb_id)));
+    _FDT((fdt_setprop(fdt, offset, "compatible", compat,
+                      sizeof(compat))));
+    return 0;
+}
+
+static Property pnv_pbcq_properties[] = {
+        DEFINE_PROP_UINT32("phb-id", PnvPBCQState, phb_id, 0),
+        DEFINE_PROP_UINT32("chip-id", PnvPBCQState, chip_id, 0),
+        DEFINE_PROP_END_OF_LIST(),
+
+};
+
+static void pnv_pbcq_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PnvXScomInterfaceClass *xdc = PNV_XSCOM_INTERFACE_CLASS(klass);
+
+    xdc->dt_xscom = pnv_pbcq_dt_xscom;
+
+    dc->realize = pnv_pbcq_realize;
+    dc->props = pnv_pbcq_properties;
+}
+
+static const TypeInfo pnv_pbcq_type_info = {
+    .name          = TYPE_PNV_PBCQ,
+    .parent        = TYPE_DEVICE,
+    .instance_size = sizeof(PnvPBCQState),
+    .class_init    = pnv_pbcq_class_init,
+    .interfaces    = (InterfaceInfo[]) {
+        { TYPE_PNV_XSCOM_INTERFACE },
+        { }
+    }
+};
+
+static void pnv_pbcq_register_types(void)
+{
+    type_register_static(&pnv_pbcq_type_info);
+}
+
+type_init(pnv_pbcq_register_types)
diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index 7401ffe5b01c..cb88bb21f189 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -44,6 +44,10 @@ 
 #include "hw/isa/isa.h"
 #include "hw/char/serial.h"
 #include "hw/timer/mc146818rtc.h"
+#include "hw/pci/pci.h"
+#include "hw/pci/pci_bus.h"
+#include "hw/pci/pci_bridge.h"
+#include "hw/pci/msi.h"
 
 #include <libfdt.h>
 
@@ -724,7 +728,9 @@  static Object *pnv_chip_power9_intc_create(PnvChip *chip, Object *child,
 
 static void pnv_chip_power8_instance_init(Object *obj)
 {
+    PnvChip *chip = PNV_CHIP(obj);
     Pnv8Chip *chip8 = PNV8_CHIP(obj);
+    int i;
 
     object_initialize(&chip8->psi, sizeof(chip8->psi), TYPE_PNV_PSI);
     object_property_add_child(obj, "psi", OBJECT(&chip8->psi), NULL);
@@ -740,6 +746,16 @@  static void pnv_chip_power8_instance_init(Object *obj)
     object_property_add_child(obj, "occ", OBJECT(&chip8->occ), NULL);
     object_property_add_const_link(OBJECT(&chip8->occ), "psi",
                                    OBJECT(&chip8->psi), &error_abort);
+
+    /* Create Power system Host Bridges 3 (PHB3) */
+    for (i = 0; i < chip->num_phbs; i++) {
+        object_initialize(&chip8->phbs[i], sizeof(chip8->phbs[i]),
+                          TYPE_PNV_PHB3);
+        object_property_add_child(obj, "phb[*]", OBJECT(&chip8->phbs[i]), NULL);
+        object_property_add_const_link(OBJECT(&chip8->phbs[i]), "xics",
+                                       OBJECT(qdev_get_machine()),
+                                       &error_abort);
+    }
 }
 
 static void pnv_chip_icp_realize(Pnv8Chip *chip8, Error **errp)
@@ -774,12 +790,43 @@  static void pnv_chip_icp_realize(Pnv8Chip *chip8, Error **errp)
     }
 }
 
+static PCIBus *pnv_chip_phb3_pci_create(PnvPHB3 *phb, Error **errp)
+{
+    PnvPBCQState *pbcq = &phb->pbcq;
+    PCIHostState *pcih = PCI_HOST_BRIDGE(phb);
+    PCIDevice *brdev;
+    PCIDevice *pdev;
+    PCIBus *parent;
+    uint8_t chassis = pbcq->chip_id * 4 + pbcq->phb_id;
+    uint8_t chassis_nr = 128;
+
+    /* Add root complex */
+    pdev = pci_create(pcih->bus, 0, TYPE_PNV_PHB3_RC);
+    qdev_prop_set_uint8(DEVICE(pdev), "chassis", chassis);
+    qdev_prop_set_uint16(DEVICE(pdev), "slot", 1);
+    object_property_add_child(OBJECT(phb), "phb3-rc", OBJECT(pdev), NULL);
+    qdev_init_nofail(DEVICE(pdev));
+
+    /* Setup bus for that chip */
+    parent = pci_bridge_get_sec_bus(PCI_BRIDGE(pdev));
+
+    brdev = pci_create(parent, 0, "pci-bridge");
+    qdev_prop_set_uint8(DEVICE(brdev), PCI_BRIDGE_DEV_PROP_CHASSIS_NR,
+                        chassis_nr);
+    object_property_add_child(OBJECT(parent), "pci-bridge", OBJECT(brdev),
+                              NULL);
+    qdev_init_nofail(DEVICE(brdev));
+
+    return pci_bridge_get_sec_bus(PCI_BRIDGE(brdev));
+}
+
 static void pnv_chip_power8_realize(DeviceState *dev, Error **errp)
 {
     PnvChipClass *pcc = PNV_CHIP_GET_CLASS(dev);
     PnvChip *chip = PNV_CHIP(dev);
     Pnv8Chip *chip8 = PNV8_CHIP(dev);
     Error *local_err = NULL;
+    int i;
 
     pcc->parent_realize(dev, &local_err);
     if (local_err) {
@@ -817,6 +864,38 @@  static void pnv_chip_power8_realize(DeviceState *dev, Error **errp)
         return;
     }
     pnv_xscom_add_subregion(chip, PNV_XSCOM_OCC_BASE, &chip8->occ.xscom_regs);
+
+    /* MSIs are supported on this platform */
+    msi_nonbroken = true;
+
+    /* Create Power system Host Bridges 3 (PHB3) */
+    for (i = 0; i < chip->num_phbs; i++) {
+        Object *obj = OBJECT(&chip8->phbs[i]);
+        PnvPBCQState *pbcq = &chip8->phbs[i].pbcq;
+
+        object_property_set_int(obj, i, "phb-id", &error_fatal);
+        object_property_set_int(obj, chip->chip_id, "chip-id", &error_fatal);
+        object_property_set_bool(obj, true, "realized", &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+        qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
+
+        pnv_xscom_add_subregion(chip, PNV_XSCOM_PBCQ_NEST_BASE + 0x400 * i,
+                                &pbcq->xscom_nest_regs);
+        pnv_xscom_add_subregion(chip, PNV_XSCOM_PBCQ_PCI_BASE + 0x400 * i,
+                                &pbcq->xscom_pci_regs);
+        pnv_xscom_add_subregion(chip, PNV_XSCOM_PBCQ_SPCI_BASE + 0x040 * i,
+                                &pbcq->xscom_spci_regs);
+
+        /* Setup the PCI busses */
+        pnv_chip_phb3_pci_create(&chip8->phbs[i], &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+    }
 }
 
 static void pnv_chip_power8e_class_init(ObjectClass *klass, void *data)
@@ -1031,6 +1110,7 @@  static Property pnv_chip_properties[] = {
     DEFINE_PROP_UINT64("ram-size", PnvChip, ram_size, 0),
     DEFINE_PROP_UINT32("nr-cores", PnvChip, nr_cores, 1),
     DEFINE_PROP_UINT64("cores-mask", PnvChip, cores_mask, 0x0),
+    DEFINE_PROP_UINT32("num-phbs", PnvChip, num_phbs, 1),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -1047,14 +1127,23 @@  static void pnv_chip_class_init(ObjectClass *klass, void *data)
 static ICSState *pnv_ics_get(XICSFabric *xi, int irq)
 {
     PnvMachineState *pnv = PNV_MACHINE(xi);
-    int i;
+    int i, j;
 
     for (i = 0; i < pnv->num_chips; i++) {
+        PnvChip *chip = pnv->chips[i];
         Pnv8Chip *chip8 = PNV8_CHIP(pnv->chips[i]);
 
         if (ics_valid_irq(&chip8->psi.ics, irq)) {
             return &chip8->psi.ics;
         }
+        for (j = 0; j < chip->num_phbs; j++) {
+            if (ics_valid_irq(&chip8->phbs[j].lsis, irq)) {
+                return &chip8->phbs[j].lsis;
+            }
+            if (ics_valid_irq(ICS_BASE(&chip8->phbs[j].msis), irq)) {
+                return ICS_BASE(&chip8->phbs[j].msis);
+            }
+        }
     }
     return NULL;
 }
@@ -1062,11 +1151,16 @@  static ICSState *pnv_ics_get(XICSFabric *xi, int irq)
 static void pnv_ics_resend(XICSFabric *xi)
 {
     PnvMachineState *pnv = PNV_MACHINE(xi);
-    int i;
+    int i, j;
 
     for (i = 0; i < pnv->num_chips; i++) {
+        PnvChip *chip = pnv->chips[i];
         Pnv8Chip *chip8 = PNV8_CHIP(pnv->chips[i]);
         ics_resend(&chip8->psi.ics);
+        for (j = 0; j < chip->num_phbs; j++) {
+            ics_resend(&chip8->phbs[j].lsis);
+            ics_resend(ICS_BASE(&chip8->phbs[j].msis));
+        }
     }
 }
 
@@ -1097,7 +1191,7 @@  static void pnv_pic_print_info(InterruptStatsProvider *obj,
                                Monitor *mon)
 {
     PnvMachineState *pnv = PNV_MACHINE(obj);
-    int i;
+    int i, j;
     CPUState *cs;
 
     CPU_FOREACH(cs) {
@@ -1107,8 +1201,14 @@  static void pnv_pic_print_info(InterruptStatsProvider *obj,
     }
 
     for (i = 0; i < pnv->num_chips; i++) {
+        PnvChip *chip = pnv->chips[i];
         Pnv8Chip *chip8 = PNV8_CHIP(pnv->chips[i]);
         ics_pic_print_info(&chip8->psi.ics, mon);
+
+        for (j = 0; j < chip->num_phbs; j++) {
+            ics_pic_print_info(&chip8->phbs[j].lsis, mon);
+            ics_pic_print_info(ICS_BASE(&chip8->phbs[j].msis), mon);
+        }
     }
 }
 
diff --git a/hw/ppc/pnv_xscom.c b/hw/ppc/pnv_xscom.c
index 46fae41f32b0..83f05d06053f 100644
--- a/hw/ppc/pnv_xscom.c
+++ b/hw/ppc/pnv_xscom.c
@@ -252,7 +252,11 @@  int pnv_dt_xscom(PnvChip *chip, void *fdt, int root_offset)
     args.fdt = fdt;
     args.xscom_offset = xscom_offset;
 
-    object_child_foreach(OBJECT(chip), xscom_dt_child, &args);
+    /* Some PnvXScomInterface objects lie a bit deeper (PnvPBCQState)
+     * than the first layer, so we need to loop on the whole object
+     * hierarchy to catch them
+     */
+    object_child_foreach_recursive(OBJECT(chip), xscom_dt_child, &args);
     return 0;
 }
 
diff --git a/hw/pci-host/Makefile.objs b/hw/pci-host/Makefile.objs
index 6d6597c06563..1b5b782bd986 100644
--- a/hw/pci-host/Makefile.objs
+++ b/hw/pci-host/Makefile.objs
@@ -19,3 +19,4 @@  common-obj-$(CONFIG_PCI_GENERIC) += gpex.o
 common-obj-$(CONFIG_PCI_XILINX) += xilinx-pcie.o
 
 common-obj-$(CONFIG_PCI_DESIGNWARE) += designware.o
+obj-$(CONFIG_POWERNV) += pnv_phb3.o pnv_phb3_pbcq.o pnv_phb3_msi.o