diff mbox

[U-Boot,8/8] x86: quark: Optimize MRC execution time

Message ID 1441014773-10902-9-git-send-email-bmeng.cn@gmail.com
State Superseded
Delegated to: Simon Glass
Headers show

Commit Message

Bin Meng Aug. 31, 2015, 9:52 a.m. UTC
Boot time performance degradation is observed with the conversion
to use dm pci. Intel Quark SoC has a low end x86 processor with
only 400MHz frequency and the most time consuming part is with MRC.
Each MRC register programming requires indirect access via pci bus.
With dm pci, accessing pci configuration space has some overhead.
Unfortunately this single access overhead gets accumulated in the
whole MRC process, and finally leads to twice boot time (25 seconds)
than before (12 seconds).

To speed up the boot, create an optimized version of pci config
read/write routines without bothering to go through driver model.
Now it only takes about 3 seconds to finish MRC, which is really
fast (8 times faster than dm pci, or 4 times faster than before).

Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
---

 arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++----------------
 1 file changed, 37 insertions(+), 22 deletions(-)

Comments

Simon Glass Aug. 31, 2015, 1:20 p.m. UTC | #1
Hi Bin,

On 31 August 2015 at 03:52, Bin Meng <bmeng.cn@gmail.com> wrote:
> Boot time performance degradation is observed with the conversion
> to use dm pci. Intel Quark SoC has a low end x86 processor with
> only 400MHz frequency and the most time consuming part is with MRC.
> Each MRC register programming requires indirect access via pci bus.
> With dm pci, accessing pci configuration space has some overhead.
> Unfortunately this single access overhead gets accumulated in the
> whole MRC process, and finally leads to twice boot time (25 seconds)
> than before (12 seconds).
>
> To speed up the boot, create an optimized version of pci config
> read/write routines without bothering to go through driver model.
> Now it only takes about 3 seconds to finish MRC, which is really
> fast (8 times faster than dm pci, or 4 times faster than before).
>
> Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
> ---
>
>  arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++----------------
>  1 file changed, 37 insertions(+), 22 deletions(-)

Before I delve into the patch - with driver model we are using the I/O
method - see pci_x86_read_config(). Is that the source of the slowdown
or is it just general driver model overhead.

If the former then perhaps we should change this. If the latter then
we have work to do...

Regards,
Simon
Bin Meng Aug. 31, 2015, 1:43 p.m. UTC | #2
Hi Simon,

On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass <sjg@chromium.org> wrote:
> Hi Bin,
>
> On 31 August 2015 at 03:52, Bin Meng <bmeng.cn@gmail.com> wrote:
>> Boot time performance degradation is observed with the conversion
>> to use dm pci. Intel Quark SoC has a low end x86 processor with
>> only 400MHz frequency and the most time consuming part is with MRC.
>> Each MRC register programming requires indirect access via pci bus.
>> With dm pci, accessing pci configuration space has some overhead.
>> Unfortunately this single access overhead gets accumulated in the
>> whole MRC process, and finally leads to twice boot time (25 seconds)
>> than before (12 seconds).
>>
>> To speed up the boot, create an optimized version of pci config
>> read/write routines without bothering to go through driver model.
>> Now it only takes about 3 seconds to finish MRC, which is really
>> fast (8 times faster than dm pci, or 4 times faster than before).
>>
>> Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
>> ---
>>
>>  arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++----------------
>>  1 file changed, 37 insertions(+), 22 deletions(-)
>
> Before I delve into the patch - with driver model we are using the I/O
> method - see pci_x86_read_config(). Is that the source of the slowdown
> or is it just general driver model overhead.

The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR
controller. Inside msg_port.c, pci_write_config_dword() and
pci_read_config_dword() are called.

With driver model, the overhead is:

pci_write_config_dword() -> pci_write_config32() -> pci_write_config()
-> uclass_get_device_by_seq() then pci_bus_write_config() will finally
call pci_x86_read_config().

Without driver model, there is still some overhead (so previously the
MRC time was about 12 seconds)

pci_write_config_dword() -> pci_hose_write_config_dword() ->
TYPE1_PCI_OP(write, dword, u32, outl, 0)

With my optimized version, pci_write_config_dword() directly calls a
hardcoded dword size pci config access, without the need to consider
offset and mask, and dereferencing hose->cfg_addr/cfg->data.

>
> If the former then perhaps we should change this. If the latter then
> we have work to do...
>

Regards,
Bin
Simon Glass Aug. 31, 2015, 1:54 p.m. UTC | #3
Hi Bin,

On 31 August 2015 at 07:43, Bin Meng <bmeng.cn@gmail.com> wrote:
> Hi Simon,
>
> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass <sjg@chromium.org> wrote:
>> Hi Bin,
>>
>> On 31 August 2015 at 03:52, Bin Meng <bmeng.cn@gmail.com> wrote:
>>> Boot time performance degradation is observed with the conversion
>>> to use dm pci. Intel Quark SoC has a low end x86 processor with
>>> only 400MHz frequency and the most time consuming part is with MRC.
>>> Each MRC register programming requires indirect access via pci bus.
>>> With dm pci, accessing pci configuration space has some overhead.
>>> Unfortunately this single access overhead gets accumulated in the
>>> whole MRC process, and finally leads to twice boot time (25 seconds)
>>> than before (12 seconds).
>>>
>>> To speed up the boot, create an optimized version of pci config
>>> read/write routines without bothering to go through driver model.
>>> Now it only takes about 3 seconds to finish MRC, which is really
>>> fast (8 times faster than dm pci, or 4 times faster than before).
>>>
>>> Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
>>> ---
>>>
>>>  arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++----------------
>>>  1 file changed, 37 insertions(+), 22 deletions(-)
>>
>> Before I delve into the patch - with driver model we are using the I/O
>> method - see pci_x86_read_config(). Is that the source of the slowdown
>> or is it just general driver model overhead.
>
> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR
> controller. Inside msg_port.c, pci_write_config_dword() and
> pci_read_config_dword() are called.
>
> With driver model, the overhead is:
>
> pci_write_config_dword() -> pci_write_config32() -> pci_write_config()
> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally
> call pci_x86_read_config().
>
> Without driver model, there is still some overhead (so previously the
> MRC time was about 12 seconds)
>
> pci_write_config_dword() -> pci_hose_write_config_dword() ->
> TYPE1_PCI_OP(write, dword, u32, outl, 0)
>
> With my optimized version, pci_write_config_dword() directly calls a
> hardcoded dword size pci config access, without the need to consider
> offset and mask, and dereferencing hose->cfg_addr/cfg->data.

What about if we use dm_pci_read_config32()? We should try to move PCI
access to driver model to avoid the uclass_get_device_by_seq()
everywhere.

>
>>
>> If the former then perhaps we should change this. If the latter then
>> we have work to do...
>>

Regards,
Simon
Bin Meng Aug. 31, 2015, 2:04 p.m. UTC | #4
Hi Simon,

On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass <sjg@chromium.org> wrote:
> Hi Bin,
>
> On 31 August 2015 at 07:43, Bin Meng <bmeng.cn@gmail.com> wrote:
>> Hi Simon,
>>
>> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass <sjg@chromium.org> wrote:
>>> Hi Bin,
>>>
>>> On 31 August 2015 at 03:52, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>> Boot time performance degradation is observed with the conversion
>>>> to use dm pci. Intel Quark SoC has a low end x86 processor with
>>>> only 400MHz frequency and the most time consuming part is with MRC.
>>>> Each MRC register programming requires indirect access via pci bus.
>>>> With dm pci, accessing pci configuration space has some overhead.
>>>> Unfortunately this single access overhead gets accumulated in the
>>>> whole MRC process, and finally leads to twice boot time (25 seconds)
>>>> than before (12 seconds).
>>>>
>>>> To speed up the boot, create an optimized version of pci config
>>>> read/write routines without bothering to go through driver model.
>>>> Now it only takes about 3 seconds to finish MRC, which is really
>>>> fast (8 times faster than dm pci, or 4 times faster than before).
>>>>
>>>> Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
>>>> ---
>>>>
>>>>  arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++----------------
>>>>  1 file changed, 37 insertions(+), 22 deletions(-)
>>>
>>> Before I delve into the patch - with driver model we are using the I/O
>>> method - see pci_x86_read_config(). Is that the source of the slowdown
>>> or is it just general driver model overhead.
>>
>> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR
>> controller. Inside msg_port.c, pci_write_config_dword() and
>> pci_read_config_dword() are called.
>>
>> With driver model, the overhead is:
>>
>> pci_write_config_dword() -> pci_write_config32() -> pci_write_config()
>> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally
>> call pci_x86_read_config().
>>
>> Without driver model, there is still some overhead (so previously the
>> MRC time was about 12 seconds)
>>
>> pci_write_config_dword() -> pci_hose_write_config_dword() ->
>> TYPE1_PCI_OP(write, dword, u32, outl, 0)
>>
>> With my optimized version, pci_write_config_dword() directly calls a
>> hardcoded dword size pci config access, without the need to consider
>> offset and mask, and dereferencing hose->cfg_addr/cfg->data.
>
> What about if we use dm_pci_read_config32()? We should try to move PCI
> access to driver model to avoid the uclass_get_device_by_seq()
> everywhere.

I don't think that helps. dm_pci_read_config32() requires a dm driver.
MRC is just something that program a bunch of registers with pci
config rw call.

>
>>
>>>
>>> If the former then perhaps we should change this. If the latter then
>>> we have work to do...
>>>

Also for this patch, I just realized that not only it helps to reduce
the boot time, but also it helps to support PCIe root port in future
patches.

By checking the Quark SoC manual, I found something that needs to be
done to get the two PCIe root ports on Quark SoC to work. In order to
get PCIe root ports show up in the PCI configuration space, we need
program some registers in the message bus (using APIs in
arch/x86/cpu/quark/msg_port.c). With driver model, if we still call
pci_write_config_dword() in the msg_port.c, we end up triggering PCI
enumeration process first before we actually write something to the
configuration space, however the enumeration process will hang when
scanning to the PCIe root port if we don't properly initialize the
message bus registers. This is a chicken-egg problem.

Regards,
Bin
Simon Glass Aug. 31, 2015, 11:12 p.m. UTC | #5
Hi Bin,

On 31 August 2015 at 08:04, Bin Meng <bmeng.cn@gmail.com> wrote:
> Hi Simon,
>
> On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass <sjg@chromium.org> wrote:
>> Hi Bin,
>>
>> On 31 August 2015 at 07:43, Bin Meng <bmeng.cn@gmail.com> wrote:
>>> Hi Simon,
>>>
>>> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass <sjg@chromium.org> wrote:
>>>> Hi Bin,
>>>>
>>>> On 31 August 2015 at 03:52, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>> Boot time performance degradation is observed with the conversion
>>>>> to use dm pci. Intel Quark SoC has a low end x86 processor with
>>>>> only 400MHz frequency and the most time consuming part is with MRC.
>>>>> Each MRC register programming requires indirect access via pci bus.
>>>>> With dm pci, accessing pci configuration space has some overhead.
>>>>> Unfortunately this single access overhead gets accumulated in the
>>>>> whole MRC process, and finally leads to twice boot time (25 seconds)
>>>>> than before (12 seconds).
>>>>>
>>>>> To speed up the boot, create an optimized version of pci config
>>>>> read/write routines without bothering to go through driver model.
>>>>> Now it only takes about 3 seconds to finish MRC, which is really
>>>>> fast (8 times faster than dm pci, or 4 times faster than before).
>>>>>
>>>>> Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
>>>>> ---
>>>>>
>>>>>  arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++----------------
>>>>>  1 file changed, 37 insertions(+), 22 deletions(-)
>>>>
>>>> Before I delve into the patch - with driver model we are using the I/O
>>>> method - see pci_x86_read_config(). Is that the source of the slowdown
>>>> or is it just general driver model overhead.
>>>
>>> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR
>>> controller. Inside msg_port.c, pci_write_config_dword() and
>>> pci_read_config_dword() are called.
>>>
>>> With driver model, the overhead is:
>>>
>>> pci_write_config_dword() -> pci_write_config32() -> pci_write_config()
>>> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally
>>> call pci_x86_read_config().
>>>
>>> Without driver model, there is still some overhead (so previously the
>>> MRC time was about 12 seconds)
>>>
>>> pci_write_config_dword() -> pci_hose_write_config_dword() ->
>>> TYPE1_PCI_OP(write, dword, u32, outl, 0)
>>>
>>> With my optimized version, pci_write_config_dword() directly calls a
>>> hardcoded dword size pci config access, without the need to consider
>>> offset and mask, and dereferencing hose->cfg_addr/cfg->data.
>>
>> What about if we use dm_pci_read_config32()? We should try to move PCI
>> access to driver model to avoid the uclass_get_device_by_seq()
>> everywhere.
>
> I don't think that helps. dm_pci_read_config32() requires a dm driver.
> MRC is just something that program a bunch of registers with pci
> config rw call.

My question is really what takes the time? It's not clear whether it
is the driver model overhead or something else. The code you add in
qrk_pci_write_config_dword() looks very similar to
pci_x86_read_config().

>
>>
>>>
>>>>
>>>> If the former then perhaps we should change this. If the latter then
>>>> we have work to do...
>>>>
>
> Also for this patch, I just realized that not only it helps to reduce
> the boot time, but also it helps to support PCIe root port in future
> patches.
>
> By checking the Quark SoC manual, I found something that needs to be
> done to get the two PCIe root ports on Quark SoC to work. In order to
> get PCIe root ports show up in the PCI configuration space, we need
> program some registers in the message bus (using APIs in
> arch/x86/cpu/quark/msg_port.c). With driver model, if we still call
> pci_write_config_dword() in the msg_port.c, we end up triggering PCI
> enumeration process first before we actually write something to the
> configuration space, however the enumeration process will hang when
> scanning to the PCIe root port if we don't properly initialize the
> message bus registers. This is a chicken-egg problem.

Sure, although I see that as a separate problem.

We can't have a driver model implementation that is really slow, so
I'd like to clear that up first. Of course your patch makes sense for
other reasons, but I don't want to play whack-a-mole here.

Also we should start to move things away from the non-driver-model pci
functions.

Regards,
Simon
Bin Meng Sept. 1, 2015, 1:40 a.m. UTC | #6
Hi Simon,

On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass <sjg@chromium.org> wrote:
> Hi Bin,
>
> On 31 August 2015 at 08:04, Bin Meng <bmeng.cn@gmail.com> wrote:
>> Hi Simon,
>>
>> On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass <sjg@chromium.org> wrote:
>>> Hi Bin,
>>>
>>> On 31 August 2015 at 07:43, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>> Hi Simon,
>>>>
>>>> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass <sjg@chromium.org> wrote:
>>>>> Hi Bin,
>>>>>
>>>>> On 31 August 2015 at 03:52, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>> Boot time performance degradation is observed with the conversion
>>>>>> to use dm pci. Intel Quark SoC has a low end x86 processor with
>>>>>> only 400MHz frequency and the most time consuming part is with MRC.
>>>>>> Each MRC register programming requires indirect access via pci bus.
>>>>>> With dm pci, accessing pci configuration space has some overhead.
>>>>>> Unfortunately this single access overhead gets accumulated in the
>>>>>> whole MRC process, and finally leads to twice boot time (25 seconds)
>>>>>> than before (12 seconds).
>>>>>>
>>>>>> To speed up the boot, create an optimized version of pci config
>>>>>> read/write routines without bothering to go through driver model.
>>>>>> Now it only takes about 3 seconds to finish MRC, which is really
>>>>>> fast (8 times faster than dm pci, or 4 times faster than before).
>>>>>>
>>>>>> Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
>>>>>> ---
>>>>>>
>>>>>>  arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++----------------
>>>>>>  1 file changed, 37 insertions(+), 22 deletions(-)
>>>>>
>>>>> Before I delve into the patch - with driver model we are using the I/O
>>>>> method - see pci_x86_read_config(). Is that the source of the slowdown
>>>>> or is it just general driver model overhead.
>>>>
>>>> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR
>>>> controller. Inside msg_port.c, pci_write_config_dword() and
>>>> pci_read_config_dword() are called.
>>>>
>>>> With driver model, the overhead is:
>>>>
>>>> pci_write_config_dword() -> pci_write_config32() -> pci_write_config()
>>>> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally
>>>> call pci_x86_read_config().
>>>>
>>>> Without driver model, there is still some overhead (so previously the
>>>> MRC time was about 12 seconds)
>>>>
>>>> pci_write_config_dword() -> pci_hose_write_config_dword() ->
>>>> TYPE1_PCI_OP(write, dword, u32, outl, 0)
>>>>
>>>> With my optimized version, pci_write_config_dword() directly calls a
>>>> hardcoded dword size pci config access, without the need to consider
>>>> offset and mask, and dereferencing hose->cfg_addr/cfg->data.
>>>
>>> What about if we use dm_pci_read_config32()? We should try to move PCI
>>> access to driver model to avoid the uclass_get_device_by_seq()
>>> everywhere.
>>
>> I don't think that helps. dm_pci_read_config32() requires a dm driver.
>> MRC is just something that program a bunch of registers with pci
>> config rw call.
>
> My question is really what takes the time? It's not clear whether it
> is the driver model overhead or something else. The code you add in
> qrk_pci_write_config_dword() looks very similar to
> pci_x86_read_config().
>

It is the driver model overhead. In order to get to
pci_x86_read_config(), we need go through a bunch of function calls
(see above). Yes, my version is very similar to pci_x86_read_config(),
but my version is more simpler as it only needs to deal with dword
size thus no need to do offset mask and switch/case. If you look at
the Quark MRC codes, there are thousands of calls to msg_port_read()
and msg_port_write().

>>
>>>
>>>>
>>>>>
>>>>> If the former then perhaps we should change this. If the latter then
>>>>> we have work to do...
>>>>>
>>
>> Also for this patch, I just realized that not only it helps to reduce
>> the boot time, but also it helps to support PCIe root port in future
>> patches.
>>
>> By checking the Quark SoC manual, I found something that needs to be
>> done to get the two PCIe root ports on Quark SoC to work. In order to
>> get PCIe root ports show up in the PCI configuration space, we need
>> program some registers in the message bus (using APIs in
>> arch/x86/cpu/quark/msg_port.c). With driver model, if we still call
>> pci_write_config_dword() in the msg_port.c, we end up triggering PCI
>> enumeration process first before we actually write something to the
>> configuration space, however the enumeration process will hang when
>> scanning to the PCIe root port if we don't properly initialize the
>> message bus registers. This is a chicken-egg problem.
>
> Sure, although I see that as a separate problem.
>

Yes, it is a separate problem. It came to me when I was reading the
manual after I submitted the patch.

> We can't have a driver model implementation that is really slow, so
> I'd like to clear that up first. Of course your patch makes sense for
> other reasons, but I don't want to play whack-a-mole here.
>

Agreed. So far the driver model PCI is used on x86 boards and sandbox.
The performance issue was not obvious on these targets, but it is
quite noticeable on Intel Quark. These PCI config read/write routines
will go through lots of function calls before we actually touch the
I/O ports 0xcf8 and 0xcfc, especially when the device is not on the
root bus (it needs to go through its parent followed by its parent's
parent).

But anyway I think this optimization for Quark is needed. I doubt we
can optimize driver model pci to such an extent.

> Also we should start to move things away from the non-driver-model pci
> functions.
>

Regards,
Bin
Simon Glass Sept. 1, 2015, 2:57 a.m. UTC | #7
Hi Bin,

On 31 August 2015 at 19:40, Bin Meng <bmeng.cn@gmail.com> wrote:
> Hi Simon,
>
> On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass <sjg@chromium.org> wrote:
>> Hi Bin,
>>
>> On 31 August 2015 at 08:04, Bin Meng <bmeng.cn@gmail.com> wrote:
>>> Hi Simon,
>>>
>>> On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass <sjg@chromium.org> wrote:
>>>> Hi Bin,
>>>>
>>>> On 31 August 2015 at 07:43, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>> Hi Simon,
>>>>>
>>>>> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass <sjg@chromium.org> wrote:
>>>>>> Hi Bin,
>>>>>>
>>>>>> On 31 August 2015 at 03:52, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>>> Boot time performance degradation is observed with the conversion
>>>>>>> to use dm pci. Intel Quark SoC has a low end x86 processor with
>>>>>>> only 400MHz frequency and the most time consuming part is with MRC.
>>>>>>> Each MRC register programming requires indirect access via pci bus.
>>>>>>> With dm pci, accessing pci configuration space has some overhead.
>>>>>>> Unfortunately this single access overhead gets accumulated in the
>>>>>>> whole MRC process, and finally leads to twice boot time (25 seconds)
>>>>>>> than before (12 seconds).
>>>>>>>
>>>>>>> To speed up the boot, create an optimized version of pci config
>>>>>>> read/write routines without bothering to go through driver model.
>>>>>>> Now it only takes about 3 seconds to finish MRC, which is really
>>>>>>> fast (8 times faster than dm pci, or 4 times faster than before).
>>>>>>>
>>>>>>> Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
>>>>>>> ---
>>>>>>>
>>>>>>>  arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++----------------
>>>>>>>  1 file changed, 37 insertions(+), 22 deletions(-)
>>>>>>
>>>>>> Before I delve into the patch - with driver model we are using the I/O
>>>>>> method - see pci_x86_read_config(). Is that the source of the slowdown
>>>>>> or is it just general driver model overhead.
>>>>>
>>>>> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR
>>>>> controller. Inside msg_port.c, pci_write_config_dword() and
>>>>> pci_read_config_dword() are called.
>>>>>
>>>>> With driver model, the overhead is:
>>>>>
>>>>> pci_write_config_dword() -> pci_write_config32() -> pci_write_config()
>>>>> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally
>>>>> call pci_x86_read_config().
>>>>>
>>>>> Without driver model, there is still some overhead (so previously the
>>>>> MRC time was about 12 seconds)
>>>>>
>>>>> pci_write_config_dword() -> pci_hose_write_config_dword() ->
>>>>> TYPE1_PCI_OP(write, dword, u32, outl, 0)
>>>>>
>>>>> With my optimized version, pci_write_config_dword() directly calls a
>>>>> hardcoded dword size pci config access, without the need to consider
>>>>> offset and mask, and dereferencing hose->cfg_addr/cfg->data.
>>>>
>>>> What about if we use dm_pci_read_config32()? We should try to move PCI
>>>> access to driver model to avoid the uclass_get_device_by_seq()
>>>> everywhere.
>>>
>>> I don't think that helps. dm_pci_read_config32() requires a dm driver.
>>> MRC is just something that program a bunch of registers with pci
>>> config rw call.
>>
>> My question is really what takes the time? It's not clear whether it
>> is the driver model overhead or something else. The code you add in
>> qrk_pci_write_config_dword() looks very similar to
>> pci_x86_read_config().
>>
>
> It is the driver model overhead. In order to get to
> pci_x86_read_config(), we need go through a bunch of function calls
> (see above). Yes, my version is very similar to pci_x86_read_config(),
> but my version is more simpler as it only needs to deal with dword
> size thus no need to do offset mask and switch/case. If you look at
> the Quark MRC codes, there are thousands of calls to msg_port_read()
> and msg_port_write().
>
>>>
>>>>
>>>>>
>>>>>>
>>>>>> If the former then perhaps we should change this. If the latter then
>>>>>> we have work to do...
>>>>>>
>>>
>>> Also for this patch, I just realized that not only it helps to reduce
>>> the boot time, but also it helps to support PCIe root port in future
>>> patches.
>>>
>>> By checking the Quark SoC manual, I found something that needs to be
>>> done to get the two PCIe root ports on Quark SoC to work. In order to
>>> get PCIe root ports show up in the PCI configuration space, we need
>>> program some registers in the message bus (using APIs in
>>> arch/x86/cpu/quark/msg_port.c). With driver model, if we still call
>>> pci_write_config_dword() in the msg_port.c, we end up triggering PCI
>>> enumeration process first before we actually write something to the
>>> configuration space, however the enumeration process will hang when
>>> scanning to the PCIe root port if we don't properly initialize the
>>> message bus registers. This is a chicken-egg problem.
>>
>> Sure, although I see that as a separate problem.
>>
>
> Yes, it is a separate problem. It came to me when I was reading the
> manual after I submitted the patch.
>
>> We can't have a driver model implementation that is really slow, so
>> I'd like to clear that up first. Of course your patch makes sense for
>> other reasons, but I don't want to play whack-a-mole here.
>>
>
> Agreed. So far the driver model PCI is used on x86 boards and sandbox.
> The performance issue was not obvious on these targets, but it is
> quite noticeable on Intel Quark. These PCI config read/write routines
> will go through lots of function calls before we actually touch the
> I/O ports 0xcf8 and 0xcfc, especially when the device is not on the
> root bus (it needs to go through its parent followed by its parent's
> parent).
>
> But anyway I think this optimization for Quark is needed. I doubt we
> can optimize driver model pci to such an extent.

Can we use this as an opportunity to try a few things? If we use the
dm_pci functions that should cut out some overhead. Can you try an
experiment to see how much difference it makes?

dm_pci_read_config32()
   pci_get_bdf()
pci_read_config()
   for loop which probably terminates immediately
pci_bus_read_config()
read_config(), which is pci_x86_read_config()

So it's not great but it doesn't look too bad.

Also is this an Intel Gallileo gen 2 development board? I'm thinking
of getting one.

>
>> Also we should start to move things away from the non-driver-model pci
>> functions.
>>
>
> Regards,
> Bin

Regards,
Simon
Bin Meng Sept. 1, 2015, 3:04 a.m. UTC | #8
Hi Simon,

On Tue, Sep 1, 2015 at 10:57 AM, Simon Glass <sjg@chromium.org> wrote:
> Hi Bin,
>
> On 31 August 2015 at 19:40, Bin Meng <bmeng.cn@gmail.com> wrote:
>> Hi Simon,
>>
>> On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass <sjg@chromium.org> wrote:
>>> Hi Bin,
>>>
>>> On 31 August 2015 at 08:04, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>> Hi Simon,
>>>>
>>>> On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass <sjg@chromium.org> wrote:
>>>>> Hi Bin,
>>>>>
>>>>> On 31 August 2015 at 07:43, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>> Hi Simon,
>>>>>>
>>>>>> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass <sjg@chromium.org> wrote:
>>>>>>> Hi Bin,
>>>>>>>
>>>>>>> On 31 August 2015 at 03:52, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>>>> Boot time performance degradation is observed with the conversion
>>>>>>>> to use dm pci. Intel Quark SoC has a low end x86 processor with
>>>>>>>> only 400MHz frequency and the most time consuming part is with MRC.
>>>>>>>> Each MRC register programming requires indirect access via pci bus.
>>>>>>>> With dm pci, accessing pci configuration space has some overhead.
>>>>>>>> Unfortunately this single access overhead gets accumulated in the
>>>>>>>> whole MRC process, and finally leads to twice boot time (25 seconds)
>>>>>>>> than before (12 seconds).
>>>>>>>>
>>>>>>>> To speed up the boot, create an optimized version of pci config
>>>>>>>> read/write routines without bothering to go through driver model.
>>>>>>>> Now it only takes about 3 seconds to finish MRC, which is really
>>>>>>>> fast (8 times faster than dm pci, or 4 times faster than before).
>>>>>>>>
>>>>>>>> Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
>>>>>>>> ---
>>>>>>>>
>>>>>>>>  arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++----------------
>>>>>>>>  1 file changed, 37 insertions(+), 22 deletions(-)
>>>>>>>
>>>>>>> Before I delve into the patch - with driver model we are using the I/O
>>>>>>> method - see pci_x86_read_config(). Is that the source of the slowdown
>>>>>>> or is it just general driver model overhead.
>>>>>>
>>>>>> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR
>>>>>> controller. Inside msg_port.c, pci_write_config_dword() and
>>>>>> pci_read_config_dword() are called.
>>>>>>
>>>>>> With driver model, the overhead is:
>>>>>>
>>>>>> pci_write_config_dword() -> pci_write_config32() -> pci_write_config()
>>>>>> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally
>>>>>> call pci_x86_read_config().
>>>>>>
>>>>>> Without driver model, there is still some overhead (so previously the
>>>>>> MRC time was about 12 seconds)
>>>>>>
>>>>>> pci_write_config_dword() -> pci_hose_write_config_dword() ->
>>>>>> TYPE1_PCI_OP(write, dword, u32, outl, 0)
>>>>>>
>>>>>> With my optimized version, pci_write_config_dword() directly calls a
>>>>>> hardcoded dword size pci config access, without the need to consider
>>>>>> offset and mask, and dereferencing hose->cfg_addr/cfg->data.
>>>>>
>>>>> What about if we use dm_pci_read_config32()? We should try to move PCI
>>>>> access to driver model to avoid the uclass_get_device_by_seq()
>>>>> everywhere.
>>>>
>>>> I don't think that helps. dm_pci_read_config32() requires a dm driver.
>>>> MRC is just something that program a bunch of registers with pci
>>>> config rw call.
>>>
>>> My question is really what takes the time? It's not clear whether it
>>> is the driver model overhead or something else. The code you add in
>>> qrk_pci_write_config_dword() looks very similar to
>>> pci_x86_read_config().
>>>
>>
>> It is the driver model overhead. In order to get to
>> pci_x86_read_config(), we need go through a bunch of function calls
>> (see above). Yes, my version is very similar to pci_x86_read_config(),
>> but my version is more simpler as it only needs to deal with dword
>> size thus no need to do offset mask and switch/case. If you look at
>> the Quark MRC codes, there are thousands of calls to msg_port_read()
>> and msg_port_write().
>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> If the former then perhaps we should change this. If the latter then
>>>>>>> we have work to do...
>>>>>>>
>>>>
>>>> Also for this patch, I just realized that not only it helps to reduce
>>>> the boot time, but also it helps to support PCIe root port in future
>>>> patches.
>>>>
>>>> By checking the Quark SoC manual, I found something that needs to be
>>>> done to get the two PCIe root ports on Quark SoC to work. In order to
>>>> get PCIe root ports show up in the PCI configuration space, we need
>>>> program some registers in the message bus (using APIs in
>>>> arch/x86/cpu/quark/msg_port.c). With driver model, if we still call
>>>> pci_write_config_dword() in the msg_port.c, we end up triggering PCI
>>>> enumeration process first before we actually write something to the
>>>> configuration space, however the enumeration process will hang when
>>>> scanning to the PCIe root port if we don't properly initialize the
>>>> message bus registers. This is a chicken-egg problem.
>>>
>>> Sure, although I see that as a separate problem.
>>>
>>
>> Yes, it is a separate problem. It came to me when I was reading the
>> manual after I submitted the patch.
>>
>>> We can't have a driver model implementation that is really slow, so
>>> I'd like to clear that up first. Of course your patch makes sense for
>>> other reasons, but I don't want to play whack-a-mole here.
>>>
>>
>> Agreed. So far the driver model PCI is used on x86 boards and sandbox.
>> The performance issue was not obvious on these targets, but it is
>> quite noticeable on Intel Quark. These PCI config read/write routines
>> will go through lots of function calls before we actually touch the
>> I/O ports 0xcf8 and 0xcfc, especially when the device is not on the
>> root bus (it needs to go through its parent followed by its parent's
>> parent).
>>
>> But anyway I think this optimization for Quark is needed. I doubt we
>> can optimize driver model pci to such an extent.
>
> Can we use this as an opportunity to try a few things? If we use the
> dm_pci functions that should cut out some overhead. Can you try an
> experiment to see how much difference it makes?
>
> dm_pci_read_config32()

We can't use this API as MRC is not a dm driver.

>    pci_get_bdf()
> pci_read_config()
>    for loop which probably terminates immediately
> pci_bus_read_config()
> read_config(), which is pci_x86_read_config()
>
> So it's not great but it doesn't look too bad.
>
> Also is this an Intel Gallileo gen 2 development board? I'm thinking
> of getting one.
>

Yes, this is an Intel Galileo gen2 development board. Although there
is an gen1 board in the past and the same u-boot.rom can boot on both
gen1 and gen2 board, Intel is now shipping only gen2 board.

>>
>>> Also we should start to move things away from the non-driver-model pci
>>> functions.
>>>
>>

Regards,
Bin
Simon Glass Sept. 1, 2015, 3:12 a.m. UTC | #9
Hi Bin,

On 31 August 2015 at 21:04, Bin Meng <bmeng.cn@gmail.com> wrote:
> Hi Simon,
>
> On Tue, Sep 1, 2015 at 10:57 AM, Simon Glass <sjg@chromium.org> wrote:
>> Hi Bin,
>>
>> On 31 August 2015 at 19:40, Bin Meng <bmeng.cn@gmail.com> wrote:
>>> Hi Simon,
>>>
>>> On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass <sjg@chromium.org> wrote:
>>>> Hi Bin,
>>>>
>>>> On 31 August 2015 at 08:04, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>> Hi Simon,
>>>>>
>>>>> On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass <sjg@chromium.org> wrote:
>>>>>> Hi Bin,
>>>>>>
>>>>>> On 31 August 2015 at 07:43, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>>> Hi Simon,
>>>>>>>
>>>>>>> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass <sjg@chromium.org> wrote:
>>>>>>>> Hi Bin,
>>>>>>>>
>>>>>>>> On 31 August 2015 at 03:52, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>>>>> Boot time performance degradation is observed with the conversion
>>>>>>>>> to use dm pci. Intel Quark SoC has a low end x86 processor with
>>>>>>>>> only 400MHz frequency and the most time consuming part is with MRC.
>>>>>>>>> Each MRC register programming requires indirect access via pci bus.
>>>>>>>>> With dm pci, accessing pci configuration space has some overhead.
>>>>>>>>> Unfortunately this single access overhead gets accumulated in the
>>>>>>>>> whole MRC process, and finally leads to twice boot time (25 seconds)
>>>>>>>>> than before (12 seconds).
>>>>>>>>>
>>>>>>>>> To speed up the boot, create an optimized version of pci config
>>>>>>>>> read/write routines without bothering to go through driver model.
>>>>>>>>> Now it only takes about 3 seconds to finish MRC, which is really
>>>>>>>>> fast (8 times faster than dm pci, or 4 times faster than before).
>>>>>>>>>
>>>>>>>>> Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>>  arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++----------------
>>>>>>>>>  1 file changed, 37 insertions(+), 22 deletions(-)
>>>>>>>>
>>>>>>>> Before I delve into the patch - with driver model we are using the I/O
>>>>>>>> method - see pci_x86_read_config(). Is that the source of the slowdown
>>>>>>>> or is it just general driver model overhead.
>>>>>>>
>>>>>>> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR
>>>>>>> controller. Inside msg_port.c, pci_write_config_dword() and
>>>>>>> pci_read_config_dword() are called.
>>>>>>>
>>>>>>> With driver model, the overhead is:
>>>>>>>
>>>>>>> pci_write_config_dword() -> pci_write_config32() -> pci_write_config()
>>>>>>> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally
>>>>>>> call pci_x86_read_config().
>>>>>>>
>>>>>>> Without driver model, there is still some overhead (so previously the
>>>>>>> MRC time was about 12 seconds)
>>>>>>>
>>>>>>> pci_write_config_dword() -> pci_hose_write_config_dword() ->
>>>>>>> TYPE1_PCI_OP(write, dword, u32, outl, 0)
>>>>>>>
>>>>>>> With my optimized version, pci_write_config_dword() directly calls a
>>>>>>> hardcoded dword size pci config access, without the need to consider
>>>>>>> offset and mask, and dereferencing hose->cfg_addr/cfg->data.
>>>>>>
>>>>>> What about if we use dm_pci_read_config32()? We should try to move PCI
>>>>>> access to driver model to avoid the uclass_get_device_by_seq()
>>>>>> everywhere.
>>>>>
>>>>> I don't think that helps. dm_pci_read_config32() requires a dm driver.
>>>>> MRC is just something that program a bunch of registers with pci
>>>>> config rw call.
>>>>
>>>> My question is really what takes the time? It's not clear whether it
>>>> is the driver model overhead or something else. The code you add in
>>>> qrk_pci_write_config_dword() looks very similar to
>>>> pci_x86_read_config().
>>>>
>>>
>>> It is the driver model overhead. In order to get to
>>> pci_x86_read_config(), we need go through a bunch of function calls
>>> (see above). Yes, my version is very similar to pci_x86_read_config(),
>>> but my version is more simpler as it only needs to deal with dword
>>> size thus no need to do offset mask and switch/case. If you look at
>>> the Quark MRC codes, there are thousands of calls to msg_port_read()
>>> and msg_port_write().
>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> If the former then perhaps we should change this. If the latter then
>>>>>>>> we have work to do...
>>>>>>>>
>>>>>
>>>>> Also for this patch, I just realized that not only it helps to reduce
>>>>> the boot time, but also it helps to support PCIe root port in future
>>>>> patches.
>>>>>
>>>>> By checking the Quark SoC manual, I found something that needs to be
>>>>> done to get the two PCIe root ports on Quark SoC to work. In order to
>>>>> get PCIe root ports show up in the PCI configuration space, we need
>>>>> program some registers in the message bus (using APIs in
>>>>> arch/x86/cpu/quark/msg_port.c). With driver model, if we still call
>>>>> pci_write_config_dword() in the msg_port.c, we end up triggering PCI
>>>>> enumeration process first before we actually write something to the
>>>>> configuration space, however the enumeration process will hang when
>>>>> scanning to the PCIe root port if we don't properly initialize the
>>>>> message bus registers. This is a chicken-egg problem.
>>>>
>>>> Sure, although I see that as a separate problem.
>>>>
>>>
>>> Yes, it is a separate problem. It came to me when I was reading the
>>> manual after I submitted the patch.
>>>
>>>> We can't have a driver model implementation that is really slow, so
>>>> I'd like to clear that up first. Of course your patch makes sense for
>>>> other reasons, but I don't want to play whack-a-mole here.
>>>>
>>>
>>> Agreed. So far the driver model PCI is used on x86 boards and sandbox.
>>> The performance issue was not obvious on these targets, but it is
>>> quite noticeable on Intel Quark. These PCI config read/write routines
>>> will go through lots of function calls before we actually touch the
>>> I/O ports 0xcf8 and 0xcfc, especially when the device is not on the
>>> root bus (it needs to go through its parent followed by its parent's
>>> parent).
>>>
>>> But anyway I think this optimization for Quark is needed. I doubt we
>>> can optimize driver model pci to such an extent.
>>
>> Can we use this as an opportunity to try a few things? If we use the
>> dm_pci functions that should cut out some overhead. Can you try an
>> experiment to see how much difference it makes?
>>
>> dm_pci_read_config32()
>
> We can't use this API as MRC is not a dm driver.

OK, probably I need to dig in and understand this a little better. Is
it running pre-relocation with the early PCI stuff? We could make a
driver with UCLASS_RAM perhaps.

>
>>    pci_get_bdf()
>> pci_read_config()
>>    for loop which probably terminates immediately
>> pci_bus_read_config()
>> read_config(), which is pci_x86_read_config()
>>
>> So it's not great but it doesn't look too bad.
>>
>> Also is this an Intel Gallileo gen 2 development board? I'm thinking
>> of getting one.
>>
>
> Yes, this is an Intel Galileo gen2 development board. Although there
> is an gen1 board in the past and the same u-boot.rom can boot on both
> gen1 and gen2 board, Intel is now shipping only gen2 board.

OK I've ordered one to try out.

>
>>>
>>>> Also we should start to move things away from the non-driver-model pci
>>>> functions.

Regards,
Simon
Bin Meng Sept. 1, 2015, 3:22 a.m. UTC | #10
Hi Simon,

On Tue, Sep 1, 2015 at 11:12 AM, Simon Glass <sjg@chromium.org> wrote:
> Hi Bin,
>
> On 31 August 2015 at 21:04, Bin Meng <bmeng.cn@gmail.com> wrote:
>> Hi Simon,
>>
>> On Tue, Sep 1, 2015 at 10:57 AM, Simon Glass <sjg@chromium.org> wrote:
>>> Hi Bin,
>>>
>>> On 31 August 2015 at 19:40, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>> Hi Simon,
>>>>
>>>> On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass <sjg@chromium.org> wrote:
>>>>> Hi Bin,
>>>>>
>>>>> On 31 August 2015 at 08:04, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>> Hi Simon,
>>>>>>
>>>>>> On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass <sjg@chromium.org> wrote:
>>>>>>> Hi Bin,
>>>>>>>
>>>>>>> On 31 August 2015 at 07:43, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>>>> Hi Simon,
>>>>>>>>
>>>>>>>> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass <sjg@chromium.org> wrote:
>>>>>>>>> Hi Bin,
>>>>>>>>>
>>>>>>>>> On 31 August 2015 at 03:52, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>>>>>> Boot time performance degradation is observed with the conversion
>>>>>>>>>> to use dm pci. Intel Quark SoC has a low end x86 processor with
>>>>>>>>>> only 400MHz frequency and the most time consuming part is with MRC.
>>>>>>>>>> Each MRC register programming requires indirect access via pci bus.
>>>>>>>>>> With dm pci, accessing pci configuration space has some overhead.
>>>>>>>>>> Unfortunately this single access overhead gets accumulated in the
>>>>>>>>>> whole MRC process, and finally leads to twice boot time (25 seconds)
>>>>>>>>>> than before (12 seconds).
>>>>>>>>>>
>>>>>>>>>> To speed up the boot, create an optimized version of pci config
>>>>>>>>>> read/write routines without bothering to go through driver model.
>>>>>>>>>> Now it only takes about 3 seconds to finish MRC, which is really
>>>>>>>>>> fast (8 times faster than dm pci, or 4 times faster than before).
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>>  arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++----------------
>>>>>>>>>>  1 file changed, 37 insertions(+), 22 deletions(-)
>>>>>>>>>
>>>>>>>>> Before I delve into the patch - with driver model we are using the I/O
>>>>>>>>> method - see pci_x86_read_config(). Is that the source of the slowdown
>>>>>>>>> or is it just general driver model overhead.
>>>>>>>>
>>>>>>>> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR
>>>>>>>> controller. Inside msg_port.c, pci_write_config_dword() and
>>>>>>>> pci_read_config_dword() are called.
>>>>>>>>
>>>>>>>> With driver model, the overhead is:
>>>>>>>>
>>>>>>>> pci_write_config_dword() -> pci_write_config32() -> pci_write_config()
>>>>>>>> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally
>>>>>>>> call pci_x86_read_config().
>>>>>>>>
>>>>>>>> Without driver model, there is still some overhead (so previously the
>>>>>>>> MRC time was about 12 seconds)
>>>>>>>>
>>>>>>>> pci_write_config_dword() -> pci_hose_write_config_dword() ->
>>>>>>>> TYPE1_PCI_OP(write, dword, u32, outl, 0)
>>>>>>>>
>>>>>>>> With my optimized version, pci_write_config_dword() directly calls a
>>>>>>>> hardcoded dword size pci config access, without the need to consider
>>>>>>>> offset and mask, and dereferencing hose->cfg_addr/cfg->data.
>>>>>>>
>>>>>>> What about if we use dm_pci_read_config32()? We should try to move PCI
>>>>>>> access to driver model to avoid the uclass_get_device_by_seq()
>>>>>>> everywhere.
>>>>>>
>>>>>> I don't think that helps. dm_pci_read_config32() requires a dm driver.
>>>>>> MRC is just something that program a bunch of registers with pci
>>>>>> config rw call.
>>>>>
>>>>> My question is really what takes the time? It's not clear whether it
>>>>> is the driver model overhead or something else. The code you add in
>>>>> qrk_pci_write_config_dword() looks very similar to
>>>>> pci_x86_read_config().
>>>>>
>>>>
>>>> It is the driver model overhead. In order to get to
>>>> pci_x86_read_config(), we need go through a bunch of function calls
>>>> (see above). Yes, my version is very similar to pci_x86_read_config(),
>>>> but my version is more simpler as it only needs to deal with dword
>>>> size thus no need to do offset mask and switch/case. If you look at
>>>> the Quark MRC codes, there are thousands of calls to msg_port_read()
>>>> and msg_port_write().
>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> If the former then perhaps we should change this. If the latter then
>>>>>>>>> we have work to do...
>>>>>>>>>
>>>>>>
>>>>>> Also for this patch, I just realized that not only it helps to reduce
>>>>>> the boot time, but also it helps to support PCIe root port in future
>>>>>> patches.
>>>>>>
>>>>>> By checking the Quark SoC manual, I found something that needs to be
>>>>>> done to get the two PCIe root ports on Quark SoC to work. In order to
>>>>>> get PCIe root ports show up in the PCI configuration space, we need
>>>>>> program some registers in the message bus (using APIs in
>>>>>> arch/x86/cpu/quark/msg_port.c). With driver model, if we still call
>>>>>> pci_write_config_dword() in the msg_port.c, we end up triggering PCI
>>>>>> enumeration process first before we actually write something to the
>>>>>> configuration space, however the enumeration process will hang when
>>>>>> scanning to the PCIe root port if we don't properly initialize the
>>>>>> message bus registers. This is a chicken-egg problem.
>>>>>
>>>>> Sure, although I see that as a separate problem.
>>>>>
>>>>
>>>> Yes, it is a separate problem. It came to me when I was reading the
>>>> manual after I submitted the patch.
>>>>
>>>>> We can't have a driver model implementation that is really slow, so
>>>>> I'd like to clear that up first. Of course your patch makes sense for
>>>>> other reasons, but I don't want to play whack-a-mole here.
>>>>>
>>>>
>>>> Agreed. So far the driver model PCI is used on x86 boards and sandbox.
>>>> The performance issue was not obvious on these targets, but it is
>>>> quite noticeable on Intel Quark. These PCI config read/write routines
>>>> will go through lots of function calls before we actually touch the
>>>> I/O ports 0xcf8 and 0xcfc, especially when the device is not on the
>>>> root bus (it needs to go through its parent followed by its parent's
>>>> parent).
>>>>
>>>> But anyway I think this optimization for Quark is needed. I doubt we
>>>> can optimize driver model pci to such an extent.
>>>
>>> Can we use this as an opportunity to try a few things? If we use the
>>> dm_pci functions that should cut out some overhead. Can you try an
>>> experiment to see how much difference it makes?
>>>
>>> dm_pci_read_config32()
>>
>> We can't use this API as MRC is not a dm driver.
>
> OK, probably I need to dig in and understand this a little better. Is
> it running pre-relocation with the early PCI stuff? We could make a
> driver with UCLASS_RAM perhaps.

Yes, it is running pre-relocation with the early PCI stuff. But I
doubt the need to create a UCLASS_RAM for x86 targets as most x86
targets uses FSP to initialize the RAM. The best candidate to
implement UCLASS_RAM that I can think of now is the Freescale DDR
controller driver on powerpc, and on ARM recently. It supports both
SPD and memory-down. On ARM, most RAM targets' DDR initialization is
memory-down I believe.

>
>>
>>>    pci_get_bdf()
>>> pci_read_config()
>>>    for loop which probably terminates immediately
>>> pci_bus_read_config()
>>> read_config(), which is pci_x86_read_config()
>>>
>>> So it's not great but it doesn't look too bad.
>>>
>>> Also is this an Intel Gallileo gen 2 development board? I'm thinking
>>> of getting one.
>>>
>>
>> Yes, this is an Intel Galileo gen2 development board. Although there
>> is an gen1 board in the past and the same u-boot.rom can boot on both
>> gen1 and gen2 board, Intel is now shipping only gen2 board.
>
> OK I've ordered one to try out.
>
>>
>>>>
>>>>> Also we should start to move things away from the non-driver-model pci
>>>>> functions.
>

Regards,
Bin
Bin Meng Sept. 1, 2015, 10:29 a.m. UTC | #11
Hi Simon,

On Tue, Sep 1, 2015 at 11:22 AM, Bin Meng <bmeng.cn@gmail.com> wrote:
> Hi Simon,
>
> On Tue, Sep 1, 2015 at 11:12 AM, Simon Glass <sjg@chromium.org> wrote:
>> Hi Bin,
>>
>> On 31 August 2015 at 21:04, Bin Meng <bmeng.cn@gmail.com> wrote:
>>> Hi Simon,
>>>
>>> On Tue, Sep 1, 2015 at 10:57 AM, Simon Glass <sjg@chromium.org> wrote:
>>>> Hi Bin,
>>>>
>>>> On 31 August 2015 at 19:40, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>> Hi Simon,
>>>>>
>>>>> On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass <sjg@chromium.org> wrote:
>>>>>> Hi Bin,
>>>>>>
>>>>>> On 31 August 2015 at 08:04, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>>> Hi Simon,
>>>>>>>
>>>>>>> On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass <sjg@chromium.org> wrote:
>>>>>>>> Hi Bin,
>>>>>>>>
>>>>>>>> On 31 August 2015 at 07:43, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>>>>> Hi Simon,
>>>>>>>>>
>>>>>>>>> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass <sjg@chromium.org> wrote:
>>>>>>>>>> Hi Bin,
>>>>>>>>>>
>>>>>>>>>> On 31 August 2015 at 03:52, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>>>>>>> Boot time performance degradation is observed with the conversion
>>>>>>>>>>> to use dm pci. Intel Quark SoC has a low end x86 processor with
>>>>>>>>>>> only 400MHz frequency and the most time consuming part is with MRC.
>>>>>>>>>>> Each MRC register programming requires indirect access via pci bus.
>>>>>>>>>>> With dm pci, accessing pci configuration space has some overhead.
>>>>>>>>>>> Unfortunately this single access overhead gets accumulated in the
>>>>>>>>>>> whole MRC process, and finally leads to twice boot time (25 seconds)
>>>>>>>>>>> than before (12 seconds).
>>>>>>>>>>>
>>>>>>>>>>> To speed up the boot, create an optimized version of pci config
>>>>>>>>>>> read/write routines without bothering to go through driver model.
>>>>>>>>>>> Now it only takes about 3 seconds to finish MRC, which is really
>>>>>>>>>>> fast (8 times faster than dm pci, or 4 times faster than before).
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>>  arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++----------------
>>>>>>>>>>>  1 file changed, 37 insertions(+), 22 deletions(-)
>>>>>>>>>>
>>>>>>>>>> Before I delve into the patch - with driver model we are using the I/O
>>>>>>>>>> method - see pci_x86_read_config(). Is that the source of the slowdown
>>>>>>>>>> or is it just general driver model overhead.
>>>>>>>>>
>>>>>>>>> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR
>>>>>>>>> controller. Inside msg_port.c, pci_write_config_dword() and
>>>>>>>>> pci_read_config_dword() are called.
>>>>>>>>>
>>>>>>>>> With driver model, the overhead is:
>>>>>>>>>
>>>>>>>>> pci_write_config_dword() -> pci_write_config32() -> pci_write_config()
>>>>>>>>> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally
>>>>>>>>> call pci_x86_read_config().
>>>>>>>>>
>>>>>>>>> Without driver model, there is still some overhead (so previously the
>>>>>>>>> MRC time was about 12 seconds)
>>>>>>>>>
>>>>>>>>> pci_write_config_dword() -> pci_hose_write_config_dword() ->
>>>>>>>>> TYPE1_PCI_OP(write, dword, u32, outl, 0)
>>>>>>>>>
>>>>>>>>> With my optimized version, pci_write_config_dword() directly calls a
>>>>>>>>> hardcoded dword size pci config access, without the need to consider
>>>>>>>>> offset and mask, and dereferencing hose->cfg_addr/cfg->data.
>>>>>>>>
>>>>>>>> What about if we use dm_pci_read_config32()? We should try to move PCI
>>>>>>>> access to driver model to avoid the uclass_get_device_by_seq()
>>>>>>>> everywhere.
>>>>>>>
>>>>>>> I don't think that helps. dm_pci_read_config32() requires a dm driver.
>>>>>>> MRC is just something that program a bunch of registers with pci
>>>>>>> config rw call.
>>>>>>
>>>>>> My question is really what takes the time? It's not clear whether it
>>>>>> is the driver model overhead or something else. The code you add in
>>>>>> qrk_pci_write_config_dword() looks very similar to
>>>>>> pci_x86_read_config().
>>>>>>
>>>>>
>>>>> It is the driver model overhead. In order to get to
>>>>> pci_x86_read_config(), we need go through a bunch of function calls
>>>>> (see above). Yes, my version is very similar to pci_x86_read_config(),
>>>>> but my version is more simpler as it only needs to deal with dword
>>>>> size thus no need to do offset mask and switch/case. If you look at
>>>>> the Quark MRC codes, there are thousands of calls to msg_port_read()
>>>>> and msg_port_write().
>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> If the former then perhaps we should change this. If the latter then
>>>>>>>>>> we have work to do...
>>>>>>>>>>
>>>>>>>
>>>>>>> Also for this patch, I just realized that not only it helps to reduce
>>>>>>> the boot time, but also it helps to support PCIe root port in future
>>>>>>> patches.
>>>>>>>
>>>>>>> By checking the Quark SoC manual, I found something that needs to be
>>>>>>> done to get the two PCIe root ports on Quark SoC to work. In order to
>>>>>>> get PCIe root ports show up in the PCI configuration space, we need
>>>>>>> program some registers in the message bus (using APIs in
>>>>>>> arch/x86/cpu/quark/msg_port.c). With driver model, if we still call
>>>>>>> pci_write_config_dword() in the msg_port.c, we end up triggering PCI
>>>>>>> enumeration process first before we actually write something to the
>>>>>>> configuration space, however the enumeration process will hang when
>>>>>>> scanning to the PCIe root port if we don't properly initialize the
>>>>>>> message bus registers. This is a chicken-egg problem.
>>>>>>
>>>>>> Sure, although I see that as a separate problem.
>>>>>>
>>>>>
>>>>> Yes, it is a separate problem. It came to me when I was reading the
>>>>> manual after I submitted the patch.
>>>>>
>>>>>> We can't have a driver model implementation that is really slow, so
>>>>>> I'd like to clear that up first. Of course your patch makes sense for
>>>>>> other reasons, but I don't want to play whack-a-mole here.
>>>>>>
>>>>>
>>>>> Agreed. So far the driver model PCI is used on x86 boards and sandbox.
>>>>> The performance issue was not obvious on these targets, but it is
>>>>> quite noticeable on Intel Quark. These PCI config read/write routines
>>>>> will go through lots of function calls before we actually touch the
>>>>> I/O ports 0xcf8 and 0xcfc, especially when the device is not on the
>>>>> root bus (it needs to go through its parent followed by its parent's
>>>>> parent).
>>>>>
>>>>> But anyway I think this optimization for Quark is needed. I doubt we
>>>>> can optimize driver model pci to such an extent.
>>>>
>>>> Can we use this as an opportunity to try a few things? If we use the
>>>> dm_pci functions that should cut out some overhead. Can you try an
>>>> experiment to see how much difference it makes?
>>>>
>>>> dm_pci_read_config32()
>>>
>>> We can't use this API as MRC is not a dm driver.
>>
>> OK, probably I need to dig in and understand this a little better. Is
>> it running pre-relocation with the early PCI stuff? We could make a
>> driver with UCLASS_RAM perhaps.
>
> Yes, it is running pre-relocation with the early PCI stuff. But I
> doubt the need to create a UCLASS_RAM for x86 targets as most x86
> targets uses FSP to initialize the RAM. The best candidate to
> implement UCLASS_RAM that I can think of now is the Freescale DDR
> controller driver on powerpc, and on ARM recently. It supports both
> SPD and memory-down. On ARM, most RAM targets' DDR initialization is
> memory-down I believe.
>

Some updates today when trying to support PCIe root ports in the v2:

I moved qrk_pci_write_config_dword() and qrk_pci_read_config_dword()
from msg_port.c to quark.c and update all codes in quark.c to call
these two routines to avoid the chicken & egg problem. With this
change, I noticed that the MRC execution time changed from 3 seconds
to 4 seconds. So 1 additional second is needed. I disassembled u-boot
and found in v1 since qrk_pci_write_config_dword() and
qrk_pci_read_config_dword() are declared static in msg_port.c, they
are inlined by the compiler into these APIs in msg_port.c. But with v2
changes, they are no longer inline but normal function call, which
causes this additional 1 second.

So even inline makes some improvement for MRC, not to mention if we
can avoid these multiple call chains in the driver model PCI APIs.

Regards,
Bin
Simon Glass Sept. 2, 2015, 2:48 a.m. UTC | #12
Hi Bin,

On 1 September 2015 at 04:29, Bin Meng <bmeng.cn@gmail.com> wrote:
> Hi Simon,
>
> On Tue, Sep 1, 2015 at 11:22 AM, Bin Meng <bmeng.cn@gmail.com> wrote:
>> Hi Simon,
>>
>> On Tue, Sep 1, 2015 at 11:12 AM, Simon Glass <sjg@chromium.org> wrote:
>>> Hi Bin,
>>>
>>> On 31 August 2015 at 21:04, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>> Hi Simon,
>>>>
>>>> On Tue, Sep 1, 2015 at 10:57 AM, Simon Glass <sjg@chromium.org> wrote:
>>>>> Hi Bin,
>>>>>
>>>>> On 31 August 2015 at 19:40, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>> Hi Simon,
>>>>>>
>>>>>> On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass <sjg@chromium.org> wrote:
>>>>>>> Hi Bin,
>>>>>>>
>>>>>>> On 31 August 2015 at 08:04, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>>>> Hi Simon,
>>>>>>>>
>>>>>>>> On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass <sjg@chromium.org> wrote:
>>>>>>>>> Hi Bin,
>>>>>>>>>
>>>>>>>>> On 31 August 2015 at 07:43, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>>>>>> Hi Simon,
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass <sjg@chromium.org> wrote:
>>>>>>>>>>> Hi Bin,
>>>>>>>>>>>
>>>>>>>>>>> On 31 August 2015 at 03:52, Bin Meng <bmeng.cn@gmail.com> wrote:
>>>>>>>>>>>> Boot time performance degradation is observed with the conversion
>>>>>>>>>>>> to use dm pci. Intel Quark SoC has a low end x86 processor with
>>>>>>>>>>>> only 400MHz frequency and the most time consuming part is with MRC.
>>>>>>>>>>>> Each MRC register programming requires indirect access via pci bus.
>>>>>>>>>>>> With dm pci, accessing pci configuration space has some overhead.
>>>>>>>>>>>> Unfortunately this single access overhead gets accumulated in the
>>>>>>>>>>>> whole MRC process, and finally leads to twice boot time (25 seconds)
>>>>>>>>>>>> than before (12 seconds).
>>>>>>>>>>>>
>>>>>>>>>>>> To speed up the boot, create an optimized version of pci config
>>>>>>>>>>>> read/write routines without bothering to go through driver model.
>>>>>>>>>>>> Now it only takes about 3 seconds to finish MRC, which is really
>>>>>>>>>>>> fast (8 times faster than dm pci, or 4 times faster than before).
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Bin Meng <bmeng.cn@gmail.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>
>>>>>>>>>>>>  arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++----------------
>>>>>>>>>>>>  1 file changed, 37 insertions(+), 22 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> Before I delve into the patch - with driver model we are using the I/O
>>>>>>>>>>> method - see pci_x86_read_config(). Is that the source of the slowdown
>>>>>>>>>>> or is it just general driver model overhead.
>>>>>>>>>>
>>>>>>>>>> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR
>>>>>>>>>> controller. Inside msg_port.c, pci_write_config_dword() and
>>>>>>>>>> pci_read_config_dword() are called.
>>>>>>>>>>
>>>>>>>>>> With driver model, the overhead is:
>>>>>>>>>>
>>>>>>>>>> pci_write_config_dword() -> pci_write_config32() -> pci_write_config()
>>>>>>>>>> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally
>>>>>>>>>> call pci_x86_read_config().
>>>>>>>>>>
>>>>>>>>>> Without driver model, there is still some overhead (so previously the
>>>>>>>>>> MRC time was about 12 seconds)
>>>>>>>>>>
>>>>>>>>>> pci_write_config_dword() -> pci_hose_write_config_dword() ->
>>>>>>>>>> TYPE1_PCI_OP(write, dword, u32, outl, 0)
>>>>>>>>>>
>>>>>>>>>> With my optimized version, pci_write_config_dword() directly calls a
>>>>>>>>>> hardcoded dword size pci config access, without the need to consider
>>>>>>>>>> offset and mask, and dereferencing hose->cfg_addr/cfg->data.
>>>>>>>>>
>>>>>>>>> What about if we use dm_pci_read_config32()? We should try to move PCI
>>>>>>>>> access to driver model to avoid the uclass_get_device_by_seq()
>>>>>>>>> everywhere.
>>>>>>>>
>>>>>>>> I don't think that helps. dm_pci_read_config32() requires a dm driver.
>>>>>>>> MRC is just something that program a bunch of registers with pci
>>>>>>>> config rw call.
>>>>>>>
>>>>>>> My question is really what takes the time? It's not clear whether it
>>>>>>> is the driver model overhead or something else. The code you add in
>>>>>>> qrk_pci_write_config_dword() looks very similar to
>>>>>>> pci_x86_read_config().
>>>>>>>
>>>>>>
>>>>>> It is the driver model overhead. In order to get to
>>>>>> pci_x86_read_config(), we need go through a bunch of function calls
>>>>>> (see above). Yes, my version is very similar to pci_x86_read_config(),
>>>>>> but my version is more simpler as it only needs to deal with dword
>>>>>> size thus no need to do offset mask and switch/case. If you look at
>>>>>> the Quark MRC codes, there are thousands of calls to msg_port_read()
>>>>>> and msg_port_write().
>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> If the former then perhaps we should change this. If the latter then
>>>>>>>>>>> we have work to do...
>>>>>>>>>>>
>>>>>>>>
>>>>>>>> Also for this patch, I just realized that not only it helps to reduce
>>>>>>>> the boot time, but also it helps to support PCIe root port in future
>>>>>>>> patches.
>>>>>>>>
>>>>>>>> By checking the Quark SoC manual, I found something that needs to be
>>>>>>>> done to get the two PCIe root ports on Quark SoC to work. In order to
>>>>>>>> get PCIe root ports show up in the PCI configuration space, we need
>>>>>>>> program some registers in the message bus (using APIs in
>>>>>>>> arch/x86/cpu/quark/msg_port.c). With driver model, if we still call
>>>>>>>> pci_write_config_dword() in the msg_port.c, we end up triggering PCI
>>>>>>>> enumeration process first before we actually write something to the
>>>>>>>> configuration space, however the enumeration process will hang when
>>>>>>>> scanning to the PCIe root port if we don't properly initialize the
>>>>>>>> message bus registers. This is a chicken-egg problem.
>>>>>>>
>>>>>>> Sure, although I see that as a separate problem.
>>>>>>>
>>>>>>
>>>>>> Yes, it is a separate problem. It came to me when I was reading the
>>>>>> manual after I submitted the patch.
>>>>>>
>>>>>>> We can't have a driver model implementation that is really slow, so
>>>>>>> I'd like to clear that up first. Of course your patch makes sense for
>>>>>>> other reasons, but I don't want to play whack-a-mole here.
>>>>>>>
>>>>>>
>>>>>> Agreed. So far the driver model PCI is used on x86 boards and sandbox.
>>>>>> The performance issue was not obvious on these targets, but it is
>>>>>> quite noticeable on Intel Quark. These PCI config read/write routines
>>>>>> will go through lots of function calls before we actually touch the
>>>>>> I/O ports 0xcf8 and 0xcfc, especially when the device is not on the
>>>>>> root bus (it needs to go through its parent followed by its parent's
>>>>>> parent).
>>>>>>
>>>>>> But anyway I think this optimization for Quark is needed. I doubt we
>>>>>> can optimize driver model pci to such an extent.
>>>>>
>>>>> Can we use this as an opportunity to try a few things? If we use the
>>>>> dm_pci functions that should cut out some overhead. Can you try an
>>>>> experiment to see how much difference it makes?
>>>>>
>>>>> dm_pci_read_config32()
>>>>
>>>> We can't use this API as MRC is not a dm driver.
>>>
>>> OK, probably I need to dig in and understand this a little better. Is
>>> it running pre-relocation with the early PCI stuff? We could make a
>>> driver with UCLASS_RAM perhaps.
>>
>> Yes, it is running pre-relocation with the early PCI stuff. But I
>> doubt the need to create a UCLASS_RAM for x86 targets as most x86
>> targets uses FSP to initialize the RAM. The best candidate to
>> implement UCLASS_RAM that I can think of now is the Freescale DDR
>> controller driver on powerpc, and on ARM recently. It supports both
>> SPD and memory-down. On ARM, most RAM targets' DDR initialization is
>> memory-down I believe.
>>
>
> Some updates today when trying to support PCIe root ports in the v2:
>
> I moved qrk_pci_write_config_dword() and qrk_pci_read_config_dword()
> from msg_port.c to quark.c and update all codes in quark.c to call
> these two routines to avoid the chicken & egg problem. With this
> change, I noticed that the MRC execution time changed from 3 seconds
> to 4 seconds. So 1 additional second is needed. I disassembled u-boot
> and found in v1 since qrk_pci_write_config_dword() and
> qrk_pci_read_config_dword() are declared static in msg_port.c, they
> are inlined by the compiler into these APIs in msg_port.c. But with v2
> changes, they are no longer inline but normal function call, which
> causes this additional 1 second.
>
> So even inline makes some improvement for MRC, not to mention if we
> can avoid these multiple call chains in the driver model PCI APIs.

OK so at this point I think we should give up and go with your
original code. I hope that in future we can figure out a way to make
this more efficient, but for now, let's run with it.

Regards,
Simon
diff mbox

Patch

diff --git a/arch/x86/cpu/quark/msg_port.c b/arch/x86/cpu/quark/msg_port.c
index 31713e3..a75ef23 100644
--- a/arch/x86/cpu/quark/msg_port.c
+++ b/arch/x86/cpu/quark/msg_port.c
@@ -5,34 +5,49 @@ 
  */
 
 #include <common.h>
-#include <pci.h>
 #include <asm/arch/device.h>
 #include <asm/arch/msg_port.h>
+#include <asm/io.h>
+#include <asm/pci.h>
+
+/* Optimized pci config write dword routine */
+static void qrk_pci_write_config_dword(pci_dev_t dev, int offset, u32 value)
+{
+	outl(dev | offset | PCI_CFG_EN, PCI_REG_ADDR);
+	outl(value, PCI_REG_DATA);
+}
+
+/* Optimized pci config read dword routine */
+static void qrk_pci_read_config_dword(pci_dev_t dev, int offset, u32 *valuep)
+{
+	outl(dev | offset | PCI_CFG_EN, PCI_REG_ADDR);
+	*valuep = inl(PCI_REG_DATA);
+}
 
 void msg_port_setup(int op, int port, int reg)
 {
-	pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_REG,
-			       (((op) << 24) | ((port) << 16) |
-			       (((reg) << 8) & 0xff00) | MSG_BYTE_ENABLE));
+	qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_REG,
+				   (((op) << 24) | ((port) << 16) |
+				   (((reg) << 8) & 0xff00) | MSG_BYTE_ENABLE));
 }
 
 u32 msg_port_read(u8 port, u32 reg)
 {
 	u32 value;
 
-	pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG,
-			       reg & 0xffffff00);
+	qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG,
+				   reg & 0xffffff00);
 	msg_port_setup(MSG_OP_READ, port, reg);
-	pci_read_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, &value);
+	qrk_pci_read_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, &value);
 
 	return value;
 }
 
 void msg_port_write(u8 port, u32 reg, u32 value)
 {
-	pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, value);
-	pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG,
-			       reg & 0xffffff00);
+	qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, value);
+	qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG,
+				   reg & 0xffffff00);
 	msg_port_setup(MSG_OP_WRITE, port, reg);
 }
 
@@ -40,19 +55,19 @@  u32 msg_port_alt_read(u8 port, u32 reg)
 {
 	u32 value;
 
-	pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG,
-			       reg & 0xffffff00);
+	qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG,
+				   reg & 0xffffff00);
 	msg_port_setup(MSG_OP_ALT_READ, port, reg);
-	pci_read_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, &value);
+	qrk_pci_read_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, &value);
 
 	return value;
 }
 
 void msg_port_alt_write(u8 port, u32 reg, u32 value)
 {
-	pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, value);
-	pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG,
-			       reg & 0xffffff00);
+	qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, value);
+	qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG,
+				   reg & 0xffffff00);
 	msg_port_setup(MSG_OP_ALT_WRITE, port, reg);
 }
 
@@ -60,18 +75,18 @@  u32 msg_port_io_read(u8 port, u32 reg)
 {
 	u32 value;
 
-	pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG,
-			       reg & 0xffffff00);
+	qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG,
+				   reg & 0xffffff00);
 	msg_port_setup(MSG_OP_IO_READ, port, reg);
-	pci_read_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, &value);
+	qrk_pci_read_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, &value);
 
 	return value;
 }
 
 void msg_port_io_write(u8 port, u32 reg, u32 value)
 {
-	pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, value);
-	pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG,
-			       reg & 0xffffff00);
+	qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_DATA_REG, value);
+	qrk_pci_write_config_dword(QUARK_HOST_BRIDGE, MSG_CTRL_EXT_REG,
+				   reg & 0xffffff00);
 	msg_port_setup(MSG_OP_IO_WRITE, port, reg);
 }