diff mbox

[0/2] x86, ia64: Move EFI_FB vga_default_device() initialization to pci_vga_fixup()

Message ID 20140821233435.19a9cffa@neptune.home
State Not Applicable
Headers show

Commit Message

Bruno Prémont Aug. 21, 2014, 9:34 p.m. UTC
On Thu, 21 August 2014 Andreas Noever <andreas.noever@gmail.com> wrote:
> dmesg with your patches and vga_set_default_device commented out
> (after "vgaarb: Boot video device...") as otherwise the system won't
> boot.

Do you know more precisely where your system hangs when it does not boot?
That's the part I can't find in this thread.
Is it dead-locking/freezing or just booting without displaying anything
(though network coming up if connected, keyboard working (e.g. caps key).

Try blacklisting both i915 and nouveau modules (and each one individually)
an see how far your system gets. Also make sure your network comes up
automatically, so that even if display remains black you can check via
network if your system is alive and what it complains about.

> dmesg | grep vgaarb
> [    1.340118] vgaarb: PCI:0000:00:02.0 PCI_COMMAND=0007
> [    1.340119] vgaarb: Boot video device: PCI:0000:00:02.0
> [    1.340120] vgaarb: device added:
> PCI:0000:00:02.0,decodes=io+mem,owns=io+mem,locks=none
> [    1.340130] vgaarb: PCI:0000:01:00.0 PCI_COMMAND=0006
> [    1.340132] vgaarb: PCI:0000:01:00.0, bridge PCI:0000:00:01.0
> PCI_BRIDGE_CONTROL=0000
> [    1.340133] vgaarb: device added:
> PCI:0000:01:00.0,decodes=io+mem,owns=none,locks=none
> [    1.340135] vgaarb: loaded
> [    1.340136] vgaarb: bridge control possible 0000:01:00.0
> [    1.340136] vgaarb: no bridge control possible 0000:00:02.0
> [    3.798430] vgaarb: device changed decodes:
> PCI:0000:00:02.0,olddecodes=io+mem,decodes=none:owns=io+mem
> 
> 
> If the line is not commented out then vgaarb simply declares the first
> (enabled) device to be the default one, which is incorrect. And the
> overwrite logic in pci_fixup_video is not triggered, since a default
> device has already been set.

The initial selection I am doing does match the PCI_COMMAND flags
as set for the devices (or masked by parent bridge), but probably none
of them has active legacy VGA I/O ports.
So the question would rather be how to determine which I/O port is active
for the Intel graphics and adjust vgaarb's "decodes"/owns interpretation
on that basis (there is no I/O active for the nvidia one).
I'm thinking about selecting only device that decodes the legacy VGA I/O
range and not those with any some other I/O range.

The short-term fix probably is to just unconditionally perform the
screen_info check in pci_fixup_video() while leaving vgaarb's initial
card selection alone for legacy hardware. Thus replicating efifb's
original behavior (and also get back incorrect ROM_SHADOW flagging).
Corresponding patch below (on top of both patches in this series, but
should apply without them as well). As mentioned in the patch this
papers over the real issue.


A second step would then be to tune vgaarb's initial selection.
Bjorn, is it possible to verify which I/O ports are decoded by a PCI
device at the time of adding it to vgaarb? If so, how? I would like to
check for legacy VGA I/O range (0x03B0-0x03DF) and only let vgaarb set
a device as default if that I/O range is decoded by the device.

Bruno



> On Wed, Aug 20, 2014 at 9:11 AM, Bruno Prémont wrote:
> > On Wed, 20 Aug 2014 07:55:08 +0200 Bruno Prémont wrote:
> >> On Tue, 19 Aug 2014 17:45:00 +0200 Andreas Noever wrote:
> >> > On Sat, Aug 16, 2014 at 7:21 PM, Bruno Prémont wrote:
> >> > > This series improves on commit 20cde694027e (x86, ia64: Move EFI_FB
> >> > > vga_default_device() initialization to pci_vga_fixup()):
> >> > > - cleanup remaining but always-true #ifndefs
> >> > > - fix boot regression on dual-GPU Macs
> >> > >
> >> > > Andreas, can you please test this series? It is a modification from
> >> > > previous testing patch that should still work fine for you.
> >> > > That testing patch would have been failing X startup on old BIOS systems
> >> > > booted with vga=normal (or otherwise in VGA text mode).
> >> > >
> >> > >
> >> > > Greg, in case you have scheduled above-mentioned commit for your next
> >> > > stable iteration, please hold it back in the queue until this follow-up
> >> > > has landed and can be included within the same stable update as alone
> >> > > that patch regresses for Macs with dual-GPU and using efifb.
> >> > >
> >> > > Bruno
> >> >
> >> > Fails again (with and without efifb).
> >> >
> >> > The vga_set_default_device in vga_arbiter_add_pci_device is at fault.
> >> > It sets the boot video device to intel. Removing it makes the system
> >> > bootable again.
> >>
> >> Could you provide your whole kernel log? I would like to understand
> >> how your vga devices are setup and why it starts the wrong way.
> >>
> >> If you can grab kernel log from both working and failing setups it
> >> would be even better. The failing one is interesting for where exactly it
> >> starts failing at boot.
> >
> > While collecting debug logs, please apply following patch to get
> > PCI command and bridge control registers as configured when vgaarb looks
> > at them.

From: Bruno Prémont <bonbons@linux-vserver.org>
Subject: [PATCH] x86: Force selection of vga_default_device on screen_info

Apple dual-GPU systems get the wrong GPU choosen by vgaarb because the
built-in Intel GPU has I/O ports active and no bridge in front that
would block legacy VGA I/O ports. (though no legacy VGA is setup)

The wrong initial selection prevents system from booting properly.

The proper solution would be to improve vgaarb's initial device
selection. Until that has been done return to behavior that efifb
implemented before the move to pci_fixup_video.

The draw-back of this old operation mode is that a wrong device
gets the IORESOURCE_ROM_SHADOW flag set.

Signed-off-by: Bruno Prémont <bonbons@linux-vserver.org>
CC: Matthew Garrett <matthew.garrett@nebula.com>
CC: stable@vger.kernel.org # v3.5+
---
 arch/x86/pci/fixup.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Bjorn Helgaas Aug. 22, 2014, 4:39 a.m. UTC | #1
On Thu, Aug 21, 2014 at 4:34 PM, Bruno Prémont
<bonbons@linux-vserver.org> wrote:

> A second step would then be to tune vgaarb's initial selection.
> Bjorn, is it possible to verify which I/O ports are decoded by a PCI
> device at the time of adding it to vgaarb? If so, how? I would like to
> check for legacy VGA I/O range (0x03B0-0x03DF) and only let vgaarb set
> a device as default if that I/O range is decoded by the device.

I don't know of a way.  I'm pretty sure VGA devices are allowed to
respond to those legacy addresses even if there's no BAR for them, but
I haven't found a spec reference for this.  There is the VGA Enable
bit in bridges, of course (PCI Bridge spec, sec 12.1.1.  If the VGA
device is behind a bridge that doesn't have the VGA Enable bit set, it
probably isn't the default device.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bruno Prémont Aug. 22, 2014, 6:23 a.m. UTC | #2
On Thu, 21 Aug 2014 23:39:31 -0500 Bjorn Helgaas wrote:
> On Thu, Aug 21, 2014 at 4:34 PM, Bruno Prémont wrote:
> 
> > A second step would then be to tune vgaarb's initial selection.
> > Bjorn, is it possible to verify which I/O ports are decoded by a PCI
> > device at the time of adding it to vgaarb? If so, how? I would like to
> > check for legacy VGA I/O range (0x03B0-0x03DF) and only let vgaarb set
> > a device as default if that I/O range is decoded by the device.
> 
> I don't know of a way.  I'm pretty sure VGA devices are allowed to
> respond to those legacy addresses even if there's no BAR for them, but
> I haven't found a spec reference for this.  There is the VGA Enable
> bit in bridges, of course (PCI Bridge spec, sec 12.1.1.  If the VGA
> device is behind a bridge that doesn't have the VGA Enable bit set, it
> probably isn't the default device.

Those VGA devices behind bridges are the easy ones that vgaarb selects
properly.
It's the ones not behind a bridge (integrated graphics) like the intel
one that cause problems.

For Andreas's system the discrete nvidia GPU has no I/O enabled
according to PCI_COMMAND flags while the integrated intel one does have
them (that's why the Intel GPU is chosen).

Unfortunately I don't know what makes his system choke at boot time as
he did not provide logs for the failing case.


If there is no better way to detect the proper legacy VGA device the
only remaining option would be to perform the screen_info testing in 
vga_arb_device_init() enclosed in arch #ifdef...

Bruno
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bruno Prémont Aug. 23, 2014, 11:06 a.m. UTC | #3
On Fri, 22 August 2014 Andreas Noever <andreas.noever@gmail.com> wrote:
> > For Andreas's system the discrete nvidia GPU has no I/O enabled
> > according to PCI_COMMAND flags while the integrated intel one does have
> > them (that's why the Intel GPU is chosen).
> >
> > Unfortunately I don't know what makes his system choke at boot time as
> > he did not provide logs for the failing case.
> Attached dmesg for the failing case (obtained via ssh).
>
> Without blacklisting a small horizontal bar of vertical green bars
> appears (no x, no console).

It's good to know that it's just the graphics (console / X) that are not
displaying properly.

> If nouveau is blacklisted then I get a console, but X will not start
> (No devices found).

The console you get is EFIFB (on the nvidia GPU to which display is routed).

Here the reason why X does not start is probably that i915 did not find
its VBIOS tables nor any connected monitor and thus X thinks "no active
output => I don't start".
Though your X would be able to start if it did not find xf86-video-intel
(intel_drv.so) and/or did find/had an explicit reference to xf86-video-fbdev
(fbdev_drv.so).

If under OSX you told your system to start on intel GPU (I think there
is an option in this direction) you system would probably boot fine as the
initial choice by vgaarb would match gmux/switcheroo settings.

> If i915 is blacklisted then I do not get a console. The screen just
> freezes after a few boot messages.

This is more interesting.

Initially you had efifb printing kernel logs until nouveau gets loaded
by udev and replaces efifb. From there on possibly applegmux does not
take over correctly (it may need both i915 and nouveau active to properly
route framebuffer to panel or connector).

Though your X should be telling the same thing as for nouveau blacklisted
as nvidia GPU is not the one having boot_vga set...

If not it may be worth finding out in what state your system exactly is
with regards to graphics.

> What is vga_default_device() used for? Is it supposed to hold the
> device that is controlling the (boot) screen? Why can't we just read
> the configuration from vga_switcheroo/gmux?

For systems not using vga_switcheroo:
  vga_default_device represents the PCI GPU that was used to boot (and
  normally handles legacy VGA I/O).
  It's never changed after boot (except eventually when a GPU gets
  hotplugged)

For systems with vga_switcheroo
  vga_default_device represents the active GPU (the one that would be
  handling legacy VGA I/O if used - and the one controlling the output
  connectors)
  vga_switcheroo is actively changing vga_default_device.


gmux is a driver for vga_switcheroo to perform the low-level platform
operations allowing switching (outputs) from one GPU to the other.


So a guess on my side would be that with both i915 and nouveau loaded
you may be able to get your display working if you can tell X to
switch GPU twice (and thus end up with matching vga_default_device
and device selected by gmux) - though I don't know how one asks for this
switch to happen.

> > If there is no better way to detect the proper legacy VGA device the
> > only remaining option would be to perform the screen_info testing in
> > vga_arb_device_init() enclosed in arch #ifdef...

I will propose a patch in this direction later this weekend.

Bruno
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Vetter Aug. 25, 2014, 12:16 p.m. UTC | #4
On Fri, Aug 22, 2014 at 08:23:24AM +0200, Bruno Prémont wrote:
> On Thu, 21 Aug 2014 23:39:31 -0500 Bjorn Helgaas wrote:
> > On Thu, Aug 21, 2014 at 4:34 PM, Bruno Prémont wrote:
> > 
> > > A second step would then be to tune vgaarb's initial selection.
> > > Bjorn, is it possible to verify which I/O ports are decoded by a PCI
> > > device at the time of adding it to vgaarb? If so, how? I would like to
> > > check for legacy VGA I/O range (0x03B0-0x03DF) and only let vgaarb set
> > > a device as default if that I/O range is decoded by the device.
> > 
> > I don't know of a way.  I'm pretty sure VGA devices are allowed to
> > respond to those legacy addresses even if there's no BAR for them, but
> > I haven't found a spec reference for this.  There is the VGA Enable
> > bit in bridges, of course (PCI Bridge spec, sec 12.1.1.  If the VGA
> > device is behind a bridge that doesn't have the VGA Enable bit set, it
> > probably isn't the default device.
> 
> Those VGA devices behind bridges are the easy ones that vgaarb selects
> properly.
> It's the ones not behind a bridge (integrated graphics) like the intel
> one that cause problems.
> 
> For Andreas's system the discrete nvidia GPU has no I/O enabled
> according to PCI_COMMAND flags while the integrated intel one does have
> them (that's why the Intel GPU is chosen).
> 
> Unfortunately I don't know what makes his system choke at boot time as
> he did not provide logs for the failing case.

Very often when something goes wrong with a kms driver we hang while doing
the initial modeset. Which is all done while holding the console_lock
(because fbdev+vt locking is just insane). You can try to get a closer
look with I915_FBDEV=n which will avoid the console_lock, but which also
won't register the legacy/compat i915 fbdev emulation any more, so greatly
changes boot behaviour.

If that doesn't lead to clues the next approach is to "carefully"
drop&reacquire console_lock at a few "interesting" places to get a few
printks out over netconsole or similar. Or just hack up entire netconsole
loggin infrastructure which bypasses printk and so all the console_lock
insanity.

It's not pretty, I know :(

Cheers, Daniel
Bruno Prémont Aug. 25, 2014, 12:39 p.m. UTC | #5
Hi Daniel,

On Mon, 25 Aug 2014 14:16:02 +0200 Daniel Vetter wrote:
> Very often when something goes wrong with a kms driver we hang while doing
> the initial modeset. Which is all done while holding the console_lock
> (because fbdev+vt locking is just insane). You can try to get a closer
> look with I915_FBDEV=n which will avoid the console_lock, but which also
> won't register the legacy/compat i915 fbdev emulation any more, so greatly
> changes boot behaviour.
> 
> If that doesn't lead to clues the next approach is to "carefully"
> drop&reacquire console_lock at a few "interesting" places to get a few
> printks out over netconsole or similar. Or just hack up entire netconsole
> loggin infrastructure which bypasses printk and so all the console_lock
> insanity.

In this case it's not that bad as Andreas could send the logs for all
cases (captured via ssh).

So probably console lock is not held (unless he did have to do
terminal-free ssh which I doubt).
It looks much more as if it's just the output routing that gets weird
on his Mac (or possibly any other dual-GPU MacBook where discrete GPU is
primary). Black screen but alive system :)

See follow-up posts in this thread.

If you have some uncommon or otherwise weird (EFI) multi-GPU systems
around and want to give my patches sent yesterday evening a try, you're
welcome! Some with non-Apple GPU multiplexer would be nice to have
tested as well.


The following part mentioned earlier by Andreas might be of interest to
you though (and my latest patch series should bring the improvement):
> > vga_arbiter_add_pci_device chooses intel simply because it is the
> > first device. Next pci_fixup_video(intel) sees that it is the default
> > device, sets the IORESOURCE_ROM_SHADOW flag and calls
> > vga_set_default_device again. And finally (if the check is removed)
> > pci_fixup_video(nvidia) sees that it owns the framebuffer and sets
> > itself as the default device which allows the system to boot again.
> >
> > Does setting the ROM_SHADOW flag on (possibly) the wrong device have
> > any effect?  
> Yes it does. Removing the line changes a long standing
> i915 0000:00:02.0: Invalid ROM contents
> into a
> i915 0000:00:02.0: BAR 6: can't assign [??? 0x00000000 flags 0x20000000] (bogus alignment).
> 
> The first is logged at KERN_ERR and the second one only at KERN_INFO.
> We are making progress.

Bruno
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c
index 5b392d2..fc509d5 100644
--- a/arch/x86/pci/fixup.c
+++ b/arch/x86/pci/fixup.c
@@ -326,7 +326,10 @@  static void pci_fixup_video(struct pci_dev *pdev)
 	struct pci_bus *bus;
 	u16 config;
 
-	if (!vga_default_device()) {
+	if (!vga_default_device() || 1) {
+		/* The `|| 1` condition papers over vgaarb initial GPU selection limitation
+		 * on Apple dual-GPU systems using EFI.
+		 */
 		resource_size_t start, end;
 		int i;