[v1] gpu: host1x: Ignore clients initialization failure
diff mbox series

Message ID 20180801150807.15926-1-digetx@gmail.com
State New
Headers show
Series
  • [v1] gpu: host1x: Ignore clients initialization failure
Related show

Commit Message

Dmitry Osipenko Aug. 1, 2018, 3:08 p.m. UTC
From time to time new bugs are popping up, causing some host1x client to
fail its initialization. Currently a single clients initialization failure
causes whole host1x device registration to fail, as a result a single DRM
sub-device initialization failure makes whole DRM initialization to fail.
Let's ignore clients initialization failure, as a result display panel
lights up even if some DRM clients (say GR2D or VIC) fail to initialize.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
---
 drivers/gpu/host1x/bus.c | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

Comments

Thierry Reding Aug. 3, 2018, 10:50 a.m. UTC | #1
On Wed, Aug 01, 2018 at 06:08:07PM +0300, Dmitry Osipenko wrote:
> From time to time new bugs are popping up, causing some host1x client to
> fail its initialization. Currently a single clients initialization failure
> causes whole host1x device registration to fail, as a result a single DRM
> sub-device initialization failure makes whole DRM initialization to fail.
> Let's ignore clients initialization failure, as a result display panel
> lights up even if some DRM clients (say GR2D or VIC) fail to initialize.
> 
> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
> ---
>  drivers/gpu/host1x/bus.c | 18 +++++++-----------
>  1 file changed, 7 insertions(+), 11 deletions(-)

This is actually done on purpose. I can't think of a case where we would
actively like to keep a half-broken DRM device operational. The errors
that you're talking about should only happen during development, and the
device not showing up is a pretty good indicator that something is wrong
as opposed to everything booting normally and then getting some cryptic
error at runtime because one of the clients didn't initialize.

From my perspective, all clients should always be operational in
whatever baseline version you use. If it isn't that's a bug that should
be fixed. Ideally those bugs should get fixed before making it into a
baseline version (mainline, linux-next, ...), so that this never impacts
even developers, unless they break it themselves, in which case refusing
to register the DRM device is a pretty good incentive to fix it.

Thierry
Dmitry Osipenko Aug. 3, 2018, 11:30 a.m. UTC | #2
On Friday, 3 August 2018 13:50:55 MSK Thierry Reding wrote:
> On Wed, Aug 01, 2018 at 06:08:07PM +0300, Dmitry Osipenko wrote:
> > From time to time new bugs are popping up, causing some host1x client to
> > fail its initialization. Currently a single clients initialization failure
> > causes whole host1x device registration to fail, as a result a single DRM
> > sub-device initialization failure makes whole DRM initialization to fail.
> > Let's ignore clients initialization failure, as a result display panel
> > lights up even if some DRM clients (say GR2D or VIC) fail to initialize.
> > 
> > Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
> > ---
> > 
> >  drivers/gpu/host1x/bus.c | 18 +++++++-----------
> >  1 file changed, 7 insertions(+), 11 deletions(-)
> 
> This is actually done on purpose. I can't think of a case where we would
> actively like to keep a half-broken DRM device operational. The errors
> that you're talking about should only happen during development,

[only in a perfect world]

> and the
> device not showing up is a pretty good indicator that something is wrong
> as opposed to everything booting normally and then getting some cryptic
> error at runtime because one of the clients didn't initialize.

If machine doesn't have a serial port and it doesn't have ssh server running 
or network doesn't come up, you'll end up with a completely dysfunctional 
piece of hardware. Hence there is no chance for you to even check what is 
wrong. The argument about a cryptic error doesn't make much sense, you have to 
improve your runtime as well (and you'll get a error message in the kernels 
log).

> From my perspective, all clients should always be operational in
> whatever baseline version you use. If it isn't that's a bug that should
> be fixed. Ideally those bugs should get fixed before making it into a
> baseline version (mainline, linux-next, ...), so that this never impacts
> even developers, unless they break it themselves, in which case refusing
> to register the DRM device is a pretty good incentive to fix it.

Sounds like you're assuming that only kernel developers are supposed to use 
Tegra, though at least (for now) for upstream it is kinda true. Of course that 
could be changed ;-)



--
To unsubscribe from this list: send the line "unsubscribe linux-tegra" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding Aug. 9, 2018, 10:45 a.m. UTC | #3
On Fri, Aug 03, 2018 at 02:30:47PM +0300, Dmitry Osipenko wrote:
> On Friday, 3 August 2018 13:50:55 MSK Thierry Reding wrote:
> > On Wed, Aug 01, 2018 at 06:08:07PM +0300, Dmitry Osipenko wrote:
> > > From time to time new bugs are popping up, causing some host1x client to
> > > fail its initialization. Currently a single clients initialization failure
> > > causes whole host1x device registration to fail, as a result a single DRM
> > > sub-device initialization failure makes whole DRM initialization to fail.
> > > Let's ignore clients initialization failure, as a result display panel
> > > lights up even if some DRM clients (say GR2D or VIC) fail to initialize.
> > > 
> > > Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
> > > ---
> > > 
> > >  drivers/gpu/host1x/bus.c | 18 +++++++-----------
> > >  1 file changed, 7 insertions(+), 11 deletions(-)
> > 
> > This is actually done on purpose. I can't think of a case where we would
> > actively like to keep a half-broken DRM device operational. The errors
> > that you're talking about should only happen during development,
> 
> [only in a perfect world]

gr2d and VIC are fairly deterministic. What are the errors you are
seeing that cause initialization failure? Note that it is possible for
these devices to malfunction even if initialization succeeds. A failure
to initialize can only happen if there's something wrong in the device
tree, firmware is missing (in case of VIC) or a regression was
introduced in some subsystem that causes a failure (maybe deferred probe
or something similar).

> > and the
> > device not showing up is a pretty good indicator that something is wrong
> > as opposed to everything booting normally and then getting some cryptic
> > error at runtime because one of the clients didn't initialize.
> 
> If machine doesn't have a serial port and it doesn't have ssh server running 
> or network doesn't come up, you'll end up with a completely dysfunctional 
> piece of hardware. Hence there is no chance for you to even check what is 
> wrong. The argument about a cryptic error doesn't make much sense, you have to 
> improve your runtime as well (and you'll get a error message in the kernels 
> log).

You make a good point here. I'm not aware of any devices we support in
the kernel that don't have a serial console, but that doesn't mean the
kernel could be used on an "unsupported" device  that doesn't have one.

> > From my perspective, all clients should always be operational in
> > whatever baseline version you use. If it isn't that's a bug that should
> > be fixed. Ideally those bugs should get fixed before making it into a
> > baseline version (mainline, linux-next, ...), so that this never impacts
> > even developers, unless they break it themselves, in which case refusing
> > to register the DRM device is a pretty good incentive to fix it.
> 
> Sounds like you're assuming that only kernel developers are supposed to use 
> Tegra, though at least (for now) for upstream it is kinda true. Of course that 
> could be changed ;-)

That's not at all what I'm saying. What I am saying is that we should
make it difficult for developers to break the kernel in a way that would
put users into a position that is difficult to get themselves out of. If
we refuse to register the complete DRM device in case some subdevice
fails to initialize we increase the chances of developers noticing and
fixing it.

If all we do on failure is output an error message, it's very likely to
go unnoticed. Developers are likely to check specifically the
functionality that they're working on and ignore failures that they may
have caused in other parts of the code as a side-effect of their current
work.

Perhaps a good middle ground would be to turn initialization failures
into WARN_ON() to increase the chances of them getting notified and then
continue on a best effort basis in the hopes of having enough
initialized to get a message to users. Another alternative would be to
mark essential subdevices (such as the display controller) so that only
they will cause failure to register the whole DRM device.

Thierry
Dmitry Osipenko Aug. 9, 2018, 12:53 p.m. UTC | #4
On Thursday, 9 August 2018 13:45:41 MSK Thierry Reding wrote:
> On Fri, Aug 03, 2018 at 02:30:47PM +0300, Dmitry Osipenko wrote:
> > On Friday, 3 August 2018 13:50:55 MSK Thierry Reding wrote:
> > > On Wed, Aug 01, 2018 at 06:08:07PM +0300, Dmitry Osipenko wrote:
> > > > From time to time new bugs are popping up, causing some host1x client
> > > > to
> > > > fail its initialization. Currently a single clients initialization
> > > > failure
> > > > causes whole host1x device registration to fail, as a result a single
> > > > DRM
> > > > sub-device initialization failure makes whole DRM initialization to
> > > > fail.
> > > > Let's ignore clients initialization failure, as a result display panel
> > > > lights up even if some DRM clients (say GR2D or VIC) fail to
> > > > initialize.
> > > > 
> > > > Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
> > > > ---
> > > > 
> > > >  drivers/gpu/host1x/bus.c | 18 +++++++-----------
> > > >  1 file changed, 7 insertions(+), 11 deletions(-)
> > > 
> > > This is actually done on purpose. I can't think of a case where we would
> > > actively like to keep a half-broken DRM device operational. The errors
> > > that you're talking about should only happen during development,
> > 
> > [only in a perfect world]
> 
> gr2d and VIC are fairly deterministic. What are the errors you are
> seeing that cause initialization failure?

Right now there are no specific errors. There is only a known trouble with the 
ARM_DMA_USE_IOMMU, but that causes to fail all of the clients.

> Note that it is possible for
> these devices to malfunction even if initialization succeeds. A failure
> to initialize can only happen if there's something wrong in the device
> tree, firmware is missing (in case of VIC) or a regression was
> introduced in some subsystem that causes a failure (maybe deferred probe
> or something similar).

A missing firmware is an actual good example. Though can't VIC driver be 
changed to load firmware at the time of a its first use by userspace? It 
should be very annoying that kernel driver forces you to churn with initramfs.

> > > and the
> > > device not showing up is a pretty good indicator that something is wrong
> > > as opposed to everything booting normally and then getting some cryptic
> > > error at runtime because one of the clients didn't initialize.
> > 
> > If machine doesn't have a serial port and it doesn't have ssh server
> > running or network doesn't come up, you'll end up with a completely
> > dysfunctional piece of hardware. Hence there is no chance for you to even
> > check what is wrong. The argument about a cryptic error doesn't make much
> > sense, you have to improve your runtime as well (and you'll get a error
> > message in the kernels log).
> 
> You make a good point here. I'm not aware of any devices we support in
> the kernel that don't have a serial console, but that doesn't mean the
> kernel could be used on an "unsupported" device  that doesn't have one.

AFAIK, enabling serial on AC100 require soldering iron.

> > > From my perspective, all clients should always be operational in
> > > whatever baseline version you use. If it isn't that's a bug that should
> > > be fixed. Ideally those bugs should get fixed before making it into a
> > > baseline version (mainline, linux-next, ...), so that this never impacts
> > > even developers, unless they break it themselves, in which case refusing
> > > to register the DRM device is a pretty good incentive to fix it.
> > 
> > Sounds like you're assuming that only kernel developers are supposed to
> > use
> > Tegra, though at least (for now) for upstream it is kinda true. Of course
> > that could be changed ;-)
> 
> That's not at all what I'm saying. What I am saying is that we should
> make it difficult for developers to break the kernel in a way that would
> put users into a position that is difficult to get themselves out of. If
> we refuse to register the complete DRM device in case some subdevice
> fails to initialize we increase the chances of developers noticing and
> fixing it.
> 
> If all we do on failure is output an error message, it's very likely to
> go unnoticed. Developers are likely to check specifically the
> functionality that they're working on and ignore failures that they may
> have caused in other parts of the code as a side-effect of their current
> work.

I can try to help with improving of your automated testing suite, if you'll 
make it accessible to the public.

> Perhaps a good middle ground would be to turn initialization failures
> into WARN_ON() to increase the chances of them getting notified and then
> continue on a best effort basis in the hopes of having enough
> initialized to get a message to users.

Good to me, I'll make v2.

> Another alternative would be to
> mark essential subdevices (such as the display controller) so that only
> they will cause failure to register the whole DRM device.

I don't think that making some subdevice more valuable than the others is a 
feasible approach, how you are going to differentiate what of the two display 
controllers is more important than the other.



--
To unsubscribe from this list: send the line "unsubscribe linux-tegra" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding Aug. 9, 2018, 1:10 p.m. UTC | #5
On Thu, Aug 09, 2018 at 03:53:03PM +0300, Dmitry Osipenko wrote:
> On Thursday, 9 August 2018 13:45:41 MSK Thierry Reding wrote:
> > On Fri, Aug 03, 2018 at 02:30:47PM +0300, Dmitry Osipenko wrote:
> > > On Friday, 3 August 2018 13:50:55 MSK Thierry Reding wrote:
> > > > On Wed, Aug 01, 2018 at 06:08:07PM +0300, Dmitry Osipenko wrote:
> > > > > From time to time new bugs are popping up, causing some host1x client
> > > > > to
> > > > > fail its initialization. Currently a single clients initialization
> > > > > failure
> > > > > causes whole host1x device registration to fail, as a result a single
> > > > > DRM
> > > > > sub-device initialization failure makes whole DRM initialization to
> > > > > fail.
> > > > > Let's ignore clients initialization failure, as a result display panel
> > > > > lights up even if some DRM clients (say GR2D or VIC) fail to
> > > > > initialize.
> > > > > 
> > > > > Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
> > > > > ---
> > > > > 
> > > > >  drivers/gpu/host1x/bus.c | 18 +++++++-----------
> > > > >  1 file changed, 7 insertions(+), 11 deletions(-)
> > > > 
> > > > This is actually done on purpose. I can't think of a case where we would
> > > > actively like to keep a half-broken DRM device operational. The errors
> > > > that you're talking about should only happen during development,
> > > 
> > > [only in a perfect world]
> > 
> > gr2d and VIC are fairly deterministic. What are the errors you are
> > seeing that cause initialization failure?
> 
> Right now there are no specific errors. There is only a known trouble with the 
> ARM_DMA_USE_IOMMU, but that causes to fail all of the clients.
> 
> > Note that it is possible for
> > these devices to malfunction even if initialization succeeds. A failure
> > to initialize can only happen if there's something wrong in the device
> > tree, firmware is missing (in case of VIC) or a regression was
> > introduced in some subsystem that causes a failure (maybe deferred probe
> > or something similar).
> 
> A missing firmware is an actual good example. Though can't VIC driver be 
> changed to load firmware at the time of a its first use by userspace? It 
> should be very annoying that kernel driver forces you to churn with initramfs.

No, that's not really how firmware loading works. There's no technical
barrier to doing it, but it'd be strange. If a device needs firmware to
work, it'd be unusual to make it available before you know that it can
be used.

What if the firmware can't be loaded at the time of the first use? How
do you report that back to userspace? -ENOENT because the firmware file
can't be found? How is userspace to know that this is a problem with
firmware loading rather than some other error having to do with the VIC
command stream being broken, for example?

Modifying the initramfs is only necessary if you have the module built-
in or the module in the initramfs. If the module is in a root
filesystem, simply put the firmware there as well and you should be
good. This is all very standard Linux behaviour, nothing new or exotic
there.

> > > > and the
> > > > device not showing up is a pretty good indicator that something is wrong
> > > > as opposed to everything booting normally and then getting some cryptic
> > > > error at runtime because one of the clients didn't initialize.
> > > 
> > > If machine doesn't have a serial port and it doesn't have ssh server
> > > running or network doesn't come up, you'll end up with a completely
> > > dysfunctional piece of hardware. Hence there is no chance for you to even
> > > check what is wrong. The argument about a cryptic error doesn't make much
> > > sense, you have to improve your runtime as well (and you'll get a error
> > > message in the kernels log).
> > 
> > You make a good point here. I'm not aware of any devices we support in
> > the kernel that don't have a serial console, but that doesn't mean the
> > kernel could be used on an "unsupported" device  that doesn't have one.
> 
> AFAIK, enabling serial on AC100 require soldering iron.
> 
> > > > From my perspective, all clients should always be operational in
> > > > whatever baseline version you use. If it isn't that's a bug that should
> > > > be fixed. Ideally those bugs should get fixed before making it into a
> > > > baseline version (mainline, linux-next, ...), so that this never impacts
> > > > even developers, unless they break it themselves, in which case refusing
> > > > to register the DRM device is a pretty good incentive to fix it.
> > > 
> > > Sounds like you're assuming that only kernel developers are supposed to
> > > use
> > > Tegra, though at least (for now) for upstream it is kinda true. Of course
> > > that could be changed ;-)
> > 
> > That's not at all what I'm saying. What I am saying is that we should
> > make it difficult for developers to break the kernel in a way that would
> > put users into a position that is difficult to get themselves out of. If
> > we refuse to register the complete DRM device in case some subdevice
> > fails to initialize we increase the chances of developers noticing and
> > fixing it.
> > 
> > If all we do on failure is output an error message, it's very likely to
> > go unnoticed. Developers are likely to check specifically the
> > functionality that they're working on and ignore failures that they may
> > have caused in other parts of the code as a side-effect of their current
> > work.
> 
> I can try to help with improving of your automated testing suite, if you'll 
> make it accessible to the public.

This has nothing to do with automated test suites. The problem with test
suites is that you can't force everyone to run them, so it's likely that
people would still miss it.

The whole point of failing the whole thing is to force people to do the
right thing irrespective of any test suite or log or whatever. So if I
work on display and accidentally break VIC initialization, I'm now
forced to fix VIC initialization if I want to be able to test display.

> > Perhaps a good middle ground would be to turn initialization failures
> > into WARN_ON() to increase the chances of them getting notified and then
> > continue on a best effort basis in the hopes of having enough
> > initialized to get a message to users.
> 
> Good to me, I'll make v2.
> 
> > Another alternative would be to
> > mark essential subdevices (such as the display controller) so that only
> > they will cause failure to register the whole DRM device.
> 
> I don't think that making some subdevice more valuable than the others is a 
> feasible approach, how you are going to differentiate what of the two display 
> controllers is more important than the other.

I wasn't talking about one display controller vs. another, but rather
about "optional" subdevices. For example, I'd consider gr2d, gr3d and
VIC as optional in that they aren't needed to get something on the
screen. Anything display related would be considered required to make
sure people get at least a working display, even if all optional sub-
devices fail.

Thierry
Dmitry Osipenko Aug. 9, 2018, 1:46 p.m. UTC | #6
On Thursday, 9 August 2018 16:10:35 MSK Thierry Reding wrote:
> On Thu, Aug 09, 2018 at 03:53:03PM +0300, Dmitry Osipenko wrote:
> > On Thursday, 9 August 2018 13:45:41 MSK Thierry Reding wrote:
> > > On Fri, Aug 03, 2018 at 02:30:47PM +0300, Dmitry Osipenko wrote:
> > > > On Friday, 3 August 2018 13:50:55 MSK Thierry Reding wrote:
> > > > > On Wed, Aug 01, 2018 at 06:08:07PM +0300, Dmitry Osipenko wrote:
> > > > > > From time to time new bugs are popping up, causing some host1x
> > > > > > client
> > > > > > to
> > > > > > fail its initialization. Currently a single clients initialization
> > > > > > failure
> > > > > > causes whole host1x device registration to fail, as a result a
> > > > > > single
> > > > > > DRM
> > > > > > sub-device initialization failure makes whole DRM initialization
> > > > > > to
> > > > > > fail.
> > > > > > Let's ignore clients initialization failure, as a result display
> > > > > > panel
> > > > > > lights up even if some DRM clients (say GR2D or VIC) fail to
> > > > > > initialize.
> > > > > > 
> > > > > > Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
> > > > > > ---
> > > > > > 
> > > > > >  drivers/gpu/host1x/bus.c | 18 +++++++-----------
> > > > > >  1 file changed, 7 insertions(+), 11 deletions(-)
> > > > > 
> > > > > This is actually done on purpose. I can't think of a case where we
> > > > > would
> > > > > actively like to keep a half-broken DRM device operational. The
> > > > > errors
> > > > > that you're talking about should only happen during development,
> > > > 
> > > > [only in a perfect world]
> > > 
> > > gr2d and VIC are fairly deterministic. What are the errors you are
> > > seeing that cause initialization failure?
> > 
> > Right now there are no specific errors. There is only a known trouble with
> > the ARM_DMA_USE_IOMMU, but that causes to fail all of the clients.
> > 
> > > Note that it is possible for
> > > these devices to malfunction even if initialization succeeds. A failure
> > > to initialize can only happen if there's something wrong in the device
> > > tree, firmware is missing (in case of VIC) or a regression was
> > > introduced in some subsystem that causes a failure (maybe deferred probe
> > > or something similar).
> > 
> > A missing firmware is an actual good example. Though can't VIC driver be
> > changed to load firmware at the time of a its first use by userspace? It
> > should be very annoying that kernel driver forces you to churn with
> > initramfs.
> No, that's not really how firmware loading works. There's no technical
> barrier to doing it, but it'd be strange. If a device needs firmware to
> work, it'd be unusual to make it available before you know that it can
> be used.
> 
> What if the firmware can't be loaded at the time of the first use? How
> do you report that back to userspace? -ENOENT because the firmware file
> can't be found? How is userspace to know that this is a problem with
> firmware loading rather than some other error having to do with the VIC
> command stream being broken, for example?

You'll have to check kernels log if there is some kernel-related failure in 
any case for the detailed information.

> Modifying the initramfs is only necessary if you have the module built-
> in or the module in the initramfs. If the module is in a root
> filesystem, simply put the firmware there as well and you should be
> good. This is all very standard Linux behaviour, nothing new or exotic
> there.

Having display driver built-in usually more preferable, it's not fun if kernel 
can't get to FS mount stage.

> > > > > and the
> > > > > device not showing up is a pretty good indicator that something is
> > > > > wrong
> > > > > as opposed to everything booting normally and then getting some
> > > > > cryptic
> > > > > error at runtime because one of the clients didn't initialize.
> > > > 
> > > > If machine doesn't have a serial port and it doesn't have ssh server
> > > > running or network doesn't come up, you'll end up with a completely
> > > > dysfunctional piece of hardware. Hence there is no chance for you to
> > > > even
> > > > check what is wrong. The argument about a cryptic error doesn't make
> > > > much
> > > > sense, you have to improve your runtime as well (and you'll get a
> > > > error
> > > > message in the kernels log).
> > > 
> > > You make a good point here. I'm not aware of any devices we support in
> > > the kernel that don't have a serial console, but that doesn't mean the
> > > kernel could be used on an "unsupported" device  that doesn't have one.
> > 
> > AFAIK, enabling serial on AC100 require soldering iron.
> > 
> > > > > From my perspective, all clients should always be operational in
> > > > > whatever baseline version you use. If it isn't that's a bug that
> > > > > should
> > > > > be fixed. Ideally those bugs should get fixed before making it into
> > > > > a
> > > > > baseline version (mainline, linux-next, ...), so that this never
> > > > > impacts
> > > > > even developers, unless they break it themselves, in which case
> > > > > refusing
> > > > > to register the DRM device is a pretty good incentive to fix it.
> > > > 
> > > > Sounds like you're assuming that only kernel developers are supposed
> > > > to
> > > > use
> > > > Tegra, though at least (for now) for upstream it is kinda true. Of
> > > > course
> > > > that could be changed ;-)
> > > 
> > > That's not at all what I'm saying. What I am saying is that we should
> > > make it difficult for developers to break the kernel in a way that would
> > > put users into a position that is difficult to get themselves out of. If
> > > we refuse to register the complete DRM device in case some subdevice
> > > fails to initialize we increase the chances of developers noticing and
> > > fixing it.
> > > 
> > > If all we do on failure is output an error message, it's very likely to
> > > go unnoticed. Developers are likely to check specifically the
> > > functionality that they're working on and ignore failures that they may
> > > have caused in other parts of the code as a side-effect of their current
> > > work.
> > 
> > I can try to help with improving of your automated testing suite, if
> > you'll
> > make it accessible to the public.
> 
> This has nothing to do with automated test suites. The problem with test
> suites is that you can't force everyone to run them, so it's likely that
> people would still miss it.
> 
> The whole point of failing the whole thing is to force people to do the
> right thing irrespective of any test suite or log or whatever. So if I
> work on display and accidentally break VIC initialization, I'm now
> forced to fix VIC initialization if I want to be able to test display.
> 
> > > Perhaps a good middle ground would be to turn initialization failures
> > > into WARN_ON() to increase the chances of them getting notified and then
> > > continue on a best effort basis in the hopes of having enough
> > > initialized to get a message to users.
> > 
> > Good to me, I'll make v2.
> > 
> > > Another alternative would be to
> > > mark essential subdevices (such as the display controller) so that only
> > > they will cause failure to register the whole DRM device.
> > 
> > I don't think that making some subdevice more valuable than the others is
> > a
> > feasible approach, how you are going to differentiate what of the two
> > display controllers is more important than the other.
> 
> I wasn't talking about one display controller vs. another, but rather
> about "optional" subdevices. For example, I'd consider gr2d, gr3d and
> VIC as optional in that they aren't needed to get something on the
> screen. Anything display related would be considered required to make
> sure people get at least a working display, even if all optional sub-
> devices fail.

Well, from a users perspective, likely that display panel is more important 
than HDMI port. It doesn't sound good that display panel won't work because of 
an unrelated HDMI issue (and vice versa).



--
To unsubscribe from this list: send the line "unsubscribe linux-tegra" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding Aug. 9, 2018, 2:15 p.m. UTC | #7
On Thu, Aug 09, 2018 at 04:46:55PM +0300, Dmitry Osipenko wrote:
> On Thursday, 9 August 2018 16:10:35 MSK Thierry Reding wrote:
> > On Thu, Aug 09, 2018 at 03:53:03PM +0300, Dmitry Osipenko wrote:
> > > On Thursday, 9 August 2018 13:45:41 MSK Thierry Reding wrote:
> > > > On Fri, Aug 03, 2018 at 02:30:47PM +0300, Dmitry Osipenko wrote:
> > > > > On Friday, 3 August 2018 13:50:55 MSK Thierry Reding wrote:
> > > > > > On Wed, Aug 01, 2018 at 06:08:07PM +0300, Dmitry Osipenko wrote:
> > > > > > > From time to time new bugs are popping up, causing some host1x
> > > > > > > client
> > > > > > > to
> > > > > > > fail its initialization. Currently a single clients initialization
> > > > > > > failure
> > > > > > > causes whole host1x device registration to fail, as a result a
> > > > > > > single
> > > > > > > DRM
> > > > > > > sub-device initialization failure makes whole DRM initialization
> > > > > > > to
> > > > > > > fail.
> > > > > > > Let's ignore clients initialization failure, as a result display
> > > > > > > panel
> > > > > > > lights up even if some DRM clients (say GR2D or VIC) fail to
> > > > > > > initialize.
> > > > > > > 
> > > > > > > Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
> > > > > > > ---
> > > > > > > 
> > > > > > >  drivers/gpu/host1x/bus.c | 18 +++++++-----------
> > > > > > >  1 file changed, 7 insertions(+), 11 deletions(-)
> > > > > > 
> > > > > > This is actually done on purpose. I can't think of a case where we
> > > > > > would
> > > > > > actively like to keep a half-broken DRM device operational. The
> > > > > > errors
> > > > > > that you're talking about should only happen during development,
> > > > > 
> > > > > [only in a perfect world]
> > > > 
> > > > gr2d and VIC are fairly deterministic. What are the errors you are
> > > > seeing that cause initialization failure?
> > > 
> > > Right now there are no specific errors. There is only a known trouble with
> > > the ARM_DMA_USE_IOMMU, but that causes to fail all of the clients.
> > > 
> > > > Note that it is possible for
> > > > these devices to malfunction even if initialization succeeds. A failure
> > > > to initialize can only happen if there's something wrong in the device
> > > > tree, firmware is missing (in case of VIC) or a regression was
> > > > introduced in some subsystem that causes a failure (maybe deferred probe
> > > > or something similar).
> > > 
> > > A missing firmware is an actual good example. Though can't VIC driver be
> > > changed to load firmware at the time of a its first use by userspace? It
> > > should be very annoying that kernel driver forces you to churn with
> > > initramfs.
> > No, that's not really how firmware loading works. There's no technical
> > barrier to doing it, but it'd be strange. If a device needs firmware to
> > work, it'd be unusual to make it available before you know that it can
> > be used.
> > 
> > What if the firmware can't be loaded at the time of the first use? How
> > do you report that back to userspace? -ENOENT because the firmware file
> > can't be found? How is userspace to know that this is a problem with
> > firmware loading rather than some other error having to do with the VIC
> > command stream being broken, for example?
> 
> You'll have to check kernels log if there is some kernel-related failure in 
> any case for the detailed information.

Well, it depends. It might also send you on a wild goose chase for a
while until you realize that it's just a simple failure to find the
firmware binaries.

> > Modifying the initramfs is only necessary if you have the module built-
> > in or the module in the initramfs. If the module is in a root
> > filesystem, simply put the firmware there as well and you should be
> > good. This is all very standard Linux behaviour, nothing new or exotic
> > there.
> 
> Having display driver built-in usually more preferable, it's not fun if kernel 
> can't get to FS mount stage.

Yeah, in that case you can include the firmware in the initramfs and the
driver should be able to pick it up.

> > > > > > and the
> > > > > > device not showing up is a pretty good indicator that something is
> > > > > > wrong
> > > > > > as opposed to everything booting normally and then getting some
> > > > > > cryptic
> > > > > > error at runtime because one of the clients didn't initialize.
> > > > > 
> > > > > If machine doesn't have a serial port and it doesn't have ssh server
> > > > > running or network doesn't come up, you'll end up with a completely
> > > > > dysfunctional piece of hardware. Hence there is no chance for you to
> > > > > even
> > > > > check what is wrong. The argument about a cryptic error doesn't make
> > > > > much
> > > > > sense, you have to improve your runtime as well (and you'll get a
> > > > > error
> > > > > message in the kernels log).
> > > > 
> > > > You make a good point here. I'm not aware of any devices we support in
> > > > the kernel that don't have a serial console, but that doesn't mean the
> > > > kernel could be used on an "unsupported" device  that doesn't have one.
> > > 
> > > AFAIK, enabling serial on AC100 require soldering iron.
> > > 
> > > > > > From my perspective, all clients should always be operational in
> > > > > > whatever baseline version you use. If it isn't that's a bug that
> > > > > > should
> > > > > > be fixed. Ideally those bugs should get fixed before making it into
> > > > > > a
> > > > > > baseline version (mainline, linux-next, ...), so that this never
> > > > > > impacts
> > > > > > even developers, unless they break it themselves, in which case
> > > > > > refusing
> > > > > > to register the DRM device is a pretty good incentive to fix it.
> > > > > 
> > > > > Sounds like you're assuming that only kernel developers are supposed
> > > > > to
> > > > > use
> > > > > Tegra, though at least (for now) for upstream it is kinda true. Of
> > > > > course
> > > > > that could be changed ;-)
> > > > 
> > > > That's not at all what I'm saying. What I am saying is that we should
> > > > make it difficult for developers to break the kernel in a way that would
> > > > put users into a position that is difficult to get themselves out of. If
> > > > we refuse to register the complete DRM device in case some subdevice
> > > > fails to initialize we increase the chances of developers noticing and
> > > > fixing it.
> > > > 
> > > > If all we do on failure is output an error message, it's very likely to
> > > > go unnoticed. Developers are likely to check specifically the
> > > > functionality that they're working on and ignore failures that they may
> > > > have caused in other parts of the code as a side-effect of their current
> > > > work.
> > > 
> > > I can try to help with improving of your automated testing suite, if
> > > you'll
> > > make it accessible to the public.
> > 
> > This has nothing to do with automated test suites. The problem with test
> > suites is that you can't force everyone to run them, so it's likely that
> > people would still miss it.
> > 
> > The whole point of failing the whole thing is to force people to do the
> > right thing irrespective of any test suite or log or whatever. So if I
> > work on display and accidentally break VIC initialization, I'm now
> > forced to fix VIC initialization if I want to be able to test display.
> > 
> > > > Perhaps a good middle ground would be to turn initialization failures
> > > > into WARN_ON() to increase the chances of them getting notified and then
> > > > continue on a best effort basis in the hopes of having enough
> > > > initialized to get a message to users.
> > > 
> > > Good to me, I'll make v2.
> > > 
> > > > Another alternative would be to
> > > > mark essential subdevices (such as the display controller) so that only
> > > > they will cause failure to register the whole DRM device.
> > > 
> > > I don't think that making some subdevice more valuable than the others is
> > > a
> > > feasible approach, how you are going to differentiate what of the two
> > > display controllers is more important than the other.
> > 
> > I wasn't talking about one display controller vs. another, but rather
> > about "optional" subdevices. For example, I'd consider gr2d, gr3d and
> > VIC as optional in that they aren't needed to get something on the
> > screen. Anything display related would be considered required to make
> > sure people get at least a working display, even if all optional sub-
> > devices fail.
> 
> Well, from a users perspective, likely that display panel is more important 
> than HDMI port. It doesn't sound good that display panel won't work because of 
> an unrelated HDMI issue (and vice versa).

Yes. That's why I said *anything* display related needs to be considered
important so that we can increase the chances of getting any output at
all. Only the bits that are later used by userspace should be optional.
Anyway, this is somewhat irrelevant since we already agreed that a WARN
should be sufficient.

Thierry

Patch
diff mbox series

diff --git a/drivers/gpu/host1x/bus.c b/drivers/gpu/host1x/bus.c
index 815bdb42e3f0..145019e105e4 100644
--- a/drivers/gpu/host1x/bus.c
+++ b/drivers/gpu/host1x/bus.c
@@ -199,34 +199,30 @@  static void host1x_subdev_unregister(struct host1x_device *device,
  */
 int host1x_device_init(struct host1x_device *device)
 {
-	struct host1x_client *client;
+	struct host1x_client *client, *cl;
 	int err;
 
+	mutex_lock(&clients_lock);
 	mutex_lock(&device->clients_lock);
 
-	list_for_each_entry(client, &device->clients, list) {
+	list_for_each_entry_safe(client, cl, &device->clients, list) {
 		if (client->ops && client->ops->init) {
 			err = client->ops->init(client);
 			if (err < 0) {
 				dev_err(&device->dev,
 					"failed to initialize %s: %d\n",
 					dev_name(client->dev), err);
-				goto teardown;
+
+				/* add the client to the list of idle clients */
+				list_add_tail(&client->list, &clients);
 			}
 		}
 	}
 
 	mutex_unlock(&device->clients_lock);
+	mutex_unlock(&clients_lock);
 
 	return 0;
-
-teardown:
-	list_for_each_entry_continue_reverse(client, &device->clients, list)
-		if (client->ops->exit)
-			client->ops->exit(client);
-
-	mutex_unlock(&device->clients_lock);
-	return err;
 }
 EXPORT_SYMBOL(host1x_device_init);