mbox

[Pull,Request,Lucid] Fix Xorg hangs with some i915 chipsets

Message ID 20110617163838.GB27800@thinkpad-t410
State New
Headers show

Pull-request

git://kernel.ubuntu.com/sforshee/ubuntu-lucid.git lp599017

Message

Seth Forshee June 17, 2011, 4:38 p.m. UTC
This is a rather large series of non-trivial backports to fix a Xorg
hang caused by the i915 driver. Patches 5 and 6 are the key patches that
implement an LRU-based eviction algorithm; patches 1-4 are
prerequisites, patch 7 fixes some of the clean-up, and patch 8 adds a
periodic flush to retire active buffers.

This is a pretty severe bug for those affected by it, but it's also a
huge amount of change for a stable kernel, so let me know if it's beyond
hope for an SRU.

SRU Justification:

  Impact: When the GPU aperture becomes sufficiently fragmented two or
  more mmapped buffers in active use can repeatedly push each other out
  of the aperture. This results in a sort of livelock situation, with
  Xorg appearing to be hung with high CPU utilization.

  Fix: Backport of a series of patches that convert the i915 eviction
  algorithm from a best-fit approach to one based on evicting the least
  recently used objects and a patch that adds a periodic flush requests
  to retire active buffers when no client is active.

  Testcase: Without these patches this scenario can be triggered
  ocassionally when visiting certain web pages that utilize Flash or
  viewing certain images in Firefox, as described on bugs #599017 and
  #330460. Testing the same pages/images with these patches does not
  induce the hang.


The following changes since commit 9cc333d02cb4620561e91a63ed86a2159ec32638:

  UBUNTU: Ubuntu-2.6.32-33.67 (2011-06-16 11:26:04 -0500)

are available in the git repository at:
  git://kernel.ubuntu.com/sforshee/ubuntu-lucid.git lp599017

Chris Wilson (5):
      drm/i915: Move the eviction logic to its own file.
      drm/i915: Implement fair lru eviction across both rings.  (v2)
      drm/i915: Maintain LRU order of inactive objects upon access by CPU (v2)
      drm/i915/evict: Ensure we completely cleanup on failure
      drm/i915: Periodically flush the active lists and requests

Daniel Vetter (3):
      drm_mm: extract check_free_mm_node
      drm: implement helper functions for scanning lru list
      drm/i915: prepare for fair lru eviction

 drivers/gpu/drm/drm_mm.c              |  236 +++++++++++++++++++++++++-----
 drivers/gpu/drm/i915/Makefile         |    1 +
 drivers/gpu/drm/i915/i915_drv.h       |   13 ++
 drivers/gpu/drm/i915/i915_gem.c       |  228 ++++-------------------------
 drivers/gpu/drm/i915/i915_gem_evict.c |  253 +++++++++++++++++++++++++++++++++
 include/drm/drm_mm.h                  |   15 ++-
 6 files changed, 510 insertions(+), 236 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_gem_evict.c

Comments

Tim Gardner June 20, 2011, 5:04 p.m. UTC | #1
On 06/17/2011 10:38 AM, Seth Forshee wrote:
> This is a rather large series of non-trivial backports to fix a Xorg
> hang caused by the i915 driver. Patches 5 and 6 are the key patches that
> implement an LRU-based eviction algorithm; patches 1-4 are
> prerequisites, patch 7 fixes some of the clean-up, and patch 8 adds a
> periodic flush to retire active buffers.
>
> This is a pretty severe bug for those affected by it, but it's also a
> huge amount of change for a stable kernel, so let me know if it's beyond
> hope for an SRU.
>
> SRU Justification:
>
>    Impact: When the GPU aperture becomes sufficiently fragmented two or
>    more mmapped buffers in active use can repeatedly push each other out
>    of the aperture. This results in a sort of livelock situation, with
>    Xorg appearing to be hung with high CPU utilization.
>
>    Fix: Backport of a series of patches that convert the i915 eviction
>    algorithm from a best-fit approach to one based on evicting the least
>    recently used objects and a patch that adds a periodic flush requests
>    to retire active buffers when no client is active.
>
>    Testcase: Without these patches this scenario can be triggered
>    ocassionally when visiting certain web pages that utilize Flash or
>    viewing certain images in Firefox, as described on bugs #599017 and
>    #330460. Testing the same pages/images with these patches does not
>    induce the hang.
>
>
> The following changes since commit 9cc333d02cb4620561e91a63ed86a2159ec32638:
>
>    UBUNTU: Ubuntu-2.6.32-33.67 (2011-06-16 11:26:04 -0500)
>
> are available in the git repository at:
>    git://kernel.ubuntu.com/sforshee/ubuntu-lucid.git lp599017
>
> Chris Wilson (5):
>        drm/i915: Move the eviction logic to its own file.
>        drm/i915: Implement fair lru eviction across both rings.  (v2)
>        drm/i915: Maintain LRU order of inactive objects upon access by CPU (v2)
>        drm/i915/evict: Ensure we completely cleanup on failure
>        drm/i915: Periodically flush the active lists and requests
>
> Daniel Vetter (3):
>        drm_mm: extract check_free_mm_node
>        drm: implement helper functions for scanning lru list
>        drm/i915: prepare for fair lru eviction
>
>   drivers/gpu/drm/drm_mm.c              |  236 +++++++++++++++++++++++++-----
>   drivers/gpu/drm/i915/Makefile         |    1 +
>   drivers/gpu/drm/i915/i915_drv.h       |   13 ++
>   drivers/gpu/drm/i915/i915_gem.c       |  228 ++++-------------------------
>   drivers/gpu/drm/i915/i915_gem_evict.c |  253 +++++++++++++++++++++++++++++++++
>   include/drm/drm_mm.h                  |   15 ++-
>   6 files changed, 510 insertions(+), 236 deletions(-)
>   create mode 100644 drivers/gpu/drm/i915/i915_gem_evict.c
>

I wouldn't do this for a non-LTS release, but I think this case is 
warranted since the test results look so good. Now we just have to make 
sure there are no regressions since some of the backport patches differ 
significantly from upstream.

Acked-by: Tim Gardner <tim.gardner@canonical.com>
Stefan Bader June 22, 2011, 2:50 p.m. UTC | #2
On 17.06.2011 18:38, Seth Forshee wrote:
> This is a rather large series of non-trivial backports to fix a Xorg
> hang caused by the i915 driver. Patches 5 and 6 are the key patches that
> implement an LRU-based eviction algorithm; patches 1-4 are
> prerequisites, patch 7 fixes some of the clean-up, and patch 8 adds a
> periodic flush to retire active buffers.
> 
> This is a pretty severe bug for those affected by it, but it's also a
> huge amount of change for a stable kernel, so let me know if it's beyond
> hope for an SRU.
> 
> SRU Justification:
> 
>   Impact: When the GPU aperture becomes sufficiently fragmented two or
>   more mmapped buffers in active use can repeatedly push each other out
>   of the aperture. This results in a sort of livelock situation, with
>   Xorg appearing to be hung with high CPU utilization.
> 
>   Fix: Backport of a series of patches that convert the i915 eviction
>   algorithm from a best-fit approach to one based on evicting the least
>   recently used objects and a patch that adds a periodic flush requests
>   to retire active buffers when no client is active.
> 
>   Testcase: Without these patches this scenario can be triggered
>   ocassionally when visiting certain web pages that utilize Flash or
>   viewing certain images in Firefox, as described on bugs #599017 and
>   #330460. Testing the same pages/images with these patches does not
>   induce the hang.
> 
> 
> The following changes since commit 9cc333d02cb4620561e91a63ed86a2159ec32638:
> 
>   UBUNTU: Ubuntu-2.6.32-33.67 (2011-06-16 11:26:04 -0500)
> 
> are available in the git repository at:
>   git://kernel.ubuntu.com/sforshee/ubuntu-lucid.git lp599017
> 
> Chris Wilson (5):
>       drm/i915: Move the eviction logic to its own file.
>       drm/i915: Implement fair lru eviction across both rings.  (v2)
>       drm/i915: Maintain LRU order of inactive objects upon access by CPU (v2)
>       drm/i915/evict: Ensure we completely cleanup on failure
>       drm/i915: Periodically flush the active lists and requests
> 
> Daniel Vetter (3):
>       drm_mm: extract check_free_mm_node
>       drm: implement helper functions for scanning lru list
>       drm/i915: prepare for fair lru eviction
> 
>  drivers/gpu/drm/drm_mm.c              |  236 +++++++++++++++++++++++++-----
>  drivers/gpu/drm/i915/Makefile         |    1 +
>  drivers/gpu/drm/i915/i915_drv.h       |   13 ++
>  drivers/gpu/drm/i915/i915_gem.c       |  228 ++++-------------------------
>  drivers/gpu/drm/i915/i915_gem_evict.c |  253 +++++++++++++++++++++++++++++++++
>  include/drm/drm_mm.h                  |   15 ++-
>  6 files changed, 510 insertions(+), 236 deletions(-)
>  create mode 100644 drivers/gpu/drm/i915/i915_gem_evict.c
> 

I also tend to say the bug sounds serious enough to make an exception to the
rule of limiting the change.

I would probably add those to the 2.6.32+drm33 tree. Though there are two
details which I think should be resolved before:
1. In the bug report (#599017) the last request for testing points to a place
which only shows 6 patches. Does the kernel provided actually contain all of them?
2. It would be good to get some feedback about the test kernel in the bug.
Seth Forshee June 22, 2011, 3:44 p.m. UTC | #3
On Wed, Jun 22, 2011 at 04:50:34PM +0200, Stefan Bader wrote:
> On 17.06.2011 18:38, Seth Forshee wrote:
> > This is a rather large series of non-trivial backports to fix a Xorg
> > hang caused by the i915 driver. Patches 5 and 6 are the key patches that
> > implement an LRU-based eviction algorithm; patches 1-4 are
> > prerequisites, patch 7 fixes some of the clean-up, and patch 8 adds a
> > periodic flush to retire active buffers.
> > 
> > This is a pretty severe bug for those affected by it, but it's also a
> > huge amount of change for a stable kernel, so let me know if it's beyond
> > hope for an SRU.
> > 
> > SRU Justification:
> > 
> >   Impact: When the GPU aperture becomes sufficiently fragmented two or
> >   more mmapped buffers in active use can repeatedly push each other out
> >   of the aperture. This results in a sort of livelock situation, with
> >   Xorg appearing to be hung with high CPU utilization.
> > 
> >   Fix: Backport of a series of patches that convert the i915 eviction
> >   algorithm from a best-fit approach to one based on evicting the least
> >   recently used objects and a patch that adds a periodic flush requests
> >   to retire active buffers when no client is active.
> > 
> >   Testcase: Without these patches this scenario can be triggered
> >   ocassionally when visiting certain web pages that utilize Flash or
> >   viewing certain images in Firefox, as described on bugs #599017 and
> >   #330460. Testing the same pages/images with these patches does not
> >   induce the hang.
> > 
> > 
> > The following changes since commit 9cc333d02cb4620561e91a63ed86a2159ec32638:
> > 
> >   UBUNTU: Ubuntu-2.6.32-33.67 (2011-06-16 11:26:04 -0500)
> > 
> > are available in the git repository at:
> >   git://kernel.ubuntu.com/sforshee/ubuntu-lucid.git lp599017
> > 
> > Chris Wilson (5):
> >       drm/i915: Move the eviction logic to its own file.
> >       drm/i915: Implement fair lru eviction across both rings.  (v2)
> >       drm/i915: Maintain LRU order of inactive objects upon access by CPU (v2)
> >       drm/i915/evict: Ensure we completely cleanup on failure
> >       drm/i915: Periodically flush the active lists and requests
> > 
> > Daniel Vetter (3):
> >       drm_mm: extract check_free_mm_node
> >       drm: implement helper functions for scanning lru list
> >       drm/i915: prepare for fair lru eviction
> > 
> >  drivers/gpu/drm/drm_mm.c              |  236 +++++++++++++++++++++++++-----
> >  drivers/gpu/drm/i915/Makefile         |    1 +
> >  drivers/gpu/drm/i915/i915_drv.h       |   13 ++
> >  drivers/gpu/drm/i915/i915_gem.c       |  228 ++++-------------------------
> >  drivers/gpu/drm/i915/i915_gem_evict.c |  253 +++++++++++++++++++++++++++++++++
> >  include/drm/drm_mm.h                  |   15 ++-
> >  6 files changed, 510 insertions(+), 236 deletions(-)
> >  create mode 100644 drivers/gpu/drm/i915/i915_gem_evict.c
> > 
> 
> I also tend to say the bug sounds serious enough to make an exception to the
> rule of limiting the change.
> 
> I would probably add those to the 2.6.32+drm33 tree. Though there are two
> details which I think should be resolved before:
> 1. In the bug report (#599017) the last request for testing points to a place
> which only shows 6 patches. Does the kernel provided actually contain all of them?

No, I guess I didn't post a link to that build on the bug, although I
just did so. I have an affected machine though and have tested the full
series myself for several days. I was able to trigger the bug very
ocassionally without the last patch but have never triggered it with
that patch.

> 2. It would be good to get some feedback about the test kernel in the bug.

I have received some offline feedback, but I'll ask for the feedback to
be posted to the bug.