diff mbox

FSP: Fix poller lock issue

Message ID 20150308094920.25374.94537.stgit@localhost.localdomain
State Changes Requested
Headers show

Commit Message

Vasant Hegde March 8, 2015, 9:49 a.m. UTC
This patch attempts to fix "poller lock issue" by removing
opal_run_pollers function call from fsp_sync_msg.

sample log:
----------
[4130907342,3] Running pollers with lock held !
CPU 08b0 Backtrace:
 S: 0000000033cc3990 R: 0000000030013520   .backtrace+0x2c
 S: 0000000033cc3a20 R: 0000000030017d9c   .opal_run_pollers+0x44
 S: 0000000033cc3aa0 R: 0000000030045f04   .fsp_sync_msg+0x40
 S: 0000000033cc3b30 R: 0000000030047840   .fsp_fetch_data+0x138
 S: 0000000033cc3c20 R: 0000000030021204   .hservices_lid_preload+0x148
 S: 0000000033cc3d10 R: 0000000030058038   .ibm_fsp_init+0xe4
 S: 0000000033cc3dc0 R: 0000000030058df8   .firenze_init+0x18
 S: 0000000033cc3e30 R: 00000000300141a4   .main_cpu_entry+0x3e0
 S: 0000000033cc3f00 R: 0000000030002534   boot_entry+0x18c

Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>

---
Ben, Stewart,
  Presently we have two places (fsp_fetch_data & capp_lid_download) where
  we held lock and make opal_run_pollers call (via fsp_sync_msg).

  Earlier we had replaced __fsp_poll with opal_run_pollers (commit 488dc8a3).
  But I think its fine just to call __fsp_poll here. This seems to be working.
  But needs to be reviewed before merging.

  I tried other alternative approach we discussed earlier. (introduce queue in
  fsp_fetch_data). This requies quite a bit of work in callback function.
---
 hw/fsp/fsp.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Vasant Hegde March 8, 2015, 11:26 a.m. UTC | #1
On 03/08/2015 03:19 PM, Vasant Hegde wrote:
> This patch attempts to fix "poller lock issue" by removing
> opal_run_pollers function call from fsp_sync_msg.
> 
> sample log:
> ----------
> [4130907342,3] Running pollers with lock held !
> CPU 08b0 Backtrace:
>  S: 0000000033cc3990 R: 0000000030013520   .backtrace+0x2c
>  S: 0000000033cc3a20 R: 0000000030017d9c   .opal_run_pollers+0x44
>  S: 0000000033cc3aa0 R: 0000000030045f04   .fsp_sync_msg+0x40
>  S: 0000000033cc3b30 R: 0000000030047840   .fsp_fetch_data+0x138
>  S: 0000000033cc3c20 R: 0000000030021204   .hservices_lid_preload+0x148
>  S: 0000000033cc3d10 R: 0000000030058038   .ibm_fsp_init+0xe4
>  S: 0000000033cc3dc0 R: 0000000030058df8   .firenze_init+0x18
>  S: 0000000033cc3e30 R: 00000000300141a4   .main_cpu_entry+0x3e0
>  S: 0000000033cc3f00 R: 0000000030002534   boot_entry+0x18c
> 
> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
> 
> ---
> Ben, Stewart,
>   Presently we have two places (fsp_fetch_data & capp_lid_download) where
>   we held lock and make opal_run_pollers call (via fsp_sync_msg).
> 
>   Earlier we had replaced __fsp_poll with opal_run_pollers (commit 488dc8a3).
>   But I think its fine just to call __fsp_poll here. This seems to be working.
>   But needs to be reviewed before merging.
> 
>   I tried other alternative approach we discussed earlier. (introduce queue in
>   fsp_fetch_data). This requies quite a bit of work in callback function.

Hmmm.. This patch fixes poller lock issue.. But this *doesn't* seems to be
complete.. We still hit lock issue in fsp_fetch_data, if we get OCC load request
while fetching skiroot :-(

Another possible approach:
  Presently we use same DMA mapping for all the fetch data.. Instead if we use
separate DMA mapping for OCC load requests then we can simply queue that request
using fsp_fetch_data_queue call..That way we will not hit lock issue in
fsp_fetch_data.. Looking into code base I don't see any other places we get such
notification which will result in fsp_fetch_data call.

-Vasant

> ---
>  hw/fsp/fsp.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/fsp/fsp.c b/hw/fsp/fsp.c
> index e57c611..5fbc33f 100644
> --- a/hw/fsp/fsp.c
> +++ b/hw/fsp/fsp.c
> @@ -1663,7 +1663,10 @@ int fsp_sync_msg(struct fsp_msg *msg, bool autofree)
>  
>  	while(fsp_msg_busy(msg)) {
>  		cpu_relax();
> -		opal_run_pollers();
> +		lock(&fsp_lock);
> +		__fsp_poll(false);
> +		unlock(&fsp_lock);
>  	}
>  
>  	switch(msg->state) {
> 
> _______________________________________________
> Skiboot mailing list
> Skiboot@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot
>
Vaidyanathan Srinivasan March 8, 2015, 5:11 p.m. UTC | #2
* Vasant Hegde <hegdevasant@linux.vnet.ibm.com> [2015-03-08 16:56:31]:

> On 03/08/2015 03:19 PM, Vasant Hegde wrote:
> > This patch attempts to fix "poller lock issue" by removing
> > opal_run_pollers function call from fsp_sync_msg.
> > 
> > sample log:
> > ----------
> > [4130907342,3] Running pollers with lock held !
> > CPU 08b0 Backtrace:
> >  S: 0000000033cc3990 R: 0000000030013520   .backtrace+0x2c
> >  S: 0000000033cc3a20 R: 0000000030017d9c   .opal_run_pollers+0x44
> >  S: 0000000033cc3aa0 R: 0000000030045f04   .fsp_sync_msg+0x40
> >  S: 0000000033cc3b30 R: 0000000030047840   .fsp_fetch_data+0x138
> >  S: 0000000033cc3c20 R: 0000000030021204   .hservices_lid_preload+0x148
> >  S: 0000000033cc3d10 R: 0000000030058038   .ibm_fsp_init+0xe4
> >  S: 0000000033cc3dc0 R: 0000000030058df8   .firenze_init+0x18
> >  S: 0000000033cc3e30 R: 00000000300141a4   .main_cpu_entry+0x3e0
> >  S: 0000000033cc3f00 R: 0000000030002534   boot_entry+0x18c
> > 
> > Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
> > 
> > ---
> > Ben, Stewart,
> >   Presently we have two places (fsp_fetch_data & capp_lid_download) where
> >   we held lock and make opal_run_pollers call (via fsp_sync_msg).
> > 
> >   Earlier we had replaced __fsp_poll with opal_run_pollers (commit 488dc8a3).
> >   But I think its fine just to call __fsp_poll here. This seems to be working.
> >   But needs to be reviewed before merging.
> > 
> >   I tried other alternative approach we discussed earlier. (introduce queue in
> >   fsp_fetch_data). This requies quite a bit of work in callback function.
> 
> Hmmm.. This patch fixes poller lock issue.. But this *doesn't* seems to be
> complete.. We still hit lock issue in fsp_fetch_data, if we get OCC load request
> while fetching skiroot :-(
> 
> Another possible approach:
>   Presently we use same DMA mapping for all the fetch data.. Instead if we use
> separate DMA mapping for OCC load requests then we can simply queue that request
> using fsp_fetch_data_queue call..That way we will not hit lock issue in
> fsp_fetch_data.. Looking into code base I don't see any other places we get such
> notification which will result in fsp_fetch_data call.

Hi Vasanth, 

The OCC load request can be made to wait.  I cam make it work with
a callback.  When fsp sends a OCC load request, I can issue a pre-load
of OCC lid and queue it to your functions.  The callback after
completion will do the actual OCC load request processing by calling
hostboot.  

Since we have preloaded the OCC lid, when we callback from hostboot
for OCC lid, we will already have it and not come to
fsp_fetch_data_queue at all.

What I still need to understand is the CAPI ucode loading request and
response sequence.  At what point will we be able to hold the request
and continue after the callback.

--Vaidy
Ananth N Mavinakayanahalli March 9, 2015, 6:36 a.m. UTC | #3
On Sun, Mar 08, 2015 at 03:19:41PM +0530, Vasant Hegde wrote:
> This patch attempts to fix "poller lock issue" by removing
> opal_run_pollers function call from fsp_sync_msg.
> 
> sample log:
> ----------
> [4130907342,3] Running pollers with lock held !
> CPU 08b0 Backtrace:
>  S: 0000000033cc3990 R: 0000000030013520   .backtrace+0x2c
>  S: 0000000033cc3a20 R: 0000000030017d9c   .opal_run_pollers+0x44
>  S: 0000000033cc3aa0 R: 0000000030045f04   .fsp_sync_msg+0x40
>  S: 0000000033cc3b30 R: 0000000030047840   .fsp_fetch_data+0x138
>  S: 0000000033cc3c20 R: 0000000030021204   .hservices_lid_preload+0x148
>  S: 0000000033cc3d10 R: 0000000030058038   .ibm_fsp_init+0xe4
>  S: 0000000033cc3dc0 R: 0000000030058df8   .firenze_init+0x18
>  S: 0000000033cc3e30 R: 00000000300141a4   .main_cpu_entry+0x3e0
>  S: 0000000033cc3f00 R: 0000000030002534   boot_entry+0x18c
> 
> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
> 
> ---
> Ben, Stewart,
>   Presently we have two places (fsp_fetch_data & capp_lid_download) where
>   we held lock and make opal_run_pollers call (via fsp_sync_msg).
> 
>   Earlier we had replaced __fsp_poll with opal_run_pollers (commit 488dc8a3).
>   But I think its fine just to call __fsp_poll here. This seems to be working.
>   But needs to be reviewed before merging.
> 
>   I tried other alternative approach we discussed earlier. (introduce queue in
>   fsp_fetch_data). This requies quite a bit of work in callback function.

This explanation needs to go in the commit message.

> ---
>  hw/fsp/fsp.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/fsp/fsp.c b/hw/fsp/fsp.c
> index e57c611..5fbc33f 100644
> --- a/hw/fsp/fsp.c
> +++ b/hw/fsp/fsp.c
> @@ -1663,7 +1663,10 @@ int fsp_sync_msg(struct fsp_msg *msg, bool autofree)
>  
>  	while(fsp_msg_busy(msg)) {
>  		cpu_relax();
> -		opal_run_pollers();
> +		lock(&fsp_lock);
> +		__fsp_poll(false);
> +		unlock(&fsp_lock);

You could just use fsp_opal_poll() instead.

Ananth
Benjamin Herrenschmidt March 9, 2015, 7:30 a.m. UTC | #4
On Mon, 2015-03-09 at 12:06 +0530, Ananth N Mavinakayanahalli wrote:
> On Sun, Mar 08, 2015 at 03:19:41PM +0530, Vasant Hegde wrote:
> > This patch attempts to fix "poller lock issue" by removing
> > opal_run_pollers function call from fsp_sync_msg.
> >

I disagree with generically changing the call to __fsp_poll(). That
means for example that during such a potentially long call, we will
not monitor the PSI link.

While a full queued d/l request mechanism is better, we could get away
with a simpler hack provided we are certain that we never will call
fsp_fetch_data() from a poller callback.

It's a bit gross but will provide a quick fix while you work on a better
solution, using a busy flag:

for (;;) {
	lock(...)
	test and set busy flag
	unlock(...)
	if (flag was clear)
		break;
	cpu_relax();
}

 .. do load

clear flag

	
 
> > sample log:
> > ----------
> > [4130907342,3] Running pollers with lock held !
> > CPU 08b0 Backtrace:
> >  S: 0000000033cc3990 R: 0000000030013520   .backtrace+0x2c
> >  S: 0000000033cc3a20 R: 0000000030017d9c   .opal_run_pollers+0x44
> >  S: 0000000033cc3aa0 R: 0000000030045f04   .fsp_sync_msg+0x40
> >  S: 0000000033cc3b30 R: 0000000030047840   .fsp_fetch_data+0x138
> >  S: 0000000033cc3c20 R: 0000000030021204   .hservices_lid_preload+0x148
> >  S: 0000000033cc3d10 R: 0000000030058038   .ibm_fsp_init+0xe4
> >  S: 0000000033cc3dc0 R: 0000000030058df8   .firenze_init+0x18
> >  S: 0000000033cc3e30 R: 00000000300141a4   .main_cpu_entry+0x3e0
> >  S: 0000000033cc3f00 R: 0000000030002534   boot_entry+0x18c
> > 
> > Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
> > 
> > ---
> > Ben, Stewart,
> >   Presently we have two places (fsp_fetch_data & capp_lid_download) where
> >   we held lock and make opal_run_pollers call (via fsp_sync_msg).
> > 
> >   Earlier we had replaced __fsp_poll with opal_run_pollers (commit 488dc8a3).
> >   But I think its fine just to call __fsp_poll here. This seems to be working.
> >   But needs to be reviewed before merging.
> > 
> >   I tried other alternative approach we discussed earlier. (introduce queue in
> >   fsp_fetch_data). This requies quite a bit of work in callback function.
> 
> This explanation needs to go in the commit message.
> 
> > ---
> >  hw/fsp/fsp.c |    5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/fsp/fsp.c b/hw/fsp/fsp.c
> > index e57c611..5fbc33f 100644
> > --- a/hw/fsp/fsp.c
> > +++ b/hw/fsp/fsp.c
> > @@ -1663,7 +1663,10 @@ int fsp_sync_msg(struct fsp_msg *msg, bool autofree)
> >  
> >  	while(fsp_msg_busy(msg)) {
> >  		cpu_relax();
> > -		opal_run_pollers();
> > +		lock(&fsp_lock);
> > +		__fsp_poll(false);
> > +		unlock(&fsp_lock);
> 
> You could just use fsp_opal_poll() instead.
> 
> Ananth
> 
> _______________________________________________
> Skiboot mailing list
> Skiboot@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot
diff mbox

Patch

diff --git a/hw/fsp/fsp.c b/hw/fsp/fsp.c
index e57c611..5fbc33f 100644
--- a/hw/fsp/fsp.c
+++ b/hw/fsp/fsp.c
@@ -1663,7 +1663,10 @@  int fsp_sync_msg(struct fsp_msg *msg, bool autofree)
 
 	while(fsp_msg_busy(msg)) {
 		cpu_relax();
-		opal_run_pollers();
+		lock(&fsp_lock);
+		__fsp_poll(false);
+		unlock(&fsp_lock);
 	}
 
 	switch(msg->state) {