diff mbox series

[v3,1/6] vfio-ccw: make it safe to access channel programs

Message ID 20190130132212.7376-2-cohuck@redhat.com
State New
Headers show
Series vfio-ccw: support hsch/csch (kernel part) | expand

Commit Message

Cornelia Huck Jan. 30, 2019, 1:22 p.m. UTC
When we get a solicited interrupt, the start function may have
been cleared by a csch, but we still have a channel program
structure allocated. Make it safe to call the cp accessors in
any case, so we can call them unconditionally.

While at it, also make sure that functions called from other parts
of the code return gracefully if the channel program structure
has not been initialized (even though that is a bug in the caller).

Signed-off-by: Cornelia Huck <cohuck@redhat.com>
---
 drivers/s390/cio/vfio_ccw_cp.c  | 20 +++++++++++++++++++-
 drivers/s390/cio/vfio_ccw_cp.h  |  2 ++
 drivers/s390/cio/vfio_ccw_fsm.c |  5 +++++
 3 files changed, 26 insertions(+), 1 deletion(-)

Comments

Halil Pasic Jan. 30, 2019, 6:51 p.m. UTC | #1
On Wed, 30 Jan 2019 14:22:07 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> When we get a solicited interrupt, the start function may have
> been cleared by a csch, but we still have a channel program
> structure allocated. Make it safe to call the cp accessors in
> any case, so we can call them unconditionally.

I read this like it is supposed to be safe regardless of
parallelism and threads. However I don't see any explicit
synchronization done for cp->initialized.

I've managed to figure out how is that supposed to be safe
for the cp_free() (which is probably our main concern) in
vfio_ccw_sch_io_todo(), but if fail when it comes to the one
in vfio_ccw_mdev_notifier().

Can you explain us how does the synchronization work?

Regards,
Halil
Cornelia Huck Jan. 31, 2019, 11:52 a.m. UTC | #2
On Wed, 30 Jan 2019 19:51:27 +0100
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Wed, 30 Jan 2019 14:22:07 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
> > When we get a solicited interrupt, the start function may have
> > been cleared by a csch, but we still have a channel program
> > structure allocated. Make it safe to call the cp accessors in
> > any case, so we can call them unconditionally.  
> 
> I read this like it is supposed to be safe regardless of
> parallelism and threads. However I don't see any explicit
> synchronization done for cp->initialized.
> 
> I've managed to figure out how is that supposed to be safe
> for the cp_free() (which is probably our main concern) in
> vfio_ccw_sch_io_todo(), but if fail when it comes to the one
> in vfio_ccw_mdev_notifier().
> 
> Can you explain us how does the synchronization work?

You read that wrong, I don't add synchronization, I just add a check.
Halil Pasic Jan. 31, 2019, 12:34 p.m. UTC | #3
On Thu, 31 Jan 2019 12:52:20 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Wed, 30 Jan 2019 19:51:27 +0100
> Halil Pasic <pasic@linux.ibm.com> wrote:
> 
> > On Wed, 30 Jan 2019 14:22:07 +0100
> > Cornelia Huck <cohuck@redhat.com> wrote:
> > 
> > > When we get a solicited interrupt, the start function may have
> > > been cleared by a csch, but we still have a channel program
> > > structure allocated. Make it safe to call the cp accessors in
> > > any case, so we can call them unconditionally.  
> > 
> > I read this like it is supposed to be safe regardless of
> > parallelism and threads. However I don't see any explicit
> > synchronization done for cp->initialized.
> > 
> > I've managed to figure out how is that supposed to be safe
> > for the cp_free() (which is probably our main concern) in
> > vfio_ccw_sch_io_todo(), but if fail when it comes to the one
> > in vfio_ccw_mdev_notifier().
> > 
> > Can you explain us how does the synchronization work?
> 
> You read that wrong, I don't add synchronization, I just add a check.
> 

Now I'm confused. Does that mean we don't need synchronization for this?

Regards,
Halil
Cornelia Huck Feb. 4, 2019, 3:31 p.m. UTC | #4
On Thu, 31 Jan 2019 13:34:55 +0100
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Thu, 31 Jan 2019 12:52:20 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
> > On Wed, 30 Jan 2019 19:51:27 +0100
> > Halil Pasic <pasic@linux.ibm.com> wrote:
> >   
> > > On Wed, 30 Jan 2019 14:22:07 +0100
> > > Cornelia Huck <cohuck@redhat.com> wrote:
> > >   
> > > > When we get a solicited interrupt, the start function may have
> > > > been cleared by a csch, but we still have a channel program
> > > > structure allocated. Make it safe to call the cp accessors in
> > > > any case, so we can call them unconditionally.    
> > > 
> > > I read this like it is supposed to be safe regardless of
> > > parallelism and threads. However I don't see any explicit
> > > synchronization done for cp->initialized.
> > > 
> > > I've managed to figure out how is that supposed to be safe
> > > for the cp_free() (which is probably our main concern) in
> > > vfio_ccw_sch_io_todo(), but if fail when it comes to the one
> > > in vfio_ccw_mdev_notifier().
> > > 
> > > Can you explain us how does the synchronization work?  
> > 
> > You read that wrong, I don't add synchronization, I just add a check.
> >   
> 
> Now I'm confused. Does that mean we don't need synchronization for this?

If we lack synchronization (that is not provided by the current state
machine handling, or the rework here), we should do a patch on top
(preferably on top of the whole series, so this does not get even more
tangled up.) This is really just about the extra check.
Eric Farman Feb. 4, 2019, 7:25 p.m. UTC | #5
On 01/30/2019 08:22 AM, Cornelia Huck wrote:
> When we get a solicited interrupt, the start function may have
> been cleared by a csch, but we still have a channel program
> structure allocated. Make it safe to call the cp accessors in
> any case, so we can call them unconditionally.
> 
> While at it, also make sure that functions called from other parts
> of the code return gracefully if the channel program structure
> has not been initialized (even though that is a bug in the caller).
> 
> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> ---
>   drivers/s390/cio/vfio_ccw_cp.c  | 20 +++++++++++++++++++-
>   drivers/s390/cio/vfio_ccw_cp.h  |  2 ++
>   drivers/s390/cio/vfio_ccw_fsm.c |  5 +++++
>   3 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
> index ba08fe137c2e..0bc0c38edda7 100644
> --- a/drivers/s390/cio/vfio_ccw_cp.c
> +++ b/drivers/s390/cio/vfio_ccw_cp.c
> @@ -335,6 +335,7 @@ static void cp_unpin_free(struct channel_program *cp)
>   	struct ccwchain *chain, *temp;
>   	int i;
>   
> +	cp->initialized = false;
>   	list_for_each_entry_safe(chain, temp, &cp->ccwchain_list, next) {
>   		for (i = 0; i < chain->ch_len; i++) {
>   			pfn_array_table_unpin_free(chain->ch_pat + i,
> @@ -701,6 +702,8 @@ int cp_init(struct channel_program *cp, struct device *mdev, union orb *orb)
>   	 */
>   	cp->orb.cmd.c64 = 1;
>   
> +	cp->initialized = true;
> +

Not seen in this hunk, but we call ccwchain_loop_tic() just prior to 
this point.  If that returns non-zero, we call cp_unpin_free()[1] (and 
set initailized to false), and then fall through to here.  So this is 
going to set initialized to true, even though we're taking an error 
path.  :-(

[1] Wait, why is it calling cp_unpin_free()?  Oh, I had proposed 
squashing cp_free() and cp_unpin_free() back in November[2], got an r-b 
from Pierre but haven't gotten back to tidy up the series for a v2. 
Okay, I'll try to do that again soon.  :-)
[2] https://patchwork.kernel.org/patch/10675261/

>   	return ret;
>   }
>   
> @@ -715,7 +718,8 @@ int cp_init(struct channel_program *cp, struct device *mdev, union orb *orb)
>    */
>   void cp_free(struct channel_program *cp)
>   {
> -	cp_unpin_free(cp);
> +	if (cp->initialized)
> +		cp_unpin_free(cp);
>   }
>   
>   /**
> @@ -760,6 +764,10 @@ int cp_prefetch(struct channel_program *cp)
>   	struct ccwchain *chain;
>   	int len, idx, ret;
>   
> +	/* this is an error in the caller */
> +	if (!cp || !cp->initialized)
> +		return -EINVAL;
> +
>   	list_for_each_entry(chain, &cp->ccwchain_list, next) {
>   		len = chain->ch_len;
>   		for (idx = 0; idx < len; idx++) {
> @@ -795,6 +803,10 @@ union orb *cp_get_orb(struct channel_program *cp, u32 intparm, u8 lpm)
>   	struct ccwchain *chain;
>   	struct ccw1 *cpa;
>   
> +	/* this is an error in the caller */
> +	if (!cp || !cp->initialized)
> +		return NULL;
> +
>   	orb = &cp->orb;
>   
>   	orb->cmd.intparm = intparm;
> @@ -831,6 +843,9 @@ void cp_update_scsw(struct channel_program *cp, union scsw *scsw)
>   	u32 cpa = scsw->cmd.cpa;
>   	u32 ccw_head, ccw_tail;
>   
> +	if (!cp->initialized)
> +		return;
> +
>   	/*
>   	 * LATER:
>   	 * For now, only update the cmd.cpa part. We may need to deal with
> @@ -869,6 +884,9 @@ bool cp_iova_pinned(struct channel_program *cp, u64 iova)
>   	struct ccwchain *chain;
>   	int i;
>   
> +	if (!cp->initialized)

So, two of the checks added above look for a nonzero cp pointer prior to 
checking initialized, while two don't.  I guess cp can't be NULL, since 
it's embedded in the private struct directly and that's only free'd when 
we do vfio_ccw_sch_remove() ... But I guess some consistency in how we 
look would be nice.

> +		return false;
> +
>   	list_for_each_entry(chain, &cp->ccwchain_list, next) {
>   		for (i = 0; i < chain->ch_len; i++)
>   			if (pfn_array_table_iova_pinned(chain->ch_pat + i,
> diff --git a/drivers/s390/cio/vfio_ccw_cp.h b/drivers/s390/cio/vfio_ccw_cp.h
> index a4b74fb1aa57..3c20cd208da5 100644
> --- a/drivers/s390/cio/vfio_ccw_cp.h
> +++ b/drivers/s390/cio/vfio_ccw_cp.h
> @@ -21,6 +21,7 @@
>    * @ccwchain_list: list head of ccwchains
>    * @orb: orb for the currently processed ssch request
>    * @mdev: the mediated device to perform page pinning/unpinning
> + * @initialized: whether this instance is actually initialized
>    *
>    * @ccwchain_list is the head of a ccwchain list, that contents the
>    * translated result of the guest channel program that pointed out by
> @@ -30,6 +31,7 @@ struct channel_program {
>   	struct list_head ccwchain_list;
>   	union orb orb;
>   	struct device *mdev;
> +	bool initialized;
>   };
>   
>   extern int cp_init(struct channel_program *cp, struct device *mdev,
> diff --git a/drivers/s390/cio/vfio_ccw_fsm.c b/drivers/s390/cio/vfio_ccw_fsm.c
> index cab17865aafe..e7c9877c9f1e 100644
> --- a/drivers/s390/cio/vfio_ccw_fsm.c
> +++ b/drivers/s390/cio/vfio_ccw_fsm.c
> @@ -31,6 +31,10 @@ static int fsm_io_helper(struct vfio_ccw_private *private)
>   	private->state = VFIO_CCW_STATE_BUSY;
>   
>   	orb = cp_get_orb(&private->cp, (u32)(addr_t)sch, sch->lpm);
> +	if (!orb) {
> +		ret = -EIO;
> +		goto out;
> +	}
>   
>   	/* Issue "Start Subchannel" */
>   	ccode = ssch(sch->schid, orb);
> @@ -64,6 +68,7 @@ static int fsm_io_helper(struct vfio_ccw_private *private)
>   	default:
>   		ret = ccode;
>   	}
> +out:
>   	spin_unlock_irqrestore(sch->lock, flags);
>   	return ret;
>   }
>
Halil Pasic Feb. 5, 2019, 11:52 a.m. UTC | #6
On Mon, 4 Feb 2019 16:31:02 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Thu, 31 Jan 2019 13:34:55 +0100
> Halil Pasic <pasic@linux.ibm.com> wrote:
> 
> > On Thu, 31 Jan 2019 12:52:20 +0100
> > Cornelia Huck <cohuck@redhat.com> wrote:
> > 
> > > On Wed, 30 Jan 2019 19:51:27 +0100
> > > Halil Pasic <pasic@linux.ibm.com> wrote:
> > >   
> > > > On Wed, 30 Jan 2019 14:22:07 +0100
> > > > Cornelia Huck <cohuck@redhat.com> wrote:
> > > >   
> > > > > When we get a solicited interrupt, the start function may have
> > > > > been cleared by a csch, but we still have a channel program
> > > > > structure allocated. Make it safe to call the cp accessors in
> > > > > any case, so we can call them unconditionally.    
> > > > 
> > > > I read this like it is supposed to be safe regardless of
> > > > parallelism and threads. However I don't see any explicit
> > > > synchronization done for cp->initialized.
> > > > 
> > > > I've managed to figure out how is that supposed to be safe
> > > > for the cp_free() (which is probably our main concern) in
> > > > vfio_ccw_sch_io_todo(), but if fail when it comes to the one
> > > > in vfio_ccw_mdev_notifier().
> > > > 
> > > > Can you explain us how does the synchronization work?  
> > > 
> > > You read that wrong, I don't add synchronization, I just add a check.
> > >   
> > 
> > Now I'm confused. Does that mean we don't need synchronization for this?
> 
> If we lack synchronization (that is not provided by the current state
> machine handling, or the rework here), we should do a patch on top
> (preferably on top of the whole series, so this does not get even more
> tangled up.) This is really just about the extra check.
> 

I'm not a huge fan of keeping or introducing races -- it makes things
difficult to reason about, but I do have some understanging your
position.

This patch-series is AFAICT a big improvement over what we have. I would
like Farhan confirming that it makes these hick-ups when he used to hit
BUSY with another ssch request disappear. If it does (I hope it does)
it's definitely a good thing for anybody who wants to use vfio-ccw.

Yet I find it difficult to slap my r-b over racy code, or partial
solutions. In the latter case, when I lack conceptual clarity, I find it
difficult to tell if we are heading into the right direction, or is what
we build today going to turn against us tomorrow. Sorry for being a drag.

Regards,
Halil
Cornelia Huck Feb. 5, 2019, 12:03 p.m. UTC | #7
On Mon, 4 Feb 2019 14:25:34 -0500
Eric Farman <farman@linux.ibm.com> wrote:

> On 01/30/2019 08:22 AM, Cornelia Huck wrote:
> > When we get a solicited interrupt, the start function may have
> > been cleared by a csch, but we still have a channel program
> > structure allocated. Make it safe to call the cp accessors in
> > any case, so we can call them unconditionally.
> > 
> > While at it, also make sure that functions called from other parts
> > of the code return gracefully if the channel program structure
> > has not been initialized (even though that is a bug in the caller).
> > 
> > Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> > ---
> >   drivers/s390/cio/vfio_ccw_cp.c  | 20 +++++++++++++++++++-
> >   drivers/s390/cio/vfio_ccw_cp.h  |  2 ++
> >   drivers/s390/cio/vfio_ccw_fsm.c |  5 +++++
> >   3 files changed, 26 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
> > index ba08fe137c2e..0bc0c38edda7 100644
> > --- a/drivers/s390/cio/vfio_ccw_cp.c
> > +++ b/drivers/s390/cio/vfio_ccw_cp.c
> > @@ -335,6 +335,7 @@ static void cp_unpin_free(struct channel_program *cp)
> >   	struct ccwchain *chain, *temp;
> >   	int i;
> >   
> > +	cp->initialized = false;
> >   	list_for_each_entry_safe(chain, temp, &cp->ccwchain_list, next) {
> >   		for (i = 0; i < chain->ch_len; i++) {
> >   			pfn_array_table_unpin_free(chain->ch_pat + i,
> > @@ -701,6 +702,8 @@ int cp_init(struct channel_program *cp, struct device *mdev, union orb *orb)
> >   	 */
> >   	cp->orb.cmd.c64 = 1;
> >   
> > +	cp->initialized = true;
> > +  
> 
> Not seen in this hunk, but we call ccwchain_loop_tic() just prior to 
> this point.  If that returns non-zero, we call cp_unpin_free()[1] (and 
> set initailized to false), and then fall through to here.  So this is 
> going to set initialized to true, even though we're taking an error 
> path.  :-(

Eek, setting c64 unconditionally threw me off. This needs to check
for !ret, of course.

> 
> [1] Wait, why is it calling cp_unpin_free()?  Oh, I had proposed 
> squashing cp_free() and cp_unpin_free() back in November[2], got an r-b 
> from Pierre but haven't gotten back to tidy up the series for a v2. 
> Okay, I'll try to do that again soon.  :-)

:)

> [2] https://patchwork.kernel.org/patch/10675261/
> 
> >   	return ret;
> >   }
> >   
> > @@ -715,7 +718,8 @@ int cp_init(struct channel_program *cp, struct device *mdev, union orb *orb)
> >    */
> >   void cp_free(struct channel_program *cp)
> >   {
> > -	cp_unpin_free(cp);
> > +	if (cp->initialized)
> > +		cp_unpin_free(cp);
> >   }
> >   
> >   /**
> > @@ -760,6 +764,10 @@ int cp_prefetch(struct channel_program *cp)
> >   	struct ccwchain *chain;
> >   	int len, idx, ret;
> >   
> > +	/* this is an error in the caller */
> > +	if (!cp || !cp->initialized)
> > +		return -EINVAL;
> > +
> >   	list_for_each_entry(chain, &cp->ccwchain_list, next) {
> >   		len = chain->ch_len;
> >   		for (idx = 0; idx < len; idx++) {
> > @@ -795,6 +803,10 @@ union orb *cp_get_orb(struct channel_program *cp, u32 intparm, u8 lpm)
> >   	struct ccwchain *chain;
> >   	struct ccw1 *cpa;
> >   
> > +	/* this is an error in the caller */
> > +	if (!cp || !cp->initialized)
> > +		return NULL;
> > +
> >   	orb = &cp->orb;
> >   
> >   	orb->cmd.intparm = intparm;
> > @@ -831,6 +843,9 @@ void cp_update_scsw(struct channel_program *cp, union scsw *scsw)
> >   	u32 cpa = scsw->cmd.cpa;
> >   	u32 ccw_head, ccw_tail;
> >   
> > +	if (!cp->initialized)
> > +		return;
> > +
> >   	/*
> >   	 * LATER:
> >   	 * For now, only update the cmd.cpa part. We may need to deal with
> > @@ -869,6 +884,9 @@ bool cp_iova_pinned(struct channel_program *cp, u64 iova)
> >   	struct ccwchain *chain;
> >   	int i;
> >   
> > +	if (!cp->initialized)  
> 
> So, two of the checks added above look for a nonzero cp pointer prior to 
> checking initialized, while two don't.  I guess cp can't be NULL, since 
> it's embedded in the private struct directly and that's only free'd when 
> we do vfio_ccw_sch_remove() ... But I guess some consistency in how we 
> look would be nice.

The idea was: In which context is this called? Is there a legitimate
reason for the caller to pass in an uninitialized cp, or would that
mean the caller had messed up (and we should not trust cp to be !NULL
either?)

But you're right, that does look inconsistent. Always checking for
cp != NULL probably looks least odd, although it is overkill. Opinions?

> 
> > +		return false;
> > +
> >   	list_for_each_entry(chain, &cp->ccwchain_list, next) {
> >   		for (i = 0; i < chain->ch_len; i++)
> >   			if (pfn_array_table_iova_pinned(chain->ch_pat + i,
Cornelia Huck Feb. 5, 2019, 12:35 p.m. UTC | #8
On Tue, 5 Feb 2019 12:52:29 +0100
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Mon, 4 Feb 2019 16:31:02 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
> > On Thu, 31 Jan 2019 13:34:55 +0100
> > Halil Pasic <pasic@linux.ibm.com> wrote:
> >   
> > > On Thu, 31 Jan 2019 12:52:20 +0100
> > > Cornelia Huck <cohuck@redhat.com> wrote:
> > >   
> > > > On Wed, 30 Jan 2019 19:51:27 +0100
> > > > Halil Pasic <pasic@linux.ibm.com> wrote:
> > > >     
> > > > > On Wed, 30 Jan 2019 14:22:07 +0100
> > > > > Cornelia Huck <cohuck@redhat.com> wrote:
> > > > >     
> > > > > > When we get a solicited interrupt, the start function may have
> > > > > > been cleared by a csch, but we still have a channel program
> > > > > > structure allocated. Make it safe to call the cp accessors in
> > > > > > any case, so we can call them unconditionally.      
> > > > > 
> > > > > I read this like it is supposed to be safe regardless of
> > > > > parallelism and threads. However I don't see any explicit
> > > > > synchronization done for cp->initialized.
> > > > > 
> > > > > I've managed to figure out how is that supposed to be safe
> > > > > for the cp_free() (which is probably our main concern) in
> > > > > vfio_ccw_sch_io_todo(), but if fail when it comes to the one
> > > > > in vfio_ccw_mdev_notifier().
> > > > > 
> > > > > Can you explain us how does the synchronization work?    
> > > > 
> > > > You read that wrong, I don't add synchronization, I just add a check.
> > > >     
> > > 
> > > Now I'm confused. Does that mean we don't need synchronization for this?  
> > 
> > If we lack synchronization (that is not provided by the current state
> > machine handling, or the rework here), we should do a patch on top
> > (preferably on top of the whole series, so this does not get even more
> > tangled up.) This is really just about the extra check.
> >   
> 
> I'm not a huge fan of keeping or introducing races -- it makes things
> difficult to reason about, but I do have some understanging your
> position.

The only thing I want to avoid is knowingly making things worse than
before, and I don't think this patch does that.

> 
> This patch-series is AFAICT a big improvement over what we have. I would
> like Farhan confirming that it makes these hick-ups when he used to hit
> BUSY with another ssch request disappear. If it does (I hope it does)
> it's definitely a good thing for anybody who wants to use vfio-ccw.

Yep. There remains a lot to be done, but it's a first step.

> 
> Yet I find it difficult to slap my r-b over racy code, or partial
> solutions. In the latter case, when I lack conceptual clarity, I find it
> difficult to tell if we are heading into the right direction, or is what
> we build today going to turn against us tomorrow. Sorry for being a drag.

As long as we don't introduce bad user space interfaces we have to drag
around forever, I think anything is fair game if we think it's a good
idea at that moment. We can rewrite things if it turned out to be a bad
idea (although I'm not arguing for doing random crap, of course :)
Eric Farman Feb. 5, 2019, 2:41 p.m. UTC | #9
On 02/05/2019 07:03 AM, Cornelia Huck wrote:
> On Mon, 4 Feb 2019 14:25:34 -0500
> Eric Farman <farman@linux.ibm.com> wrote:
> 
>> On 01/30/2019 08:22 AM, Cornelia Huck wrote:
>>> When we get a solicited interrupt, the start function may have
>>> been cleared by a csch, but we still have a channel program
>>> structure allocated. Make it safe to call the cp accessors in
>>> any case, so we can call them unconditionally.
>>>
>>> While at it, also make sure that functions called from other parts
>>> of the code return gracefully if the channel program structure
>>> has not been initialized (even though that is a bug in the caller).
>>>
>>> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
>>> ---
>>>    drivers/s390/cio/vfio_ccw_cp.c  | 20 +++++++++++++++++++-
>>>    drivers/s390/cio/vfio_ccw_cp.h  |  2 ++
>>>    drivers/s390/cio/vfio_ccw_fsm.c |  5 +++++
>>>    3 files changed, 26 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
>>> index ba08fe137c2e..0bc0c38edda7 100644
>>> --- a/drivers/s390/cio/vfio_ccw_cp.c
>>> +++ b/drivers/s390/cio/vfio_ccw_cp.c
>>> @@ -335,6 +335,7 @@ static void cp_unpin_free(struct channel_program *cp)
>>>    	struct ccwchain *chain, *temp;
>>>    	int i;
>>>    
>>> +	cp->initialized = false;
>>>    	list_for_each_entry_safe(chain, temp, &cp->ccwchain_list, next) {
>>>    		for (i = 0; i < chain->ch_len; i++) {
>>>    			pfn_array_table_unpin_free(chain->ch_pat + i,
>>> @@ -701,6 +702,8 @@ int cp_init(struct channel_program *cp, struct device *mdev, union orb *orb)
>>>    	 */
>>>    	cp->orb.cmd.c64 = 1;
>>>    
>>> +	cp->initialized = true;
>>> +
>>
>> Not seen in this hunk, but we call ccwchain_loop_tic() just prior to
>> this point.  If that returns non-zero, we call cp_unpin_free()[1] (and
>> set initailized to false), and then fall through to here.  So this is
>> going to set initialized to true, even though we're taking an error
>> path.  :-(
> 
> Eek, setting c64 unconditionally threw me off. This needs to check
> for !ret, of course.
> 
>>
>> [1] Wait, why is it calling cp_unpin_free()?  Oh, I had proposed
>> squashing cp_free() and cp_unpin_free() back in November[2], got an r-b
>> from Pierre but haven't gotten back to tidy up the series for a v2.
>> Okay, I'll try to do that again soon.  :-)
> 
> :)
> 
>> [2] https://patchwork.kernel.org/patch/10675261/
>>
>>>    	return ret;
>>>    }
>>>    
>>> @@ -715,7 +718,8 @@ int cp_init(struct channel_program *cp, struct device *mdev, union orb *orb)
>>>     */
>>>    void cp_free(struct channel_program *cp)
>>>    {
>>> -	cp_unpin_free(cp);
>>> +	if (cp->initialized)
>>> +		cp_unpin_free(cp);
>>>    }
>>>    
>>>    /**
>>> @@ -760,6 +764,10 @@ int cp_prefetch(struct channel_program *cp)
>>>    	struct ccwchain *chain;
>>>    	int len, idx, ret;
>>>    
>>> +	/* this is an error in the caller */
>>> +	if (!cp || !cp->initialized)
>>> +		return -EINVAL;
>>> +
>>>    	list_for_each_entry(chain, &cp->ccwchain_list, next) {
>>>    		len = chain->ch_len;
>>>    		for (idx = 0; idx < len; idx++) {
>>> @@ -795,6 +803,10 @@ union orb *cp_get_orb(struct channel_program *cp, u32 intparm, u8 lpm)
>>>    	struct ccwchain *chain;
>>>    	struct ccw1 *cpa;
>>>    
>>> +	/* this is an error in the caller */
>>> +	if (!cp || !cp->initialized)
>>> +		return NULL;
>>> +
>>>    	orb = &cp->orb;
>>>    
>>>    	orb->cmd.intparm = intparm;
>>> @@ -831,6 +843,9 @@ void cp_update_scsw(struct channel_program *cp, union scsw *scsw)
>>>    	u32 cpa = scsw->cmd.cpa;
>>>    	u32 ccw_head, ccw_tail;
>>>    
>>> +	if (!cp->initialized)
>>> +		return;
>>> +
>>>    	/*
>>>    	 * LATER:
>>>    	 * For now, only update the cmd.cpa part. We may need to deal with
>>> @@ -869,6 +884,9 @@ bool cp_iova_pinned(struct channel_program *cp, u64 iova)
>>>    	struct ccwchain *chain;
>>>    	int i;
>>>    
>>> +	if (!cp->initialized)
>>
>> So, two of the checks added above look for a nonzero cp pointer prior to
>> checking initialized, while two don't.  I guess cp can't be NULL, since
>> it's embedded in the private struct directly and that's only free'd when
>> we do vfio_ccw_sch_remove() ... But I guess some consistency in how we
>> look would be nice.
> 
> The idea was: In which context is this called? Is there a legitimate
> reason for the caller to pass in an uninitialized cp, or would that
> mean the caller had messed up (and we should not trust cp to be !NULL
> either?)
> 
> But you're right, that does look inconsistent. Always checking for
> cp != NULL probably looks least odd, although it is overkill. Opinions?

My opinion?  Since cp is embedded in vfio_ccw_private, rather than a 
pointer to a separately malloc'd struct, we pass &private->cp to those 
functions.  So a check for !cp doesn't really buy us anything because 
what we are actually concerned about is whether or not private is NULL, 
which only changes on the probe/remove boundaries.

> 
>>
>>> +		return false;
>>> +
>>>    	list_for_each_entry(chain, &cp->ccwchain_list, next) {
>>>    		for (i = 0; i < chain->ch_len; i++)
>>>    			if (pfn_array_table_iova_pinned(chain->ch_pat + i,
>
Eric Farman Feb. 5, 2019, 2:48 p.m. UTC | #10
On 02/05/2019 07:35 AM, Cornelia Huck wrote:
> On Tue, 5 Feb 2019 12:52:29 +0100
> Halil Pasic <pasic@linux.ibm.com> wrote:
> 
>> On Mon, 4 Feb 2019 16:31:02 +0100
>> Cornelia Huck <cohuck@redhat.com> wrote:
>>
>>> On Thu, 31 Jan 2019 13:34:55 +0100
>>> Halil Pasic <pasic@linux.ibm.com> wrote:
>>>    
>>>> On Thu, 31 Jan 2019 12:52:20 +0100
>>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>>    
>>>>> On Wed, 30 Jan 2019 19:51:27 +0100
>>>>> Halil Pasic <pasic@linux.ibm.com> wrote:
>>>>>      
>>>>>> On Wed, 30 Jan 2019 14:22:07 +0100
>>>>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>>>>      
>>>>>>> When we get a solicited interrupt, the start function may have
>>>>>>> been cleared by a csch, but we still have a channel program
>>>>>>> structure allocated. Make it safe to call the cp accessors in
>>>>>>> any case, so we can call them unconditionally.
>>>>>>
>>>>>> I read this like it is supposed to be safe regardless of
>>>>>> parallelism and threads. However I don't see any explicit
>>>>>> synchronization done for cp->initialized.
>>>>>>
>>>>>> I've managed to figure out how is that supposed to be safe
>>>>>> for the cp_free() (which is probably our main concern) in
>>>>>> vfio_ccw_sch_io_todo(), but if fail when it comes to the one
>>>>>> in vfio_ccw_mdev_notifier().
>>>>>>
>>>>>> Can you explain us how does the synchronization work?
>>>>>
>>>>> You read that wrong, I don't add synchronization, I just add a check.
>>>>>      
>>>>
>>>> Now I'm confused. Does that mean we don't need synchronization for this?
>>>
>>> If we lack synchronization (that is not provided by the current state
>>> machine handling, or the rework here), we should do a patch on top
>>> (preferably on top of the whole series, so this does not get even more
>>> tangled up.) This is really just about the extra check.
>>>    
>>
>> I'm not a huge fan of keeping or introducing races -- it makes things
>> difficult to reason about, but I do have some understanging your
>> position.
> 
> The only thing I want to avoid is knowingly making things worse than
> before, and I don't think this patch does that.
> 
>>
>> This patch-series is AFAICT a big improvement over what we have. I would
>> like Farhan confirming that it makes these hick-ups when he used to hit
>> BUSY with another ssch request disappear. If it does (I hope it does)
>> it's definitely a good thing for anybody who wants to use vfio-ccw.
> 
> Yep. There remains a lot to be done, but it's a first step.

s/a first step/an excellent first step/  :)

Can't speak for Farhan, but this makes things somewhat better for me. 
I'm still getting some periodic errors, but they happen infrequently 
enough now that debugging them is frustrating.  ;-)

  - Eric

> 
>>
>> Yet I find it difficult to slap my r-b over racy code, or partial
>> solutions. In the latter case, when I lack conceptual clarity, I find it
>> difficult to tell if we are heading into the right direction, or is what
>> we build today going to turn against us tomorrow. Sorry for being a drag.
> 
> As long as we don't introduce bad user space interfaces we have to drag
> around forever, I think anything is fair game if we think it's a good
> idea at that moment. We can rewrite things if it turned out to be a bad
> idea (although I'm not arguing for doing random crap, of course :)
>
Farhan Ali Feb. 5, 2019, 3:14 p.m. UTC | #11
On 02/05/2019 09:48 AM, Eric Farman wrote:
> 
> 
> On 02/05/2019 07:35 AM, Cornelia Huck wrote:
>> On Tue, 5 Feb 2019 12:52:29 +0100
>> Halil Pasic <pasic@linux.ibm.com> wrote:
>>
>>> On Mon, 4 Feb 2019 16:31:02 +0100
>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>
>>>> On Thu, 31 Jan 2019 13:34:55 +0100
>>>> Halil Pasic <pasic@linux.ibm.com> wrote:
>>>>> On Thu, 31 Jan 2019 12:52:20 +0100
>>>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>>>> On Wed, 30 Jan 2019 19:51:27 +0100
>>>>>> Halil Pasic <pasic@linux.ibm.com> wrote:
>>>>>>> On Wed, 30 Jan 2019 14:22:07 +0100
>>>>>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>>>>>> When we get a solicited interrupt, the start function may have
>>>>>>>> been cleared by a csch, but we still have a channel program
>>>>>>>> structure allocated. Make it safe to call the cp accessors in
>>>>>>>> any case, so we can call them unconditionally.
>>>>>>>
>>>>>>> I read this like it is supposed to be safe regardless of
>>>>>>> parallelism and threads. However I don't see any explicit
>>>>>>> synchronization done for cp->initialized.
>>>>>>>
>>>>>>> I've managed to figure out how is that supposed to be safe
>>>>>>> for the cp_free() (which is probably our main concern) in
>>>>>>> vfio_ccw_sch_io_todo(), but if fail when it comes to the one
>>>>>>> in vfio_ccw_mdev_notifier().
>>>>>>>
>>>>>>> Can you explain us how does the synchronization work?
>>>>>>
>>>>>> You read that wrong, I don't add synchronization, I just add a check.
>>>>>
>>>>> Now I'm confused. Does that mean we don't need synchronization for 
>>>>> this?
>>>>
>>>> If we lack synchronization (that is not provided by the current state
>>>> machine handling, or the rework here), we should do a patch on top
>>>> (preferably on top of the whole series, so this does not get even more
>>>> tangled up.) This is really just about the extra check.
>>>
>>> I'm not a huge fan of keeping or introducing races -- it makes things
>>> difficult to reason about, but I do have some understanging your
>>> position.
>>
>> The only thing I want to avoid is knowingly making things worse than
>> before, and I don't think this patch does that.
>>
>>>
>>> This patch-series is AFAICT a big improvement over what we have. I would
>>> like Farhan confirming that it makes these hick-ups when he used to hit
>>> BUSY with another ssch request disappear. If it does (I hope it does)
>>> it's definitely a good thing for anybody who wants to use vfio-ccw.
>>
>> Yep. There remains a lot to be done, but it's a first step.
> 
> s/a first step/an excellent first step/  :)
> 
> Can't speak for Farhan, but this makes things somewhat better for me. 
> I'm still getting some periodic errors, but they happen infrequently 
> enough now that debugging them is frustrating.  ;-)
> 
>   - Eric
> 

I ran the my workloads/tests with the patches and like Eric I notice the 
errors I previously hit less frequently.


>>
>>>
>>> Yet I find it difficult to slap my r-b over racy code, or partial
>>> solutions. In the latter case, when I lack conceptual clarity, I find it
>>> difficult to tell if we are heading into the right direction, or is what
>>> we build today going to turn against us tomorrow. Sorry for being a 
>>> drag.
>>
>> As long as we don't introduce bad user space interfaces we have to drag
>> around forever, I think anything is fair game if we think it's a good
>> idea at that moment. We can rewrite things if it turned out to be a bad
>> idea (although I'm not arguing for doing random crap, of course :)
>>
>
Cornelia Huck Feb. 5, 2019, 4:13 p.m. UTC | #12
On Tue, 5 Feb 2019 10:14:46 -0500
Farhan Ali <alifm@linux.ibm.com> wrote:

> On 02/05/2019 09:48 AM, Eric Farman wrote:

> >>> This patch-series is AFAICT a big improvement over what we have. I would
> >>> like Farhan confirming that it makes these hick-ups when he used to hit
> >>> BUSY with another ssch request disappear. If it does (I hope it does)
> >>> it's definitely a good thing for anybody who wants to use vfio-ccw.  
> >>
> >> Yep. There remains a lot to be done, but it's a first step.  
> > 
> > s/a first step/an excellent first step/  :)
> > 
> > Can't speak for Farhan, but this makes things somewhat better for me. 
> > I'm still getting some periodic errors, but they happen infrequently 
> > enough now that debugging them is frustrating.  ;-)
> > 
> >   - Eric
> >   
> 
> I ran the my workloads/tests with the patches and like Eric I notice the 
> errors I previously hit less frequently.

Great, thanks for testing!
Cornelia Huck Feb. 5, 2019, 4:29 p.m. UTC | #13
On Tue, 5 Feb 2019 09:41:15 -0500
Eric Farman <farman@linux.ibm.com> wrote:

> On 02/05/2019 07:03 AM, Cornelia Huck wrote:
> > On Mon, 4 Feb 2019 14:25:34 -0500
> > Eric Farman <farman@linux.ibm.com> wrote:
> >   
> >> On 01/30/2019 08:22 AM, Cornelia Huck wrote:  

> >>> @@ -760,6 +764,10 @@ int cp_prefetch(struct channel_program *cp)
> >>>    	struct ccwchain *chain;
> >>>    	int len, idx, ret;
> >>>    
> >>> +	/* this is an error in the caller */
> >>> +	if (!cp || !cp->initialized)
> >>> +		return -EINVAL;
> >>> +
> >>>    	list_for_each_entry(chain, &cp->ccwchain_list, next) {
> >>>    		len = chain->ch_len;
> >>>    		for (idx = 0; idx < len; idx++) {
> >>> @@ -795,6 +803,10 @@ union orb *cp_get_orb(struct channel_program *cp, u32 intparm, u8 lpm)
> >>>    	struct ccwchain *chain;
> >>>    	struct ccw1 *cpa;
> >>>    
> >>> +	/* this is an error in the caller */
> >>> +	if (!cp || !cp->initialized)
> >>> +		return NULL;
> >>> +
> >>>    	orb = &cp->orb;
> >>>    
> >>>    	orb->cmd.intparm = intparm;
> >>> @@ -831,6 +843,9 @@ void cp_update_scsw(struct channel_program *cp, union scsw *scsw)
> >>>    	u32 cpa = scsw->cmd.cpa;
> >>>    	u32 ccw_head, ccw_tail;
> >>>    
> >>> +	if (!cp->initialized)
> >>> +		return;
> >>> +
> >>>    	/*
> >>>    	 * LATER:
> >>>    	 * For now, only update the cmd.cpa part. We may need to deal with
> >>> @@ -869,6 +884,9 @@ bool cp_iova_pinned(struct channel_program *cp, u64 iova)
> >>>    	struct ccwchain *chain;
> >>>    	int i;
> >>>    
> >>> +	if (!cp->initialized)  
> >>
> >> So, two of the checks added above look for a nonzero cp pointer prior to
> >> checking initialized, while two don't.  I guess cp can't be NULL, since
> >> it's embedded in the private struct directly and that's only free'd when
> >> we do vfio_ccw_sch_remove() ... But I guess some consistency in how we
> >> look would be nice.  
> > 
> > The idea was: In which context is this called? Is there a legitimate
> > reason for the caller to pass in an uninitialized cp, or would that
> > mean the caller had messed up (and we should not trust cp to be !NULL
> > either?)
> > 
> > But you're right, that does look inconsistent. Always checking for
> > cp != NULL probably looks least odd, although it is overkill. Opinions?  
> 
> My opinion?  Since cp is embedded in vfio_ccw_private, rather than a 
> pointer to a separately malloc'd struct, we pass &private->cp to those 
> functions.  So a check for !cp doesn't really buy us anything because 
> what we are actually concerned about is whether or not private is NULL, 
> which only changes on the probe/remove boundaries.

I guess if we pass in crap (or NULL) instead of &private->cp, it's our
own fault and we can disregard fencing that case. The probe/remove path
does not really bother me, for the reasons you said.
diff mbox series

Patch

diff --git a/drivers/s390/cio/vfio_ccw_cp.c b/drivers/s390/cio/vfio_ccw_cp.c
index ba08fe137c2e..0bc0c38edda7 100644
--- a/drivers/s390/cio/vfio_ccw_cp.c
+++ b/drivers/s390/cio/vfio_ccw_cp.c
@@ -335,6 +335,7 @@  static void cp_unpin_free(struct channel_program *cp)
 	struct ccwchain *chain, *temp;
 	int i;
 
+	cp->initialized = false;
 	list_for_each_entry_safe(chain, temp, &cp->ccwchain_list, next) {
 		for (i = 0; i < chain->ch_len; i++) {
 			pfn_array_table_unpin_free(chain->ch_pat + i,
@@ -701,6 +702,8 @@  int cp_init(struct channel_program *cp, struct device *mdev, union orb *orb)
 	 */
 	cp->orb.cmd.c64 = 1;
 
+	cp->initialized = true;
+
 	return ret;
 }
 
@@ -715,7 +718,8 @@  int cp_init(struct channel_program *cp, struct device *mdev, union orb *orb)
  */
 void cp_free(struct channel_program *cp)
 {
-	cp_unpin_free(cp);
+	if (cp->initialized)
+		cp_unpin_free(cp);
 }
 
 /**
@@ -760,6 +764,10 @@  int cp_prefetch(struct channel_program *cp)
 	struct ccwchain *chain;
 	int len, idx, ret;
 
+	/* this is an error in the caller */
+	if (!cp || !cp->initialized)
+		return -EINVAL;
+
 	list_for_each_entry(chain, &cp->ccwchain_list, next) {
 		len = chain->ch_len;
 		for (idx = 0; idx < len; idx++) {
@@ -795,6 +803,10 @@  union orb *cp_get_orb(struct channel_program *cp, u32 intparm, u8 lpm)
 	struct ccwchain *chain;
 	struct ccw1 *cpa;
 
+	/* this is an error in the caller */
+	if (!cp || !cp->initialized)
+		return NULL;
+
 	orb = &cp->orb;
 
 	orb->cmd.intparm = intparm;
@@ -831,6 +843,9 @@  void cp_update_scsw(struct channel_program *cp, union scsw *scsw)
 	u32 cpa = scsw->cmd.cpa;
 	u32 ccw_head, ccw_tail;
 
+	if (!cp->initialized)
+		return;
+
 	/*
 	 * LATER:
 	 * For now, only update the cmd.cpa part. We may need to deal with
@@ -869,6 +884,9 @@  bool cp_iova_pinned(struct channel_program *cp, u64 iova)
 	struct ccwchain *chain;
 	int i;
 
+	if (!cp->initialized)
+		return false;
+
 	list_for_each_entry(chain, &cp->ccwchain_list, next) {
 		for (i = 0; i < chain->ch_len; i++)
 			if (pfn_array_table_iova_pinned(chain->ch_pat + i,
diff --git a/drivers/s390/cio/vfio_ccw_cp.h b/drivers/s390/cio/vfio_ccw_cp.h
index a4b74fb1aa57..3c20cd208da5 100644
--- a/drivers/s390/cio/vfio_ccw_cp.h
+++ b/drivers/s390/cio/vfio_ccw_cp.h
@@ -21,6 +21,7 @@ 
  * @ccwchain_list: list head of ccwchains
  * @orb: orb for the currently processed ssch request
  * @mdev: the mediated device to perform page pinning/unpinning
+ * @initialized: whether this instance is actually initialized
  *
  * @ccwchain_list is the head of a ccwchain list, that contents the
  * translated result of the guest channel program that pointed out by
@@ -30,6 +31,7 @@  struct channel_program {
 	struct list_head ccwchain_list;
 	union orb orb;
 	struct device *mdev;
+	bool initialized;
 };
 
 extern int cp_init(struct channel_program *cp, struct device *mdev,
diff --git a/drivers/s390/cio/vfio_ccw_fsm.c b/drivers/s390/cio/vfio_ccw_fsm.c
index cab17865aafe..e7c9877c9f1e 100644
--- a/drivers/s390/cio/vfio_ccw_fsm.c
+++ b/drivers/s390/cio/vfio_ccw_fsm.c
@@ -31,6 +31,10 @@  static int fsm_io_helper(struct vfio_ccw_private *private)
 	private->state = VFIO_CCW_STATE_BUSY;
 
 	orb = cp_get_orb(&private->cp, (u32)(addr_t)sch, sch->lpm);
+	if (!orb) {
+		ret = -EIO;
+		goto out;
+	}
 
 	/* Issue "Start Subchannel" */
 	ccode = ssch(sch->schid, orb);
@@ -64,6 +68,7 @@  static int fsm_io_helper(struct vfio_ccw_private *private)
 	default:
 		ret = ccode;
 	}
+out:
 	spin_unlock_irqrestore(sch->lock, flags);
 	return ret;
 }