diff mbox series

spi: fsi: Implement a timeout for polling status

Message ID 20220317211426.38940-1-eajames@linux.ibm.com
State New
Headers show
Series spi: fsi: Implement a timeout for polling status | expand

Commit Message

Eddie James March 17, 2022, 9:14 p.m. UTC
The data transfer routines must poll the status register to
determine when more data can be shifted in or out. If the hardware
gets into a bad state, these polling loops may never exit. Prevent
this by returning an error if a timeout is exceeded.

Signed-off-by: Eddie James <eajames@linux.ibm.com>
---
 drivers/spi/spi-fsi.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Joel Stanley March 18, 2022, 4:19 a.m. UTC | #1
On Thu, 17 Mar 2022 at 21:14, Eddie James <eajames@linux.ibm.com> wrote:
>
> The data transfer routines must poll the status register to
> determine when more data can be shifted in or out. If the hardware
> gets into a bad state, these polling loops may never exit. Prevent
> this by returning an error if a timeout is exceeded.

This makes sense. We may even want to put this code in regardless.

However, I'm wondering why the code in fsi_spi_status didn't catch this.

> static int fsi_spi_status(struct fsi_spi *ctx, u64 *status, const char *dir)
> {
>        int rc = fsi_spi_read_reg(ctx, SPI_FSI_STATUS, status);

You mentioned the error condition is we get back 0xff. That means that
status will be 0xffff_ffff_ffff_ffff ?

Did you observe status being this value?

>        if (rc)
>                return rc;
>
>        if (*status & SPI_FSI_STATUS_ANY_ERROR) {

I think that we're checking against 0xffe0f000.

>                dev_err(ctx->dev, "%s error: %016llx\n", dir, *status);
>
>                rc = fsi_spi_reset(ctx);
>                if (rc)
>                        return rc;

Is the problem here? fsi_spi_reset writes to the clock config
registers, but doesn't read the status.

Obviously doing the writes causes a call to fsi_spi_check_status, but
that checks the FSI2SPI bridge, not the SPI master.

...but it doesn't matter, because we're either going to return an
error from the reset, or return EREMOTEIO, so there's no masking of
the error.

>
>                return -EREMOTEIO;
>        }
>
>        return 0;
> }


>
> Signed-off-by: Eddie James <eajames@linux.ibm.com>
> ---
>  drivers/spi/spi-fsi.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/drivers/spi/spi-fsi.c b/drivers/spi/spi-fsi.c
> index b6c7467f0b59..d403a7a3021d 100644
> --- a/drivers/spi/spi-fsi.c
> +++ b/drivers/spi/spi-fsi.c
> @@ -25,6 +25,7 @@
>
>  #define SPI_FSI_BASE                   0x70000
>  #define SPI_FSI_INIT_TIMEOUT_MS                1000
> +#define SPI_FSI_STATUS_TIMEOUT_MS      100

Can you add a comment (or put something in the commit message) about
why you chose 100ms.

>  #define SPI_FSI_MAX_RX_SIZE            8
>  #define SPI_FSI_MAX_TX_SIZE            40
>
> @@ -299,6 +300,7 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
>                                  struct spi_transfer *transfer)
>  {
>         int rc = 0;
> +       unsigned long end;
>         u64 status = 0ULL;
>
>         if (transfer->tx_buf) {
> @@ -315,10 +317,14 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
>                         if (rc)
>                                 return rc;
>
> +                       end = jiffies + msecs_to_jiffies(SPI_FSI_STATUS_TIMEOUT_MS);
>                         do {
>                                 rc = fsi_spi_status(ctx, &status, "TX");
>                                 if (rc)
>                                         return rc;
> +
> +                               if (time_after(jiffies, end))
> +                                       return -ETIMEDOUT;
>                         } while (status & SPI_FSI_STATUS_TDR_FULL);
>
>                         sent += nb;
> @@ -329,10 +335,14 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
>                 u8 *rx = transfer->rx_buf;
>
>                 while (transfer->len > recv) {
> +                       end = jiffies + msecs_to_jiffies(SPI_FSI_STATUS_TIMEOUT_MS);
>                         do {
>                                 rc = fsi_spi_status(ctx, &status, "RX");
>                                 if (rc)
>                                         return rc;
> +
> +                               if (time_after(jiffies, end))
> +                                       return -ETIMEDOUT;
>                         } while (!(status & SPI_FSI_STATUS_RDR_FULL));
>
>                         rc = fsi_spi_read_reg(ctx, SPI_FSI_DATA_RX, &in);
> --
> 2.27.0
>
Eddie James March 18, 2022, 2:10 p.m. UTC | #2
On 3/17/22 23:19, Joel Stanley wrote:
> On Thu, 17 Mar 2022 at 21:14, Eddie James <eajames@linux.ibm.com> wrote:
>> The data transfer routines must poll the status register to
>> determine when more data can be shifted in or out. If the hardware
>> gets into a bad state, these polling loops may never exit. Prevent
>> this by returning an error if a timeout is exceeded.
> This makes sense. We may even want to put this code in regardless.
>
> However, I'm wondering why the code in fsi_spi_status didn't catch this.


Same, which is why I thought the problem couldn't be happening here for 
a long time. See below with what I think is going on.


>
>> static int fsi_spi_status(struct fsi_spi *ctx, u64 *status, const char *dir)
>> {
>>         int rc = fsi_spi_read_reg(ctx, SPI_FSI_STATUS, status);
> You mentioned the error condition is we get back 0xff. That means that
> status will be 0xffff_ffff_ffff_ffff ?
>
> Did you observe status being this value?


No, I think my observation of 0xff is not universal. I suspect that 
while the CFAM is IN reset, 0xff is returned, but once it's been reset, 
valid (though maybe uninitialized) data is returned. I observed a status 
of 0x0001100000000000, which means the controller is idle, which makes 
sense since it's been reset. So the issue occurs if we start an 
operation before a CFAM reset and are waiting for it to complete during 
the CFAM reset, but we also don't get any failed/invalid data FSI 
operations during/after the reset (very timing dependent - the FSI 
master does lock during the reset but doesn't wait afterwards for the 
hardware to initialize).


>
>>         if (rc)
>>                 return rc;
>>
>>         if (*status & SPI_FSI_STATUS_ANY_ERROR) {
> I think that we're checking against 0xffe0f000.
>
>>                 dev_err(ctx->dev, "%s error: %016llx\n", dir, *status);
>>
>>                 rc = fsi_spi_reset(ctx);
>>                 if (rc)
>>                         return rc;
> Is the problem here? fsi_spi_reset writes to the clock config
> registers, but doesn't read the status.
>
> Obviously doing the writes causes a call to fsi_spi_check_status, but
> that checks the FSI2SPI bridge, not the SPI master.
>
> ...but it doesn't matter, because we're either going to return an
> error from the reset, or return EREMOTEIO, so there's no masking of
> the error.


Not sure I follow. I don't think we were hitting this path in this error 
scenario. Do you think we need to check the status after a reset? It 
should always be the same.


>
>>                 return -EREMOTEIO;
>>         }
>>
>>         return 0;
>> }
>
>> Signed-off-by: Eddie James <eajames@linux.ibm.com>
>> ---
>>   drivers/spi/spi-fsi.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/drivers/spi/spi-fsi.c b/drivers/spi/spi-fsi.c
>> index b6c7467f0b59..d403a7a3021d 100644
>> --- a/drivers/spi/spi-fsi.c
>> +++ b/drivers/spi/spi-fsi.c
>> @@ -25,6 +25,7 @@
>>
>>   #define SPI_FSI_BASE                   0x70000
>>   #define SPI_FSI_INIT_TIMEOUT_MS                1000
>> +#define SPI_FSI_STATUS_TIMEOUT_MS      100
> Can you add a comment (or put something in the commit message) about
> why you chose 100ms.


Hm, sure, but I chose it pretty arbitrarily. I'm not sure how to choose 
something like this.


>
>>   #define SPI_FSI_MAX_RX_SIZE            8
>>   #define SPI_FSI_MAX_TX_SIZE            40
>>
>> @@ -299,6 +300,7 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
>>                                   struct spi_transfer *transfer)
>>   {
>>          int rc = 0;
>> +       unsigned long end;
>>          u64 status = 0ULL;
>>
>>          if (transfer->tx_buf) {
>> @@ -315,10 +317,14 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
>>                          if (rc)
>>                                  return rc;
>>
>> +                       end = jiffies + msecs_to_jiffies(SPI_FSI_STATUS_TIMEOUT_MS);
>>                          do {
>>                                  rc = fsi_spi_status(ctx, &status, "TX");
>>                                  if (rc)
>>                                          return rc;
>> +
>> +                               if (time_after(jiffies, end))
>> +                                       return -ETIMEDOUT;
>>                          } while (status & SPI_FSI_STATUS_TDR_FULL);
>>
>>                          sent += nb;
>> @@ -329,10 +335,14 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
>>                  u8 *rx = transfer->rx_buf;
>>
>>                  while (transfer->len > recv) {
>> +                       end = jiffies + msecs_to_jiffies(SPI_FSI_STATUS_TIMEOUT_MS);
>>                          do {
>>                                  rc = fsi_spi_status(ctx, &status, "RX");
>>                                  if (rc)
>>                                          return rc;
>> +
>> +                               if (time_after(jiffies, end))
>> +                                       return -ETIMEDOUT;
>>                          } while (!(status & SPI_FSI_STATUS_RDR_FULL));
>>
>>                          rc = fsi_spi_read_reg(ctx, SPI_FSI_DATA_RX, &in);
>> --
>> 2.27.0
>>
Mark Brown March 18, 2022, 8:58 p.m. UTC | #3
On Thu, 17 Mar 2022 16:14:26 -0500, Eddie James wrote:
> The data transfer routines must poll the status register to
> determine when more data can be shifted in or out. If the hardware
> gets into a bad state, these polling loops may never exit. Prevent
> this by returning an error if a timeout is exceeded.
> 
> 

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next

Thanks!

[1/1] spi: fsi: Implement a timeout for polling status
      commit: 89b35e3f28514087d3f1e28e8f5634fbfd07c554

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark
diff mbox series

Patch

diff --git a/drivers/spi/spi-fsi.c b/drivers/spi/spi-fsi.c
index b6c7467f0b59..d403a7a3021d 100644
--- a/drivers/spi/spi-fsi.c
+++ b/drivers/spi/spi-fsi.c
@@ -25,6 +25,7 @@ 
 
 #define SPI_FSI_BASE			0x70000
 #define SPI_FSI_INIT_TIMEOUT_MS		1000
+#define SPI_FSI_STATUS_TIMEOUT_MS	100
 #define SPI_FSI_MAX_RX_SIZE		8
 #define SPI_FSI_MAX_TX_SIZE		40
 
@@ -299,6 +300,7 @@  static int fsi_spi_transfer_data(struct fsi_spi *ctx,
 				 struct spi_transfer *transfer)
 {
 	int rc = 0;
+	unsigned long end;
 	u64 status = 0ULL;
 
 	if (transfer->tx_buf) {
@@ -315,10 +317,14 @@  static int fsi_spi_transfer_data(struct fsi_spi *ctx,
 			if (rc)
 				return rc;
 
+			end = jiffies + msecs_to_jiffies(SPI_FSI_STATUS_TIMEOUT_MS);
 			do {
 				rc = fsi_spi_status(ctx, &status, "TX");
 				if (rc)
 					return rc;
+
+				if (time_after(jiffies, end))
+					return -ETIMEDOUT;
 			} while (status & SPI_FSI_STATUS_TDR_FULL);
 
 			sent += nb;
@@ -329,10 +335,14 @@  static int fsi_spi_transfer_data(struct fsi_spi *ctx,
 		u8 *rx = transfer->rx_buf;
 
 		while (transfer->len > recv) {
+			end = jiffies + msecs_to_jiffies(SPI_FSI_STATUS_TIMEOUT_MS);
 			do {
 				rc = fsi_spi_status(ctx, &status, "RX");
 				if (rc)
 					return rc;
+
+				if (time_after(jiffies, end))
+					return -ETIMEDOUT;
 			} while (!(status & SPI_FSI_STATUS_RDR_FULL));
 
 			rc = fsi_spi_read_reg(ctx, SPI_FSI_DATA_RX, &in);