opal-prd: Do not error out on first failure for soft/hard offline.

Message ID 152604712300.701.13496749985386675066.stgit@jupiter.in.ibm.com
State Accepted
Headers show
Series
  • opal-prd: Do not error out on first failure for soft/hard offline.
Related show

Commit Message

Mahesh J Salgaonkar May 11, 2018, 1:58 p.m.
From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

The memory errors (CEs and UEs) that are detected as part of background
memory scrubbing are reported by PRD asynchronously to opal-prd along with
affected memory ranges. hservice_memory_error() converts these ranges into
page granularity before hooking up them to soft/hard offline-ing
infrastructure.

But the current implementation of hservice_memory_error() does not hookup
all the pages to soft/hard offline-ing if any of the page offline action
fails. e.g hard offline can fail for:
      - Pages that are not part of buddy managed pool.
      - Pages that are reserved by kernel using memblock_reserved()
      - Pages that are in use by kernel.

But for the pages that are in use by user space application, the hard
offline marks the page as hwpoison, sends SIGBUS signal to kill the
affected application as recovery action and returns success.

Hence, It is possible that some of the pages in that memory range are in
use by application or free. By stopping on first error we loose the
opportunity to hwpoison the subsequent pages which may be free or in use by
application. This patch fixes this issue.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 external/opal-prd/opal-prd.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Stewart Smith May 25, 2018, 12:23 a.m. | #1
Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> The memory errors (CEs and UEs) that are detected as part of background
> memory scrubbing are reported by PRD asynchronously to opal-prd along with
> affected memory ranges. hservice_memory_error() converts these ranges into
> page granularity before hooking up them to soft/hard offline-ing
> infrastructure.
>
> But the current implementation of hservice_memory_error() does not hookup
> all the pages to soft/hard offline-ing if any of the page offline action
> fails. e.g hard offline can fail for:
>       - Pages that are not part of buddy managed pool.
>       - Pages that are reserved by kernel using memblock_reserved()
>       - Pages that are in use by kernel.
>
> But for the pages that are in use by user space application, the hard
> offline marks the page as hwpoison, sends SIGBUS signal to kill the
> affected application as recovery action and returns success.
>
> Hence, It is possible that some of the pages in that memory range are in
> use by application or free. By stopping on first error we loose the
> opportunity to hwpoison the subsequent pages which may be free or in use by
> application. This patch fixes this issue.
>
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> ---
>  external/opal-prd/opal-prd.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Merged to master as of e9ee7c7d357160a704c8248a1787124f94df8c54.

Should this also head to stable?
Mahesh J Salgaonkar May 25, 2018, 5:25 a.m. | #2
On 05/25/2018 05:53 AM, Stewart Smith wrote:
> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> The memory errors (CEs and UEs) that are detected as part of background
>> memory scrubbing are reported by PRD asynchronously to opal-prd along with
>> affected memory ranges. hservice_memory_error() converts these ranges into
>> page granularity before hooking up them to soft/hard offline-ing
>> infrastructure.
>>
>> But the current implementation of hservice_memory_error() does not hookup
>> all the pages to soft/hard offline-ing if any of the page offline action
>> fails. e.g hard offline can fail for:
>>       - Pages that are not part of buddy managed pool.
>>       - Pages that are reserved by kernel using memblock_reserved()
>>       - Pages that are in use by kernel.
>>
>> But for the pages that are in use by user space application, the hard
>> offline marks the page as hwpoison, sends SIGBUS signal to kill the
>> affected application as recovery action and returns success.
>>
>> Hence, It is possible that some of the pages in that memory range are in
>> use by application or free. By stopping on first error we loose the
>> opportunity to hwpoison the subsequent pages which may be free or in use by
>> application. This patch fixes this issue.
>>
>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>> ---
>>  external/opal-prd/opal-prd.c |    6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> Merged to master as of e9ee7c7d357160a704c8248a1787124f94df8c54.
> 
> Should this also head to stable?
> 

Yes. We been broken from day 1.

Thanks,
-Mahesh.
Stewart Smith May 28, 2018, 2:27 a.m. | #3
Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
> On 05/25/2018 05:53 AM, Stewart Smith wrote:
>> Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> writes:
>>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>>
>>> The memory errors (CEs and UEs) that are detected as part of background
>>> memory scrubbing are reported by PRD asynchronously to opal-prd along with
>>> affected memory ranges. hservice_memory_error() converts these ranges into
>>> page granularity before hooking up them to soft/hard offline-ing
>>> infrastructure.
>>>
>>> But the current implementation of hservice_memory_error() does not hookup
>>> all the pages to soft/hard offline-ing if any of the page offline action
>>> fails. e.g hard offline can fail for:
>>>       - Pages that are not part of buddy managed pool.
>>>       - Pages that are reserved by kernel using memblock_reserved()
>>>       - Pages that are in use by kernel.
>>>
>>> But for the pages that are in use by user space application, the hard
>>> offline marks the page as hwpoison, sends SIGBUS signal to kill the
>>> affected application as recovery action and returns success.
>>>
>>> Hence, It is possible that some of the pages in that memory range are in
>>> use by application or free. By stopping on first error we loose the
>>> opportunity to hwpoison the subsequent pages which may be free or in use by
>>> application. This patch fixes this issue.
>>>
>>> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>> ---
>>>  external/opal-prd/opal-prd.c |    6 +++---
>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>> 
>> Merged to master as of e9ee7c7d357160a704c8248a1787124f94df8c54.
>> 
>> Should this also head to stable?
>> 
>
> Yes. We been broken from day 1.

Okay, I've cherry-picked back to:
6.0.x as of 3efceb1691846f450f0541c0156ad7258a57870b
5.10.x as of 6ca368e2e2254eac8682b6af43758ba134aa3763
5.4.x as of 92bd1c4bbebd25886adabe7ff275e3ca0c600234

Which seem to be the branches in use by current distros, which I guess
we now have to have them bump up what they're using to get this fix.

It's things like this that make me wonder if continuing to have opal-prd in
the same repository/verisoning scheme as skiboot continues to make sense....

Patch

diff --git a/external/opal-prd/opal-prd.c b/external/opal-prd/opal-prd.c
index bc092d186..e3b4439c6 100644
--- a/external/opal-prd/opal-prd.c
+++ b/external/opal-prd/opal-prd.c
@@ -701,7 +701,7 @@  int hservice_memory_error(uint64_t i_start_addr, uint64_t i_endAddr,
 {
 	const char *sysfsfile, *typestr;
 	char buf[ADDR_STRING_SZ];
-	int memfd, rc, n;
+	int memfd, rc, n, ret = 0;
 	uint64_t addr;
 
 	switch(i_errorType) {
@@ -737,11 +737,11 @@  int hservice_memory_error(uint64_t i_start_addr, uint64_t i_endAddr,
 			pr_log(LOG_CRIT, "MEM: Failed to offline memory! "
 					"page addr: %016lx type: %d: %m",
 				addr, i_errorType);
-			return rc;
+			ret = rc;
 		}
 	}
 
-	return 0;
+	return ret;
 }
 
 uint64_t hservice_get_interface_capabilities(uint64_t set)