core/mem_region: Give up on freeing unmanaged memory block

Message ID 1495776116-27890-1-git-send-email-gwshan@linux.vnet.ibm.com
State New
Headers show

Commit Message

Gavin Shan May 26, 2017, 5:21 a.m.
When adding memory regions to global list, the overlapped one
(including it's descriptor) is purged. However, the descriptor might
be in BSS (not HEAP) section. It will trigger the exception caused
by assert() in mem_free() as below backtrace indicates. The memory
regions causing this are "ibm,os-reserve" and "hostboot-reserve-0"
separately. The former descriptor is resident in BSS section while
the later one is in HEAP:

CPU 0050 Backtrace:
 S: 0000000031d43a00 R: 00000000300136d0   .backtrace+0x2c
 S: 0000000031d43a90 R: 00000000300191ec   ._abort+0x4c
 S: 0000000031d43b10 R: 0000000030019268   .assert_fail+0x34
 S: 0000000031d43b90 R: 0000000030015d08   .mem_free+0x64
 S: 0000000031d43c20 R: 0000000030017080   .__free+0x38
 S: 0000000031d43cb0 R: 0000000030015518   .add_region+0x230
 S: 0000000031d43d60 R: 0000000030016854   .mem_region_init+0x2c8
 S: 0000000031d43e30 R: 0000000030014994   .main_cpu_entry+0x3ec
 S: 0000000031d43f00 R: 0000000030002648   boot_entry+0x198

This fixes the issue by doing nothing when trying to freeing memory
block which is resident in unmanaged (not HEAP) region.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 core/mem_region.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Oliver O'Halloran May 26, 2017, 6:12 a.m. | #1
On Fri, May 26, 2017 at 3:21 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
> When adding memory regions to global list, the overlapped one
> (including it's descriptor) is purged. However, the descriptor might
> be in BSS (not HEAP) section. It will trigger the exception caused
> by assert() in mem_free() as below backtrace indicates. The memory
> regions causing this are "ibm,os-reserve" and "hostboot-reserve-0"
> separately. The former descriptor is resident in BSS section while
> the later one is in HEAP:
>
> CPU 0050 Backtrace:
>  S: 0000000031d43a00 R: 00000000300136d0   .backtrace+0x2c
>  S: 0000000031d43a90 R: 00000000300191ec   ._abort+0x4c
>  S: 0000000031d43b10 R: 0000000030019268   .assert_fail+0x34
>  S: 0000000031d43b90 R: 0000000030015d08   .mem_free+0x64
>  S: 0000000031d43c20 R: 0000000030017080   .__free+0x38
>  S: 0000000031d43cb0 R: 0000000030015518   .add_region+0x230
>  S: 0000000031d43d60 R: 0000000030016854   .mem_region_init+0x2c8
>  S: 0000000031d43e30 R: 0000000030014994   .main_cpu_entry+0x3ec
>  S: 0000000031d43f00 R: 0000000030002648   boot_entry+0x198
>
> This fixes the issue by doing nothing when trying to freeing memory
> block which is resident in unmanaged (not HEAP) region.
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>

Not a fan. The specific bug you're seeing here is because of hostboot
being broken on FSP machines and reserving memory that it shouldn't.
This patch papers over the problem by making any invalid free()
silently fail which is something we should avoid.

> ---
>  core/mem_region.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/core/mem_region.c b/core/mem_region.c
> index a314558..bfb5ac9 100644
> --- a/core/mem_region.c
> +++ b/core/mem_region.c
> @@ -469,9 +469,10 @@ void mem_free(struct mem_region *region, void *mem, const char *location)
>         if (!mem)
>                 return;
>
> -       /* Your memory is in the region, right? */
> -       assert(mem >= region_start(region) + sizeof(*hdr));
> -       assert(mem < region_start(region) + region->len);
> +       /* Give up if it is in unmanaged and invalid regions */
> +       if (mem < (region_start(region) + sizeof(*hdr)) ||
> +           mem >= (region_start(region) + region->len))
> +               return;
>
>         /* Grab header. */
>         hdr = mem - sizeof(*hdr);
> --
> 2.7.4
>
> _______________________________________________
> Skiboot mailing list
> Skiboot@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/skiboot
Gavin Shan May 26, 2017, 6:30 a.m. | #2
On Fri, May 26, 2017 at 04:12:24PM +1000, Oliver O'Halloran wrote:
>On Fri, May 26, 2017 at 3:21 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
>> When adding memory regions to global list, the overlapped one
>> (including it's descriptor) is purged. However, the descriptor might
>> be in BSS (not HEAP) section. It will trigger the exception caused
>> by assert() in mem_free() as below backtrace indicates. The memory
>> regions causing this are "ibm,os-reserve" and "hostboot-reserve-0"
>> separately. The former descriptor is resident in BSS section while
>> the later one is in HEAP:
>>
>> CPU 0050 Backtrace:
>>  S: 0000000031d43a00 R: 00000000300136d0   .backtrace+0x2c
>>  S: 0000000031d43a90 R: 00000000300191ec   ._abort+0x4c
>>  S: 0000000031d43b10 R: 0000000030019268   .assert_fail+0x34
>>  S: 0000000031d43b90 R: 0000000030015d08   .mem_free+0x64
>>  S: 0000000031d43c20 R: 0000000030017080   .__free+0x38
>>  S: 0000000031d43cb0 R: 0000000030015518   .add_region+0x230
>>  S: 0000000031d43d60 R: 0000000030016854   .mem_region_init+0x2c8
>>  S: 0000000031d43e30 R: 0000000030014994   .main_cpu_entry+0x3ec
>>  S: 0000000031d43f00 R: 0000000030002648   boot_entry+0x198
>>
>> This fixes the issue by doing nothing when trying to freeing memory
>> block which is resident in unmanaged (not HEAP) region.
>>
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>
>Not a fan. The specific bug you're seeing here is because of hostboot
>being broken on FSP machines and reserving memory that it shouldn't.
>This patch papers over the problem by making any invalid free()
>silently fail which is something we should avoid.
>

Well, It's not to fail sliently. The memory block resident in
bss/data can't simply free'd. I agree it's reasonable to avoid freeing
the memory chunks, which in unmanaged from source. If you're going
to do that, this patch can be dropped sliently. On the other hand,
I also found the last skiboot can't boot and it's related to get_hb_reserved_me().
Everything becomes fine when it's simply commented out (or skipped). It seems the
memory blocks (in device-tree) that should have reserved by kernel, aren't
reserved properly. It leads to data/memory corruption eventually. I'm
thinking the correct thing to do is mering the memory region with the
overlapped one in mem_region.c.

Cheers,
Gavin

>> ---
>>  core/mem_region.c | 7 ++++---
>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/core/mem_region.c b/core/mem_region.c
>> index a314558..bfb5ac9 100644
>> --- a/core/mem_region.c
>> +++ b/core/mem_region.c
>> @@ -469,9 +469,10 @@ void mem_free(struct mem_region *region, void *mem, const char *location)
>>         if (!mem)
>>                 return;
>>
>> -       /* Your memory is in the region, right? */
>> -       assert(mem >= region_start(region) + sizeof(*hdr));
>> -       assert(mem < region_start(region) + region->len);
>> +       /* Give up if it is in unmanaged and invalid regions */
>> +       if (mem < (region_start(region) + sizeof(*hdr)) ||
>> +           mem >= (region_start(region) + region->len))
>> +               return;
>>
>>         /* Grab header. */
>>         hdr = mem - sizeof(*hdr);
>> --
>> 2.7.4
>>
>> _______________________________________________
>> Skiboot mailing list
>> Skiboot@lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/skiboot
>
Oliver O'Halloran May 26, 2017, 6:57 a.m. | #3
On Fri, May 26, 2017 at 4:30 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
> On Fri, May 26, 2017 at 04:12:24PM +1000, Oliver O'Halloran wrote:
>>On Fri, May 26, 2017 at 3:21 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
>>> When adding memory regions to global list, the overlapped one
>>> (including it's descriptor) is purged. However, the descriptor might
>>> be in BSS (not HEAP) section. It will trigger the exception caused
>>> by assert() in mem_free() as below backtrace indicates. The memory
>>> regions causing this are "ibm,os-reserve" and "hostboot-reserve-0"
>>> separately. The former descriptor is resident in BSS section while
>>> the later one is in HEAP:
>>>
>>> CPU 0050 Backtrace:
>>>  S: 0000000031d43a00 R: 00000000300136d0   .backtrace+0x2c
>>>  S: 0000000031d43a90 R: 00000000300191ec   ._abort+0x4c
>>>  S: 0000000031d43b10 R: 0000000030019268   .assert_fail+0x34
>>>  S: 0000000031d43b90 R: 0000000030015d08   .mem_free+0x64
>>>  S: 0000000031d43c20 R: 0000000030017080   .__free+0x38
>>>  S: 0000000031d43cb0 R: 0000000030015518   .add_region+0x230
>>>  S: 0000000031d43d60 R: 0000000030016854   .mem_region_init+0x2c8
>>>  S: 0000000031d43e30 R: 0000000030014994   .main_cpu_entry+0x3ec
>>>  S: 0000000031d43f00 R: 0000000030002648   boot_entry+0x198
>>>
>>> This fixes the issue by doing nothing when trying to freeing memory
>>> block which is resident in unmanaged (not HEAP) region.
>>>
>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>
>>Not a fan. The specific bug you're seeing here is because of hostboot
>>being broken on FSP machines and reserving memory that it shouldn't.
>>This patch papers over the problem by making any invalid free()
>>silently fail which is something we should avoid.
>>
>
> Well, It's not to fail sliently. The memory block resident in
> bss/data can't simply free'd. I agree it's reasonable to avoid freeing
> the memory chunks, which in unmanaged from source. If you're going
> to do that, this patch can be dropped sliently. On the other hand,
> I also found the last skiboot can't boot and it's related to get_hb_reserved_me().
> Everything becomes fine when it's simply commented out (or skipped). It seems the
> memory blocks (in device-tree) that should have reserved by kernel, aren't
> reserved properly. It leads to data/memory corruption eventually. I'm
> thinking the correct thing to do is mering the memory region with the
> overlapped one in mem_region.c.

In this situation merging the overlapping reservations is fine, but we
can't rely on that being true in general. We've had problems in the
past with reservations for the OCC being overlapped with those for the
hostboot runtime services. In those situations I would prefer we just
crashed early rather than silently corrupting state.

>
> Cheers,
> Gavin
>
>>> ---
>>>  core/mem_region.c | 7 ++++---
>>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/core/mem_region.c b/core/mem_region.c
>>> index a314558..bfb5ac9 100644
>>> --- a/core/mem_region.c
>>> +++ b/core/mem_region.c
>>> @@ -469,9 +469,10 @@ void mem_free(struct mem_region *region, void *mem, const char *location)
>>>         if (!mem)
>>>                 return;
>>>
>>> -       /* Your memory is in the region, right? */
>>> -       assert(mem >= region_start(region) + sizeof(*hdr));
>>> -       assert(mem < region_start(region) + region->len);
>>> +       /* Give up if it is in unmanaged and invalid regions */
>>> +       if (mem < (region_start(region) + sizeof(*hdr)) ||
>>> +           mem >= (region_start(region) + region->len))
>>> +               return;
>>>
>>>         /* Grab header. */
>>>         hdr = mem - sizeof(*hdr);
>>> --
>>> 2.7.4
>>>
>>> _______________________________________________
>>> Skiboot mailing list
>>> Skiboot@lists.ozlabs.org
>>> https://lists.ozlabs.org/listinfo/skiboot
>>
>
Gavin Shan May 29, 2017, 1:13 a.m. | #4
On Fri, May 26, 2017 at 04:57:19PM +1000, Oliver O'Halloran wrote:
>On Fri, May 26, 2017 at 4:30 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
>> On Fri, May 26, 2017 at 04:12:24PM +1000, Oliver O'Halloran wrote:
>>>On Fri, May 26, 2017 at 3:21 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
>>>> When adding memory regions to global list, the overlapped one
>>>> (including it's descriptor) is purged. However, the descriptor might
>>>> be in BSS (not HEAP) section. It will trigger the exception caused
>>>> by assert() in mem_free() as below backtrace indicates. The memory
>>>> regions causing this are "ibm,os-reserve" and "hostboot-reserve-0"
>>>> separately. The former descriptor is resident in BSS section while
>>>> the later one is in HEAP:
>>>>
>>>> CPU 0050 Backtrace:
>>>>  S: 0000000031d43a00 R: 00000000300136d0   .backtrace+0x2c
>>>>  S: 0000000031d43a90 R: 00000000300191ec   ._abort+0x4c
>>>>  S: 0000000031d43b10 R: 0000000030019268   .assert_fail+0x34
>>>>  S: 0000000031d43b90 R: 0000000030015d08   .mem_free+0x64
>>>>  S: 0000000031d43c20 R: 0000000030017080   .__free+0x38
>>>>  S: 0000000031d43cb0 R: 0000000030015518   .add_region+0x230
>>>>  S: 0000000031d43d60 R: 0000000030016854   .mem_region_init+0x2c8
>>>>  S: 0000000031d43e30 R: 0000000030014994   .main_cpu_entry+0x3ec
>>>>  S: 0000000031d43f00 R: 0000000030002648   boot_entry+0x198
>>>>
>>>> This fixes the issue by doing nothing when trying to freeing memory
>>>> block which is resident in unmanaged (not HEAP) region.
>>>>
>>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>
>>>Not a fan. The specific bug you're seeing here is because of hostboot
>>>being broken on FSP machines and reserving memory that it shouldn't.
>>>This patch papers over the problem by making any invalid free()
>>>silently fail which is something we should avoid.
>>>
>>
>> Well, It's not to fail sliently. The memory block resident in
>> bss/data can't simply free'd. I agree it's reasonable to avoid freeing
>> the memory chunks, which in unmanaged from source. If you're going
>> to do that, this patch can be dropped sliently. On the other hand,
>> I also found the last skiboot can't boot and it's related to get_hb_reserved_me().
>> Everything becomes fine when it's simply commented out (or skipped). It seems the
>> memory blocks (in device-tree) that should have reserved by kernel, aren't
>> reserved properly. It leads to data/memory corruption eventually. I'm
>> thinking the correct thing to do is mering the memory region with the
>> overlapped one in mem_region.c.
>
>In this situation merging the overlapping reservations is fine, but we
>can't rely on that being true in general. We've had problems in the
>past with reservations for the OCC being overlapped with those for the
>hostboot runtime services. In those situations I would prefer we just
>crashed early rather than silently corrupting state.
>

Yes, It makes sense. This patch can be dropped.

Cheers,
Gavin

Patch

diff --git a/core/mem_region.c b/core/mem_region.c
index a314558..bfb5ac9 100644
--- a/core/mem_region.c
+++ b/core/mem_region.c
@@ -469,9 +469,10 @@  void mem_free(struct mem_region *region, void *mem, const char *location)
 	if (!mem)
 		return;
 
-	/* Your memory is in the region, right? */
-	assert(mem >= region_start(region) + sizeof(*hdr));
-	assert(mem < region_start(region) + region->len);
+	/* Give up if it is in unmanaged and invalid regions */
+	if (mem < (region_start(region) + sizeof(*hdr)) ||
+	    mem >= (region_start(region) + region->len))
+		return;
 
 	/* Grab header. */
 	hdr = mem - sizeof(*hdr);