Patchwork linux-user: Align mmap memory to the target page size.

login
register
mail settings
Submitter Richard Henderson
Date Jan. 15, 2010, 12:38 a.m.
Message ID <20100115011604.DCB1CB93@are.twiddle.net>
Download mbox | patch
Permalink /patch/42939/
State New
Headers show

Comments

Richard Henderson - Jan. 15, 2010, 12:38 a.m.
Previously, mmap_find_vma could return addresses not properly aligned
to the target page size.  This of course led to all sorts of odd
problems down the road.

The trivial fix, to simply reject the unaligned address and continue
searching the address space by increments of one page, is not a good
idea when there's a 64-bit address space involved.  The kernel may well
continue to return the last available address which we've already
rejected while we search upward from e.g. 2**42 from 2**64.

This patch uses a more complex search algorithm that takes the result
of the previous allocation into account.  We normally search upward,
but notice 2 consecutive results and start searching downward instead.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 linux-user/main.c |    7 +---
 linux-user/mmap.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 64 insertions(+), 14 deletions(-)
Richard Henderson - Jan. 15, 2010, 3:43 p.m.
On 01/14/2010 04:38 PM, Richard Henderson wrote:
> Previously, mmap_find_vma could return addresses not properly aligned
> to the target page size.  This of course led to all sorts of odd
> problems down the road.
>
> The trivial fix, to simply reject the unaligned address and continue
> searching the address space by increments of one page, is not a good
> idea when there's a 64-bit address space involved.  The kernel may well
> continue to return the last available address which we've already
> rejected while we search upward from e.g. 2**42 from 2**64.
>
> This patch uses a more complex search algorithm that takes the result
> of the previous allocation into account.  We normally search upward,
> but notice 2 consecutive results and start searching downward instead.

I've failed to take guest_base into account properly; my testing 
happened to have guest_base = 0.  New patch to follow.


r~
Riku Voipio - Jan. 25, 2010, 1:35 p.m.
On Fri, Jan 15, 2010 at 07:43:48AM -0800, Richard Henderson wrote:
>> This patch uses a more complex search algorithm that takes the result
>> of the previous allocation into account.  We normally search upward,
>> but notice 2 consecutive results and start searching downward instead.

> I've failed to take guest_base into account properly; my testing  
> happened to have guest_base = 0.  New patch to follow.

Did you have time to look at this again?
Richard Henderson - Jan. 25, 2010, 4:38 p.m.
On 01/25/2010 05:35 AM, Riku Voipio wrote:
> On Fri, Jan 15, 2010 at 07:43:48AM -0800, Richard Henderson wrote:
>>> This patch uses a more complex search algorithm that takes the result
>>> of the previous allocation into account.  We normally search upward,
>>> but notice 2 consecutive results and start searching downward instead.
>
>> I've failed to take guest_base into account properly; my testing
>> happened to have guest_base = 0.  New patch to follow.
>
> Did you have time to look at this again?

Some.  There is an additional problem.

The target's pages are recorded in the physical page mapping via 
page_set_flags, whose input address is bounded by 
TARGET_PHYS_ADDR_SPACE_BITS, which is local to exec.c.  Doubly 
unfortunately, page_set_flags silently discards pages that it considers 
must be outside the target's address space.

It's fairly easy to get the x86-64 kernel to return a vma outside the 
range of any of the existing TARGET_PHYS_ADDR_SPACE_BITS.  Which works 
fine inside the TB's, but causes any other syscall to return -EFAULT.

We have a check

    /* If address fits target address space we've found what we need */
    if ((unsigned long)ptr + size - 1 <= (abi_ulong)-1)
        break;

which kind-of works for 32-bit targets.  (You'll note, of course, that 
the comparison is against PTR, which is a host address; the correct test 
would have been h2g_valid, as seen with -B 0x100000000.)  However, it 
does nothing for 64-bit targets and the artificial 2**42 virtual address 
space limit we impose on most of them.

I talked with pbrook a bit about this on irc, and there seems to be no 
simple solution (like exporting TARGET_PHYS_ADDR_SPACE_BITS, possibly 
renamed as TARGET_ADDR_SPACE_BITS, in cpu.h) that would be acceptable to 
him.

Given the number of patches I've already submitted that aren't being 
reviewed, I'm unlikely to develop the momentum to totally rewrite qemu's 
page table support.  Particularly without a clue as to what sort of 
solution might be acceptable.  (Something like Sparc64's hashed page 
tables perhaps, which support a full 64-bit virtual address space?)


r~

Patch

diff --git a/linux-user/main.c b/linux-user/main.c
index a0d8ce7..7db9fc3 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -2725,12 +2725,9 @@  int main(int argc, char **argv, char **envp)
     /*
      * Read in mmap_min_addr kernel parameter.  This value is used
      * When loading the ELF image to determine whether guest_base
-     * is needed.
-     *
-     * When user has explicitly set the quest base, we skip this
-     * test.
+     * is needed.  It is also used in mmap_find_vma.
      */
-    if (!have_guest_base) {
+    {
         FILE *fp;
 
         if ((fp = fopen("/proc/sys/vm/mmap_min_addr", "r")) != NULL) {
diff --git a/linux-user/mmap.c b/linux-user/mmap.c
index 144fb7c..b92fdc4 100644
--- a/linux-user/mmap.c
+++ b/linux-user/mmap.c
@@ -281,8 +281,9 @@  unsigned long last_brk;
  */
 abi_ulong mmap_find_vma(abi_ulong start, abi_ulong size)
 {
-    void *ptr;
+    void *ptr, *prev;
     abi_ulong addr;
+    int wrapped, repeat;
 
     size = HOST_PAGE_ALIGN(size);
     start &= qemu_host_page_mask;
@@ -292,8 +293,11 @@  abi_ulong mmap_find_vma(abi_ulong start, abi_ulong size)
         start = mmap_next_start;
 
     addr = start;
+    wrapped = (start == 0);
+    repeat = 0;
+    prev = 0;
 
-    for(;;) {
+    for (;; prev = ptr) {
         /*
          * Reserve needed memory area to avoid a race.
          * It should be discarded using:
@@ -305,20 +309,69 @@  abi_ulong mmap_find_vma(abi_ulong start, abi_ulong size)
                    MAP_ANONYMOUS|MAP_PRIVATE|MAP_NORESERVE, -1, 0);
 
         /* ENOMEM, if host address space has no memory */
-        if (ptr == MAP_FAILED)
+        if (ptr == MAP_FAILED) {
             return (abi_ulong)-1;
+        }
 
-        /* If address fits target address space we've found what we need */
-        if ((unsigned long)ptr + size - 1 <= (abi_ulong)-1)
+        /* Count the number of sequential returns of the same address.
+           This is used to modify the search algorithm below.  */
+        repeat = (ptr == prev ? repeat + 1 : 0);
+
+        if ((unsigned long)ptr & ~TARGET_PAGE_MASK) {
+            /* The address is not properly aligned for the target.  */
+            switch (repeat) {
+            case 0:
+                /* Assume the result that the kernel gave us is the
+                   first with enough free space, so start again at the
+                   next higher target page.  */
+                addr = TARGET_PAGE_ALIGN((unsigned long)ptr);
+                break;
+            case 1:
+                /* Sometimes the kernel decides to perform the allocation
+                   at the top end of memory instead.  Notice this via
+                   sequential allocations that result in the same address.  */
+                /* ??? This can be exacerbated by a successful allocation
+                   at the top of memory on one round, and storing that
+                   result in mmap_next_start.  The next allocation is sure
+                   to start at an address that's going to fail.  */
+                addr = (unsigned long)ptr & TARGET_PAGE_MASK;
+                break;
+            case 2:
+                /* Start over at low memory.  */
+                addr = 0;
+                break;
+            default:
+                /* Fail.  This unaligned block must be the only one left.  */
+                addr = -1;
+                break;
+            }
+        } else if ((unsigned long)ptr + size - 1 <= (abi_ulong)-1) {
             break;
+        } else {
+            /* Since the result the kernel gave didn't fit, start
+               again at low memory.  If any repetition, fail.  */
+            addr = (repeat ? -1 : 0);
+        }
 
-        /* Unmap and try again with new page */
+        /* Unmap and try again.  */
         munmap(ptr, size);
-        addr += qemu_host_page_size;
 
-        /* ENOMEM if we check whole of target address space */
-        if (addr == start)
+        /* ENOMEM if we checked the whole of the target address space.  */
+        if (addr == -1) {
             return (abi_ulong)-1;
+        } else if (addr == 0) {
+            if (wrapped) {
+                return (abi_ulong)-1;
+            }
+            wrapped = 1;
+            /* Don't actually use 0 when wrapping, instead indicate
+               that we'd truely like an allocation in low memory.  */
+            addr = (mmap_min_addr > TARGET_PAGE_SIZE
+                    ? TARGET_PAGE_ALIGN(mmap_min_addr)
+                    : TARGET_PAGE_SIZE);
+        } else if (wrapped && addr >= start) {
+            return (abi_ulong)-1;
+        }
     }
 
     /* Update default start address */