diff mbox

[Xen-devel] increase maxmem before calling xc_domain_populate_physmap

Message ID 547C6E12.50502@terremark.com
State New
Headers show

Commit Message

Don Slutz Dec. 1, 2014, 1:33 p.m. UTC
On 11/27/14 05:48, Stefano Stabellini wrote:
> On Wed, 26 Nov 2014, Don Slutz wrote:
>> On 11/26/14 13:17, Stefano Stabellini wrote:
>>> On Tue, 25 Nov 2014, Andrew Cooper wrote:
>>>> On 25/11/14 17:45, Stefano Stabellini wrote:
>>>>> Increase maxmem before calling xc_domain_populate_physmap_exact to avoid
>>>>> the risk of running out of guest memory. This way we can also avoid
>>>>> complex memory calculations in libxl at domain construction time.
>>>>>
>>>>> This patch fixes an abort() when assigning more than 4 NICs to a VM.
>>>>>
>>>>> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
>>>>>
>>>>> diff --git a/xen-hvm.c b/xen-hvm.c
>>>>> index 5c69a8d..38e08c3 100644
>>>>> --- a/xen-hvm.c
>>>>> +++ b/xen-hvm.c
>>>>> @@ -218,6 +218,7 @@ void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t
>>>>> size, MemoryRegion *mr)
>>>>>        unsigned long nr_pfn;
>>>>>        xen_pfn_t *pfn_list;
>>>>>        int i;
>>>>> +    xc_dominfo_t info;
>>>>>          if (runstate_check(RUN_STATE_INMIGRATE)) {
>>>>>            /* RAM already populated in Xen */
>>>>> @@ -240,6 +241,13 @@ void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t
>>>>> size, MemoryRegion *mr)
>>>>>            pfn_list[i] = (ram_addr >> TARGET_PAGE_BITS) + i;
>>>>>        }
>>>>>    +    if (xc_domain_getinfo(xen_xc, xen_domid, 1, &info) < 0) {
>>>> xc_domain_getinfo()'s interface is mad, and provides no guarantee that
>>>> it returns the information for the domain you requested.  It also won't
>>>> return -1 on error.  The correct error handing is:
>>>>
>>>> (xc_domain_getinfo(xen_xc, xen_domid, 1, &info) != 1) || (info.domid !=
>>>> xen_domid)
>>> It might be wiser to switch to xc_domain_getinfolist
>> Either needs the same tests, since both return an vector of info.
> Right
>
>
>>>> ~Andrew
>>>>
>>>>> +        hw_error("xc_domain_getinfo failed");
>>>>> +    }
>>>>> +    if (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
>>>>> +                            (nr_pfn * XC_PAGE_SIZE / 1024)) < 0) {
>> There are two big issues and 1 minor one with this.
>> 1) You will allocate the videoram again.
>> 2) You will never use the 1 MB already allocated for option ROMs.
>>
>> And the minor one is that you can increase maxmem more then is needed.
> I don't understand: are you aware that setmaxmem doesn't allocate any
> memory, just raises the maximum amount of memory allowed for the domain
> to have?

Yes.

> But you are right that we would raise the limit more than it could be,
> specifically the videoram would get accounted for twice and we wouldn't
> need LIBXL_MAXMEM_CONSTANT. I guess we would have to write a patch for
> that.
>
>
>
>> Here is a better if:
>>
>> -    if (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
>> -                            (nr_pfn * XC_PAGE_SIZE / 1024)) < 0) {
>> +    max_pages = info.max_memkb * 1024 / XC_PAGE_SIZE;
>> +    free_pages = max_pages - info.nr_pages;
>> +    need_pages = nr_pfn - free_pages;
>> +    if ((free_pages < nr_pfn) &&
>> +       (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
>> +                            (need_pages * XC_PAGE_SIZE / 1024)) < 0)) {
> That's an interesting idea, but I am not sure if it is safe in all
> configurations.
>
> It could make QEMU work better with older libxl and avoid increasing
> maxmem more than necessary.
> On the other hand I guess it could break things when PoD is used, or in
> general when the user purposely sets maxmem on the vm config file.
>

Works fine in both claim modes and with PoD used (maxmem > memory).  Do
not know how to test with tmem.  I do not see how it would be worse then 
current
code that does not auto increase.  I.E. even without a xen change, I 
think something
like this could be done.


>> My testing shows a free 32 pages that I am not sure where they come from.  But
>> the code about is passing my 8 nics of e1000.
> I think that raising maxmem a bit higher than necessary is not too bad.
> If we really care about it, we could lower the limit after QEMU's
> initialization is completed.

Ok.  I did find the 32 it is VGA_HOLE_SIZE.  So here is what I have 
which includes
a lot of extra printf.


mr_name) "requested: %#lx size %#lx mr->name=%s"
  xen_client_set_memory(uint64_t start_addr, unsigned long size, bool 
log_dirty) "%#"PRIx64" size %#lx, log_dirty %i"
  handle_ioreq(void *req, uint32_t type, uint32_t dir, uint32_t df, 
uint32_t data_is_ptr, uint64_t addr, uint64_t data, uint32_t count, 
uint32_t size) "I/O=%p type=%d dir=%d df=%d ptr=%d port=%#"PRIx64" 
data=%#"PRIx64" count=%d size=%d"
  handle_ioreq_read(void *req, uint32_t type, uint32_t df, uint32_t 
data_is_ptr, uint64_t addr, uint64_t data, uint32_t count, uint32_t 
size) "I/O=%p read type=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" 
count=%d size=%d"


    -Don Slutz

>
>>     -Don Slutz
>>
>>
>>>>> +        hw_error("xc_domain_setmaxmem failed");
>>>>> +    }
>>>>>        if (xc_domain_populate_physmap_exact(xen_xc, xen_domid, nr_pfn, 0,
>>>>> 0, pfn_list)) {
>>>>>            hw_error("xen: failed to populate ram at " RAM_ADDR_FMT,
>>>>> ram_addr);
>>>>>        }
>>>>>
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@lists.xen.org
>>>>> http://lists.xen.org/xen-devel

Comments

Stefano Stabellini Dec. 1, 2014, 3:37 p.m. UTC | #1
On Mon, 1 Dec 2014, Don Slutz wrote:
> On 11/27/14 05:48, Stefano Stabellini wrote:
> > On Wed, 26 Nov 2014, Don Slutz wrote:
> > > On 11/26/14 13:17, Stefano Stabellini wrote:
> > > > On Tue, 25 Nov 2014, Andrew Cooper wrote:
> > > > > On 25/11/14 17:45, Stefano Stabellini wrote:
> > > > > > Increase maxmem before calling xc_domain_populate_physmap_exact to
> > > > > > avoid
> > > > > > the risk of running out of guest memory. This way we can also avoid
> > > > > > complex memory calculations in libxl at domain construction time.
> > > > > > 
> > > > > > This patch fixes an abort() when assigning more than 4 NICs to a VM.
> > > > > > 
> > > > > > Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > > > > > 
> > > > > > diff --git a/xen-hvm.c b/xen-hvm.c
> > > > > > index 5c69a8d..38e08c3 100644
> > > > > > --- a/xen-hvm.c
> > > > > > +++ b/xen-hvm.c
> > > > > > @@ -218,6 +218,7 @@ void xen_ram_alloc(ram_addr_t ram_addr,
> > > > > > ram_addr_t
> > > > > > size, MemoryRegion *mr)
> > > > > >        unsigned long nr_pfn;
> > > > > >        xen_pfn_t *pfn_list;
> > > > > >        int i;
> > > > > > +    xc_dominfo_t info;
> > > > > >          if (runstate_check(RUN_STATE_INMIGRATE)) {
> > > > > >            /* RAM already populated in Xen */
> > > > > > @@ -240,6 +241,13 @@ void xen_ram_alloc(ram_addr_t ram_addr,
> > > > > > ram_addr_t
> > > > > > size, MemoryRegion *mr)
> > > > > >            pfn_list[i] = (ram_addr >> TARGET_PAGE_BITS) + i;
> > > > > >        }
> > > > > >    +    if (xc_domain_getinfo(xen_xc, xen_domid, 1, &info) < 0) {
> > > > > xc_domain_getinfo()'s interface is mad, and provides no guarantee that
> > > > > it returns the information for the domain you requested.  It also
> > > > > won't
> > > > > return -1 on error.  The correct error handing is:
> > > > > 
> > > > > (xc_domain_getinfo(xen_xc, xen_domid, 1, &info) != 1) || (info.domid
> > > > > !=
> > > > > xen_domid)
> > > > It might be wiser to switch to xc_domain_getinfolist
> > > Either needs the same tests, since both return an vector of info.
> > Right
> > 
> > 
> > > > > ~Andrew
> > > > > 
> > > > > > +        hw_error("xc_domain_getinfo failed");
> > > > > > +    }
> > > > > > +    if (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
> > > > > > +                            (nr_pfn * XC_PAGE_SIZE / 1024)) < 0) {
> > > There are two big issues and 1 minor one with this.
> > > 1) You will allocate the videoram again.
> > > 2) You will never use the 1 MB already allocated for option ROMs.
> > > 
> > > And the minor one is that you can increase maxmem more then is needed.
> > I don't understand: are you aware that setmaxmem doesn't allocate any
> > memory, just raises the maximum amount of memory allowed for the domain
> > to have?
> 
> Yes.
> 
> > But you are right that we would raise the limit more than it could be,
> > specifically the videoram would get accounted for twice and we wouldn't
> > need LIBXL_MAXMEM_CONSTANT. I guess we would have to write a patch for
> > that.
> > 
> > 
> > 
> > > Here is a better if:
> > > 
> > > -    if (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
> > > -                            (nr_pfn * XC_PAGE_SIZE / 1024)) < 0) {
> > > +    max_pages = info.max_memkb * 1024 / XC_PAGE_SIZE;
> > > +    free_pages = max_pages - info.nr_pages;
> > > +    need_pages = nr_pfn - free_pages;
> > > +    if ((free_pages < nr_pfn) &&
> > > +       (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
> > > +                            (need_pages * XC_PAGE_SIZE / 1024)) < 0)) {
> > That's an interesting idea, but I am not sure if it is safe in all
> > configurations.
> > 
> > It could make QEMU work better with older libxl and avoid increasing
> > maxmem more than necessary.
> > On the other hand I guess it could break things when PoD is used, or in
> > general when the user purposely sets maxmem on the vm config file.
> > 
> 
> Works fine in both claim modes and with PoD used (maxmem > memory).  Do
> not know how to test with tmem.  I do not see how it would be worse then
> current
> code that does not auto increase.  I.E. even without a xen change, I think
> something
> like this could be done.

OK, good to know. I am OK with increasing maxmem only if it is strictly
necessary.


> 
> > > My testing shows a free 32 pages that I am not sure where they come from.
> > > But
> > > the code about is passing my 8 nics of e1000.
> > I think that raising maxmem a bit higher than necessary is not too bad.
> > If we really care about it, we could lower the limit after QEMU's
> > initialization is completed.
> 
> Ok.  I did find the 32 it is VGA_HOLE_SIZE.  So here is what I have which
> includes
> a lot of extra printf.

In QEMU I would prefer not to assume that libxl increased maxmem for the
vga hole. I would rather call xc_domain_setmaxmem twice for the vga hole
than tie QEMU to a particular maxmem allocation scheme in libxl.

In libxl I would like to avoid increasing mamxem for anything QEMU will
allocate later, that includes rom and vga vram. I am not sure how to
make that work with older QEMU versions that don't call
xc_domain_setmaxmem by themselves yet though. Maybe we could check the
specific QEMU version in libxl and decide based on that. Or we could
export a feature flag in QEMU.



> --- a/xen-hvm.c
> +++ b/xen-hvm.c
> @@ -67,6 +67,7 @@ static inline ioreq_t *xen_vcpu_ioreq(shared_iopage_t
> *shared_page, int vcpu)
>  #endif
> 
>  #define BUFFER_IO_MAX_DELAY  100
> +#define VGA_HOLE_SIZE (0x20)
> 
>  typedef struct XenPhysmap {
>      hwaddr start_addr;
> @@ -219,6 +220,11 @@ void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t size,
> MemoryRegion *mr)
>      xen_pfn_t *pfn_list;
>      int i;
>      xc_dominfo_t info;
> +    unsigned long max_pages, free_pages, real_free;
> +    long need_pages;
> +    uint64_t tot_pages, pod_cache_pages, pod_entries;
> +
> +    trace_xen_ram_alloc(ram_addr, size, mr->name);
> 
>      if (runstate_check(RUN_STATE_INMIGRATE)) {
>          /* RAM already populated in Xen */
> @@ -232,13 +238,6 @@ void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t size,
> MemoryRegion *mr)
>          return;
>      }
> 
> -    fprintf(stderr, "%s: alloc "RAM_ADDR_FMT
> -            " bytes (%ld Kib) of ram at "RAM_ADDR_FMT
> -            " mr.name=%s\n",
> -            __func__, size, (long)(size>>10), ram_addr, mr->name);
> -
> -    trace_xen_ram_alloc(ram_addr, size);
> -
>      nr_pfn = size >> TARGET_PAGE_BITS;
>      pfn_list = g_malloc(sizeof (*pfn_list) * nr_pfn);
> 
> @@ -246,11 +245,38 @@ void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t size,
> MemoryRegion *mr)
>          pfn_list[i] = (ram_addr >> TARGET_PAGE_BITS) + i;
>      }
> 
> -    if (xc_domain_getinfo(xen_xc, xen_domid, 1, &info) < 0) {
> +    if ((xc_domain_getinfo(xen_xc, xen_domid, 1, &info) != 1) ||
> +       (info.domid != xen_domid)) {
>          hw_error("xc_domain_getinfo failed");
>      }
> -    if (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
> -                            (nr_pfn * XC_PAGE_SIZE / 1024)) < 0) {
> +    max_pages = info.max_memkb * 1024 / XC_PAGE_SIZE;
> +    free_pages = max_pages - info.nr_pages;
> +    real_free = free_pages;
> +    if (free_pages > VGA_HOLE_SIZE) {
> +        free_pages -= VGA_HOLE_SIZE;
> +    } else {
> +        free_pages = 0;
> +    }
> +    need_pages = nr_pfn - free_pages;
> +    fprintf(stderr, "%s: alloc "RAM_ADDR_FMT
> +            " bytes (%ld KiB) of ram at "RAM_ADDR_FMT
> +            " mp=%ld fp=%ld nr=%ld need=%ld mr.name=%s\n",
> +            __func__, size, (long)(size>>10), ram_addr,
> +           max_pages, free_pages, nr_pfn, need_pages,
> +            mr->name);
> +    if (xc_domain_get_pod_target(xen_xc, xen_domid, &tot_pages,
> +                                 &pod_cache_pages, &pod_entries) >= 0) {
> +        unsigned long populated = tot_pages - pod_cache_pages;
> +        long delta_tot = tot_pages - info.nr_pages;
> +
> +        fprintf(stderr, "%s: PoD pop=%ld tot=%ld(%ld) cnt=%ld ent=%ld nop=%ld
> free=%ld\n",
> +                __func__, populated, (long)tot_pages, delta_tot,
> +                (long)pod_cache_pages, (long)pod_entries,
> +               (long)info.nr_outstanding_pages, real_free);
> +    }

What is the purpose of calling xc_domain_get_pod_target here? It doesn't
look like is affecting the parameters passed to xc_domain_setmaxmem.


> +    if ((free_pages < nr_pfn) &&
> +       (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
> +                            (need_pages * XC_PAGE_SIZE / 1024)) < 0)) {
>          hw_error("xc_domain_setmaxmem failed");
>      }
>      if (xc_domain_populate_physmap_exact(xen_xc, xen_domid, nr_pfn, 0, 0,
> pfn_list)) {
> 
> 
> Note: I already had a trace_xen_ram_alloc, here is the delta in trace-events
> (which
> has others not yet sent):
> 
> 
> 
> diff --git a/trace-events b/trace-events
> index eaeecd5..a58fc11 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -908,7 +908,7 @@ pvscsi_tx_rings_ppn(const char* label, uint64_t ppn) "%s
> page: %"PRIx64""
>  pvscsi_tx_rings_num_pages(const char* label, uint32_t num) "Number of %s
> pages: %u"
> 
>  # xen-all.c
> -xen_ram_alloc(unsigned long ram_addr, unsigned long size) "requested: %#lx,
> size %#lx"
> +xen_ram_alloc(unsigned long ram_addr, unsigned long size, const char*
> mr_name) "requested: %#lx size %#lx mr->name=%s"
>  xen_client_set_memory(uint64_t start_addr, unsigned long size, bool
> log_dirty) "%#"PRIx64" size %#lx, log_dirty %i"
>  handle_ioreq(void *req, uint32_t type, uint32_t dir, uint32_t df, uint32_t
> data_is_ptr, uint64_t addr, uint64_t data, uint32_t count, uint32_t size)
> "I/O=%p type=%d dir=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" count=%d
> size=%d"
>  handle_ioreq_read(void *req, uint32_t type, uint32_t df, uint32_t
> data_is_ptr, uint64_t addr, uint64_t data, uint32_t count, uint32_t size)
> "I/O=%p read type=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" count=%d
> size=%d"
> 
> 
>    -Don Slutz
> 
> > 
> > >     -Don Slutz
> > > 
> > > 
> > > > > > +        hw_error("xc_domain_setmaxmem failed");
> > > > > > +    }
> > > > > >        if (xc_domain_populate_physmap_exact(xen_xc, xen_domid,
> > > > > > nr_pfn, 0,
> > > > > > 0, pfn_list)) {
> > > > > >            hw_error("xen: failed to populate ram at " RAM_ADDR_FMT,
> > > > > > ram_addr);
> > > > > >        }
> > > > > > 
> > > > > > _______________________________________________
> > > > > > Xen-devel mailing list
> > > > > > Xen-devel@lists.xen.org
> > > > > > http://lists.xen.org/xen-devel
>
Don Slutz Dec. 1, 2014, 11:16 p.m. UTC | #2
On 12/01/14 10:37, Stefano Stabellini wrote:
> On Mon, 1 Dec 2014, Don Slutz wrote:
>> On 11/27/14 05:48, Stefano Stabellini wrote:

[...]

>>
>> Works fine in both claim modes and with PoD used (maxmem > memory).  Do
>> not know how to test with tmem.  I do not see how it would be worse then
>> current
>> code that does not auto increase.  I.E. even without a xen change, I think
>> something
>> like this could be done.
> OK, good to know. I am OK with increasing maxmem only if it is strictly
> necessary.
>
>
>>>> My testing shows a free 32 pages that I am not sure where they come from.
>>>> But
>>>> the code about is passing my 8 nics of e1000.
>>> I think that raising maxmem a bit higher than necessary is not too bad.
>>> If we really care about it, we could lower the limit after QEMU's
>>> initialization is completed.
>> Ok.  I did find the 32 it is VGA_HOLE_SIZE.  So here is what I have which
>> includes
>> a lot of extra printf.
> In QEMU I would prefer not to assume that libxl increased maxmem for the
> vga hole. I would rather call xc_domain_setmaxmem twice for the vga hole
> than tie QEMU to a particular maxmem allocation scheme in libxl.

Ok.  The area we are talking about is 0x000a0000 to 0x000c0000.
It is in libxc (xc_hvm_build_x86), not libxl.   I have no issue with a 
name change to
some thing like QEMU_EXTRA_FREE_PAGES.

My testing has shows that some of these 32 pages are used outside of QEMU.
I am seeing just 23 free pages using a standalone program to display
the same info after a CentOS 6.3 guest is done booting.

> In libxl I would like to avoid increasing mamxem for anything QEMU will
> allocate later, that includes rom and vga vram. I am not sure how to
> make that work with older QEMU versions that don't call
> xc_domain_setmaxmem by themselves yet though. Maybe we could check the
> specific QEMU version in libxl and decide based on that. Or we could
> export a feature flag in QEMU.

Yes, it would be nice to adjust libxl to not increase maxmem. However since
videoram is included in memory (and maxmem), making the change related
to vram is a bigger issue.

the rom change is much simpler.

Currently I do not know of a way to do different things based on the 
QEMU version
and/or features (this includes getting the QEMU version in libxl).

I have been going with:
1) change QEMU 1st.
2) Wait for an upstream version of QEMU with this.
3) change xen to optionally use a feature in the latest QEMU.

>
>
>> --- a/xen-hvm.c
>> +++ b/xen-hvm.c
>> @@ -67,6 +67,7 @@ static inline ioreq_t *xen_vcpu_ioreq(shared_iopage_t
>> *shared_page, int vcpu)
>>   #endif
>>
>>   #define BUFFER_IO_MAX_DELAY  100
>> +#define VGA_HOLE_SIZE (0x20)
>>
>>   typedef struct XenPhysmap {
>>       hwaddr start_addr;
>> @@ -219,6 +220,11 @@ void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t size,
>> MemoryRegion *mr)
>>       xen_pfn_t *pfn_list;
>>       int i;
>>       xc_dominfo_t info;
>> +    unsigned long max_pages, free_pages, real_free;
>> +    long need_pages;
>> +    uint64_t tot_pages, pod_cache_pages, pod_entries;
>> +
>> +    trace_xen_ram_alloc(ram_addr, size, mr->name);
>>
>>       if (runstate_check(RUN_STATE_INMIGRATE)) {
>>           /* RAM already populated in Xen */
>> @@ -232,13 +238,6 @@ void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t size,
>> MemoryRegion *mr)
>>           return;
>>       }
>>
>> -    fprintf(stderr, "%s: alloc "RAM_ADDR_FMT
>> -            " bytes (%ld Kib) of ram at "RAM_ADDR_FMT
>> -            " mr.name=%s\n",
>> -            __func__, size, (long)(size>>10), ram_addr, mr->name);
>> -
>> -    trace_xen_ram_alloc(ram_addr, size);
>> -
>>       nr_pfn = size >> TARGET_PAGE_BITS;
>>       pfn_list = g_malloc(sizeof (*pfn_list) * nr_pfn);
>>
>> @@ -246,11 +245,38 @@ void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t size,
>> MemoryRegion *mr)
>>           pfn_list[i] = (ram_addr >> TARGET_PAGE_BITS) + i;
>>       }
>>
>> -    if (xc_domain_getinfo(xen_xc, xen_domid, 1, &info) < 0) {
>> +    if ((xc_domain_getinfo(xen_xc, xen_domid, 1, &info) != 1) ||
>> +       (info.domid != xen_domid)) {
>>           hw_error("xc_domain_getinfo failed");
>>       }
>> -    if (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
>> -                            (nr_pfn * XC_PAGE_SIZE / 1024)) < 0) {
>> +    max_pages = info.max_memkb * 1024 / XC_PAGE_SIZE;
>> +    free_pages = max_pages - info.nr_pages;
>> +    real_free = free_pages;
>> +    if (free_pages > VGA_HOLE_SIZE) {
>> +        free_pages -= VGA_HOLE_SIZE;
>> +    } else {
>> +        free_pages = 0;
>> +    }
>> +    need_pages = nr_pfn - free_pages;
>> +    fprintf(stderr, "%s: alloc "RAM_ADDR_FMT
>> +            " bytes (%ld KiB) of ram at "RAM_ADDR_FMT
>> +            " mp=%ld fp=%ld nr=%ld need=%ld mr.name=%s\n",
>> +            __func__, size, (long)(size>>10), ram_addr,
>> +           max_pages, free_pages, nr_pfn, need_pages,
>> +            mr->name);
>> +    if (xc_domain_get_pod_target(xen_xc, xen_domid, &tot_pages,
>> +                                 &pod_cache_pages, &pod_entries) >= 0) {
>> +        unsigned long populated = tot_pages - pod_cache_pages;
>> +        long delta_tot = tot_pages - info.nr_pages;
>> +
>> +        fprintf(stderr, "%s: PoD pop=%ld tot=%ld(%ld) cnt=%ld ent=%ld nop=%ld
>> free=%ld\n",
>> +                __func__, populated, (long)tot_pages, delta_tot,
>> +                (long)pod_cache_pages, (long)pod_entries,
>> +               (long)info.nr_outstanding_pages, real_free);
>> +    }
> What is the purpose of calling xc_domain_get_pod_target here? It doesn't
> look like is affecting the parameters passed to xc_domain_setmaxmem.

This was just to test and see if anything was needed for PoD mode.
I did not see anything.

Did you want me to make a patch to send to the list that has my proposed
changes?

I do think that what I have would be fine to do since most of the time the
new code does nothing with the current xen (and older versions), until
more then 4 nics are configured for xen.

It would be good to have the change:

[PATCH] libxl_set_memory_target: retain the same maxmem offset on top of 
the current target

(When working) in xen before using a QEMU with this change.

Not sure I would push for 2.2 however.

    -Don Slutz

>
>> +    if ((free_pages < nr_pfn) &&
>> +       (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
>> +                            (need_pages * XC_PAGE_SIZE / 1024)) < 0)) {
>>           hw_error("xc_domain_setmaxmem failed");
>>       }
>>       if (xc_domain_populate_physmap_exact(xen_xc, xen_domid, nr_pfn, 0, 0,
>> pfn_list)) {
>>
>>
>> Note: I already had a trace_xen_ram_alloc, here is the delta in trace-events
>> (which
>> has others not yet sent):
>>
>>
>>
>> diff --git a/trace-events b/trace-events
>> index eaeecd5..a58fc11 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -908,7 +908,7 @@ pvscsi_tx_rings_ppn(const char* label, uint64_t ppn) "%s
>> page: %"PRIx64""
>>   pvscsi_tx_rings_num_pages(const char* label, uint32_t num) "Number of %s
>> pages: %u"
>>
>>   # xen-all.c
>> -xen_ram_alloc(unsigned long ram_addr, unsigned long size) "requested: %#lx,
>> size %#lx"
>> +xen_ram_alloc(unsigned long ram_addr, unsigned long size, const char*
>> mr_name) "requested: %#lx size %#lx mr->name=%s"
>>   xen_client_set_memory(uint64_t start_addr, unsigned long size, bool
>> log_dirty) "%#"PRIx64" size %#lx, log_dirty %i"
>>   handle_ioreq(void *req, uint32_t type, uint32_t dir, uint32_t df, uint32_t
>> data_is_ptr, uint64_t addr, uint64_t data, uint32_t count, uint32_t size)
>> "I/O=%p type=%d dir=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" count=%d
>> size=%d"
>>   handle_ioreq_read(void *req, uint32_t type, uint32_t df, uint32_t
>> data_is_ptr, uint64_t addr, uint64_t data, uint32_t count, uint32_t size)
>> "I/O=%p read type=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" count=%d
>> size=%d"
>>
>>
>>     -Don Slutz
>>
>>>>      -Don Slutz
>>>>
>>>>
>>>>>>> +        hw_error("xc_domain_setmaxmem failed");
>>>>>>> +    }
>>>>>>>         if (xc_domain_populate_physmap_exact(xen_xc, xen_domid,
>>>>>>> nr_pfn, 0,
>>>>>>> 0, pfn_list)) {
>>>>>>>             hw_error("xen: failed to populate ram at " RAM_ADDR_FMT,
>>>>>>> ram_addr);
>>>>>>>         }
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Xen-devel mailing list
>>>>>>> Xen-devel@lists.xen.org
>>>>>>> http://lists.xen.org/xen-devel
Stefano Stabellini Dec. 2, 2014, 12:26 p.m. UTC | #3
On Mon, 1 Dec 2014, Don Slutz wrote:
> On 12/01/14 10:37, Stefano Stabellini wrote:
> > On Mon, 1 Dec 2014, Don Slutz wrote:
> > > On 11/27/14 05:48, Stefano Stabellini wrote:
> 
> [...]
> 
> > > 
> > > Works fine in both claim modes and with PoD used (maxmem > memory).  Do
> > > not know how to test with tmem.  I do not see how it would be worse then
> > > current
> > > code that does not auto increase.  I.E. even without a xen change, I think
> > > something
> > > like this could be done.
> > OK, good to know. I am OK with increasing maxmem only if it is strictly
> > necessary.
> > 
> > 
> > > > > My testing shows a free 32 pages that I am not sure where they come
> > > > > from.
> > > > > But
> > > > > the code about is passing my 8 nics of e1000.
> > > > I think that raising maxmem a bit higher than necessary is not too bad.
> > > > If we really care about it, we could lower the limit after QEMU's
> > > > initialization is completed.
> > > Ok.  I did find the 32 it is VGA_HOLE_SIZE.  So here is what I have which
> > > includes
> > > a lot of extra printf.
> > In QEMU I would prefer not to assume that libxl increased maxmem for the
> > vga hole. I would rather call xc_domain_setmaxmem twice for the vga hole
> > than tie QEMU to a particular maxmem allocation scheme in libxl.
> 
> Ok.  The area we are talking about is 0x000a0000 to 0x000c0000.
> It is in libxc (xc_hvm_build_x86), not libxl.   I have no issue with a name
> change to
> some thing like QEMU_EXTRA_FREE_PAGES.

Sorry, I was thinking about the relocated videoram region, the one
allocated by xen_add_to_physmap in QEMU.

Regarding 0x000a0000-0x000c0000, according to this comment:

    /*
     * Allocate memory for HVM guest, skipping VGA hole 0xA0000-0xC0000.
     *

xc_hvm_build_x86 shouldn't be allocating memory for it.
In fact if I remember correctly 0xa0000-0xc0000 is trapped and emulated.


> My testing has shows that some of these 32 pages are used outside of QEMU.
> I am seeing just 23 free pages using a standalone program to display
> the same info after a CentOS 6.3 guest is done booting.
>
> > In libxl I would like to avoid increasing mamxem for anything QEMU will
> > allocate later, that includes rom and vga vram. I am not sure how to
> > make that work with older QEMU versions that don't call
> > xc_domain_setmaxmem by themselves yet though. Maybe we could check the
> > specific QEMU version in libxl and decide based on that. Or we could
> > export a feature flag in QEMU.
> 
> Yes, it would be nice to adjust libxl to not increase maxmem. However since
> videoram is included in memory (and maxmem), making the change related
> to vram is a bigger issue.
> 
> the rom change is much simpler.
> 
> Currently I do not know of a way to do different things based on the QEMU
> version
> and/or features (this includes getting the QEMU version in libxl).
> 
> I have been going with:
> 1) change QEMU 1st.
> 2) Wait for an upstream version of QEMU with this.
> 3) change xen to optionally use a feature in the latest QEMU.

Let's try to take this approach for the videoram too


> > 
> > > --- a/xen-hvm.c
> > > +++ b/xen-hvm.c
> > > @@ -67,6 +67,7 @@ static inline ioreq_t *xen_vcpu_ioreq(shared_iopage_t
> > > *shared_page, int vcpu)
> > >   #endif
> > > 
> > >   #define BUFFER_IO_MAX_DELAY  100
> > > +#define VGA_HOLE_SIZE (0x20)
> > > 
> > >   typedef struct XenPhysmap {
> > >       hwaddr start_addr;
> > > @@ -219,6 +220,11 @@ void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t
> > > size,
> > > MemoryRegion *mr)
> > >       xen_pfn_t *pfn_list;
> > >       int i;
> > >       xc_dominfo_t info;
> > > +    unsigned long max_pages, free_pages, real_free;
> > > +    long need_pages;
> > > +    uint64_t tot_pages, pod_cache_pages, pod_entries;
> > > +
> > > +    trace_xen_ram_alloc(ram_addr, size, mr->name);
> > > 
> > >       if (runstate_check(RUN_STATE_INMIGRATE)) {
> > >           /* RAM already populated in Xen */
> > > @@ -232,13 +238,6 @@ void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t
> > > size,
> > > MemoryRegion *mr)
> > >           return;
> > >       }
> > > 
> > > -    fprintf(stderr, "%s: alloc "RAM_ADDR_FMT
> > > -            " bytes (%ld Kib) of ram at "RAM_ADDR_FMT
> > > -            " mr.name=%s\n",
> > > -            __func__, size, (long)(size>>10), ram_addr, mr->name);
> > > -
> > > -    trace_xen_ram_alloc(ram_addr, size);
> > > -
> > >       nr_pfn = size >> TARGET_PAGE_BITS;
> > >       pfn_list = g_malloc(sizeof (*pfn_list) * nr_pfn);
> > > 
> > > @@ -246,11 +245,38 @@ void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t
> > > size,
> > > MemoryRegion *mr)
> > >           pfn_list[i] = (ram_addr >> TARGET_PAGE_BITS) + i;
> > >       }
> > > 
> > > -    if (xc_domain_getinfo(xen_xc, xen_domid, 1, &info) < 0) {
> > > +    if ((xc_domain_getinfo(xen_xc, xen_domid, 1, &info) != 1) ||

I think it's best to call xc_domain_getinfolist.


> > > +       (info.domid != xen_domid)) {
> > >           hw_error("xc_domain_getinfo failed");
> > >       }
> > > -    if (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
> > > -                            (nr_pfn * XC_PAGE_SIZE / 1024)) < 0) {
> > > +    max_pages = info.max_memkb * 1024 / XC_PAGE_SIZE;
> > > +    free_pages = max_pages - info.nr_pages;
> > > +    real_free = free_pages;
> > > +    if (free_pages > VGA_HOLE_SIZE) {
> > > +        free_pages -= VGA_HOLE_SIZE;
> > > +    } else {
> > > +        free_pages = 0;
> > > +    }

I don't think we need to subtract VGA_HOLE_SIZE.


> > > +    need_pages = nr_pfn - free_pages;
> > > +    fprintf(stderr, "%s: alloc "RAM_ADDR_FMT
> > > +            " bytes (%ld KiB) of ram at "RAM_ADDR_FMT
> > > +            " mp=%ld fp=%ld nr=%ld need=%ld mr.name=%s\n",
> > > +            __func__, size, (long)(size>>10), ram_addr,
> > > +           max_pages, free_pages, nr_pfn, need_pages,
> > > +            mr->name);
> > > +    if (xc_domain_get_pod_target(xen_xc, xen_domid, &tot_pages,
> > > +                                 &pod_cache_pages, &pod_entries) >= 0) {
> > > +        unsigned long populated = tot_pages - pod_cache_pages;
> > > +        long delta_tot = tot_pages - info.nr_pages;
> > > +
> > > +        fprintf(stderr, "%s: PoD pop=%ld tot=%ld(%ld) cnt=%ld ent=%ld
> > > nop=%ld
> > > free=%ld\n",
> > > +                __func__, populated, (long)tot_pages, delta_tot,
> > > +                (long)pod_cache_pages, (long)pod_entries,
> > > +               (long)info.nr_outstanding_pages, real_free);
> > > +    }
> > What is the purpose of calling xc_domain_get_pod_target here? It doesn't
> > look like is affecting the parameters passed to xc_domain_setmaxmem.
> 
> This was just to test and see if anything was needed for PoD mode.
> I did not see anything.
> 
> Did you want me to make a patch to send to the list that has my proposed
> changes?

Yep, that's fine.


> I do think that what I have would be fine to do since most of the time the
> new code does nothing with the current xen (and older versions), until
> more then 4 nics are configured for xen.
> 
> It would be good to have the change:
> 
> [PATCH] libxl_set_memory_target: retain the same maxmem offset on top of the
> current target
> 
> (When working) in xen before using a QEMU with this change.
>
> Not sure I would push for 2.2 however.

I wouldn't.


>    -Don Slutz
> 
> > 
> > > +    if ((free_pages < nr_pfn) &&
> > > +       (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
> > > +                            (need_pages * XC_PAGE_SIZE / 1024)) < 0)) {
> > >           hw_error("xc_domain_setmaxmem failed");
> > >       }
> > >       if (xc_domain_populate_physmap_exact(xen_xc, xen_domid, nr_pfn, 0,
> > > 0,
> > > pfn_list)) {
> > > 
> > > 
> > > Note: I already had a trace_xen_ram_alloc, here is the delta in
> > > trace-events
> > > (which
> > > has others not yet sent):
> > > 
> > > 
> > > 
> > > diff --git a/trace-events b/trace-events
> > > index eaeecd5..a58fc11 100644
> > > --- a/trace-events
> > > +++ b/trace-events
> > > @@ -908,7 +908,7 @@ pvscsi_tx_rings_ppn(const char* label, uint64_t ppn)
> > > "%s
> > > page: %"PRIx64""
> > >   pvscsi_tx_rings_num_pages(const char* label, uint32_t num) "Number of %s
> > > pages: %u"
> > > 
> > >   # xen-all.c
> > > -xen_ram_alloc(unsigned long ram_addr, unsigned long size) "requested:
> > > %#lx,
> > > size %#lx"
> > > +xen_ram_alloc(unsigned long ram_addr, unsigned long size, const char*
> > > mr_name) "requested: %#lx size %#lx mr->name=%s"
> > >   xen_client_set_memory(uint64_t start_addr, unsigned long size, bool
> > > log_dirty) "%#"PRIx64" size %#lx, log_dirty %i"
> > >   handle_ioreq(void *req, uint32_t type, uint32_t dir, uint32_t df,
> > > uint32_t
> > > data_is_ptr, uint64_t addr, uint64_t data, uint32_t count, uint32_t size)
> > > "I/O=%p type=%d dir=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64"
> > > count=%d
> > > size=%d"
> > >   handle_ioreq_read(void *req, uint32_t type, uint32_t df, uint32_t
> > > data_is_ptr, uint64_t addr, uint64_t data, uint32_t count, uint32_t size)
> > > "I/O=%p read type=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" count=%d
> > > size=%d"
> > > 
> > > 
> > >     -Don Slutz
> > > 
> > > > >      -Don Slutz
> > > > > 
> > > > > 
> > > > > > > > +        hw_error("xc_domain_setmaxmem failed");
> > > > > > > > +    }
> > > > > > > >         if (xc_domain_populate_physmap_exact(xen_xc, xen_domid,
> > > > > > > > nr_pfn, 0,
> > > > > > > > 0, pfn_list)) {
> > > > > > > >             hw_error("xen: failed to populate ram at "
> > > > > > > > RAM_ADDR_FMT,
> > > > > > > > ram_addr);
> > > > > > > >         }
> > > > > > > > 
> > > > > > > > _______________________________________________
> > > > > > > > Xen-devel mailing list
> > > > > > > > Xen-devel@lists.xen.org
> > > > > > > > http://lists.xen.org/xen-devel
>
Don Slutz Dec. 2, 2014, 8:23 p.m. UTC | #4
On 12/02/14 07:26, Stefano Stabellini wrote:
> On Mon, 1 Dec 2014, Don Slutz wrote:
>> On 12/01/14 10:37, Stefano Stabellini wrote:
>>> On Mon, 1 Dec 2014, Don Slutz wrote:
>>>> On 11/27/14 05:48, Stefano Stabellini wrote:
>> [...]
>>
>>>> Works fine in both claim modes and with PoD used (maxmem > memory).  Do
>>>> not know how to test with tmem.  I do not see how it would be worse then
>>>> current
>>>> code that does not auto increase.  I.E. even without a xen change, I think
>>>> something
>>>> like this could be done.
>>> OK, good to know. I am OK with increasing maxmem only if it is strictly
>>> necessary.
>>>
>>>
>>>>>> My testing shows a free 32 pages that I am not sure where they come
>>>>>> from.
>>>>>> But
>>>>>> the code about is passing my 8 nics of e1000.
>>>>> I think that raising maxmem a bit higher than necessary is not too bad.
>>>>> If we really care about it, we could lower the limit after QEMU's
>>>>> initialization is completed.
>>>> Ok.  I did find the 32 it is VGA_HOLE_SIZE.  So here is what I have which
>>>> includes
>>>> a lot of extra printf.
>>> In QEMU I would prefer not to assume that libxl increased maxmem for the
>>> vga hole. I would rather call xc_domain_setmaxmem twice for the vga hole
>>> than tie QEMU to a particular maxmem allocation scheme in libxl.
>> Ok.  The area we are talking about is 0x000a0000 to 0x000c0000.
>> It is in libxc (xc_hvm_build_x86), not libxl.   I have no issue with a name
>> change to
>> some thing like QEMU_EXTRA_FREE_PAGES.
> Sorry, I was thinking about the relocated videoram region, the one
> allocated by xen_add_to_physmap in QEMU.
>
> Regarding 0x000a0000-0x000c0000, according to this comment:
>
>      /*
>       * Allocate memory for HVM guest, skipping VGA hole 0xA0000-0xC0000.
>       *
>
> xc_hvm_build_x86 shouldn't be allocating memory for it.
> In fact if I remember correctly 0xa0000-0xc0000 is trapped and emulated.
>

Clearly I am not writing clear enough statements.

xc_hvm_build_x86 is not allocating memory for it.  libxl__build_pre() sets
maxmem via xc_domain_setmaxmem().  Then xc_hvm_build_x86 allocates
memory skipping the VGA hole 0xA0000-0xC0000. So difference between
maxmem (max_pages, xlinfo->max_memkb) and tot_pages 
(xlinfo->current_memkb) is videoram + LIBXL_MAXMEM_CONSTANT + 32
(I.E. what VGA_HOLE_SIZE is definded as).


>> My testing has shows that some of these 32 pages are used outside of QEMU.
>> I am seeing just 23 free pages using a standalone program to display
>> the same info after a CentOS 6.3 guest is done booting.
>>
>>> In libxl I would like to avoid increasing mamxem for anything QEMU will
>>> allocate later, that includes rom and vga vram. I am not sure how to
>>> make that work with older QEMU versions that don't call
>>> xc_domain_setmaxmem by themselves yet though. Maybe we could check the
>>> specific QEMU version in libxl and decide based on that. Or we could
>>> export a feature flag in QEMU.
>> Yes, it would be nice to adjust libxl to not increase maxmem. However since
>> videoram is included in memory (and maxmem), making the change related
>> to vram is a bigger issue.
>>
>> the rom change is much simpler.
>>
>> Currently I do not know of a way to do different things based on the QEMU
>> version
>> and/or features (this includes getting the QEMU version in libxl).
>>
>> I have been going with:
>> 1) change QEMU 1st.
>> 2) Wait for an upstream version of QEMU with this.
>> 3) change xen to optionally use a feature in the latest QEMU.
> Let's try to take this approach for the videoram too

Ok.

>
>>>> --- a/xen-hvm.c
>>>> +++ b/xen-hvm.c
>>>> @@ -67,6 +67,7 @@ static inline ioreq_t *xen_vcpu_ioreq(shared_iopage_t
>>>> *shared_page, int vcpu)
>>>>    #endif
>>>>
>>>>    #define BUFFER_IO_MAX_DELAY  100
>>>> +#define VGA_HOLE_SIZE (0x20)
>>>>
>>>>    typedef struct XenPhysmap {
>>>>        hwaddr start_addr;
>>>> @@ -219,6 +220,11 @@ void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t
>>>> size,
>>>> MemoryRegion *mr)
>>>>        xen_pfn_t *pfn_list;
>>>>        int i;
>>>>        xc_dominfo_t info;
>>>> +    unsigned long max_pages, free_pages, real_free;
>>>> +    long need_pages;
>>>> +    uint64_t tot_pages, pod_cache_pages, pod_entries;
>>>> +
>>>> +    trace_xen_ram_alloc(ram_addr, size, mr->name);
>>>>
>>>>        if (runstate_check(RUN_STATE_INMIGRATE)) {
>>>>            /* RAM already populated in Xen */
>>>> @@ -232,13 +238,6 @@ void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t
>>>> size,
>>>> MemoryRegion *mr)
>>>>            return;
>>>>        }
>>>>
>>>> -    fprintf(stderr, "%s: alloc "RAM_ADDR_FMT
>>>> -            " bytes (%ld Kib) of ram at "RAM_ADDR_FMT
>>>> -            " mr.name=%s\n",
>>>> -            __func__, size, (long)(size>>10), ram_addr, mr->name);
>>>> -
>>>> -    trace_xen_ram_alloc(ram_addr, size);
>>>> -
>>>>        nr_pfn = size >> TARGET_PAGE_BITS;
>>>>        pfn_list = g_malloc(sizeof (*pfn_list) * nr_pfn);
>>>>
>>>> @@ -246,11 +245,38 @@ void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t
>>>> size,
>>>> MemoryRegion *mr)
>>>>            pfn_list[i] = (ram_addr >> TARGET_PAGE_BITS) + i;
>>>>        }
>>>>
>>>> -    if (xc_domain_getinfo(xen_xc, xen_domid, 1, &info) < 0) {
>>>> +    if ((xc_domain_getinfo(xen_xc, xen_domid, 1, &info) != 1) ||
> I think it's best to call xc_domain_getinfolist.
>

Will switch.

>>>> +       (info.domid != xen_domid)) {
>>>>            hw_error("xc_domain_getinfo failed");
>>>>        }
>>>> -    if (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
>>>> -                            (nr_pfn * XC_PAGE_SIZE / 1024)) < 0) {
>>>> +    max_pages = info.max_memkb * 1024 / XC_PAGE_SIZE;
>>>> +    free_pages = max_pages - info.nr_pages;
>>>> +    real_free = free_pages;
>>>> +    if (free_pages > VGA_HOLE_SIZE) {
>>>> +        free_pages -= VGA_HOLE_SIZE;
>>>> +    } else {
>>>> +        free_pages = 0;
>>>> +    }
> I don't think we need to subtract VGA_HOLE_SIZE.

If you do not use some slack (like 32 here), xen does report:


(d5) [2014-11-29 17:07:21] Loaded SeaBIOS
(d5) [2014-11-29 17:07:21] Creating MP tables ...
(d5) [2014-11-29 17:07:21] Loading ACPI ...
(XEN) [2014-11-29 17:07:21] page_alloc.c:1568:d5 Over-allocation for 
domain 5: 1057417 > 1057416
(XEN) [2014-11-29 17:07:21] memory.c:158:d5 Could not allocate order=0 
extent: id=5 memflags=0 (0 of 1)
(d5) [2014-11-29 17:07:21] vm86 TSS at 00098c00
(d5) [2014-11-29 17:07:21] BIOS map:


However while QEMU startup ends with 32 "free" pages (maxmem - curmem)
this quickly changes to 23.  I.E. there are 7 more places that act like 
a call
to xc_domain_populate_physmap_exact() but do not log errors if there is
no free pages.  And just to make sure I do not fully understand what is
happening here, after the message above, the domain appears to work
fine and ends up with 8 "free" pages (code I do not know about ends up
releasing populated pages).

So my current thinking is to have QEMU leave a few (9 to 32 (64?)) pages 
"free".


>
>>>> +    need_pages = nr_pfn - free_pages;
>>>> +    fprintf(stderr, "%s: alloc "RAM_ADDR_FMT
>>>> +            " bytes (%ld KiB) of ram at "RAM_ADDR_FMT
>>>> +            " mp=%ld fp=%ld nr=%ld need=%ld mr.name=%s\n",
>>>> +            __func__, size, (long)(size>>10), ram_addr,
>>>> +           max_pages, free_pages, nr_pfn, need_pages,
>>>> +            mr->name);
>>>> +    if (xc_domain_get_pod_target(xen_xc, xen_domid, &tot_pages,
>>>> +                                 &pod_cache_pages, &pod_entries) >= 0) {
>>>> +        unsigned long populated = tot_pages - pod_cache_pages;
>>>> +        long delta_tot = tot_pages - info.nr_pages;
>>>> +
>>>> +        fprintf(stderr, "%s: PoD pop=%ld tot=%ld(%ld) cnt=%ld ent=%ld
>>>> nop=%ld
>>>> free=%ld\n",
>>>> +                __func__, populated, (long)tot_pages, delta_tot,
>>>> +                (long)pod_cache_pages, (long)pod_entries,
>>>> +               (long)info.nr_outstanding_pages, real_free);
>>>> +    }
>>> What is the purpose of calling xc_domain_get_pod_target here? It doesn't
>>> look like is affecting the parameters passed to xc_domain_setmaxmem.
>> This was just to test and see if anything was needed for PoD mode.
>> I did not see anything.
>>
>> Did you want me to make a patch to send to the list that has my proposed
>> changes?
> Yep, that's fine.
>
>
>> I do think that what I have would be fine to do since most of the time the
>> new code does nothing with the current xen (and older versions), until
>> more then 4 nics are configured for xen.
>>
>> It would be good to have the change:
>>
>> [PATCH] libxl_set_memory_target: retain the same maxmem offset on top of the
>> current target
>>
>> (When working) in xen before using a QEMU with this change.
>>
>> Not sure I would push for 2.2 however.
> I wouldn't.
>

Will mark as for 2.3

    -Don Slutz
Wei Liu Dec. 3, 2014, 10:54 a.m. UTC | #5
On Tue, Dec 02, 2014 at 03:23:29PM -0500, Don Slutz wrote:
[...]
> >>>>           hw_error("xc_domain_getinfo failed");
> >>>>       }
> >>>>-    if (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
> >>>>-                            (nr_pfn * XC_PAGE_SIZE / 1024)) < 0) {
> >>>>+    max_pages = info.max_memkb * 1024 / XC_PAGE_SIZE;
> >>>>+    free_pages = max_pages - info.nr_pages;
> >>>>+    real_free = free_pages;
> >>>>+    if (free_pages > VGA_HOLE_SIZE) {
> >>>>+        free_pages -= VGA_HOLE_SIZE;
> >>>>+    } else {
> >>>>+        free_pages = 0;
> >>>>+    }
> >I don't think we need to subtract VGA_HOLE_SIZE.
> 
> If you do not use some slack (like 32 here), xen does report:
> 
> 
> (d5) [2014-11-29 17:07:21] Loaded SeaBIOS
> (d5) [2014-11-29 17:07:21] Creating MP tables ...
> (d5) [2014-11-29 17:07:21] Loading ACPI ...
> (XEN) [2014-11-29 17:07:21] page_alloc.c:1568:d5 Over-allocation for domain
> 5: 1057417 > 1057416
> (XEN) [2014-11-29 17:07:21] memory.c:158:d5 Could not allocate order=0

This message is a bit red herring.

It's hvmloader trying to populate ram for firmware data. The actual
amount of extra pages needed depends on the firmware.

In any case it's safe to disallow hvmloader from doing so, it will just
relocate some pages from ram (hence shrinking *mem_end).

> extent: id=5 memflags=0 (0 of 1)
> (d5) [2014-11-29 17:07:21] vm86 TSS at 00098c00
> (d5) [2014-11-29 17:07:21] BIOS map:
> 
> 
> However while QEMU startup ends with 32 "free" pages (maxmem - curmem)
> this quickly changes to 23.  I.E. there are 7 more places that act like a
> call
> to xc_domain_populate_physmap_exact() but do not log errors if there is
> no free pages.  And just to make sure I do not fully understand what is
> happening here, after the message above, the domain appears to work
> fine and ends up with 8 "free" pages (code I do not know about ends up
> releasing populated pages).
> 
> So my current thinking is to have QEMU leave a few (9 to 32 (64?)) pages
> "free".
> 

Unless we know before hand how many pages hvmloader needs this number is
estimation at best.

Wei.
Stefano Stabellini Dec. 3, 2014, 12:20 p.m. UTC | #6
On Wed, 3 Dec 2014, Wei Liu wrote:
> On Tue, Dec 02, 2014 at 03:23:29PM -0500, Don Slutz wrote:
> [...]
> > >>>>           hw_error("xc_domain_getinfo failed");
> > >>>>       }
> > >>>>-    if (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
> > >>>>-                            (nr_pfn * XC_PAGE_SIZE / 1024)) < 0) {
> > >>>>+    max_pages = info.max_memkb * 1024 / XC_PAGE_SIZE;
> > >>>>+    free_pages = max_pages - info.nr_pages;
> > >>>>+    real_free = free_pages;
> > >>>>+    if (free_pages > VGA_HOLE_SIZE) {
> > >>>>+        free_pages -= VGA_HOLE_SIZE;
> > >>>>+    } else {
> > >>>>+        free_pages = 0;
> > >>>>+    }
> > >I don't think we need to subtract VGA_HOLE_SIZE.
> > 
> > If you do not use some slack (like 32 here), xen does report:
> > 
> > 
> > (d5) [2014-11-29 17:07:21] Loaded SeaBIOS
> > (d5) [2014-11-29 17:07:21] Creating MP tables ...
> > (d5) [2014-11-29 17:07:21] Loading ACPI ...
> > (XEN) [2014-11-29 17:07:21] page_alloc.c:1568:d5 Over-allocation for domain
> > 5: 1057417 > 1057416
> > (XEN) [2014-11-29 17:07:21] memory.c:158:d5 Could not allocate order=0
> 
> This message is a bit red herring.
> 
> It's hvmloader trying to populate ram for firmware data. The actual
> amount of extra pages needed depends on the firmware.
> 
> In any case it's safe to disallow hvmloader from doing so, it will just
> relocate some pages from ram (hence shrinking *mem_end).

That looks like a better solution


> > extent: id=5 memflags=0 (0 of 1)
> > (d5) [2014-11-29 17:07:21] vm86 TSS at 00098c00
> > (d5) [2014-11-29 17:07:21] BIOS map:
> > 
> > 
> > However while QEMU startup ends with 32 "free" pages (maxmem - curmem)
> > this quickly changes to 23.  I.E. there are 7 more places that act like a
> > call
> > to xc_domain_populate_physmap_exact() but do not log errors if there is
> > no free pages.  And just to make sure I do not fully understand what is
> > happening here, after the message above, the domain appears to work
> > fine and ends up with 8 "free" pages (code I do not know about ends up
> > releasing populated pages).
> > 
> > So my current thinking is to have QEMU leave a few (9 to 32 (64?)) pages
> > "free".
> > 
> 
> Unless we know before hand how many pages hvmloader needs this number is
> estimation at best.
 
Right. It would be nice to get rid of any estimations by:
- making hvmloader use normal ram
- making qemu increase maxmem
- removing all the estimation from libxl
Don Slutz Dec. 3, 2014, 1:39 p.m. UTC | #7
On 12/03/14 07:20, Stefano Stabellini wrote:
> On Wed, 3 Dec 2014, Wei Liu wrote:
>> On Tue, Dec 02, 2014 at 03:23:29PM -0500, Don Slutz wrote:
>> [...]
>>>>>>>            hw_error("xc_domain_getinfo failed");
>>>>>>>        }
>>>>>>> -    if (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
>>>>>>> -                            (nr_pfn * XC_PAGE_SIZE / 1024)) < 0) {
>>>>>>> +    max_pages = info.max_memkb * 1024 / XC_PAGE_SIZE;
>>>>>>> +    free_pages = max_pages - info.nr_pages;
>>>>>>> +    real_free = free_pages;
>>>>>>> +    if (free_pages > VGA_HOLE_SIZE) {
>>>>>>> +        free_pages -= VGA_HOLE_SIZE;
>>>>>>> +    } else {
>>>>>>> +        free_pages = 0;
>>>>>>> +    }
>>>> I don't think we need to subtract VGA_HOLE_SIZE.
>>> If you do not use some slack (like 32 here), xen does report:
>>>
>>>
>>> (d5) [2014-11-29 17:07:21] Loaded SeaBIOS
>>> (d5) [2014-11-29 17:07:21] Creating MP tables ...
>>> (d5) [2014-11-29 17:07:21] Loading ACPI ...
>>> (XEN) [2014-11-29 17:07:21] page_alloc.c:1568:d5 Over-allocation for domain
>>> 5: 1057417 > 1057416
>>> (XEN) [2014-11-29 17:07:21] memory.c:158:d5 Could not allocate order=0
>> This message is a bit red herring.
>>
>> It's hvmloader trying to populate ram for firmware data. The actual
>> amount of extra pages needed depends on the firmware.
>>
>> In any case it's safe to disallow hvmloader from doing so, it will just
>> relocate some pages from ram (hence shrinking *mem_end).
> That looks like a better solution
>

I went with a "leave some slack" so that the error message above is not 
output.

When a change to hvmloader is done so that the message does not appear 
during
normal usage, the extra pages in QEMU can be dropped.


>>> extent: id=5 memflags=0 (0 of 1)
>>> (d5) [2014-11-29 17:07:21] vm86 TSS at 00098c00
>>> (d5) [2014-11-29 17:07:21] BIOS map:
>>>
>>>
>>> However while QEMU startup ends with 32 "free" pages (maxmem - curmem)
>>> this quickly changes to 23.  I.E. there are 7 more places that act like a
>>> call
>>> to xc_domain_populate_physmap_exact() but do not log errors if there is
>>> no free pages.  And just to make sure I do not fully understand what is
>>> happening here, after the message above, the domain appears to work
>>> fine and ends up with 8 "free" pages (code I do not know about ends up
>>> releasing populated pages).
>>>
>>> So my current thinking is to have QEMU leave a few (9 to 32 (64?)) pages
>>> "free".
>>>
>> Unless we know before hand how many pages hvmloader needs this number is
>> estimation at best.
>   
> Right. It would be nice to get rid of any estimations by:
> - making hvmloader use normal ram
> - making qemu increase maxmem
> - removing all the estimation from libxl

Sounds like a plan for 4.6

     -Don Slutz
Stefano Stabellini Dec. 3, 2014, 2:50 p.m. UTC | #8
On Wed, 3 Dec 2014, Don Slutz wrote:
> On 12/03/14 07:20, Stefano Stabellini wrote:
> > On Wed, 3 Dec 2014, Wei Liu wrote:
> > > On Tue, Dec 02, 2014 at 03:23:29PM -0500, Don Slutz wrote:
> > > [...]
> > > > > > > >            hw_error("xc_domain_getinfo failed");
> > > > > > > >        }
> > > > > > > > -    if (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
> > > > > > > > -                            (nr_pfn * XC_PAGE_SIZE / 1024)) <
> > > > > > > > 0) {
> > > > > > > > +    max_pages = info.max_memkb * 1024 / XC_PAGE_SIZE;
> > > > > > > > +    free_pages = max_pages - info.nr_pages;
> > > > > > > > +    real_free = free_pages;
> > > > > > > > +    if (free_pages > VGA_HOLE_SIZE) {
> > > > > > > > +        free_pages -= VGA_HOLE_SIZE;
> > > > > > > > +    } else {
> > > > > > > > +        free_pages = 0;
> > > > > > > > +    }
> > > > > I don't think we need to subtract VGA_HOLE_SIZE.
> > > > If you do not use some slack (like 32 here), xen does report:
> > > > 
> > > > 
> > > > (d5) [2014-11-29 17:07:21] Loaded SeaBIOS
> > > > (d5) [2014-11-29 17:07:21] Creating MP tables ...
> > > > (d5) [2014-11-29 17:07:21] Loading ACPI ...
> > > > (XEN) [2014-11-29 17:07:21] page_alloc.c:1568:d5 Over-allocation for
> > > > domain
> > > > 5: 1057417 > 1057416
> > > > (XEN) [2014-11-29 17:07:21] memory.c:158:d5 Could not allocate order=0
> > > This message is a bit red herring.
> > > 
> > > It's hvmloader trying to populate ram for firmware data. The actual
> > > amount of extra pages needed depends on the firmware.
> > > 
> > > In any case it's safe to disallow hvmloader from doing so, it will just
> > > relocate some pages from ram (hence shrinking *mem_end).
> > That looks like a better solution
> > 
> 
> I went with a "leave some slack" so that the error message above is not
> output.
> 
> When a change to hvmloader is done so that the message does not appear during
> normal usage, the extra pages in QEMU can be dropped.

Although those messages look like fatal errors, they are not. It is
normal way for hvmloader to operate: firstly it tries to allocate extra
memory until it gets an error, then it continues with normal memory,
see:

void mem_hole_populate_ram(xen_pfn_t mfn, uint32_t nr_mfns)
{
    static int over_allocated;
    struct xen_add_to_physmap xatp;
    struct xen_memory_reservation xmr;

    for ( ; nr_mfns-- != 0; mfn++ )
    {
        /* Try to allocate a brand new page in the reserved area. */
        if ( !over_allocated )
        {
            xmr.domid = DOMID_SELF;
            xmr.mem_flags = 0;
            xmr.extent_order = 0;
            xmr.nr_extents = 1;
            set_xen_guest_handle(xmr.extent_start, &mfn);
            if ( hypercall_memory_op(XENMEM_populate_physmap, &xmr) == 1 )
                continue;
            over_allocated = 1;
        }

        /* Otherwise, relocate a page from the ordinary RAM map. */

I think there is really nothing to change there. Maybe we want to make
those warnings less scary.
Don Slutz Dec. 4, 2014, 4:26 p.m. UTC | #9
On 12/03/14 09:50, Stefano Stabellini wrote:
> On Wed, 3 Dec 2014, Don Slutz wrote:
>> On 12/03/14 07:20, Stefano Stabellini wrote:
>>> On Wed, 3 Dec 2014, Wei Liu wrote:
>>>> On Tue, Dec 02, 2014 at 03:23:29PM -0500, Don Slutz wrote:
>>>> [...]
>>>>>>>>>             hw_error("xc_domain_getinfo failed");
>>>>>>>>>         }
>>>>>>>>> -    if (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
>>>>>>>>> -                            (nr_pfn * XC_PAGE_SIZE / 1024)) <
>>>>>>>>> 0) {
>>>>>>>>> +    max_pages = info.max_memkb * 1024 / XC_PAGE_SIZE;
>>>>>>>>> +    free_pages = max_pages - info.nr_pages;
>>>>>>>>> +    real_free = free_pages;
>>>>>>>>> +    if (free_pages > VGA_HOLE_SIZE) {
>>>>>>>>> +        free_pages -= VGA_HOLE_SIZE;
>>>>>>>>> +    } else {
>>>>>>>>> +        free_pages = 0;
>>>>>>>>> +    }
>>>>>> I don't think we need to subtract VGA_HOLE_SIZE.
>>>>> If you do not use some slack (like 32 here), xen does report:
>>>>>
>>>>>
>>>>> (d5) [2014-11-29 17:07:21] Loaded SeaBIOS
>>>>> (d5) [2014-11-29 17:07:21] Creating MP tables ...
>>>>> (d5) [2014-11-29 17:07:21] Loading ACPI ...
>>>>> (XEN) [2014-11-29 17:07:21] page_alloc.c:1568:d5 Over-allocation for
>>>>> domain
>>>>> 5: 1057417 > 1057416
>>>>> (XEN) [2014-11-29 17:07:21] memory.c:158:d5 Could not allocate order=0
>>>> This message is a bit red herring.
>>>>
>>>> It's hvmloader trying to populate ram for firmware data. The actual
>>>> amount of extra pages needed depends on the firmware.
>>>>
>>>> In any case it's safe to disallow hvmloader from doing so, it will just
>>>> relocate some pages from ram (hence shrinking *mem_end).
>>> That looks like a better solution
>>>
>> I went with a "leave some slack" so that the error message above is not
>> output.
>>
>> When a change to hvmloader is done so that the message does not appear during
>> normal usage, the extra pages in QEMU can be dropped.
> Although those messages look like fatal errors, they are not. It is
> normal way for hvmloader to operate: firstly it tries to allocate extra
> memory until it gets an error, then it continues with normal memory,
> see:
>
> void mem_hole_populate_ram(xen_pfn_t mfn, uint32_t nr_mfns)
> {
>      static int over_allocated;
>      struct xen_add_to_physmap xatp;
>      struct xen_memory_reservation xmr;
>
>      for ( ; nr_mfns-- != 0; mfn++ )
>      {
>          /* Try to allocate a brand new page in the reserved area. */
>          if ( !over_allocated )
>          {
>              xmr.domid = DOMID_SELF;
>              xmr.mem_flags = 0;
>              xmr.extent_order = 0;
>              xmr.nr_extents = 1;
>              set_xen_guest_handle(xmr.extent_start, &mfn);
>              if ( hypercall_memory_op(XENMEM_populate_physmap, &xmr) == 1 )
>                  continue;
>              over_allocated = 1;
>          }
>
>          /* Otherwise, relocate a page from the ordinary RAM map. */
>
> I think there is really nothing to change there. Maybe we want to make
> those warnings less scary.

It was not so much that hvmloader is the one to change (but having it check
for room first might be good), but more that a change to xen would be good
(like changing the wording or maybe only output in debug=y builds, etc.).

Maybe a new XENMEMF to suppress the message?


    -Don Slutz
Wei Liu Dec. 4, 2014, 4:38 p.m. UTC | #10
On Thu, Dec 04, 2014 at 11:26:58AM -0500, Don Slutz wrote:
[...]
> >those warnings less scary.
> 
> It was not so much that hvmloader is the one to change (but having it check
> for room first might be good), but more that a change to xen would be good
> (like changing the wording or maybe only output in debug=y builds, etc.).
> 
> Maybe a new XENMEMF to suppress the message?
> 

I don't think it should work like that. That message is perfectly
legitimate -- a domain is asking for more memory than it should and Xen
rightfully rejects that. Having a guest controlable way to suppress that
is a bad idea.

Wei.

> 
>    -Don Slutz
> 
> 
> 
>
diff mbox

Patch

--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -67,6 +67,7 @@  static inline ioreq_t *xen_vcpu_ioreq(shared_iopage_t 
*shared_page, int vcpu)
  #endif

  #define BUFFER_IO_MAX_DELAY  100
+#define VGA_HOLE_SIZE (0x20)

  typedef struct XenPhysmap {
      hwaddr start_addr;
@@ -219,6 +220,11 @@  void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t 
size, MemoryRegion *mr)
      xen_pfn_t *pfn_list;
      int i;
      xc_dominfo_t info;
+    unsigned long max_pages, free_pages, real_free;
+    long need_pages;
+    uint64_t tot_pages, pod_cache_pages, pod_entries;
+
+    trace_xen_ram_alloc(ram_addr, size, mr->name);

      if (runstate_check(RUN_STATE_INMIGRATE)) {
          /* RAM already populated in Xen */
@@ -232,13 +238,6 @@  void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t 
size, MemoryRegion *mr)
          return;
      }

-    fprintf(stderr, "%s: alloc "RAM_ADDR_FMT
-            " bytes (%ld Kib) of ram at "RAM_ADDR_FMT
-            " mr.name=%s\n",
-            __func__, size, (long)(size>>10), ram_addr, mr->name);
-
-    trace_xen_ram_alloc(ram_addr, size);
-
      nr_pfn = size >> TARGET_PAGE_BITS;
      pfn_list = g_malloc(sizeof (*pfn_list) * nr_pfn);

@@ -246,11 +245,38 @@  void xen_ram_alloc(ram_addr_t ram_addr, ram_addr_t 
size, MemoryRegion *mr)
          pfn_list[i] = (ram_addr >> TARGET_PAGE_BITS) + i;
      }

-    if (xc_domain_getinfo(xen_xc, xen_domid, 1, &info) < 0) {
+    if ((xc_domain_getinfo(xen_xc, xen_domid, 1, &info) != 1) ||
+       (info.domid != xen_domid)) {
          hw_error("xc_domain_getinfo failed");
      }
-    if (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
-                            (nr_pfn * XC_PAGE_SIZE / 1024)) < 0) {
+    max_pages = info.max_memkb * 1024 / XC_PAGE_SIZE;
+    free_pages = max_pages - info.nr_pages;
+    real_free = free_pages;
+    if (free_pages > VGA_HOLE_SIZE) {
+        free_pages -= VGA_HOLE_SIZE;
+    } else {
+        free_pages = 0;
+    }
+    need_pages = nr_pfn - free_pages;
+    fprintf(stderr, "%s: alloc "RAM_ADDR_FMT
+            " bytes (%ld KiB) of ram at "RAM_ADDR_FMT
+            " mp=%ld fp=%ld nr=%ld need=%ld mr.name=%s\n",
+            __func__, size, (long)(size>>10), ram_addr,
+           max_pages, free_pages, nr_pfn, need_pages,
+            mr->name);
+    if (xc_domain_get_pod_target(xen_xc, xen_domid, &tot_pages,
+                                 &pod_cache_pages, &pod_entries) >= 0) {
+        unsigned long populated = tot_pages - pod_cache_pages;
+        long delta_tot = tot_pages - info.nr_pages;
+
+        fprintf(stderr, "%s: PoD pop=%ld tot=%ld(%ld) cnt=%ld ent=%ld 
nop=%ld free=%ld\n",
+                __func__, populated, (long)tot_pages, delta_tot,
+                (long)pod_cache_pages, (long)pod_entries,
+               (long)info.nr_outstanding_pages, real_free);
+    }
+    if ((free_pages < nr_pfn) &&
+       (xc_domain_setmaxmem(xen_xc, xen_domid, info.max_memkb +
+                            (need_pages * XC_PAGE_SIZE / 1024)) < 0)) {
          hw_error("xc_domain_setmaxmem failed");
      }
      if (xc_domain_populate_physmap_exact(xen_xc, xen_domid, nr_pfn, 0, 
0, pfn_list)) {


Note: I already had a trace_xen_ram_alloc, here is the delta in 
trace-events (which
has others not yet sent):



diff --git a/trace-events b/trace-events
index eaeecd5..a58fc11 100644
--- a/trace-events
+++ b/trace-events
@@ -908,7 +908,7 @@  pvscsi_tx_rings_ppn(const char* label, uint64_t ppn) 
"%s page: %"PRIx64""
  pvscsi_tx_rings_num_pages(const char* label, uint32_t num) "Number of 
%s pages: %u"

  # xen-all.c
-xen_ram_alloc(unsigned long ram_addr, unsigned long size) "requested: 
%#lx, size %#lx"
+xen_ram_alloc(unsigned long ram_addr, unsigned long size, const char*