Patchwork RFC: Cache LTO streamer mappings

login
register
mail settings
Submitter Andi Kleen
Date Oct. 9, 2011, 5:11 p.m.
Message ID <1318180303-684-1-git-send-email-andi@firstfloor.org>
Download mbox | patch
Permalink /patch/118605/
State New
Headers show

Comments

Andi Kleen - Oct. 9, 2011, 5:11 p.m.
From: Andi Kleen <ak@linux.intel.com>

Currently the LTO file mappings use a simple one-off cache.
This doesn't match the access pattern very well.

This patch adds a real LRU of LTO mappings with a total limit.
Each file is completely mapped now instead of only specific
sections. This addresses one of the FIXME comments in LTO.

The limit is 256GB on 64bit and 256MB on 32bit. The limit
can be temporarily exceeded by a single file. The whole
file has to fit into the address space now. This may increase
the address space requirements a bit.

I originally wrote this in an attempt to minimze fragmentation
of the virtual memory map, but it didn't make too much difference
for that because it was all caused by GGC.

Also on my fairly large builds it didn't make a measurable
compile time difference, probably because it was shadowed
by other much slower passes. That is why I'm just sending it as a
RFC. It certainly complicates the code somewhat.

Maybe if people have other LTO builds they could try if it makes a
Diego Novillo - Oct. 12, 2011, 12:14 p.m.
On Sun, Oct 9, 2011 at 13:11, Andi Kleen <andi@firstfloor.org> wrote:

> Is it still a good idea?

Given that you found no speedups and it introduces added complexity, I
think it's best if we revisit the idea later.  I never found bytecode
reading to be a bottleneck in LTO, but perhaps Jan can comment what
the experience is with Mozilla builds.


Diego.
Jan Hubicka - Oct. 12, 2011, 12:25 p.m.
> On Sun, Oct 9, 2011 at 13:11, Andi Kleen <andi@firstfloor.org> wrote:
> 
> > Is it still a good idea?
> 
> Given that you found no speedups and it introduces added complexity, I
> think it's best if we revisit the idea later.  I never found bytecode
> reading to be a bottleneck in LTO, but perhaps Jan can comment what
> the experience is with Mozilla builds.

WPA is currently about 1/3 of reading&type merging, 1/3 of streaming out and
1/3 of inlining.  inlining is relatively easy to cure, so yes, streaming
performance is important.  The very basic streaming primitives actualy still
shows top in profile along with hashing and type comparing code.  I will post
some updated oprofiles into Mozilla PR.

Honestly I think we won't get any great speedups unless we work on reducing amount of
unnecesary info we pickle/unpickle.

Honza
> 
> 
> Diego.
Diego Novillo - Oct. 12, 2011, 12:43 p.m.
On 11-10-12 08:25 , Jan Hubicka wrote:

> WPA is currently about 1/3 of reading&type merging, 1/3 of streaming out and
> 1/3 of inlining.  inlining is relatively easy to cure, so yes, streaming
> performance is important.  The very basic streaming primitives actualy still
> shows top in profile along with hashing and type comparing code.  I will post
> some updated oprofiles into Mozilla PR.

OK, thanks.  My numbers are from very early LTO development.

> Honestly I think we won't get any great speedups unless we work on reducing amount of
> unnecesary info we pickle/unpickle.

That's what I was leaning towards.  Optimizing the basic access patterns 
may not buy us as much as just reducing the amount of clutter we have to 
deal with.  It may make sense, however, as a subsequent optimization.


Diego.
Jan Hubicka - Oct. 12, 2011, 2:43 p.m.
> On 11-10-12 08:25 , Jan Hubicka wrote:
>
>> WPA is currently about 1/3 of reading&type merging, 1/3 of streaming out and
>> 1/3 of inlining.  inlining is relatively easy to cure, so yes, streaming
>> performance is important.  The very basic streaming primitives actualy still
>> shows top in profile along with hashing and type comparing code.  I will post
>> some updated oprofiles into Mozilla PR.
>
> OK, thanks.  My numbers are from very early LTO development.

Yeah, the problem is minor on small projects and C projects. C++ tends to carry
a lot of context with it - both in the files streamed from compilation to WPA
(a lot of types and such) as well as into individual ltrans units.

We still need to stream in and out about 2GB from WPA to ltrans (combined sizes
of ltrans0 to lstrans31) and since we are at 3 minutes of compilation now,
seconds actually count.

>
>> Honestly I think we won't get any great speedups unless we work on reducing amount of
>> unnecesary info we pickle/unpickle.
>
> That's what I was leaning towards.  Optimizing the basic access patterns  
> may not buy us as much as just reducing the amount of clutter we have to  
> deal with.  It may make sense, however, as a subsequent optimization.

I will give this patch a try on Mozilla to see if I can report some positive numbers.
Obviously having the basic I/O effective is also important.

Honza
>
>
> Diego.

Patch

difference for them.

Is it still a good idea?

Passes a full LTO bootstrap plus test suite on x86-64-linux.

gcc/lto/:

2011-10-05   Andi Kleen <ak@linux.intel.com>

	* lto.c (list_head, mapped_file): Add.
	(page_mask): Rename to page_size.
	(MB, GB, max_mapped, cur_mapped, mapped_lru, list_add, list_del): Add.
	(mf_lru_enforce_limit, mf_hashtable, mf_lru_finish_cache, mf_eq): Add
	(mf_hash, mf_lookup_or_create)
	(lto_read_section_data): Split into two ifdef versions.
	Implement version using LRU cache. Add more error checks.
	(mf_lru_finish_cached): Add dummy in ifdef.
	(free_section_data): Rewrite for LRU.
	(read_cgraph_and_symbols): Call mf_lru_finish_cache.
---
 gcc/lto/lto.c |  292 ++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 238 insertions(+), 54 deletions(-)

diff --git a/gcc/lto/lto.c b/gcc/lto/lto.c
index a77eeb4..29dc3b8 100644
--- a/gcc/lto/lto.c
+++ b/gcc/lto/lto.c
@@ -1141,6 +1141,30 @@  lto_file_finalize (struct lto_file_decl_data *file_data, lto_file *file)
   lto_free_section_data (file_data, LTO_section_decls, NULL, data, len);
 }
 
+/* A list node or head */
+struct list_head
+  {
+    struct list_head *next;  /* Next node */
+    struct list_head *prev;  /* Prev node */
+  };
+
+/* Cache of mapped files */
+struct mapped_file
+  {
+    struct list_head lru; /* LRU list. Must be first. */
+    char *map;		/* Mapping of the file */
+    size_t size;	/* Size of mapping (rounded up) */
+    int refcnt; 	/* Number of users */
+    const char *filename;	/* File name */
+  };
+
+struct lwstate
+{
+  lto_file *file;
+  struct lto_file_decl_data **file_data;
+  int *count;
+};
+
 /* Finalize FILE_DATA in FILE and increase COUNT. */
 
 static int 
@@ -1200,65 +1224,213 @@  lto_file_read (lto_file *file, FILE *resolution_file, int *count)
 #endif
 
 #if LTO_MMAP_IO
+
 /* Page size of machine is used for mmap and munmap calls.  */
-static size_t page_mask;
-#endif
+static size_t page_size;
+
+#define MB (1UL << 20)
+#define GB (1UL << 30)
+
+/* Limit of mapped files */
+static unsigned HOST_WIDE_INT max_mapped = sizeof(void *) > 4 ? 256*GB : 256*MB;
+
+/* Total size of currently mapped files */
+static unsigned HOST_WIDE_INT cur_mapped; 
+
+/* LRU of mapped files */
+static struct list_head mapped_lru = { &mapped_lru, &mapped_lru };
+
+/* Add NODE into list HEAD */
+
+static void 
+list_add(struct list_head *node, struct list_head *head)
+{
+  struct list_head *prev = head;
+  struct list_head *next = head->next;
+
+  next->prev = node;
+  node->next = next;
+  node->prev = prev;
+  prev->next = node;
+}
+
+/* Remove NODE from list. */
+
+static void 
+list_del(struct list_head *node)
+{
+  struct list_head *prev = node->prev;
+  struct list_head *next = node->next;
+   
+  if (!next && !prev)
+    return;
+  next->prev = prev;
+  prev->next = next;
+  node->next = NULL;
+  node->prev = NULL;
+}
+
+/* Enforce the global LRU limit MAX when the commitment changes by INCREMENT. */
+
+static void
+mf_lru_enforce_limit (unsigned HOST_WIDE_INT increment, unsigned HOST_WIDE_INT max)
+{
+  struct mapped_file *mf;
+  unsigned HOST_WIDE_INT new_mapped = cur_mapped + increment;
+  struct list_head *node, *prev;
+
+  for (node = mapped_lru.prev; new_mapped > max && node != &mapped_lru; node = prev) 
+    {
+      prev = node->prev;
+      mf = (struct mapped_file *) node;
+      if (mf->refcnt > 0)
+        continue;
+      munmap (mf->map, mf->size);
+      mf->map = NULL;
+      new_mapped -= mf->size;
+      list_del (node);
+    }
+
+  cur_mapped = new_mapped;
+}
+
+/* Hash table of mapped_files */
+static htab_t mf_hashtable;
+
+/* Free all mappings in the hash table. */
+
+static void
+mf_lru_finish_cache (void)
+{
+  mf_lru_enforce_limit (0, 0);
+  gcc_assert (mapped_lru.next == mapped_lru.prev);
+  htab_delete (mf_hashtable);
+  mf_hashtable = NULL;
+}
+
+/* Compare hash table entries A and B. */
+
+static int 
+mf_eq (const void *a, const void *b)
+{
+  const struct mapped_file *as = (const struct mapped_file *)a;
+  const struct mapped_file *bs = (const struct mapped_file *)b;
+
+  return !strcmp (as->filename, bs->filename);
+}
+
+/* Hash symbol A. */
+
+static hashval_t 
+mf_hash (const void *a)
+{
+  const struct mapped_file *as = (const struct mapped_file *)a;
 
+  return htab_hash_string (as->filename);
+}
+
+/* Look up mapped_file of FILE_DATA. */
+
+static struct mapped_file *
+mf_lookup_or_create (struct lto_file_decl_data *file_data)
+{
+  struct mapped_file *mf, search;
+  void **slot;
+
+  if (mf_hashtable == NULL)
+    mf_hashtable = htab_create (31, mf_hash, mf_eq, NULL);
+  
+  search.filename = file_data->file_name;
+  slot = htab_find_slot (mf_hashtable, &search, INSERT);
+  mf = (struct mapped_file *)*slot;
+  if (mf == NULL)
+    {
+      mf = XCNEW (struct mapped_file);
+      mf->filename = file_data->file_name;
+      *slot = mf;
+    }
+  
+  return mf;
+}
+  
 /* Get the section data of length LEN from FILENAME starting at
    OFFSET.  The data segment must be freed by the caller when the
-   caller is finished.  Returns NULL if all was not well.  */
+   caller is finished.  Returns NULL if all was not well.  
+  
+   mmap version. Keep a cache of mapped files upto a limit. */
 
 static char *
 lto_read_section_data (struct lto_file_decl_data *file_data,
 		       intptr_t offset, size_t len)
 {
-  char *result;
-  static int fd = -1;
-  static char *fd_name;
-#if LTO_MMAP_IO
-  intptr_t computed_len;
-  intptr_t computed_offset;
-  intptr_t diff;
-#endif
+  unsigned HOST_WIDE_INT size;
+  struct mapped_file *mf;
+  int fd;
+  struct stat st;
+  char *map;
 
-  /* Keep a single-entry file-descriptor cache.  The last file we
-     touched will get closed at exit.
-     ???  Eventually we want to add a more sophisticated larger cache
-     or rather fix function body streaming to not stream them in
-     practically random order.  */
-  if (fd != -1
-      && filename_cmp (fd_name, file_data->file_name) != 0)
-    {
-      free (fd_name);
-      close (fd);
-      fd = -1;
-    }
-  if (fd == -1)
+  mf = mf_lookup_or_create (file_data);
+      
+  /* Map file if needed */
+  if (mf->map == NULL)
     {
+      gcc_assert (mf->refcnt == 0);
+
       fd = open (file_data->file_name, O_RDONLY|O_BINARY);
-      if (fd == -1)
-	return NULL;
-      fd_name = xstrdup (file_data->file_name);
-    }
+      if (fd < 0)
+        return NULL;
 
-#if LTO_MMAP_IO
-  if (!page_mask)
-    {
-      size_t page_size = sysconf (_SC_PAGE_SIZE);
-      page_mask = ~(page_size - 1);
+      if (fstat (fd, &st) < 0)
+        {
+          close (fd);
+          return NULL;
+        }
+
+      if (!page_size)
+        page_size = sysconf (_SC_PAGE_SIZE);
+      size = (st.st_size + page_size - 1) & ~(page_size - 1);
+
+      mf_lru_enforce_limit (size, max_mapped);
+
+      map = (char *) mmap (NULL, size, PROT_READ, MAP_PRIVATE,
+		           fd, 0);
+      close (fd);
+      if (map == MAP_FAILED)
+        {
+	  internal_error ("Unable to map file %s", file_data->file_name);
+          return NULL;
+        }
+
+      mf->map = map;
+      mf->size = size;
     }
 
-  computed_offset = offset & page_mask;
-  diff = offset - computed_offset;
-  computed_len = len + diff;
+  mf->refcnt++;
+
+  /* Move mapping to front of LRU */
+  list_del (&mf->lru);
+  list_add (&mf->lru, &mapped_lru);
 
-  result = (char *) mmap (NULL, computed_len, PROT_READ, MAP_PRIVATE,
-			  fd, computed_offset);
-  if (result == MAP_FAILED)
-    return NULL;
+  if ((unsigned HOST_WIDE_INT)offset + len > mf->size)
+    internal_error ("out of bounds LTO file access in %s", 
+		    file_data->file_name);
+  return mf->map + offset;
+}
 
-  return result + diff;
 #else
+
+/* Get the section data of length LEN from FILENAME starting at
+   OFFSET.  The data segment must be freed by the caller when the
+   caller is finished.  Returns NULL if all was not well.  
+
+   Dump fallback malloc version without caching. */
+
+static char *
+lto_read_section_data (struct lto_file_decl_data *file_data,
+		       intptr_t offset, size_t len)
+{
+  char *result;
+
   result = (char *) xmalloc (len);
   if (lseek (fd, offset, SEEK_SET) != offset
       || read (fd, result, len) != (ssize_t) len)
@@ -1276,9 +1448,16 @@  lto_read_section_data (struct lto_file_decl_data *file_data,
   fd = -1;
 #endif
   return result;
-#endif
 }    
 
+/* Free all mappings in the LRU. */
+
+static void
+mf_lru_finish_cache (void)
+{
+}
+#endif
+
 
 /* Get the section data from FILE_DATA of SECTION_TYPE with NAME.
    NAME will be NULL unless the section type is for a function
@@ -1317,20 +1496,22 @@  static void
 free_section_data (struct lto_file_decl_data *file_data ATTRIBUTE_UNUSED,
 		   enum lto_section_type section_type ATTRIBUTE_UNUSED,
 		   const char *name ATTRIBUTE_UNUSED,
-		   const char *offset, size_t len ATTRIBUTE_UNUSED)
+		   const char *offset ATTRIBUTE_UNUSED, 
+		   size_t len ATTRIBUTE_UNUSED)
 {
 #if LTO_MMAP_IO
-  intptr_t computed_len;
-  intptr_t computed_offset;
-  intptr_t diff;
-#endif
-
-#if LTO_MMAP_IO
-  computed_offset = ((intptr_t) offset) & page_mask;
-  diff = (intptr_t) offset - computed_offset;
-  computed_len = len + diff;
-
-  munmap ((caddr_t) computed_offset, computed_len);
+  struct mapped_file *mf, search;
+
+  search.filename = file_data->file_name;
+  mf = (struct mapped_file *) htab_find (mf_hashtable, &search);
+  gcc_assert (mf != NULL);
+  gcc_assert (mf->map != NULL);
+  gcc_assert (mf->refcnt > 0);
+  gcc_assert (offset >= mf->map && offset + len < mf->map + mf->size);
+
+  mf->refcnt--;
+  /* The actual freeing is done later when we're all done or run out of 
+     space. */
 #else
   free (CONST_CAST(char *, offset));
 #endif
@@ -2608,6 +2789,9 @@  read_cgraph_and_symbols (unsigned nfiles, const char **fnames)
   else
     ipa_read_summaries ();
 
+  /* Unmap all mapped files */
+  mf_lru_finish_cache ();
+
   /* Finally merge the cgraph according to the decl merging decisions.  */
   timevar_push (TV_IPA_LTO_CGRAPH_MERGE);
   if (cgraph_dump_file)