diff mbox series

[5/5] Add single-threaded path to _int_malloc

Message ID DB6PR0801MB20538FB1206136EFA224D622834B0@DB6PR0801MB2053.eurprd08.prod.outlook.com
State New
Headers show
Series Add single-threaded fast path to malloc | expand

Commit Message

Wilco Dijkstra Oct. 12, 2017, 9:35 a.m. UTC
This patch adds a single-threaded fast path to _int_malloc.
Since the goal is to use a single test to skip multiple atomic
operations, some of the fastbin allocation code has to be
duplicated.

Passes GLIBC tests, OK for commit?

ChangeLog:
2017-10-11  Wilco Dijkstra  <wdijkstr@arm.com>

	* malloc/malloc.c (_int_malloc): Add SINGLE_THREAD_P path.
--

Comments

Wilco Dijkstra Oct. 20, 2017, 2:51 p.m. UTC | #1
DJ Delorie wrote:

> +      if (SINGLE_THREAD_P)

> Would a suitable __glibc_unlikely make sense here?  I can't convince
> myself that either of these two cases would be rare enough to move to
> cold sections, but I posit the question anyway...

How do you mean? SINGLE_THREAD_P is more likely indeed but it's not
so skewed that it makes sense to treat the multithreaded case as being cold...

> +           void *p = chunk2mem (victim);
> +           alloc_perturb (p, bytes);
> +           return p;
> +         }

> The overall duplication of code looks... undesirable... but given
> there's a macro in the middle we're changing, I don't see a good way to
> factor it out, unless the SINGLE_THREAD_P test is so cheap we can
> instead rewrite the REMOVE_FB() macro to use it instead of duplicating
> code.  Could/did you consider that?  If the benchmarks show no
> significant difference, I'd prefer cleaner code over an insignificant
> speedup.  Might even call SINGLE_THREAD_P once and store the result in a
> register, to let the compiler optimize it better.

See new version below which avoids the duplication. It's slightly slower but
not significantly, so it seems best to do it this way for now. However moving
items from the fast bin to the tcache could be done in a much more efficient
way in the future.  

> Remember, "fast at any cost" is not the goal.  "Fast while still
> maintainable" is ;-)

However those typically go together. Removing needless complexity results in
performance gains. The code should be restructured to use more inline function
calls.  We could use the ifunc mechanism to decide between different variants - 
including for example for the single-thread optimization.

Wilco


ChangeLog:
2017-10-11  Wilco Dijkstra  <wdijkstr@arm.com>

	* malloc/malloc.c (_int_malloc): Add SINGLE_THREAD_P path.

diff --git a/malloc/malloc.c b/malloc/malloc.c
index ca5cfff3a1b1882ae608219fdec973b7f13cbb21..74eb961baba44aa6f11d6bb5e6408cfaaa7c3c47 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -3569,37 +3569,50 @@ _int_malloc (mstate av, size_t bytes)
     {
       idx = fastbin_index (nb);
       mfastbinptr *fb = &fastbin (av, idx);
-      mchunkptr pp = *fb;
-      REMOVE_FB (fb, victim, pp);
-      if (victim != 0)
-        {
-          if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0))
-	    malloc_printerr ("malloc(): memory corruption (fast)");
-          check_remalloced_chunk (av, victim, nb);
-#if USE_TCACHE
-	  /* While we're here, if we see other chunks of the same size,
-	     stash them in the tcache.  */
-	  size_t tc_idx = csize2tidx (nb);
-	  if (tcache && tc_idx < mp_.tcache_bins)
-	    {
-	      mchunkptr tc_victim;
+      mchunkptr pp;
+      victim = *fb;
 
-	      /* While bin not empty and tcache not full, copy chunks over.  */
-	      while (tcache->counts[tc_idx] < mp_.tcache_count
-		     && (pp = *fb) != NULL)
+      if (victim != NULL)
+	{
+	  if (SINGLE_THREAD_P)
+	    *fb = victim->fd;
+	  else
+	    REMOVE_FB (fb, pp, victim);
+	  if (__glibc_likely (victim != NULL))
+	    {
+	      size_t victim_idx = fastbin_index (chunksize (victim));
+	      if (__builtin_expect (victim_idx != idx, 0))
+		malloc_printerr ("malloc(): memory corruption (fast)");
+	      check_remalloced_chunk (av, victim, nb);
+#if USE_TCACHE
+	      /* While we're here, if we see other chunks of the same size,
+		 stash them in the tcache.  */
+	      size_t tc_idx = csize2tidx (nb);
+	      if (tcache && tc_idx < mp_.tcache_bins)
 		{
-		  REMOVE_FB (fb, tc_victim, pp);
-		  if (tc_victim != 0)
+		  mchunkptr tc_victim;
+
+		  /* While bin not empty and tcache not full, copy chunks.  */
+		  while (tcache->counts[tc_idx] < mp_.tcache_count
+			 && (tc_victim = *fb) != NULL)
 		    {
+		      if (SINGLE_THREAD_P)
+			*fb = tc_victim->fd;
+		      else
+			{
+			  REMOVE_FB (fb, pp, tc_victim);
+			  if (__glibc_unlikely (tc_victim == NULL))
+			    break;
+			}
 		      tcache_put (tc_victim, tc_idx);
-	            }
+		    }
 		}
-	    }
 #endif
-          void *p = chunk2mem (victim);
-          alloc_perturb (p, bytes);
-          return p;
-        }
+	      void *p = chunk2mem (victim);
+	      alloc_perturb (p, bytes);
+	      return p;
+	    }
+	}
     }
 
   /*
Wilco Dijkstra Oct. 24, 2017, 1:49 p.m. UTC | #2
DJ Delorie wrote:
>
> Looks good to me.

OK it's all committed now. Thanks for the reviews!

> > We could use the ifunc mechanism to decide between different variants - 
> > including for example for the single-thread optimization.
>
> I suspect ifuncs are chosen before we know if the program will spawn any
> threads, though.

I was thinking of dynamically updating the ifunc, for example if we want to trace,
hook or enable more expensive checking. That way there is no need to check
for all special cases on every call, and additional code can be enabled or
disabled at runtime by swapping to a different variant of malloc/free.

Wilco
Szabolcs Nagy Oct. 24, 2017, 2:14 p.m. UTC | #3
On 24/10/17 14:49, Wilco Dijkstra wrote:
> DJ Delorie wrote:
>>
>> Looks good to me.
> 
> OK it's all committed now. Thanks for the reviews!
> 
>>> We could use the ifunc mechanism to decide between different variants - 
>>> including for example for the single-thread optimization.
>>
>> I suspect ifuncs are chosen before we know if the program will spawn any
>> threads, though.
> 
> I was thinking of dynamically updating the ifunc, for example if we want to trace,
> hook or enable more expensive checking. That way there is no need to check
> for all special cases on every call, and additional code can be enabled or
> disabled at runtime by swapping to a different variant of malloc/free.
> 

the address of malloc must be const during the lifetime of a
process, since it can be saved in a variable that can be later
used as func ptr.

you would need an additional indirection for this to work.
(e.g. implement it as void *malloc(size_t n){ return fptr(n); }
where fptr is a global that's updated at thread creation time)
at which point it has not much to do with ifunc.
Wilco Dijkstra Oct. 24, 2017, 2:42 p.m. UTC | #4
Szabolcs Nagy wrote:

> the address of malloc must be const during the lifetime of a
> process, since it can be saved in a variable that can be later
> used as func ptr.

The application always gets the address of the fixed PLT entry, including
in a statically linked binary. So calling via a function pointer will still use
a PLT indirection. Not sure what happens inside a dynamic library itself,
but that would work fine if that gets the address of the same PLT entry
(if not then the addresses would already be different today).

Wilco
diff mbox series

Patch

diff --git a/malloc/malloc.c b/malloc/malloc.c
index f4f44400d120188c4d0bece996380e04b35c8fac..11fb93fbc983d5895ea05a9c9dc262d2d9d8923c 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -3565,37 +3565,77 @@  _int_malloc (mstate av, size_t bytes)
     {
       idx = fastbin_index (nb);
       mfastbinptr *fb = &fastbin (av, idx);
-      mchunkptr pp = *fb;
-      REMOVE_FB (fb, victim, pp);
-      if (victim != 0)
-        {
-          if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0))
-	    malloc_printerr ("malloc(): memory corruption (fast)");
-          check_remalloced_chunk (av, victim, nb);
-#if USE_TCACHE
-	  /* While we're here, if we see other chunks of the same size,
-	     stash them in the tcache.  */
-	  size_t tc_idx = csize2tidx (nb);
-	  if (tcache && tc_idx < mp_.tcache_bins)
-	    {
-	      mchunkptr tc_victim;
+      mchunkptr next;
 
-	      /* While bin not empty and tcache not full, copy chunks over.  */
-	      while (tcache->counts[tc_idx] < mp_.tcache_count
-		     && (pp = *fb) != NULL)
+      if (SINGLE_THREAD_P)
+	{
+	  victim = *fb;
+	  if (victim != NULL)
+	    {
+	      size_t victim_idx = fastbin_index (chunksize (victim));
+	      next = victim->fd;
+	      if (__builtin_expect (victim_idx != idx, 0))
+		malloc_printerr ("malloc(): memory corruption (fast)");
+	      check_remalloced_chunk (av, victim, nb);
+#if USE_TCACHE
+	      /* While we're here, if we see other chunks of the same size,
+		 stash them in the tcache.  */
+	      size_t tc_idx = csize2tidx (nb);
+	      if (next != NULL && tcache && tc_idx < mp_.tcache_bins)
 		{
-		  REMOVE_FB (fb, tc_victim, pp);
-		  if (tc_victim != 0)
+		  mchunkptr tc_victim;
+
+		  /* While bin not empty and tcache not full, copy chunks.  */
+		  while (tcache->counts[tc_idx] < mp_.tcache_count)
 		    {
+		      tc_victim = next;
+		      next = tc_victim->fd;
 		      tcache_put (tc_victim, tc_idx);
-	            }
+		      if (next == NULL)
+			break;
+		    }
 		}
-	    }
 #endif
-          void *p = chunk2mem (victim);
-          alloc_perturb (p, bytes);
-          return p;
+	      *fb = next;
+	      void *p = chunk2mem (victim);
+	      alloc_perturb (p, bytes);
+	      return p;
+	    }
         }
+      else
+	{
+	  next = *fb;
+	  REMOVE_FB (fb, victim, next);
+	  if (victim != 0)
+	    {
+	      size_t victim_idx = fastbin_index (chunksize (victim));
+	      if (__builtin_expect (victim_idx != idx, 0))
+		malloc_printerr ("malloc(): memory corruption (fast)");
+	      check_remalloced_chunk (av, victim, nb);
+#if USE_TCACHE
+	      /* While we're here, if we see other chunks of the same size,
+		 stash them in the tcache.  */
+	      size_t tc_idx = csize2tidx (nb);
+	      if (tcache && tc_idx < mp_.tcache_bins)
+		{
+		  mchunkptr tc_victim;
+
+		  /* While bin not empty and tcache not full, copy chunks.  */
+		  while (tcache->counts[tc_idx] < mp_.tcache_count
+			 && (next = *fb) != NULL)
+		    {
+		      REMOVE_FB (fb, tc_victim, next);
+		      if (tc_victim == NULL)
+			break;
+		      tcache_put (tc_victim, tc_idx);
+		    }
+		}
+#endif
+	      void *p = chunk2mem (victim);
+	      alloc_perturb (p, bytes);
+	      return p;
+	    }
+	}
     }
 
   /*