Patchwork [v2,net-next] inet: Get critical word in first 64bit of cache line

login
register
mail settings
Submitter Eric Dumazet
Date Feb. 2, 2013, 3:03 p.m.
Message ID <1359817435.30177.70.camel@edumazet-glaptop>
Download mbox | patch
Permalink /patch/217667/
State Changes Requested
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - Feb. 2, 2013, 3:03 p.m.
From: Ma Ling <ling.ma.program@gmail.com>

In order to reduce memory latency when last level cache miss occurs,
modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
Early Restart(ER) to get data ASAP. For CWF if critical word is first
member
in cache line, memory feed CPU with critical word, then fill others
data in cache line one by one, otherwise after critical word it must
cost more cycle to fill the remaining cache line. For Early First CPU
will restart until critical word in cache line reaches.

Hash value is critical word, so in this patch we place it as first
member in cache line (sock address is cache-line aligned), and it is
also good for Early Restart platform as well .

[edumazet: respin on net-next after commit ce43b03e8889]

Signed-off-by: Ma Ling <ling.ma.program@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Maciej ┼╗enczykowski <maze@google.com>
---
 include/net/sock.h |   26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
saeed bishara - Feb. 3, 2013, 9 p.m.
On Sat, Feb 2, 2013 at 5:03 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Ma Ling <ling.ma.program@gmail.com>
>
> In order to reduce memory latency when last level cache miss occurs,
> modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
> Early Restart(ER) to get data ASAP. For CWF if critical word is first
> member
> in cache line, memory feed CPU with critical word, then fill others
> data in cache line one by one, otherwise after critical word it must
> cost more cycle to fill the remaining cache line. For Early First CPU
> will restart until critical word in cache line reaches.
>
> Hash value is critical word, so in this patch we place it as first
> member in cache line (sock address is cache-line aligned), and it is
> also good for Early Restart platform as well .
I think the description of this patch doen't make sense. the purpose
of CWF hardware feature is to release the sw from moving critical word
as first member of the cache.
that's ofcourse depends on how you define the CWF, but at least
according to http://lwn.net/Articles/252125/ and here
https://github.com/jamie-allen/cpu_caches/blob/master/preso/presentation.md
the CWF means the hw will do the job.
so I think the patch maybe usefull (1) for system that doesn't have
CWF, (2) CWF may not totaly eliminate the additional latency. this is
of course a prediction as you see.

saeed
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Feb. 3, 2013, 9:08 p.m.
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 02 Feb 2013 07:03:55 -0800

> From: Ma Ling <ling.ma.program@gmail.com>
> 
> In order to reduce memory latency when last level cache miss occurs,
> modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
> Early Restart(ER) to get data ASAP. For CWF if critical word is first
> member
> in cache line, memory feed CPU with critical word, then fill others
> data in cache line one by one, otherwise after critical word it must
> cost more cycle to fill the remaining cache line. For Early First CPU
> will restart until critical word in cache line reaches.
> 
> Hash value is critical word, so in this patch we place it as first
> member in cache line (sock address is cache-line aligned), and it is
> also good for Early Restart platform as well .
> 
> [edumazet: respin on net-next after commit ce43b03e8889]
> 
> Signed-off-by: Ma Ling <ling.ma.program@gmail.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

I completely agree with the other response to this patch in that
the description is bogus.

If CWF is implemented in the cpu, it should exactly relieve us from
having to move things around in structures so carefully like this.

Either the patch should be completely dropped (modern cpus don't
need this) or the commit message changed to reflect reality.

It really makes a terrible impression upon me when the patch says
something which in fact is 180 degrees from reality.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Feb. 4, 2013, 12:18 a.m.
On Sun, 2013-02-03 at 16:08 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sat, 02 Feb 2013 07:03:55 -0800
> 
> > From: Ma Ling <ling.ma.program@gmail.com>
> > 
> > In order to reduce memory latency when last level cache miss occurs,
> > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
> > Early Restart(ER) to get data ASAP. For CWF if critical word is first
> > member
> > in cache line, memory feed CPU with critical word, then fill others
> > data in cache line one by one, otherwise after critical word it must
> > cost more cycle to fill the remaining cache line. For Early First CPU
> > will restart until critical word in cache line reaches.
> > 
> > Hash value is critical word, so in this patch we place it as first
> > member in cache line (sock address is cache-line aligned), and it is
> > also good for Early Restart platform as well .
> > 
> > [edumazet: respin on net-next after commit ce43b03e8889]
> > 
> > Signed-off-by: Ma Ling <ling.ma.program@gmail.com>
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> I completely agree with the other response to this patch in that
> the description is bogus.
> 
> If CWF is implemented in the cpu, it should exactly relieve us from
> having to move things around in structures so carefully like this.
> 
> Either the patch should be completely dropped (modern cpus don't
> need this) or the commit message changed to reflect reality.
> 
> It really makes a terrible impression upon me when the patch says
> something which in fact is 180 degrees from reality.

Hmm. 

Maybe the changelog is misleading, or maybe all the performance gains I
have from this patch are probably some artifact or old/bad hardware, or
something else.



(Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz)
# ./cwf
looking-up aligned time 108712072, 
looking-up unaligned time 113268256
looking-up aligned time 108677032, 
looking-up unaligned time 113297636


(Intel(R) Xeon(R) CPU           X5679  @ 3.20GHz)
# ./cwf
looking-up aligned time 139193589, 
looking-up unaligned time 144307821
looking-up aligned time 139136787, 
looking-up unaligned time 144277752

My laptop : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
# ./cwf
looking-up aligned time 84869203, 
looking-up unaligned time 86843462
looking-up aligned time 84253003, 
looking-up unaligned time 86227675

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>

#define CACHELINE_SZ 64L

#define BIGBUFFER_SZ (64<<20)

# define HP_TIMING_NOW(Var) \
 ({ unsigned long long _hi, _lo; \
  asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \
  (Var) = _hi << 32 | _lo; })

#define repeat_times  20

char *bufzap;

static void zap_cache(void)
{
	memset(bufzap, 2, BIGBUFFER_SZ);
	memset(bufzap, 3, BIGBUFFER_SZ);
	memset(bufzap, 4, BIGBUFFER_SZ);
}

static char *init_buf(void)
{
	void *res;

	if (posix_memalign(&res, CACHELINE_SZ, BIGBUFFER_SZ)) {
		fprintf(stderr, "malloc() failed");
	        exit(1);
	}

	memset(res, 1, BIGBUFFER_SZ);
	return res;
}

unsigned long total;

static unsigned long random_access(void *buffer,
				   unsigned int off1,
				   unsigned int off2,
				   unsigned int off3)
{
	int i;
	unsigned int n;
	unsigned long sum = 0;
	unsigned long *ptr;

	srandom(7777);
	for (i = 0; i < 1000000; i++) {
		n = random() % (BIGBUFFER_SZ/CACHELINE_SZ);
		ptr = buffer + n*CACHELINE_SZ;
		if (ptr[off1])
			sum++;
		if (ptr[off2])
			sum++;
//		if (ptr[off3])
//			sum++;
	}
	total += sum;
	return sum;
}

static unsigned long test_lookup_time(void *buf, 
				unsigned int off1,
				unsigned int off2,
				unsigned int off3)
{
        unsigned long i, start, end, best_time = ~0;

        for (i = 0; i < repeat_times; i++) {
		zap_cache();
                HP_TIMING_NOW(start);
                random_access(buf, off1, off2, off3);
                HP_TIMING_NOW(end);
                if (best_time > (end - start))
                        best_time = (end - start);
        }

        return best_time;

}

int main(int argc, char *argv[])
{
        char *buf;
        unsigned long aligned_time, unaligned_time;

        buf = init_buf();
        bufzap = init_buf();

        aligned_time = test_lookup_time(buf, 0, 2, 4);
        unaligned_time = test_lookup_time(buf, 4, 2, 0);

        printf("looking-up aligned time %lu, \nlooking-up unaligned time %lu\n", aligned_time, unaligned_time);

        aligned_time = test_lookup_time(buf, 0, 2, 4);
        unaligned_time = test_lookup_time(buf, 4, 2, 0);

        printf("looking-up aligned time %lu, \nlooking-up unaligned time %lu\n", aligned_time, unaligned_time);
}


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Feb. 4, 2013, 12:25 a.m.
On Sun, 2013-02-03 at 16:18 -0800, Eric Dumazet wrote:
> On Sun, 2013-02-03 at 16:08 -0500, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Sat, 02 Feb 2013 07:03:55 -0800
> > 
> > > From: Ma Ling <ling.ma.program@gmail.com>
> > > 
> > > In order to reduce memory latency when last level cache miss occurs,
> > > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
> > > Early Restart(ER) to get data ASAP. For CWF if critical word is first
> > > member
> > > in cache line, memory feed CPU with critical word, then fill others
> > > data in cache line one by one, otherwise after critical word it must
> > > cost more cycle to fill the remaining cache line. For Early First CPU
> > > will restart until critical word in cache line reaches.
> > > 
> > > Hash value is critical word, so in this patch we place it as first
> > > member in cache line (sock address is cache-line aligned), and it is
> > > also good for Early Restart platform as well .
> > > 
> > > [edumazet: respin on net-next after commit ce43b03e8889]
> > > 
> > > Signed-off-by: Ma Ling <ling.ma.program@gmail.com>
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > 
> > I completely agree with the other response to this patch in that
> > the description is bogus.
> > 
> > If CWF is implemented in the cpu, it should exactly relieve us from
> > having to move things around in structures so carefully like this.
> > 
> > Either the patch should be completely dropped (modern cpus don't
> > need this) or the commit message changed to reflect reality.
> > 
> > It really makes a terrible impression upon me when the patch says
> > something which in fact is 180 degrees from reality.
> 
> Hmm. 
> 
> Maybe the changelog is misleading, or maybe all the performance gains I
> have from this patch are probably some artifact or old/bad hardware, or
> something else.
> 
> 
> 
> (Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz)
> # ./cwf
> looking-up aligned time 108712072, 
> looking-up unaligned time 113268256
> looking-up aligned time 108677032, 
> looking-up unaligned time 113297636
> 
> 
> (Intel(R) Xeon(R) CPU           X5679  @ 3.20GHz)
> # ./cwf
> looking-up aligned time 139193589, 
> looking-up unaligned time 144307821
> looking-up aligned time 139136787, 
> looking-up unaligned time 144277752
> 
> My laptop : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
> # ./cwf
> looking-up aligned time 84869203, 
> looking-up unaligned time 86843462
> looking-up aligned time 84253003, 
> looking-up unaligned time 86227675
> 
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> #include <unistd.h>
> 
> #define CACHELINE_SZ 64L
> 
> #define BIGBUFFER_SZ (64<<20)
> 
> # define HP_TIMING_NOW(Var) \
>  ({ unsigned long long _hi, _lo; \
>   asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \
>   (Var) = _hi << 32 | _lo; })
> 
> #define repeat_times  20
> 
> char *bufzap;
> 
> static void zap_cache(void)
> {
> 	memset(bufzap, 2, BIGBUFFER_SZ);
> 	memset(bufzap, 3, BIGBUFFER_SZ);
> 	memset(bufzap, 4, BIGBUFFER_SZ);
> }
> 
> static char *init_buf(void)
> {
> 	void *res;
> 
> 	if (posix_memalign(&res, CACHELINE_SZ, BIGBUFFER_SZ)) {
> 		fprintf(stderr, "malloc() failed");
> 	        exit(1);
> 	}
> 
> 	memset(res, 1, BIGBUFFER_SZ);
> 	return res;
> }
> 
> unsigned long total;
> 
> static unsigned long random_access(void *buffer,
> 				   unsigned int off1,
> 				   unsigned int off2,
> 				   unsigned int off3)
> {
> 	int i;
> 	unsigned int n;
> 	unsigned long sum = 0;
> 	unsigned long *ptr;
> 
> 	srandom(7777);
> 	for (i = 0; i < 1000000; i++) {
> 		n = random() % (BIGBUFFER_SZ/CACHELINE_SZ);
> 		ptr = buffer + n*CACHELINE_SZ;
> 		if (ptr[off1])
> 			sum++;
> 		if (ptr[off2])
> 			sum++;
> //		if (ptr[off3])
> //			sum++;

Hmm, I don't know why I left a comment on these two lines...

Of course, results are a bit different removing the comments :

looking-up aligned time 113601316, 
looking-up unaligned time 115964760
looking-up aligned time 113698636, 
looking-up unaligned time 115986072

More testing is probably needed.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ling Ma - Feb. 4, 2013, 2:53 a.m.
I attached my test program(we force all cpu loads issue one by one ,
and avoid cpu hardwre prefetch cc -o test-cwf test-cwf.c.) and
cpu-info, the result from ./test-cwf  indicates as below:
looking-up aligned time 157000272, looking-up unaligned time 162652724
If I was wrong please correct me.

Thanks
Ling


2013/2/4, Eric Dumazet <eric.dumazet@gmail.com>:
> On Sun, 2013-02-03 at 16:18 -0800, Eric Dumazet wrote:
>> On Sun, 2013-02-03 at 16:08 -0500, David Miller wrote:
>> > From: Eric Dumazet <eric.dumazet@gmail.com>
>> > Date: Sat, 02 Feb 2013 07:03:55 -0800
>> >
>> > > From: Ma Ling <ling.ma.program@gmail.com>
>> > >
>> > > In order to reduce memory latency when last level cache miss occurs,
>> > > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or
>> > > Early Restart(ER) to get data ASAP. For CWF if critical word is first
>> > > member
>> > > in cache line, memory feed CPU with critical word, then fill others
>> > > data in cache line one by one, otherwise after critical word it must
>> > > cost more cycle to fill the remaining cache line. For Early First CPU
>> > > will restart until critical word in cache line reaches.
>> > >
>> > > Hash value is critical word, so in this patch we place it as first
>> > > member in cache line (sock address is cache-line aligned), and it is
>> > > also good for Early Restart platform as well .
>> > >
>> > > [edumazet: respin on net-next after commit ce43b03e8889]
>> > >
>> > > Signed-off-by: Ma Ling <ling.ma.program@gmail.com>
>> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> >
>> > I completely agree with the other response to this patch in that
>> > the description is bogus.
>> >
>> > If CWF is implemented in the cpu, it should exactly relieve us from
>> > having to move things around in structures so carefully like this.
>> >
>> > Either the patch should be completely dropped (modern cpus don't
>> > need this) or the commit message changed to reflect reality.
>> >
>> > It really makes a terrible impression upon me when the patch says
>> > something which in fact is 180 degrees from reality.
>>
>> Hmm.
>>
>> Maybe the changelog is misleading, or maybe all the performance gains I
>> have from this patch are probably some artifact or old/bad hardware, or
>> something else.
>>
>>
>>
>> (Intel(R) Xeon(R) CPU           X5660  @ 2.80GHz)
>> # ./cwf
>> looking-up aligned time 108712072,
>> looking-up unaligned time 113268256
>> looking-up aligned time 108677032,
>> looking-up unaligned time 113297636
>>
>>
>> (Intel(R) Xeon(R) CPU           X5679  @ 3.20GHz)
>> # ./cwf
>> looking-up aligned time 139193589,
>> looking-up unaligned time 144307821
>> looking-up aligned time 139136787,
>> looking-up unaligned time 144277752
>>
>> My laptop : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
>> # ./cwf
>> looking-up aligned time 84869203,
>> looking-up unaligned time 86843462
>> looking-up aligned time 84253003,
>> looking-up unaligned time 86227675
>>
>> #include <stdio.h>
>> #include <string.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>>
>> #define CACHELINE_SZ 64L
>>
>> #define BIGBUFFER_SZ (64<<20)
>>
>> # define HP_TIMING_NOW(Var) \
>>  ({ unsigned long long _hi, _lo; \
>>   asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \
>>   (Var) = _hi << 32 | _lo; })
>>
>> #define repeat_times  20
>>
>> char *bufzap;
>>
>> static void zap_cache(void)
>> {
>> 	memset(bufzap, 2, BIGBUFFER_SZ);
>> 	memset(bufzap, 3, BIGBUFFER_SZ);
>> 	memset(bufzap, 4, BIGBUFFER_SZ);
>> }
>>
>> static char *init_buf(void)
>> {
>> 	void *res;
>>
>> 	if (posix_memalign(&res, CACHELINE_SZ, BIGBUFFER_SZ)) {
>> 		fprintf(stderr, "malloc() failed");
>> 	        exit(1);
>> 	}
>>
>> 	memset(res, 1, BIGBUFFER_SZ);
>> 	return res;
>> }
>>
>> unsigned long total;
>>
>> static unsigned long random_access(void *buffer,
>> 				   unsigned int off1,
>> 				   unsigned int off2,
>> 				   unsigned int off3)
>> {
>> 	int i;
>> 	unsigned int n;
>> 	unsigned long sum = 0;
>> 	unsigned long *ptr;
>>
>> 	srandom(7777);
>> 	for (i = 0; i < 1000000; i++) {
>> 		n = random() % (BIGBUFFER_SZ/CACHELINE_SZ);
>> 		ptr = buffer + n*CACHELINE_SZ;
>> 		if (ptr[off1])
>> 			sum++;
>> 		if (ptr[off2])
>> 			sum++;
>> //		if (ptr[off3])
>> //			sum++;
>
> Hmm, I don't know why I left a comment on these two lines...
>
> Of course, results are a bit different removing the comments :
>
> looking-up aligned time 113601316,
> looking-up unaligned time 115964760
> looking-up aligned time 113698636,
> looking-up unaligned time 115986072
>
> More testing is probably needed.
>
>
>
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 1
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 9
cpu cores	: 4
apicid		: 18
initial apicid	: 18
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 10
cpu cores	: 4
apicid		: 20
initial apicid	: 20
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 1
cpu cores	: 4
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 6
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 9
cpu cores	: 4
apicid		: 19
initial apicid	: 19
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
stepping	: 2
microcode	: 0x10
cpu MHz		: 2400.153
cache size	: 12288 KB
physical id	: 0
siblings	: 8
core id		: 10
cpu cores	: 4
apicid		: 21
initial apicid	: 21
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4800.30
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:
Eric Dumazet - Feb. 4, 2013, 3:11 a.m.
On Mon, 2013-02-04 at 10:53 +0800, Ling Ma wrote:
> I attached my test program(we force all cpu loads issue one by one ,
> and avoid cpu hardwre prefetch cc -o test-cwf test-cwf.c.) and
> cpu-info, the result from ./test-cwf  indicates as below:
> looking-up aligned time 157000272, looking-up unaligned time 162652724
> If I was wrong please correct me.

I have no idea why you use assembly code.

unsigned long lookingup_memmory(char *access, int num)
{
        __asm__("sub $1, %rsi");
        __asm__("xor %rax, %rax");
        __asm__("1:");
        __asm__("mov (%rdi), %r8");
        __asm__("add %r8, %rax");
        __asm__("mov %r8, %rdi");
        __asm__("sub $1, %rsi");
        __asm__("jae 1b");
}

Your program is really hard to read.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/net/sock.h b/include/net/sock.h
index a340ab4..efabd9a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -131,12 +131,12 @@  typedef __u64 __bitwise __addrpair;
 
 /**
  *	struct sock_common - minimal network layer representation of sockets
- *	@skc_daddr: Foreign IPv4 addr
- *	@skc_rcv_saddr: Bound local IPv4 addr
  *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_u16hashes: two u16 hash values used by UDP lookup tables
  *	@skc_dport: placeholder for inet_dport/tw_dport
  *	@skc_num: placeholder for inet_num/tw_num
+ *	@skc_daddr: Foreign IPv4 addr
+ *	@skc_rcv_saddr: Bound local IPv4 addr
  *	@skc_family: network address family
  *	@skc_state: Connection state
  *	@skc_reuse: %SO_REUSEADDR setting
@@ -153,18 +153,10 @@  typedef __u64 __bitwise __addrpair;
  *
  *	This is the minimal network layer representation of sockets, the header
  *	for struct sock and struct inet_timewait_sock.
+ *	Order of first fields is critical for __inet_lookup_established() :
+ *	skc_hash, skc_portpair, skc_addrpair
  */
 struct sock_common {
-	/* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
-	 * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH()
-	 */
-	union {
-		__addrpair	skc_addrpair;
-		struct {
-			__be32	skc_daddr;
-			__be32	skc_rcv_saddr;
-		};
-	};
 	union  {
 		unsigned int	skc_hash;
 		__u16		skc_u16hashes[2];
@@ -178,6 +170,16 @@  struct sock_common {
 		};
 	};
 
+	/* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned
+	 * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH()
+	 */
+	union {
+		__addrpair	skc_addrpair;
+		struct {
+			__be32	skc_daddr;
+			__be32	skc_rcv_saddr;
+		};
+	};
 	unsigned short		skc_family;
 	volatile unsigned char	skc_state;
 	unsigned char		skc_reuse:4;