Message ID | 1359817435.30177.70.camel@edumazet-glaptop |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Sat, Feb 2, 2013 at 5:03 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > From: Ma Ling <ling.ma.program@gmail.com> > > In order to reduce memory latency when last level cache miss occurs, > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or > Early Restart(ER) to get data ASAP. For CWF if critical word is first > member > in cache line, memory feed CPU with critical word, then fill others > data in cache line one by one, otherwise after critical word it must > cost more cycle to fill the remaining cache line. For Early First CPU > will restart until critical word in cache line reaches. > > Hash value is critical word, so in this patch we place it as first > member in cache line (sock address is cache-line aligned), and it is > also good for Early Restart platform as well . I think the description of this patch doen't make sense. the purpose of CWF hardware feature is to release the sw from moving critical word as first member of the cache. that's ofcourse depends on how you define the CWF, but at least according to http://lwn.net/Articles/252125/ and here https://github.com/jamie-allen/cpu_caches/blob/master/preso/presentation.md the CWF means the hw will do the job. so I think the patch maybe usefull (1) for system that doesn't have CWF, (2) CWF may not totaly eliminate the additional latency. this is of course a prediction as you see. saeed -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Sat, 02 Feb 2013 07:03:55 -0800 > From: Ma Ling <ling.ma.program@gmail.com> > > In order to reduce memory latency when last level cache miss occurs, > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or > Early Restart(ER) to get data ASAP. For CWF if critical word is first > member > in cache line, memory feed CPU with critical word, then fill others > data in cache line one by one, otherwise after critical word it must > cost more cycle to fill the remaining cache line. For Early First CPU > will restart until critical word in cache line reaches. > > Hash value is critical word, so in this patch we place it as first > member in cache line (sock address is cache-line aligned), and it is > also good for Early Restart platform as well . > > [edumazet: respin on net-next after commit ce43b03e8889] > > Signed-off-by: Ma Ling <ling.ma.program@gmail.com> > Signed-off-by: Eric Dumazet <edumazet@google.com> I completely agree with the other response to this patch in that the description is bogus. If CWF is implemented in the cpu, it should exactly relieve us from having to move things around in structures so carefully like this. Either the patch should be completely dropped (modern cpus don't need this) or the commit message changed to reflect reality. It really makes a terrible impression upon me when the patch says something which in fact is 180 degrees from reality. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2013-02-03 at 16:08 -0500, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Sat, 02 Feb 2013 07:03:55 -0800 > > > From: Ma Ling <ling.ma.program@gmail.com> > > > > In order to reduce memory latency when last level cache miss occurs, > > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or > > Early Restart(ER) to get data ASAP. For CWF if critical word is first > > member > > in cache line, memory feed CPU with critical word, then fill others > > data in cache line one by one, otherwise after critical word it must > > cost more cycle to fill the remaining cache line. For Early First CPU > > will restart until critical word in cache line reaches. > > > > Hash value is critical word, so in this patch we place it as first > > member in cache line (sock address is cache-line aligned), and it is > > also good for Early Restart platform as well . > > > > [edumazet: respin on net-next after commit ce43b03e8889] > > > > Signed-off-by: Ma Ling <ling.ma.program@gmail.com> > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > I completely agree with the other response to this patch in that > the description is bogus. > > If CWF is implemented in the cpu, it should exactly relieve us from > having to move things around in structures so carefully like this. > > Either the patch should be completely dropped (modern cpus don't > need this) or the commit message changed to reflect reality. > > It really makes a terrible impression upon me when the patch says > something which in fact is 180 degrees from reality. Hmm. Maybe the changelog is misleading, or maybe all the performance gains I have from this patch are probably some artifact or old/bad hardware, or something else. (Intel(R) Xeon(R) CPU X5660 @ 2.80GHz) # ./cwf looking-up aligned time 108712072, looking-up unaligned time 113268256 looking-up aligned time 108677032, looking-up unaligned time 113297636 (Intel(R) Xeon(R) CPU X5679 @ 3.20GHz) # ./cwf looking-up aligned time 139193589, looking-up unaligned time 144307821 looking-up aligned time 139136787, looking-up unaligned time 144277752 My laptop : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz # ./cwf looking-up aligned time 84869203, looking-up unaligned time 86843462 looking-up aligned time 84253003, looking-up unaligned time 86227675 #include <stdio.h> #include <string.h> #include <stdlib.h> #include <unistd.h> #define CACHELINE_SZ 64L #define BIGBUFFER_SZ (64<<20) # define HP_TIMING_NOW(Var) \ ({ unsigned long long _hi, _lo; \ asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \ (Var) = _hi << 32 | _lo; }) #define repeat_times 20 char *bufzap; static void zap_cache(void) { memset(bufzap, 2, BIGBUFFER_SZ); memset(bufzap, 3, BIGBUFFER_SZ); memset(bufzap, 4, BIGBUFFER_SZ); } static char *init_buf(void) { void *res; if (posix_memalign(&res, CACHELINE_SZ, BIGBUFFER_SZ)) { fprintf(stderr, "malloc() failed"); exit(1); } memset(res, 1, BIGBUFFER_SZ); return res; } unsigned long total; static unsigned long random_access(void *buffer, unsigned int off1, unsigned int off2, unsigned int off3) { int i; unsigned int n; unsigned long sum = 0; unsigned long *ptr; srandom(7777); for (i = 0; i < 1000000; i++) { n = random() % (BIGBUFFER_SZ/CACHELINE_SZ); ptr = buffer + n*CACHELINE_SZ; if (ptr[off1]) sum++; if (ptr[off2]) sum++; // if (ptr[off3]) // sum++; } total += sum; return sum; } static unsigned long test_lookup_time(void *buf, unsigned int off1, unsigned int off2, unsigned int off3) { unsigned long i, start, end, best_time = ~0; for (i = 0; i < repeat_times; i++) { zap_cache(); HP_TIMING_NOW(start); random_access(buf, off1, off2, off3); HP_TIMING_NOW(end); if (best_time > (end - start)) best_time = (end - start); } return best_time; } int main(int argc, char *argv[]) { char *buf; unsigned long aligned_time, unaligned_time; buf = init_buf(); bufzap = init_buf(); aligned_time = test_lookup_time(buf, 0, 2, 4); unaligned_time = test_lookup_time(buf, 4, 2, 0); printf("looking-up aligned time %lu, \nlooking-up unaligned time %lu\n", aligned_time, unaligned_time); aligned_time = test_lookup_time(buf, 0, 2, 4); unaligned_time = test_lookup_time(buf, 4, 2, 0); printf("looking-up aligned time %lu, \nlooking-up unaligned time %lu\n", aligned_time, unaligned_time); } -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2013-02-03 at 16:18 -0800, Eric Dumazet wrote: > On Sun, 2013-02-03 at 16:08 -0500, David Miller wrote: > > From: Eric Dumazet <eric.dumazet@gmail.com> > > Date: Sat, 02 Feb 2013 07:03:55 -0800 > > > > > From: Ma Ling <ling.ma.program@gmail.com> > > > > > > In order to reduce memory latency when last level cache miss occurs, > > > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or > > > Early Restart(ER) to get data ASAP. For CWF if critical word is first > > > member > > > in cache line, memory feed CPU with critical word, then fill others > > > data in cache line one by one, otherwise after critical word it must > > > cost more cycle to fill the remaining cache line. For Early First CPU > > > will restart until critical word in cache line reaches. > > > > > > Hash value is critical word, so in this patch we place it as first > > > member in cache line (sock address is cache-line aligned), and it is > > > also good for Early Restart platform as well . > > > > > > [edumazet: respin on net-next after commit ce43b03e8889] > > > > > > Signed-off-by: Ma Ling <ling.ma.program@gmail.com> > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > > > I completely agree with the other response to this patch in that > > the description is bogus. > > > > If CWF is implemented in the cpu, it should exactly relieve us from > > having to move things around in structures so carefully like this. > > > > Either the patch should be completely dropped (modern cpus don't > > need this) or the commit message changed to reflect reality. > > > > It really makes a terrible impression upon me when the patch says > > something which in fact is 180 degrees from reality. > > Hmm. > > Maybe the changelog is misleading, or maybe all the performance gains I > have from this patch are probably some artifact or old/bad hardware, or > something else. > > > > (Intel(R) Xeon(R) CPU X5660 @ 2.80GHz) > # ./cwf > looking-up aligned time 108712072, > looking-up unaligned time 113268256 > looking-up aligned time 108677032, > looking-up unaligned time 113297636 > > > (Intel(R) Xeon(R) CPU X5679 @ 3.20GHz) > # ./cwf > looking-up aligned time 139193589, > looking-up unaligned time 144307821 > looking-up aligned time 139136787, > looking-up unaligned time 144277752 > > My laptop : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz > # ./cwf > looking-up aligned time 84869203, > looking-up unaligned time 86843462 > looking-up aligned time 84253003, > looking-up unaligned time 86227675 > > #include <stdio.h> > #include <string.h> > #include <stdlib.h> > #include <unistd.h> > > #define CACHELINE_SZ 64L > > #define BIGBUFFER_SZ (64<<20) > > # define HP_TIMING_NOW(Var) \ > ({ unsigned long long _hi, _lo; \ > asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \ > (Var) = _hi << 32 | _lo; }) > > #define repeat_times 20 > > char *bufzap; > > static void zap_cache(void) > { > memset(bufzap, 2, BIGBUFFER_SZ); > memset(bufzap, 3, BIGBUFFER_SZ); > memset(bufzap, 4, BIGBUFFER_SZ); > } > > static char *init_buf(void) > { > void *res; > > if (posix_memalign(&res, CACHELINE_SZ, BIGBUFFER_SZ)) { > fprintf(stderr, "malloc() failed"); > exit(1); > } > > memset(res, 1, BIGBUFFER_SZ); > return res; > } > > unsigned long total; > > static unsigned long random_access(void *buffer, > unsigned int off1, > unsigned int off2, > unsigned int off3) > { > int i; > unsigned int n; > unsigned long sum = 0; > unsigned long *ptr; > > srandom(7777); > for (i = 0; i < 1000000; i++) { > n = random() % (BIGBUFFER_SZ/CACHELINE_SZ); > ptr = buffer + n*CACHELINE_SZ; > if (ptr[off1]) > sum++; > if (ptr[off2]) > sum++; > // if (ptr[off3]) > // sum++; Hmm, I don't know why I left a comment on these two lines... Of course, results are a bit different removing the comments : looking-up aligned time 113601316, looking-up unaligned time 115964760 looking-up aligned time 113698636, looking-up unaligned time 115986072 More testing is probably needed. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I attached my test program(we force all cpu loads issue one by one , and avoid cpu hardwre prefetch cc -o test-cwf test-cwf.c.) and cpu-info, the result from ./test-cwf indicates as below: looking-up aligned time 157000272, looking-up unaligned time 162652724 If I was wrong please correct me. Thanks Ling 2013/2/4, Eric Dumazet <eric.dumazet@gmail.com>: > On Sun, 2013-02-03 at 16:18 -0800, Eric Dumazet wrote: >> On Sun, 2013-02-03 at 16:08 -0500, David Miller wrote: >> > From: Eric Dumazet <eric.dumazet@gmail.com> >> > Date: Sat, 02 Feb 2013 07:03:55 -0800 >> > >> > > From: Ma Ling <ling.ma.program@gmail.com> >> > > >> > > In order to reduce memory latency when last level cache miss occurs, >> > > modern CPUs i.e. x86 and arm introduced Critical Word First(CWF) or >> > > Early Restart(ER) to get data ASAP. For CWF if critical word is first >> > > member >> > > in cache line, memory feed CPU with critical word, then fill others >> > > data in cache line one by one, otherwise after critical word it must >> > > cost more cycle to fill the remaining cache line. For Early First CPU >> > > will restart until critical word in cache line reaches. >> > > >> > > Hash value is critical word, so in this patch we place it as first >> > > member in cache line (sock address is cache-line aligned), and it is >> > > also good for Early Restart platform as well . >> > > >> > > [edumazet: respin on net-next after commit ce43b03e8889] >> > > >> > > Signed-off-by: Ma Ling <ling.ma.program@gmail.com> >> > > Signed-off-by: Eric Dumazet <edumazet@google.com> >> > >> > I completely agree with the other response to this patch in that >> > the description is bogus. >> > >> > If CWF is implemented in the cpu, it should exactly relieve us from >> > having to move things around in structures so carefully like this. >> > >> > Either the patch should be completely dropped (modern cpus don't >> > need this) or the commit message changed to reflect reality. >> > >> > It really makes a terrible impression upon me when the patch says >> > something which in fact is 180 degrees from reality. >> >> Hmm. >> >> Maybe the changelog is misleading, or maybe all the performance gains I >> have from this patch are probably some artifact or old/bad hardware, or >> something else. >> >> >> >> (Intel(R) Xeon(R) CPU X5660 @ 2.80GHz) >> # ./cwf >> looking-up aligned time 108712072, >> looking-up unaligned time 113268256 >> looking-up aligned time 108677032, >> looking-up unaligned time 113297636 >> >> >> (Intel(R) Xeon(R) CPU X5679 @ 3.20GHz) >> # ./cwf >> looking-up aligned time 139193589, >> looking-up unaligned time 144307821 >> looking-up aligned time 139136787, >> looking-up unaligned time 144277752 >> >> My laptop : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz >> # ./cwf >> looking-up aligned time 84869203, >> looking-up unaligned time 86843462 >> looking-up aligned time 84253003, >> looking-up unaligned time 86227675 >> >> #include <stdio.h> >> #include <string.h> >> #include <stdlib.h> >> #include <unistd.h> >> >> #define CACHELINE_SZ 64L >> >> #define BIGBUFFER_SZ (64<<20) >> >> # define HP_TIMING_NOW(Var) \ >> ({ unsigned long long _hi, _lo; \ >> asm volatile ("rdtsc" : "=a" (_lo), "=d" (_hi)); \ >> (Var) = _hi << 32 | _lo; }) >> >> #define repeat_times 20 >> >> char *bufzap; >> >> static void zap_cache(void) >> { >> memset(bufzap, 2, BIGBUFFER_SZ); >> memset(bufzap, 3, BIGBUFFER_SZ); >> memset(bufzap, 4, BIGBUFFER_SZ); >> } >> >> static char *init_buf(void) >> { >> void *res; >> >> if (posix_memalign(&res, CACHELINE_SZ, BIGBUFFER_SZ)) { >> fprintf(stderr, "malloc() failed"); >> exit(1); >> } >> >> memset(res, 1, BIGBUFFER_SZ); >> return res; >> } >> >> unsigned long total; >> >> static unsigned long random_access(void *buffer, >> unsigned int off1, >> unsigned int off2, >> unsigned int off3) >> { >> int i; >> unsigned int n; >> unsigned long sum = 0; >> unsigned long *ptr; >> >> srandom(7777); >> for (i = 0; i < 1000000; i++) { >> n = random() % (BIGBUFFER_SZ/CACHELINE_SZ); >> ptr = buffer + n*CACHELINE_SZ; >> if (ptr[off1]) >> sum++; >> if (ptr[off2]) >> sum++; >> // if (ptr[off3]) >> // sum++; > > Hmm, I don't know why I left a comment on these two lines... > > Of course, results are a bit different removing the comments : > > looking-up aligned time 113601316, > looking-up unaligned time 115964760 > looking-up aligned time 113698636, > looking-up unaligned time 115986072 > > More testing is probably needed. > > > processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 1 cpu cores : 4 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 9 cpu cores : 4 apicid : 18 initial apicid : 18 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 10 cpu cores : 4 apicid : 20 initial apicid : 20 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 4 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 5 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 1 cpu cores : 4 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 6 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 9 cpu cores : 4 apicid : 19 initial apicid : 19 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping : 2 microcode : 0x10 cpu MHz : 2400.153 cache size : 12288 KB physical id : 0 siblings : 8 core id : 10 cpu cores : 4 apicid : 21 initial apicid : 21 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm arat epb dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 4800.30 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:
On Mon, 2013-02-04 at 10:53 +0800, Ling Ma wrote: > I attached my test program(we force all cpu loads issue one by one , > and avoid cpu hardwre prefetch cc -o test-cwf test-cwf.c.) and > cpu-info, the result from ./test-cwf indicates as below: > looking-up aligned time 157000272, looking-up unaligned time 162652724 > If I was wrong please correct me. I have no idea why you use assembly code. unsigned long lookingup_memmory(char *access, int num) { __asm__("sub $1, %rsi"); __asm__("xor %rax, %rax"); __asm__("1:"); __asm__("mov (%rdi), %r8"); __asm__("add %r8, %rax"); __asm__("mov %r8, %rdi"); __asm__("sub $1, %rsi"); __asm__("jae 1b"); } Your program is really hard to read. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/sock.h b/include/net/sock.h index a340ab4..efabd9a 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -131,12 +131,12 @@ typedef __u64 __bitwise __addrpair; /** * struct sock_common - minimal network layer representation of sockets - * @skc_daddr: Foreign IPv4 addr - * @skc_rcv_saddr: Bound local IPv4 addr * @skc_hash: hash value used with various protocol lookup tables * @skc_u16hashes: two u16 hash values used by UDP lookup tables * @skc_dport: placeholder for inet_dport/tw_dport * @skc_num: placeholder for inet_num/tw_num + * @skc_daddr: Foreign IPv4 addr + * @skc_rcv_saddr: Bound local IPv4 addr * @skc_family: network address family * @skc_state: Connection state * @skc_reuse: %SO_REUSEADDR setting @@ -153,18 +153,10 @@ typedef __u64 __bitwise __addrpair; * * This is the minimal network layer representation of sockets, the header * for struct sock and struct inet_timewait_sock. + * Order of first fields is critical for __inet_lookup_established() : + * skc_hash, skc_portpair, skc_addrpair */ struct sock_common { - /* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned - * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH() - */ - union { - __addrpair skc_addrpair; - struct { - __be32 skc_daddr; - __be32 skc_rcv_saddr; - }; - }; union { unsigned int skc_hash; __u16 skc_u16hashes[2]; @@ -178,6 +170,16 @@ struct sock_common { }; }; + /* skc_daddr and skc_rcv_saddr must be grouped on a 8 bytes aligned + * address on 64bit arches : cf INET_MATCH() and INET_TW_MATCH() + */ + union { + __addrpair skc_addrpair; + struct { + __be32 skc_daddr; + __be32 skc_rcv_saddr; + }; + }; unsigned short skc_family; volatile unsigned char skc_state; unsigned char skc_reuse:4;