diff mbox

[RFC,net-next] ipv6: Use destination address determined by IPVS

Message ID alpine.LFD.2.03.1310180153250.4554@ssi.bg
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Julian Anastasov Oct. 17, 2013, 11:03 p.m. UTC
Hello,

	Here is a solution that should work not only for IPVS.
If the change looks correct I'll send it in a separate message.

[PATCH net] ipv6: always prefer rt6i_gateway if present

From: Julian Anastasov <ja@ssi.bg>

In v3.9 6fd6ce2056de2709 ("ipv6: Do not depend on rt->n in
ip6_finish_output2()." changed the behaviour of ip6_finish_output2()
such that the recently introduced rt6_nexthop() is used
instead of an assigned neighbor.

As rt6_nexthop() prefers rt6i_gateway only for gatewayed
routes this causes a problem for users like IPVS, xt_TEE and
RAW(hdrincl) if they want to use different address for routing
compared to the destination address.

Fix it by considering the rt6i_gateway address in all
cases, so that traffic routed to address on local subnet is
not wrongly diverted to the destination address.

Thanks to Simon Horman and Phil Oester for spotting the
problematic commit.

Reported-by: Phil Oester <kernel@linuxace.com>
Reported-by: Mark Brooks <mark@loadbalancer.org>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
---

Please review for possible side effects when using
rt6i_gateway without RTF_GATEWAY!

 include/net/ip6_route.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Hannes Frederic Sowa Oct. 18, 2013, 2:10 a.m. UTC | #1
On Fri, Oct 18, 2013 at 02:03:27AM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> 	Here is a solution that should work not only for IPVS.
> If the change looks correct I'll send it in a separate message.
> 
> [PATCH net] ipv6: always prefer rt6i_gateway if present
> 
> From: Julian Anastasov <ja@ssi.bg>
> 
> In v3.9 6fd6ce2056de2709 ("ipv6: Do not depend on rt->n in
> ip6_finish_output2()." changed the behaviour of ip6_finish_output2()
> such that the recently introduced rt6_nexthop() is used
> instead of an assigned neighbor.
> 
> As rt6_nexthop() prefers rt6i_gateway only for gatewayed
> routes this causes a problem for users like IPVS, xt_TEE and
> RAW(hdrincl) if they want to use different address for routing
> compared to the destination address.
> 
> Fix it by considering the rt6i_gateway address in all
> cases, so that traffic routed to address on local subnet is
> not wrongly diverted to the destination address.
> 
> Thanks to Simon Horman and Phil Oester for spotting the
> problematic commit.

I played around with your patch and tested xt_TEE. I added a TEE rule to
mangle/OUTPUT and pinged. This happend, I have not yet analyzed it:

[  101.126649] ------------[ cut here ]------------
[  101.128436] BUG: unable to handle kernel paging request at fffffffb8a2fda88
[  101.129421] IP: [<ffffffff810c9737>] cpuacct_charge+0x97/0x200
[  101.129421] PGD 1c0f067 PUD 0
[  101.129421] Thread overran stack, or stack corrupted
[  101.129421] Oops: 0000 [#1] SMP
[  101.129421] Modules linked in: xt_TEE nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables joydev serio_raw virtio_balloon i2c_piix4 i2c_core nfsd auth_rpcgss nfs_acl lockd sunrpc virtio_blk virtio_net ata_generic pata_acpi
[  101.129421] CPU: 1 PID: 1190 Comm: ping6 Not tainted 3.12.0-rc3+ #2
[  101.129421] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  101.129421] task: ffff8800c30a4590 ti: ffff8801193ce000 task.ti: ffff8801193ce000
[  101.129421] RIP: 0010:[<ffffffff810c9737>]  [<ffffffff810c9737>] cpuacct_charge+0x97/0x200
[  101.129421] RSP: 0018:ffff88011b403d60  EFLAGS: 00010002
[  101.129421] RAX: 000000000000e5d8 RBX: 0000000000244314 RCX: ffffffff810b87bd
[  101.129421] RDX: ffffffff81c4dce0 RSI: ffffffff81c47e00 RDI: 0000000000000046
[  101.129421] RBP: ffff88011b403d88 R08: ffff8800c30a5308 R09: 0000000000000001
[  101.129421] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8800c30a4590
[  101.129421] R13: 00000000810b87bd R14: ffff880117d5b450 R15: 0000000000000000
[  101.129421] FS:  00007f79260bc740(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
[  101.129421] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.129421] CR2: fffffffb8a2fda88 CR3: 0000000117e40000 CR4: 00000000000006e0
[  101.129421] Stack:
[  101.129421]  ffffffff810c96a5 ffff8800c30a45f8 ffff8800c8d1da40 ffff8800c30a4590
[  101.129421]  0000000000244314 ffff88011b403dc8 ffffffff810bce9c 000000003264e993
[  101.129421]  ffff8800c30a45f8 0000000000000001 ffff8800c8d1da40 ffff88011b5d5240
[  101.129421] Call Trace:
[  101.129421]  <IRQ>
[  101.129421]  [<ffffffff810c96a5>] ? cpuacct_charge+0x5/0x200
[  101.129421]  [<ffffffff810bce9c>] update_curr+0xcc/0x210
[  101.129421]  [<ffffffff810beb19>] task_tick_fair+0x2b9/0x680
[  101.129421]  [<ffffffff810b8948>] ? sched_clock_cpu+0xa8/0x100
[  101.129421]  [<ffffffff810b3564>] scheduler_tick+0x64/0xe0
[  101.129421]  [<ffffffff810866a6>] update_process_times+0x66/0x80
[  101.129421]  [<ffffffff810e7a15>] tick_sched_handle.isra.15+0x25/0x60
[  101.129421]  [<ffffffff810e7a91>] tick_sched_timer+0x41/0x60
[  101.129421]  [<ffffffff810a5146>] __run_hrtimer+0x86/0x490
[  101.129421]  [<ffffffff810e7a50>] ? tick_sched_handle.isra.15+0x60/0x60
[  101.129421]  [<ffffffff810a5fc7>] hrtimer_interrupt+0xf7/0x240
[  101.129421]  [<ffffffff8104a877>] local_apic_timer_interrupt+0x37/0x60
[  101.129421]  [<ffffffff8173dcbf>] smp_apic_timer_interrupt+0x3f/0x60
[  101.129421]  [<ffffffff8173c6b2>] apic_timer_interrupt+0x72/0x80
[  101.129421]  <EOI>
[  101.129421]  [<ffffffff810d5bdd>] ? vprintk_emit+0x1dd/0x5e0
[  101.129421]  [<ffffffff8107afe4>] ? __local_bh_disable+0x94/0xa0
[  101.129421]  [<ffffffff81723dbf>] printk+0x67/0x69
[  101.129421]  [<ffffffff810ed07c>] ? trace_hardirqs_on_caller+0xac/0x1c0
[  101.129421]  [<ffffffff8107afe4>] ? __local_bh_disable+0x94/0xa0
[  101.129421]  [<ffffffff81075394>] warn_slowpath_common+0x34/0xa0
[  101.129421]  [<ffffffff81600309>] ? dst_alloc+0x139/0x170
[  101.129421]  [<ffffffff810754ba>] warn_slowpath_null+0x1a/0x20
[  101.129421]  [<ffffffff8107afe4>] __local_bh_disable+0x94/0xa0
[  101.129421]  [<ffffffff8107b007>] local_bh_disable+0x17/0x20
[  101.129421]  [<ffffffff81600309>] dst_alloc+0x139/0x170
[  101.129421]  [<ffffffff816d4e47>] icmp6_dst_alloc+0xf7/0x430
[  101.129421]  [<ffffffff816d4d55>] ? icmp6_dst_alloc+0x5/0x430
[  101.129421]  [<ffffffff816dde26>] ndisc_send_skb+0x3e6/0x540
[  101.129421]  [<ffffffff816df81c>] ndisc_send_ns+0xdc/0x140
[  101.129421]  [<ffffffff816d394b>] rt6_probe+0x31b/0x380
[  101.129421]  [<ffffffff816d3690>] ? rt6_probe+0x60/0x380
[  101.129421]  [<ffffffff816d1b93>] ? rt6_score_route+0xa3/0x380
[  101.129421]  [<ffffffff816d471c>] ip6_pol_route.isra.47+0x2fc/0x4f0
[  101.129421]  [<ffffffff816d4940>] ? ip6_pol_route_input+0x30/0x30
[  101.129421]  [<ffffffff816d496a>] ip6_pol_route_output+0x2a/0x30
[  101.129421]  [<ffffffff8170516e>] fib6_rule_action+0xbe/0x200
[  101.129421]  [<ffffffff810ea057>] ? __lock_is_held+0x57/0x80
[  101.129421]  [<ffffffff816142e5>] fib_rules_lookup+0x145/0x390
[  101.129421]  [<ffffffff816141a5>] ? fib_rules_lookup+0x5/0x390
[  101.129421]  [<ffffffff81705464>] fib6_rule_lookup+0x44/0x80
[  101.129421]  [<ffffffff816d4940>] ? ip6_pol_route_input+0x30/0x30
[  101.129421]  [<ffffffff816d1263>] ip6_route_output+0x73/0xb0
[  101.129421]  [<ffffffffa01df54f>] tee_tg6+0x15f/0x228 [xt_TEE]
[  101.129421]  [<ffffffffa012413f>] ip6t_do_table+0x26f/0x435 [ip6_tables]
[  101.129421]  [<ffffffff8107b09a>] ? local_bh_enable+0x8a/0xf0
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffffa01c10e8>] ip6table_mangle_hook+0xe8/0x170 [ip6table_mangle]
[  101.129421]  [<ffffffff810ea057>] ? __lock_is_held+0x57/0x80
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffff8162ab5a>] nf_iterate+0xca/0x180
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffff8162acf4>] nf_hook_slow+0xe4/0x290
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffff816dddb2>] ndisc_send_skb+0x372/0x540
[  101.129421]  [<ffffffff816df81c>] ndisc_send_ns+0xdc/0x140
[  101.129421]  [<ffffffff816d394b>] rt6_probe+0x31b/0x380
[  101.129421]  [<ffffffff816d3690>] ? rt6_probe+0x60/0x380
[  101.129421]  [<ffffffff816d1b93>] ? rt6_score_route+0xa3/0x380
[  101.129421]  [<ffffffff816d471c>] ip6_pol_route.isra.47+0x2fc/0x4f0
[  101.129421]  [<ffffffff816d4940>] ? ip6_pol_route_input+0x30/0x30
[  101.129421]  [<ffffffff816d496a>] ip6_pol_route_output+0x2a/0x30
[  101.129421]  [<ffffffff8170516e>] fib6_rule_action+0xbe/0x200
[  101.129421]  [<ffffffff810ea057>] ? __lock_is_held+0x57/0x80
[  101.129421]  [<ffffffff816142e5>] fib_rules_lookup+0x145/0x390
[  101.129421]  [<ffffffff816141a5>] ? fib_rules_lookup+0x5/0x390
[  101.129421]  [<ffffffff81705464>] fib6_rule_lookup+0x44/0x80
[  101.129421]  [<ffffffff816d4940>] ? ip6_pol_route_input+0x30/0x30
[  101.129421]  [<ffffffff816d1263>] ip6_route_output+0x73/0xb0
[  101.129421]  [<ffffffffa01df54f>] tee_tg6+0x15f/0x228 [xt_TEE]
[  101.129421]  [<ffffffffa012413f>] ip6t_do_table+0x26f/0x435 [ip6_tables]
[  101.129421]  [<ffffffff8107b09a>] ? local_bh_enable+0x8a/0xf0
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffffa01c10e8>] ip6table_mangle_hook+0xe8/0x170 [ip6table_mangle]
[  101.129421]  [<ffffffff810ea057>] ? __lock_is_held+0x57/0x80
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffff8162ab5a>] nf_iterate+0xca/0x180
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffff8162acf4>] nf_hook_slow+0xe4/0x290
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffff816dddb2>] ndisc_send_skb+0x372/0x540
[  101.129421]  [<ffffffff816df81c>] ndisc_send_ns+0xdc/0x140
[  101.129421]  [<ffffffff816d394b>] rt6_probe+0x31b/0x380
[  101.129421]  [<ffffffff816d3690>] ? rt6_probe+0x60/0x380
[  101.129421]  [<ffffffff816d1b93>] ? rt6_score_route+0xa3/0x380
[  101.129421]  [<ffffffff816d471c>] ip6_pol_route.isra.47+0x2fc/0x4f0
[  101.129421]  [<ffffffff816d4940>] ? ip6_pol_route_input+0x30/0x30
[  101.129421]  [<ffffffff816d496a>] ip6_pol_route_output+0x2a/0x30
[  101.129421]  [<ffffffff8170516e>] fib6_rule_action+0xbe/0x200
[  101.129421]  [<ffffffff810ea057>] ? __lock_is_held+0x57/0x80
[  101.129421]  [<ffffffff816142e5>] fib_rules_lookup+0x145/0x390
[  101.129421]  [<ffffffff816141a5>] ? fib_rules_lookup+0x5/0x390
[  101.129421]  [<ffffffff81705464>] fib6_rule_lookup+0x44/0x80
[  101.129421]  [<ffffffff816d4940>] ? ip6_pol_route_input+0x30/0x30
[  101.129421]  [<ffffffff816d1263>] ip6_route_output+0x73/0xb0
[  101.129421]  [<ffffffffa01df54f>] tee_tg6+0x15f/0x228 [xt_TEE]
[  101.129421]  [<ffffffffa012413f>] ip6t_do_table+0x26f/0x435 [ip6_tables]
[  101.129421]  [<ffffffff8107b09a>] ? local_bh_enable+0x8a/0xf0
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffffa01c10e8>] ip6table_mangle_hook+0xe8/0x170 [ip6table_mangle]
[  101.129421]  [<ffffffff810ea057>] ? __lock_is_held+0x57/0x80
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffff8162ab5a>] nf_iterate+0xca/0x180
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffff8162acf4>] nf_hook_slow+0xe4/0x290
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffff816dddb2>] ndisc_send_skb+0x372/0x540
[  101.129421]  [<ffffffff816df81c>] ndisc_send_ns+0xdc/0x140
[  101.129421]  [<ffffffff816d394b>] rt6_probe+0x31b/0x380
[  101.129421]  [<ffffffff816d3690>] ? rt6_probe+0x60/0x380
[  101.129421]  [<ffffffff816d1b93>] ? rt6_score_route+0xa3/0x380
[  101.129421]  [<ffffffff816d471c>] ip6_pol_route.isra.47+0x2fc/0x4f0
[  101.129421]  [<ffffffff816d4940>] ? ip6_pol_route_input+0x30/0x30
[  101.129421]  [<ffffffff816d496a>] ip6_pol_route_output+0x2a/0x30
[  101.129421]  [<ffffffff8170516e>] fib6_rule_action+0xbe/0x200
[  101.129421]  [<ffffffff810ea057>] ? __lock_is_held+0x57/0x80
[  101.129421]  [<ffffffff816142e5>] fib_rules_lookup+0x145/0x390
[  101.129421]  [<ffffffff816141a5>] ? fib_rules_lookup+0x5/0x390
[  101.129421]  [<ffffffff81705464>] fib6_rule_lookup+0x44/0x80
[  101.129421]  [<ffffffff816d4940>] ? ip6_pol_route_input+0x30/0x30
[  101.129421]  [<ffffffff816d1263>] ip6_route_output+0x73/0xb0
[  101.129421]  [<ffffffffa01df54f>] tee_tg6+0x15f/0x228 [xt_TEE]
[  101.129421]  [<ffffffffa012413f>] ip6t_do_table+0x26f/0x435 [ip6_tables]
[  101.129421]  [<ffffffff8107b09a>] ? local_bh_enable+0x8a/0xf0
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffffa01c10e8>] ip6table_mangle_hook+0xe8/0x170 [ip6table_mangle]
[  101.129421]  [<ffffffff810ea057>] ? __lock_is_held+0x57/0x80
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffff8162ab5a>] nf_iterate+0xca/0x180
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffff8162acf4>] nf_hook_slow+0xe4/0x290
[  101.129421]  [<ffffffff816dd6e0>] ? ndisc_alloc_skb+0x110/0x110
[  101.129421]  [<ffffffff816dddb2>] ndisc_send_skb+0x372/0x540
[  101.129421]  [<ffffffff816df81c>] ndisc_send_ns+0xdc/0x140
[  101.129421]  [<ffffffff816d394b>] rt6_probe+0x31b/0x380
[  101.129421]  [<ffffffff816d3690>] ? rt6_probe+0x60/0x380
[  101.129421]  [<ffffffff816d1b93>] ? rt6_score_route+0xa3/0x380
[  101.129421]  [<ffffffff816d471c>] ip6_pol_route.isra.47+0x2fc/0x4f0
[  101.129421]  [<ffffffff816d4940>] ? ip6_pol_route_input+0x30/0x30
[  101.129421]  [<ffffffff816d496a>] ip6_pol_route_output+0x2a/0x30
[  101.129421]  [<ffffffff8170516e>] fib6_rule_action+0xbe/0x200
[  101.129421]  [<ffffffff810ea057>] ? __lock_is_held+0x57/0x80
[  101.129421]  [<ffffffff816142e5>] fib_rules_lookup+0x145/0x390
[  101.129421]  [<ffffffff816141a5>] ? fib_rules_lookup+0x5/0x390
[  101.129421]  [<ffffffff81705464>] fib6_rule_lookup+0x44/0x80
[  101.129421]  [<ffffffff816d4940>] ? ip6_pol_route_input+0x30/0x30
[  101.129421]  [<ffffffff816d1263>] ip6_route_output+0x73/0xb0
[  101.129421]  [<ffffffff816bfd13>] ip6_dst_lookup_tail+0x373/0x4a0
[  101.129421]  [<ffffffff8105768f>] ? kvm_clock_read+0x2f/0x50
[  101.129421]  [<ffffffff81021b89>] ? sched_clock+0x9/0x10
[  101.129421]  [<ffffffff810b87bd>] ? sched_clock_local+0x1d/0x80
[  101.129421]  [<ffffffff816bfe93>] ip6_dst_lookup_flow+0x33/0x90
[  101.129421]  [<ffffffff816f94bd>] ip6_datagram_connect+0x16d/0x510
[  101.129421]  [<ffffffff810ed0cd>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[  101.129421]  [<ffffffff810ed19d>] ? trace_hardirqs_on+0xd/0x10
[  101.129421]  [<ffffffff81678dde>] inet_dgram_connect+0x2e/0x80
[  101.129421]  [<ffffffff815da25a>] SYSC_connect+0xea/0x130
[  101.129421]  [<ffffffff81732818>] ? retint_swapgs+0x13/0x1b
[  101.129421]  [<ffffffff810ed0cd>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[  101.129421]  [<ffffffff8137b67e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  101.129421]  [<ffffffff815db75e>] SyS_connect+0xe/0x10
[  101.129421]  [<ffffffff8173ba29>] system_call_fastpath+0x16/0x1b
[  101.129421] Code: 00 00 4d 8b b4 24 e0 13 00 00 e8 e5 3d fd ff 85 c0 74 09 80 3d d0 4a c5 00 00 74 78 49 8b 56 48 49 63 cd 90 48 8b 82 b8 00 00 00 <48> 03 04 cd a0 9c d3 81 48 01 18 48 8b 52 40 48 85 d2 75 e5 e8
[  101.129421] RIP  [<ffffffff810c9737>] cpuacct_charge+0x97/0x200
[  101.129421]  RSP <ffff88011b403d60>
[  101.129421] CR2: fffffffb8a2fda88
[  101.129421] ---[ end trace 68ce790414632b74 ]---
[  101.129421] Kernel panic - not syncing: Fatal exception in interrupt



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Anastasov Oct. 18, 2013, 6:33 a.m. UTC | #2
Hello,

On Fri, 18 Oct 2013, Hannes Frederic Sowa wrote:

> I played around with your patch and tested xt_TEE. I added a TEE rule to
> mangle/OUTPUT and pinged. This happend, I have not yet analyzed it:

	May be a side effect of using some multicast address?
Can you test it with such address check?:

	if (rt->rt6i_flags & RTF_GATEWAY ||
	    ipv6_addr_type(&rt->rt6i_gateway) & IPV6_ADDR_UNICAST)

> [  101.129421]  [<ffffffff816d4e47>] icmp6_dst_alloc+0xf7/0x430
> [  101.129421]  [<ffffffff816d4d55>] ? icmp6_dst_alloc+0x5/0x430
> [  101.129421]  [<ffffffff816dde26>] ndisc_send_skb+0x3e6/0x540
> [  101.129421]  [<ffffffff816df81c>] ndisc_send_ns+0xdc/0x140
> [  101.129421]  [<ffffffff816d394b>] rt6_probe+0x31b/0x380
> [  101.129421]  [<ffffffff816d3690>] ? rt6_probe+0x60/0x380
> [  101.129421]  [<ffffffff816d1b93>] ? rt6_score_route+0xa3/0x380
> [  101.129421]  [<ffffffff816d471c>] ip6_pol_route.isra.47+0x2fc/0x4f0

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Oct. 18, 2013, 4:33 p.m. UTC | #3
On Fri, Oct 18, 2013 at 09:33:13AM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Fri, 18 Oct 2013, Hannes Frederic Sowa wrote:
> 
> > I played around with your patch and tested xt_TEE. I added a TEE rule to
> > mangle/OUTPUT and pinged. This happend, I have not yet analyzed it:
> 
> 	May be a side effect of using some multicast address?
> Can you test it with such address check?:

I don't think that is the reason.

> 
> 	if (rt->rt6i_flags & RTF_GATEWAY ||
> 	    ipv6_addr_type(&rt->rt6i_gateway) & IPV6_ADDR_UNICAST)

Just checked, I had the same panic.

Greetings,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Anastasov Oct. 19, 2013, 4:37 p.m. UTC | #4
Hello,

On Fri, 18 Oct 2013, Hannes Frederic Sowa wrote:

> I played around with your patch and tested xt_TEE. I added a TEE rule to
> mangle/OUTPUT and pinged. This happend, I have not yet analyzed it:
> 
> [  101.126649] ------------[ cut here ]------------
> [  101.128436] BUG: unable to handle kernel paging request at fffffffb8a2fda88
> [  101.129421] IP: [<ffffffff810c9737>] cpuacct_charge+0x97/0x200
> [  101.129421] PGD 1c0f067 PUD 0
> [  101.129421] Thread overran stack, or stack corrupted

	Problem with process stack? May be some packet loop
happens? Because I can not reproduce such problem in my
virtual setup, I tested TEE too, with careful packet
matching and 1 CPU. Should I assume that you don't have such
oops when the patch is not applied, with the same TEE rule?

> [  101.129421] Oops: 0000 [#1] SMP

	You don't appear to have PREEMPT in above line.
I'm not sure when preemption is enabled if tee_tg6() does
not have a problem with its anti-loop measures (tee_active).
Is preemption possible in OUTPUT hook, i.e. can we change
the CPU while playing with tee_active and as result change
different flag?

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Oct. 19, 2013, 6:34 p.m. UTC | #5
On Sat, Oct 19, 2013 at 07:37:10PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Fri, 18 Oct 2013, Hannes Frederic Sowa wrote:
> 
> > I played around with your patch and tested xt_TEE. I added a TEE rule to
> > mangle/OUTPUT and pinged. This happend, I have not yet analyzed it:
> > 
> > [  101.126649] ------------[ cut here ]------------
> > [  101.128436] BUG: unable to handle kernel paging request at fffffffb8a2fda88
> > [  101.129421] IP: [<ffffffff810c9737>] cpuacct_charge+0x97/0x200
> > [  101.129421] PGD 1c0f067 PUD 0
> > [  101.129421] Thread overran stack, or stack corrupted
> 
> 	Problem with process stack? May be some packet loop
> happens? Because I can not reproduce such problem in my
> virtual setup, I tested TEE too, with careful packet
> matching and 1 CPU. Should I assume that you don't have such
> oops when the patch is not applied, with the same TEE rule?

Oh, sorry, you are right. It happens with an unpatched net-next  kernel, too.

I inserted the TEE rule in mangel/OUTGOING and had only one route, ip -6 r a
default via fe80::1 dev eth0 which at the time of the panic was actually not
reachable.

> > [  101.129421] Oops: 0000 [#1] SMP
> 
> 	You don't appear to have PREEMPT in above line.
> I'm not sure when preemption is enabled if tee_tg6() does
> not have a problem with its anti-loop measures (tee_active).
> Is preemption possible in OUTPUT hook, i.e. can we change
> the CPU while playing with tee_active and as result change
> different flag?

Hm, maybe. I don't have too much insight into netfilter stack and
what are the differences between OUTPUT and FORWARD path but plan to
investigate. ;)

Anyways just wanted to let you know that unpatched kernels are affected, too.
Will have a closer look later.

Greetings,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Oct. 19, 2013, 10:36 p.m. UTC | #6
On Sat, Oct 19, 2013 at 08:34:33PM +0200, Hannes Frederic Sowa wrote:
> On Sat, Oct 19, 2013 at 07:37:10PM +0300, Julian Anastasov wrote:
> > 
> > 	Hello,
> > 
> > On Fri, 18 Oct 2013, Hannes Frederic Sowa wrote:
> > 
> > > I played around with your patch and tested xt_TEE. I added a TEE rule to
> > > mangle/OUTPUT and pinged. This happend, I have not yet analyzed it:
> > > 
> > > [  101.126649] ------------[ cut here ]------------
> > > [  101.128436] BUG: unable to handle kernel paging request at fffffffb8a2fda88
> > > [  101.129421] IP: [<ffffffff810c9737>] cpuacct_charge+0x97/0x200
> > > [  101.129421] PGD 1c0f067 PUD 0
> > > [  101.129421] Thread overran stack, or stack corrupted
> > 
> > 	Problem with process stack? May be some packet loop
> > happens? Because I can not reproduce such problem in my
> > virtual setup, I tested TEE too, with careful packet
> > matching and 1 CPU. Should I assume that you don't have such
> > oops when the patch is not applied, with the same TEE rule?
> 
> Oh, sorry, you are right. It happens with an unpatched net-next  kernel, too.
> 
> I inserted the TEE rule in mangel/OUTGOING and had only one route, ip -6 r a
> default via fe80::1 dev eth0 which at the time of the panic was actually not
> reachable.
> 
> > > [  101.129421] Oops: 0000 [#1] SMP
> > 
> > 	You don't appear to have PREEMPT in above line.
> > I'm not sure when preemption is enabled if tee_tg6() does
> > not have a problem with its anti-loop measures (tee_active).
> > Is preemption possible in OUTPUT hook, i.e. can we change
> > the CPU while playing with tee_active and as result change
> > different flag?
> 
> Hm, maybe. I don't have too much insight into netfilter stack and
> what are the differences between OUTPUT and FORWARD path but plan to
> investigate. ;)

It seems tables are processed with bh disabled, so no preemption while
recursing. So I guess the use of tee_active is safe for breaking the
tie here.

The reason I exhaust stack space is that we can actually send out packets
while looking up routes (rt6_probe). The nonreachability of the default
gateway and the to-teed-to box does the rest.

We need to change the route lookup of the duplicated packet in xt_tee to not
cause ndisc probes to be generated.

The more I review the patch the more I think it is ok. But we could actually
try to just always return rt6i_gateway, as we should always be handed a cloned
rt6_info where the gateway is already filled in, no?

Greetings,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Oct. 19, 2013, 10:39 p.m. UTC | #7
On Sun, Oct 20, 2013 at 12:36:57AM +0200, Hannes Frederic Sowa wrote:
> The more I review the patch the more I think it is ok. But we could actually
> try to just always return rt6i_gateway, as we should always be handed a cloned
> rt6_info where the gateway is already filled in, no?

Ergh, this seems to break loopback then.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Anastasov Oct. 20, 2013, 6:39 a.m. UTC | #8
Hello,

On Sat, 19 Oct 2013, Hannes Frederic Sowa wrote:

> On Sat, Oct 19, 2013 at 07:37:10PM +0300, Julian Anastasov wrote:
> > 
> > 	Problem with process stack? May be some packet loop
> > happens? Because I can not reproduce such problem in my
> > virtual setup, I tested TEE too, with careful packet
> > matching and 1 CPU. Should I assume that you don't have such
> > oops when the patch is not applied, with the same TEE rule?
> 
> Oh, sorry, you are right. It happens with an unpatched net-next  kernel, too.
> 
> I inserted the TEE rule in mangel/OUTGOING and had only one route, ip -6 r a
> default via fe80::1 dev eth0 which at the time of the panic was actually not
> reachable.

	Thanks for the confirmation! I'll try later
to reproduce such problem with TEE, it is interesting
to know the real reason for this loop.

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Oct. 20, 2013, 6:47 a.m. UTC | #9
On Sun, Oct 20, 2013 at 09:39:26AM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Sat, 19 Oct 2013, Hannes Frederic Sowa wrote:
> 
> > On Sat, Oct 19, 2013 at 07:37:10PM +0300, Julian Anastasov wrote:
> > > 
> > > 	Problem with process stack? May be some packet loop
> > > happens? Because I can not reproduce such problem in my
> > > virtual setup, I tested TEE too, with careful packet
> > > matching and 1 CPU. Should I assume that you don't have such
> > > oops when the patch is not applied, with the same TEE rule?
> > 
> > Oh, sorry, you are right. It happens with an unpatched net-next  kernel, too.
> > 
> > I inserted the TEE rule in mangel/OUTGOING and had only one route, ip -6 r a
> > default via fe80::1 dev eth0 which at the time of the panic was actually not
> > reachable.
> 
> 	Thanks for the confirmation! I'll try later
> to reproduce such problem with TEE, it is interesting
> to know the real reason for this loop.

Yup, I have a patch in works wich defers the ndisc_send_ns call in
rt6_probe into a workqueue out of the call stack. This should solve
the problem.

Still wonder if there is a nicer api as to create a new structure and pass the
arguments via container_of dereferencing to the deferred function (and I
don't think async calls are useable there).

Greetings,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Anastasov Oct. 20, 2013, 7:11 a.m. UTC | #10
Hello,

On Sun, 20 Oct 2013, Hannes Frederic Sowa wrote:

> > Hm, maybe. I don't have too much insight into netfilter stack and
> > what are the differences between OUTPUT and FORWARD path but plan to
> > investigate. ;)
> 
> It seems tables are processed with bh disabled, so no preemption while
> recursing. So I guess the use of tee_active is safe for breaking the
> tie here.

	May be, I'll check it again, for now I see only
rcu_read_lock() in nf_hook_slow() which is preemptable.
Looking at rcu_preempt_note_context_switch, many levels of
RCU locks are preemptable too.

> The reason I exhaust stack space is that we can actually send out packets
> while looking up routes (rt6_probe). The nonreachability of the default
> gateway and the to-teed-to box does the rest.

	In my test I used link route to local subnet,
--gateway to IP that is not present. I'll try other
variants.

> We need to change the route lookup of the duplicated packet in xt_tee to not
> cause ndisc probes to be generated.
> 
> The more I review the patch the more I think it is ok. But we could actually
> try to just always return rt6i_gateway, as we should always be handed a cloned
> rt6_info where the gateway is already filled in, no?

	Yes, this patch is ok and after spending the whole
saturday I'm preparing a new patch that will convert
rt6_nexthop() to return just rt6i_gateway, without daddr.
This can happen after filling rt6i_gateway in all places.

	For your concern for loopback, I don't see problem,
local/anycast route will have rt6i_gateway=IP, they are
simple DST_HOST routes. I'm preparing now the patches and
will post them in following hours.

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index f525e70..481404a 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -196,7 +196,7 @@  static inline int ip6_skb_dst_mtu(struct sk_buff *skb)
 
 static inline struct in6_addr *rt6_nexthop(struct rt6_info *rt, struct in6_addr *dest)
 {
-	if (rt->rt6i_flags & RTF_GATEWAY)
+	if (rt->rt6i_flags & RTF_GATEWAY || !ipv6_addr_any(&rt->rt6i_gateway))
 		return &rt->rt6i_gateway;
 	return dest;
 }