gateway icmp redirect handling problem (3.0.36-3.0.23)

Discussion:

(too old to reply)

Simon Roscic

2012-07-20 22:52:24 UTC

Hello,

I'm experiencing the following problem with kernel versions 3.0.36
(down to 3.0.23):

on our network we all have one default gateway, it's 10.1.1.254, but
there are some networks for which we have another gateway and for this
networks the default gateway sends an icmp redirect.

lets assume my test machine has ip 10.1.20.79 netmask is 255.255.0.0
and my default gateway is 10.1.1.254, i now ping the following ip:
10.109.98.11, my default gateway (10.1.1.254) now sends me an icmp
redirect to another gateway (10.1.1.1) ... and now everything works as
expected, i get the replies from 10.109.98.11 but not for long, after
approx. 60 (or so) seconds i only get "ping: sendmsg: Network is down".
(exact same problem with all other tcp/udp protocols, but i used ping
for the tests because it also prints the redirect messages to the
console)

so let's have a closer look:

not ok - kernel versions 3.0.36 down to 3.0.23:
-----------------------------------------------

test-simon:~ # ping 10.109.98.10
..
64 bytes from 10.109.98.11: icmp_seq=62 ttl=60 time=12.1 ms
64 bytes from 10.109.98.11: icmp_seq=63 ttl=60 time=11.6 ms
ping: sendmsg: Network is down
ping: sendmsg: Network is down

when looking at "ip neigh" the "ping: sendmsg: Network is down" message
appears in the exact moment when the arp entry for the default gateway
(10.1.1.254) gets removed from the arp cache:

ping "OK"
test-simon:~ # ip neigh
10.1.1.1 dev eth0 lladdr 00:00:0c:9f:f0:64 REACHABLE
10.1.1.254 dev eth0 lladdr 00:1a:64:8f:23:64 STALE

ping "dead"
test-simon:~ # ip neigh
10.1.1.1 dev eth0 lladdr 00:00:0c:9f:f0:64 REACHABLE

so it seems that when the default gateway is removed from the arp cache
something goes wrong in the kernel route handling. i don't know the
internals of the linux route handling, now i need your help, any ideas
what's going wrong?

i did a lot of tests, the problem i described first happens with kernel
version 3.0.23, i found in the changelog of 3.0.23 the following two
commits:
(http://www.kernel.org/pub/linux/kernel/v3.0/ChangeLog-3.0.23)

commit 42ab5316ddcaa0de23e88e8a3d363c767b9ab0b3
Author: Eric Dumazet <***@gmail.com>
Date: Fri Nov 18 15:24:32 2011 -0500
ipv4: fix redirect handling

commit bebee22bcbf0026f92141990972bd5863ef9b69c
Author: Flavio Leitner <***@redhat.com>
Date: Mon Oct 24 02:56:38 2011 -0400
route: fix ICMP redirect validation

i then took the net/ipv4/route.c file from kernel 3.0.22 and replaced
the version in 3.0.23 with it, this reverts the two mentioned patches
above (if i havent overlooked something) after that the problem
disappears.
so those two patches surely fixed some problem but for kernel versions
3.0.23-3.0.36 they broke the gateway icmp redirect handling as described
by me here.

i did some further tests with different kernel versions:
3.5-rc6: OK
3.4.4: OK
3.2.22: OK
3.0.1 - 3.0.22: OK
3.0.23 - 3.0.36: not OK
2.6.35.13: OK

now lets have a closer look at a kernel version which works:
------------------------------------------------------------

this is from 3.5-rc6, but 3.4.4, 3.2.2 and 2.6.35.13 also behave
exactly this way, 3.0.1-3.0.22 behave slightly different, see note
below.

test-simon:~ # ping 10.109.98.11
PING 10.109.98.10 (10.109.98.11) 56(84) bytes of data.
64 bytes from 10.109.98.11: icmp_seq=1 ttl=60 time=15.2 ms
From 10.1.1.254: icmp_seq=2 Redirect Host(New nexthop: 10.1.1.1)
..

test-simon:~ # ip neigh
10.1.1.1 dev eth0 lladdr 00:00:0c:9f:f0:64 REACHABLE
10.1.1.254 dev eth0 lladdr 00:1a:64:8f:23:64 STALE

and after approx 60 or so seconds:

test-simon:~ # ip neigh
10.1.1.1 dev eth0 lladdr 00:00:0c:9f:f0:64 REACHABLE

and ping (and everything else) is as expected still working.

note:
-----

on 3.0.1-3.0.22:

i see lots of icmp redirects sent from the default gateway (10.1.1.254)
to my test machine, while running tcpdump on the default gateway
(10.1.1.254) i see every ping packet also arriving there and also some
icmp redirect messages going out to my test machine.
but everything works so i think my test machine is correctly talking to
the destination using the other gateway (10.1.1.1).
i also sniffed a windows 7 client pc, it looks the same there, so
possibly no problem, but i mention this because kernel versions 3.5-rc6,
3.4.4, 3.2.22 and 2.6.35.13 act differently (see below).

on 3.0.23-3.0.36:

i see lots of icmp redirects sent from the default gateway (10.1.1.254)
to my test machine, while running tcpdump on the default gateway
(10.1.1.254) i see up to 20 ping packets arriving there and also up to
17 icmp redirect messages going out to my test machine, after the 20th
ping packet i dont see further ping packets arriving at the default
gateway. so my test machine is then only talking to the other gateway
(10.1.1.1) i think.
..
17:48:41.643952 IP 10.1.1.254 > 10.1.20.79: ICMP redirect 10.109.98.11
to host 10.1.1.1, length 92
..
17:48:44.649008 IP 10.1.20.79 > 10.109.98.11: ICMP echo request, id
30733, seq 20, length 64
17:48:44.649018 IP 10.1.20.79 > 10.109.98.11: ICMP echo request, id
30733, seq 20, length 64

on 3.5-rc6, 3.4.4, 3.2.22 and 2.6.35.13:

here it looks different, and for me this is the expected behavior, or
at least the behavior i have seen from lots of linux machines on my
network. i see 1-2 icmp redirects sent from the default gateway
(10.1.1.254) to my test machine, while running tcpdump on the default
gateway (10.1.1.254) i only see up to 2 ping packets arriving then
nothing, so then my test machine seems to only talk to the other gateway
(10.1.1.1).

17:50:58.995894 IP 10.1.20.79 > 10.109.98.11: ICMP echo request, id
10766, seq 1, length 64
17:50:58.995914 IP 10.1.20.79 > 10.109.98.11: ICMP echo request, id
10766, seq 1, length 64
17:50:59.997260 IP 10.1.20.79 > 10.109.98.11: ICMP echo request, id
10766, seq 2, length 64
17:50:59.997277 IP 10.1.1.254 > 10.1.20.79: ICMP redirect 10.109.98.11
to host 10.1.1.1, length 92
17:50:59.997287 IP 10.1.20.79 > 10.109.98.11: ICMP echo request, id
10766, seq 2, length 64

..

(before someone asks why i "must" use kernel 3.0.x ... because this are
SLES 11 SP2 VMs and they currently ship kernel 3.0.34)

i hope i described the problem in a way so that the kernel network
stack maintainers can understand the problem, please conact me if you
have further questions, and please CC me as i am not subscribed to
linux-kernel. this message is already on linux-netdev, if you wish you
can CC your answer also there.

kind regards,
Simon Roscic.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Rune Darrud

2012-08-28 17:50:23 UTC

Permalink

See inline answer below.

Post by Simon Roscic
3.5-rc6: OK
3.4.4: OK
3.2.22: OK
3.0.1 - 3.0.22: OK
3.0.23 - 3.0.36: not OK
2.6.35.13: OK

Let me add that kernel 3.0.38 also experiences the same for SLES 11 SP2. A
restart of the network resolves it temporarily for a few hours. After running
fine for a few hours after upgrade from 2.6.3x to 3.0.38 via zypper, this is not
a good situation.

Post by Simon Roscic
.....
(before someone asks why i "must" use kernel 3.0.x ... because this are
SLES 11 SP2 VMs and they currently ship kernel 3.0.34)

Going to raise an SR with Novell about this.

Post by Simon Roscic
i hope i described the problem in a way so that the kernel network
stack maintainers can understand the problem, please conact me if you
have further questions, and please CC me as i am not subscribed to
linux-kernel. this message is already on linux-netdev, if you wish you
can CC your answer also there.
kind regards,
Simon Roscic.

Best regards,
Rune "TheFlyingCorpse" Darrud

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

s***@accedian.com

2019-01-24 22:05:49 UTC

Permalink

Post by Simon Roscic
Hello,
I'm experiencing the following problem with kernel versions 3.0.36
on our network we all have one default gateway, it's 10.1.1.254, but
there are some networks for which we have another gateway and for this
networks the default gateway sends an icmp redirect.
lets assume my test machine has ip 10.1.20.79 netmask is 255.255.0.0
10.109.98.11, my default gateway (10.1.1.254) now sends me an icmp
redirect to another gateway (10.1.1.1) ... and now everything works as
expected, i get the replies from 10.109.98.11 but not for long, after
approx. 60 (or so) seconds i only get "ping: sendmsg: Network is down".
(exact same problem with all other tcp/udp protocols, but i used ping
for the tests because it also prints the redirect messages to the
console)
-----------------------------------------------
test-simon:~ # ping 10.109.98.10
..
64 bytes from 10.109.98.11: icmp_seq=62 ttl=60 time=12.1 ms
64 bytes from 10.109.98.11: icmp_seq=63 ttl=60 time=11.6 ms
ping: sendmsg: Network is down
ping: sendmsg: Network is down
when looking at "ip neigh" the "ping: sendmsg: Network is down" message
appears in the exact moment when the arp entry for the default gateway
ping "OK"
test-simon:~ # ip neigh
10.1.1.1 dev eth0 lladdr 00:00:0c:9f:f0:64 REACHABLE
10.1.1.254 dev eth0 lladdr 00:1a:64:8f:23:64 STALE
ping "dead"
test-simon:~ # ip neigh
10.1.1.1 dev eth0 lladdr 00:00:0c:9f:f0:64 REACHABLE
so it seems that when the default gateway is removed from the arp cache
something goes wrong in the kernel route handling. i don't know the
internals of the linux route handling, now i need your help, any ideas
what's going wrong?
i did a lot of tests, the problem i described first happens with kernel
version 3.0.23, i found in the changelog of 3.0.23 the following two
(http://www.kernel.org/pub/linux/kernel/v3.0/ChangeLog-3.0.23)
commit 42ab5316ddcaa0de23e88e8a3d363c767b9ab0b3
Date: Fri Nov 18 15:24:32 2011 -0500
ipv4: fix redirect handling
commit bebee22bcbf0026f92141990972bd5863ef9b69c
Date: Mon Oct 24 02:56:38 2011 -0400
route: fix ICMP redirect validation
i then took the net/ipv4/route.c file from kernel 3.0.22 and replaced
the version in 3.0.23 with it, this reverts the two mentioned patches
above (if i havent overlooked something) after that the problem
disappears.
so those two patches surely fixed some problem but for kernel versions
3.0.23-3.0.36 they broke the gateway icmp redirect handling as described
by me here.
3.5-rc6: OK
3.4.4: OK
3.2.22: OK
3.0.1 - 3.0.22: OK
3.0.23 - 3.0.36: not OK
2.6.35.13: OK
------------------------------------------------------------
this is from 3.5-rc6, but 3.4.4, 3.2.2 and 2.6.35.13 also behave
exactly this way, 3.0.1-3.0.22 behave slightly different, see note
below.
test-simon:~ # ping 10.109.98.11
PING 10.109.98.10 (10.109.98.11) 56(84) bytes of data.
64 bytes from 10.109.98.11: icmp_seq=1 ttl=60 time=15.2 ms
From 10.1.1.254: icmp_seq=2 Redirect Host(New nexthop: 10.1.1.1)
..
test-simon:~ # ip neigh
10.1.1.1 dev eth0 lladdr 00:00:0c:9f:f0:64 REACHABLE
10.1.1.254 dev eth0 lladdr 00:1a:64:8f:23:64 STALE
test-simon:~ # ip neigh
10.1.1.1 dev eth0 lladdr 00:00:0c:9f:f0:64 REACHABLE
and ping (and everything else) is as expected still working.
-----
i see lots of icmp redirects sent from the default gateway (10.1.1.254)
to my test machine, while running tcpdump on the default gateway
(10.1.1.254) i see every ping packet also arriving there and also some
icmp redirect messages going out to my test machine.
but everything works so i think my test machine is correctly talking to
the destination using the other gateway (10.1.1.1).
i also sniffed a windows 7 client pc, it looks the same there, so
possibly no problem, but i mention this because kernel versions 3.5-rc6,
3.4.4, 3.2.22 and 2.6.35.13 act differently (see below).
i see lots of icmp redirects sent from the default gateway (10.1.1.254)
to my test machine, while running tcpdump on the default gateway
(10.1.1.254) i see up to 20 ping packets arriving there and also up to
17 icmp redirect messages going out to my test machine, after the 20th
ping packet i dont see further ping packets arriving at the default
gateway. so my test machine is then only talking to the other gateway
(10.1.1.1) i think.
..
17:48:41.643952 IP 10.1.1.254 > 10.1.20.79: ICMP redirect 10.109.98.11
to host 10.1.1.1, length 92
..
17:48:44.649008 IP 10.1.20.79 > 10.109.98.11: ICMP echo request, id
30733, seq 20, length 64
17:48:44.649018 IP 10.1.20.79 > 10.109.98.11: ICMP echo request, id
30733, seq 20, length 64
here it looks different, and for me this is the expected behavior, or
at least the behavior i have seen from lots of linux machines on my
network. i see 1-2 icmp redirects sent from the default gateway
(10.1.1.254) to my test machine, while running tcpdump on the default
gateway (10.1.1.254) i only see up to 2 ping packets arriving then
nothing, so then my test machine seems to only talk to the other gateway
(10.1.1.1).
17:50:58.995894 IP 10.1.20.79 > 10.109.98.11: ICMP echo request, id
10766, seq 1, length 64
17:50:58.995914 IP 10.1.20.79 > 10.109.98.11: ICMP echo request, id
10766, seq 1, length 64
17:50:59.997260 IP 10.1.20.79 > 10.109.98.11: ICMP echo request, id
10766, seq 2, length 64
17:50:59.997277 IP 10.1.1.254 > 10.1.20.79: ICMP redirect 10.109.98.11
to host 10.1.1.1, length 92
17:50:59.997287 IP 10.1.20.79 > 10.109.98.11: ICMP echo request, id
10766, seq 2, length 64
..
(before someone asks why i "must" use kernel 3.0.x ... because this are
SLES 11 SP2 VMs and they currently ship kernel 3.0.34)
i hope i described the problem in a way so that the kernel network
stack maintainers can understand the problem, please conact me if you
have further questions, and please CC me as i am not subscribed to
linux-kernel. this message is already on linux-netdev, if you wish you
can CC your answer also there.
kind regards,
Simon Roscic.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Did you end up figuring out a method to fix this, I am in the same dilemma. I tried to follow this point on net dev + view commits but could not track down the direct fix. Any help would be appreciated :)
--
Avis de confidentialité

Les
informations contenues dans le présent
message et dans toute pièce qui
lui est jointe sont confidentielles et
peuvent être protégées par le
secret professionnel. Ces informations sont
à l’usage exclusif de son ou
de ses destinataires. Si vous recevez ce
message par erreur, veuillez
s’il vous plait communiquer immédiatement
avec l’expéditeur et en
détruire tout exemplaire. De plus, il vous est
strictement interdit de
le divulguer, de le distribuer ou de le reproduire
sans l’autorisation
de l’expéditeur. Merci.

Confidentiality notice

This

e-mail message and any attachment hereto contain confidential
information
which may be privileged and which is intended for the
exclusive use of its
addressee(s). If you receive this message in error,
please inform sender
immediately and destroy any copy thereof.
Furthermore, any disclosure,
distribution or copying of this message
and/or any attachment hereto
without the consent of the sender is
strictly prohibited. Thank you.

s***@gmail.com

2019-01-24 22:06:45 UTC

Permalink