关于ip_dst_cache会overflow...

sypdz · 发表于 2004-11-24 10:47:16

我碰到了ip_dst_cache overflow的情况，尤其在病毒非常猖獗的情况下...
/proc/slabinfo中的ip_dst_cache会一直增长下去，即使在用户没有使用该linux router的情况下，也不会减少...

到网上搜了很多帖子，没看见有国人讨论，都是foreigner碰到的这些问题 .....

在ChinaUnix上发帖子，没人回答，linuxsir更专业一些，希望有高人指点，谢谢，
j譬如：相关链接：
http://www.ussg.iu.edu/hypermail/linux/kernel/0311.1/1070.html

若有通知，请newjintao@yahoo.com.cn
我会经常关注此帖，谢谢...

Hello

We have about 25 systems that receive data via a pci DVB card from satellite.
The data is received through multiple muticast streams by some closed
source software. On all systems we notice that the free memory decreases
until in most cases the system are no longer reachable via network. They
then constantly print out: dst cache overflow. But I also have noticed that
some systems lock up hard, I assume this is because we just increase
the ip_dst_cache in /proc/sys/net/ipv4/route/max_size to some very
large value.

I also know that the German Telekom and Eumetsat have the same problems
and always have to reboot their systems. I also have reports from Austria
and expect many more systems in Europe are effected.

To get more information I have setup 3 systems with different kernels and
hardware and noticed that over the time ip_dst_cache and skbuff_head_cache
in /proc/slabinfo always increase. They never go down. Also one or more of
the of the size-x values always increase depending on the kernel and DVB
card being used. Here some more slabinfo details and hardware being used:

System1 : PIII 450MHz, 256MB ram, Kernel 2.4.23-pre9, pent@value DVB card
System2 : PII 350MHz, 384MB ram, Kernel 2.4.21, pent@value DVB card
System3 : P4 2.4GHz with HT enabled, 1 GB ram (high mem enabled),
Kernel 2.4.23-rc1 and libata patch, Nova-S DVB card

Now the slabinfo data every 24 hours:

System1:

ip_dst_cache 647 672 160 27 28 1
ip_dst_cache 7444 7464 160 311 311 1
ip_dst_cache 14339 14352 160 598 598 1
ip_dst_cache 21106 21120 160 880 880 1
ip_dst_cache 28101 28104 160 1171 1171 1

skbuff_head_cache 796 1008 160 41 42 1
skbuff_head_cache 7588 7824 160 326 326 1
skbuff_head_cache 14482 14688 160 612 612 1
skbuff_head_cache 21258 21480 160 895 895 1
skbuff_head_cache 28255 28416 160 1184 1184 1

size-2048 685 968 2048 343 484 1
size-2048 7483 7676 2048 3742 3838 1
size-2048 14376 14398 2048 7188 7199 1
size-2048 21146 21216 2048 10573 10608 1
size-2048 28142 28292 2048 14071 14146 1

System2:

ip_dst_cache 9 48 160 1 2 1
ip_dst_cache 7437 7464 160 311 311 1
ip_dst_cache 15161 15168 160 632 632 1
ip_dst_cache 18831 18840 160 785 785 1

skbuff_head_cache 14 24 160 1 1 1
skbuff_head_cache 11482 12168 160 500 507 1
skbuff_head_cache 23312 23904 160 996 996 1
skbuff_head_cache 28900 29640 160 1235 1235 1

size-128 611 660 128 21 22 1
size-128 11987 12210 128 402 407 1
size-128 23800 23970 128 798 799 1
size-128 29445 29670 128 983 989 1

Slabinfo for every 12 hours and CONFIG_DEBUG_SLAB set:

System3:

ip_dst_cache 576 576 160 24 24 1 : 576 576 24 0 0 : 252 126 : 1946 48 1426 0
ip_dst_cache 17760 17760 160 740 740 1 : 17760 17760 740 0 0 : 252 126 : 46553 1480 29557 0
ip_dst_cache 35376 35376 160 1474 1474 1 : 35376 36403 1474 0 0 : 252 126 : 94140 3014 60309 0
ip_dst_cache 51624 51624 160 2151 2151 1 : 51624 53444 2151 0 0 : 252 126 : 138864 4431 89547 0

skbuff_head_cache 1311 1311 168 57 57 1 : 1311 79557 57 0 0 : 252 126 : 82108 735 81114 621
skbuff_head_cache 18492 18492 168 804 804 1 : 18492 3300792 804 0 0 : 252 126 : 3320868 27658 3303434 26050
skbuff_head_cache 36133 36133 168 1571 1571 1 : 36133 6652585 1583 12 0 : 252 126 : 6684139 55715 6649977 52420
skbuff_head_cache 52371 52371 168 2277 2277 1 : 52371 9913620 2294 17 0 : 252 126 : 9957116 82923 9907545 78097

size-8192 540 540 8192 540 540 2 : 540 3196 540 0 0 : 0 0 : 0 0 0 0
size-8192 17736 17738 8192 17736 17738 2 : 17738 23194 17738 0 0 : 0 0 : 0 0 0 0
size-8192 35367 35367 8192 35367 35367 2 : 35367 43715 35374 7 0 : 0 0 : 0 0 0 0
size-8192 51596 51598 8192 51596 51598 2 : 51598 62824 51611 13 0 : 0 0 : 0 0 0 0

size-2048 452 512 2048 240 256 1 : 512 75002 256 0 0 : 60 30 : 140293 2995 140145 2485
size-2048 454 514 2048 238 257 1 : 514 3029044 257 0 0 : 60 30 : 5130850 101465 5130703 100953
size-2048 456 486 2048 241 243 1 : 530 6113873 593 350 0 : 60 30 : 10457205 204975 10457530 203655
size-2048 454 484 2048 239 242 1 : 542 9104228 1042 800 0 : 60 30 : 15398297 305608 15399447 303014

size-128 2016 2268 136 78 81 1 : 2268 9125 81 0 0 : 252 126 : 23644 195 22128 56
size-128 19096 19096 136 682 682 1 : 19096 26457 682 0 0 : 252 126 : 131136 1401 113018 58
size-128 36708 36708 136 1311 1311 1 : 36708 59707 1317 6 0 : 252 126 : 255889 2833 220918 144
size-128 52920 52920 136 1890 1890 1 : 52920 81855 1911 21 0 : 252 126 : 370264 4135 319786 153

size-64 7844 7844 72 148 148 1 : 7844 7931 148 0 0 : 252 126 : 15660 253 9102 0
size-64 18497 18497 72 349 349 1 : 18497 18584 349 0 0 : 252 126 : 110763 655 93784 0
size-64 24963 24963 72 471 471 1 : 24963 32458 471 0 0 : 252 126 : 209402 1008 186275 0
size-64 34503 34503 72 651 651 1 : 34503 48900 651 0 0 : 252 126 : 305026 1613 272674 0

There is much more data available, the full slabinfo was taken every
hour for each system. Additionally with the help of Jörn Engel I managed
to setup System1 with gcov kernel patch and have all data available on
an hourly basis until the system has reached "dst cache overflow". I have
tried very hard to evaluate this data myself, but find that the linux
network code is way beyond my c programming knowledge.

Another thing noticed is that as the memory usage increases the systems
become slower, when you log in on them and work there.

Has anyone any suggestion of what else I can do to narrow down the problem?

What I am also not sure if it is correct to assume the bug in the ipv4
multicast implementation, or can it still be a driver problem? But I assume
two completely different drivers make this very unlikely.

Please, can someone help me to find the bug. I am willing to do any tests
or provide more information.

Thanks,
Holger

PS: Please cc me, since I am not on the list.

状况是在ip_dst_cache到达/proc/sys/net/ipv4/route/max_size规定的最大值之后，kernel不停打印dst cache overflow到标准输出...

konds · 发表于 2004-11-25 10:59:27

未碰到过此问题.放待解决里吧

sypdz · 发表于 2004-11-25 14:27:29

谢谢斑竹...原因初步看来是kernel里的问题...
朋友正在看源码以判断是什么问题...
若有答案了会把结果发上来...

dancingpig · 发表于 2004-11-25 22:55:35

overflow？
给人exploit了？？？

sypdz · 发表于 2004-11-26 08:39:25

不是被人exploit，而是自己出现这问题...
我在测试过程中与之直连，用hping2,也出现这个问题...
问题已经发现啦...是kernel的问题...
有空可以看看route.c里ip_dst_cache
还没解决，解决了把解决答案发上来...

btw:忘记一种可能，网卡驱动...

sypdz · 发表于 2004-12-21 15:10:20

目前仍然没有解决。。

烦啊。。。。

sypdz · 发表于 2004-12-27 17:59:49

问题已经解决，解决后测试中...是intel提供的网口驱动的问题，谢谢斑竹...

qzhou9887 · 发表于 2005-9-13 10:19:09

我自己也遇到过这种情况，用google搜索的结果显示跟内核有关。
我的笔记本有一次用来拔号上网，用iptables做NAT。公司局域网里有一台机子疯狂的发包，竟把我的本本搞的“一声不吭”，致使局域网里所有机子都无法上网。本本表现出的现象跟死机没什么两样，我立马拔掉网线，三分钟后，本本才慢慢地缓过神来，终于又能使用了。
作业平台：RedHat AS-4.0

memory · 发表于 2005-9-15 00:01:01

没有什么好的解决办法。目前只能通过检查iptables的conntrack内容，查清数据特征，然后使用iptables封掉垃圾数据包。我单位的网关就是这么搞的，有点儿被动。

liyaoshi · 发表于 2005-10-11 09:25:40

这个解决了没有？
https://www.redhat.com/archives/ ... March/msg05797.html
可以参考，目前2.4的kernel可能多会有这个问题

		自动登录	找回密码
密码			注册