今天服务器的一点儿小问题(March 22, 2004)

周边技术 2004-7-27 18:35

上午过来一看，虚拟主机服务器挂了，reset看了看，没什么大问题...

虚拟主机这台服务器一般不会挂的，看了看日志，发现凌晨的一条错误信息''ENOMEM in do_get_write_access retrying''，于是直奔http://bugzilla.redhat.com/bugzilla抓虫乐园去了，果然查到一些信息：

system lockup with message "ENOMEM in do_get_write_access retrying"

Description of problem:
System locked up when using 'rdist' to copy many files onto an ext3 file system.

Console login did not respond, as well as ssh login. We had to power the system
off to restart it.

Additional messages seen were:

ENOMEM in new_handle, retrying.
ENOMEM in journal_get_undo_access_Rsmp_df5dec49, retrying.

The system is running kernel 2.4.18-18.7.xbigmem.

Version-Release number of selected component (if applicable):

How reproducible:
Always

Steps to Reproduce:
1.Use the rdist utility to send a large number of files to the machine with the
new kernel.
2.
3.

这个是官方信息，看了之后心里就有了底儿了，只要不是哪位大虾过来指点我就好（上次配台服务器，有个著名的Solaris中telnet服务的login漏洞，想着隔天就打补丁结果忘了，可能正好一位大虾路过，顺便就从这个漏洞进来了，好人家啊，进来看了一圈什么都没留，为了安全帮我把telnet关了，等发现之后幸福了我半天，（眼泪汪汪的）遇到好人了~~）

不过红帽子的资料还是太少，因为出问题的是dell的服务器，所以又去dell的官方站查了查虫子报告（http://lists.us.dell.com/mailman/listinfo/linux-poweredge），好人真多，又查到一些资料：

The system takes over (summary)
Norman Gaywood norm@turing.une.edu.au
Tue Dec 3 17:43:00 2002
Previous message: FW: Adding alert actions more easily in DOMSA HTML GUI?
Next message: 2400 RAID5 Problems
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

About a week ago I posted a message saying that I was experiencing a
system slowdown to the point of not working while doing a large rsync
copy over the network. After much playing around I have eliminated many
things such as incorrect software raid setup, bad disks, bad SCSI
controller, network, etc. Towards the end of the thread people were
pointing at the linux VM code and suspecting that to be the cause.

The details of my system at at the end of this message.

I can now say that I can trigger this problem in about 30-40 minutes. At
the end of that time, kswapd will start to get a larger % of CPU and
the system load will be around 2-3. The system will feel sluggish at an
interactive shell and it will take several seconds before a command like
top would start to display. If I let it go for another 30 minutes the
system is unusable were it could take 10 minutes or more to do simple
commands. If I let it go for several hours after that, the following
messages can appear on the console depending on the type of copy:

ENOMEM in journal_get_undo_access_Rsmp_df5dec49, retrying.

or

EMOMEM in do_get_write_access, retrying.

The problem can be triggered by almost any type of copy command. In
particular, this command can trigger it:

tar cf /dev/tape .

for . large enough. Unfortunately this was how I was intending to backup
the system.

"Large enough" is several gigabytes. It also seems to depend on how much
memory is used. In particular, how much memory is used by cache.

Can it be stopped? Yes. Stephan Wonczak suggested that I should put the
system under some memory pressure while doing the copy. The program he
supplied used about 750 megabytes just to use some memory. I tried
running this at 10 second intervals while doing a copy but it did not
help. Since the system has 16 Gig of memory, I tried to give it some
real memory pressure and ran 7 processes that used 1.8G each like this:

#!/bin/sh
SLEEP=600
COUNT=20

while [ `expr $COUNT - 1` != 0 ]
do
date
# 2000 by 1_000_000 seems to be a 1.8G process
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }' &
perl -e '$i=2000;while ($i--){ $a[$i]="x"x1_000_000; }'
sleep $SLEEP
done

This bought the cache down to about 3-4 Gig after it ran. With this
running the system performed the copy with no problems!

There is a suggestion that I may not see this problem when the system is
under real load. Since I am only setting up the system at the moment there
are no users giving the system something to do. The copy is the only real
work during these tests. I find it difficult to say "she'll be right",
(as we do in Aus) and throw the system into production hoping that it
will just work.

So what do I do now? I have a what I believe a trigger for a VM problem
in linux. Anyone have some patches for me to try?

看了这篇就大体明白了，跟领导也有交待了，不过如何解决……这位仁兄也在一个劲问So what do I do now? ok……这也是我想问的……ft

标签集:TAGS:

回复Comments() 点击Count()

回复Comments

{commentauthor}

{commenttime}

{commentnum}

{commentcontent}

作者:

{commentrecontent}

}

一种心情 一种生活

今天服务器的一点儿小问题(March 22, 2004)

回复Comments() 点击Count()

回复Comments

作者:

一种心情一种生活