在下面的顶部示例中,在没有运行其他进程的正常Web服务期间,systemd,systemd-logind,systemd-journal和dbus-daemon使用总共10.7%的四核cpu,而systemd消耗了19%的四核cpu.系统的16GB内存.这不是正常行为,搜索后我没有找到任何其他人这个问题.什么可能导致这种资源肆虐?任何建议,将不胜感激.
空闲期间从顶部输出(网络服务除外):
top - 08:51:31 up 16 days,13:43,2 users,load average: 1.84,1.39,1.07 Tasks: 297 total,2 running,295 sleeping,0 stopped,0 zombie %cpu(s): 5.6 us,3.6 sy,0.0 ni,90.6 id,0.1 wa,0.0 hi,0.1 si,0.0 st KiB Mem : 16212992 total,2466564 free,4275764 used,9470664 buff/cache KiB Swap: 4194300 total,4070740 free,123560 used. 10707392 avail Mem PID USER PR NI VIRT RES SHR S %cpu %MEM TIME+ COMMAND 743 dbus 20 0 27104 1856 1152 S 3.3 0.0 304:27.19 dbus-daemon 1 root 20 0 3247784 2.920g 1800 S 3.0 18.9 287:41.35 systemd 737 root 20 0 27416 2524 1304 S 2.7 0.0 225:32.66 systemd-logind 736 root 20 0 434760 3756 3076 S 2.0 0.0 172:26.53 NetworkManager 548 root 20 0 82276 34652 34516 S 1.7 0.2 160:20.16 systemd-journal 770 polkitd 20 0 522920 2956 2248 S 1.7 0.0 120:06.11 polkitd 716 root 16 -4 116744 1368 1312 S 1.3 0.0 93:26.54 auditd 3778 Nginx 20 0 446488 14688 6564 S 1.3 0.1 2:18.80 PHP-fpm 3847 Nginx 20 0 446316 14588 6548 S 1.3 0.1 2:19.29 PHP-fpm 7000 Nginx 20 0 446132 14400 6544 S 1.3 0.1 1:22.77 PHP-fpm 14862 Nginx 20 0 446304 14600 6580 S 1.3 0.1 1:32.25 PHP-fpm 30333 Nginx 20 0 446292 14468 6528 S 1.3 0.1 1:40.78 PHP-fpm 740 root 20 0 784980 20112 19696 S 1.0 0.1 76:12.69 rsyslogd 3521 Nginx 20 0 446188 14848 6748 S 1.0 0.1 2:20.00 PHP-fpm 3687 Nginx 20 0 446036 14688 6764 S 1.0 0.1 2:20.45 PHP-fpm 3689 Nginx 20 0 446408 14604 6552 S 1.0 0.1 2:19.75 PHP-fpm 3774 Nginx 20 0 446288 14568 6552 S 1.0 0.1 2:19.68 PHP-fpm 3836 Nginx 20 0 447416 15572 6564 S 1.0 0.1 2:21.06 PHP-fpm 4861 Nginx 20 0 446260 14576 6540 S 1.0 0.1 2:18.94 PHP-fpm 4862 Nginx 20 0 446508 15084 6764 S 1.0 0.1 2:20.71 PHP-fpm 13538 Nginx 20 0 447204 15452 6572 S 1.0 0.1 1:32.33 PHP-fpm 15530 Nginx 20 0 446292 14520 6528 S 1.0 0.1 1:32.55 PHP-fpm 28468 Nginx 20 0 446356 14672 6568 S 1.0 0.1 1:42.21 PHP-fpm 29564 Nginx 20 0 446292 14536 6548 S 1.0 0.1 1:41.11 PHP-fpm 30851 Nginx 20 0 445956 14568 6748 S 1.0 0.1 1:49.66 PHP-fpm
编辑2-14-16
我可能在“sudo journalctl”的输出中找到了相关内容(见下文).关于来自我的其他生产服务器之一的SSH连接,每隔几小时就会出现许多行数小时.这些是将文件从远程服务器传输到相关服务器的rsync进程.这结果解释了systemd,NetworkManager和systemd-journal的cpu使用情况.
但是,这无法解释内存泄漏,这是最大的问题.自从几天前这篇文章的原始写作以来,systemd已从18.9%增加到21.4%的系统内存使用率.
已修改以下日志以替换服务器的实际域名和IP地址.
Feb 14 10:02:13 hostname.domain.com systemd-logind[737]: New session 6467482 of user tropicg9. Feb 14 10:02:13 hostname.domain.com systemd[1]: Started Session 6467482 of user tropicg9. Feb 14 10:02:13 hostname.domain.com systemd[1]: Starting Session 6467482 of user tropicg9. Feb 14 10:02:13 hostname.domain.com sshd[9665]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:13 hostname.domain.com sshd[9667]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:13 hostname.domain.com sshd[9665]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:13 hostname.domain.com systemd-logind[737]: Removed session 6467482. Feb 14 10:02:14 hostname.domain.com sshd[9728]: Accepted publickey for tropicg9 from 1.2.3.4 port 45289 ssh2: RSA 0b: Feb 14 10:02:14 hostname.domain.com systemd-logind[737]: New session 6467483 of user tropicg9. Feb 14 10:02:14 hostname.domain.com systemd[1]: Started Session 6467483 of user tropicg9. Feb 14 10:02:14 hostname.domain.com systemd[1]: Starting Session 6467483 of user tropicg9. Feb 14 10:02:14 hostname.domain.com sshd[9728]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:14 hostname.domain.com sshd[9735]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:14 hostname.domain.com sshd[9728]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:14 hostname.domain.com systemd-logind[737]: Removed session 6467483. Feb 14 10:02:15 hostname.domain.com sshd[9876]: Accepted publickey for tropicg9 from 1.2.3.4 port 45290 ssh2: RSA 0b: Feb 14 10:02:15 hostname.domain.com systemd-logind[737]: New session 6467484 of user tropicg9. Feb 14 10:02:15 hostname.domain.com systemd[1]: Started Session 6467484 of user tropicg9. Feb 14 10:02:15 hostname.domain.com systemd[1]: Starting Session 6467484 of user tropicg9. Feb 14 10:02:15 hostname.domain.com sshd[9876]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:15 hostname.domain.com sshd[9883]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:15 hostname.domain.com sshd[9876]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:15 hostname.domain.com systemd-logind[737]: Removed session 6467484. Feb 14 10:02:20 hostname.domain.com sshd[10333]: Accepted publickey for tropicg9 from 1.2.3.4 port 45291 ssh2: RSA 0b Feb 14 10:02:20 hostname.domain.com systemd-logind[737]: New session 6467485 of user tropicg9. Feb 14 10:02:20 hostname.domain.com systemd[1]: Started Session 6467485 of user tropicg9. Feb 14 10:02:20 hostname.domain.com systemd[1]: Starting Session 6467485 of user tropicg9. Feb 14 10:02:20 hostname.domain.com sshd[10333]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:20 hostname.domain.com sshd[10342]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:20 hostname.domain.com sshd[10333]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:20 hostname.domain.com systemd-logind[737]: Removed session 6467485. Feb 14 10:02:21 hostname.domain.com sshd[10450]: Accepted publickey for tropicg9 from 1.2.3.4 port 45292 ssh2: RSA 0b Feb 14 10:02:21 hostname.domain.com systemd-logind[737]: New session 6467486 of user tropicg9. Feb 14 10:02:21 hostname.domain.com systemd[1]: Started Session 6467486 of user tropicg9. Feb 14 10:02:21 hostname.domain.com systemd[1]: Starting Session 6467486 of user tropicg9. Feb 14 10:02:21 hostname.domain.com sshd[10450]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:21 hostname.domain.com sshd[10457]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:21 hostname.domain.com sshd[10450]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:21 hostname.domain.com systemd-logind[737]: Removed session 6467486. Feb 14 10:02:22 hostname.domain.com sshd[10473]: Accepted publickey for tropicg9 from 1.2.3.4 port 45293 ssh2: RSA 0b Feb 14 10:02:22 hostname.domain.com systemd-logind[737]: New session 6467487 of user tropicg9. Feb 14 10:02:22 hostname.domain.com systemd[1]: Started Session 6467487 of user tropicg9. Feb 14 10:02:22 hostname.domain.com systemd[1]: Starting Session 6467487 of user tropicg9. Feb 14 10:02:22 hostname.domain.com sshd[10473]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:22 hostname.domain.com sshd[10475]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:22 hostname.domain.com sshd[10473]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:22 hostname.domain.com systemd-logind[737]: Removed session 6467487. Feb 14 10:02:23 hostname.domain.com sshd[10484]: Accepted publickey for tropicg9 from 1.2.3.4 port 45294 ssh2: RSA 0b Feb 14 10:02:23 hostname.domain.com systemd-logind[737]: New session 6467488 of user tropicg9. Feb 14 10:02:23 hostname.domain.com systemd[1]: Started Session 6467488 of user tropicg9. Feb 14 10:02:23 hostname.domain.com systemd[1]: Starting Session 6467488 of user tropicg9. Feb 14 10:02:23 hostname.domain.com sshd[10484]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:23 hostname.domain.com sshd[10486]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:23 hostname.domain.com sshd[10484]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:23 hostname.domain.com systemd-logind[737]: Removed session 6467488. Feb 14 10:02:39 hostname.domain.com sshd[10654]: Accepted publickey for tropicg9 from 1.2.3.4 port 45295 ssh2: RSA 0b Feb 14 10:02:39 hostname.domain.com systemd[1]: Started Session 6467489 of user tropicg9. Feb 14 10:02:39 hostname.domain.com systemd-logind[737]: New session 6467489 of user tropicg9. Feb 14 10:02:39 hostname.domain.com systemd[1]: Starting Session 6467489 of user tropicg9. Feb 14 10:02:39 hostname.domain.com sshd[10654]: pam_unix(sshd:session): session opened for user tropicg9 by (uid=0) Feb 14 10:02:39 hostname.domain.com sshd[10656]: Received disconnect from 1.2.3.4: 11: disconnected by user Feb 14 10:02:39 hostname.domain.com sshd[10654]: pam_unix(sshd:session): session closed for user tropicg9 Feb 14 10:02:39 hostname.domain.com systemd-logind[737]: Removed session 6467489.session 6467489.
更新2-16-16
这是systemd-cgtop的输出,显示活动控件组的资源使用情况(向右滚动).这显示了“根”路径下的所有重要资源使用情况.这似乎并没有缩小范围,但也许这些信息可能会有所帮助.
/ run / systemd / system /下只有86个范围文件和相关目录,最长可达6天.有一个issue,这些文件在SSH连接期间是孤立的,导致数千个条目和高cpu负载,但这不会发生在这里.
Path Tasks %cpu Memory Input/s Output/s / 296 30.5 11.3G 657.8K 893.0K /system.slice/NetworkManager.service 1 - - - - /system.slice/auditd.service 1 - - - - /system.slice/crond.service 1 - - - - /system.slice/dbus.service 1 - - - - /system.slice/irqbalance.service 1 - - - - /system.slice/lvm2-lvMetad.service 1 - - - - /system.slice/mariadb.service 2 - - - - /system.slice/Nginx.service 10 - - - - /system.slice/PHP-fpm.service 101 - - - - /system.slice/polkit.service 1 - - - - /system.slice/postfix.service 3 - - - - /system.slice/rsyslog.service 1 - - - - /system.slice/smartd.service 1 - - - - /system.slice/sshd.service 2 - - - - /system.slice/system-getty.slice/getty@tty1.service 1 - - - - /system.slice/systemd-journald.service 1 - - - - /system.slice/systemd-logind.service 1 - - - - /system.slice/systemd-udevd.service 1 - - - - /system.slice/tuned.service 1 - - - - /system.slice/wpa_supplicant.service 1 - - - - /user.slice/user-1000.slice/session-7170741.scope 4 - - - -
系统内存的临时清除
似乎运行systemctl守护进程 – reexec将释放分配给PID 1进程的所有内存.但是,泄漏仍在继续.这个问题的一个临时解决方案是设置每日cron以清除内存,但它不能解决泄漏问题.我已经向Redhat提交了bug,因为这是CentOS 7.x的systemd的稳定版本.希望泄漏可以找到并堵塞.
解决方法
yum install strace strace -ff -p 1
这是诊断内存泄漏的快速而肮脏的方法.
系统进程的Strace看起来应该类似:
recvmsg(23,{msg_name(0)=NULL,msg_iov(1)=[{"WATCHDOG=1",4096}],msg_controllen=32,{cmsg_len=28,cmsg_level=SOL_SOCKET,cmsg_type=SCM_CREDENTIALS{pid=620,uid=0,gid=0}},msg_flags=MSG_CMSG_CLOEXEC},MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 10 open("/proc/620/cgroup",O_RDONLY|O_CLOEXEC) = 20 fstat(20,{st_mode=S_IFREG|0444,st_size=0,...}) = 0 mmap(NULL,4096,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANONYMOUS,-1,0) = 0x7fcfd734e000 read(20,"10:cpuset:/\n9:perf_event:/\n8:hug"...,1024) = 164 close(20) = 0 munmap(0x7fcfd734e000,4096) = 0
它分配内存,做一些事情,而不是释放内存.检查systemd的系统调用跟踪,你应该发现无法完成调用的地方并释放分配的内存.我认为伪文件系统或selinux安装不正确存在问题,因此systemd无法完成调用.