一次 macOS 频繁死机重启问题排查
死机问题
最近一周几乎每天早上起来打开电脑看到的都是系统死机重启之后的报告,查看详细日志长下面这样。这个日志里面显示的原因是内存泄露导致 panic
重启,里面关键信息是 pid 34100: bash
,也就是说死机和 ID 为 34100 的进程有关。可惜是一个 bash
进程,给排查增加了很大的难度。因为很多程序会调用 bash
执行命令。
panic(cpu 3 caller 0xfffffe002f6cb630): zalloc[3]: zone map exhausted while allocating from zone [data.kalloc.1024], likely due to memory leak in zone [data.kalloc.1024] (16G, 17743216 elements allocated) @zalloc.c:4491
Debugger message: panic
Memory ID: 0x6
OS release type: User
OS version: 23B81
Kernel version: Darwin Kernel Version 23.1.0: Mon Oct 9 21:27:24 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6000
Fileset Kernelcache UUID: 9B88AE46569577FF0B82A48EAE38CDC6
Kernel UUID: 88BB97DD-3D4F-3EB9-8DDE-FFC364B55886
<...省略...>
Compressor Info: 9% of compressed pages limit (OK) and 22% of segments limit (OK) with 0 swapfiles and OK swap space
Panicked task 0xfffffe2cbe024b38: 0 pages, 1 threads: pid 34100: bash
Panicked thread: 0xfffffe27fb245828, backtrace: 0xfffffe10107befb0, tid: 491454956
lr: 0xfffffe002f6555e0 fp: 0xfffffe10107bf040
lr: 0xfffffe002f7a036c fp: 0xfffffe10107bf0b0
lr: 0xfffffe002f79e7e8 fp: 0xfffffe10107bf1a0
lr: 0xfffffe002f603aa4 fp: 0xfffffe10107bf1b0
lr: 0xfffffe002f654ebc fp: 0xfffffe10107bf560
lr: 0xfffffe002fe67260 fp: 0xfffffe10107bf580
lr: 0xfffffe002f6cb630 fp: 0xfffffe10107bf740
lr: 0xfffffe002f6cebec fp: 0xfffffe10107bf7c0
lr: 0xfffffe002f667a70 fp: 0xfffffe10107bf7f0
lr: 0xfffffe002fd002f8 fp: 0xfffffe10107bf810
lr: 0xfffffe0030b6c484 fp: 0xfffffe10107bf830
lr: 0xfffffe0030b6c594 fp: 0xfffffe10107bf860
lr: 0xfffffe0030b78ba8 fp: 0xfffffe10107bfb30
lr: 0xfffffe002fe41ad8 fp: 0xfffffe10107bfc00
lr: 0xfffffe002fb35b0c fp: 0xfffffe10107bfcb0
lr: 0xfffffe002fb36600 fp: 0xfffffe10107bfd60
lr: 0xfffffe002fb36270 fp: 0xfffffe10107bfd90
lr: 0xfffffe002fc7b8c8 fp: 0xfffffe10107bfe20
lr: 0xfffffe002f79e928 fp: 0xfffffe10107bff10
lr: 0xfffffe002f603aa4 fp: 0xfffffe10107bff20
Kernel Extensions in backtrace:
com.apple.driver.AppleMobileFileIntegrity(1.0.5)[B562E2F2-059F-3F75-9A50-57B08E283880]@0xfffffe0030b6a820->0xfffffe0030b9a8eb
dependency: com.apple.iokit.CoreAnalyticsFamily(1)[0F366549-E9FC-3B68-9765-ED75460DC0B0]@0xfffffe00312ac1a0->0xfffffe00312b3f8b
dependency: com.apple.kec.corecrypto(14.0)[4EAE32F2-4DA7-3078-AE3D-6E4CBCE4739B]@0xfffffe0032770f00->0xfffffe00327c124b
dependency: com.apple.kext.CoreTrust(1)[1984A0D2-7977-3FF6-9231-06D409A3EB3F]@0xfffffe0031375950->0xfffffe003137d873
dependency: com.apple.security.AppleImage4(5.0.0)[DB08D649-F4D7-3D40-BCC9-BD9008ECAB95]@0xfffffe00309c34a0->0xfffffe00309dc147
<...省略...>
分析问题
死机启动之后会在 system.log
留下一条 bootlog 的打印,从下面的打印可以看出死机大部分发生在凌晨 3-7 点,之后 8-9 点是我发现死机之后主动重启机器产生的日志。这也意味着很难在我使用期间定位问题。
➜ ~ gzcat /var/log/system.log.* | grep bootlog | sort
Nov 23 04:49:50 phyng-m1 bootlog[0]: BOOT_TIME 1700686190 609710
Nov 23 14:47:38 phyng-m1 bootlog[0]: BOOT_TIME 1700722058 676269
Nov 24 11:00:01 phyng-m1 bootlog[0]: BOOT_TIME 1700794801 268615
Nov 24 11:13:26 phyng-m1 bootlog[0]: BOOT_TIME 1700795606 374267
Nov 25 07:03:20 phyng-m1 bootlog[0]: BOOT_TIME 1700867000 25451
Nov 25 08:44:54 phyng-m1 bootlog[0]: BOOT_TIME 1700873094 45593
Nov 26 09:06:12 phyng-m1 bootlog[0]: BOOT_TIME 1700960772 438934
Nov 26 15:37:09 phyng-m1 bootlog[0]: BOOT_TIME 1700984229 825958
Nov 28 03:59:16 phyng-m1 bootlog[0]: BOOT_TIME 1701115156 973815
Nov 28 08:40:42 phyng-m1 bootlog[0]: BOOT_TIME 1701132042 35736
Nov 29 03:54:00 phyng-m1 bootlog[0]: BOOT_TIME 1701201240 539764
Nov 29 09:46:38 phyng-m1 bootlog[0]: BOOT_TIME 1701222398 722656
增加进程监控日志
死机日志中会带上进程 ID 和名称,如果每隔一段时间记录一下当前系统全部进程,下次死机之后就能根据进程日志分析了。
下面是 ~/dev/servers/phyng-m1/scripts/watch_ps.sh
的监控代码,每隔 3 秒保存一遍系统日志,每 100MB 压缩为一个文件。
#!/usr/bin/env bash
LOG_FILE="/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log"
MAX_SIZE=104857600 # 设定日志文件的最大大小(例如:100MB)
function watch_ps() {
LOG="watch_ps() called at $(date)"
echo "$LOG"
echo "$LOG" >> $LOG_FILE
ps -ewww -o pid,ppid,user,%cpu,%mem,vsz,rss,stat,lstart,tty,args >> $LOG_FILE
}
function check_compress_log() {
# 如果日志文件存在且大小超过了设定值,则压缩它
if [ -f "$LOG_FILE" ] && [ "$(stat -f%z "$LOG_FILE")" -ge $MAX_SIZE ]; then
gzip "$LOG_FILE"
mv "$LOG_FILE.gz" "$LOG_FILE-$(date +%Y%m%d-%H%M%S).gz"
fi
}
while true; do
watch_ps
check_compress_log
sleep 3
done
运行监控程序
repeat 1000 sudo ./phyng-m1/scripts/watch_ps.sh
根据日志定位问题
终于在第二天死机之后看到了日志,看到日志之后就需要对死机的进程 PID 往上追踪,因为 bash 进程很可能有无数的父进程,在我的日志格式 ps -ewww -o pid,ppid,user,%cpu,%mem,vsz,rss,stat,lstart,tty,args
中第一列为进程 PID 第二列父进程 PPID。
下面是死机之后的进程链 34027 -&gt; 34023 -&gt; 34020
➜ ~ zgrep 34027 ~/dev/servers/phyng-m1/scripts/watch_ps.log-20231130*
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-015928.gz:34027 34023 phyng 0.1 0.0 408639712 1552 S Thu Nov 30 01:44:25 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-015928.gz:34094 34027 phyng 0.0 0.0 408639712 1536 S Thu Nov 30 01:44:25 2023 ?? (bash)
➜ ~ zgrep 34023 ~/dev/servers/phyng-m1/scripts/watch_ps.log-20231130*
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-015928.gz:34023 34020 phyng 0.1 0.0 408636640 1520 S Thu Nov 30 01:44:25 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-015928.gz:34027 34023 phyng 0.1 0.0 408639712 1552 S Thu Nov 30 01:44:25 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-070317.gz:34023 34019 phyng 0.1 0.0 408636640 1520 S Thu Nov 30 06:58:58 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-070317.gz:34025 34023 phyng 0.0 0.0 408637664 1488 S Thu Nov 30 06:58:58 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
➜ ~ zgrep 34020 ~/dev/servers/phyng-m1/scripts/watch_ps.log-20231130*
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-011352.gz:34020 26346 phyng 1.9 0.0 408636640 2176 S Thu Nov 30 01:08:49 2023 ?? /bin/sh /opt/homebrew/bin/byobu-status tmux_right
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-011352.gz:34118 34020 phyng 2.4 0.0 408636640 2048 S Thu Nov 30 01:08:49 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-exec python2 -c import snack
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-015928.gz:34020 34019 phyng 0.7 0.0 408627424 1888 S Thu Nov 30 01:44:25 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-015928.gz:34023 34020 phyng 0.1 0.0 408636640 1520 S Thu Nov 30 01:44:25 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-070317.gz:34020 33961 phyng 0.1 0.0 408636640 2064 S Thu Nov 30 06:53:18 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-exec python2 -c import snack
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-070317.gz:34042 34020 phyng 0.1 0.0 408645856 2368 S Thu Nov 30 06:53:18 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-which python2
根据日志分析关键问题出在 /opt/homebrew/bin/byobu-status tmux_right
,这个程序是用来给 byobu 刷新右下角状态的,通过下面的命令过滤发现 byobu-status
依然在执行。
➜ ~ ps -ewww -o pid,ppid,user,%cpu,%mem,vsz,rss,stat,lstart,tty,args | egrep '(bash|status)'
462 1 root 0.0 0.0 408300528 11456 Ss Thu Nov 30 11:18:19 2023 ?? /System/Library/PrivateFrameworks/SystemStatusServer.framework/Support/systemstatusd
757 1 phyng 0.0 0.0 408506592 2336 S Thu Nov 30 11:18:24 2023 ?? /bin/bash /usr/local/MacGPG2/libexec/shutdown-gpg-agent
40652 33906 phyng 0.2 0.0 408636640 2224 S Thu Nov 30 11:44:27 2023 ?? /bin/sh /opt/homebrew/bin/byobu-status tmux_left
40655 33906 phyng 0.2 0.0 408636640 2160 S Thu Nov 30 11:44:27 2023 ?? /bin/sh /opt/homebrew/bin/byobu-status tmux_right
40709 40652 phyng 0.6 0.0 408636640 2032 S Thu Nov 30 11:44:27 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-exec python2 -c import snack
40711 40655 phyng 0.6 0.0 408636640 2032 S Thu Nov 30 11:44:27 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-exec python2 -c import snack
40728 40709 phyng 0.3 0.0 408637664 2304 S Thu Nov 30 11:44:27 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-which python2
40732 40711 phyng 0.3 0.0 408645856 2336 S Thu Nov 30 11:44:27 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-which python2
40768 40728 phyng 0.0 0.0 408646880 1664 S Thu Nov 30 11:44:27 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-which python2
40769 40768 phyng 0.1 0.0 408627424 1856 S Thu Nov 30 11:44:27 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
40771 40769 phyng 0.0 0.0 408636640 1536 S Thu Nov 30 11:44:27 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
40773 40771 phyng 0.6 0.0 408640736 1584 S Thu Nov 30 11:44:27 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
40774 40732 phyng 0.0 0.0 408645856 1728 S Thu Nov 30 11:44:27 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-which python2
40776 40774 phyng 0.1 0.0 408627424 1856 S Thu Nov 30 11:44:27 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
40781 40776 phyng 0.0 0.0 408636640 1536 S Thu Nov 30 11:44:27 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
40783 40781 phyng 0.6 0.0 408639712 1552 S Thu Nov 30 11:44:27 2023 ?? bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
41693 40773 phyng 0.0 0.0 408640736 1536 S Thu Nov 30 11:44:27 2023 ?? (bash)
41694 41693 phyng 0.0 0.0 408627424 1856 S Thu Nov 30 11:44:27 2023 ?? (bash)
41699 41694 phyng 0.0 0.0 408636640 1568 S Thu Nov 30 11:44:27 2023 ?? (bash)
41700 41699 phyng 0.0 0.0 408627424 1904 S Thu Nov 30 11:44:27 2023 ?? (bash)
41701 40783 phyng 0.0 0.0 408639712 1504 S Thu Nov 30 11:44:27 2023 ?? (bash)
41702 41701 phyng 0.0 0.0 408627424 1856 S Thu Nov 30 11:44:27 2023 ?? (bash)
41703 41700 phyng 0.0 0.0 0 0 ? Thu Nov 30 11:44:27 2023 ?? (bash)
41704 41702 phyng 0.0 0.0 408636640 1568 S Thu Nov 30 11:44:27 2023 ?? (bash)
41705 41704 phyng 0.0 0.0 408627424 1840 S Thu Nov 30 11:44:27 2023 ?? (bash)
41709 41705 phyng 0.0 0.0 408636720 2304 S Thu Nov 30 11:44:27 2023 ?? (bash)
41713 15741 phyng 0.0 0.0 408495824 1168 R+ Thu Nov 30 11:44:27 2023 ttys001 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox -E --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox (bash|status)
14802 9266 phyng 0.0 0.0 408664560 3168 S+ Thu Nov 30 11:40:37 2023 ttys013 bash
进一步发现 /opt/homebrew/bin/byobu-status
会拉起 bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
不过会报错,应该是 pyenv 脚本的问题。
➜ ~ bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
/opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence: line 23: pyenv-versions: command not found
解决问题
快速处理方法:修改 /opt/homebrew/bin/byobu-status
顶部,强制指定一个可用的 Python 解释器路径,这样就避免了调用 pyenv 脚本查找 Python 解释器的问题。
# @phyng patch start
export BYOBU_PYTHON=/Users/phyng/.pyenv/versions/3.11.6/bin/python3.11
# @phyng patch end
稳定处理方法:/opt/homebrew/bin/byobu-status
在启动时会加载 ~/.byoburc
,所以编辑 vim ~/.byoburc
加入 BYOBU_PYTHON
环境变量,这样就不需要修改 byobu-status 脚本了。
可以在 byobu 启动之后运行 echo $BYOBU_PYTHON
确认是否生效。
后续
修改之后已经稳定运行一周了。这个问题后续在运行 macOS 的测试机器上也遇到了。