一次 macOS 频繁死机重启问题排查

死机问题

最近一周几乎每天早上起来打开电脑看到的都是系统死机重启之后的报告,查看详细日志长下面这样。这个日志里面显示的原因是内存泄露导致 panic 重启,里面关键信息是 pid 34100: bash ,也就是说死机和 ID 为 34100 的进程有关。可惜是一个 bash 进程,给排查增加了很大的难度。因为很多程序会调用 bash 执行命令。

panic(cpu 3 caller 0xfffffe002f6cb630): zalloc[3]: zone map exhausted while allocating from zone [data.kalloc.1024], likely due to memory leak in zone [data.kalloc.1024] (16G, 17743216 elements allocated) @zalloc.c:4491
Debugger message: panic
Memory ID: 0x6
OS release type: User
OS version: 23B81
Kernel version: Darwin Kernel Version 23.1.0: Mon Oct  9 21:27:24 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T6000
Fileset Kernelcache UUID: 9B88AE46569577FF0B82A48EAE38CDC6
Kernel UUID: 88BB97DD-3D4F-3EB9-8DDE-FFC364B55886

<...省略...>

Compressor Info: 9% of compressed pages limit (OK) and 22% of segments limit (OK) with 0 swapfiles and OK swap space
Panicked task 0xfffffe2cbe024b38: 0 pages, 1 threads: pid 34100: bash
Panicked thread: 0xfffffe27fb245828, backtrace: 0xfffffe10107befb0, tid: 491454956
      lr: 0xfffffe002f6555e0  fp: 0xfffffe10107bf040
      lr: 0xfffffe002f7a036c  fp: 0xfffffe10107bf0b0
      lr: 0xfffffe002f79e7e8  fp: 0xfffffe10107bf1a0
      lr: 0xfffffe002f603aa4  fp: 0xfffffe10107bf1b0
      lr: 0xfffffe002f654ebc  fp: 0xfffffe10107bf560
      lr: 0xfffffe002fe67260  fp: 0xfffffe10107bf580
      lr: 0xfffffe002f6cb630  fp: 0xfffffe10107bf740
      lr: 0xfffffe002f6cebec  fp: 0xfffffe10107bf7c0
      lr: 0xfffffe002f667a70  fp: 0xfffffe10107bf7f0
      lr: 0xfffffe002fd002f8  fp: 0xfffffe10107bf810
      lr: 0xfffffe0030b6c484  fp: 0xfffffe10107bf830
      lr: 0xfffffe0030b6c594  fp: 0xfffffe10107bf860
      lr: 0xfffffe0030b78ba8  fp: 0xfffffe10107bfb30
      lr: 0xfffffe002fe41ad8  fp: 0xfffffe10107bfc00
      lr: 0xfffffe002fb35b0c  fp: 0xfffffe10107bfcb0
      lr: 0xfffffe002fb36600  fp: 0xfffffe10107bfd60
      lr: 0xfffffe002fb36270  fp: 0xfffffe10107bfd90
      lr: 0xfffffe002fc7b8c8  fp: 0xfffffe10107bfe20
      lr: 0xfffffe002f79e928  fp: 0xfffffe10107bff10
      lr: 0xfffffe002f603aa4  fp: 0xfffffe10107bff20
      Kernel Extensions in backtrace:
         com.apple.driver.AppleMobileFileIntegrity(1.0.5)[B562E2F2-059F-3F75-9A50-57B08E283880]@0xfffffe0030b6a820->0xfffffe0030b9a8eb
            dependency: com.apple.iokit.CoreAnalyticsFamily(1)[0F366549-E9FC-3B68-9765-ED75460DC0B0]@0xfffffe00312ac1a0->0xfffffe00312b3f8b
            dependency: com.apple.kec.corecrypto(14.0)[4EAE32F2-4DA7-3078-AE3D-6E4CBCE4739B]@0xfffffe0032770f00->0xfffffe00327c124b
            dependency: com.apple.kext.CoreTrust(1)[1984A0D2-7977-3FF6-9231-06D409A3EB3F]@0xfffffe0031375950->0xfffffe003137d873
            dependency: com.apple.security.AppleImage4(5.0.0)[DB08D649-F4D7-3D40-BCC9-BD9008ECAB95]@0xfffffe00309c34a0->0xfffffe00309dc147

<...省略...>

分析问题

死机启动之后会在 system.log 留下一条 bootlog 的打印,从下面的打印可以看出死机大部分发生在凌晨 3-7 点,之后 8-9 点是我发现死机之后主动重启机器产生的日志。这也意味着很难在我使用期间定位问题。

  ~ gzcat /var/log/system.log.* | grep bootlog | sort
Nov 23 04:49:50 phyng-m1 bootlog[0]: BOOT_TIME 1700686190 609710
Nov 23 14:47:38 phyng-m1 bootlog[0]: BOOT_TIME 1700722058 676269
Nov 24 11:00:01 phyng-m1 bootlog[0]: BOOT_TIME 1700794801 268615
Nov 24 11:13:26 phyng-m1 bootlog[0]: BOOT_TIME 1700795606 374267
Nov 25 07:03:20 phyng-m1 bootlog[0]: BOOT_TIME 1700867000 25451
Nov 25 08:44:54 phyng-m1 bootlog[0]: BOOT_TIME 1700873094 45593
Nov 26 09:06:12 phyng-m1 bootlog[0]: BOOT_TIME 1700960772 438934
Nov 26 15:37:09 phyng-m1 bootlog[0]: BOOT_TIME 1700984229 825958
Nov 28 03:59:16 phyng-m1 bootlog[0]: BOOT_TIME 1701115156 973815
Nov 28 08:40:42 phyng-m1 bootlog[0]: BOOT_TIME 1701132042 35736
Nov 29 03:54:00 phyng-m1 bootlog[0]: BOOT_TIME 1701201240 539764
Nov 29 09:46:38 phyng-m1 bootlog[0]: BOOT_TIME 1701222398 722656

增加进程监控日志

死机日志中会带上进程 ID 和名称,如果每隔一段时间记录一下当前系统全部进程,下次死机之后就能根据进程日志分析了。

下面是 ~/dev/servers/phyng-m1/scripts/watch_ps.sh 的监控代码,每隔 3 秒保存一遍系统日志,每 100MB 压缩为一个文件。

#!/usr/bin/env bash

LOG_FILE="/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log"
MAX_SIZE=104857600  # 设定日志文件的最大大小(例如:100MB)

function watch_ps() {
    LOG="watch_ps() called at $(date)"
    echo "$LOG"
    echo "$LOG" >> $LOG_FILE
    ps -ewww -o pid,ppid,user,%cpu,%mem,vsz,rss,stat,lstart,tty,args >> $LOG_FILE
}

function check_compress_log() {
    # 如果日志文件存在且大小超过了设定值,则压缩它
    if [ -f "$LOG_FILE" ] && [ "$(stat -f%z "$LOG_FILE")" -ge $MAX_SIZE ]; then
        gzip "$LOG_FILE"
        mv "$LOG_FILE.gz" "$LOG_FILE-$(date +%Y%m%d-%H%M%S).gz"
    fi
}

while true; do
    watch_ps
    check_compress_log
    sleep 3
done

运行监控程序

repeat 1000 sudo ./phyng-m1/scripts/watch_ps.sh

根据日志定位问题

终于在第二天死机之后看到了日志,看到日志之后就需要对死机的进程 PID 往上追踪,因为 bash 进程很可能有无数的父进程,在我的日志格式 ps -ewww -o pid,ppid,user,%cpu,%mem,vsz,rss,stat,lstart,tty,args 中第一列为进程 PID 第二列父进程 PPID。

下面是死机之后的进程链 34027 -&gt; 34023 -&gt; 34020

  ~ zgrep 34027 ~/dev/servers/phyng-m1/scripts/watch_ps.log-20231130*
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-015928.gz:34027 34023 phyng              0.1  0.0 408639712   1552 S    Thu Nov 30 01:44:25 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-015928.gz:34094 34027 phyng              0.0  0.0 408639712   1536 S    Thu Nov 30 01:44:25 2023     ??       (bash)  ~ zgrep 34023 ~/dev/servers/phyng-m1/scripts/watch_ps.log-20231130*
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-015928.gz:34023 34020 phyng              0.1  0.0 408636640   1520 S    Thu Nov 30 01:44:25 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-015928.gz:34027 34023 phyng              0.1  0.0 408639712   1552 S    Thu Nov 30 01:44:25 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-070317.gz:34023 34019 phyng              0.1  0.0 408636640   1520 S    Thu Nov 30 06:58:58 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-070317.gz:34025 34023 phyng              0.0  0.0 408637664   1488 S    Thu Nov 30 06:58:58 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2

➜  ~ zgrep 34020 ~/dev/servers/phyng-m1/scripts/watch_ps.log-20231130*
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-011352.gz:34020 26346 phyng              1.9  0.0 408636640   2176 S    Thu Nov 30 01:08:49 2023     ??       /bin/sh /opt/homebrew/bin/byobu-status tmux_right
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-011352.gz:34118 34020 phyng              2.4  0.0 408636640   2048 S    Thu Nov 30 01:08:49 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-exec python2 -c import snack
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-015928.gz:34020 34019 phyng              0.7  0.0 408627424   1888 S    Thu Nov 30 01:44:25 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-015928.gz:34023 34020 phyng              0.1  0.0 408636640   1520 S    Thu Nov 30 01:44:25 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-070317.gz:34020 33961 phyng              0.1  0.0 408636640   2064 S    Thu Nov 30 06:53:18 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-exec python2 -c import snack
/Users/phyng/dev/servers/phyng-m1/scripts/watch_ps.log-20231130-070317.gz:34042 34020 phyng              0.1  0.0 408645856   2368 S    Thu Nov 30 06:53:18 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-which python2

根据日志分析关键问题出在 /opt/homebrew/bin/byobu-status tmux_right,这个程序是用来给 byobu 刷新右下角状态的,通过下面的命令过滤发现 byobu-status 依然在执行。

  ~ ps -ewww -o pid,ppid,user,%cpu,%mem,vsz,rss,stat,lstart,tty,args | egrep '(bash|status)'
  462     1 root               0.0  0.0 408300528  11456 Ss   Thu Nov 30 11:18:19 2023     ??       /System/Library/PrivateFrameworks/SystemStatusServer.framework/Support/systemstatusd
  757     1 phyng              0.0  0.0 408506592   2336 S    Thu Nov 30 11:18:24 2023     ??       /bin/bash /usr/local/MacGPG2/libexec/shutdown-gpg-agent
40652 33906 phyng              0.2  0.0 408636640   2224 S    Thu Nov 30 11:44:27 2023     ??       /bin/sh /opt/homebrew/bin/byobu-status tmux_left
40655 33906 phyng              0.2  0.0 408636640   2160 S    Thu Nov 30 11:44:27 2023     ??       /bin/sh /opt/homebrew/bin/byobu-status tmux_right
40709 40652 phyng              0.6  0.0 408636640   2032 S    Thu Nov 30 11:44:27 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-exec python2 -c import snack
40711 40655 phyng              0.6  0.0 408636640   2032 S    Thu Nov 30 11:44:27 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-exec python2 -c import snack
40728 40709 phyng              0.3  0.0 408637664   2304 S    Thu Nov 30 11:44:27 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-which python2
40732 40711 phyng              0.3  0.0 408645856   2336 S    Thu Nov 30 11:44:27 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-which python2
40768 40728 phyng              0.0  0.0 408646880   1664 S    Thu Nov 30 11:44:27 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-which python2
40769 40768 phyng              0.1  0.0 408627424   1856 S    Thu Nov 30 11:44:27 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
40771 40769 phyng              0.0  0.0 408636640   1536 S    Thu Nov 30 11:44:27 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
40773 40771 phyng              0.6  0.0 408640736   1584 S    Thu Nov 30 11:44:27 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
40774 40732 phyng              0.0  0.0 408645856   1728 S    Thu Nov 30 11:44:27 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-which python2
40776 40774 phyng              0.1  0.0 408627424   1856 S    Thu Nov 30 11:44:27 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
40781 40776 phyng              0.0  0.0 408636640   1536 S    Thu Nov 30 11:44:27 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
40783 40781 phyng              0.6  0.0 408639712   1552 S    Thu Nov 30 11:44:27 2023     ??       bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
41693 40773 phyng              0.0  0.0 408640736   1536 S    Thu Nov 30 11:44:27 2023     ??       (bash)
41694 41693 phyng              0.0  0.0 408627424   1856 S    Thu Nov 30 11:44:27 2023     ??       (bash)
41699 41694 phyng              0.0  0.0 408636640   1568 S    Thu Nov 30 11:44:27 2023     ??       (bash)
41700 41699 phyng              0.0  0.0 408627424   1904 S    Thu Nov 30 11:44:27 2023     ??       (bash)
41701 40783 phyng              0.0  0.0 408639712   1504 S    Thu Nov 30 11:44:27 2023     ??       (bash)
41702 41701 phyng              0.0  0.0 408627424   1856 S    Thu Nov 30 11:44:27 2023     ??       (bash)
41703 41700 phyng              0.0  0.0        0      0 ?    Thu Nov 30 11:44:27 2023     ??       (bash)
41704 41702 phyng              0.0  0.0 408636640   1568 S    Thu Nov 30 11:44:27 2023     ??       (bash)
41705 41704 phyng              0.0  0.0 408627424   1840 S    Thu Nov 30 11:44:27 2023     ??       (bash)
41709 41705 phyng              0.0  0.0 408636720   2304 S    Thu Nov 30 11:44:27 2023     ??       (bash)
41713 15741 phyng              0.0  0.0 408495824   1168 R+   Thu Nov 30 11:44:27 2023     ttys001  grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox -E --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox (bash|status)
14802  9266 phyng              0.0  0.0 408664560   3168 S+   Thu Nov 30 11:40:37 2023     ttys013  bash

进一步发现 /opt/homebrew/bin/byobu-status 会拉起 bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2 不过会报错,应该是 pyenv 脚本的问题。

  ~ bash /opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence python2
/opt/homebrew/Cellar/pyenv/2.3.33/libexec/pyenv-whence: line 23: pyenv-versions: command not found

解决问题

快速处理方法:修改 /opt/homebrew/bin/byobu-status 顶部,强制指定一个可用的 Python 解释器路径,这样就避免了调用 pyenv 脚本查找 Python 解释器的问题。

# @phyng patch start
export BYOBU_PYTHON=/Users/phyng/.pyenv/versions/3.11.6/bin/python3.11
# @phyng patch end

稳定处理方法:/opt/homebrew/bin/byobu-status 在启动时会加载 ~/.byoburc,所以编辑 vim ~/.byoburc 加入 BYOBU_PYTHON 环境变量,这样就不需要修改 byobu-status 脚本了。

可以在 byobu 启动之后运行 echo $BYOBU_PYTHON 确认是否生效。

后续

修改之后已经稳定运行一周了。这个问题后续在运行 macOS 的测试机器上也遇到了。