收起左侧

docker容器无法调用40hx--n卡

4
回复
163
查看
[ 复制链接 ]

1

主题

2

回帖

0

牛值

江湖小虾

2026-3-15 22:00:04 显示全部楼层 阅读模式

物理机戴尔5070wyse使用pex8747外接的40hx显卡 系统版本1.1.23
BUG现象:(安装了应用商店的560驱动和nvidia-container-toolkit,在飞牛影视和飞牛相册可以正常调用40hx显卡,但容器无法正常调用显卡,如immich直接卡在构建页面,显示没有找到显卡,jellyfin在正常配置了n卡编解码的情况下,报硬件错误,核心错误代码ln: failed to create symbolic link '/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1': File exists,奇怪的是启动官方的cuda容器 却是正常的)
出现频率:(必现)
联系方式:(飞牛332群---谨)

root@x86fn:~# nvidia-smi
Sun Mar 15 06:44:01 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA CMP 40HX Off | 00000000:03:00.0 Off | N/A |
| 36% 30C P8 15W / 184W | 1MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@x86fn:# nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.18.2
commit: 9e88ed39710fd94c7e49fbb26d96492c45e574fb
root@x86fn:
# sudo nvidia-ctk runtime configure --runtime=docker
INFO[0000] Loading config from /etc/docker/daemon.json
INFO[0000] Wrote updated config to /etc/docker/daemon.json
INFO[0000] It is recommended that docker daemon be restarted.
root@x86fn:# sudo systemctl daemon-reload
root@x86fn:
# sudo systemctl restart docker
root@x86fn:~# cd /vol2/1000/docker1/单独/immich
root@x86fn:/vol2/1000/docker1/单独/immich# docker compose up -d
[+] Running 2/4
✔ Container immich_postgres Running 0.0s
✔ Container immich_redis Running 0.0s
⠸ Container immich_machine_learning Starting 0.4s
⠸ Container immich_server Starting 0.4s
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
root@x86fn:/vol2/1000/docker1/单独/immich# sudo ln -s /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.560.28.03 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
ln: failed to create symbolic link '/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1': File exists
root@x86fn:/vol2/1000/docker1/单独/immich# cd /
root@x86fn:/# sudo ln -s /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.560.28.03 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
ln: failed to create symbolic link '/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1': File exists
root@x86fn:/# docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
Unable to find image 'nvidia/cuda:12.2.0-base-ubuntu22.04' locally
12.2.0-base-ubuntu22.04: Pulling from nvidia/cuda
aece8493d397: Already exists
9fe5ccccae45: Pull complete
8054e9d6e8d6: Pull complete
bdddd5cb92f6: Pull complete
5324914b4472: Pull complete
Digest: sha256:ecdf8549dd5f12609e365217a64dedde26ecda26da8f3ff3f82def6749f53051
Status: Downloaded newer image for nvidia/cuda:12.2.0-base-ubuntu22.04
Sat Mar 14 23:15:41 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA CMP 40HX Off | 00000000:03:00.0 Off | N/A |
| 36% 26C P8 14W / 184W | 1MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

AI回复的可能原因
((看起来你已经做了不少排查工作。最奇怪的点在于:手动执行 docker run --gpus all 是成功的,但通过 Immich 的 Docker Compose 启动时却报错找不到 libnvidia-ml.so.1。

这通常意味着 docker-compose.yml 中的 GPU 配置方式触发了与直接运行命令不同的挂载机制。

核心原因分析

当报错 nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1 时,通常是因为 NVIDIA 容器运行时(nvidia-container-runtime)无法在宿主机的标准路径中定位到驱动库,或者 Docker Compose 的配置语法较旧。

解决方案

  1. 刷新宿主机的动态链接库缓存

虽然你手动创建了软链接,但系统缓存可能还没更新。请执行:

Bash

sudo ldconfig

  1. 检查 Immich 的 docker-compose.yml 配置

Immich 的机器学习服务(immich-machine-learning)需要正确配置 GPU 预约。请确保你的配置文件中该服务的 deploy 部分如下所示:

YAML

services: immich-machine-learning: # ... 其他配置 ... deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu]

注意: 如果你在服务层级写了 runtime: nvidia,请尝试将其删除,改用上面的 deploy 语法。现代 Docker 更倾向于通过 devices 预约来调用 NVIDIA 驱动。

  1. 检查 NVIDIA Container Toolkit 的模式

由于报错中提到了 Auto-detected mode as 'legacy',这说明 toolkit 正在尝试兼容旧模式。你可以尝试强制重新发现驱动:

Bash

sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker

  1. 彻底清理并重启

有时候失败的容器任务(shim task)会残留,导致后续启动一直报错。请执行:

Bash

停止并移除容器 docker compose down # 启动并查看实时日志 docker compose up -d docker compose logs -f immich_machine_learning

进阶调试(如果还是报错)

如果 docker run 能成功但 Compose 依然失败,请确认你的 libnvidia-ml.so.1 所在的路径。运行:

Bash

find /usr/lib -name "libnvidia-ml.so.1"

如果它不在 /usr/lib/x86_64-linux-gnu/ 下,而是在其他位置,你可能需要修改 /etc/nvidia-container-runtime/config.toml,确保 ldconfig 的路径包含了该目录。

需要我帮你检查一下你的 docker-compose.yml 文件中关于 GPU 的那部分代码吗?)

收藏
送赞
分享

1

主题

2

回帖

0

牛值

江湖小虾

2026-3-16 21:45:37 楼主 显示全部楼层
111111111111111

1

主题

2

回帖

0

牛值

江湖小虾

2026-3-22 16:49:33 楼主 显示全部楼层

物理机40HX安装商店驱动,docker调用显卡

系统版本:X86/系统版本1.1.23

设备环境:(物理机B365+G5420+40HX )

BUG现象:

BUG现象:(安装了应用商店的560驱动和nvidia-container-toolkit,在飞牛影视和飞牛相册可以正常调用40hx显卡,但容器无法正常调用显卡,如immich直接卡在构建页面,显示没有找到显卡,jellyfin在正常配置了n卡编解码的情况下,报硬件错误,核心错误代码ln: failed to create symbolic link '/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1': File exists,奇怪的是启动官方的cuda容器 却是正常的))

出现频率:(偶现/必现)

联系方式:(飞牛332群---谨))

日志文件:

root@x86fn:~# nvidia-smi
Sun Mar 15 06:44:01 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA CMP 40HX Off | 00000000:03:00.0 Off | N/A |
| 36% 30C P8 15W / 184W | 1MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@x86fn:# nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.18.2
commit: 9e88ed39710fd94c7e49fbb26d96492c45e574fb
root@x86fn:
# sudo nvidia-ctk runtime configure --runtime=docker
INFO[0000] Loading config from /etc/docker/daemon.json
INFO[0000] Wrote updated config to /etc/docker/daemon.json
INFO[0000] It is recommended that docker daemon be restarted.
root@x86fn:# sudo systemctl daemon-reload
root@x86fn:
# sudo systemctl restart docker
root@x86fn:~# cd /vol2/1000/docker1/单独/immich
root@x86fn:/vol2/1000/docker1/单独/immich# docker compose up -d
[+] Running 2/4
✔ Container immich_postgres Running 0.0s
✔ Container immich_redis Running 0.0s
⠸ Container immich_machine_learning Starting 0.4s
⠸ Container immich_server Starting 0.4s
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
root@x86fn:/vol2/1000/docker1/单独/immich# sudo ln -s /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.560.28.03 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
ln: failed to create symbolic link '/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1': File exists
root@x86fn:/vol2/1000/docker1/单独/immich# cd /
root@x86fn:/# sudo ln -s /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.560.28.03 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
ln: failed to create symbolic link '/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1': File exists
root@x86fn:/# docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
Unable to find image 'nvidia/cuda:12.2.0-base-ubuntu22.04' locally
12.2.0-base-ubuntu22.04: Pulling from nvidia/cuda
aece8493d397: Already exists
9fe5ccccae45: Pull complete
8054e9d6e8d6: Pull complete
bdddd5cb92f6: Pull complete
5324914b4472: Pull complete
Digest: sha256:ecdf8549dd5f12609e365217a64dedde26ecda26da8f3ff3f82def6749f53051
Status: Downloaded newer image for nvidia/cuda:12.2.0-base-ubuntu22.04
Sat Mar 14 23:15:41 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03 Driver Version: 560.28.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA CMP 40HX Off | 00000000:03:00.0 Off | N/A |
| 36% 26C P8 14W / 184W | 1MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

277

主题

1万

回帖

0

牛值

管理员

fnOS1.0上线纪念勋章

2026-3-24 15:41:42 显示全部楼层

已收到反馈 我转给负责的同事看看

我这边也换了设备试过了,相同的显卡,不同的配置,问题依旧,更换了显卡就不会了,似乎是和40hx一起的问题,我不清楚你们是怎么让560版本的驱动支持了40hx,但可能是其中有一些驱动层面上的问题没有处理好导致的  详情 回复
2026-3-25 14:08

1

主题

2

回帖

0

牛值

江湖小虾

2026-3-25 14:08:36 楼主 显示全部楼层
飞牛技术同学 发表于 2026-3-24 15:41
已收到反馈 我转给负责的同事看看

我这边也换了设备试过了,相同的显卡,不同的配置,问题依旧,更换了显卡就不会了,似乎是和40hx一起的问题,我不清楚你们是怎么让560版本的驱动支持了40hx,但可能是其中有一些驱动层面上的问题没有处理好导致的
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则