前提
- 该活可能不适合全部780m核显用户,比如我现在都没在GPU面板看到我正确的核显型号说明,这个需要飞牛官方去修,为了稳定使用不能整活
- 就是为了修复
[ 1080.711898] amdgpu 0000:3f:00.0: [drm] ERR* NO EDID read
错误和能让780m正常解码+跑小ai
- 飞牛内核至少是6.1.0,还是debian12,放心使用
- 目前飞牛27版本已经将源定向为中科大镜像了,建议不要为了安装而额外安装,会出现升级bug
- 官方
firmware-amd-graphics
版本为 20250808-1
,飞牛获取的镜像为 20230210-5
,但是需要补全lib/firmware/amdgpu,但是开源驱动理论上和官方提供的仓库闭源驱动冲突,我目前还没有补全,不补全不影响系统运行,补全大概率也不影响,但是请谨慎,记得备份飞牛配置,方便崩了后重装
- 目前按照此安装方式不影响飞牛更新(驱动路径不同)
- rocm巨大,24g,请提前多分配系统空间,至少64g(或者学我保寿命和降低错误,直接分配整个盘)
前提准备
- ssh开启,关闭所有网络代理,准备好重启多次(大概3次)准备
- 输入
lspci -nn | grep -i radeon
和 lsmod | grep amdgpu
命令,确保自己要么是无驱动或者有驱动识别状态,并且记录一下,避免更新后反而出事
- 快速入门安装指南 — ROCm 安装 (Linux) 可以参考这里,amd官网教程
实现路径
安装rocm7.0.1
wget https://repo.radeon.com/amdgpu-install/7.0.1/ubuntu/jammy/amdgpu-install_7.0.1.70001-1_all.deb
sudo apt install ./amdgpu-install_7.0.1.70001-1_all.deb
sudo apt update
sudo apt install python3-setuptools python3-wheel
sudo usermod -a -G render,video $LOGNAME #添加当前用户进组,也就是常用物理机管理员
sudo apt install rocm
安装AMD radeon官方驱动
wget https://repo.radeon.com/amdgpu-install/7.0.1/ubuntu/jammy/amdgpu-install_7.0.1.70001-1_all.deb
sudo apt install ./amdgpu-install_7.0.1.70001-1_all.deb
sudo apt update
sudo apt install "linux-headers-$(uname -r)"
sudo apt install amdgpu-dkms ## 建议几乎纯界面小白别输入这条,用了后悔就来不及了
中途操作
- 不管是安装驱动还是rocm,都会导致内核参与编辑,安装完一份后记得输入
sudo dpkg --get-selections|grep linux-imag
e查看下有没有内核在后台挂载编译,如果有,最简单的方法就是直接重启退掉没卸载的内核
- 如果出现报错提示
E: Sub-process /usr/bin/dpkg received a segmentation fault.
解决方法同上
- amd官方仓库没有被墙,如果挂加速反而会导致不能一次下载完成卡包(无法继续下载安装),这个时候需要清理缓存与apt残余包,比如
sudo rm -rf /var/cache/apt/archives/*.deb
,否则无法继续下载安装
- 如果报错依赖异常,但是中科大没有依赖包,使用
apt --fix-broken install
可以通过amd库补全
amdgpu-dkms说明
-
这是一个驱动编译模块,安装报错基本不影响正常gpu使用,但是一旦自动编译完成会进入内核,并且会检测rocm的系统依赖
-
amdgpu-dkms包需要很多amd固件驱动,需要linux-firmware的支持,但linux-firmware是ubt系统的包,主服务都只有分散包,一旦按照就会报警,比如:
W: Possible missing firmware /lib/firmware/rtl_nic/rtl8126a-3.fw for module r8169
W: Possible missing firmware /lib/firmware/rtl_nic/rtl8126a-2.fw for module r8169
W: Possible missing firmware /lib/firmware/amdgpu/aldebaran_ip_discovery.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/arcturus_ip_discovery.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/picasso_ip_discovery.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/raven2_ip_discovery.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/raven_ip_discovery.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/vega20_ip_discovery.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/vega12_ip_discovery.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/vega10_ip_discovery.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/ip_discovery.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/vega10_cap.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/navi12_cap.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/psp_13_0_12_ta.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/psp_13_0_12_sos.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/psp_13_0_0_ta_kicker.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/psp_13_0_0_sos_kicker.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/aldebaran_cap.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/psp_14_0_5_ta.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/psp_14_0_5_toc.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/psp_14_0_3_ta_kicker.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/psp_14_0_3_sos_kicker.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_9_5_0_rlc.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_9_5_0_mec.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_11_5_3_imu.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_11_0_0_imu_kicker.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_11_5_3_rlc.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_11_5_3_mec.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_11_5_3_me.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_11_5_3_pfp.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_11_0_0_toc.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_11_0_0_rlc_kicker.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_12_0_1_toc.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_12_0_1_rlc_kicker.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_12_0_0_toc.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_12_0_1_imu_kicker.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/sdma_4_4_4.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/sdma_6_1_3.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_11_5_3_mes1.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_11_5_3_mes_2.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/gc_11_0_3_mes.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/vcn_5_0_1.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/smu_13_0_0_kicker.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/smu_14_0_3_kicker.bin for module amdgpu
W: Possible missing firmware /lib/firmware/amdgpu/dcn_3_6_dmcub.bin for module amdgpu
-
不安装这些包并不会特别影响系统运作(甚至更新问题没出现了),但是如果要补全,得从从 ROCm 官方 GitHub下载最新固件并且cp,这里展示一部分,剩下的建议直接问ai:
# 下载固件压缩包
wget https://github.com/RadeonOpenCom**/ROCm-Firmware/archive/refs/tags/ROCm-Firmware-30.10.1.tar.gz
# 解压
tar -xvf ROCm-Firmware-30.10.1.tar.gz
# 将固件复制到系统目录
sudo cp -r ROCm-Firmware-30.10.1/rocm-firmware/ip_discovery.bin /lib/firmware/amdgpu/
sudo cp -r ROCm-Firmware-30.10.1/rocm-firmware/vega10_cap.bin /lib/firmware/amdgpu/
sudo cp -r ROCm-Firmware-30.10.1/rocm-firmware/navi12_cap.bin /lib/firmware/amdgpu/
sudo cp -r ROCm-Firmware-30.10.1/rocm-firmware/aldebaran_cap.bin /lib/firmware/amdgpu/
sudo cp -r ROCm-Firmware-30.10.1/rocm-firmware/psp_14_0_3_ta_kicker.bin /lib/firmware/amdgpu/
sudo cp -r ROCm-Firmware-30.10.1/rocm-firmware/psp_14_0_3_sos_kicker.bin /lib/firmware/amdgpu/
sudo cp -r ROCm-Firmware-30.10.1/rocm-firmware/gc_11_0_0_toc.bin /lib/firmware/amdgpu/
sudo cp -r ROCm-Firmware-30.10.1/rocm-firmware/gc_12_0_1_rlc_kicker.bin /lib/firmware/amdgpu/
sudo cp -r ROCm-Firmware-30.10.1/rocm-firmware/gc_12_0_1_imu_kicker.bin /lib/firmware/amdgpu/
sudo cp -r ROCm-Firmware-30.10.1/rocm-firmware/gc_11_0_3_mes.bin /lib/firmware/amdgpu/
sudo cp -r ROCm-Firmware-30.10.1/rocm-firmware/smu_14_0_3_kicker.bin /lib/firmware/amdgpu/
# 校验文件完整性(示例)
sha256sum /lib/firmware/amdgpu/ip_discovery.bin
# 应输出:2a0f7b2a2a2a2a2a2a2a2a2a2a2a2a2a2a2a2a
结束语
按照上述方法可以解决io报错和ollma没法驱动780m核显的问题,但是有概率导致你本身的780m核显标识变成phinex,也就是变核显了,显示不出型号来smi也显示错误,但是调用正常,毕竟是ROCm SMI 工具与当前硬件/驱动的兼容性问题,只是因为官方长期没理我并且跟工作人员联系后也没回音了的一种临时解决方法,如果有更好的解决办法还原分享(我也是小白)