《sit-back-and-relax-with-fault-awareness-and-robust-instant-recovery-for-large-scale-ai-workloads-yuanredaelsji-mao-ai-du-zhe-pencezha-dun-ju-rezha-fu-dun-fanshi-zhang-kebe-liu-daocloud.pdf》由会员分享,可在线阅读,更多相关《sit-back-and-relax-with-fault-awareness-and-robust-instant-recovery-for-large-scale-ai-workloads-yuanredaelsji-mao-ai-du-zhe-pencezha-dun-ju-rezha-fu-dun-fanshi-zhang-kebe-liu-daocloud.pdf(31页珍藏版)》请在三个皮匠报告上搜索。
1、kebe7junnekomeowww5.47.16.05.44.26.44.38.86.28.55.95.14.56.74.87.35.86.97.04.84.16.34.07.96.17.26.55.23.96.04.68.15.57.46.15.34.36.54.48.56.38.66.05.24.66.84.97.45.97.06.94.94.26.44.18.06.07.16.85.04.06.14.58.2.6.48.75.85.34.76.95.07.5 4254.197871 NVRM:GPU 0000:5d:00.0:GPU has fallen off the bus.425
2、4.197913 NVRM:A GPU crash dump has been created.If possible,please run NVRM:nvidia-bug-report.sh as root to collect this data before NVRM:the NVIDIA kernel module is unloaded.4254.197816 NVRM:GPU at PCI:0000:5d:00:GPU-f1906b9b-557a-e961-045c-9fe4be3ce012 4254.197854 NVRM:GPU Board Serial Number:1653
3、923026510 4254.197860 NVRM:Xid(PCI:0000:5d:00):79,pid=,name=,GPU has fallen off the bus.4254.197878 NVRM:GPU 0000:5d:00.0:GPU serial number is 1653923026510.NVRM:fallen off the bus and is not responding to commands.14387.210134 nvidia:probe of 0000:5d:00.0 failed with error-114387.274303 NVRM:The NV
4、IDIA probe routine failed for 1 device(s).14387.209961 NVRM:The NVIDIA GPU 0000:5d:00.0 NVRM:(PCI ID:10de:2330)installed in this system has14387.274366 NVRM:loading NVIDIA UNIX x86_64 Kernel Module 525.125.06 Tue May 30 05:11:37 UTC 202314387.511008 nvidia_uvm:module uses symbols from proprietary mo
5、dule nvidia,inheriting taint.14387.548839 nvidia-uvm:Loaded the UVM driver,major device number 502.14387.573380 nvidia-modeset:Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 525.125.06 Tue May 30 04:E ProcessGroupNCCL.cpp:481 Some NCCL operations have failed or timed out.Due to the asy
6、nchronous nature of CUDA kernNET/IB:Got completion from peer 10.42.0.2 with error 5,opcode 48,len 32764,vendor err 244terminate called after throwing an instance of std:runtime_error what():Rank 16 NCCL watchdog thread terminated with exception:NCCL error:remote process exited or there was a nncclRe