ohshitgorillas 0 Posted October 16, 2025 Posted October 16, 2025 The Problem I'm running Emby inside Docker on a headless Ubuntu Server, and the container randomly loses access to the GPU, breaking hardware transcoding until the container restarts. The GPU literally disappears from the container's perspective - all /dev/nvidia* devices vanish. The Server Emby Server: v4.9.1.80 via linuxserver/emby:latest OS: Ubuntu Server v24.04-noble Docker: docker-ce/noble,now 5:28.5.1-1~ubuntu.24.04~noble amd64 [installed] docker-compose/noble,noble,now 1.29.2-6ubuntu1 all [installed] GPU: NVIDIA GeForce RTX 3060 Ti Driver: 580.65.06 (CUDA 13.0) docker-compose.yaml: services: emby: image: linuxserver/emby:latest container_name: emby network_mode: host environment: - PUID=1000 - PGID=1000 - TZ=America/Los_Angeles - LD_LIBRARY_PATH=/app/emby/lib:/app/emby/extra/lib volumes: - ./config:/config - /srv/media/music:/music - /srv/media/tv:/tv - /srv/media/movies:/movies deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu, video, compute, utility] restart: unless-stopped nginx: container_name: emby_nginx image: nginx:latest volumes: - ./nginx/log:/var/log/nginx - ./nginx/keys:/config/keys - ./nginx/nginx.conf:/etc/nginx/nginx.conf depends_on: - emby restart: unless-stopped network_mode: host watchdog: build: ./watchdog container_name: emby_watchdog privileged: true network_mode: host environment: - EMBY_CONTAINER_NAME=emby - CHECK_INTERVAL=300 - SLACK_WEBHOOK_URL=(redacted) - TZ=America/Los_Angeles volumes: - /var/run/docker.sock:/var/run/docker.sock - ./watchdog/reports:/reports - ./config/logs:/emby-logs:ro depends_on: - emby restart: unless-stopped The "emby_watchdog" is a custom container which represents an attempt by me to solve this exact problem. It checks for GPU access every 5 min; if it finds the GPU is gone, it generates a report on the issue before restarting the container to restore GPU visibility and hardware transcoding. The code for the watchdog can be found here: https://github.com/ohshitgorillas/emby_watchdog/tree/main The Library Path Fix Early in troubleshooting, I discovered that Emby's bundled ffmpeg couldn't find its own libraries. Running `ldd /app/emby/bin/ffmpeg` showed multiple "not found" errors for libav* libraries that were actually present in `/app/emby/lib/` and `/app/emby/extra/lib/`. Adding `LD_LIBRARY_PATH=/app/emby/lib:/app/emby/extra/lib` to the environment variables was **critical** and dramatically improved stability. Before this fix, I had GPU failures multiple times per day (sometimes within minutes of restart). After that fix, I had my first GPU failure in 2.5 days (60-hour uptime before failing) just earlier today. This suggests that perhaps the constant ffmpeg library load failures were causing GPU resource leaks or corruption. However, the problem still occurs occasionally. What the Watchdog Has Revealed Honestly... less than I was hoping for. Here's an example report: ====================================================================== EMBY GPU WATCHDOG FAILURE REPORT ====================================================================== FAILURE DETECTED: 2025-10-16 08:52:09 TIMING INFORMATION: Container started: 2025-10-14T03:17:26.630591472Z Container uptime at failure: 60 hours 34 minutes ---[ GPU ACCESS FROM CONTAINER ]--- nvidia-smi output: Exit code: 255 Failed to initialize NVML: Unknown Error /dev/nvidia* devices: ls: cannot access '/dev/nvidia*': No such file or directory /proc/driver/nvidia/version: NVRM version: NVIDIA UNIX x86_64 Kernel Module 580.65.06 Sun Jul 27 07:14:19 UTC 2025 GCC version: gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04) /dev/nvidia-caps: ls: cannot access '/dev/nvidia-caps': No such file or directory NVIDIA environment variables: NVIDIA_DRIVER_CAPABILITIES=compute,video,utility ---[ HOST GPU STATUS ]--- Host nvidia-smi: /bin/sh: nvidia-smi: not found Host /dev/nvidia* devices: crw-rw-rw- 1 root root 195, 254 Oct 11 01:29 /dev/nvidia-modeset crw-rw-rw- 1 root root 511, 0 Oct 11 01:29 /dev/nvidia-uvm crw-rw-rw- 1 root root 511, 1 Oct 11 01:29 /dev/nvidia-uvm-tools crw-rw-rw- 1 root root 195, 0 Oct 11 01:29 /dev/nvidia0 crw-rw-rw- 1 root root 195, 255 Oct 11 01:29 /dev/nvidiactl /dev/nvidia-caps: total 0 drwxr-xr-x 2 root root 80 Oct 11 01:29 . drwxr-xr-x 16 root root 4620 Oct 11 01:29 .. cr-------- 1 root root 236, 1 Oct 11 01:29 nvidia-cap1 cr--r--r-- 1 root root 236, 2 Oct 11 01:29 nvidia-cap2 nvidia kernel modules: nvidia_uvm 2097152 4 nvidia_drm 139264 0 nvidia_modeset 1564672 2 nvidia_drm nvidia 103985152 50 nvidia_uvm,nvidia_modeset video 77824 1 nvidia_modeset nvidia-persistenced status: /bin/sh: systemctl: not found ---[ CONTAINER CONFIGURATION ]--- Runtime: runc Status: running Network mode: host Device mappings: None NVIDIA environment variables: NVIDIA_DRIVER_CAPABILITIES=compute,video,utility ---[ EMBY LOGS ANALYSIS ]--- Recent GPU-related entries (last 1000 lines): Command line: /app/emby/system/EmbyServer.dll -programdata /config -ffdetect /app/emby/bin/ffdetect -ffmpeg /app/emby/bin/ffmpeg -ffprobe /app/emby/bin/ffprobe -restartexitcode 3 Last Hardware Detection: Hardware detection not found in recent logs Last CodecList: CodecList not found in recent logs CUDA Errors Found: No CUDA errors found in logs ====================================================================== END REPORT ====================================================================== Okay, there are a few bugs in here: the host section is not properly accessing the `nvidia-smi` or `systemctl` commands properly, however, here's what they would show on the host system: +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.65.06 Driver Version: 580.65.06 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 Ti On | 00000000:01:00.0 Off | N/A | | 48% 44C P2 46W / 200W | 245MiB / 8192MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 1365824 C /usr/bin/hqplayerd 204MiB | +-----------------------------------------------------------------------------------------+ ● nvidia-persistenced.service - NVIDIA Persistence Daemon Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; enabled; preset: enabled) Drop-In: /etc/systemd/system/nvidia-persistenced.service.d └─override.conf Active: active (running) since Wed 2025-10-08 00:02:41 PDT; 1 week 1 day ago Main PID: 543475 (nvidia-persiste) Tasks: 1 (limit: 19011) Memory: 528.0K (peak: 1.5M) CPU: 173ms CGroup: /system.slice/nvidia-persistenced.service └─543475 /usr/bin/nvidia-persistenced --user nvidia-persistenced --verbose Oct 08 00:02:41 obsidiana systemd[1]: Starting nvidia-persistenced.service - NVIDIA Persistence Daemon... Oct 08 00:02:41 obsidiana nvidia-persistenced[543475]: Verbose syslog connection opened Oct 08 00:02:41 obsidiana nvidia-persistenced[543475]: Now running with user ID 122 and group ID 129 Oct 08 00:02:41 obsidiana nvidia-persistenced[543475]: Started (543475) Oct 08 00:02:41 obsidiana nvidia-persistenced[543475]: device 0000:01:00.0 - registered Oct 08 00:02:41 obsidiana nvidia-persistenced[543475]: device 0000:01:00.0 - persistence mode enabled. Oct 08 00:02:41 obsidiana nvidia-persistenced[543475]: device 0000:01:00.0 - NUMA memory onlined. Oct 08 00:02:41 obsidiana nvidia-persistenced[543475]: Local RPC services initialized Oct 08 00:02:41 obsidiana systemd[1]: Started nvidia-persistenced.service - NVIDIA Persistence Daemon. where `override.conf` contains: [Install] WantedBy=multi-user.target [Service] ExecStart= ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --verbose Running `nvidia-smi` inside the Emby container when it can access the GPU yields the same except that it can't see `hqplayerd`: +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.65.06 Driver Version: 580.65.06 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 Ti On | 00000000:01:00.0 Off | N/A | | 48% 44C P2 45W / 200W | 245MiB / 8192MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ Here are the key observations: 1. Container loses device access**: All `/dev/nvidia*` devices disappear from the container 2. Host GPU is fine: All device nodes exist and work properly on the host 3. Docker shows no device mappings: `docker inspect` shows "Device mappings: None" when it fails 4. But `/proc/driver/nvidia/version` remains readable: Suggesting partial nvidia runtime access persists 5. No errors in Emby logs: No CUDA errors, ffmpeg errors, or obvious triggers I would appreciate any help, insights, suggestions, etc. from the community in getting this issue fixed. I really don't want to switch to running on bare metal.
darkassassin07 652 Posted October 17, 2025 Posted October 17, 2025 Any containers I run that require gpu access have these lines: runtime: nvidia # Expose NVIDIA GPUs environment: - NVIDIA_VISIBLE_DEVICES=all - NVIDIA_DRIVER_CAPABILITIES=all devices: - /dev/dri:/dev/dri # VAAPI/NVDEC/NVENC render nodes One of them (Tdarr) contains a 'deploy:' segment similar to yours, but my emby container does not, it has only the above. If I remember right (it's been a while), the primary thing was ensuring Nvidias runtime was installed and setup: Nvidia Container Toolkit (How-To-Geek, a little simpler to read, at least on mobile)
Luke 42077 Posted October 17, 2025 Posted October 17, 2025 Hi there, please attach the Emby server log from when the problem occurred: How to Report a Problem Thanks!
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now