Jump to content

Emby in Docker loses GPU access


Recommended Posts

ohshitgorillas
Posted

The Problem

I'm running Emby inside Docker on a headless Ubuntu Server, and the container randomly loses access to the GPU, breaking hardware transcoding until the container restarts. The GPU literally disappears from the container's perspective - all /dev/nvidia* devices vanish.

 

The Server

  • Emby Server: v4.9.1.80 via linuxserver/emby:latest
  • OS: Ubuntu Server v24.04-noble
  • Docker:
    docker-ce/noble,now 5:28.5.1-1~ubuntu.24.04~noble amd64 [installed]
    docker-compose/noble,noble,now 1.29.2-6ubuntu1 all [installed]
  • GPU: NVIDIA GeForce RTX 3060 Ti
  • Driver: 580.65.06 (CUDA 13.0)

docker-compose.yaml:

services:
  emby:
    image: linuxserver/emby:latest
    container_name: emby
    network_mode: host
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=America/Los_Angeles
      - LD_LIBRARY_PATH=/app/emby/lib:/app/emby/extra/lib
    volumes:
      - ./config:/config
      - /srv/media/music:/music
      - /srv/media/tv:/tv
      - /srv/media/movies:/movies
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu, video, compute, utility]
    restart: unless-stopped

  nginx:
    container_name: emby_nginx
    image: nginx:latest
    volumes:
      - ./nginx/log:/var/log/nginx
      - ./nginx/keys:/config/keys
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - emby
    restart: unless-stopped
    network_mode: host

  watchdog:
    build: ./watchdog
    container_name: emby_watchdog
    privileged: true
    network_mode: host
    environment:
      - EMBY_CONTAINER_NAME=emby
      - CHECK_INTERVAL=300
      - SLACK_WEBHOOK_URL=(redacted)
      - TZ=America/Los_Angeles
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./watchdog/reports:/reports
      - ./config/logs:/emby-logs:ro
    depends_on:
      - emby
    restart: unless-stopped

The "emby_watchdog" is a custom container which represents an attempt by me to solve this exact problem. It checks for GPU access every 5 min; if it finds the GPU is gone, it generates a report on the issue before restarting the container to restore GPU visibility and hardware transcoding. The code for the watchdog can be found here: https://github.com/ohshitgorillas/emby_watchdog/tree/main

 

The Library Path Fix

Early in troubleshooting, I discovered that Emby's bundled ffmpeg couldn't find its own libraries. Running `ldd /app/emby/bin/ffmpeg` showed multiple "not found" errors for libav* libraries that were actually present in `/app/emby/lib/` and `/app/emby/extra/lib/`. Adding `LD_LIBRARY_PATH=/app/emby/lib:/app/emby/extra/lib` to the environment variables was **critical** and dramatically improved stability. Before this fix, I had GPU failures multiple times per day (sometimes within minutes of restart). After that fix, I had my first GPU failure in 2.5 days (60-hour uptime before failing) just earlier today. This suggests that perhaps the constant ffmpeg library load failures were causing GPU resource leaks or corruption. However, the problem still occurs occasionally.

 

What the Watchdog Has Revealed

Honestly... less than I was hoping for. Here's an example report:

======================================================================
EMBY GPU WATCHDOG FAILURE REPORT
======================================================================

FAILURE DETECTED: 2025-10-16 08:52:09

TIMING INFORMATION:
Container started: 2025-10-14T03:17:26.630591472Z
Container uptime at failure: 60 hours 34 minutes

---[ GPU ACCESS FROM CONTAINER ]---

nvidia-smi output:
Exit code: 255
Failed to initialize NVML: Unknown Error


/dev/nvidia* devices:
ls: cannot access '/dev/nvidia*': No such file or directory


/proc/driver/nvidia/version:
NVRM version: NVIDIA UNIX x86_64 Kernel Module  580.65.06  Sun Jul 27 07:14:19 UTC 2025
GCC version:  gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)


/dev/nvidia-caps:
ls: cannot access '/dev/nvidia-caps': No such file or directory


NVIDIA environment variables:
NVIDIA_DRIVER_CAPABILITIES=compute,video,utility

---[ HOST GPU STATUS ]---

Host nvidia-smi:
/bin/sh: nvidia-smi: not found


Host /dev/nvidia* devices:
crw-rw-rw-    1 root     root      195, 254 Oct 11 01:29 /dev/nvidia-modeset
crw-rw-rw-    1 root     root      511,   0 Oct 11 01:29 /dev/nvidia-uvm
crw-rw-rw-    1 root     root      511,   1 Oct 11 01:29 /dev/nvidia-uvm-tools
crw-rw-rw-    1 root     root      195,   0 Oct 11 01:29 /dev/nvidia0
crw-rw-rw-    1 root     root      195, 255 Oct 11 01:29 /dev/nvidiactl

/dev/nvidia-caps:
total 0
drwxr-xr-x    2 root     root            80 Oct 11 01:29 .
drwxr-xr-x   16 root     root          4620 Oct 11 01:29 ..
cr--------    1 root     root      236,   1 Oct 11 01:29 nvidia-cap1
cr--r--r--    1 root     root      236,   2 Oct 11 01:29 nvidia-cap2


nvidia kernel modules:
nvidia_uvm           2097152  4
nvidia_drm            139264  0
nvidia_modeset       1564672  2 nvidia_drm
nvidia              103985152 50 nvidia_uvm,nvidia_modeset
video                  77824  1 nvidia_modeset


nvidia-persistenced status:
/bin/sh: systemctl: not found


---[ CONTAINER CONFIGURATION ]---

Runtime: runc
Status: running
Network mode: host

Device mappings:
None

NVIDIA environment variables:
NVIDIA_DRIVER_CAPABILITIES=compute,video,utility

---[ EMBY LOGS ANALYSIS ]---

Recent GPU-related entries (last 1000 lines):
Command line: /app/emby/system/EmbyServer.dll -programdata /config -ffdetect /app/emby/bin/ffdetect -ffmpeg /app/emby/bin/ffmpeg -ffprobe /app/emby/bin/ffprobe -restartexitcode 3

Last Hardware Detection:
Hardware detection not found in recent logs

Last CodecList:
CodecList not found in recent logs

CUDA Errors Found:
No CUDA errors found in logs

======================================================================
END REPORT
======================================================================

Okay, there are a few bugs in here: the host section is not properly accessing the `nvidia-smi` or `systemctl` commands properly, however, here's what they would show on the host system:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060 Ti     On  |   00000000:01:00.0 Off |                  N/A |
| 48%   44C    P2             46W /  200W |     245MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         1365824      C   /usr/bin/hqplayerd                      204MiB |
+-----------------------------------------------------------------------------------------+
● nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/nvidia-persistenced.service.d
             └─override.conf
     Active: active (running) since Wed 2025-10-08 00:02:41 PDT; 1 week 1 day ago
   Main PID: 543475 (nvidia-persiste)
      Tasks: 1 (limit: 19011)
     Memory: 528.0K (peak: 1.5M)
        CPU: 173ms
     CGroup: /system.slice/nvidia-persistenced.service
             └─543475 /usr/bin/nvidia-persistenced --user nvidia-persistenced --verbose

Oct 08 00:02:41 obsidiana systemd[1]: Starting nvidia-persistenced.service - NVIDIA Persistence Daemon...
Oct 08 00:02:41 obsidiana nvidia-persistenced[543475]: Verbose syslog connection opened
Oct 08 00:02:41 obsidiana nvidia-persistenced[543475]: Now running with user ID 122 and group ID 129
Oct 08 00:02:41 obsidiana nvidia-persistenced[543475]: Started (543475)
Oct 08 00:02:41 obsidiana nvidia-persistenced[543475]: device 0000:01:00.0 - registered
Oct 08 00:02:41 obsidiana nvidia-persistenced[543475]: device 0000:01:00.0 - persistence mode enabled.
Oct 08 00:02:41 obsidiana nvidia-persistenced[543475]: device 0000:01:00.0 - NUMA memory onlined.
Oct 08 00:02:41 obsidiana nvidia-persistenced[543475]: Local RPC services initialized
Oct 08 00:02:41 obsidiana systemd[1]: Started nvidia-persistenced.service - NVIDIA Persistence Daemon.

where `override.conf` contains:

[Install]
WantedBy=multi-user.target

[Service]
ExecStart=
ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --verbose

Running `nvidia-smi` inside the Emby container when it can access the GPU yields the same except that it can't see `hqplayerd`:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060 Ti     On  |   00000000:01:00.0 Off |                  N/A |
| 48%   44C    P2             45W /  200W |     245MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

 

Here are the key observations:

1. Container loses device access**: All `/dev/nvidia*` devices disappear from the container

2. Host GPU is fine: All device nodes exist and work properly on the host

3. Docker shows no device mappings: `docker inspect` shows "Device mappings: None" when it fails

4. But `/proc/driver/nvidia/version` remains readable: Suggesting partial nvidia runtime access persists

5. No errors in Emby logs: No CUDA errors, ffmpeg errors, or obvious triggers

 

I would appreciate any help, insights, suggestions, etc. from the community in getting this issue fixed. I really don't want to switch to running on bare metal. 

darkassassin07
Posted

Any containers I run that require gpu access have these lines:

    runtime: nvidia # Expose NVIDIA GPUs
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all
    devices:
      - /dev/dri:/dev/dri # VAAPI/NVDEC/NVENC render nodes

One of them (Tdarr) contains a 'deploy:' segment similar to yours, but my emby container does not, it has only the above.

 

 

If I remember right (it's been a while), the primary thing was ensuring Nvidias runtime was installed and setup: Nvidia Container Toolkit (How-To-Geek, a little simpler to read, at least on mobile)

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...