Jump to content

M1/M2 GPU acceleration support


Recommended Posts

sfatula
Posted
On 4/17/2026 at 10:14 PM, PowerCC said:

 

Emby’s gradual rollout of Apple Silicon GPU acceleration is because macOS relies on a different video pipeline (VideoToolbox) that needs to be properly wired into ffmpeg. Since Emby uses ffmpeg for transcoding, the real work lies in making that integration stable on Apple’s tightly controlled hardware, not in the language or CPU translation layer.

As an independent technical user with no affiliation to Emby, it’s obvious Luke and the team are putting in serious effort while also supporting a wide range of platforms. Hardware acceleration on something like a Celeron-based NAS didn’t happen overnight either—it took time to mature. This will get there as well, and when it does, it should deliver the same level of stability Emby users already expect.

What is the ffmpeg limitation? I transcode all the time from command line (well, tdarr) using hardware acceleration on an M1 studio. It has never failed me yet. Is there some specific ffmpeg feature missing that I have not encountered yet perhaps? It's very fast. Has handled every video I've thrown at it, maybe a few hundred by now. 

kalanihelekunihi
Posted

Jellyfin had no issues adding support in 10.8 back in 2022. Plex had no issues also adding support in 2022.
ffmpeg added support for hardware accelerated transcoding for macOS on ARM64 in version 4.4, back in 2021.

Emby refusing to add it simply feels like spite at this point.

  • Like 3
Posted

Saying “FFmpeg added Apple Silicon support in 4.4, so Emby should’ve just flipped a switch” skips over how messy this actually is in practice.

Yes, FFmpeg has supported VideoToolbox on ARM64 since 4.4. And yes, Jellyfin and Plex Media Server both shipped something in 2022.

But “support exists” ≠ “full, stable, end-to-end pipeline works.”

On Apple Silicon, FFmpeg is going through VideoToolbox, which is basically a black box. You don’t get the same pipeline control you do with Intel Quick Sync Video. That’s why even today you still see:

* CPU fallback for subtitles
* inconsistent HDR tone mapping paths
* filters breaking zero-copy pipelines

Jellyfin moved fast here, but they also accepted those tradeoffs. Plex did too, just behind a paywall and with similar caveats. That’s not “problem solved,” it’s “good enough for most users.”

Emby has always leaned more conservative on the server side. And yes, part of that is architectural—being largely C# means tighter coupling to their own pipeline and more work integrating low-level FFmpeg/VideoToolbox behavior cleanly, especially on a newer platform.

Also, stepping back: if we’re talking about FFmpeg pipelines, Apple Silicon still isn’t on par with Quick Sync anyway. With QSV you can actually keep decode → scale → tone map → encode on GPU with predictable behavior. With VideoToolbox you’re often bouncing between GPU and CPU depending on the workload.

So framing this as “Emby refusing out of spite” doesn’t really hold up. It’s more like:

* Apple’s API model is less flexible for FFmpeg-style pipelines
* Early implementations across *all* servers were partial
* Emby prioritized stability over shipping a half-baked path

You can argue they’re behind—and they are—but it’s not as simple as “everyone else did it in 2022 so they’re just being stubborn.”
 

Posted
On 4/26/2026 at 8:12 PM, sfatula said:

What is the ffmpeg limitation? I transcode all the time from command line (well, tdarr) using hardware acceleration on an M1 studio. It has never failed me yet. Is there some specific ffmpeg feature missing that I have not encountered yet perhaps? It's very fast. Has handled every video I've thrown at it, maybe a few hundred by now. 

You’re not wrong—what you’re describing is exactly the “happy path,” and Apple Silicon + FFmpeg with VideoToolboxis genuinely very fast and very reliable for straightforward transcoding.

If you’re doing Tdarr-style batch jobs or CLI transcodes with relatively clean inputs (H.264/HEVC, simple scaling, no complex filter chains), you’re mostly staying inside the hardware-accelerated path, so it’s not surprising you’ve had zero issues.

Where the distinction comes in is that a media server like Emby is not doing “one job at a time transcoding.” It’s doing dynamic, real-time pipeline construction based on client behavior.

That’s where FFmpeg + VideoToolbox behaves differently than people expect.

The “limitation” isn’t that FFmpeg can’t use hardware acceleration on Apple Silicon—it absolutely can. The issue is that hardware acceleration in FFmpeg is not end-to-end guaranteed across the full filter graph, especially in server-driven scenarios.

For example, in real playback situations you can hit cases like:

  • subtitle burn-in forcing CPU paths
  • HDR → SDR tone mapping breaking hardware chains
  • certain scaling + filter combinations causing hwdownload/hwupload transitions
  • mixed codec/container edge cases depending on client profile switching mid-stream

So even though encode/decode is fast (what you’re seeing in Tdarr), the pipeline integrity under dynamic conditions is what differs.

That’s also where Intel Quick Sync tends to behave more predictably in FFmpeg-based servers, because more of the filter pipeline has mature hardware-accelerated equivalents, so fewer stages fall back to CPU mid-graph.

Posted (edited)

Hi! I have something practical for fellow Premiere subscribers on Apple Silicon while we wait for the official build.

First — for anyone wondering whether Emby is actually working on this: **yes, they are, and there's strong evidence inside the binary itself.** While building the plugin I had to look at the existing VideoToolbox classes in `Emby.Server.MediaEncoding.dll` and `Emby.Ffmpeg.Lib.dll`, and what I found was that almost the entire HEVC VideoToolbox infrastructure is already in place in the shipped DLLs:

  • A complete `hevc_videotoolbox` ffmpeg encoder wrapper class (with profile/option definitions for `main`, `main10`, `allow_sw`, `realtime`, `prio_speed`, etc.) lives in `Emby.Ffmpeg.Lib.dll`.
  • A `VideoToolboxDeviceInfo` codec device-info type, profile/level lists, and the relevant base classes (`VideoEncoderHevcBase`, `VideoDecoderH264Base`, `VideoDecoderHevcBase`) are all present in `Emby.Server.MediaEncoding.dll`.
  • The H.264 VideoToolbox encoder is fully implemented and registered (Premiere-gated) — it just lacks an HEVC sibling, and there are no VT decoders registered.

In other words, the team has clearly built ~90% of the HEVC VT encoder and the decoder hookups already; they just aren't wired into the `ICodecProvider` graph yet. This is also consistent with Luke from the Emby team confirming on the M3/M4 thread (April 2025) that the work is in active development. So this post isn't a workaround for something forgotten; it's a stop-gap until the first-party implementation ships, and it's deliberately written to plug into the *existing* Emby classes so a future official build replaces it cleanly.

@PowerCC's architectural notes earlier in this thread also turned out to be exactly right — I hit every one of them. Filter graph limitations and HDR tone-mapping forcing CPU paths are real and unavoidable from a plugin alone. But the basic codec gaps are addressable purely through Emby's standard `ICodecProvider` plugin API, without modifying any Emby DLLs.

What I built

A single 11 KB plugin DLL that registers the VideoToolbox codecs the official osx-arm64 build doesn't:

  • VideoToolbox **H.265 (HEVC) encoder** — `hevc_videotoolbox`
  • VideoToolbox **H.264 decoder** with `-hwaccel videotoolbox`
  • VideoToolbox **H.265 decoder** with `-hwaccel videotoolbox`

The plugin **does not patch any Emby DLLs**, **does not modify the EmbyServer.app bundle**, and **does not bypass any license checks** — it's purely additive, using only public Emby API surfaces (`ICodecProvider`, `VideoEncoderHevcBase`, `VideoDecoderH264Base`, `VideoDecoderHevcBase`).

It's intended for Emby Premiere subscribers running 4.10.x on Apple Silicon. The H.264 VT encoder is still gated by Premiere in the official build, and this plugin doesn't change that gate — it just adds the codecs that aren't registered at all (HEVC encoder, decoders).

Results — Mac mini M4, Emby 4.10.0.10 Beta

  • 1080p H.264 → 720p HEVC at 3.6 Mbps: **153 fps, ~25% of one core**
  • Three concurrent 1080p VC1 → 1080p HEVC transcodes: **~70% total of one core**
  • `VTEncoderXPCService` and `VTDecoderXPCService` daemons spawned — confirmed real GPU offload via Activity Monitor
  • HEVC encoder selects `-profile:v main`/`main10` based on bit depth, sets `-allow_sw 0` to prevent silent SW fallback

Limitations (architectural, can't fix from a plugin)

  • **HDR → SDR transcoding stays on CPU.** The bundled ffmpeg has no `scale_vt` / `tonemap_videotoolbox` filters, so any 4K HDR → 1080p SDR pipeline forces a per-frame GPU↔CPU roundtrip plus software downscale + tone map. Verified via `ffmpeg -filters` — only software `scale`/`tonemap` are compiled in. SDR→SDR transcodes that don't change format dimensions stay fully GPU.
  • **VC1 decode stays software** — Apple silicon has no hardware VC1 decoder.
  • **No HW icon for partial-HW transcodes.** Emby's UI flags HW only when decode + encode + filter graph all stay GPU; software filters (subtitle burn, scale, tone map) correctly result in no badge.
  • **AV1 decode not registered** even though M3+ supports it (could be added; I haven't tested).

Install

The DLL and the C# source file are both attached to this post — either grab the prebuilt or build from the source yourself.

1. Download `EmbyHwAddon.dll` (attached).
2. Copy it to your Emby plugins directory: `<config>/plugins/`
3. Restart Emby Server (quit from tray, relaunch).
4. Server → Transcoding → set **Hardware acceleration when available** to **Advanced** → tick the new **VideoToolbox H.264 (Decoder)**, **VideoToolbox H.265 (Decoder)**, and **VideoToolbox H.265** boxes. Save.

Verification

After playing a transcode, your latest `ffmpeg-transcode-*.txt` log should show:
 

>>>>>>  Selected Codecs
Decoder VideoToolbox H.265 (Decoder)
Encoder VideoToolbox H.265

And the actual ffmpeg command line should include:

-c:v hevc -hwaccel:v videotoolbox  ...  -c:v hevc_videotoolbox -profile:v main10 -allow_sw 0

Build from source

The plugin is a single 224-line C# file (`HevcVideoToolbox.cs`, also attached to this post). You don't have to trust the binary — build it yourself with .NET 6 SDK.

Steps:

mkdir lib
# copy these DLLs from your EmbyServer.app/Contents/MacOS/ into lib/
cp /Applications/EmbyServer.app/Contents/MacOS/{Emby.Server.MediaEncoding,Emby.Ffmpeg,Emby.Ffmpeg.Lib,Emby.Ffmpeg.Base,Emby.Media.Model,MediaBrowser.Model,MediaBrowser.Controller,MediaBrowser.Common,Emby.Web.GenericEdit}.dll lib/

dotnet build EmbyHwAddon -c Release
# Output: EmbyHwAddon/bin/Release/net6.0/EmbyHwAddon.dll

A note for the Emby team

If anyone on the Emby side wants to use the registration shape from this plugin as a reference for the official implementation, please do — that's the easiest path to "this becomes obsolete because Emby ships it natively," which is the goal. Happy to discuss in this thread.

Hopefully this is useful to other Premiere subscribers waiting on the official build. License: MIT.
 

EmbyHwAddon.dll EmbyHwAddon.csproj HevcVideoToolbox.cs

Edited by 0xBEB
  • Thanks 1
Posted

UPDATE — added VP9 and AV1 hardware decoders.

Especially relevant for anyone on an M3 or M4 Mac, since those chips have hardware AV1 decode that the previous version of this plugin didn't expose.

What's new in this version:

  • VideoToolbox VP9 decoder with -hwaccel videotoolbox (works on all Apple silicon)
  • VideoToolbox AV1 decoder with -hwaccel videotoolbox (M3 and later)

The HEVC encoder and the H.264/H.265 decoders from the previous post are unchanged. The plugin remains a single drop-in DLL — no changes to install instructions; just replace EmbyHwAddon.dll in your <config>/plugins/ folder, restart, and tick the new entries in Server > Transcoding > Advanced.

Notes for AV1:

  • AV1 is registered unconditionally. On an M1 or M2 the entry will appear in the codec list but VideoToolbox does not have hardware AV1 decode there, so it will simply fall back to software for AV1 inputs. If that bothers you, leave the AV1 box unchecked.
  • On M3+, you should see "Decoder VideoToolbox AV1 (Decoder)" in your ffmpeg-transcode-*.txt logs and -c:v:0 av1 -hwaccel:v:0 videotoolbox in the actual ffmpeg command.

EmbyHwAddon.dll EmbyHwAddon.csproj HevcVideoToolbox.cs

Posted

@0xBEB

This is seriously great work — thanks for putting the time into this and sharing both the plugin and the source. This is exactly the kind of practical validation this thread needed.

Also appreciate you calling out my earlier points — and honestly your findings line up almost perfectly with what I was trying to describe from the architecture side.

What you’ve shown pretty clearly is that FFmpeg + VideoToolbox on Apple Silicon is already very capable once the codecs are actually registered. The performance numbers you’re getting (and the VT services spinning up) make that hard to argue against.

At the same time, the limitations you listed are really the other half of the story:

  • HDR → SDR staying on CPU
  • lack of VideoToolbox-backed filters forcing GPU↔CPU transitions
  • subtitle burn-in breaking the hardware path
  • Emby correctly not flagging partial pipelines as “HW”

So even with GPU decode + encode working, the full pipeline is still hybrid in a lot of real playback scenarios. That distinction is easy to miss if you’re mostly comparing against CLI or batch workflows where things stay on the happy path.

Your plugin actually highlights both sides of the discussion at once:

  • the codec side is largely there
  • but the filter graph + pipeline behavior is what determines whether HW acceleration really holds end-to-end

Which also explains why different servers made different tradeoffs here — some shipped earlier with partial hardware paths, while Emby seems to be aiming for something more consistent across the full pipeline.

Out of curiosity (and maybe for @Lukeand the Emby team): based on what you saw in the binaries, does this now look more like a filter/FFmpeg build limitation (missing scale_vt / tone-mapping equivalents), or is the harder part still how the server wires and adapts the pipeline dynamically?

Either way, this is a really clean stop-gap and a great data point for where things actually stand today.

Posted

Thanks @PowerCC, that's nice to hear. And honestly that's the right question to be asking.

Short answer from poking around the binaries: it's both, but the filter side is clearly the bigger blocker.

The codec side is mostly there — there's even a VideoToolboxProvider class in the DLLs that would have registered the decoders, it just never gets wired up by the composition root. And the HEVC decoder class in there is actually broken (inherits from the H.264 base by mistake, uses a codec name that doesn't exist in ffmpeg, returns null from CreateFfmpegDecoder). Looks like someone scaffolded it, hit a snag, and it's been sitting half-finished. Probably a day or two of cleanup for someone with full source access.

The filter side is the real work. Emby has ~320 video-filter wrapper classes — scale_cuda, scale_qsv, scale_vaapi, tonemap_cuda, tonemap_vaapi, etc. — and zero of them are VideoToolbox or Metal. The bundled ffmpeg is also built with --disable-metal, so even the underlying Apple filter primitives aren't there. So getting HDR→SDR fully on the GPU needs three things stacked: ffmpeg rebuilt with Metal, new C# filter wrappers, and pipeline-builder logic that knows when to swap them in without breaking subtitle burn-in or dimensional changes.

That third one is exactly the "dynamic pipeline" thing you were describing. So my read matches yours — the codecs are the easy part, the filter graph is where the actual engineering lives.

Posted

Quick follow-up that directly answers the codec-vs-pipeline question with hard numbers.

I grabbed jellyfin-ffmpeg 7.1.3 (official portable macOS arm64, built April 2026) and ran it on the same M4 Mac mini against Emby's bundled ffmpeg. Same hardware, only the ffmpeg build differs.

Emby's is 5.1 fork built with --disable-metal. Jellyfin's is 7.1.3 with Metal on.

Filters Jellyfin has and Emby doesn't:

- scale_vt (GPU resize)
- tonemap_videotoolbox (Metal HDR to SDR)
- overlay_videotoolbox (subtitle burn-in on GPU)
- bwdif_videotoolbox / yadif_videotoolbox (GPU deinterlace)
- tonemapx (SIMD-vectorized CPU tonemap, way faster than plain "tonemap")
- the full opencl filter suite

Encoders are identical (h264/hevc/prores videotoolbox).

Concrete impact on my Hobbit case (4K HDR HEVC -> 1080p SDR HEVC, ~500% CPU on Emby):

Emby graph (forced):
  hevc(VT) -> hwdownload -> libswscale -> tonemap(SW) -> hevc_videotoolbox

Jellyfin graph (possible):
  hevc(VT) -> tonemap_videotoolbox -> scale_vt -> hevc_videotoolbox

That's the difference between ~50% and ~500% CPU on the same chip.

So to sharpen @PowerCC's question: the immediate root cause is the ffmpeg build. --disable-metal is a one-flag change, and Jellyfin proves the filters compile fine on macOS arm64 in current ffmpeg (same machine, same OS). The C# wrapper layer (zero VT/Metal entries among 320 filter classes) is downstream — even if the pipeline builder knew to emit scale_vt, the binary couldn't execute it today.

  • Agree 1
Posted
On 5/1/2026 at 8:04 AM, PowerCC said:

With QSV you can actually keep decode → scale → tone map → encode on GPU with predictable behavior. With VideoToolbox you’re often bouncing between GPU and CPU depending on the workload.

Took apart Plex Transcoder on the same M4 to check this. Their actual command for a 4K HDR Hobbit transcode is

-hwaccel videotoolbox -filter_complex [0:0]hwupload,scale_videotoolbox=w=1280:h=720:format=p010le -codec:0 hevc_videotoolbox -profile:0 main10

CPU sits at 2–54% of one core, no GPU↔CPU bounce. Plex even ships a custom scale_videotoolbox filter in their ffmpeg fork that isn't in upstream, plus they embed their own Metal compute kernel for yadif at build time. Jellyfin does the same on the Metal side with tonemap_videotoolbox and overlay_videotoolbox. So the "VT forces bouncing" framing doesn't really hold — the bouncing only happens on Emby because the bundled ffmpeg is built --disable-metal and the C# filter layer has zero VT wrappers. Both other servers have proven the pipelines stay coherent on Apple silicon.

kalanihelekunihi
Posted

Community has done more engineering effort on hardware acceleration in the past few days than Emby has done in the past 5 years.

  • Sad 2
Tazintosh
Posted

Love the actual discussion, but clearly out of my league :D
How does this translate on M1?

PowerCC
Posted
8 hours ago, 0xBEB said:

Took apart Plex Transcoder on the same M4 to check this. Their actual command for a 4K HDR Hobbit transcode is

-hwaccel videotoolbox -filter_complex [0:0]hwupload,scale_videotoolbox=w=1280:h=720:format=p010le -codec:0 hevc_videotoolbox -profile:0 main10

CPU sits at 2–54% of one core, no GPU↔CPU bounce. Plex even ships a custom scale_videotoolbox filter in their ffmpeg fork that isn't in upstream, plus they embed their own Metal compute kernel for yadif at build time. Jellyfin does the same on the Metal side with tonemap_videotoolbox and overlay_videotoolbox. So the "VT forces bouncing" framing doesn't really hold — the bouncing only happens on Emby because the bundled ffmpeg is built --disable-metal and the C# filter layer has zero VT wrappers. Both other servers have proven the pipelines stay coherent on Apple silicon.

I think this is mixing up what’s possible with what’s guaranteed.

With FFmpeg + VideoToolbox, you absolutely can keep decode → scale → tone map → encode fully on the GPU. And yeah, Plex and Jellyfin have clearly done the work (custom filters, Metal kernels, tightly controlled graphs) to make that happen on Apple silicon.

But that doesn’t mean you get a GPU-only pipeline just because the FFmpeg build supports it.

The real deciding factor is the filter graph, not the build flags. As soon as you introduce a filter that doesn’t have a VideoToolbox/Metal path, FFmpeg will quietly insert a hwdownload/hwupload and you’re back on the CPU. That includes pretty normal things like certain subtitle paths, overlays, or some tone mapping cases depending on format.

The example command works because it’s very constrained:

  • hwupload keeps frames on the GPU
  • scale_videotoolbox is GPU-native
  • p010le plays nicely with the rest of the pipeline

Change any of that and it’s no longer guaranteed to stay on-GPU.

Also, Plex/Jellyfin showing this works doesn’t generalize to every server. Emby falls back more often—not because VideoToolbox “forces bouncing,” but because:

  • Different FFmpeg build/flags
  • Less complete VT/Metal mapping in the filter layer
  • Less aggressive hardware-aware graph construction

So I’d frame it like this:

  • VideoToolbox doesn’t inherently force GPU↔CPU bouncing
  • Avoiding it requires intentional graph design + compatible filters + server-side integration
  • It’s not something you get “for free” from the FFmpeg build

That’s also where the difference vs Intel Quick Sync Video shows up—QSV tends to be more predictable in generic FFmpeg usage, while VT can be just as efficient but is more sensitive to how the pipeline is put together.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...