A lot of this is conjecture based on what I know about vaapi despite pretty much never using it here. I'm unable to test any of it right now and can't boot to Linux due to a disc problem, though, so any changes would require someone else to do basic testing to confirm. I also don't have a hardware HEVC decoder, so...yeah.
If the decoder output is using a hardware surface format (like vaapi), then any software filtering won't be possible, so just changing the hwaccel_output_format would break software filtering unless you did a hwdownload prior to the software scale/overlay/whatever filters. You do have some options, though, one of which is the easiest: just use hwdownload prior to the scale (may need a format specified as well). No matter what, though, you just need to ensure proper hwupload/hwdownload filter use when needed. It really doesn't matter if you change hwaccel_output_format to vaapi since you'd need to immediately download/convert it from the surfaces before using software filtering. That's not necessarily bad, but it's a change with no real effect...ultimately, the decoder output has to end up in system memory and it shouldn't matter if you do it as step 1 (setting it to yuv420p or whatever) or step 2 (hwdownload filter) since there's nothing done between those two steps.
The real question in this case is why the decoder output tosses an error when you try to copy a frame to system memory as yuv420p. It's entirely possible that there is no conversion code for 10-bit hardware surface format to 8bit pixel formats (could be an ffmpeg or driver limitation), but who knows. It's also possible that there's a decoding problem causing the output to be malformed (see note at the bottom). If the former is true, then hwdownload may not work either, but is worth trying as an alternative in case it does something different on the back end than hwaccel_output_format yuv420p would do. Another option would be to use something like nv12 rather than yuv420p since it's more "surface-friendly". If the latter is true, however, that's a serious pain to work around (I'd hang a sign stating "abandon hope all ye who enter" on the issue and walk away since there's nothing Emby can do to fix that kind of issue, lol).
Since you seem to be okay with playing with ffmpeg a tad, give this a try to see what happens:
/usr/bin/ffmpeg -ss 00:14:24 -hwaccel vaapi -hwaccel_output_format vaapi -vaapi_device /dev/dri/renderD128 -i file:"input.mkv" -map 0:0 -codec:v:0 h264_vaapi -filter_complex "[0:3]scale=3840:2160:force_original_aspect_ratio=decrease[sub];hwdownload,format=yuv420p[vid];[vid][sub]overlay,format=nv12|vaapi,hwupload" -hide_banner -f null -
Warning: there might be a typo or mistake in there since I can't test it here. It should just do a similar operation as the original, but using a surface format for the decoder output and downloading it via a filter instead. It only operates on the video stream (no audio), since there's no need to add in the rest of the options just to test hardware copy-back and filter operations. If it does work, you can interrupt it after a dozen frames or so are processed (no need to encode the whole thing). It won't write a file since that's not important either, but if it works and you want to see the results, just replace the -f null and trailing hyphen with a file name. If it doesn't work, I have a few other things to try.
Side note: This looks a tad like the result of something that I've been wrestling with even doing software decoding: seeking in some HEVC files causes decoding errors due to ffmpeg marking all CRA frames as keyframes even though they may contain slices that need info from prior units. Working around this requires a LOT of pre-parsing gymnastics or modifying ffmpeg to restrict keyframe detection to only two NAL unit types, neither of which is especially fun. I thought about submitting a patch for it, but the behaviour isn't actually incorrect (decoders are expected to just discard the GOP/frame with missing slices, but it can be problematic when seeking/segmenting files for encoding - different use case than the spec expects). Other tools seem to have the same psuedo-issue, so even using things like mkvmerge to seek/segment end up seeking to a frame that can't be decoded cleanly. I wouldn't doubt that some hardware decoders may cough a bit in these scenarios too, although they may just discard the frame/GOP and move along.
Second side note: WHY IS ANYONE TRANSCODING HDR FILES TO H.264??? Killin' me over here....