Jump to content

ExtractionThreadCount not honored in 4.7.x?


jasonb1

Recommended Posts

jasonb1

I had the following in my encoding.xml in 4.6.7.0 (Windows Server):

  <ExtractionThreadCount>8</ExtractionThreadCount>

And it worked causing the ffmpeg.exe command line to have "-threads 8" on it for the chapter image extraction allowing it to be 8 times faster at that task which is a huge win.

Now, in 4.7.1.0, I still have the setting in the encoding.xml and I see in the detection log that it is able to read this from the xml file. I've even looked at the "source code" via a .Net decompiler and it looks like the old code where it appears to pick up the value from the setting file and use that for building the command line, etc. What I see now in the logs is always "-threads -1" for the 'quick-extract-imageseries' process.

I guess this means it's now going through some other new code path (or something if not that)...

 - Jason

embyserver-63789292800.txt hardware_detection-63789525624.txt

Edited by jasonb1
Link to comment
Share on other sites

 

19 hours ago, jasonb1 said:

-threads -1

it's 1, not -1.

19 hours ago, jasonb1 said:

And it worked causing the ffmpeg.exe command line to have "-threads 8" on it for the chapter image extraction allowing it to be 8 times faster at that task which is a huge win.

I think that you're mixing up things you might not be aware of. We have different ways for extracting chapter images or thumbnail images:

There's "extract-imageseries" and "quick-extract-imageseries". Which one is used depends on a number of details, but in most cases it is quick-extract-imageseries.
In earlier versions, quick-extract has been used in fewer cases (where we weren't sure enough that it would work always).

extract-imageseries

The normal extraction works by decoding the whole complete video and saving out the desired frames at each target time. For this method, we're using the ExtractionThreadCount value, which can in fact accelerate the extraction process - but it is still a long and expensive operation - even with 8 threads.

quick-extract-imageseries

This is our own method which works in a way that ffmpeg is not capable to do (the regular ffmpeg): It does not decode all frames of a video, it skips to key frames near to the desired target times and when the keyframe is not too far off, it uses the keyframe, otherwise it decodes a few frames to approach the desired target times.

I don't want to go deep into details, just the bottom line: This extraction procedure does not benefit from more decoding threads, and it's not just that, but it's even not allowed to use multiple decoding threads for this method, because it would lead to incorrect results.

Nonetheless, this method (with 1 thread) is much faster than the old extraction with 8 or even more threads. And it drastically saves CPU usage at the same time.

 

Such simple calculations like 8 threads > 8 times faster are almost never valid at all. Sometimes you can get close to that, but often not. Not every operation is well parallelizable. And then it depends on the details. For example, lets assume we want to parallelize the quick extraction:

More threads for decoding won't help, because we're extracting key frames, which are full frames and don't require much CPU processing (like all the other frames between two key frames). So how could be parallelize?

Well, we could for example start 4 ffmpeg processes. Assuming the video has 16 chapters, the first process should extract chapters 1,2,3,4, then second process 5,6,7,8 and so on..

But one of the details you need to know is that container formats are different and that affects the way how you can seek through a video(media) file. MKV files for example have seek indexes at the start or at the end or both. In that case, seeking goes fast and the parallelization I described could work and in fact accelerate the extraction. But with other containers, it can be required to parse through large portions of the file to seek to the desired points. And with the described parallelization, 4 processes would need to parse through the first quarter, 3 processes need to parse through the first and second quarter, 2 processes through first, second and third quarter, etc.
That means a lot for work would be duplicated, file IO would be multiplied and when the file is no a remote location, this would also multiply network IO and the end result would be:

The parallelized extraction would be much slower than the single one going once through the file from start to end.

Finally, we DON'T want to put users' systems under full load for doing image extraction. This is something that should happen in the background without affecting general server operation and specifically not Emby Server operations. We have just little means to control external processes like ffmpeg and we don't want to affect users' playback experience (e.g. after adding new files to the library) by ongoing image extractions.

Edited by softworkz
Link to comment
Share on other sites

jasonb1

I'm talking about the 10 second image extraction which clearly benefited from using more threads as I have evidence from previous version as I saw it going much faster and finishing sooner (using ~45% of my CPU with 8 threads on my particular hardware instead of the measly ~5% now). I have a powerful server machine and I want it to take the resources (that's my decision therefore should be optional if it's not the default that you want). I'm a C++ concurrency developer for 17+ years so I know about these issues. For example (a different example that I haven't complained about yet) when you guys do scans you call ffmpeg.exe once per file serially for extraction/probing, etc. and that frankly is "not ideal" when I have a very powerful machine where you could be doing multiple at once. I can understand in general you might not want to enable this for everyone, but people like me with certain usage patterns can benefit from more concurrency than you currently have.

The current extraction for the 10 second thing is clearly slower and the nearly the same speed as it was in previous versions when only using one thread, so the changes you've made here haven't improved anything and have taken away a performance tweak that was previously possible.

Please reconsider adding something back for use power users (both in scenario and hardware capabilities).

Thanks.

 - Jason

Link to comment
Share on other sites

1 minute ago, jasonb1 said:

The current extraction for the 10 second thing is clearly slower and the nearly the same speed as it was in previous versions when only using one thread, so the changes you've made here haven't improved anything and have taken away a performance tweak that was previously possible.

Please show both ffmpeg command to compare.

Thanks

Link to comment
Share on other sites

4 minutes ago, jasonb1 said:

(using ~45% of my CPU with 8 threads on my particular hardware instead of the measly ~5% now)

That doesn't mean it's faster with 45% CPU.

It's the whole point of the new method that it doesn't need to do the same amount of work.

Edited by softworkz
Link to comment
Share on other sites

jasonb1

It was for sure faster before with the "-threads 8" on the command line compared with the default "-threads 1" (this is for the bif extraction of course) and that's something I not only measured by seeing how fast the detail page was updated but also I measured manually running such an ffmpeg extraction on the command line with the different threads params and it was indeed "8x" faster with 8 threads (I don't have the exact ratio right now but it was extremely significant). I'm not making this up and you could easily test this with any video file with the old ffmpeg.exe to confirm.

I guess that brings up something weird then. Sometimes I was initially thinking it was similarly fast in 4.7.x but then sometimes it seems really slow. I think you might then have an issue with certain video files where it's still not able to be as fast as you'd like with this new algorithm.

However, I even if you have something new that is supposedly faster even single threaded, one can clearly encode those thumbnail frames concurrently to make it faster. Also I can guarantee that you can write out the temp .jpg files to disk faster if done concurrently. I assume that's what the old ffmpeg.exe was doing with that threads command line (could be bad assumption, but how else was it able to get massive perf gain?).

Link to comment
Share on other sites

It seems you are not aware of the dimensions I'm talking about, so please look at this:

Classic Extraction - 1 Thread

ffmpeg -threads 1 -i "U:\TestMedia\Movies\Big Buck Bunny\Big Buck's Bunny.mp4" -an -sn -threads 0 -vf "fps=fps=1/10,scale=min(iw\,320):trunc(ow/dar/2)*2" -f image2 -vsync cfr C:\tmp\out2\xmg_%05d.jpg

205 s

Classic Extraction - 8 Threads

ffmpeg -threads 8 -i "U:\TestMedia\Movies\Big Buck Bunny\Big Buck's Bunny.mp4" -an -sn -threads 0 -vf "fps=fps=1/10,scale=min(iw\,320):trunc(ow/dar/2)*2" -f image2 -vsync cfr C:\tmp\out2\xmg_%05d.jpg

91 s

ffmpeg -threads 16 -i "U:\TestMedia\Movies\Big Buck Bunny\Big Buck's Bunny.mp4" -an -sn -threads 0 -vf "fps=fps=1/10,scale=min(iw\,320):trunc(ow/dar/2)*2" -f image2 -vsync cfr C:\tmp\out2\xmg_%05d.jpg

74 s

Quick Extraction - 1 Thread

ffmpeg -threads 1 -skip_interval 10 -i "U:\TestMedia\Movies\Big Buck Bunny\Big Buck's Bunny.mp4" -copyts -s 320x180 -f image2 -r 0.1 -vsync cfr C:\tmp\out1\xmg_%05d.jpg

4.7 s

Quick Extraction - 8 Threads

ffmpeg -threads 8 -skip_interval 10 -i "U:\TestMedia\Movies\Big Buck Bunny\Big Buck's Bunny.mp4" -copyts -s 320x180 -f image2 -r 0.1 -vsync cfr C:\tmp\out1\xmg_%05d.jpg

4.7 s - but sometimes incorrect results

 

If you want to reproduce, it's the BigBuckBunny video with H.264, 4k, 60fps

  • Thanks 1
Link to comment
Share on other sites

jasonb1

I'm sorry, but i just looked closer at the log I attached and this was one where the 'quick-extract-imageseries' process were not slow (seemingly a few seconds each). I'll have to dig through some logs to see if there's still a slow one in there. Hmm...

Link to comment
Share on other sites

18 minutes ago, jasonb1 said:

However, I even if you have something new that is supposedly faster even single threaded, one can clearly encode those thumbnail frames concurrently to make it faster.

You can't because you get them only one-by-one. But the mjpeg decoder works multi-threaded, mean each single frame can be encoded multi-threaded.

The thread parameter before the input affects the decoder, not the encoder. The default for the encoder is threads 0 which means auto-selection depending on CPU cores.

 

Edited by softworkz
Link to comment
Share on other sites

You're welcome to do so.

I've worked hard to implement the quick image extraction in our ffmpeg. None of our competition can extract images as fast as we can do and as you can see above: increasing the threads value doesn't improve this (and even can cause incorrect results).

When you see classic extractions (which you can recognize from the "fps..." filter) which don't show your configured ExtractionThreadCount value, then that would be in fact a regression bug.

Link to comment
Share on other sites

rbjtech

A timely thread - @Cheesegeezer and I are just constructing an HDR>SDR Chapters/Thumbnail generator plugin based on the 'quick-image-series' emby ffmpeg syntax (with tonemapping added into the mix). 

I've been using this as a script for some time now with perfect results - but did wonder why threads was set to 1 - now I know ;)

  • Like 1
Link to comment
Share on other sites

jasonb1

Sorry for my initially poor report and analysis.

After finding a slow one again, it seems that it's just that certain videos are really fast and certain random ones are super slow. Basically a few seconds for a ~1GB file vs. ~1min for a ~400MB file. Both files were just encoded with the same parameters with NVENC (approximately 720p, 2500mbps, preserve original FPS). I don't think I can send you the video so maybe try watching out for something like this in your own collection.

Attached are the fast and slow logs from manually running the same commands the server did if that helps. The FPS is different on the slow one (is that a clue?).

slow.txt fast.txt

Link to comment
Share on other sites

Hm, I can't tell without looking at the file. What I can tell is that it all depends on the key frame distribution in the files.

If you don't want to send them, you could use EDD file creation with the following steps?

  • Prerequisites
    • Update to Emby Server 4.7.2
    • Install the Diagnostics Plugin from the catalog
    • Reboot
  • EDD Generation
    • Go to "Diagnostic Options" on the server dashboard
    • Scroll to the bottom where it says: Diagnostic Data Generation
    • Check all three boxes
    • Save
    • Play both files for a moment like 10s and stop
  • You will see two files in the log folder with an .edd extension
    It's a zip actually, you can extract and check its contents, but please don't modify.

You can send these via PM if you prefer.

Link to comment
Share on other sites

jasonb1

I was able to reproduce the problem with a 24fps sample clip I found on the Internet here:

https://s3.amazonaws.com/senkorasic.com/test-media/video/caminandes-llamigos/caminandes_llamigos_1080p.mp4

I re-encoded with MediaCoder to h.264 with its NVENC GPU support, 1280x720, 2500Kbps Average Bitrate, don't modify Frame Rate.

Something about NVENC is causing the problem in these files as it seems to be the same pattern for my files that are 24-25fps. If I force the NVENC re-encode to 29.97fps the problem disappears. Or, if I don't use NVENC, just the the software h.264 through ffmpeg (still in MediaCoder), also no problem.

I guess I'll just force the frame rate change on my encodings to fix the problem since I can't really tell the difference.

Attached is the broken (extremely significant extraction time compared to the original file) re-encode for you to look at in case it's a bug in the extraction code.

Edited by jasonb1
Link to comment
Share on other sites

jasonb1

The video in the link is the original I found and it extracts "nearly instantly", yes. The re-encoded one (the attachment) takes much longer (several seconds). In my actual examples on my server it's a couple of seconds vs. a minute for the problematic files, so I figured it's all relative to the size/duration of the file and that this example is good enough to show the differences. Also, the ffmpeg output for the attached file extraction is similar to my "slow.txt" output. In the "fast.txt," output there's very few of those "elapsed=" lines interspersed with the "Write output:" lines. In the "slow.txt", there's a an "elapsed=" line for almost every "Write output:" line. That's the same here for this small example, just fewer so I figured it was enough to figure out why there's a slowdown.

Now, I was using a command line that I pulled from a log for one of my real examples and modified (switched the input filename and output location for the .jpgs to a scratch location), so hopefully that doesn't throw off the results. If that's an issue, I guess I need to try with this sample original file and and re-encoded version of it actually in the server and check the times for each extraction in the log instead of just eyeballing it.

Link to comment
Share on other sites

This is how to runs on my side (250ms):

PS C:\Users\admin> C:\tmp\ff47\ffmpeg.exe -loglevel +timing -threads 1 -skip_interval 10 -i "C:\Users\admin\Downloads\caminandes_llamigos_720p_nvenc.mp4" -s 320x180 -f image2 -r 0.1 -vsync cfr C:\tmp\out1\xmg_%05d.jpg
23:04:31.852 ffmpeg version 5.0.0-emby_2022_05_07-g463c71b3b3+1847 Copyright (c) 2000-2022 the FFmpeg developers and softworkz for Emby LLC
23:04:31.853   built with gcc 10.3.0 (Rev5, Built by MSYS2 project)
23:04:31.854 Execution Date: 2022-06-05 23:04:31
23:04:31.870 Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'C:\Users\admin\Downloads\caminandes_llamigos_720p_nvenc.mp4':
23:04:31.871   Metadata:
23:04:31.871     major_brand     : M4V
23:04:31.872     minor_version   : 1
23:04:31.872     compatible_brands: isomavc1M4A M4V mp42
23:04:31.872     creation_time   : 2022-06-05T17:49:52.000000Z
23:04:31.873   Duration: 00:02:30.16, start: 0.000000, bitrate: 2410 kb/s
23:04:31.873   Stream #0:0[0x1](und): Video: h264 (High) (avc1 / 0x31637661), yuv420p(progressive), 1280x720 [SAR 8001:8000 DAR 889:500], 2342 kb/s, Level 31, 24 fps, 24 tbr, 24k tbn (default)
23:04:31.873     Metadata:
23:04:31.873       creation_time   : 2022-06-05T17:49:52.000000Z
23:04:31.873       handler_name    : 264#video:fps=24:par=8001:8000@GPAC0.7.0-rev0-gbd5c9af-master
23:04:31.873       vendor_id       : [0][0][0][0]
23:04:31.873   Stream #0:1[0x2](und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 67 kb/s (default)
23:04:31.873     Metadata:
23:04:31.874       creation_time   : 2022-06-05T17:49:45.000000Z
23:04:31.874       handler_name    : Sound Media Handler
23:04:31.874       vendor_id       : [0][0][0][0]
23:04:31.874 Stream mapping:
23:04:31.874   Stream #0:0 -> #0:0 (h264 (native) -> mjpeg (native))
23:04:31.874 Press [q] to stop, [?] for help
23:04:31.881 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98ca80e80] deprecated pixel format used, make sure you did set range correctly
23:04:31.882 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98cfd7940] deprecated pixel format used, make sure you did set range correctly
23:04:31.882 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98cffb340] deprecated pixel format used, make sure you did set range correctly
23:04:31.883 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98d029580] deprecated pixel format used, make sure you did set range correctly
23:04:31.883 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98d0662c0] deprecated pixel format used, make sure you did set range correctly
23:04:31.883 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98ce0d040] deprecated pixel format used, make sure you did set range correctly
23:04:31.884 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98ce34900] deprecated pixel format used, make sure you did set range correctly
23:04:31.884 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98ce69600] deprecated pixel format used, make sure you did set range correctly
23:04:31.884 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98ce9eac0] deprecated pixel format used, make sure you did set range correctly
23:04:31.885 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98cc1d2c0] deprecated pixel format used, make sure you did set range correctly
23:04:31.885 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98cfc55c0] deprecated pixel format used, make sure you did set range correctly
23:04:31.886 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98cd0a700] deprecated pixel format used, make sure you did set range correctly
23:04:31.886 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98cd28ac0] deprecated pixel format used, make sure you did set range correctly
23:04:31.887 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98cd55a80] deprecated pixel format used, make sure you did set range correctly
23:04:31.887 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98cdb0040] deprecated pixel format used, make sure you did set range correctly
23:04:31.887 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98cde5040] deprecated pixel format used, make sure you did set range correctly
23:04:31.888 [swscaler @ 000001d98ccc9b00] [swscaler @ 000001d98c72e6c0] deprecated pixel format used, make sure you did set range correctly
23:04:31.889 Output #0, image2, to 'C:\tmp\out1\xmg_%05d.jpg':
23:04:31.889   Metadata:
23:04:31.889     major_brand     : M4V
23:04:31.889     minor_version   : 1
23:04:31.889     compatible_brands: isomavc1M4A M4V mp42
23:04:31.889     encoder         : Lavf59.17.100
23:04:31.889   Stream #0:0(und): Video: mjpeg, yuvj420p(pc, progressive), 320x180 [SAR 8001:8000 DAR 889:500], q=2-31, 200 kb/s, 0.10 fps, 0.10 tbn (default)
23:04:31.889     Metadata:
23:04:31.889       creation_time   : 2022-06-05T17:49:52.000000Z
23:04:31.889       handler_name    : 264#video:fps=24:par=8001:8000@GPAC0.7.0-rev0-gbd5c9af-master
23:04:31.889       vendor_id       : [0][0][0][0]
23:04:31.890       encoder         : Lavc59.21.100 mjpeg
23:04:31.890     Side data:
23:04:31.890       cpb: bitrate max/min/avg: 0/0/200000 buffer size: 0 vbv_delay: N/A
23:04:31.890 Skip-Option - SKIP   output: skip_pos:0 input_pts_time:0 rebased:0
23:04:31.891 Skip-Option - SKIP   output: skip_pos:0 input_pts_time:0.0416667 rebased:0.041666/A throttle=off speed=   0x
23:04:31.893 Skip-Option - SKIP   output: skip_pos:0 input_pts_time:0.0833333 rebased:0.083333
23:04:31.895 Skip-Option - SKIP   output: skip_pos:0 input_pts_time:0.125 rebased:0.125
23:04:31.896 Skip-Option - SKIP   output: skip_pos:0 input_pts_time:0.166667 rebased:0.166666
23:04:31.898 Skip-Option - SKIP   output: skip_pos:0 input_pts_time:0.208333 rebased:0.208333
23:04:31.899 Skip-Option - SKIP   output: skip_pos:0 input_pts_time:0.25 rebased:0.25
23:04:31.901 Skip-Option - SKIP   output: skip_pos:0 input_pts_time:0.291667 rebased:0.291666
23:04:31.903 Skip-Option - SKIP   output: skip_pos:0 input_pts_time:0.333333 rebased:0.333333
23:04:31.904 Skip-Option - SKIP   output: skip_pos:0 input_pts_time:0.375 rebased:0.375
23:04:31.906 Skip-Option - SKIP   output: skip_pos:0 input_pts_time:0.416667 rebased:0.416666
23:04:31.907 Skip-Option - SKIP   output: skip_pos:0 input_pts_time:0.458333 rebased:0.458333
23:04:31.909 Skip-Option - Write output: pkt_pts_time:0 pkt_dts_time:0 input_pts_time:0.5
23:04:31.922 Skip-Option - Write output: pkt_pts_time:10 pkt_dts_time:10 input_pts_time:10.4167
23:04:31.934 Skip-Option - Write output: pkt_pts_time:20 pkt_dts_time:20 input_pts_time:20.8333
23:04:31.945 Skip-Option - Write output: pkt_pts_time:30 pkt_dts_time:30 input_pts_time:31.25
23:04:31.955 Skip-Option - Write output: pkt_pts_time:40 pkt_dts_time:40 input_pts_time:41.6667
23:04:31.965 Skip-Option - Write output: pkt_pts_time:50 pkt_dts_time:50 input_pts_time:52.0833
23:04:31.978 Skip-Option - Write output: pkt_pts_time:60 pkt_dts_time:60 input_pts_time:62.5
23:04:31.989 Skip-Option - Write output: pkt_pts_time:70 pkt_dts_time:70 input_pts_time:72.9167
23:04:31.998 Skip-Option - Write output: pkt_pts_time:80 pkt_dts_time:80 input_pts_time:83.3333
23:04:32.006 Skip-Option - Write output: pkt_pts_time:90 pkt_dts_time:90 input_pts_time:93.75
23:04:32.013 Skip-Option - Write output: pkt_pts_time:100 pkt_dts_time:100 input_pts_time:104.167
23:04:32.027 Skip-Option - Write output: pkt_pts_time:110 pkt_dts_time:110 input_pts_time:114.583
23:04:32.039 Skip-Option - Write output: pkt_pts_time:120 pkt_dts_time:120 input_pts_time:125
23:04:32.054 Skip-Option - Write output: pkt_pts_time:130 pkt_dts_time:130 input_pts_time:135.417
23:04:32.081 Skip-Option - Write output: pkt_pts_time:140 pkt_dts_time:140 input_pts_time:145.833
23:04:32.082 23:04:32.082 elapsed=00:00:00.20 frame=   15 fps=0.0 q=1.6 Lsize=N/A time=00:02:30.00 bitrate=N/A dup=0 drop=12 throttle=off speed= 723x
23:04:32.082 video:180kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
    Last message repeated 1 times
23:04:32.083 EXIT

 

As I said: 200-250ms
Let's compare with yours.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...