Jump to content
Guest asrequested

Server GPU acceleration or CPU processing?

Recommended Posts

Guest asrequested

I've been looking at alternative options to using GPU hardware acceleration. My server presently has an i7 6700k. I've always favored Quick Sync, and for client use, it's great. But for those who need to transcode 'on the fly' to client apps, which is better, CPU or GPU? Presently, GPU transcoding is 'experimental' and for Live TV recording, it isn't available. In my case, I transcode all my Live TV recordings. While my recordings are ok, with two simultaneous transcodes I see my CPU get consumed. That said, sometimes I have 4 simultaneous transcodes and they all record just fine. My concern is that with the CPU so busy, what happens to the other functions that my machine needs to do? Personally, I don't like having 'just enough' resources. I like having head room.

 

Up until recently, for home servers Intel Xeon CPUs were the common answer. They were very expensive and often you would use dual processors. But now there have been developments with higher core/thread CPUs for desktop PCs. The two main contenders are Intel and AMD. Intel have lead the field for many years. They are stable and well supported. But every once in a while, AMD will try to knock them off the throne. In a bid to do that once more, they developed the Ryzen line. They had no integrated GPU, but they had higher core/thread counts, and were significantly cheaper that the equivalent Intel CPU. For those who wanted affordable raw processing, Ryzen seemed to be the answer. But is it?

 

It really comes down to what your use requirements are. If you want a really stable Emby server for transcoding, then you'll probably want to avoid GPU acceleration and rely on straight processing with the CPU.

 

If you only have a few clients/apps in your home, you don't stream remotely and have a 'budget', then GPU transcoding may be something that will be beneficial.

 

Every hardware configuration will have varying results.

 

 

To the task at hand!

 

I am choosing to not use GPU acceleration and switch to CPU processing. I'm looking for something that will be a significant improvement over my i7 6700k but will not require selling my truck to pay for it.

 

A useful site for CPU comparison, is http://cpu.userbenchmark.com/

 

I was looking at the AMD Ryzen Threadrippper 1920X

 

While a significant improvement, http://cpu.userbenchmark.com/Compare/Intel-Core-i7-6700K-vs-AMD-Ryzen-TR-1920X/3502vs3934, looking at the Intel i7 7820X, it also offers significant improvement over the i7 6700k at a lower cost than the Ryzen 1920X,

http://cpu.userbenchmark.com/Compare/Intel-Core--i7-7820X-vs-AMD-Ryzen-TR-1920X/3928vs3934

 

AMD also have a standard Ryzen line, other than Threadripppers.

 

Comparing the i7 6700k to the Ryzen 7 1700X we see there are pros and cons. For multi-core processing, Ryzen wins. But for the regular processing, it loses something.

 

Comparing the i7 6700k to the i7 7820X, we see overall and significant improvement.

 

Which would you choose?

Edited by Doofus

Share this post


Link to post
Share on other sites
legallink

Bang for buck I think you’ll struggle to beat a standard Ryzen like the 1700 in this use case. Transcoding is CPU intensive if you can’t offload to the GPU. The # of cores is directly related to the amount of processing power available provided that single threaded performance is comparable. As far as I know the Ryzen has a clear lead there. Threadrippers are nice but if you are talking 4 transcodes it seems like overkill. That being said 4K will also introduce processing requirements that really are still in their infancy of being addressed.

Share this post


Link to post
Share on other sites
dcrdev

Depends on what your using the machine for - technically:

 

Threadripper - ffmpeg will benefit more from multiple cores because it supports threading. Also bear in mind that the number of threads per physical core is directly related to the codec being used by ffmpeg, so h264 for example uses 1 thread per core, whereas h265 uses 2.

 

If your going to be doing other stuff with your server that doesn't support threading, then Intel still has a slight lead in single core performance under most circumstances.

 

But then if your concerned about price, then AMD is the clear winner - to be honest both are fine CPUs. If it were me then I'd choose Intel at this moment in time because Ryzen is a big leap for AMD and I don't feel that they have enough of a track record just yet to utilise in a server. For workstation/general computing I absolutely would (and probably will) do Threadripper.

Edited by dcrdev

Share this post


Link to post
Share on other sites
PenkethBoy

I have never been a fan of Quick Sync/Gpu transcoding and have taken a slightly different route for now

 

I am going through all my media and transcoding via Handbrake to more client friendly versions - H264 - i am doing this on my server which has Emby installed - i was pleasantly surprised that Emby has a minimal to none performance hit while transcoding in Handbrake and doing the same for an Emby client or two - it just slows down the Handbrake transcodes as they occur on lower priority threads. The server has a relatively underpowered i7-4790s but gets the job done. Yes a more powerful cpu would run through the files quicker but i just load up Handbrake with a hundred files or so then leave it for two to three days and start again.

 

an added benefit is that i have basically halved my storage needs by transcoding my media files :)

 

If i had not in the last year or so replaced my server and main pc (i7 6700k) i would be looking at Ryzen and TR in detail but chose the slower (and hopefully) more permanent solution without splashing the cash for now. 

 

If i had to choose a cpu today - i would get the 1700 Ryzen and overclock when i needed the extra power that the 1800x would give - all the ryzen 7 chips are the same spec so cant see the point paying extra for a 1700x etc. One thing that apparently makes a "big" difference to Ryzen is the quality of the RAM you buy - noticeable improvements in scores with better high speed RAM.

Share this post


Link to post
Share on other sites
PenkethBoy

in your last link

 

for twice the power draw and almost twice the price not a fair comparison  :P

Share this post


Link to post
Share on other sites
Guest asrequested

in your last link

 

for twice the power draw and almost twice the price not a fair comparison  :P

 

But it's a notable improvement for single core processing. As the machine will have multiple systems running on it, this is good for my needs. The Ryzen 7's only work well when multi-threading. In your link, they aren't testing the i7 7820X, so it's not much help, to me.

Share this post


Link to post
Share on other sites
PenkethBoy

Curious why single thread is important when we are talking about transcoding and multiple systems work better with multi thread etc

 

No the 7820x is not on the list but other expensive Intel multi thread processors are and yes some are faster than the ryzen's but at cost of extra power and cost.

 

If you want to go with the 7820x thats fine but why ask for opinions if you are not going to consider other options - sounds like you have already made up your mind  :P

Share this post


Link to post
Share on other sites
Guest asrequested

Curious why single thread is important when we are talking about transcoding and multiple systems work better with multi thread etc

 

No the 7820x is not on the list but other expensive Intel multi thread processors are and yes some are faster than the ryzen's but at cost of extra power and cost.

 

If you want to go with the 7820x thats fine but why ask for opinions if you are not going to consider other options - sounds like you have already made up your mind  :P

 

The advantage of the Ryzen is using multiple threads for one task. Other systems will only be using a single thread, possibly two. The Ryzen doesn't appear to be very good with that. For transcoding, the Ryzen is great, but for general use, not so much. The i7 7820X performs very well in both.

 

My OP is more focused on the Threadripper. I only mentioned the Ryzen 7 to show that it didn't fit my usage. Comparing it to what I already have, i7 6700k, I was showing that general use is poor. And trying to show the reason I'm not going to use it. :P

 

59d00c538c47c_Snapshot_238.jpg

 

Comparing the i7 6700k to the Ryzen 7 1700X we see there are pros and cons. For multi-core processing, Ryzen wins. But for the regular processing, it loses something. 

Edited by Doofus

Share this post


Link to post
Share on other sites
PenkethBoy

Ok your choice - for me the comparison site you are using would not be my choice  :)

Share this post


Link to post
Share on other sites
Guest asrequested

What I like about that site is that the results are of people at home testing their own machines, not a test environment. Real world results. That's more trustworthy, to me. But more importantly, it shows the results for Threadrippers and the new i7 and i9 CPUs, which are what I'm interested in. I want my server to be in excess of my needs. The other factor is that the motherboards for the Threadripper are more expensive than the Intel boards. 

Share this post


Link to post
Share on other sites
Waldonnis

Be sure to consider that ffmpeg isn't exactly NUMA aware (TR is a multi-CPU configuration, after all).  If you go with Threadripper, you may have to run it in UMA "mode", which will introduce some latency.  It may not end up being very noticeable, but it's something to think about.  x265 does support thread assignment via --pools for NUMA configurations, but many other encoders do not (notably x264).

Share this post


Link to post
Share on other sites
Guest asrequested

Be sure to consider that ffmpeg isn't exactly NUMA aware (TR is a multi-CPU configuration, after all).  If you go with Threadripper, you may have to run it in UMA "mode", which will introduce some latency.  It may not end up being very noticeable, but it's something to think about.  x265 does support thread assignment via --pools for NUMA configurations, but many other encoders do not (notably x264).

 

What about the intel i7 7820X? You think that is more suited?

Share this post


Link to post
Share on other sites
Waldonnis

What about the intel i7 7820X? You think that is more suited?

 

I'm not sure, honestly.  I've been hoping to see more info from actually TR owners about how much of an impact UMA has in real-life, non-gaming scenarios.  I'd also be very interested in seeing x265 performance when using --pools across the entire TR package for a single encoding job compared to parallel encodes each assigned to a different NUMA node.  Sadly, reviewers rarely test those scenarios and usually just stick to heavy gaming/synthetic benchmark runs and some very basic Handbrake preset testing. Those tests are just shy of useless to me for a whole host of reasons (source material choice, preset choice, etc.), so I don't have enough info to really draw any conclusions yet.

 

I'm hoping someone I know splurges and buys a TR so I can pester them to run some tests for me (I even wrote a small test suite for such things since most of my friends are non-techie gamers  :P ).

Edited by Waldonnis

Share this post


Link to post
Share on other sites
Guest asrequested

Well, I'm happy you chimed in. I hadn't considered such things. Honestly, I wasn't even aware of them. But you are steering me more toward the Intel CPU. The Ryzens do appear to be more toward the gamers. Intel i7 7820X, it looks like you might have a new home :D

Share this post


Link to post
Share on other sites
Waldonnis

Well, I'm happy you chimed in. I hadn't considered such things. Honestly, I wasn't even aware of them. But you are steering me more toward the Intel CPU. The Ryzens do appear to be more toward the gamers. Intel i7 7820X, it looks like you might have a new home :D

 

I can't quite figure out who TR is for, personally.  I can see it being a very useful chip for the workstation crowd, but there are some design decisions that make me wonder why gamers would prefer it over even the regular Ryzen offerings (other than bragging rights/"epeen measuring").  I could put it to good use for parallel encoding, but most people I know would never benefit from a traditional multi-CPU setup, so TR makes no sense for them either.  Seems more like a "let's see if we can do this and see if folks will buy it" project from the engineers (full disclosure: I worked directly with AMD CEs in the past, and believe me, many of them had fun little experiments on the side like that).  Pricing is very attractive, at least, for the performance level.  My best guess is that it's AMD's first attempt in recent memory at trying to shake up the Intel-dominated prosumer/workstation market, but who knows.

 

Another thing I didn't mention before that I'm curious about is how compiler optimisation and instruction set processing factor into things.  (Micro-)Architecturally, AMD's approach is a bit different than Intel's even though they support the same instruction sets.  I've spotted a few discussions about AVX2 performance being somewhat disappointing, but that was a few months ago.  I don't think many (any?) current compilers produce optimised code for Ryzen quite yet, so there may be some performance gains seen in the future as kernel and compiler support matures.  It's something I haven't researched much so far, since I'm not in the position to purchase quite yet anyway, so work may be moving along in that department faster than I would expect.

 

Side note: If it were me, I'd actually wait a little longer if you can.  Intel's scrambling to "answer" Ryzen properly, even accelerating their own release timeline and lobbing out some chips that I don't quite understand the motivation or market for releasing.  AMD's certainly shaken things up, which is nice, but it's still too soon to see what Intel's real answer is going to be and how pricing will be affected.

Edited by Waldonnis

Share this post


Link to post
Share on other sites
Guest asrequested

That sounds like good advice. It was going to be a little while before I was going to buy one, anyway. But it does seem as though Intel may have a better handle on this tech. Thanks a lot for your input. It's very insightful. 

Share this post


Link to post
Share on other sites
Waldonnis

That sounds like good advice. It was going to be a little while before I was going to buy one, anyway. But it does seem as though Intel may have a better handle on this tech. Thanks a lot for your input. It's very insightful. 

 

No problem  :)

 

From an encoding angle, I'd check out sites like doom9 and other video-centric forums frequented by the professional crowd.  There are a ton of folks in those circles that are always looking for performance gains and new tech/tools to help them with their work.  And here's to hoping that the patch/repo adding AMD's VCE support to ffmpeg gets cleaned up so it can be merged soon...that would add Windows hardware en/decoding to the discussion as well (VAAPI should already work).

Share this post


Link to post
Share on other sites
Waldonnis

Had a thought this morning that may be helpful...

 

Since TR is really just two Ryzens in one package, you can probably compare Intel's single-CPU offerings to the Ryzen die used in the TR model you're considering (TR 1950X = a pair of 1800X, for example, so you could compare the 1800X to an i7 or i9).  That'll give you an idea of how a non-NUMA-aware application will perform in a NUMA configuration, as it will only use one NUMA node's worth of threads/memory per process.

 

Swapping it to UMA would certainly slow performance a bit, but will expose the whole TR package as a single CPU and unify memory...and just how much of a performance hit will be seen depends on a lot of factors that are quite technical and involved.  The increased (doubled) thread count may somewhat offset the latency introduced when encoding, though - it just wouldn't be as efficient as a single-die/bus solution with that many cores would be.  It's why I'm interested in actual real-world reports rather than reviewers' benchmarks.  I want to see just how much it "costs" in non-gaming scenarios to go with UMA vs. NUMA when using non-NUMA-aware software/tasks.

Share this post


Link to post
Share on other sites
Guest asrequested

Had a thought this morning that may be helpful...

 

Since TR is really just two Ryzens in one package, you can probably compare Intel's single-CPU offerings to the Ryzen die used in the TR model you're considering (TR 1950X = a pair of 1800X, for example, so you could compare the 1800X to an i7 or i9).  That'll give you an idea of how a non-NUMA-aware application will perform in a NUMA configuration, as it will only use one NUMA node's worth of threads/memory per process.

 

Swapping it to UMA would certainly slow performance a bit, but will expose the whole TR package as a single CPU and unify memory...and just how much of a performance hit will be seen depends on a lot of factors that are quite technical and involved.  The increased (doubled) thread count may somewhat offset the latency introduced when encoding, though - it just wouldn't be as efficient as a single-die/bus solution with that many cores would be.  It's why I'm interested in actual real-world reports rather than reviewers' benchmarks.  I want to see just how much it "costs" in non-gaming scenarios to go with UMA vs. NUMA when using non-NUMA-aware software/tasks.

 

So you're saying that there may be multithread improvement but for single or quad core they would be even slower?

Share this post


Link to post
Share on other sites
Waldonnis

So you're saying that there may be multithread improvement but for single or quad core they would be even slower?

 

Pretty much.  If you look at a TR motherboard, you'll see two banks of DIMM slots.  Each of those corresponds to a die in the CPU.  As such, each half of TR has fast and direct access to half of the total system memory via a memory controller located on the respective die. Each "half" of that die/memory is pretty much a NUMA node, since it's self-contained.  So, if you had a TR system with 64GB total RAM (32GB per bank), each node would have access to 32GB of RAM, and access to its RAM bank would be the fastest it can be.

 

In a basic UMA scenario, that "partitioning" is erased artificially and presented like a unified package, combining the RAM banks and dies' cores into one "virtual" CPU and memory bank (so you'd have 64GB RAM available to any given core, and double the cores available to spread threads across).  The kicker is, If a core on die B tries to access info in one of die A's memory, it introduces latency since it has to go "farther" to get the info (it has to travel farther electrically, and has to interact with the opposing die's memory controller). AMD has made an interesting and choice in its UMA implementation by making all memory access use the "other node's" memory controller/DRAM, meaning all memory access is slower than if a given core used its own die's memory bus (there's a short high-level overview of this concept at this link, complete with a few slides from AMD).  The positive side is that it's predictably slower, so schedulers don't require any additional intelligence when allocating resources (e.g. it doesn't have to know how to "prefer" certain DRAM banks for memory allocation to ensure they're closer to the cores it has assigned the task to).  As stated, the negative side is that all memory operations are slower.  Their latency estimation for memory access is reported to be ~20ns on average, which is actually not bad at all, but it's still added latency...and a ~30ish% latency increase compared to NUMA memory access (note: that doesn't mean everything will be 33% slower, just that the access latency is higher).

 

If you had a task that spread threads across all of the cores of a TR in UMA mode, there's no question that a single encode would run faster when spread across 32 cores in UMA vs. 16 in NUMA.  The question is really how much performance is lost when doing so vs a single-die CPU with 32 cores...or even compared to two parallel encodes each assigned to different NUMA nodes. I've seen some slides and benchmarks that suggest that the performance hit isn't really bad at all, but they were all gaming-related tests which are a completely different type of workload.

Share this post


Link to post
Share on other sites
Guest asrequested

So given that the Intel CPUs are quicker with single and quad core processing, does this mean they function differently to the TR? The motherboard I'm looking at has those two banks of memory. Are they also two dies put together?

Share this post


Link to post
Share on other sites
Waldonnis

So given that the Intel CPUs are quicker with single and quad core processing, does this mean they function differently to the TR? The motherboard I'm looking at has those two banks of memory. Are they also two dies put together?

 

I honestly don't know.  I haven't looked at Skylake-X (or KL-X) in detail yet and have only read gaming-level reviews so far which don't go into stuff like that.  I do know they talking about using a new mesh approach at one point, so it's possible that it will be more NUMA-like, but it depends on the memory access and core layout on the die (just pulled up the die pic, but need to stare at it a bit).  My hunch is that it won't be a NUMA arrangement like TR is, but I can't say that for sure yet, nor how the latency works out for "far core" memory access compared to AMD's UMA (likely less, but I know next to nothing about Intel's mesh solution).

 

Looking at the die and glancing at a more technical summary of the layout, it looks like they're using an arrangement that I would call "NUMA-lite" (reminds me of a Phi in some ways), as they clearly got rid of the ring buffer(s).  Since it's one die, though, and smaller than TR, latency in a far core situation should be less in theory since the path from a given core to the farthest core or memory controller is electrically "shorter".  Of course, there's a lot more at play here and I don't know enough about their mesh or core microarchitecture to have a solid grasp on the situation.  It seems clear that the CPU will present itself as a single processor (so, basically just like UMA) to the OS.  The memory arrangement appears to be a way of overcoming CPU-wide memory access latency by establishing banks of memory that are closer, but not dedicated to cores like a NUMA solution would be.  This is a rather interesting architecture...I'll have to look at more info about it since it means formerly enterprise-only arrangements are finding their way to prosumer chips (and will eventually make it to some mainstream consumer chips as well).

 

If I were to compare TR to one of these things, I'd definitely seek out benchmarks/testing that focuses on TR in UMA mode vs. the KL-X and SL-X offerings...if leaving TR in NUMA mode is not supported or ideal for your applications/workloads.

Share this post


Link to post
Share on other sites
Guest asrequested

Fantastic information! You seem to be confirming my suspicions that Intel is making what I consider to be better architecture. I'd love to hear you discover about the SL-X and KL-X CPUs. I'm pretty much sold on them, but you have such good information. I would do research on this, but I likely won't understand a good portion of what I would read. 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×
×
  • Create New...