AMD Kaveri benchmark
Final Review Update: New information disclosed at APU13 and CES 2014, corrections to initial estimations, and final benchmarks are available at the end.
Abstract: Chip-maker AMD will release a new family of APUs by the end of 2013. This new family will be named Kaveri and will substitute the current Richland family of APUs. In this article I will consider Kaveri performance using all the information that AMD has disclosed as well as my best guesses on the parts that remain unknown. I estimate that the the top Kaveri GPU will have a CPU clocked at 4 GHz with a maximum performance of 128 GFLOP and a GPU clocked at 900 MHz with 922 GFLOP, giving a total of 1050 GFLOP for the whole APU. Combining all this data, I predict that the CPU of the top Kaveri APU will be about 26% faster than top Trinity APU and about 17% faster than top Richland APU. This would put the multi-threaded performance of the CPU of the new quad core Kaveri APU at the same level than an Intel quad core i5 or a six-core AMD FX with traditional software. The new Kaveri APU will show its real strength with HSA software, which will exploit the performance of both the CPU and the GPU. With HSA enabled software, Kaveri has the potential to be much faster than an Intel i7 or an octo-core AMD FX. Some developers are finding accelerations of up to 500% when enabling HSA. A collection of APU and CPU benchmarks and scores is given.
Chip-maker AMD will release a new family of APUs by the end of 2013. This new family will be named Kaveri and will substitute the current Richland family of APUs. Kaveri will combine two to four 28nm CPU cores, based on the Steamroller architecture plus HSA improvements, with a HSA enabled Graphics Core Next GPU. This HSA Graphics Core Next architecture can work as a traditional graphics card (rendering graphics) or as co-processor (computing parallel tasks). Kaveri will also introduce a uniform memory model, dubbed hUMA, that will allow both the CPU and the GPU to access to a common memory pool.
Many sites report some of the specs of the upcoming Kaveri. In this article I will consider Kaveri performance using all the information that AMD has disclosed as well as my best guesses on the parts that remain unknown.
During the Hot Chips 2012 conference, AMD said that the new Steamroller modules will provide a 30% gain in IPC over the initial Bulldozer modular design. Piledriver introduced about a 8% gain over Bulldozer at the same clocks. This means that Steamroller will introduce about a 20% IPC gain over Piledriver, because 1.30 = 1.08 × 1.20 rounded to two decimal digits. We know that the shared decode used in both Bulldozer and Piledriver introduces about a 20% penalty compared with a non-clustered core design; we know this from comparing the performance of a two-threaded workload running in two cores in the same module against the performance when each core is in a different module 1. Of course, Steamroller could be faster like some rumors suggest; I discuss this possibility below.
AMD has not revealed the frequencies of the new APUs. However, we know that the top Kaveri APU will have a total performance of 1050 GFLOP rounded to zero decimal figures. During the 2012 conference, AMD said that Steamroller will have a FPU with two FMAC 128-wide units plus a MMX unit. This is the same configuration as Piledriver, except that Steamroller FPU will be streamlined to save die space. Each FMAC unit will be capable of up to 8 FLOP s using FMA4 instructions; those are SP computations. Therefore, the formula to obtain the maximum floating point performance of the quad core CPU is 4 core × 8 FLOP s core-1 × freq s-1.
AMD has revealed that the top Kaveri APU will include a GPU with 512 unified shaders. Each shader can run up to 2 FLOP s of SP computations, which implies the following formula for the maximum performance of the top Kaveri GPU: 512 shader × 2 FLOP s shader-1 × freq s-1. The following table gives three possible combinations of frequencies for the CPU and the GPU whose combined performance gives the claimed 1050 GFLOP.
|GHz CPU||GFLOP CPU||GFLOP GPU †||MHz GPU|
|† The combined performance of the CPU and the GPU give a total performance of 1050 GFLOP rounded to zero decimal figures|
AMD commercializes Radeon discrete graphics cards with core frequencies of 897, 900, and 907 MHz; however, the Radeon HD 7750 based in GCN architecture has 512 shaders at 900 MHz. Selecting this frequency for the top Kaveri GPU, implies a CPU clocked at 4 GHz, which is a frequency between Trinity A10-5800k and Richland A10-6800k. The very small down-clock from 4.1 GHz is broadly compensated by the new Steamroller architecture.
Combining all this data 2, I predict that the CPU of the top Kaveri APU will be about 26% faster than top Trinity APU and about 17% faster than top Richland APU. This would put the multi-threaded performance of the CPU of the new quad core Kaveri APU at about the same level than an Intel quad core i5 or a six-core AMD FX. I estimate that the Kaveri quad core APU will have a PassMark CPU score of about 6000 points. Next, I add a collection of CPU benchmarks with estimations of the performance of the top Kaveri APU compared to the competence.
Kaveri scores for the CPU are obtained from taking Trinity/Richland scores as base and utilizing the assumed 20% gain from doubling the decoder per module minus a correction factor of the 5%. This correction factor is a safety belt that accounts for 'systematic' variations in benchmarks scores caused by different compiler support for bdver2 flags. Those variations are of the order of 3%—4%. In practice, I am assuming a worst case scenario where the base scores would be above the average and thus multiplying them by a 15% instead of an 20%. I assume that the scores obtained in this way for Kaveri are conservative and probably the real silicon will perform better thanks to improvements from the new bdver3 flags for Steamroller architecture and hardware improvements not considered here, such as a superior memory subsystem.
The above benchmarks use traditional software, which only use the performance of the CPU and ignores the rest of the APU. The new Kaveri APU will show its real strength with HSA software, which will exploit the performance of both the CPU and the GPU.
AMD has shown, during the Hot Chips 2013 conference, the acceleration that ordinary applications receive when HSA is enabled. Below I show an estimation of the acceleration that Kaveri will provide in an algorithm that analyzes images to detect faces. This estimation is obtained from data disclosed by AMD for an older VLIW4 architecture GPU with six compute units at 685MHz. I corrected the observed acceleration by 8/6, which accounts for the higher number of compute units of the new Kaveri APU. I ignored further corrections arising from the fact that Kaveri will use a new GCN architecture (which is much faster at compute than VLIW4) and the new hUMA memory subsystem. Again, I assume that the HSA score obtained in this way for Kaveri is conservative and probably the real silicon will perform much better.
Many other massively parallel algorithms 3 from file compression to video encoding or game physics simulations will receive similar accelerations. With HSA enabled software, Kaveri has the potential of being much faster than an Intel i7 or an octo-core AMD FX 4. Some developers are finding accelerations of up to 500% when enabling HSA. During Hot Chips 2013 AMD showed a 5.8x increase for a HSA enabled cloud server workload running on the old APU with six VLIW4 compute units at 685MHz. The new Kaveri APU would break the 6x increase easily, thanks to its eight GCN compute units clocked around the 900MHz 4.
I cannot estimate the performance of the GPU in Kaveri because crucial data is not known. The improvement in graphics performance could be anything between 33% and 200% faster than Richland APU, because benchmarks are very sensitive to memory bandwidth, and the precise nature of the version of GCN used in the GPU (Richland uses older VLIW4 architecture).
During the preparation of this article, AMD has announced a revolutionary OS layer named MANTLE. This consists of a driver plus a low-level API for GCN graphics cards. It is highly probable that Kaveri will be fully compatible with MANTLE thanks to the GCN technology used in the integrated GPU. Being a low-level API, MANTLE is free of the overhead associated to bloated and inefficient APIs like Microsoft DirectX. During the presentation, AMD claims up to 9x more draw calls compared to other APIs 5. The key here is on weighting the importance of draw calls in the game engine. This is something I cannot evaluate; first because it is engine and even game dependent; second because few details are known about MANTLE at the time of writing this.
Please note that the Kaveri APU has not been released still and some of the information contained here can change in the last minute. Needless to say, any mistakes that might be found in this article are entirely my responsibility and cannot be attributed to AMD nor to anyone else.
Some details of Kaveri have been revealed at annual AMD developers summit, APU13, celebrated the past year. The first surprise is that the 1050 GFLOP of performance that AMD announced for Kaveri at its 2012 Financial Analyst Day have gone now. Kaveri is finally a 856 GFLOP APU. As a consequence, the clocks for the iGPU computed above are now inaccurate. We do not need to recompute the clocks using the new 856 GFLOP figure, because the definitive clocks have been disclosed at APU13 and they are 3.7GHz for the CPU and 720MHz for the iGPU.
Some have speculated that the transition from PD-SOI process to bulk had a dramatic impact in this 18% reduction in total GFLOP, but this is not correct. If we pay attention to the HD Radeon 7750 we can observe that only the GDDR5 version has its cores clocked at 900MHz, the DDR3 version is a 800MHz card. The drop of the originally planned GDDR5 support for Kaveri is the main reason behind the 20% drop in the GPU clocks. The reduction in the TDP from the 100W (Trinity/Richland) to final 95W for Kaveri can almost explain additional reductions in clocks: from 4GHz to 3.7GHz for the CPU and from 800MHz to 720MHz for the GPU. The selection of bulk process has minor impact in base clocks, but will limit the overclocking possibilities of Kaveri. The Kaveri APUs will not break worldwide overclocking records as its predecessor Richland did.
The above estimated Kaveri scores assumed a 4GHz CPU, the GPU clocks were not used. However, during our above predictions we assumed that Steamroller module would offer about 20% more IPC than Piledriver. There is additional changes in the Steamroller module not mentioned during the Hot Chips 2012 presentation. For instance, Gian Maria Forni affirms that the L2 cache in SteamrollerB is 20% faster than in Bulldozer/Piledriver.
Final silicon has been just benchmarked. In the general sense, the measured values for the A10-7850K are in excellent agreement with the theoretical predictions. For instance, the difference between the value predicted for Kaveri at 4GHz (see above) and the value finally measured for Kaveri at 4GHz in the x264 benchmark is of -7%. Kaveri is slightly slower than I expected in this test. For the C-Ray benchmark the difference is of +5% (less score is better in this benchmark because measures the seconds spent to finish the raytracing task). Here Kaveri is slightly faster than expected. The difference is of +13% for the Himeno benchmark, with Kaveri performing significately faster than expected. This test is heavily dependant on floating point capabilities. The reduction of FP Pipes, from 4 to 3, in the Steamroller architecture and the improved L2 cache must explain the better score.
Comparison of predicted CPU performance to measured performance of the top Kaveri APU (click on each image to zoom)
Kaveri performs better than I expected in some tests and poor in others, with an average deviation of 0% (average of -7%, 5%, and 1%). This is an excellent agreement between prediction and final measurements. This means that, in the average sense, the Kaveri CPU performs as I predicted in this work. However, final Kaveri silicon is clocked at 3.7GHz, and scores are a bit poor. For instance, Kaveri overclocked at 4GHz hits 94.45 in the x264 benchmark but only hits 90.34 at the stock frequency. Next, I present a series of benchmarks for Kaveri at stock.
In the first part of this article, above, I compared the estimations made for Kaveri to measurements of two Sandy Bridge i5 and one Ivy Bridge i5. In this occasion, I have replaced the Sandy Bridge i5-2400S by the Haswell i5-4670. This way we can compare the AMD Steamroller architecture to three successive generations of Intel iCore processors. Small differences between the scores shown below for the i5-2500K and the i5-3470 and the numbers used above in the first version of this article are a consequence of improvements or an occasional regression in new versions of the software used.
PassMark CPU score are available for the top Kaveri APU and are in excellent agreement with the predicted 6000 points. Last baselines at stock frequencies are 5839, 5786, 5997, which average to 5874 points and differs from the predicted 6000 points by 2%. Again the error is insignificant.
A first version of this article, without the final benchmarks, was re-published in What to expect from kaveri: a detailed predictive analysis.
Acknowledgements: Benchmark data used in this article has been obtained by Michael Larabel from Phoronix.com. I thank both Marcus Pollice and Matthias Waldhauer for their useful insights and corrections to the first draft. I thank Matthias Waldhauer for additional corrections to the first version of this article. I also thank Gian Maria Forni for details about Steamroller L2 cache and for reporting some typos in the text.
- Being strict, the 20% improvement from running two threads in two Piledriver modules cannot be completely translated to a dual-decoder module because some resources are still shared within the module, e.g. the fetcher or the L2 cache. A hypothetical Piledriver module with double decoder would offer less than a 20% improvement. I think other improvements in the Steamroller module (fetch buffer, L1 cache, memory controller, etcetera) will improve performance up to the 20% approx. That is, I think that running two threads in a Steamroller module with double decoder plus the other improvements, must be close to running two threads in two Piledriver modules.
- I am taking 512 shaders for the GPU of the top Kaveri APU and the original claim that the new Steamroller modules will provide a 30% gain in IPC over the initial Bulldozer modules, as a conservative base for estimating the performance of Kaveri. Some unnamed sources state that the Steamroller module will provide up to a 45% gain over the Bulldozer module. A recent presentation from AMD suggests that Steamroller has 3 ALUs per core, instead of the 2 ALUs per core of the Bulldozer/Piledriver design. Some supposed leaked benchmarks suggest that Steamroller includes a new memory controller that provides more throughput, when paired with ordinary 1600MHz DDR3 memory, than Piledriver with fast 2400MHz DDR3. Whereas this improved bandwidth (if it is legit) will be very helpful for the integrated GPU in Kaveri, the impact on the CPU must be small, probably about 2—6% depending of the task. Combining all those rumors, Steamroller could be up to 40% faster (approx. 35% from module plus 4% from memory subsystem) than Piledriver. During last days, some rumors about a new Kaveri APU top model with larger number of graphics cores are hitting the web, but the number of CUs reported is even, which seems strange; moreover, it is not known if the rumors correspond to some kind of dual graphics configuration (APU + discrete graphics). Of course, AMD remains silent about all these rumors.
- The GPU of the Kaveri APU can manage up to a maximum of 20480 threads (2560 per each CU grouped in 40 wavefronts). However, each CU can execute 'only' one wavefront per cycle in sequences of four cycles. This means that the GPU can execute 2048 threads every four cycles.
- A hypothetical FX Steamroller CPU with 16 cores at 5 GHz could not execute the same task in less time than the Kaveri APU.
- We know that PCs with DirectX can show only a tenth of the performance of a console if you need a separate batch for each draw call. Microsoft has introduced multi-threaded display lists in DirectX 11 to help to reduce the overhead up to a factor of two at the very best, if you are using high-performance CPU cores. The problem with this brute force approach is that if you want each of your draw calls to be a bit different (e.g., if you are going to draw ten independent crates), then you cannot get over about 2000—3000 draw calls per frame, typically, whereas console games can use 10000—20000 draw calls. MANTLE will improve the performance of both CPU and GPU. It will improve the GPU because will eliminate the CPU bottleneck that arises somewhat about 2000—5000 draw calls. Being liberated from this bottleneck the CPU will spend more resources to other tasks.
Date: 2014 January 30, 11:46:53+01:00