More Opteron bus information


Following up on a discussion earlier this week, I wanted to lay out some additional performance comparisons between the IBM PowerPC 970 and the AMD Opteron.

Thanks to Dmitry for the pointer to this AnandTech article about the Opteron architecture.

Opteron memory philosophyThe design of the Opteron memory and processor interconnects are specifically designed to be able to be exploited in a multiprocessor environment without requiring extremely high-speed RAM.

A single processor has three point-to-point HyperTransport links (the same type of link that Apple uses to connect peripherals to its 970 system controller chip). Each of these connections has two 16-bit 3.2GBps unidirectional channels, for 6.4GBps of connection bandwidth between the CPU and whatever it is communicating with.

The Opteron also has direct connections to dedicated DRAM (PC2700, DDR 333MHz). Because of this design, each CPU can simultaneously access its RAM at 5.3GBps. Hence, if all the data is in the right place, a 4 CPU system can access 21.2GBps... not shabby.

One last note, the Opteron uses a device called the XBAR switch (presumedly for crossbar switch) that switches requests from one HT link to another or to the memory controller or CPU. Thus, if you need to get access to memory on one CPU from another, you don't block the CPU unless it also needs something from the same memory, or is traveling in the same direction on an HT channel.

Apple's 970 architecture

Apple's architecture for the 970 is organized differently. The CPUs are connected to the System Controller chip by way of individual channels. Each of them can move 4GBps in each direction, or 8GBps theoretical.

The memory is shared between the CPUs via the System Controller chip and thus the combination of CPUs can access memory at a maximum rate of 6.4GBps.

The Apple System Controller is itself an sophisticated switch that also serves as the memory controller for the PowerPC CPUs. Looking at the two architectures, you can consider the SC to be basically like the outskirts of an Opteron (memory controller + hypertransport) with a few integrated pieces on it. In particular, it also handles AGP directly (normally, the Opteron CPUs would communicate with AGP via an AGP Tunnel over HyperTransport).

What does it mean?

Heck if I know!

All that we're going to talk about here is pure speculation and the software is going to play a large part in making the hardware perform at peak rates, especially on the Opteron. Because of it's ability to gather data from memory at 10.6GBps if it is pulling from each of 2 CPUs, it theoretically smashes the Macs 6.4GBps speed, even more so when the Mac's CPUs are limited to 4GBps each for load. However, that requires that all of the RAM be local. The worst-case access for the AMD (going across the HT for RAM far away) will be 3.2GBps (AMD's powerpoint on the CPU), much lower than the 5.3GBps theoretical to the CPU.

However, it is likely that the worst-case scenarios are less likely to occur than a middle-of-the-road scenario. So, let's look at 2 CPUs in each machine, each reading data in and writing it.

At this rate, the Opteron can clear 10.2GBps to its two diverse sets of RAM or about 6.2GBps if it is doing worst case to each of the CPUs.

The Mac, on the other hand, can move 6.4GBps to/from each. This one looks like a win for the Opteron's local memory bus. Now, if Apple chooses to put a second memory bus on the controller chip, they could move to a theoretical maximum of 12.8GBps to RAM, but that doesn't exist, so we might as well wonder what happens if the AMD folks move to PC3200 from PC2700...

For device access, Apple has handled all of this in the System Controller chip. Memory to AGP transfers take place at a maximum of 2.1GBps (a speed that con be easily maintained by the HT connections). AMD will also push AGP through HT and should be constrained by the AGP side, even if going across the HT to get memory. The only caveat I see here is possible memory and HT contention when the AGP (connected to CPU1) needs to use the same HT channel as CPU1 to talk to CPU2 about something. However, that should fall well under the speed of the HT for now.

Cache as Cache can

Another difference between the two CPUs is the amount of Level 2 (or L2) cache. The 970 comes in the door with 512KB of 64GBps cache. The Opteron shows up with 1024KB of cache (although I can't locate the speed).

Of course, size is one thing, speed another, and efficiency of use yet another. Again, though, the initial stats (putting aside speed since I can't find that statistic for the Opteron) appears to give the edge to the Opteron.

Conclusion

Did you really expect a conclusion? Without the machines here and real-world applications, it is very difficult to tell anything of substance. We know that the Altivec units in the 970 provide stellar performance for certain types of operations, but we don't know how common they'll be. Perhaps the multiple memory buses will be used by Windows well and the Opteron will squash the 970 like a grape on data access. Until we start seeing side-by-sides it is going to be hard to tell.

It's going to be an interesting fall.