# Vega hypothesis

in Amd

The last few days I was constantly thinking about AMD’s future Vega GPUs. So far we only know little but they actually gave us a lot of data spread across different places. Therefore we can actually extrapolate Vega’s performance. For this purpose I’ll actually take 2 supporting arguments which I’ll outline below. If you think my reasoning is not sound, please provide adequate explanations and if possible reputable sources to support your objections (on this Reddit post).

Less than a month ago, I stumbled upon this article Memory bandwidth on hackernews where the author talks about the effects of memory bandwidth and how it has developed over the years. If your read the article we can find a formula which gives us the amount of bandwidth a single stream processors has available per clock cycle. Quite astoundingly, if we compare the numbers for AMD and Nvidia we’ll get very similar results.

For the entire GPU:

```
(Bytes/cycle) = (Memory bandwidth) / (Chip clock)
```

For a single Stream process (AMD) or CUDA core (Nvidia), I’ll call them work units:

```
(Bytes/Work Unit) = (Bytes/cycle) / (amount of Work units)
```

Given the 2 formulas, we can derive the bandwidth for the following GPUs at base clock (data was taken from Geizhals.eu):

GPU | Work Units | Chip clock in GHz | Memory bandwidth in GiB/s | Bytes/cycle | Bytes/work unit |
---|---|---|---|---|---|

GTX 1080TI | 3584 | 1.48 | 484 | 327 | 0.09 |

GTX 1070 | 1920 | 1.50 | 256 | 170 | 0.09 |

GTX 1060 | 1280 | 1.50 | 192 | 128 | 0.10 |

RX 480 | 2304 | 1.12 | 256 | 213 | 0.09 |

RX 470 | 2048 | 0.93 | 224 | 240 | 0.11 |

RX 460 | 896 | 0.92 | 112 | 102 | 0.11 |

Fury X | 4096 | 1.05 | 512 | 487 | 0.11 |

GTX 980TI | 2816 | 1.00 | 336 | 336 | 0.11 |

As we can see, we usually hover somewhere around 0.09 - 0.11 bytes / cycle for a single work unit. The higher the clock speeds, the lower the rate.

Now, what do we know of Vega explicitly provided by AMD Vega architecture:

- 2 x bandwidth vs HBM 1 aka. HBM2
- 128 32-bit operations per clock aka. Single Floating point operations

## Argument 1

So for the first argument we have the memory bandwidth. The bandwidth is always in harmony with the remaining components of GPU. We’ll generally never see a GPU where the memory bus or the memory clock are higher than needed.

HBM1 runs on our Fury card with 512GiB/s. Given that we know that HBM2 is going to be used, we can deduct that our future GPU will have at least a memory bandwidth of as least 1024GiB/s according to AMD. Because we know how many bytes per cycle are used per work unit in a typical card across both vendors, we can deduce by using arithmetic that we could have up to 8192 work units at a clock speed of 1.4 GHz. The clock was chosen based on the RX 480’s boost clock and the fact that both are 14nm.

```
(1024 GiB/s) / (1.4 GHz) / (0.09 bytes/cycle) = 8126
```

8126 seems a little bit odd, adding a few to 8192 won’t hurt much.

Also from AMD’s Vega architecture site we know:

This NCU makes it possible to deliver higher clock speeds and higher instructions per clock resulting in up to 2X throughput increase over previous designs [2].

And the corresponding footnote [2]:

Data based on AMD Engineering design of Vega. Radeon R9 Fury X has 4 geometry engines and a peak of 4 polygons per clock. Vega is designed to handle up to 11 polygons per clock with 4 geometry engines. This represents an increase of 2.6x.

So Amd is actually telling us that Vega will definitely have a higher clock speed than a Fury X at 1.05GHz. Given that AMD will use a 14nm architecture for VEGA and Polaris has the same size, we could assume that a clock speed of 1.4GHz should be doable.

Vega’s possible specs:

- Work units: 8192
- Chip clock: 1.4GHz
- Memory bandwidth: 1024 GiB/s using HBM2

You may ask yourself now, how can we deduce the expected performance from that? Well, we simply have to know how many operations are possible per clock, which clock speed we’ll use and how many work units are available:

For our possible Vega card this would be:

```
(8192 work units) * (1.4 GHz) * (128 32-bit operations/clock) = 1468006 GFlop/s
```

**Cough** **Cough** **Cough**, 1468 TFlops! Are you kidding me?!

This can’t be true right? So something seems to be off with this calculation. Can we somehow derive the correct formula from existing data? We have our RX 480, so let’s try:

- Peak performance: 5.8 TFlops or 5800 GFlops (for 32bit operations aka. single precision performance)
- Clock speed (Boost/Base): 1266 MHz / 1120 MHz

Knowing the peak performance and the clock speed we can calculate the amount of operations a RX480 can execute in a single cycle:

```
(5800 GFlops) / (1.12 GHz) = 5178 Flop/cycle
```

Meh, what a weird number, let’s try the advertised boost clock.

```
(5800 GFlops) / (1.26 GHz) = 4603 Flop/cycle
```

So we have 4603 floating point operations in a single cycle and about 2304 work units. Hhm, sounds like a good match. That would be 2 operations per work unit. Sounds very reasonable.

From the Vega endnotes we also know:

Discrete AMD Radeon™ and FirePro™ GPUs based on the Graphics Core Next architecture consist of multiple discrete execution engines known as a Compute Unit (“CU”). Each CU contains 64 shaders (“Stream Processors”) working together.

So this means the 128 32-bit ops per clock number is actually meant for a single Compute Unite which usually consists of 64 work units. If we update our formula to use the compute units aspect we’ll get:

```
(8192 work units) / (64 work units per CU) *
(1.4 GHz) * (128 32-bit operations/clock) = 22937 GFlop/s
```

23 TFlops sounds really, really nice.

But wait, what happens if we increase the clock speed to our boost clock on an *extremely, extremely well binned* Sapphire RX 580 with additional overclocking? So far, we only considered an average RX 480’s boost clock but AMD loves presenting peak performance using the very best boost clock possible, who wouldn’t.

```
(8192 work units) / (64 work units per CU) *
(1.5 GHz) * (128 32-bit operations/clock) = 24576 GFlop/s
```

24.6TFlops sounds even better! But, can we somehow verify this number?

**Thinking, thinking…**

24.5 that’s pretty close to 25. Do you remember the name MI25? This leads us to my second argument.

## Argument 2

Amd’s Radeon Instinct GPU’s will also come out in Q2. Looking at the information here we can make some assumptions:

- MI6 sounds like our RX 570 (with 5.7TFlops therefore MI6)
- MI8 sounds like our RX 580 (with 8.2TFlops therefore MI8)
- MI25 sounds like our next VEGA GPU (with 25TFlops therefore MI25)

But wait! We only have 24.5 TFlops that doesn’t sound right. Given the advances in the RX 580 and the resulting high TDP due to the high frequencies we can assume that MI25 will be highly overclocked. Therefore the horrendous TDP of 300W. A nice space heater for the next winter :^) .

If we say that the MI25 has around 25TFlops to justify the name MI25, we would need the card to run at a minimum of 1.5 GHz. Sounds possible with decent cooling.

So here comes the bad news, do you remember the new Radeon Pro Duo and the Ryzen 7/5 CPUs? I hate to brake it to you, but AMD just loves their dual setups. I really doubt it that AMD will make a single die with 128 compute units (8192 / 64). That’s humongous! A more reasonably setup would be a 64 CU per die like the Fury X. This would result in a high end consumer configuration with around 12.5TFlops.

If I recall correctly this would fall in line with a GTX 1080 in terms of **gaming** performance. Proof: One of Linus’ videos playing Doom at 4k at 1:00 - 1:10 [Youtube Link], which was sponsored by AMD. Also we can see at 1:14 that we have a single GPU setup because there is no back plate. On paper it would actually look like a GTX 1080TI, it really depends on how well the drivers are optimized (just compare the raw specifications of a GTX 1060 and an RX 480/580).

If we look at the MI25 on the Radeon Instinct site, it will probably be a dual GPU given the sheer size of the card compared to the MI6 and MI8. Also if you look extremely closely on the left side of the card, we have 2 metal brackets for a dual slot GPU which would indicate a bigger cooler for a dual GPU setup.

## Final thoughts

Since we can guess the performance and almost all the specs by using basic arithmetic and AMD’s usual stance on doing things, can we somehow deduct pricing? I don’t know I’ll leave that to you guys Strawpoll about pricing.

But, if due to some miracle AMD actually pulled it of to produce a single die with 8192 work units which runs reasonable on air cooling, take 4 in crossfire and you got yourself a nice 100TFlops gaming rig. From a space stand point on our PCB, we could actually fit a pretty large die on the PCB due to HBM2 (cf. http://www.amd.com/Documents/High-Bandwidth-Memory-HBM.pdf)

In closing, my main supporting arguments are the memory bandwidth and the upcoming Instinct GPUs. We could expect there to be at least one consumer high end GPU with around 12.5TFlops and a dual GPU for server deployments at around 25 TFlops. I did not rely on any rumors. I used either official data provided or sponsored by AMD as well as historical architecture layouts like how many “X” per “Y” do we usually have. I hope this helps 🌎🚀🌕.