Hot Chips 32, XSX Architecture Deep Dive

LucasTaves · October 30, 2020, 10:51am

They do use the CUs, but there is added hardware too to make them more efficient for the job.

The CUs are able to process a single operation involving 2 32bit float point number. Without the extra hardware for performing math with integers (which are smaller) the full CU would also be occupied with only a single int operation.

The extra hardware allows it to perform up to 4 int operations at the same time significantly increasing the throughput. The CUs will still be occupied and unable to perform shader work when that happens, but for a console that’s actually preferable than dealing with extra space lost with completely separated cores.

No-1HoloLens-Fan · October 30, 2020, 11:04am

Intresting!

Looks like RX 6000 series doesn’t have int-4 and int-8 support like XSX and XSS

If it would have they should have mentioned it like this

LucasTaves · October 30, 2020, 12:05pm

Even without rpm the peak int8 and int4 performance would be the same as the single fp precision, so I wouldn’t read too much into it not being listed.

No-1HoloLens-Fan · October 30, 2020, 12:09pm

I didn’t understand. Why it won’t matter?

LucasTaves · October 30, 2020, 12:21pm

Not that it wouldn’t matter, but the int4 and int8 hardware isn’t to enable the gpu to process integer math, it’s to accelerate them.

Basically each CU is big enough to handle a single operation with 2 32bit floating point numbers. (the 16tflops in this case).

Without rpm the int4 rate would also be 16tops, so they could list that.

What rpm does is add a bit of hardware so the cu can process more numbers at once when the numbers are not big enough to fill the whole registers. So essentially they increase the throughout (but not enable int math) by allowing more operations to be done at the same time.

For int8 you would be able to process up to 4 int operations per cu so 16 X 4 = 64 and for int up to 8 so 128 tops. Whereas for gpus without rpm that would be a constant 16 in all cases.

Tl;Dr the gpu is definitely able to handle int math, they don’t listing it does not necessarily mean that they don’t support the acceleration for it.

No-1HoloLens-Fan · October 30, 2020, 12:27pm

Ok. Now i understand. But i included the image of other GPU for a reason.

It specifically mentions int-8 for the other GPU. Why they choose not to mention for rx6000?

I think we are again at the same place of ‘assume it until confirmed’ situation.

Let’s see.

Outrun · October 30, 2020, 12:42pm

I am not technical. But I keep on hearing about AMD’s Infinity Cache.

Does the XSX and S have it on their SOC?

Colbert · October 30, 2020, 1:26pm

AFAIK int8 would be double of FP16 and int4 would be quadrupel of FP16 in throughput. Depending on the Hotchip slides they talk about a ML inference performance boost from 3x to 10x. You do not get that by just having the same throughput, right?

Colbert · October 30, 2020, 1:28pm

Not that I know about, Infinity cache is an extra cache to circumvent bandwidth issues from only having 512GB/s max for the PC GPUs. The XSX die shot does not show the same kind of structures and and Infinty Cache is absolutely not needed for the XSS. Infinity cache is also a +128MB cache to normal caches which is completely absent from the XSX SOC which sports way less cache in total including the CPU iirc.

LucasTaves · October 30, 2020, 1:37pm

For FP16 yes, but for fp32 it would be 4x and 8x no?

Colbert · October 30, 2020, 2:36pm

fp32 is half the throughput of fp16. you remember: fp16 is half presision and fp32 is single precision. the 12TF is FP32. 24 TF with FP16. for int8 and int4 you don’t have that type of measurement because it happens not to be a “floating point operation”.

No-1HoloLens-Fan · October 30, 2020, 3:03pm

No!

No-1HoloLens-Fan · October 30, 2020, 3:06pm

That’s how Infinte cache looks like

Xbox series doesn’t have it probably.

This could be a potential secret sauce for PS5 although. But that also has low probability.

LucasTaves · October 30, 2020, 3:25pm

Yes, but I meant initially I used the 16 tflops (not FP16) rate of the 6800 which is fp32 and which would be equivalent to 64 and 128 tops.

Or, if they don’t support rpm for int it would be a constant 16tops for all scenarios.

LucasTaves · October 30, 2020, 3:27pm

Definitely no console will have that. The infinity cache is more than 1/3 of the SX soc die at 128MB.

On a big desktop gpu that doesn’t matter as much as you have way more headroom, but in a console, I think we would have another xbone situation.

Biggzy · October 30, 2020, 3:40pm

I think you are being too generous because I will say it is physically impossible when you see how much space the cache is taking up on a 80 CU chip and you also consider the PS5’s APU is significantly smaller than the Series X’s.

No-1HoloLens-Fan · October 30, 2020, 3:49pm

Yup! I was being generous.

But also i believe there is something up the sleeves of PS5 because everyone keep on telling me that difference between PS5 and XSX will be negligible. And i can’t imagine how it is possible unless there is some special secret sauce.

Biggzy · October 30, 2020, 4:02pm

To those people I have already said on here that Sony has already shown their ‘secret sauce’ and it is to do with IO, audio and dynamic clocks. Sony has spoken at lengths about the PS5 now, if there was anything else that was interesting they would have said it months ago. This honestly reminds me a bit of the ridiculous things about the One with the dual GPUs etc lol.

PlayStation fans just honestly need to accept that the Series X is just the more capable machine just like Xbox fans did back on 2013.

Colbert · October 30, 2020, 5:45pm

MS said it is 97 Tops max (int4).

LucasTaves · October 30, 2020, 5:50pm

Yeah, but this is regarding the 6800 (Which is rated at 16 tflops as opposed to SX’s 12)