Microsoft is releasing Direct ML apis for the general public

Ms is finally making the DirectML apis available for everyone.

Along with the apis for using gpus and tensor cores they are also releasing some examples, including super resolution and denoiser for raytracing.

9 Likes

Cool, they included the SuperResolution sample.

I get 10fps on my Ryzen 4700U upscaling the video from 540p to 1080p. Thats better than i thought.

1 Like

May you walk me through how to implement this?

What tools it requires to do so?

Implement or run?

For running it you need:

I think you need a rather modern GPU for this to work. I have a Vega 8 in my laptop (GCN2).

3 Likes

If this can make it’s way into the Series S, it would be massive for that console

4 Likes

It’s already there. To use it ya need INT4 and INT8 compute capabilities. XSS/XSX have INT4 and INT8 specifically for this exact purpose. :slight_smile:

4 Likes

oh right

Stupid me then

1 Like

In 5 yrs ppl will look back on this cycle and be shocked how big of a deal ML turned out to be.

VERY bold prediction for funzies: Within 10 yrs ML will replace rasterization of graphics in games entirely.

2 Likes

This is why MS is most likely marketing the Series S as a 1440p box because they plan in future to use this tech, 1080p upres’d to 1440p using this tech is their goal I think.

1 Like

The SuperResolution example provided with DirectML here uses FP16. I’m not sure what can be modified to run on such a low precision (INT4 is only different 16 values) but some parts of the network maybe can.

The neural network provided here is too expensive to use on a Series S. Upscaling 540p@30fps would use most of the GPU time in the console, not a lot left to render the game.

XSS/XSX also use RPM (FP16 included). And you can use INT instead of FP to get much better performance (by a factor of 2 or 4). The weights used for SS are integers so using FP is not very efficient. Keep in mind too that the portion of the frame when SS is used is not the same as when the rest of the rasterization or RT is being done (SS is at the end of the frame when that other work is finished already). We know for certain that this exact thing is why INT is available in XSS/XSX. Andrew Goosen said so explicitly. We also know that the ML SS tech is being used for texture upres too within XGS (James Gwertzmann said so a while back).

1 Like

The SuperResolution sample uses FP16 which my GPU supports. At the end its 10fps or lower on a 1.4 Teraflop GPU. But i have to run some further experiments with this. And maybe a user with a more capable GPU can try their luck and report us their findings. What did you get?

Sure using INT results in better performance, its also way lower precision. And you want to avoid artifacts. Thats why NVidias DLSS and Facebooks SuperResolution research all use FP16.

Where did you get this from?

If I step through the example in the debugger the weights all seem to be the float type.

Doesn’t really matter when the SuperResolution is used in the GPU pipeline. If it takes almost 100% of the ALU in the GPU to upres 540p to 1080p @ 30fps then there is no ALU left to rasterize the game :woman_shrugging:

I have a GTX 1650.

So in the end, the application created can upscale a video (540p) in real time to 1080p video.

If yes, what type of videos we may run?

Perhaps a by product of the gpu not supporting rpm for int? Because without it FP16 would be faster compared to the single rate int would run.

This is a nice glitch. @moderators

My post is showing that it’s posted by Craig.

Funny

3 Likes

With a GTX 1650 you should get markedly better results than my laptop.

The sample upscales the video file FH3_540p60.mp4 in folder Media\Videos\Turn10. You can just replace it with a different video in 540p@60fps and it should work (didn’t try myself).

This has not anything to do with the GPU support for INT. The SuperResolution network uses floating point weights. Here, straight from the runtime:

Or maybe i didn’t understand your question correct? sorry, english is not my first language.

What I mean is that direct ML tries to use any hardware available, so it could be that even if they could fit then into an int format they stick to FP16 as that would be faster on a gpu without rpm for int.

Nvidia’s tensor cores aren’t built for SS, they’re built for generalized ML, including training. Their DLSS solution uses both INT4/INT8. Those were added to the tensor cores specifically for stuff like SS where you end up quantizing the data set right off the bat as the weights that get output are all integer values. Doing tensor math on integers will never require FP precision since you can’t multiply integers and get anything other than more integers. I think you are confusing their training (which uses FP16) and how DLSS does inference.

When asked specifically about DLSS/ML in XSS/XSX, this is what Andrew Goosen said:

“We knew that many inference algorithms need only 8-bit and 4-bit integer positions for weights and the math operations involving those weights comprise the bulk of the performance overhead for those algorithms. So we added special hardware support for this specific scenario.”

And of course it matters what part of the frame ML is used. You aren’t using all your GPU resources during each millisecond within the frame’s production.

3 Likes

It sticks to FP16, because floating point is specified as a data type. There is no INT in the example.

I can’t find any source for that. Do you have a link where i can read about its use in DLSS?

But this is not one of the inference algorithms where INT4 will be used and I question the widespread use of INT8 with SuperResolution if NVidia and Facebook engineers don’t use it.

The SuperResolution inference uses almost 100% of the GPU resources when it runs. Nothing else will overlap with it in an efficient manner.

1 Like

Yeah I’ve included more info below. I wanna note that the super res example uses FP16 since it is showing DirectML using Nvidia’s FP16-trained model. That isn’t the extent of what MS plans to do. As noted, James Gwertzman already confirmed MS is using the INT capabilities specifically for upscaling individual textures and MS has repeated that super sampling is their intended use case there too several times. So unless you think MS is mistaken about their own decisions and research, it seems quite settled to me.

And you don’t need anything to overlap with the SS within the frame time. The rest of the work would be done by then anyhow (since you are rendering much, much fewer pixels in the first place before scaling).

“Microsoft has confirmed support for accelerated INT4/INT8 processing for Xbox Series X (for the record, DLSS uses INT8) but Sony has not confirmed ML support for PlayStation 5 nor a clutch of other RDNA 2 features that are present for the next generation Xbox and in PC via DirectX 12 Ultimate support on upcoming AMD products.”

Some more background on INT8 added to Turing’s architecture from Nvidia:

“Turing GPUs include a new version of the Tensor Core design that has been enhanced for inferencing. Turing Tensor Cores add new INT8 and INT4 precision modes for inferencing workloads that can tolerate quantization and don’t require FP16 precision.”

This also might be worth checking out as well:

“Many inference applications benefit from reduced precision, whether it’s mixed precision for recurrent neural networks (RNNs) or INT8 for convolutional neural networks (CNNs), where applications can get 3x+ speedups.”

Note: DLSS uses a convolutional NN for training.

And this arXiv paper (presumably published elsewhere by now since it’s from April):

“Once trained, neural networks can be deployed for inference using even lower-precision formats, including floating-point, fixed-point, and integer.”

As for accuracy, quantization matters. This research is part of the basis for what Nvidia has been doing with DLSS and other ML uses cases:

“We present an overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations. Per-channel quantization of weights and per-layer quantization of activations to 8-bits of precision post-training produces classification accuracies within 2% of floating point networks for a wide variety of CNN architectures.”

“Quantization-aware training can provide further improvements, reducing the gap to floating point to 1% at 8-bit precision.”

2 Likes