
Arithmetic, input/output, and implicit conversions between types are all supported. The final section discusses the performance aspects of using tensor cores.ĬUDA Fortran supports 16-bit floating point variables using the real(2) declarations, which are available on both the host and device.
#Wmma 5 forums series#
That section is followed by a series of progressively complex examples illustrating the use of WMMA API from CUDA Fortran, starting with a simple 16×16 matrix multiply. This post will first discuss the support for 16-bit floating point values using real(2) declarations, as well as the wmma CUDA Fortran module, in particular how variables of the WMMASubmatrix type are used to perform matrix multiplications. The mapping of threads to matrix elements is opaque, where the WMMASubmatrix datatype (equivalent to the fragment in CUDA C), is used to represent the elements each thread holds of the matrix represented by the warp of threads, along with other metadata. Before the WWMA operation can take place the operand matrices must be loaded into registers, distributed amongst the threads in the warp. Each Tensor Core actually performs a 4×4 matrix multiply, so multiple Tensor Cores are used in each WMMA operation. The multiplicands A and B are matrices of half-precision (16-bit) floating point values, whereas C and D are matrices of either both half-precision or both full-precision (32-bit) floating point values. With the WMMA interface, a single warp of 32 threads performs D = A∗B+C where C and D are 256-element matrices.

Note that the WMMA interface is a preview feature in CUDA C and subject to change, as is the CUDA Fortran interface described in what follows. This paper describes a CUDA Fortran interface to this same functionality. Access to programming Tensor Cores in CUDA C became available in the CUDA 9.0 release through the WMMA (Warp Matrix Multiply and Accumulate) API. One of the defining features of recent NVIDIA GPUs including the Telsa V100 is the introduction of Tensor Cores, which are programmable matrix multiply and accumulate units that operate on half-precision (16-bit) multiplicands.
