DAY83：阅读Compute Capability 7.x

GPUS Lady

发布于 2018-11-06 10:37:54

9680

发布于 2018-11-06 10:37:54

文章被收录于专栏：GPUS开发者GPUS开发者

我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第83天，我们正在讲解计算能力，希望在接下来的17天里，您可以学习到原汁原味的CUDA，同时能养成英文阅读的习惯。

H.6. Compute Capability 7.x

H.6.1. Architecture

A multiprocessor consists of:

64 FP32 cores for single-precision arithmetic operations,
32 FP64 cores for double-precision arithmetic operations, 28
64 INT32 cores for integer math,
8 mixed-precision Tensor Cores for deep learning matrix arithmetic
16 special function units for single-precision floating-point transcendental functions,
4 warp schedulers.

A multiprocessor statically distributes its warps among its schedulers. Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any.

A multiprocessor has:

a read-only constant cache that is shared by all functional units and speeds up reads from the constant memory space, which resides in device memory,
a unified data cache and shared memory with a total size of 128 KB (Volta) or 96 KB (Turing).

Shared memory is partitioned out of unified data cache, and can be configured to various sizes (See Shared Memory.) The remaining data cache serves as an L1 cache and is also used by the texture unit that implements the various addressing and data filtering modes mentioned in Texture and Surface Memory.

H.6.2. Independent Thread Scheduling

The Volta architecture introduces Independent Thread Scheduling among threads in a warp, enabling intra-warp synchronization patterns previously unavailable and simplifying code changes when porting CPU code. However, this can lead to a rather different set of threads participating in the executed code than intended if the developer made assumptions about warp-synchronicity of previous hardware architectures.

Below are code patterns of concern and suggested corrective actions for Volta-safe code.

For applications using warp intrinsics (__shfl*, __any, __all, __ballot), it is necessary that developers port their code to the new, safe, synchronizing counterpart, with the *_syncsuffix. The new warp intrinsics take in a mask of threads that explicitly define which lanes (threads of a warp) must participate in the warp intrinsic. See Warp Vote Functions and Warp Shuffle Functions for details. Since the intrinsics are available with CUDA 9.0+, (if necessary) code can be executed conditionally with the following preprocessor macro:

These intrinsics are available on all architectures, not just Volta or Turing, and in most cases a single code-base will suffice for all architectures. Note, however, that for Pascal and earlier architectures, all threads in mask must execute the same warp intrinsic instruction in convergence, and the union of all values in mask must be equal to the warp's active mask. The following code pattern is valid on Volta, but not on Pascal or earlier architectures.

The replacement for __ballot(1) is __activemask(). Note that threads within a warp can diverge even within a single code path. As a result, __activemask() and __ballot(1) may return only a subset of the threads on the current code path. The following invalid code example sets bit i of output to 1 when data[i] is greater than threshold. __activemask() is used in an attempt to enable cases where dataLen is not a multiple of 32.

This code is invalid because CUDA does not guarantee that the warp will diverge ONLY at the loop condition. When divergence happens for other reasons, conflicting results will be computed for the same 32-bit output element by different subsets of threads in the warp. A correct code might use a non-divergent loop condition together with __ballot_sync() to safely enumerate the set of threads in the warp participating in the threshold calculation as follows.

Discovery Pattern demonstrates a valid usecase for __activemask().

2. If applications have warp-synchronous codes, they will need to insert the new __syncwarp() warp-wide barrier synchronization instruction between any steps where data is exchanged between threads via global or shared memory. Assumptions that code is executed in lockstep or that reads/writes from separate threads are visible across a warp without synchronization are invalid.

3. Although __syncthreads() has been consistently documented as synchronizing all threads in the thread block, Pascal and prior architectures could only enforce synchronization at the warp level. In certain cases, this allowed a barrier to succeed without being executed by every thread as long as at least some thread in every warp reached the barrier. Starting with Volta, the CUDA built-in __syncthreads() and PTX instruction bar.sync (and their derivatives) are enforced per thread and thus will not succeed until reached by all non-exited threads in the block. Code exploiting the previous behavior will likely deadlock and must be modified to ensure that all non-exited threads reach the barrier.

The racecheck and synccheck tools provided by cuda-memcheck can aid in locating violations of points 2 and 3.

To aid migration while implementing the above-mentioned corrective actions, developers can opt-in to the Pascal scheduling model that does not support independent thread scheduling. See Application Compatibility for details.

H.6.3. Global Memory

Global memory behaves the same way as devices of compute capability 5.x (See Global Memory).

H.6.4. Shared Memory

Similar to the Kepler architecture, the amount of the unified data cache reserved for shared memory is configurable on a per kernel basis. For the Volta architecture (compute capability 7.0), the unified data cache has a size of 128 KB, and the shared memory capacity can be set to 0, 8, 16, 32, 64 or 96 KB. For the Turing architecture (compute capability 7.5), the unified data cache has a size of 96 KB, and the shared memory capacity can be set to either 32 KB or 64 KB. Unlike Kepler, the driver automatically configures the shared memory capacity for each kernel to avoid shared memory occupancy bottlenecks while also allowing concurrent execution with already launched kernels where possible. In most cases, the driver's default behavior should provide optimal performance.

Because the driver is not always aware of the full workload, it is sometimes useful for applications to provide additional hints regarding the desired shared memory configuration. For example, a kernel with little or no shared memory use may request a larger carveout in order to encourage concurrent execution with later kernels that require more shared memory. The new cudaFuncSetAttribute() API allows applications to set a preferred shared memory capacity, or carveout, as a percentage of the maximum supported shared memory capacity (96 KB for Volta, and 64 KB for Turing).

cudaFuncSetAttribute() relaxes enforcement of the preferred shared capacity compared to the legacy cudaFuncSetCacheConfig() API introduced with Kepler. The legacy API treated shared memory capacities as hard requirements for kernel launch. As a result, interleaving kernels with different shared memory configurations would needlessly serialize launches behind shared memory reconfigurations. With the new API, the carveout is treated as a hint. The driver may choose a different configuration if required to execute the function or to avoid thrashing.

Here the integer carveout specifies the shared memory carveout preference in percent of the total resource. This is only a hint, and the driver can choose a different ratio if required to execute the function or to avoid thrashing.

Compute capability 7.x devices allow a single thread block to address the full capacity of shared memory: 96 KB on Volta, 64 KB on Turing. Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, as such they must use dynamic shared memory (rather than statically sized arrays) and require an explicit opt-in using cudaFuncSetAttribute() as follows.

Otherwise, shared memory behaves the same way as devices of compute capability 5.x (See Shared Memory).

本文备注/经验分享：

这章节内容还是很简单的，主要说明了，新的SM的资源和计算单元的变化，以及，7.0和7.5所具有的warp不一致执行（独立线程调度），以及，特别的，强调了增大容量的shared memory.

其中后两个方面我们在前面的章节都说过了，论坛上也有客户实际测试运行的例子（大容量shared　memory) 这里主要说明一下资源和执行单元的变化。首先说，这两个架构某种程度上的可以看成同一种架构，例如Maxwell的5.0和5.2计算能力的关系，而不是看成Turing（7.5）是Volta（7.0）的下一代全新架构。主要理由有这么几点，在不考虑Turing新引入的能整数和浮点并发执行能力的情况下，也不考虑Tensor Core的增强的情况下，这两个实际上并无本质区别。该有的特色的都是有的，例如从9.2起对7.0（Volta）引入的不兼容改动，对Turing同样适用（欢迎回看我们之前的章节），增大的Shared Memory的用法也是Turing和Volta一样（例如需要使用动态分配）等等。甚至NV在官网文档中强调，Turing可以几乎全效率的执行Volta的cubin，而不需要重新编译。这说明这两者很大程度上还是有些类似的。但是用户需要注意这点不同：（1）Turing对Tensor Core进行了增强，引入了新指令。NV说，如果在使用了Tensor Core的情况下，建议重新编译，否则可以无法充分发挥Tensor Core的性能。（但是不重新编译，为Volta生成的纯二进制代码依然可以成功执行，无PTX的情况下）。（2）Turing除了这个增强，有并发的FP和INT单元执行，这种执行能在折算到同频率，同样的SP数量的情况下，具有更好的浮点和整数性能。甚至这对纯浮点运算的代码也有效。为何？因为任何计算，包括很多用户认为的纯浮点计算，依然会用到整数指令，举个例子说，在进行访存时候，例如一个数组或者指针，A[y][x]这种类似形式，地址的计算需要使用整数单元。Turing第二点能让FP单元在忙碌的时候，让INT插入地址计算之类的应用，提升了整体性能。除了这两点外，Turing将double砍掉了。速率变成了1/32（和5.x，6.x（不包含6.0））是一样的。其他的情况下，这两个并无太大差异。实际上，在NV的文档中提到，推荐对Turing的编译器是CUDA 9.2和CUDA 10. 请注意这里并不是CUDA 10+，这点可能出乎一些人的意料。但是Turing能在不使用新特性的情况下（主要是新tensor core指令）可以完全无碍的全速指令Volta的cubin，说明这点的确可以。这样对于一些刚刚升级到9.2，还来不及喘气的人来说，购买了新卡，可以暂时不升级手工的工具和代码，大部分的情况下均不影响的。此外，这代的（虽然是两代，但是我感觉还是看成一代较好）的Shared Memory，有了大小的提升，构造也发生了变化，本代是Texture + L1 + Shared memory三者融和在一起了，这样又回到了大部分情况不需要使用Texture就能享受纹理缓存的效果（对2D和3D CUDA Array访问不适用），这三者在N卡的历代上，分分合合多次，大家习惯就好。合并起来的三者，导致了L1实际上现在是shared memory,　具有典型的shared memory的延迟(20多个周期），而不是经典的Maxwell/Pascal的80-100个周期的L1延迟。但是损失了可能的L1和Shared Memory并发发生的可能性。各有利弊。此外，这代将SM的资源不变的情况下，计算单元数量减半，实际上你购买折算到同样的SP数量的卡，资源翻倍了。后面的都讲过了，就是完全一样的之前的章节。不应该重新讲的。所以建议用户如果是新开发CUDA项目，一定要从Volta/Turing开始。翻倍的资源，和新的执行方式（warp不一致），比从老架构写好，再改代码升级要强。关于本章节的后面两个部分，独立线程调度（warp内部不一致执行），我们之前在多个章节（主要是warp内置函数部分）进行过详细描述。注意这里还有一种能在新卡上，尽量使用老方式写代码，提升兼容性的方式。这个之前也依然讲过（compute_60和sm_70编译）。这点实际上老用户需要注意一下。在你用新卡+新编译器（例如9.2）编译的时候，如果它发现了你有不安全的老式书写行为，会自动警告你，并自动建议这种方式（这个我们之前也说过的），注意本章节的这部分存在一些范例代码，告诉你主要的新的容易出错的不兼容的地方。本章节同时还建议了使用cuda-memcheck内置的另外两个工具（race和sync check), 来进行兼容性检查的办法（使用方法是cuda-memcheck --tool的形式来进行检查）。注意cuda-memcheck内置4个工具，默认的访存检查，也就是memcheck，这也是我们之前说过无数无数无数无数次的用这个快速定位和检查kernel的方式（然后无数无数无数人总是无视这个建议，然后试图肉眼检查，我只能呵呵）。该工具的另外3个内置工具是racecheck, synccheck和initcheck, 注意这3个工具都需要使用刚才的--tool命令行切换，均不是默认的。此外，本章节还说明了7.5的Shared Memory扩大的不如7.0的大，前者最大64KB，后者最大96KB，但任何超过48KB(从2.0到6.2）的使用均需要使用本章节的新用法。用户需要注意。

虽然说7.X并无新的本章节说的变化内容。但是需要强调的是，7.5的Tensor Core 2代十分给力，从Maxwell到Pascal（6.1），INT8提升了400%，然后到了7.5继续提升了400%，现在一共是1600%的性能(相比大部分普通的float或者整数指令），用户应该考虑充分利用这个。

有不明白的地方，请在本文后留言

本文参与?腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-10-18，如有侵权请联系?cloudcommunity@tencent.com 删除

其他

本文分享自 GPUS开发者微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与?腾讯云自媒体同步曝光计划? ，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度