在开发meshoptimizer的过程中，经常会出现以下问题：“该算法可以使用SIMD吗？”

该库以性能为导向，但是SIMD并不总是提供明显的速度优势。不幸的是，SIMD会使代码的可移植性和可维护性降低。因此，在每种情况下，都必须寻求折衷方案。当性能至关重要时，您必须为SSE和NEON指令集开发和维护单独的SIMD实现。在其他情况下，您需要了解使用SIMD的效果。今天，我们将尝试使用SSEn / AVXn指令集来加速最近被添加到库中的新网格简化算法。

对于我们的基准，我们将泰国佛陀模型从600万个三角形简化为该数字的0.1％。我们将Microsoft Visual Studio 2019编译器用于目标x64体系结构。标量算法可以在单个Intel Core i7-8700K线程（约4.4 GHz）中在约210 ms内执行这种合理化。这相当于每秒约2850万个三角形。在实践中也许这足够了，但是我很好奇地探索了设备的最大功能。

在某些情况下，可以通过将网格划分为多个部分来并行化此过程，但是为此，有必要进行连通性的其他分析以保持边界，因此，现在我们将仅限于SIMD优化。

七次测量

为了了解优化的可能性，我们将使用英特尔VTune执行性能分析。运行该过程100次以确保有足够的数据。

在这里，我打开了微体系结构模式，以固定每个功能的执行时间，并查找瓶颈。我们看到合理化是使用一组函数执行的，每个函数都需要一定数量的循环。功能列表按时间排序。这里按执行顺序排列，使算法更易于理解：

rescalePositions将所有顶点的位置归一化为单个多维数据集，以准备使用computeVertexIds进行量化
computeVertexIds为给定大小的统一网格上的每个顶点计算一个30位量化标识符，其中每个轴在网格上进行量化（网格大小为10位，因此标识符为30）
假设一个网格单元中所有顶点的并集， countTriangles计算创新者将为给定网格大小创建的近似三角形数
fillVertexCells填充一个表，该表fillVertexCells所有顶点fillVertexCells到相应的单元格；具有相同ID的所有顶点对应于一个像元
fillCellQuadrics为每个像元填充fillCellQuadrics结构（ Quadric对称矩阵），以反映有关相应几何的聚合信息
fillCellRemap计算每个像元的顶点索引，选择该像元中的一个顶点，并使几何变形最小化
filterTriangles根据先前构造的vertex-cell-vertex表显示最终的三角形集；幼稚的平移可以平均产生约5％的重复三角形，因此该函数可以过滤重复的三角形。

computeVertexIds和countTriangles执行多次：该算法确定用于合并顶点的网格大小，执行加速的二进制搜索以达到目标三角形数量（在我们的示例中为6000），并计算每种网格大小在每次迭代时生成的三角形数量。其他功能一次启动。在我们的文件中，五次搜索通过给出了40 ³的目标网格大小。

VTune有用地报告说，最耗资源的函数是计算二次函数的函数：它几乎花费了21 s的总执行时间的一半。这是优化SIMD的首要目标。

一张一张

让我们看一下fillCellQuadrics的源代码，以更好地了解它的确切计算结果：

 static void fillCellQuadrics(Quadric* cell_quadrics, const unsigned int* indices, size_t index_count, const Vector3* vertex_positions, const unsigned int* vertex_cells) { for (size_t i = 0; i < index_count; i += 3) { unsigned int i0 = indices[i + 0]; unsigned int i1 = indices[i + 1]; unsigned int i2 = indices[i + 2]; unsigned int c0 = vertex_cells[i0]; unsigned int c1 = vertex_cells[i1]; unsigned int c2 = vertex_cells[i2]; bool single_cell = (c0 == c1) & (c0 == c2); float weight = single_cell ? 3.f : 1.f; Quadric Q; quadricFromTriangle(Q, vertex_positions[i0], vertex_positions[i1], vertex_positions[i2], weight); if (single_cell) { quadricAdd(cell_quadrics[c0], Q); } else { quadricAdd(cell_quadrics[c0], Q); quadricAdd(cell_quadrics[c1], Q); quadricAdd(cell_quadrics[c2], Q); } } }

该函数遍历所有三角形，为每个三角形计算一个二次曲面，并将其添加到每个单元的二次曲面。二次曲面-一个4×4对称矩阵，表示为10个浮点数：

 struct Quadric { float a00; float a10, a11; float a20, a21, a22; float b0, b1, b2, c; };

二次曲面的计算需要求解三角形的平面方程，构造二次矩阵，并使用三角形的面积对其进行加权：

 static void quadricFromPlane(Quadric& Q, float a, float b, float c, float d) { Q.a00 = a * a; Q.a10 = b * a; Q.a11 = b * b; Q.a20 = c * a; Q.a21 = c * b; Q.a22 = c * c; Q.b0 = d * a; Q.b1 = d * b; Q.b2 = d * c; Qc = d * d; } static void quadricFromTriangle(Quadric& Q, const Vector3& p0, const Vector3& p1, const Vector3& p2, float weight) { Vector3 p10 = {p1.x - p0.x, p1.y - p0.y, p1.z - p0.z}; Vector3 p20 = {p2.x - p0.x, p2.y - p0.y, p2.z - p0.z}; Vector3 normal = { p10.y * p20.z - p10.z * p20.y, p10.z * p20.x - p10.x * p20.z, p10.x * p20.y - p10.y * p20.x }; float area = normalize(normal); float distance = normal.x*p0.x + normal.y*p0.y + normal.z*p0.z; quadricFromPlane(Q, normal.x, normal.y, normal.z, -distance); quadricMul(Q, area * weight); }

看起来有很多浮点运算，因此可以使用SIMD对其进行并行化。首先，我们将每个向量表示为4宽度的SIMD向量，还将Quadric结构更改为12个浮点数，而不是10个浮点数，以便它恰好适合三个SIMD寄存器（增加大小不会影响性能），并更改字段的顺序，以便进行计算quadricFromPlane变得更加统一：

 struct Quadric { float a00, a11, a22; float pad0; float a10, a21, a20; float pad1; float b0, b1, b2, c; };

在这里，一些计算，尤其是标量积，与上证所的早期版本不太一致。幸运的是，SSE4.1中出现了标量产品的说明，这对我们非常有用。

 static void fillCellQuadrics(Quadric* cell_quadrics, const unsigned int* indices, size_t index_count, const Vector3* vertex_positions, const unsigned int* vertex_cells) { const int yzx = _MM_SHUFFLE(3, 0, 2, 1); const int zxy = _MM_SHUFFLE(3, 1, 0, 2); const int dp_xyz = 0x7f; for (size_t i = 0; i < index_count; i += 3) { unsigned int i0 = indices[i + 0]; unsigned int i1 = indices[i + 1]; unsigned int i2 = indices[i + 2]; unsigned int c0 = vertex_cells[i0]; unsigned int c1 = vertex_cells[i1]; unsigned int c2 = vertex_cells[i2]; bool single_cell = (c0 == c1) & (c0 == c2); __m128 p0 = _mm_loadu_ps(&vertex_positions[i0].x); __m128 p1 = _mm_loadu_ps(&vertex_positions[i1].x); __m128 p2 = _mm_loadu_ps(&vertex_positions[i2].x); __m128 p10 = _mm_sub_ps(p1, p0); __m128 p20 = _mm_sub_ps(p2, p0); __m128 normal = _mm_sub_ps( _mm_mul_ps( _mm_shuffle_ps(p10, p10, yzx), _mm_shuffle_ps(p20, p20, zxy)), _mm_mul_ps( _mm_shuffle_ps(p10, p10, zxy), _mm_shuffle_ps(p20, p20, yzx))); __m128 areasq = _mm_dp_ps(normal, normal, dp_xyz); // SSE4.1 __m128 area = _mm_sqrt_ps(areasq); // masks the result of the division when area==0 // scalar version does this in normalize() normal = _mm_and_ps( _mm_div_ps(normal, area), _mm_cmpneq_ps(area, _mm_setzero_ps())); __m128 distance = _mm_dp_ps(normal, p0, dp_xyz); // SSE4.1 __m128 negdistance = _mm_sub_ps(_mm_setzero_ps(), distance); __m128 normalnegdist = _mm_blend_ps(normal, negdistance, 8); __m128 Qx = _mm_mul_ps(normal, normal); __m128 Qy = _mm_mul_ps( _mm_shuffle_ps(normal, normal, _MM_SHUFFLE(3, 2, 2, 1)), _mm_shuffle_ps(normal, normal, _MM_SHUFFLE(3, 0, 1, 0))); __m128 Qz = _mm_mul_ps(negdistance, normalnegdist); if (single_cell) { area = _mm_mul_ps(area, _mm_set1_ps(3.f)); Qx = _mm_mul_ps(Qx, area); Qy = _mm_mul_ps(Qy, area); Qz = _mm_mul_ps(Qz, area); Quadric& q0 = cell_quadrics[c0]; __m128 q0x = _mm_loadu_ps(&q0.a00); __m128 q0y = _mm_loadu_ps(&q0.a10); __m128 q0z = _mm_loadu_ps(&q0.b0); _mm_storeu_ps(&q0.a00, _mm_add_ps(q0x, Qx)); _mm_storeu_ps(&q0.a10, _mm_add_ps(q0y, Qy)); _mm_storeu_ps(&q0.b0, _mm_add_ps(q0z, Qz)); } else { // omitted for brevity, repeats the if() body // three times for c0/c1/c2 } } }

这段代码没有什么特别有趣的。我们大量使用未对齐的加载/存储指令。尽管Vector3的输入可以对齐，但对于未对齐的读取似乎没有明显的损失。请注意，在函数的前半部分不使用向量，这很好-我们的向量具有三个分量，在某些情况下只有一个分量（请参阅areaq / area / distance的计算），而处理器并行执行4个运算。无论如何，让我们看看并行化是如何帮助的。

现在，一百个fillCellQuadrics开始时间是5.3 s（而不是9.8 s），这在每次操作上节省了约45毫秒-不错，但不是很好。在许多指令中，我们使用三个而不是四个分量，以及精确的乘法，这会产生相当大的延迟。如果您以前为SIMD编写过说明，那么您就会知道如何正确地处理标量积。

为此，您需要一次执行四个向量。而不是将一个完整的向量存储在一个SIMD寄存器中，我们使用三个寄存器-在一个寄存器中，我们存储x四个分量，在另一个寄存器中，我们将存储，而在第三个z 。这里一次需要四个向量：这意味着我们将同时处理四个三角形。

我们有许多带有动态索引的数组。通常，它有助于将数据传输到x / y / z分量的准备好的数组（或者更确切地说，对于输入中的8个顶点，通常使用小的SIMD寄存器，例如float x[8], y[8], z[8] ）。数据：这称为AoSoA（数组结构的数组），可以在缓存局部性和易于加载到SIMD寄存器之间取得良好的平衡），但是这里的动态索引效果不佳，因此请照常加载四个三角形的数据，并使用方便的方式对向量进行转置宏_MM_TRANSPOSE 。

从理论上讲，您需要在自己的SIMD寄存器中计算四个有限二次__m128 Q_a00每个分量（例如，我们将拥有__m128 Q_a00其中四个分量具有__m128 Q_a00有限二次__m128 Q_a00 ）。在这种情况下，二次曲面上的运算非常适合4宽SIMD指令，并且转换实际上会减慢代码的速度-因此，我们只转置初始向量，然后转置平面方程，然后运行用于计算二次曲面的相同代码，但是重复一下四次。以下是用于计算平面方程的代码（为简便起见，省略了其余部分）：

 unsigned int i00 = indices[(i + 0) * 3 + 0]; unsigned int i01 = indices[(i + 0) * 3 + 1]; unsigned int i02 = indices[(i + 0) * 3 + 2]; unsigned int i10 = indices[(i + 1) * 3 + 0]; unsigned int i11 = indices[(i + 1) * 3 + 1]; unsigned int i12 = indices[(i + 1) * 3 + 2]; unsigned int i20 = indices[(i + 2) * 3 + 0]; unsigned int i21 = indices[(i + 2) * 3 + 1]; unsigned int i22 = indices[(i + 2) * 3 + 2]; unsigned int i30 = indices[(i + 3) * 3 + 0]; unsigned int i31 = indices[(i + 3) * 3 + 1]; unsigned int i32 = indices[(i + 3) * 3 + 2]; // load first vertex of each triangle and transpose into vectors with components (pw0 isn't used later) __m128 px0 = _mm_loadu_ps(&vertex_positions[i00].x); __m128 py0 = _mm_loadu_ps(&vertex_positions[i10].x); __m128 pz0 = _mm_loadu_ps(&vertex_positions[i20].x); __m128 pw0 = _mm_loadu_ps(&vertex_positions[i30].x); _MM_TRANSPOSE4_PS(px0, py0, pz0, pw0); // load second vertex of each triangle and transpose into vectors with components __m128 px1 = _mm_loadu_ps(&vertex_positions[i01].x); __m128 py1 = _mm_loadu_ps(&vertex_positions[i11].x); __m128 pz1 = _mm_loadu_ps(&vertex_positions[i21].x); __m128 pw1 = _mm_loadu_ps(&vertex_positions[i31].x); _MM_TRANSPOSE4_PS(px1, py1, pz1, pw1); // load third vertex of each triangle and transpose into vectors with components __m128 px2 = _mm_loadu_ps(&vertex_positions[i02].x); __m128 py2 = _mm_loadu_ps(&vertex_positions[i12].x); __m128 pz2 = _mm_loadu_ps(&vertex_positions[i22].x); __m128 pw2 = _mm_loadu_ps(&vertex_positions[i32].x); _MM_TRANSPOSE4_PS(px2, py2, pz2, pw2); // p1 - p0 __m128 px10 = _mm_sub_ps(px1, px0); __m128 py10 = _mm_sub_ps(py1, py0); __m128 pz10 = _mm_sub_ps(pz1, pz0); // p2 - p0 __m128 px20 = _mm_sub_ps(px2, px0); __m128 py20 = _mm_sub_ps(py2, py0); __m128 pz20 = _mm_sub_ps(pz2, pz0); // cross(p10, p20) __m128 normalx = _mm_sub_ps( _mm_mul_ps(py10, pz20), _mm_mul_ps(pz10, py20)); __m128 normaly = _mm_sub_ps( _mm_mul_ps(pz10, px20), _mm_mul_ps(px10, pz20)); __m128 normalz = _mm_sub_ps( _mm_mul_ps(px10, py20), _mm_mul_ps(py10, px20)); // normalize; note that areasq/area now contain 4 values, not just one __m128 areasq = _mm_add_ps( _mm_mul_ps(normalx, normalx), _mm_add_ps( _mm_mul_ps(normaly, normaly), _mm_mul_ps(normalz, normalz))); __m128 area = _mm_sqrt_ps(areasq); __m128 areanz = _mm_cmpneq_ps(area, _mm_setzero_ps()); normalx = _mm_and_ps(_mm_div_ps(normalx, area), areanz); normaly = _mm_and_ps(_mm_div_ps(normaly, area), areanz); normalz = _mm_and_ps(_mm_div_ps(normalz, area), areanz); __m128 distance = _mm_add_ps( _mm_mul_ps(normalx, px0), _mm_add_ps( _mm_mul_ps(normaly, py0), _mm_mul_ps(normalz, pz0))); __m128 negdistance = _mm_sub_ps(_mm_setzero_ps(), distance); // this computes the plane equations (a, b, c, d) for each of the 4 triangles __m128 plane0 = normalx; __m128 plane1 = normaly; __m128 plane2 = normalz; __m128 plane3 = negdistance; _MM_TRANSPOSE4_PS(plane0, plane1, plane2, plane3);

代码变得更长一些：现在，我们在每次迭代中处理四个三角形，并且为此，我们不再需要SSE4.1指令。从理论上讲，应该更有效地使用SIMD单元。让我们看看它如何帮助您。

好的，没关系。该代码已经非常缓慢地加速了，尽管fillCellQuadrics函数现在的运行速度几乎是没有SIMD的原始函数的两倍，但是尚不清楚这是否可以证明复杂性的显着提高。从理论上讲，您可以使用AVX2并在每次迭代中处理8个三角形，但是在这里您将需要进一步手动旋转循环（理想情况下，所有这些代码都是使用ISPC生成的，但是我幼稚地尝试使其生成良好代码的尝试并未成功：代替加载/存储序列他持续发出收集/分散信息，这会大大降低执行速度）。让我们尝试其他事情。

AVX2 = SSE2 + SSE2

AVX2只是一组指令。它具有8个宽的浮点寄存器，一条指令可以执行8个操作；但是从本质上讲，这样的指令与在寄存器的两半执行的两条SSE2指令没有区别（据我所知，具有AVX2的第一批处理器支持以两个或多个微操作对每个指令进行解码，因此性能提升受到提取指令阶段的限制）。例如， _mm_dp_ps在两个SSE2寄存器之间执行标量积，而_mm256_dp_ps在两个AVX2寄存器的两半之间产生两个标量积，因此，每半只限于4宽。

因此，AVX2代码通常与通用的“ 8宽SIMD”有所不同，但在这里它对我们有利：不是试图通过转置4宽矢量来改善矢量化，而是返回到SIMD的第一个版本，并使用AVX2指令（而不是SSE2）将循环加倍/ SSE4。我们仍然需要加载和存储4宽向量，但是通常，我们只需使用以下几种设置_mm256将_mm256中的_mm_更改为__m256 ，将_mm256更改为__m256 ：

 unsigned int i00 = indices[(i + 0) * 3 + 0]; unsigned int i01 = indices[(i + 0) * 3 + 1]; unsigned int i02 = indices[(i + 0) * 3 + 2]; unsigned int i10 = indices[(i + 1) * 3 + 0]; unsigned int i11 = indices[(i + 1) * 3 + 1]; unsigned int i12 = indices[(i + 1) * 3 + 2]; __m256 p0 = _mm256_loadu2_m128( &vertex_positions[i10].x, &vertex_positions[i00].x); __m256 p1 = _mm256_loadu2_m128( &vertex_positions[i11].x, &vertex_positions[i01].x); __m256 p2 = _mm256_loadu2_m128( &vertex_positions[i12].x, &vertex_positions[i02].x); __m256 p10 = _mm256_sub_ps(p1, p0); __m256 p20 = _mm256_sub_ps(p2, p0); __m256 normal = _mm256_sub_ps( _mm256_mul_ps( _mm256_shuffle_ps(p10, p10, yzx), _mm256_shuffle_ps(p20, p20, zxy)), _mm256_mul_ps( _mm256_shuffle_ps(p10, p10, zxy), _mm256_shuffle_ps(p20, p20, yzx))); __m256 areasq = _mm256_dp_ps(normal, normal, dp_xyz); __m256 area = _mm256_sqrt_ps(areasq); __m256 areanz = _mm256_cmp_ps(area, _mm256_setzero_ps(), _CMP_NEQ_OQ); normal = _mm256_and_ps(_mm256_div_ps(normal, area), areanz); __m256 distance = _mm256_dp_ps(normal, p0, dp_xyz); __m256 negdistance = _mm256_sub_ps(_mm256_setzero_ps(), distance); __m256 normalnegdist = _mm256_blend_ps(normal, negdistance, 0x88); __m256 Qx = _mm256_mul_ps(normal, normal); __m256 Qy = _mm256_mul_ps( _mm256_shuffle_ps(normal, normal, _MM_SHUFFLE(3, 2, 2, 1)), _mm256_shuffle_ps(normal, normal, _MM_SHUFFLE(3, 0, 1, 0))); __m256 Qz = _mm256_mul_ps(negdistance, normalnegdist);

在这里，您可以获取接收到的Qx / Qz / Qz每128位一半，并运行用于添加二次项的相同代码。相反，我们假设如果一个三角形在一个单元格中具有三个顶点（ single_cell == true ），那么另一个三角形在另一个单元格中很有可能具有三个顶点，并且我们也使用AVX2执行二次曲面的最终聚合：

 unsigned int c00 = vertex_cells[i00]; unsigned int c01 = vertex_cells[i01]; unsigned int c02 = vertex_cells[i02]; unsigned int c10 = vertex_cells[i10]; unsigned int c11 = vertex_cells[i11]; unsigned int c12 = vertex_cells[i12]; bool single_cell = (c00 == c01) & (c00 == c02) & (c10 == c11) & (c10 == c12); if (single_cell) { area = _mm256_mul_ps(area, _mm256_set1_ps(3.f)); Qx = _mm256_mul_ps(Qx, area); Qy = _mm256_mul_ps(Qy, area); Qz = _mm256_mul_ps(Qz, area); Quadric& q00 = cell_quadrics[c00]; Quadric& q10 = cell_quadrics[c10]; __m256 q0x = _mm256_loadu2_m128(&q10.a00, &q00.a00); __m256 q0y = _mm256_loadu2_m128(&q10.a10, &q00.a10); __m256 q0z = _mm256_loadu2_m128(&q10.b0, &q00.b0); _mm256_storeu2_m128(&q10.a00, &q00.a00, _mm256_add_ps(q0x, Qx)); _mm256_storeu2_m128(&q10.a10, &q00.a10, _mm256_add_ps(q0y, Qy)); _mm256_storeu2_m128(&q10.b0, &q00.b0, _mm256_add_ps(q0z, Qz)); } else { // omitted for brevity }

所产生的代码比我们不成功的SSE2方法更简单，简明和快速：

当然，我们没有实现8倍的加速，而只有2.45倍。加载和存储操作仍然是4宽，因为由于动态索引，我们不得不使用不舒服的内存布局，并且计算对于SIMD并不是最佳的。但是现在fillCellQuadrics不再是我们配置文件管道中的瓶颈，您可以专注于其他功能。

孩子们聚集

我们在测试运行中节省了4.8秒（每次运行节省了48毫秒），现在我们的主要入侵者是countTriangles 。这似乎是一个简单的函数，但是它执行的不是一次，而是五次：

 static size_t countTriangles(const unsigned int* vertex_ids, const unsigned int* indices, size_t index_count) { size_t result = 0; for (size_t i = 0; i < index_count; i += 3) { unsigned int id0 = vertex_ids[indices[i + 0]]; unsigned int id1 = vertex_ids[indices[i + 1]]; unsigned int id2 = vertex_ids[indices[i + 2]]; result += (id0 != id1) & (id0 != id2) & (id1 != id2); } return result; }

该函数遍历所有源三角形，并通过比较顶点的标识符来计算非退化三角形的数量。除非您使用收集指令，否则目前尚不清楚如何使用SIMD对其进行并行化。

AVX2指令集向x64 SIMD添加了一个收集/分散指令系列。它们每个都使用具有4或8个值的向量寄存器-同时执行4或8个加载或保存操作。如果在此处使用收集，则可以下载三个索引，一次运行所有收集（或以4或8为一组）并比较结果。从历史上看，英特尔处理器上的收集速度相当缓慢，但让我们尝试一下。为简单起见，我们上载了8个三角形的数据，对向量进行类似于先前尝试的转置，并比较了每个向量的对应元素：

 for (size_t i = 0; i < (triangle_count & ~7); i += 8) { __m256 tri0 = _mm256_loadu2_m128( (const float*)&indices[(i + 4) * 3 + 0], (const float*)&indices[(i + 0) * 3 + 0]); __m256 tri1 = _mm256_loadu2_m128( (const float*)&indices[(i + 5) * 3 + 0], (const float*)&indices[(i + 1) * 3 + 0]); __m256 tri2 = _mm256_loadu2_m128( (const float*)&indices[(i + 6) * 3 + 0], (const float*)&indices[(i + 2) * 3 + 0]); __m256 tri3 = _mm256_loadu2_m128( (const float*)&indices[(i + 7) * 3 + 0], (const float*)&indices[(i + 3) * 3 + 0]); _MM_TRANSPOSE8_LANE4_PS(tri0, tri1, tri2, tri3); __m256i i0 = _mm256_castps_si256(tri0); __m256i i1 = _mm256_castps_si256(tri1); __m256i i2 = _mm256_castps_si256(tri2); __m256i id0 = _mm256_i32gather_epi32((int*)vertex_ids, i0, 4); __m256i id1 = _mm256_i32gather_epi32((int*)vertex_ids, i1, 4); __m256i id2 = _mm256_i32gather_epi32((int*)vertex_ids, i2, 4); __m256i deg = _mm256_or_si256( _mm256_cmpeq_epi32(id1, id2), _mm256_or_si256( _mm256_cmpeq_epi32(id0, id1), _mm256_cmpeq_epi32(id0, id2))); result += 8 - _mm_popcnt_u32(_mm256_movemask_epi8(deg)) / 4; }

AVX2中的_MM_TRANSPOSE8_LANE4_PS宏等效于_MM_TRANSPOSE4_PS ，该宏在标准标题中找不到，但很容易显示。它需要四个AVX2向量，并且彼此独立地转置两个4×4矩阵：

 #define _MM_TRANSPOSE8_LANE4_PS(row0, row1, row2, row3) \ do { \ __m256 __t0, __t1, __t2, __t3; \ __t0 = _mm256_unpacklo_ps(row0, row1); \ __t1 = _mm256_unpackhi_ps(row0, row1); \ __t2 = _mm256_unpacklo_ps(row2, row3); \ __t3 = _mm256_unpackhi_ps(row2, row3); \ row0 = _mm256_shuffle_ps(__t0, __t2, _MM_SHUFFLE(1, 0, 1, 0)); \ row1 = _mm256_shuffle_ps(__t0, __t2, _MM_SHUFFLE(3, 2, 3, 2)); \ row2 = _mm256_shuffle_ps(__t1, __t3, _MM_SHUFFLE(1, 0, 1, 0)); \ row3 = _mm256_shuffle_ps(__t1, __t3, _MM_SHUFFLE(3, 2, 3, 2)); \ } while (0)

由于SSE2 / AVX2指令集的某些功能，转置向量时必须使用浮点寄存器操作。我们有点粗心地加载数据；但这基本上没有关系，因为收集性能会限制我们：

现在countTriangles速度提高了约27％，并注意到了可怕的CPI（每条指令的周期数）：我们发出的指令少了四倍，但收集却花费了很多时间。加快整体工作的速度很好，但是，当然，性能提升会令人沮丧。我们设法超过了概要文件中的fillCellQuadrics ，这使我们进入了尚未查看的列表顶部的最后一个函数。

第6章，一切都应开始进行

最后一个函数是computeVertexIds 。在算法中，它执行了6次，因此它也代表了优化的绝佳目标。第一次，我们看到一个似乎是为了在SIMD中进行清晰优化而创建的函数：

 static void computeVertexIds(unsigned int* vertex_ids, const Vector3* vertex_positions, size_t vertex_count, int grid_size) { assert(grid_size >= 1 && grid_size <= 1024); float cell_scale = float(grid_size - 1); for (size_t i = 0; i < vertex_count; ++i) { const Vector3& v = vertex_positions[i]; int xi = int(vx * cell_scale + 0.5f); int yi = int(vy * cell_scale + 0.5f); int zi = int(vz * cell_scale + 0.5f); vertex_ids[i] = (xi << 20) | (yi << 10) | zi; } }

经过之前的优化，我们知道该怎么做：将循环展开4或8次，因为试图加速一次迭代，转置向量分量并并行运行计算没有任何意义。让我们使用AVX2进行处理，一次处理8个顶点：

 __m256 scale = _mm256_set1_ps(cell_scale); __m256 half = _mm256_set1_ps(0.5f); for (size_t i = 0; i < (vertex_count & ~7); i += 8) { __m256 vx = _mm256_loadu2_m128( &vertex_positions[i + 4].x, &vertex_positions[i + 0].x); __m256 vy = _mm256_loadu2_m128( &vertex_positions[i + 5].x, &vertex_positions[i + 1].x); __m256 vz = _mm256_loadu2_m128( &vertex_positions[i + 6].x, &vertex_positions[i + 2].x); __m256 vw = _mm256_loadu2_m128( &vertex_positions[i + 7].x, &vertex_positions[i + 3].x); _MM_TRANSPOSE8_LANE4_PS(vx, vy, vz, vw); __m256i xi = _mm256_cvttps_epi32( _mm256_add_ps(_mm256_mul_ps(vx, scale), half)); __m256i yi = _mm256_cvttps_epi32( _mm256_add_ps(_mm256_mul_ps(vy, scale), half)); __m256i zi = _mm256_cvttps_epi32( _mm256_add_ps(_mm256_mul_ps(vz, scale), half)); __m256i id = _mm256_or_si256( zi, _mm256_or_si256( _mm256_slli_epi32(xi, 20), _mm256_slli_epi32(yi, 10))); _mm256_storeu_si256((__m256i*)&vertex_ids[i], id); }

然后看一下结果：

我们加速了computeVertexIds两次。考虑到所有优化，该程序的总执行时间减少到大约120毫秒，这相当于每秒计算5000万个三角形。

似乎我们又没有达到预期的生产率增长：computeVertexIds并行化之后，它是否应该加速两倍以上？为了回答这个问题，让我们尝试看看该函数执行了多少工作。

computeVertexIds它在一个程序开始时执行六次：在二进制搜索中执行五次，在结束时计算一次，以计算用于进一步处理的最终标识符。每次，此函数处理300万个顶点，每个顶点读取12个字节，并写入4个字节。

总共有100多次创新者运行，此功能处理18亿个顶点，读取21 GB，回写7 GB。在1.46秒内处理28 GB需要19 GB / s的总线带宽。我们可以通过运行来检查内存带宽memcmp(block1, block2, 512 MB)。结果是45毫秒，即一个内核约22 GB / s（尽管AIDA64基准测试显示我的系统上的读取速度高达31 GB / s，但它使用多个内核）。实际上，我们已经接近最大可达到的内存限制，并且要进一步提高性能，将需要更紧密地打包这些顶点，以便它们占用的空间少于12个字节。

结论

我们采用了一种非常优化的算法，该算法以每秒2800万个三角形的速度简化了非常大的网格，并使用SSE和AVX指令集将其速度几乎提高了两倍，达到了每秒5000万个三角形。在此过程中，我们必须学习使用SIMD的不同方法：用于存储3-wide向量的寄存器，SoA换位，用于存储两个3-wide向量的AVX2指令，与标量指令相比，收集指令以加快数据加载，最后我们直接将AVX2应用于流处理。

SIMD通常不是优化的最佳起点：网格合理化器在不使用特定于平台的指令的情况下经历了算法优化和微优化的许多迭代。但是在某些时候，这些可能性已被穷尽，如果性能至关重要，那么SIMD是一个很棒的工具，可以在必要时使用。

我不确定这些优化中的哪一个会落入主分支meshoptimizer：最后，这只是一个实验，目的是在不大幅改变算法的情况下查看代码的超频情况。希望本文内容丰富，并能为您提供优化代码的想法。本文的最终资料在这里 ;这项工作基于meshoptimizer 99ab49的版本，而泰国佛陀模型则在Sketchfab上发布。

各种SIMD

七次测量

一张一张

AVX2 = SSE2 + SSE2

孩子们聚集

第6章，一切都应开始进行

结论

More articles: