今天，我们将把Go中的短线粘合速度提高30％。为此，我们不需要修改Go本身，所有这些都将作为第三方库实现。

根据削减，您正在等待：

比较+ ， strings.Builder和本机串联函数
转到内部行详细信息
相当多的汇编器

本文也可以作为讨论CL123256的借口：运行时，cmd / compile：专长concatstring2 。欢迎提出改进此更改列表的想法。

立即结果

与编译器的go tip （主版）版本进行了比较。您可以在Go 1.5左右的版本上获得类似的结果。对concatstrings函数的最后一个重要更改是CL3120：cmd / gc：为堆栈上未转义的字符串分配缓冲区。

 BenchmarkConcat2Operator-8 20000000 83.8 ns/op BenchmarkConcat2Builder-8 20000000 70.9 ns/op BenchmarkConcat2-8 20000000 62.1 ns/op BenchmarkConcat3Operator-8 20000000 104 ns/op BenchmarkConcat3Builder-8 20000000 89.9 ns/op BenchmarkConcat3-8 20000000 82.1 ns/op

ConcatOperator使用+ 。
ConcatBuilder使用正确预分配的strings.Builder 。
Concat使用了我们在本故事中实现的功能。

通过Benchstat比较：

 name old time/op new time/op delta Concat2-8 84.2ns ± 1% 62.7ns ± 2% -25.49% (p=0.000 n=9+10) Concat3-8 103ns ± 3% 83ns ± 4% -19.83% (p=0.000 n=10+9)

GOARCH=AMD64下的汇编程序实现稍快一些，并具有附加的优化功能，该功能存在于内置+运算符中，但在以下内容中有更多介绍：

 name old time/op new time/op delta Concat2-8 84.2ns ± 1% 57.1ns ± 3% -32.20% (p=0.000 n=9+9)

我们将把汇编器功能视为100％的性能（相对于其余考虑的实现）。

在README.md中可以看到更长行的结果。字符串越长，实现之间的差异就越不明显。

天真的串联

最简单的解决方案是使用+运算符。

该语句的语义如下：占用两行并返回包含两行串联的结果字符串。不能保证将返回新行。例如，如果存在一个空字符串和其他字符串的串联，则运行时可能会返回一个非空参数，从而避免了分配新内存并在其中复制数据的需要。

但是，从本文开头的结果可以看出，这是最慢的方法。

 func concat2operator(x, y string) string { return x + y }

绩效等级： 67.8％ 。

strings.Builder

不久前，一个新类型被添加到Go- strings.Builder中。这是bytes.Buffer的类似物，但是在调用String()方法时，不会重新分配内存，也不会复制数据。

与bytes.Buffer不同，builder没有优化小缓冲区，因此没有为字符串预先分配内存。如果不使用Grow方法，则性能将比使用bytes.Buffer差。 Go 1.11中的一些回归是由此特定功能引起的（请参阅CL113235 ）。

在我们的代码中，出于实验的纯度，我们将避免该错误。

 func concat2builder(x, y string) string { var builder strings.Builder builder.Grow(len(x) + len(y)) //      builder.WriteString(x) builder.WriteString(y) return builder.String() }

绩效等级： 80.5％ （+12.7）。

级联代码生成

如果我们看一下编译器为+运算符生成的代码，我们将看到对concatstring2 ， concatstring3函数等的调用（最多包括concatstring5 ）。

 func concat2codegen(x, y) string { return x + y } // => CALL runtime.concatstring2(SB) func concat3codegen(x, y, z) string { return x + y + z } // => CALL runtime.concatstring3(SB)

看一下runtime / string.go本身：

 func concatstring2(buf *tmpBuf, a [2]string) string { return concatstrings(buf, a[:]) } func concatstring3(buf *tmpBuf, a [3]string) string { return concatstrings(buf, a[:]) }

因此，剩下的要学习功能concatstrings 。
扰流板下面提供了完整的列表，但这是高级描述：

buf参数可以为nil 。如果该行未从其定义中“转义”，则由编译器分配此缓冲区。如果字符串的寿命比帧长，则此缓冲区将始终nil （这是最常见的情况）。但是，如果此缓冲区可用，则可以避免分配，以防结果闯入缓冲区（其大小为32字节）。
如果除一行以外的所有行均为空，则函数将返回此行。但是同时，在堆栈上选择并离开其帧的行会绕过此优化，因此调用方不会收到已经释放的内存。
此外，所有行均被复制到新存储器中。

concatstrings函数的完整列表

 // concatstrings implements a Go string concatenation x+y+z+... // The operands are passed in the slice a. // If buf != nil, the compiler has determined that the result does not // escape the calling function, so the string data can be stored in buf // if small enough. func concatstrings(buf *tmpBuf, a []string) string { idx := 0 l := 0 count := 0 for i, x := range a { n := len(x) if n == 0 { continue } if l+n < l { throw("string concatenation too long") } l += n count++ idx = i } if count == 0 { return "" } // If there is just one string and either it is not on the stack // or our result does not escape the calling frame (buf != nil), // then we can return that string directly. if count == 1 && (buf != nil || !stringDataOnStack(a[idx])) { return a[idx] } s, b := rawstringtmp(buf, l) for _, x := range a { copy(b, x) b = b[len(x):] } return s }

在这里，我们同时看到几个可以针对特定情况进行优化的地方：

buf通常是空的。当编译器无法证明该字符串可安全放置在堆栈中时，传递一个额外的参数并在函数内部检查nil仅会产生开销。
对于len(a) == 2的特殊情况len(a) == 2我们不需要循环，因此可以简化计算。这是最常见的串联形式。

串联统计

当执行./make.bash （ ./make.bash Go编译器和stdlib）时，我们看到445个带有两个操作数的连接：

398个结果正在消失。在这种情况下，我们的专业化是有意义的。
47结果不会离开您的框架。

来自两个参数的总计89％的串联得到了优化。

对于go实用程序，我们有：

501呼叫concatstring2
194个电话concatstring3
55个电话concatstring4

适用于所有架构的版本

要实现专业化，我们需要知道Go中的行是如何表示的。二进制兼容性对我们很重要，而unsafe.Pointer可以用*byte替换而无需任何牺牲。

 type stringStruct struct { str *byte len int }

我们可以从运行时得出的第二个重要结论是：行开始其生命可变。分配一块由[]byte引用的内存，将新行的内容写入其中，并且仅在丢弃[]byte之后，并将其引用的内存存储在stringStruct 。

对于那些想了解更多细节的人，建议研究rawstringtmp和rawstring的功能。

runtime.rawstring

 // rawstring allocates storage for a new string. The returned // string and byte slice both refer to the same storage. // The storage is not zeroed. Callers should use // b to set the string contents and then drop b. func rawstring(size int) (s string, b []byte) { p := mallocgc(uintptr(size), nil, false) stringStructOf(&s).str = p stringStructOf(&s).len = size *(*slice)(unsafe.Pointer(&b)) = slice{p, size, size} return }

我们可以使用unsafe软件包的阴暗面来大致提高速度：

 func concat2(x, y string) string { length := len(x) + len(y) if length == 0 { return "" } b := make([]byte, length) copy(b, x) copy(b[len(x):], y) return goString(&b[0], length) }

我们突出显示[]byte ，我们用它来形成新行的内容。然后，我们只能通过将其带到预期的运行时表示形式来完成该行。 goString函数负责：

 func goString(ptr *byte, length int) string { s := stringStruct{str: ptr, len: length} return *(*string)(unsafe.Pointer(&s)) }

绩效评级： 91.9％ （+10.9）。

AMD64版本

不幸的是，该函数的先前版本没有优化与空字符串的连接，并且由于无法直接分配内存，我们还执行了许多不必要的计算，我们必须使用字节片。

Go汇编器有趣的功能之一是它允许您调用例如不可导出的运行时函数。我们可以从汇编代码中调用runtime·mallocgc ，即使它不是runtime包的一部分。我们将使用此属性。

我们还可以检查堆栈存储器行的所有权，这使得安全地优化作为结果的参数之一的返回是安全的。

假设使用参数concat2("", "123")调用函数。 x是一个空字符串，如果y没有分配在堆栈上，我们可以将其作为串联的结果返回。

 //; ,  x  y   stringStruct. //; CX - y.str. //; SI - y.len. maybe_return_y: //;      . MOVQ (TLS), AX //; *g CMPQ CX, (AX) JL return_y //;  y_str < g.stack.lo CMPQ CX, 8(AX) JGE return_y //;  y_str >= g.stack.hi JMP concatenate //; y  ,    return_y: MOVQ CX, ret+32(FP) //; stringStruct.len MOVQ SI, ret+40(FP) //; stringStruct.str RET

MOVQ (TLS), AX * g移至AX寄存器。以零偏移量读取将得到g.stack.lo字段，而g.stack.hi从第8个字节开始（对于64位平台）。

 type g struct { stack struct { lo uintptr // 0(AX) hi uintptr // 8(AX) } stackguard0 uintptr // 16(AX) stackguard1 uintptr // 24(AX) // ...   }

concatenate主体分配内存，用两行填充它，然后返回新行。

有评论的完整清单

 #include "textflag.h" #include "funcdata.h" TEXT ·Strings(SB), 0, $48-48 NO_LOCAL_POINTERS //    . MOVQ x+0(FP), DX MOVQ x+8(FP), DI MOVQ y+16(FP), CX MOVQ y+24(FP), SI TESTQ DI, DI JZ maybe_return_y // x -  ,   y TESTQ SI, SI JZ maybe_return_x // y -  ,   x concatenate: LEAQ (DI)(SI*1), R8 // len(x) + len(y) //     . MOVQ R8, 0(SP) MOVQ $0, 8(SP) MOVB $0, 16(SP) CALL runtime·mallocgc(SB) MOVQ 24(SP), AX //     MOVQ AX, newstr-8(SP) //  x. MOVQ x+0(FP), DX MOVQ x+8(FP), DI MOVQ AX, 0(SP) MOVQ DX, 8(SP) MOVQ DI, 16(SP) CALL runtime·memmove(SB) //  y   len(x). MOVQ x+8(FP), DI MOVQ y+16(FP), CX MOVQ y+24(FP), SI MOVQ newstr-8(SP), AX LEAQ (AX)(DI*1), BX MOVQ BX, 0(SP) MOVQ CX, 8(SP) MOVQ SI, 16(SP) CALL runtime·memmove(SB) //   . MOVQ newstr-8(SP), AX MOVQ x+8(FP), R8 ADDQ y+24(FP), R8 MOVQ AX, ret+32(FP) MOVQ R8, ret+40(FP) RET maybe_return_y: //      . MOVQ (TLS), AX // *g CMPQ CX, (AX) JL return_y //  y_ptr < stk.lo CMPQ CX, 8(AX) JGE return_y //  y_ptr >= stk.hi JMP concatenate // y  ,    return_y: MOVQ CX, ret+32(FP) MOVQ SI, ret+40(FP) RET maybe_return_x: //      . MOVQ (TLS), AX // *g CMPQ DX, (AX) JL return_x //  x_ptr < stk.lo CMPQ DX, 8(AX) JGE return_x //  x_ptr >= stk.hi JMP concatenate // x  ,    return_x: MOVQ DX, ret+32(FP) MOVQ DI, ret+40(FP) RET

如果您对这段代码中的NO_LOCAL_POINTERS的性质感兴趣，可以阅读从asm调用Go函数（“严重错误：缺少堆栈映射”）。

绩效等级： 100％ （+8.6）。

总结

所有代码均作为concat软件包提供。

世界已经准备好进行如此快速的串联了吗？谁知道

在本文的开头，提到了CL123256 。他有几种发展道路：

编译器未分配临时缓冲区的情况下的变体专用化。每种情况的增长都较少，但是它涵盖了更多的串联类型，并且实际上并没有增加代码的大小（包括机器代码和Go代码）。
针对特殊情况的更多专业知识。更高的收益，但是更多的机器代码，可能会损害指令缓存。
针对每种特殊情况和特殊记忆的大量机器代码，以glibc中的方式进行。这里主要是权宜之计。

当前提出的选项只会加速最常见的一对字符串串联（arity = 2）的情况。

如果Go不接受此更改，则可以通过以第三方库的形式实现字符串操作来实现可比的加速。不太方便，美观大方，但可以。

在Go中更快地自己动手字符串连接