golang pprof 监控系列(4) —— goroutine thread 统计原理

golang pprof 监控系列(4 —— goroutine thread 统计原理

在之前 golang pprof监控系列文章里我分别介绍了go trace以及go pprof工具对memory，block，mutex这些维度的统计原理，今天我们接着来介绍golang pprof工具对于goroutine 和thread的统计原理。

老规矩，在介绍统计原理前，先来看看http接口暴露的方式暴露了哪些信息。

http 接口暴露的方式

goroutine profile 输出信息介绍

进入到了一个这样的界面，我们挨个分析下网页展现出来的信息:

debug =2 时的如上图所示，41代表协程的id，方括号内running代表了协程的状态是运行中，接着就是该协程此时的堆栈信息了。

goroutine profile: total 6
1 @ 0x102ad6c60 0x102acf7f4 0x102b04de0 0x102b6e850 0x102b6e8dc 0x102b6f79c 0x102c27d04 0x102c377c8 0x102d0fc74 0x102bea72c 0x102bebec0 0x102bebf4c 0x102ca4af0 0x102ca49dc 0x102d0b084 0x102d10f30 0x102d176a4 0x102b09fc4
#	0x102b04ddf	internal/poll.runtime_pollWait+0x5f		/Users/xiongchuanhong/goproject/src/go/src/runtime/netpoll.go:303
#	0x102b6e84f	internal/poll.(*pollDesc.wait+0x8f		/Users/xiongchuanhong/goproject/src/go/src/internal/poll/fd_poll_runtime.go:84

......

goroutine profile 表明了这个profile的类型。

接着是下面一行，1 代表了在这个堆栈上，只有一个协程在执行。但其实在计算出数字1时，并不仅仅按堆栈去做区分，还依据了协程labels值，也就是协程的堆栈和lebels标签值共同构成了一个key，而数字1就是在遍历所有协程信息时，对相同key进行累加计数得来的。

	pprof.SetGoroutineLabels(pprof.WithLabels(context.Background(, pprof.Labels("name", "lanpangzi", "age", "18"

通过上述代码，我可以为当前协程设置了两个标签值，分别是name和age，设置label值之后，再来看debug=1后的网页输出，可以发现设置的labels出现了。

1 @ 0x104f86c60 0x104fb7358 0x105236368 0x104f867ec 0x104fba024
# labels: {"age":"18", "name":"lanpangzi"}
#	0x104fb7357	time.Sleep+0x137	/Users/xiongchuanhong/goproject/src/go/src/runtime/time.go:193
#	0x105236367	main.main+0x437		/Users/xiongchuanhong/goproject/src/go/main/main.go:46
#	0x104f867eb	runtime.main+0x25b	/Users/xiongchuanhong/goproject/src/go/src/runtime/proc.go:255

而数字1之后，就是协程正在执行的堆栈信息了。至此，goroutine指标的输出信息介绍完毕。

threadcreate 输出信息介绍

老规矩，先看地址栏，debug=1代表输出的是文本可读的信息，threadcreate 就没有debug=2的特别输出了，debug=0时同样也会下载一个可供go tool pprof分析的二进制文件。

下面输出的内容可以看到在main方法里面创建了一个线程，runtime.newm 方法内部，runtime会启动一个系统线程。

程序代码暴露指标信息

看完了http接口暴露着两类指标的方式，我们再来看看如何通过代码来暴露他们。
还记得在golang pprof监控系列（2） —— memory，block，mutex 使用是如何通过程序代码暴露memory block mutex 指标的吗，goroutine 和 threadcreate 和他们一样，也是通过pprof.Lookup方法进行暴露的。

os.Remove("goroutine.out"
	f, _ := os.Create("goroutine.out"
	defer f.Close(
	err := pprof.Lookup("goroutine".WriteTo(f, 1
	if err != nil {
		log.Fatal(err
	}
	
	.... 
	
	os.Remove("threadcreate.out"
	f, _ := os.Create("threadcreate.out"
	defer f.Close(
	err := pprof.Lookup("threadcreate".WriteTo(f, 1
	if err != nil {
		log.Fatal(err
	}

无非就是将pprof.Lookup的传入的参数值改成对应的指标名即可。

统计原理介绍

无论是 goroutine 还是threadcreate 的指标信息的输出，都是调用了同一个方法writeRuntimeProfile。 golang 源码版本 go1.17.12。

// src/runtime/pprof/pprof.go:708
func writeRuntimeProfile(w io.Writer, debug int, name string, fetch func([]runtime.StackRecord, []unsafe.Pointer (int, bool error {
	var p []runtime.StackRecord
	var labels []unsafe.Pointer
	n, ok := fetch(nil, nil
	for {
		p = make([]runtime.StackRecord, n+10
		labels = make([]unsafe.Pointer, n+10
		n, ok = fetch(p, labels
		if ok {
			p = p[0:n]
			break
		}
	}
	return printCountProfile(w, debug, name, &runtimeProfile{p, labels}
}

让我们来分析下这个函数，函数会传递一个fetch 方法，goroutine和threadcreate信息在输出时选择了不同的fetch方法来获取到各自的信息。

变量名为labels的切片里。

获取到了堆栈信息，labels 信息，接着就是要将这些信息进行输出了，进行输出的函数是上述源码里的最后一行中的printCountProfile 函数。

至此，对goroutine和threadcreate 指标信息的输出过程应该有了解了，即通过fetch方法获取到指标信息，然后通过printCountProfile 方法对指标信息进行输出。

源码如下:

// src/runtime/pprof/pprof.go:661  
func writeThreadCreate(w io.Writer, debug int error {
	return writeRuntimeProfile(w, debug, "threadcreate", func(p []runtime.StackRecord, _ []unsafe.Pointer (n int, ok bool {
		return runtime.ThreadCreateProfile(p
	}
}

// src/runtime/pprof/pprof.go:680 
func writeGoroutine(w io.Writer, debug int error {
	if debug >= 2 {
		return writeGoroutineStacks(w
	}
	return writeRuntimeProfile(w, debug, "goroutine", runtime_goroutineProfileWithLabels
}

goroutine 指标信息在输出时，会选择runtime_goroutineProfileWithLabels函数来获取goroutine指标，而threadcreate 则会调用 runtime.ThreadCreateProfile(p 去获取threadcreate指标信息。

goroutine fetch 函数实现

// src/runtime/mprof.go:744
//go:linkname runtime_goroutineProfileWithLabels runtime/pprof.runtime_goroutineProfileWithLabels
func runtime_goroutineProfileWithLabels(p []StackRecord, labels []unsafe.Pointer (n int, ok bool {
	return goroutineProfileWithLabels(p, labels
}

goroutineProfileWithLabels 就是实际获取goroutine堆栈和标签的方法了。

goroutineProfileWithLabels 的逻辑也比较容易，我这里仅仅简单概括下，其内部会通过一个全局变量allgptr 去遍历所有的协程，allgptr 保存了程序中所有的协程的地址，而协程的结构体g内部，有一个叫做label的属性，这个值就代表协程的标签值,在遍历协程时，通过该属性便可以获取到标签值了。

threadcreate fetch 函数实现

源码如下：

func ThreadCreateProfile(p []StackRecord (n int, ok bool {
	first := (*m(atomic.Loadp(unsafe.Pointer(&allm
	for mp := first; mp != nil; mp = mp.alllink {
		n++
	}
	if n <= len(p {
		ok = true
		i := 0
		for mp := first; mp != nil; mp = mp.alllink {
			p[i].Stack0 = mp.createstack
			i++
		}
	}
	return
}

首先是获取到allm变量的地址，allm是一个全局变量，它其实是存储所有m链表的表头元素。

// src/runtime/runtime2.go:1092
var (
	allm       *m
	.....

在golang里，每创建一个m结构便会在底层创建一个系统线程，所以你可以简单的认为m就是代表了一个线程。可以之后深入了解下gpm模型。

for mp := first; mp != nil; mp = mp.alllink {
			p[i].Stack0 = mp.createstack
			i++
		}

然后 ThreadCreateProfile 里这段逻辑就是遍历了整个m链表，将m结构体保存的堆栈信息赋值给参数p，p则是我们需要填充的堆栈信息数组，在m结构体里，alllink是一个指向链表下一个元素的指针，每次新创建m时，会将新m插入到表头位置，然后更新allm变量。

总结

至此，goroutine 和threadcreate的使用和原理都介绍完了，他们比起之前的memory，block之类的统计相对来说比较简单，简而言之就是遍历一个全局变量allgptr或者allm，遍历时获取到协程或者线程的堆栈信息和labels信息，然后将这些信息进行输出即可。

编程笔记 » golang pprof 监控系列(4) —— goroutine thread 统计原理

golang pprof 监控系列(4) —— goroutine thread 统计原理