标签: perf

在 C 中使用“perf_event”计算 CPU 周期会产生与“perf”不同的值

我尝试通过简短的 C 代码片段来计算单个进程的 CPU 周期。MWE 是cpucycles.c。

\n\n

cpucycles.c（主要基于手册页示例）

\n\n

#include <stdlib.h>\n#include <stdio.h>\n#include <unistd.h>\n#include <string.h>\n#include <sys/ioctl.h>\n#include <linux/perf_event.h>\n#include <asm/unistd.h>\n\nstatic long\nperf_event_open(struct perf_event_attr *hw_event, pid_t pid,\n                int cpu, int group_fd, unsigned long flags)\n{\n    int ret;\n    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,\n                    group_fd, flags);\n    return ret;\n}\n\nlong long\ncpu_cycles(pid_t pid, unsigned int microseconds)\n{\n    struct perf_event_attr pe;\n    long long count;\n    int fd;\n\n    memset(&pe, 0, sizeof(struct perf_event_attr));\n    pe.type = PERF_TYPE_HARDWARE;\n    pe.size = sizeof(struct perf_event_attr);\n    pe.config = PERF_COUNT_HW_CPU_CYCLES;\n    pe.disabled = 1;\n    pe.exclude_kernel = 1;\n    pe.exclude_hv …

Run Code Online (Sandbox Code Playgroud)

c performancecounter cpu-cycles perf

Chi*_*kus

lucky-day

3
推荐指数

1
解决办法

1905
查看次数

perf中的时间戳是什么意思？

我想使用“perf”来测量函数的实际执行时间。“perf script”命令给出调用函数时的时间戳。

Xorg  1523 [001] 25712.423702:    probe:sock_write_iter: (ffffffff95cd8b80)

Run Code Online (Sandbox Code Playgroud)

时间戳字段的格式为X.Y. 我该如何理解这个值？是XY秒吗？

linux timestamp perf

jj1*_*7jj

lucky-day

3
推荐指数

1
解决办法

2394
查看次数

Linux `perf record --append` 选项缺失

像https://linux.die.net/man/1/perf-record这样的在线手册页表明，Linux 命令有一个perf支持增量分析的选项，即通过perf record --append. 但是，在我的perf4.15.18 版本系统上，缺少该选项。我的性能版本是否太新或太旧而无法使用该--append选项？或者，如果--append缺少该选项，是否有另一种方法可以合并/附加多次运行的性能结果并进行增量分析？

使用 LLVM 进行基于采样的分析时出现了这个问题。在 LLVM 中，基于检测的分析支持合并多个运行中的分析数据，我想知道我们是否可以使用perf.

linux perf

Bea*_*qua

2020 06-21

3
推荐指数

1
解决办法

546
查看次数

列出和使用自定义 Linux 内核跟踪点

我按照https://www.kernel.org/doc/Documentation/trace/tracepoints.txt上的教程在内核核心中创建自定义跟踪点（即不在可加载模块中）。

但是，我没有看到perf list或tplist（来自 bcc 工具）的输出中列出的跟踪点。

所以，我不知道如何使用跟踪点。

问题：如何让跟踪点出现在perf list/tplist输出中？

谢谢。

linux linux-kernel perf tracepoint bcc-bpf

fpk*_*vdw

lucky-day

3
推荐指数

1
解决办法

2415
查看次数

如何计算进程id的执行指令数，包括所有未来的子线程

前段时间，我问了以下问题“如何计算进程 id 的执行指令数（包括子进程）”，@M-Iduoad 好心提供了一个解决方案来pgrep捕获所有子 PID 并将其与 perf stat 中的 -p 一起使用。效果很好！

然而，我遇到的一个问题是多线程应用程序以及当生成新线程时。由于我不是算命先生（太糟糕了！），我不知道tid新生成的线程，因此我无法将它们添加到perf stat-p 或 -t 参数中。

举个例子，假设我有一个多线程 Nodejs 服务器（作为容器部署在 Kubernetes 之上），具有以下内容pstree：

root@node2:/home/m# pstree -p 4037791\nnode(4037791)\xe2\x94\x80\xe2\x94\xac\xe2\x94\x80sh(4037824)\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80node(4037825)\xe2\x94\x80\xe2\x94\xac\xe2\x94\x80{node}(4037826)\n              \xe2\x94\x82                             \xe2\x94\x9c\xe2\x94\x80{node}(4037827)\n              \xe2\x94\x82                             \xe2\x94\x9c\xe2\x94\x80{node}(4037828)\n              \xe2\x94\x82                             \xe2\x94\x9c\xe2\x94\x80{node}(4037829)\n              \xe2\x94\x82                             \xe2\x94\x9c\xe2\x94\x80{node}(4037830)\n              \xe2\x94\x82                             \xe2\x94\x94\xe2\x94\x80{node}(4037831)\n              \xe2\x94\x9c\xe2\x94\x80{node}(4037805)\n              \xe2\x94\x9c\xe2\x94\x80{node}(4037806)\n              \xe2\x94\x9c\xe2\x94\x80{node}(4037807)\n              \xe2\x94\x9c\xe2\x94\x80{node}(4037808)\n              \xe2\x94\x9c\xe2\x94\x80{node}(4037809)\n              \xe2\x94\x9c\xe2\x94\x80{node}(4037810)\n              \xe2\x94\x9c\xe2\x94\x80{node}(4037811)\n              \xe2\x94\x9c\xe2\x94\x80{node}(4037812)\n              \xe2\x94\x9c\xe2\x94\x80{node}(4037813)\n              \xe2\x94\x94\xe2\x94\x80{node}(4037814) \n

Run Code Online (Sandbox Code Playgroud)\n

当然，我可以使用以下perf stat命令来观察其线程：

perf stat --per-thread -e instructions,cycles,task-clock,cpu-clock,cpu-migrations,context-switches,cache-misses,duration_time -p $(pgrep --ns 4037791 | paste -s -d ",")\n

Run Code Online (Sandbox Code Playgroud)\n …

linux performance profiling performance-testing perf

Mic*_*kan

2020 09-29

3
推荐指数

1
解决办法

851
查看次数

mem_load_uops_retired.l3_miss 和 offcore_response.demand_data_rd.l3_miss.local_dram 事件之间的区别

我有一个Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz( Haswell) 处理器。AFAIK计算DRAM （即）数据读取访问mem_load_uops_retired.l3_miss的数量。顾名思义，计算针对 DRAM 的数据读取次数。因此，这两个事件看起来是等价的（或者至少几乎相同）。但根据以下基准，前一个事件比后者发生的频率要低得多：demandnon-prefetchoffcore_response.demand_data_rd.l3_miss.local_dramdemand

1) 在循环中初始化 1000 个元素的全局数组C：

Performance counter stats for '/home/ahmad/Simple Progs/loop': 1,363 mem_load_uops_retired.l3_miss 1,543 offcore_response.demand_data_rd.l3_miss.local_dram 0.000749574 seconds time elapsed 0.000778000 seconds user 0.000000000 seconds sys
Run Code Online (Sandbox Code Playgroud)
2）在Evince中打开PDF文档：

Performance counter stats for '/opt/evince-3.28.4/bin/evince': 936,152 mem_load_uops_retired.l3_miss 1,853,998 offcore_response.demand_data_rd.l3_miss.local_dram 4.346408203 seconds time elapsed 1.644826000 seconds user 0.103411000 seconds sys
Run Code Online (Sandbox Code Playgroud)
3）运行Wireshark 5秒：

Performance counter stats …
Run Code Online (Sandbox Code Playgroud)

intel performancecounter memory-access perf intel-pmu

The*_*mad

2021 03-04

3
推荐指数

1
解决办法

509
查看次数

JVM 如何在底层收集 ThreadDump

请解释 JVM 如何在底层收集 ThreadDump。
我不明白它如何收集脱离 CPU 的线程的堆栈跟踪（等待磁盘 IO、网络、非自愿上下文切换）。
例如，linux perf 仅收集有关 CPU 线程（使用 CPU 周期）的信息

java jvm jvm-hotspot jvmti perf

srg*_*321

lucky-day

3
推荐指数

1
解决办法

354
查看次数

从寄存器移动到频繁访问的变量时性能意外降低

我正在使用以下示例了解缓存的工作原理：
\n
#include <stdio.h>\n#include <stdint.h>\n#include <stdlib.h>\n\ntypedef uint32_t data_t;\nconst int U = 10000000; // size of the array. 10 million vals ~= 40MB\nconst int N = 100000000; // number of searches to perform\n\nint main() {\n data_t* data = (data_t*) malloc(U * sizeof(data_t));\n if (data == NULL) {\n free(data);\n printf("Error: not enough memory\\n");\n exit(-1);\n }\n\n // fill up the array with sequential (sorted) values.\n int i;\n for (i = 0; i < U; i++) {\n data[i] = i;\n }\n\n printf("Allocated array of …
Run Code Online (Sandbox Code Playgroud)

c assembly caching x86-64 perf

Ste*_*Mai

2023 07-27

3
推荐指数

1
解决办法

138
查看次数

perf_event_open - 如何监视多个事件

有没有人知道如何设置perf_event_attr可以触发PMU监控多个(类型)事件的结构perf_event_open()？

比如perf record -e cycles,faults ls,它有两种不同的事件类型(PERF_TYPE_HARDWARE和PERF_TYPE_SOFTWARE),但在perf_event_open的联机帮助页上的示例中,perf_event_attr.type只能分配单个值.

任何建议将不胜感激,谢谢!

20170208更新 感谢@gudok指点我的方向,但结果似乎有些异常.演示程序如下(用于测量整个系统的CPU周期和缓存未命中):

#define _GNU_SOURCE #include <stdlib.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #include <string.h> #include <sys/ioctl.h> #include <linux/perf_event.h> #include <linux/hw_breakpoint.h> #include <asm/unistd.h> #include <errno.h> #include <stdint.h> #include <inttypes.h> #include <time.h> struct read_format { uint64_t nr; struct { uint64_t value; uint64_t id; } values[]; }; int main(int argc, char* argv[]) { struct perf_event_attr pea; int fd1, fd2; uint64_t id1, id2; uint64_t val1, …
Run Code Online (Sandbox Code Playgroud)

linux intel perf

Kan*_*son

2017 02-08

2
推荐指数

1
解决办法

2553
查看次数

Perf Stat与Perf记录

我感到困惑的区别perf record,并perf stat当谈到计数像页面错误,缓存缺失和任何从其他事件perf list.我在"问题1"的答案下面有2个问题,也可能有助于回答"问题2",但是在没有问题的情况下我明确地写出了问题.

问题1:我的理解是perf stat得到计数的"摘要"但是当与-I选项一起使用时,以指定的毫秒间隔获得计数.使用此选项是否可以总结间隔内的计数或获得间隔内的平均值,或完全不同的其他内容？我认为它总结了.该PERF维基称它聚集,但我想,这可能意味着无论是.

问题2:为什么不perf stat -e <event1> -I 1000 sleep 5给出相同的计数,好像我总结了以下命令的每秒计数perf record -e <event1> -F 1000 sleep 5？

例如,如果我使用"page-faults"作为event1的事件,我会得到以下每个命令下面列出的输出.(我假设句点字段是perf record's perf.data文件中事件的计数)

PERF STAT

perf stat -e page-faults -I 1000 sleep 5 # time counts unit events 1.000252928 54 page-faults 2.000498389 <not counted> page-faults 3.000569957 <not counted> page-faults 4.000659987 <not counted> page-faults 5.000837864 2 page-faults
Run Code Online (Sandbox Code Playgroud)
完美记录

perf record -e page-faults -F 1000 …
Run Code Online (Sandbox Code Playgroud)

perf

Uni*_*ame

lucky-day

2
推荐指数

1
解决办法

1284
查看次数

标签统计

perf ×10

linux ×5

c ×2

intel ×2

performancecounter ×2

assembly ×1

bcc-bpf ×1

caching ×1

cpu-cycles ×1

intel-pmu ×1

java ×1

jvm ×1

jvm-hotspot ×1

jvmti ×1

linux-kernel ×1

memory-access ×1

performance ×1

performance-testing ×1

profiling ×1

timestamp ×1

tracepoint ×1

x86-64 ×1

标签 统计

标签统计