How to reliably reproduce curl_multi timeout while testing public proxies

Acc*_*t م 10 php c bug-tracking curl curl-multi

Relevant information: issue 3602 on GitHub

I'm working on a project that gathers and tests public/free proxies, and noticed that when I use the curl_multi interface for testing these proxies, sometimes I get many 28(timeout) errors. This never happens if I test every proxy alone.

The problem is that this issue is unreliably reproducible, and it does not always show up , it could be something in curl or something else.

Unfortunately, I'm not such a deep networks debugger and I don't know how to debug this issue on a deeper level, however I wrote 2 C testing programs (one of them is originally written by Daniel Stenberg but I modified it's output to the same format as the other C program). These 2 C programs test 407 public proxies using curl

  1. with curl_multi interface (which has the problem)

  2. with curl on many threads, each curl operates on a thread. (which has no problem)

These are the 2 C programs I wrote for testing I'm not a C developer so please let me know about anything wrong you notice in the 2 programs.

This is the original PHP class that I used for reproducing the issue a month ago.

And these are the 2 C programs tests results. You can notice that the tests done with curl_multi timeout, while the timeouts made by curl-threads are stable (about 50 out of 407 of the proxies are working).

This is a sample from the test results. Please note columns 4 and 5 to see how the curl threads timeout about ~170 times and successfully connect ~40 times. Out of these, curl_multi makes 0 successful connections and timeouts ~300 times out of 407 proxies.

column(1) : #
column(2) : time(UTC)
column(3) : total execution time (seconds)
column(4) : no error 0 (how many requests result in no error CURLE_OK)
column(5) : error 28 (how many requests result in error 28 CURLE_OPERATION_TIMEDOUT)
column(6) : error 7 (how many requests result in error 7 CURLE_COULDNT_CONNECT)
column(7) : error 35 (how many requests result in error 35 CURLE_SSL_CONNECT_ERROR)
column(8) : error 56 (how many requests result in error 56 CURLE_RECV_ERROR)
column(9) : other errors (how many requests result in errors other than the above)
column(10) : program that used the curl
column(11) : cURL version

c(1)    c(2)           c(3)c(4)c(5)c(6)c(7)c(8)c(9) c(10)                  c(11)
267 2019-3-28 01:58:01  40  43  176 183 1   4   0   C (curl - threads) (Linux Fedora)   7.59.0
268 2019-3-28 01:59:01  30  0   286 110 1   10  0   C (curl-multi one thread) (Linux Fedora)    7.59.0
269 2019-3-28 02:00:01  30  46  169 181 1   8   2   C (curl - threads) (Linux Fedora)   7.59.0
270 2019-3-28 02:01:01  31  0   331 74  1   1   0   C (curl-multi one thread) (Linux Fedora)    7.59.0
271 2019-3-28 02:02:01  30  42  173 186 1   4   1   C (curl - threads) (Linux Fedora)   7.59.0
272 2019-3-28 02:03:01  30  0   277 116 1   13  0   C (curl-multi one thread) (Linux Fedora)    7.59.0
Run Code Online (Sandbox Code Playgroud)

Why does curl_multi timeout inconsistently with most of the connections, while curl-threads never does this?

I downloaded Wireshark and used it to capture the traffic while each of the 2 C programs was running, I also filtered the traffic to the proxies list used by the 2 C programs, and saved the files on GitHub.

the curl-threads program (the expected behavior)

63 successful connections and 158 connections timeout out of 407 proxies.

the curl_multi program (the unexpected behavior)

0 successful connections and 272 connections timeout out of 407 proxies.

You can open the .pcapng files using Wireshark and see the recorded traffic on my computer while both expected/unexpected behavior. I filtered the traffic to the 407 proxy IPs and left Wireshark open for a little while after the 30 seconds of curl limit because I noticed some packets still showing up. I don't know Wireshark and this level of networking, but I thought this could be useful.


Note on the bandwidth:

Open the .pcapng file of the curl_threads program (the normal behavior) in wireshark and go to Statistics > Conversations . you will see a window like this

在此处输入图片说明

I have copied the data and saved them here on GitHuB , now calculate the Sum of the Bytes sent from A->B and B->A.

The ENTIRE bandwidth needed to work normally is about 692.8 KB.

Jim*_*mix 1

在我看来,您的卷曲本身没有问题,但如果连接被拒绝,则同时与代理服务器进行过多的连接。您可能会被永久列入黑名单或在一段时间内被列入黑名单。

通过从当前 IP 运行curl 并执行 stat 来检查:建立了多少个连接,有多少个连接被拒绝,有多少个连接超时。重复几次并收集平均值。然后将服务器更改为具有不同 IP 的其他服务器,并检查那里有哪些统计信息。第一次运行时,您应该有更好的统计数据,如果您在新 IP 上重复测试,情况可能会变得更糟。好主意可能是不要使用所有代理池来连接进行统计,而是从中选择一个切片并检查实际 IP,然后对新 IP 重复该检查,这样如果原因是您滥用服务,您就不会将自己列入黑名单所有代理,但仍然有下一组“未触及”代理在新 IP 上测试它们(如果情况确实如此)。请注意,即使代理的 IP 位于不同位置,它们也可以属于同一服务提供商。这可能有一个针对所有代理服务的滥用列表,因此,如果您在一个国家/地区所做的请求量没有被很好地看到,那么即使在您连接到另一个国家/地区的代理之前,您也可能在其他国家/地区被阻止。

如果您仍然想检查这是否不是curl,那么您可以设置一个具有多个服务的测试环境。您可以将此测试环境传递给curl 维护人员,以便他可以复制错误。你可以使用docker创建10个、20个或100个代理服务器并连接到它们来查看curl是否有问题。

您将需要 docker它可以安装在 Win/Mac/Linux 上的代理映像
之一来创建代理为容器 创建网络教程(桥应该可以) 将容器附加到网络--network 可以为每个代理容器设置它们的- -ip 使每个代理容器可以读取配置并写入错误日志(这样您就可以通过使用--volume挂载错误日志/配置文件/目录来读取它们断开连接的原因) ,并且所有代理容器都应该运行





您可以通过两种方式连接到在容器内运行的代理。如果您想在这些容器之外使用curl,那么您需要使用-p将这些代理的端口从容器公开到外界(在您的情况下为curl)。

或者

您可以使用另一个具有 linux + curl 的容器映像。例如Alpine linux + curl并将其连接到同一个网络,就像使用代理一样。如果您这样做,则不需要发布(公开)代理端口,也不需要考虑我应该为此特定代理公开多少个代理端口。

在每一步你都可以发出命令

docker ps -a
Run Code Online (Sandbox Code Playgroud)

查看所有容器及其状态。

停止并删除所有容器(不是它们来自的图像,而是正在运行的容器),以防容器退出时出现一些错误。

docker stop $(docker ps -aq) && docker rm $(docker ps -aq)
Run Code Online (Sandbox Code Playgroud)

或停止并从列表中删除特定容器

docker stop <container-id>
docker rm <container-id>
Run Code Online (Sandbox Code Playgroud)

查看连接到桥接网络的所有容器(默认)

docker network inspect bridge
Run Code Online (Sandbox Code Playgroud)

如果您确认与本地计算机上的代理的连接确实存在问题,那么curl 维护人员可以复制这一点。

只需将上面的所有命令放入文件中即可创建所有代理,将它们连接到网络等,例如以以下开头replicate.sh的脚本

#!/bin/sh

and your comands here
Run Code Online (Sandbox Code Playgroud)

保存该文件并发出命令

chmod +x ./replicate.sh
Run Code Online (Sandbox Code Playgroud)

使其可执行。

您可以运行它来仔细检查一切是否按预期工作

./replicate.sh
Run Code Online (Sandbox Code Playgroud)

并将curl 的维护者发送到复制您遇到问题的环境。

如果您不喜欢放置大量像 doker run 这样的命令来运行代理,您可以使用docker compose来代替,它允许您在一个文件中定义整个测试环境。

如果您运行大量容器,您可以限制资源,例如每个容器消耗的内存,这可能会在代理太多的情况下对您有所帮助