小编YSY*_*YSY的帖子

Python:调用Python对象时超出了最大递归深度

我已经构建了一个必须在大约5M页面上运行的爬虫(通过增加URL ID),然后解析包含"我需要"信息的页面.

在使用在网址(200K)上运行的算法并保存了好的和坏的结果后,我发现我浪费了很多时间.我可以看到有一些返回的减数我可以用来检查下一个有效的URL.

你可以很快地看到减数(少数第一个"好身份证") -

510000011 # +8
510000029 # +18
510000037 # +8
510000045 # +8
510000052 # +7
510000060 # +8
510000078 # +18
510000086 # +8
510000094 # +8
510000102 # +8
510000110 # etc'
510000128
510000136
510000144
510000151
510000169
510000177
510000185
510000193
510000201

Run Code Online (Sandbox Code Playgroud)

在抓取大约200K网址之后,这给了我14K的好结果我知道我浪费时间并且需要优化它,所以我运行一些统计数据并构建了一个函数来检查网址,同时增加id为8\18\17\8(顶部返回减数)等'.

这是功能 -

def checkNextID(ID):
    global numOfRuns, curRes, lastResult
    while ID < lastResult:
        try:
            numOfRuns += 1
            if numOfRuns % 10 == 0:
                time.sleep(3) # sleep every 10 iterations
            if isValid(ID + 8): …

Run Code Online (Sandbox Code Playgroud)

python algorithm recursion web-crawler depth

YSY*_*YSY

lucky-day

34
推荐指数

4
解决办法

9万
查看次数

"剥削艺术"反汇编示例不一样(C代码)

我正在按照"剥削艺术"一书中的示例来尝试用C语言编写程序,而本书附带了自己的Linux LiveCD,我更喜欢使用BT5(32位).

代码示例非常简单 - (我使用它相同)

#include <stdio.h>

int main()
{
  int i;
  for(i=0; i < 10; i++)       // Loop 10 times.
  {
    puts("Hello, world!\n");  // put the string to the output.
  }
  return 0;                   // Tell OS the program exited without errors.
}

Run Code Online (Sandbox Code Playgroud)

作者正在使用

gcc file_name.c

编译代码,我使用几乎相同的语法,但使用-o,以便将编译路径保存到我想要的地方.

然后他使用命令 -

objdump -D loop | grep -A20主要:

检查编译的二进制文件.

这是他的输出 -

reader@hacking:~/booksrc $ objdump -D a.out | grep -A20 main.:
08048374 <main>:
 8048374:       55                      push   %ebp
 8048375:       89 e5                   mov    %esp,%ebp
 8048377:       83 …

Run Code Online (Sandbox Code Playgroud)

c assembly disassembly

YSY*_*YSY

2011 08-08

8
推荐指数

1
解决办法

1282
查看次数

Python urllib2和[errno 10054]远程主机强行关闭现有连接和一些urllib2问题

我编写了一个使用urllib2来获取网址的抓取工具.

每一个请求我得到一些奇怪的行为,我尝试用wireshark分析它,无法理解问题.

getPAGE()负责获取url.如果成功获取url,则返回url(response.read())的内容,否则返回None.

def getPAGE(FetchAddress):
    attempts = 0
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0'}
    while attempts < 2:
        req = Request(FetchAddress, None ,headers)
        try:
            response = urlopen(req) #fetching the url
        except HTTPError, e:
            print 'The server didn\'t do the request.'
            print 'Error code: ', str(e.code) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except URLError, e:
            print 'Failed to reach the server.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4) …

Run Code Online (Sandbox Code Playgroud)

python exception urllib2 errno web-crawler

YSY*_*YSY

2019 07-25

7
推荐指数

0
解决办法

8104
查看次数

在 python 中同时运行多个线程 - 有可能吗？

我正在编写一个应该多次获取 URL 的小爬虫，我希望所有线程同时（同时）运行。

我写了一小段代码应该可以做到这一点。

import thread
from urllib2 import Request, urlopen, URLError, HTTPError


def getPAGE(FetchAddress):
    attempts = 0
    while attempts < 2:
        req = Request(FetchAddress, None)
        try:
            response = urlopen(req, timeout = 8) #fetching the url
            print "fetched url %s" % FetchAddress
        except HTTPError, e:
            print 'The server didn\'t do the request.'
            print 'Error code: ', str(e.code) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except URLError, e:
            print 'Failed to reach the server.'
            print 'Reason: ', str(e.reason) …

Run Code Online (Sandbox Code Playgroud)

python multithreading web-crawler gil

YSY*_*YSY

lucky-day

6
推荐指数

1
解决办法

9805
查看次数

如何读取可以在python中保存为ansi或unicode的文件？

我必须编写一个支持读取文件的脚本,该文件可以保存为Unicode或Ansi(使用MS的记事本).

我没有任何关于文件中编码格式的指示,我如何支持这两种编码格式？(一种在不知道高级格式的情况下读取文件的通用方法).

python unicode ansi utf-8

YSY*_*YSY

lucky-day

2
推荐指数

1
解决办法

1万
查看次数

python md5,d.update(strParam).hexdigest()返回NoneType.=,为什么？

>>> d = md5.new()
>>> d.update('a').hexdigest()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'hexdigest'

Run Code Online (Sandbox Code Playgroud)

这会工作 -

>>> d = md5.new()
>>> d.update('a')
>>> d.hexdigest()
'0cc175b9c0f1b6a831c399e269772661'

Run Code Online (Sandbox Code Playgroud)

是否有缩短python代码的解释？

python md5

YSY*_*YSY

2011 07-02

0
推荐指数

1
解决办法

4318
查看次数

标签统计

python ×5

web-crawler ×3

algorithm ×1

ansi ×1

assembly ×1

c ×1

depth ×1

disassembly ×1

errno ×1

exception ×1

gil ×1

md5 ×1

multithreading ×1

recursion ×1

unicode ×1

urllib2 ×1

utf-8 ×1

Python:调用Python对象时超出了最大递归深度

"剥削艺术"反汇编示例不一样(C代码)

Python urllib2和[errno 10054]远程主机强行关闭现有连接和一些urllib2问题

在 python 中同时运行多个线程 - 有可能吗？

如何读取可以在python中保存为ansi或unicode的文件？

python md5,d.update(strParam).hexdigest()返回NoneType.=,为什么？

标签 统计

小编YSY_YSY的帖子

标签统计