Web抓取:根据结果展开/收缩边界框

Ste*_*ead 4 python recursion for-loop bounding-box

客户想要知道他们竞争对手商店的位置,所以我是准邪恶的并且在抓住竞争对手的网站.

服务器接受边界框(即,左下角和右上角坐标)作为参数,并返回在边界框内找到的位置.这部分工作正常,我可以成功检索给定边框的商店位置.

问题是返回边界框中的前10个位置 - 因此在填充区域中,10度边界框将返回太多位置:

在此输入图像描述

我总是可以使用较小的边界框,但我正在尝试避免不必要的服务器命中,同时确保返回所有商店.

所以我需要一种方法来减少搜索矩形大小,当找到10个商店(因为可能存在超过10个商店),并递归搜索较小的搜索矩形大小,然后恢复为较大的矩形为下一个网格单元格.

我编写了一个函数,它在给定边界框的情况下从服务器检索存储:

stores = checkForStores(<bounding box>)
if len(stores) >= 10:
  # There are too many stores. Search again with a smaller bounding box
else:
  # Everything is good - process these stores
Run Code Online (Sandbox Code Playgroud)

但我正在努力设置如何为该checkForStores功能设置合适的边界框 .

我尝试使用for纬度和经度上的循环设置主网格单元格:

cellsize = 10
for minLat in range(-40, -10, cellsize):
    for minLng in range(110, 150, cellsize):
        maxLat = minLat + cellsize
        maxLng = minLng + cellsize
Run Code Online (Sandbox Code Playgroud)

...但我不知道如果找到10个商店,如何继续使用较小的边界框进行搜索.我也试过使用while循环,但我不能让它们中的任何一个工作.

感谢您对从哪里开始的任何建议或指示.

pol*_*art 5

以下是使用递归的方法.代码应该是不言自明的,但这是它的工作原理:给出一些边界框,它检查其中的商店数量,如果有多于或等于10,那么它将这个框分成较小的框,并自己调用这个新的边界框.这样做直到找到不到10家商店.在这种情况下,找到的商店只是保存在列表中.

注意:由于使用了递归,因此可能会出现超出最大递归深度的情况.这是理论上的.在您的情况下,即使您将通过40 000 x 40 000 km边界框,也只需要15步即可到达1 x 1 km边界框cell_axis_reduction_factor=2:

In [1]: import math

In [2]: math.log(40000, 2)
Out[2]: 15.287712379549449
Run Code Online (Sandbox Code Playgroud)

无论如何,在这种情况下你可以尝试增加cell_axis_reduction_factor数量.

还要注意:在Python中,根据PEP 8,函数应该是小写的,带下划线,所以我将checkForStores函数重命名为check_for_stores.

# Save visited boxes. Only for debugging purpose.
visited_boxes = []


def check_for_stores(bounding_box):
    """Function mocking real `ckeck_fo_stores` function by returning
    random list of "stores"
    """
    import random
    randint = random.randint(1, 12)
    print 'Found {} stores for bounding box {}.'.format(randint, bounding_box)
    visited_boxes.append(bounding_box)
    return ['store'] * randint


def split_bounding_box(bounding_box, cell_axis_reduction_factor=2):
    """Returns generator of bounding box coordinates splitted
    from parent `bounding_box`

    :param bounding_box: tuple containing coordinates containing tuples of
          lower-left and upper-right corner coordinates,
          e.g. ((0, 5.2), (20.5, 14.0))
    :param cell_axis_reduction_factor: divide each axis in this param,
                                       in order to produce new box,
                                       meaning that in the end it will
                                       return `cell_axis_reduction_factor`**2 boxes
    :return: generator of bounding box coordinates

    """
    box_lc, box_rc = bounding_box
    box_lc_x, box_lc_y = box_lc
    box_rc_x, box_rc_y = box_rc

    cell_width = (box_rc_x - box_lc_x) / float(cell_axis_reduction_factor)
    cell_height = (box_rc_y - box_lc_y) / float(cell_axis_reduction_factor)

    for x_factor in xrange(cell_axis_reduction_factor):
        lc_x = box_lc_x + cell_width * x_factor
        rc_x = lc_x + cell_width

        for y_factor in xrange(cell_axis_reduction_factor):
            lc_y = box_lc_y + cell_height * y_factor
            rc_y = lc_y + cell_height

            yield ((lc_x, lc_y), (rc_x, rc_y))


def get_stores_in_box(bounding_box, result=None):
    """Returns list of stores found provided `bounding_box`.

    If there are more than or equal to 10 stores found in `bounding_box`,
    recursively splits current `bounding_box` into smaller one and checks
    stores in them.

    :param bounding_box: tuple containing coordinates containing tuples of
          lower-left and upper-right corner coordinates,
          e.g. ((0, 5.2), (20.5, 14.0))
    :param result: list containing found stores, found stores appended here;
                   used for recursive calls
    :return: list with found stores

    """
    if result is None:
        result = []

    print 'Checking for stores...'
    stores = check_for_stores(bounding_box)
    if len(stores) >= 10:
        print 'Stores number is more than or equal 10. Splitting bounding box...'
        for splitted_box_coords in split_bounding_box(bounding_box):
            get_stores_in_box(splitted_box_coords, result)
    else:
        print 'Stores number is less than 10. Saving results.'
        result += stores

    return result


stores = get_stores_in_box(((0, 1), (30, 20)))
print 'Found {} stores in total'.format(len(stores))
print 'Visited boxes: '
print visited_boxes
Run Code Online (Sandbox Code Playgroud)

以下是输出的示例:

Checking for stores...
Found 10 stores for bounding box ((0, 1), (30, 20)).
Stores number is more than or equal 10. Splitting bounding box...
Checking for stores...
Found 4 stores for bounding box ((0.0, 1.0), (15.0, 10.5)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 4 stores for bounding box ((0.0, 10.5), (15.0, 20.0)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 10 stores for bounding box ((15.0, 1.0), (30.0, 10.5)).
Stores number is more than or equal 10. Splitting bounding box...
Checking for stores...
Found 1 stores for bounding box ((15.0, 1.0), (22.5, 5.75)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 9 stores for bounding box ((15.0, 5.75), (22.5, 10.5)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 4 stores for bounding box ((22.5, 1.0), (30.0, 5.75)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 1 stores for bounding box ((22.5, 5.75), (30.0, 10.5)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 6 stores for bounding box ((15.0, 10.5), (30.0, 20.0)).
Stores number is less than 10. Saving results.
Found 29 stores in total
Visited boxes: 
[
((0, 1), (30, 20)), 
((0.0, 1.0), (15.0, 10.5)), 
((0.0, 10.5), (15.0, 20.0)), 
((15.0, 1.0), (30.0, 10.5)), 
((15.0, 1.0), (22.5, 5.75)), 
((15.0, 5.75), (22.5, 10.5)), 
((22.5, 1.0), (30.0, 5.75)), 
((22.5, 5.75), (30.0, 10.5)), 
((15.0, 10.5), (30.0, 20.0))
]
Run Code Online (Sandbox Code Playgroud)