使用特殊格式从URL结果中提取数据

dat*_*oda 4 python url parsing

我有一个URL:http:
//somewhere.com/relatedqueries?limit = 2&query = setermterm

修改输入,限制和查询的位置将生成所需数据.限制是可能的最大术语数,查询是种子术语.

URL提供以这种方式格式化的文本结果:
oo.visualization.Query.setResponse({version:'0.5',reqId:'0',status:'ok',sig:'1303596067112929220',table:{cols:[{ ID: '得分',标签: '得分',类型: '编号',图案: '#,## 0 ###'},{ID: '查询',标签: '查询',类型:'字符串'图案: ''}],行:[{C:[{ν:0.9894380670262618中,f: '0.99'},{ν: 'newterm1'}]},{C:[{ν:0.9894380670262618,F:' 0.99 '},{v:' newterm2 '}]}],p:{' totalResultsCount ':' 7727' }}});

我想编写一个带有两个参数(限制数和查询种子)的python脚本,在线获取数据,解析结果并返回一个包含新术语['newterm1','newterm2']的列表案件.

我喜欢一些帮助,尤其是URL提取,因为我以前从未这样做过.

Wes*_*ley 12

听起来你可以把这个问题分解成几个子问题.

子问题

在编写完成的脚本之前,有一些问题需要解决:

  1. 形成请求URL:从模板创建配置的请求URL
  2. 检索数据:实际发出请求
  3. 展开JSONP:返回的数据似乎是在JavaScript函数调用中包装的JSON
  4. 遍历对象图:浏览结果以查找所需的信息位

形成请求URL

这只是简单的字符串格式化.

url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
url = url_template.format(limit=2, seedterm='seedterm')
Run Code Online (Sandbox Code Playgroud)

Python 2注意

您需要在%此处使用字符串格式化运算符().

url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
url = url_template % dict(limit=2, seedterm='seedterm')
Run Code Online (Sandbox Code Playgroud)

检索数据

您可以使用内置的urllib.request模块.

import urllib.request
data = urllib.request.urlopen(url) # url from previous section
Run Code Online (Sandbox Code Playgroud)

这将返回一个名为的类文件对象data.你也可以在这里使用with语句:

with urllib.request.urlopen(url) as data:
    # do processing here
Run Code Online (Sandbox Code Playgroud)

Python 2注意

进口 urllib2而不是urllib.request.

展开JSONP

您粘贴的结果看起来像JSONP.假设调用(oo.visualization.Query.setResponse)的包装函数没有改变,我们可以简单地去掉这个方法调用.

result = data.read()

prefix = 'oo.visualization.Query.setResponse('
suffix = ');'

if result.startswith(prefix) and result.endswith(suffix):
    result = result[len(prefix):-len(suffix)]
Run Code Online (Sandbox Code Playgroud)

解析JSON

结果result字符串只是JSON数据.使用内置的json模块解析它.

import json

result_object = json.loads(result)
Run Code Online (Sandbox Code Playgroud)

遍历对象图

现在,您有一个result_object代表JSON响应的代码.对象本身是一个dictversion,reqId等等的键.根据您的问题,您需要执行以下操作来创建列表.

# Get the rows in the table, then get the second column's value for
# each row
terms = [row['c'][2]['v'] for row in result_object['table']['rows']]
Run Code Online (Sandbox Code Playgroud)

把它们放在一起

#!/usr/bin/env python3

"""A script for retrieving and parsing results from requests to
somewhere.com.

This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python3 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""

import urllib.request
import json
import sys

E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2

def parse_result(result):
    """Parse a JSONP result string and return a list of terms"""
    prefix = 'oo.visualization.Query.setResponse('
    suffix = ');'

    # Strip JSONP function wrapper
    if result.startswith(prefix) and result.endswith(suffix):
        result = result[len(prefix):-len(suffix)]

    # Deserialize JSON to Python objects
    result_object = json.loads(result)

    # Get the rows in the table, then get the second column's value
    # for each row
    return [row['c'][2]['v'] for row in result_object['table']['rows']]

def retrieve_terms(limit, seedterm):
    """Retrieves and parses data and returns a list of terms"""
    url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
    url = url_template.format(limit=limit, seedterm=seedterm)

    try:
        with urllib.request.urlopen(url) as data:
            data = perform_request(limit, seedterm)
            result = data.read()
    except:
        print('Could not request data from server', file=sys.stderr)
        exit(E_OPERATION_ERROR)

    terms = parse_result(result)
    print(terms)

def main(limit, seedterm):
    """Retrieves and parses data and prints each term to standard output"""
    terms = retrieve_terms(limit, seedterm)
    for term in terms:
        print(term)

if __name__ == '__main__'
    try:
        limit = int(sys.argv[1])
        seedterm = sys.argv[2]
    except:
        error_message = '''{} limit seedterm

limit must be an integer'''.format(sys.argv[0])
        print(error_message, file=sys.stderr)
        exit(2)

    exit(main(limit, seedterm))
Run Code Online (Sandbox Code Playgroud)

Python 2.7版本

#!/usr/bin/env python2.7

"""A script for retrieving and parsing results from requests to
somewhere.com.

This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python2.7 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""

import urllib2
import json
import sys

E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2

def parse_result(result):
    """Parse a JSONP result string and return a list of terms"""
    prefix = 'oo.visualization.Query.setResponse('
    suffix = ');'

    # Strip JSONP function wrapper
    if result.startswith(prefix) and result.endswith(suffix):
        result = result[len(prefix):-len(suffix)]

    # Deserialize JSON to Python objects
    result_object = json.loads(result)

    # Get the rows in the table, then get the second column's value
    # for each row
    return [row['c'][2]['v'] for row in result_object['table']['rows']]

def retrieve_terms(limit, seedterm):
    """Retrieves and parses data and returns a list of terms"""
    url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
    url = url_template % dict(limit=2, seedterm='seedterm')

    try:
        with urllib2.urlopen(url) as data:
            data = perform_request(limit, seedterm)
            result = data.read()
    except:
        sys.stderr.write('%s\n' % 'Could not request data from server')
        exit(E_OPERATION_ERROR)

    terms = parse_result(result)
    print terms

def main(limit, seedterm):
    """Retrieves and parses data and prints each term to standard output"""
    terms = retrieve_terms(limit, seedterm)
    for term in terms:
        print term

if __name__ == '__main__'
    try:
        limit = int(sys.argv[1])
        seedterm = sys.argv[2]
    except:
        error_message = '''{} limit seedterm

limit must be an integer'''.format(sys.argv[0])
        sys.stderr.write('%s\n' % error_message)
        exit(2)

    exit(main(limit, seedterm))
Run Code Online (Sandbox Code Playgroud)