How to extract text as well as hyperlink text in scrapy?

Question

How to extract text as well as hyperlink text in scrapy?

I want to extract from following html code:

<li>
    <a test="test" href="abc.html" id="11">Click Here</a>
    "for further reference"
</li>

Run Code Online (Sandbox Code Playgroud)

I'm trying to do with following extract command

response.css("article div#section-2 li::text").extract()

Run Code Online (Sandbox Code Playgroud)

But it is giving only "for further reference" line And Expected output is "Click Here for further reference" as a one string. How to do this? How to modify this to do the same if following patterns are there:

Text Hyperlink Text
Hyperlink Text
Text Hyperlink

Answer 1

pau*_*rth 5

至少有几种方法可以做到这一点：

让我们首先构建一个模拟您的响应的测试选择器：

>>> response = scrapy.Selector(text="""<li>
...     <a test="test" href="abc.html" id="11">Click Here</a>
...     "for further reference"
... </li>""")

Run Code Online (Sandbox Code Playgroud)

第一个选项，对 CSS 选择器稍作改动。查看所有文本后代，而不仅仅是文本子元素（注意li和::text伪元素之间的空格）：

# this is your CSS select,
# which only gives direct children text of your selected LI
>>> response.css("li::text").extract()    
[u'\n    ', u'\n    "for further reference"\n']

# notice the extra space
#                 here
#                   |
#                   v
>>> response.css("li ::text").extract()
[u'\n    ', u'Click Here', u'\n    "for further reference"\n']

# using Python's join() to concatenate and build the full sentence
>>> ''.join(response.css("li ::text").extract())
u'\n    Click Here\n    "for further reference"\n'

Run Code Online (Sandbox Code Playgroud)

另一种选择是将您的.css()调用与 XPath 1.0string()或normalize-space()在后续.xpath()调用中链接起来：

>>> response.css("li").xpath('string()').extract()
[u'\n    Click Here\n    "for further reference"\n']
>>> response.css("li").xpath('normalize-space()').extract()
[u'Click Here "for further reference"']

# calling `.extract_first()` gives you a string directly, not a list of 1 string
>>> response.css("li").xpath('normalize-space()').extract_first()
u'Click Here "for further reference"'

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，7 月前
查看次数：	2058 次
最近记录：	8 年，6 月前