How to extract text as well as hyperlink text in scrapy?

Shu*_* B. 2 csv scrapy web-scraping

I want to extract from following html code:

<li>
    <a test="test" href="abc.html" id="11">Click Here</a>
    "for further reference"
</li>
Run Code Online (Sandbox Code Playgroud)

I'm trying to do with following extract command

response.css("article div#section-2 li::text").extract()
Run Code Online (Sandbox Code Playgroud)

But it is giving only "for further reference" line And Expected output is "Click Here for further reference" as a one string. How to do this? How to modify this to do the same if following patterns are there:

  1. Text Hyperlink Text
  2. Hyperlink Text
  3. Text Hyperlink

pau*_*rth 5

至少有几种方法可以做到这一点:

让我们首先构建一个模拟您的响应的测试选择器:

>>> response = scrapy.Selector(text="""<li>
...     <a test="test" href="abc.html" id="11">Click Here</a>
...     "for further reference"
... </li>""")
Run Code Online (Sandbox Code Playgroud)

第一个选项,对 CSS 选择器稍作改动。查看所有文本后代,而不仅仅是文本子元素(注意li::text伪元素之间的空格):

# this is your CSS select,
# which only gives direct children text of your selected LI
>>> response.css("li::text").extract()    
[u'\n    ', u'\n    "for further reference"\n']

# notice the extra space
#                 here
#                   |
#                   v
>>> response.css("li ::text").extract()
[u'\n    ', u'Click Here', u'\n    "for further reference"\n']

# using Python's join() to concatenate and build the full sentence
>>> ''.join(response.css("li ::text").extract())
u'\n    Click Here\n    "for further reference"\n'
Run Code Online (Sandbox Code Playgroud)

另一种选择是将您的.css()调用与 XPath 1.0string()normalize-space()在后续.xpath()调用中链接起来:

>>> response.css("li").xpath('string()').extract()
[u'\n    Click Here\n    "for further reference"\n']
>>> response.css("li").xpath('normalize-space()').extract()
[u'Click Here "for further reference"']

# calling `.extract_first()` gives you a string directly, not a list of 1 string
>>> response.css("li").xpath('normalize-space()').extract_first()
u'Click Here "for further reference"'
Run Code Online (Sandbox Code Playgroud)