解析推文以在Python中将主题标签提取到数组中

Sco*_*ott 12 python arrays

我有一段时间在包含主题标签的推文中获取信息,并使用Python将每个主题标签拉入数组.到目前为止,我甚至把我一直在努力的东西都搞砸了.

例如,"我喜欢#stackoverflow因为#people非常#helpful!"

这应该将3个主题标签拉入数组.

And*_*Dog 60

一个简单的正则表达式应该做的工作:

>>> import re
>>> s = "I love #stackoverflow because #people are very #helpful!"
>>> re.findall(r"#(\w+)", s)
['stackoverflow', 'people', 'helpful']
Run Code Online (Sandbox Code Playgroud)

但请注意,正如其他答案中所建议的那样,这也可能会找到非主题标签,例如URL中的散列位置:

>>> re.findall(r"#(\w+)", "http://example.org/#comments")
['comments']
Run Code Online (Sandbox Code Playgroud)

所以另一个简单的解决方案是以下(删除重复作为奖励):

>>> def extract_hash_tags(s):
...    return set(part[1:] for part in s.split() if part.startswith('#'))
...
>>> extract_hash_tags("#test http://example.org/#comments #test")
set(['test'])
Run Code Online (Sandbox Code Playgroud)

  • 您的简单解决方案会捕获太多字符:例如,如果您在主题标签后面有逗号,则最终会包含在主题标签中. (3认同)

gho*_*g74 18

>>> s="I love #stackoverflow because #people are very #helpful!"
>>> [i  for i in s.split() if i.startswith("#") ]
['#stackoverflow', '#people', '#helpful!']
Run Code Online (Sandbox Code Playgroud)

  • 我认为这比使用公认的响应正则表达式更好.这种方式类似http://example.com/index.html#anchor_link未标记为主题标签. (3认同)

kor*_*huk 7

最好的Twitter 标签正则表达式

import re
text = "#promovolt #1st # promovolt #123"
re.findall(r'\B#\w*[a-zA-Z]+\w*', text)

>>> ['#promovolt', '#1st']
Run Code Online (Sandbox Code Playgroud)

在此处输入图片说明


Ivo*_*zel 6

AndiDogs的回答会搞砸链接和其他东西,你可能想先把它们过滤掉.之后使用此代码:

UTF_CHARS = ur'a-z0-9_\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u00ff'
TAG_EXP = ur'(^|[^0-9A-Z&/]+)(#|\uff03)([0-9A-Z_]*[A-Z_]+[%s]*)' % UTF_CHARS
TAG_REGEX = re.compile(TAG_EXP, re.UNICODE | re.IGNORECASE)
Run Code Online (Sandbox Code Playgroud)

它似乎有点矫枉过正,但已经从这里转换了http://github.com/mzsanford/twitter-text-java.它将以与twitter处理它们相同的方式处理99%的所有主题标签.

有关更多转换后的twitter正则表达式,请查看:http://github.com/BonsaiDen/Atarashii/blob/master/atarashii/usr/share/pyshared/atarashii/formatter.py

编辑:
查看:http://github.com/BonsaiDen/AtarashiiFormat


Gab*_*mpa 5

假设你必须#Hashtags从一个充满标点符号的句子中检索你.让我们说#stackoverflow #people并且#helpful以不同的符号终止,你想从中检索它们,text但你可能想避免重复:

>>> text = "I love #stackoverflow, because #people... are very #helpful! Are they really #helpful??? Yes #people in #stackoverflow are really really #helpful!!!"
Run Code Online (Sandbox Code Playgroud)

如果你set([i for i in text.split() if i.startswith("#")])单独尝试,你会得到:

>>> set(['#helpful???',
 '#people',
 '#stackoverflow,',
 '#stackoverflow',
 '#helpful!!!',
 '#helpful!',
 '#people...'])
Run Code Online (Sandbox Code Playgroud)

在我看来,这是多余的.使用RE与模块的更好解决方案re:

>>> import re
>>> set([re.sub(r"(\W+)$", "", j) for j in set([i for i in text.split() if i.startswith("#")])])
>>> set(['#people', '#helpful', '#stackoverflow'])
Run Code Online (Sandbox Code Playgroud)

现在对我来说没关系.

编辑:UNICODE #Hashtags

添加re.UNICODE标志,如果你想删除标点符号,但仍保留有口音,省略号和其他Unicode编码的东西,这可能是重要的,如果字母#Hashtags可以预期不会只使用英语......也许这只是一个意大利帅哥的噩梦, 也许不吧!;-)

例如:

>>> text = u"I love #stackoverflòw, because #peoplè... are very #helpfùl! Are they really #helpfùl??? Yes #peoplè in #stackoverflòw are really really #helpfùl!!!"
Run Code Online (Sandbox Code Playgroud)

将unicode编码为:

>>> u'I love #stackoverfl\xf2w, because #peopl\xe8... are very #helpf\xf9l! Are they really #helpf\xf9l??? Yes #peopl\xe8 in #stackoverfl\xf2w are really really #helpf\xf9l!!!'
Run Code Online (Sandbox Code Playgroud)

你可以#Hashtags用这种方式检索你的(正确编码的):

>>> set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])])
>>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l'])
Run Code Online (Sandbox Code Playgroud)

EDITx2:UNICODE #Hashtags#重复控制

如果你想控制#符号的多次重复,就像在(请原谅我,如果text示例几乎不可读):

>>> text = u"I love ###stackoverflòw, because ##################peoplè... are very ####helpfùl! Are they really ##helpfùl??? Yes ###peoplè in ######stackoverflòw are really really ######helpfùl!!!"
>>> u'I love ###stackoverfl\xf2w, because ##################peopl\xe8... are very ####helpf\xf9l! Are they really ##helpf\xf9l??? Yes ###peopl\xe8 in ######stackoverfl\xf2w are really really ######helpf\xf9l!!!'
Run Code Online (Sandbox Code Playgroud)

那么你应该用唯一的替换这些多次出现#.一种可能的解决方案是引入另一个嵌套的隐式set()定义,该sub()函数用#单个替换多于1 的出现#:

>>> set([re.sub(r"#+", "#", k) for k in set([re.sub(r"(\W+)$", "", j, flags = re.UNICODE) for j in set([i for i in text.split() if i.startswith("#")])])])
>>> set([u'#stackoverfl\xf2w', u'#peopl\xe8', u'#helpf\xf9l']) 
Run Code Online (Sandbox Code Playgroud)