Pei*_* Li 12 python regex twitter
我需要使用Python预处理推文.现在我想知道什么是正则表达式分别删除所有的标签,@ user和推文的链接?
例如,
original tweet: @peter I really love that shirt at #Macy. http://bet.ly//WjdiW4
I really love that shirt at Macy@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bet.ly/tuN2wx
Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preveI am at Starbucks http://4sh.com/samqUI (7419 3rd ave, at 75th, Brooklyn)
I am at Starbucks 7419 3rd ave at 75th Brooklyn我只需要每条推文中有意义的单词.我不需要用户名,或任何链接或任何标点符号.
Abh*_*jit 24
以下示例是近似值.不幸的是,没有正确的方法只通过正则表达式来实现.以下正则表达式只是一个URL(不仅仅是http),任何标点符号,用户名或任何非字母数字字符.它还将单词与单个空格分开.如果您想按照意图解析推文,则需要在系统中获得更多智能.考虑到没有标准的推文馈送格式,一些预知自学习算法.
这是我提出的建议.
' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
Run Code Online (Sandbox Code Playgroud)
这是你的例子的结果
>>> x="@peter I really love that shirt at #Macy. http://bit.ly//WjdiW4"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I really love that shirt at Macy'
>>> x="@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bit.ly/tuN2wx"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
>>> x="I am at Starbucks http://4sq.com/samqUI (7419 3rd ave, at 75th, Brooklyn) "
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I am at Starbucks 7419 3rd ave at 75th Brooklyn'
>>>
Run Code Online (Sandbox Code Playgroud)
这里有一些不完美的例子
>>> x="I c RT @iamFink: @SamanthaSpice that's my excited face and my regular face. The expression never changes."
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I c RT that s my excited face and my regular face The expression never changes'
>>> x="RT @AstrologyForYou: #Gemini recharges through regular contact with people of like mind, and social involvement that allows expression of their ideas"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT Gemini recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> # Though after you add # to the regex expression filter, results become a bit better
>>> ' '.join(re.sub("([@#][A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> x="New comment by diego.bosca: Re: Re: wrong regular expression? http://t.co/4KOb94ua"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'New comment by diego bosca Re Re wrong regular expression'
>>> #See how miserably it performed?
>>>
Run Code Online (Sandbox Code Playgroud)
有点晚了,但是这个解决方案可以防止标点错误,如#hashtag1,#hashtag2(没有空格),并且实现非常简单
import re,string
def strip_links(text):
link_regex = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
links = re.findall(link_regex, text)
for link in links:
text = text.replace(link[0], ', ')
return text
def strip_all_entities(text):
entity_prefixes = ['@','#']
for separator in string.punctuation:
if separator not in entity_prefixes :
text = text.replace(separator,' ')
words = []
for word in text.split():
word = word.strip()
if word:
if word[0] not in entity_prefixes:
words.append(word)
return ' '.join(words)
tests = [
"@peter I really love that shirt at #Macy. http://bet.ly//WjdiW4",
"@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bet.ly/tuN2wx",
"I am at Starbucks http://4sh.com/samqUI (7419 3rd ave, at 75th, Brooklyn)",
]
for t in tests:
strip_all_entities(strip_links(t))
#'I really love that shirt at'
#'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
#'I am at Starbucks 7419 3rd ave at 75th Brooklyn'
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
27091 次 |
| 最近记录: |