在R中使用Regex获取Twitter @Username

Che*_*off 4 regex twitter r

如何在R中使用正则表达式从一串文本中提取Twitter用户名?

我试过了

library(stringr)

theString <- '@foobar Foobar! and @foo (@bar) but not foo@bar.com'

str_extract_all(string=theString,pattern='(?:^|(?:[^-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)')
Run Code Online (Sandbox Code Playgroud)

但我最终@foobar,@foo并且(@bar其中包含一个不必要的括号.

我怎样才能得到公正@foobar,@foo@bar作为输出?

Ben*_*Ben 7

这是一种适用于的方法R:

theString <- '@foobar Foobar! and @foo (@bar) but not foo@bar.com'
theString1 <- unlist(strsplit(theString, " "))
regex <- "(^|[^@\\w])@(\\w{1,15})\\b"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "@foobar" "@foo"    "(@bar)"
Run Code Online (Sandbox Code Playgroud)

如果你想使用@Jerry的答案R:

regex <- "@([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "@foobar" "@foo"    "(@bar)" 
Run Code Online (Sandbox Code Playgroud)

但是,这两种方法都包含您不想要的括号.

更新这将从头到尾没有括号或任何其他类型的标点符号(除了下划线,因为它们在用户名中被允许)

theString <- '@foobar Foobar! and @fo_o (@bar) but not foo@bar.com'
theString1 <- unlist(strsplit(theString, " "))
regex1 <- "(^|[^@\\w])@(\\w{1,15})\\b" # get strings with @
regex2 <- "[^[:alnum:]@_]"             # remove all punctuation except _ and @
users <- gsub(regex2, "", theString1[grep(regex1, theString1, perl = T)])
users

[1] "@foobar" "@fo_o"   "@bar"
Run Code Online (Sandbox Code Playgroud)