标签: grepl

使用regex将URL解压缩到新的数据框列中

我想使用正则表达式从数据框中的文本中提取所有URL到新列.我有一些旧的代码,我用来提取关键字,所以我想调整代码为正则表达式.我想将正则表达式保存为字符串变量并在此处应用:

data$ContentURL <- apply(sapply(regex, grepl, data$Content, fixed=FALSE), 1, function(x) paste(selection[x], collapse=','))

Run Code Online (Sandbox Code Playgroud)

似乎fixed=FALSE应该告诉grepl它是一个正则表达式,但R不喜欢我试图将正则表达式保存为:

regex <- "http.*?1-\\d+,\\d+"

Run Code Online (Sandbox Code Playgroud)

我的数据组织在这样的数据框中:

data <- read.table(text='"Content"     "date"   
 1     "a house a home https://www.foo.com"     "12/31/2013"
 2     "cabin ideas https://www.example.com in the woods"     "5/4/2013"
 3     "motel is a hotel"   "1/4/2013"', header=TRUE)

Run Code Online (Sandbox Code Playgroud)

希望看起来像:

                                           Content       date              ContentURL
1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
3                                 motel is a hotel   1/4/2013

Run Code Online (Sandbox Code Playgroud)

regex r grepl

lmc*_*ane

2014 10-22

4
推荐指数

1
解决办法

5412
查看次数

将角色分成几部分

我观察到以下特征:

  l <- "mod, range1 = seq(-m, n, 0.1), range2 = seq(-2, 2, 0.1), range3 = seq(-2, 2, 0.1)"

Run Code Online (Sandbox Code Playgroud)

在RI中使用正则表达式希望l在以下结构中进行拆分:

[1] "mod"                      "range1 = seq(-m, n, 0.1)"
[3] "range2 = seq(-2, 2, 0.1)" "range3 = seq(-2, 2, 0.1)"

Run Code Online (Sandbox Code Playgroud)

不幸的是,我没有找到解决问题的正确方法.任何人都知道怎么可能获得这样一个重要的分裂？

r substr gsub grepl

And*_*d_R

lucky-day

4
推荐指数

1
解决办法

118
查看次数

R - 查找包含所有字符串/模式的所有向量元素 - str_detect grep

样本数据

files.in.path = c("a.4.0. name 2015 - NY.RDS", 
                  "b.4.0. name 2016 - CA.RDS", 
                  "c.4.0. name 2015 - PA.RDS")
strings.to.find = c("4.0", "PA")

Run Code Online (Sandbox Code Playgroud)

我想要显示包含所有元素的逻辑向量strings.to.find。结果想要：

FALSE FALSE TRUE

Run Code Online (Sandbox Code Playgroud)

此代码将查找包含以下任何一项的元素strings.to.find，即，使用 OR 运算符

str_detect(files.in.path, str_c(strings.to.find, collapse="|")) # OR operator
 TRUE TRUE TRUE

Run Code Online (Sandbox Code Playgroud)

此代码尝试使用 AND 运算符但不起作用。

str_detect(files.in.path, str_c(strings.to.find, collapse="&")) # AND operator
FALSE FALSE FALSE

Run Code Online (Sandbox Code Playgroud)

这在几行中有效，我可以编写一个for循环，该循环将为具有大量strings.to.find

det.1 = str_detect(files.in.path,      "4.0"  )   
det.2 = str_detect(files.in.path,      "PA"  )   
det.all = det.1 & det.2
 FALSE FALSE  TRUE

Run Code Online (Sandbox Code Playgroud)

但是有没有更好的方法不涉及使用依赖于strings.to.find.

r and-operator stringr grepl

LWR*_*RMS

2016 09-12

4
推荐指数

2
解决办法

5716
查看次数

R,使用包含grepl()的ifelse的dplyr :: mutate会产生意想不到的结果

这个ifelse声明有什么问题.

df <- data.frame(var1=c('ABC','CAB', 'AB'))
dplyr::mutate(df, var2=ifelse(grepl('^AB',var1), 'AB-starter', var1))

Run Code Online (Sandbox Code Playgroud)

给

  var1       var2
1  ABC AB-starter
2  CAB          3
3   AB AB-starter

Run Code Online (Sandbox Code Playgroud)

我想(使用mutate和ifelse语句)var2的第二个元素中var1的值(即'var1'不以"AB"开头):

  var1       var2
1  ABC AB-starter
2  CAB        CAB
3   AB AB-starter

Run Code Online (Sandbox Code Playgroud)

if-statement r dplyr grepl

use*_*672

lucky-day

4
推荐指数

1
解决办法

3736
查看次数

POSIX 字符类在基本 R 正则表达式中不起作用

我在将模式与中的文本字符串匹配时遇到一些问题R。

当文字类似于以下内容时，我正在尝试TRUE理解grepl"lettersornumbersorspaces y lettersornumbersorspaces".

我正在使用以下内容regex：

([:alnum:]|[:blank:])+[:blank:][yY][:blank:]([:alnum:]|[:blank:])+

Run Code Online (Sandbox Code Playgroud)

当使用regex如下方法获取“地址”时，它会按预期工作。

regex <- "([:alnum:]|[:blank:])+[:blank:][yY][:blank:]([:alnum:]|[:blank:])+"
address <- str_extract(fulltext, regex)

Run Code Online (Sandbox Code Playgroud)

我看到该地址就是我需要的文本。现在，如果我想使用如下方式grepl获取TRUE：

grepl("([:alnum:]|[:blank:])+[:blank:][yY][:blank:]([:alnum:]|[:blank:])+", address,ignore.case = TRUE)

Run Code Online (Sandbox Code Playgroud)

FALSE被返回。这怎么可能？我正在使用相同的方法regex来获取TRUE. 我尝试过修改参数grepl，但没有一个与此相关。

文本示例如下："26 de Marzo y Pareyra de la Luz"

谢谢！！

regex r pattern-matching grepl

M.P*_*ico

2017 09-06

4
推荐指数

1
解决办法

570
查看次数

grepl在两个向量上

我想grepl在两个向量上应用，以查看第一个向量的元素在第二个向量的相应元素中是否可用。例如

grepl(c("bc","23","a2"),c("abcd","1234","zzzz"))

Run Code Online (Sandbox Code Playgroud)

既然bc在里面abcd，23在里面1234而a2不是里面zzzz，我想得到TRUE TRUE FALSE。但是，我得到的却是：

[1]  TRUE FALSE FALSE
Warning message:
In grepl(c("bc", "23", "a2"), c("abcd", "1234", "zzzz")) :
argument 'pattern' has length > 1 and only the first element will be used

Run Code Online (Sandbox Code Playgroud)

r grepl

Mah*_*oud

lucky-day

4
推荐指数

2
解决办法

70
查看次数

在R中，找到每行包含一个字符串的列

我一定是在用错误的搜索词思考，因为我无法相信我的问题是独一无二的，但我只找到了一个相似的。

我有一些来自世界银行的相当笨重的数据，它们是代表数据库的平面文件。数据是每行一个项目，但每个项目都有多个特征，这些特征在名称如“SECTOR.1”的列中很方便，在其他列中具有自己的特征，名称如“SECTOR.1.PCT”等。

从中，我试图提取与特定类型的 SECTOR 相关的数据，但我仍然需要所有其他项目信息。

我已经能够朝着正确的方向迈出一些步骤，从我在 SO 上发现的另一个问题：在数据框中查找包含字符串作为值的列的索引

基于上述问题注释的最小可重现示例如下：

> df <- data.frame(col1 = c(letters[1:4],"c"), 
...                  col2 = 1:5, 
...                  col3 = c("a","c","l","c","l"), 
...                  col4= letters[3:7])
> df
  col1 col2 col3 col4
1    a    1    a    c
2    b    2    c    d
3    c    3    l    e
4    d    4    c    f
5    c    5    l    g

Run Code Online (Sandbox Code Playgroud)

我想要的输出类似于：

1 col4
2 col3
3 col1
4 col3
5 col1

Run Code Online (Sandbox Code Playgroud)

我知道我可以做一个 ifelse，但这似乎不是一个非常优雅的方法。当然，因为这是我只会做 1 次的事情（对于这个项目），打字错误的风险很小。例如，

> df$hasc <- ifelse(grepl("c",df$col1), "col1",
...                         ifelse(grepl("c",df$col2), …

Run Code Online (Sandbox Code Playgroud)

r grepl

jes*_*ssi

2017 05-23

3
推荐指数

1
解决办法

4146
查看次数

R 基于应用于多列的多个部分字符串过滤行

数据集样本：

diag01 <- as.factor(c("S7211","J47","J47","K729","M2445","Z509","Z488","R13","L893","N318","L0311","S510","A047","D649"))
diag02 <- as.factor(c("K590","D761","J961","T501","M8580","R268","T831","G8240","B9688","G550","E162","T8902","E86","I849"))
diag03 <- as.factor(c("F058","M0820","E877","E86","G712","R32","A408","E888","G8220","C794","T68","L0310","M1094","D469"))
diag04 <- as.factor(c("E86","C845","R790","I420","G4732","R600","L893","R509","T913","C795","M8412","G8212","L891","L0311"))
diag05 <- as.factor(c("R001","N289","E876","E871","H659","R4589","N508","B99","I209","C773","T921","Q070","H919","L033"))
diag06 <- as.factor(c("I951","E877","S7240","I500","H901","E119","Z223","K590","I959","C509","G819","F719","Z290","R13"))

df <- data.frame(diag01, diag02, diag03, diag04, diag05, diag06)

Run Code Online (Sandbox Code Playgroud)

我想过滤在给定的列列表（例如 diag01、diag02 等）中的任何位置具有部分字符串匹配的整行。我可以在单列上实现这一点，例如

junk <- filter(df, grepl(pattern="^E11|^E16|^E86|^E87|^E88", diag02))

Run Code Online (Sandbox Code Playgroud)

但我需要将其应用于多列（原始数据集有 216 列和 >1,000,000 行）。在其他选择中，我尝试过

junk <- filter(df, grepl(pattern="^E11|^E16|^E86|^E87|^E88", df[,c(1:6)]))
junk <- apply(df, 1, function(r) any(r %in% grepl(pattern="^E11|^E16|^E86|^E87|^E88")))

Run Code Online (Sandbox Code Playgroud)

我需要整行，理想情况下，我希望将过滤条件限制为给定的列列表，因为其他列中的值可能以声明的部分字符串开头。

努力寻找解决方案，但显然我缺乏对 R 的了解。

r filter dplyr grepl

Ler*_*one

2017 09-15

3
推荐指数

1
解决办法

1526
查看次数

如何从与模式匹配的向量中删除所有元素？

ncvars = c("prate", "arate", "wpd", "Atm1", "Atm2", "area", "fC", "bas__1", "bas__asssaa", "bas__Clow", "bas__g2333e", "baser__arge", "bas__Aow", "bas__Aass")

Run Code Online (Sandbox Code Playgroud)

现在，我想删除所有元素

正是名字 area
匹配这个字符串 bas__

我怎样才能做到这一点？

审判

patterns <- c("bas__", "area")
ncvars %>%
  filter(.,grepl(paste(patterns, collapse="|")))

Run Code Online (Sandbox Code Playgroud)

r subset grepl

max*_*oku

2021 05-07

3
推荐指数

1
解决办法

2967
查看次数

检查列是否包含另一列的值

R 有没有办法检查一列中的值是否包含另一列中的值？

在下面的示例中，我试图查看 col2 中的值是否包含在 col1 中的值中（在每行中独立），但收到一条警告消息：“参数 'pattern' 的长度 > 1 并且仅使用第一个元素”。

标志列应在第一行/最后一行显示“是”，在第二行和第三行显示“否”。任何关于如何解决的想法将不胜感激。

col1 <- c("R.S.U.L.C","S.I.W","P.U.E","A.E.N")
col2 <- c("R","U","I","N")

df2 <- data.frame(col1,col2)

df2$Flag <- ifelse(grepl(df2$col2,df2$col1),"Yes","No")

Run Code Online (Sandbox Code Playgroud)

r contains dataframe grepl

Mat*_*ett

2023 02-01

3
推荐指数

1
解决办法

7656
查看次数