小编leo*_*oce的帖子

非贪婪的gsub

我有一个日志数据集:

V1  duration  id  startpoint
T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512  7771    1   2012-05-07_12-29-51
T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360????[=]C<=>360.cn 7771    1   2012-05-07_12-29-51
T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501    7771    1   2012-05-07_12-29-51
T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804  7771    1   2012-05-07_12-29-51 211
Run Code Online (Sandbox Code Playgroud)

我正在尝试从第一列(时间点,进程,pid,url等)中提取信息。一开始我尝试过:

df$timepoint <- gsub("T<=>(.*)[=].*", "\\1", df$V1)
Run Code Online (Sandbox Code Playgroud)

它返回类似的内容161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<,然后我尝试了:

df$timepoint <- gsub("T<=>([0-9]*).*", "\\1", df$V1)
Run Code Online (Sandbox Code Playgroud)

它有效,但是在处理诸如流程名称之类的文本时将不起作用,因此我搜索了“正则表达式最小匹配”并找到了术语non-greedy。我再次尝试:

df$timepoint <- gsub("T<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$process <- gsub(".*P<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$pid <- gsub(".*I<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$url <- gsub(".*U<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$addr <- gsub(".*A<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$tab <- gsub(".*B<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$ver <- gsub(".*V<=>(.*?)\\[=\\].*", "\\1", df$V1) …
Run Code Online (Sandbox Code Playgroud)

regex r gsub

5
推荐指数
1
解决办法
1112
查看次数

R在一列中使用逗号读取逗号分隔的txt文件

我有一些用户浏览行为的日志.它来自数据收集器,显然他用逗号分隔变量.但是有些网址里面有逗号.我无法将txt文件读入R.

20091,2009-06-02 22:06:14,84,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1,www.taobao.com,shopping,e-commerce,C2C
20092,2009-06-16 12:25:35,8,sohu.com,www.wap.sohu.com,http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387,www.sohu.com,portal,entertainment,mobile
20092,2009-06-07 16:02:03,14,eetchina.com,www.powersystems.eetchina.com,http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT,,others,marketing,enterprise
20096,2009-06-30 07:51:38,7,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1,search1.taobao.com,shopping,e-commerce,C2C
2009184,2009-06-25 14:40:39,6,mktginc.com,surv.mktginc.com,,,unknown,unknown,unknown
20092,2009-06-07 15:13:06,32,ccb.com.cn,ibsbjstar.ccb.com.cn,https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp,,e-bank,finance,e-bank
Run Code Online (Sandbox Code Playgroud)

上面的网址应该是:

http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1

https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp
Run Code Online (Sandbox Code Playgroud)

我怎么能告诉R每行中确实有10个变量并在URL中加入逗号?谢谢!

df <- read.table('2009.txt', sep= ',', quote= '', comment.char= '', stringsAsFactors= F)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : line 130 did not have 10 elements
Run Code Online (Sandbox Code Playgroud)

csv url r

5
推荐指数
1
解决办法
893
查看次数

在R中构建单词共生边缘列表

我有一大块句子,我想建立单词共现的无向边列表,并查看每个边的频率.我看了看tm包但没找到类似的功能.我可以使用一些包/脚本吗?非常感谢!

注意:单词不会与自身共存.出现两次或更多次的单词与同一句子中的其他单词共同出现一次.

DF:

sentence_id text
1           a b c d e
2           a b b e
3           b c d
4           a e
5           a
6           a a a
Run Code Online (Sandbox Code Playgroud)

OUTPUT

word1 word2 freq
a     b     2
a     c     1
a     d     1
a     e     3
b     c     2
b     d     2
b     e     2
c     d     2
c     e     1
d     e     1
Run Code Online (Sandbox Code Playgroud)

r text-mining network-analysis

5
推荐指数
1
解决办法
2121
查看次数

R'聚合'耗尽内存

我有一个关于微博的数据集(600 Mb与5038720观察),我试图找出一个用户在一小时内发布了多少推文(具有相同中间数的推文).以下是数据集的外观:

head(mydata)

       uid              mid    year month date hour min sec
1738914174 3342412291119279 2011     8    3   21   4  12
1738914174 3342413045470746 2011     8    3   21   7  12
1738914174 3342823219232783 2011     8    5    0  17   5
1738914174 3343095924467484 2011     8    5   18  20  43
1738914174 3343131303394795 2011     8    5   20  41  18
1738914174 3343386263030889 2011     8    6   13  34  25
Run Code Online (Sandbox Code Playgroud)

这是我的代码:

count <- function(x) {
length(unique(na.omit(x)))
}
attach(mydata)
hourPost <- aggregate(mid, by=list(uid, hour), FUN=count)
Run Code Online (Sandbox Code Playgroud)

它在那里挂了大约半个小时,我发现所有真正的内存(24 Gb)都被使用了,它开始使用虚拟内存.知道为什么这个小任务消耗了这么多时间和记忆,我该如何改进呢?提前致谢!

memory aggregate r

3
推荐指数
1
解决办法
438
查看次数

带有空值的 MySQL 浮点列 ERROR 1265 (01000):列的数据被截断

我正在尝试将GDELT数据集存储在 MySQL 数据库(MySQL 8.0、RHEL 7)中,但它返回了 ERROR 1265(01000),因为一个浮点列中有空值:

CREATE TABLE event (
    GlobalEventID INT NOT NULL,
    Day INT NOT NULL,
    MonthYear MEDIUMINT NOT NULL,
    Year SMALLINT NOT NULL,
    FractionDate FLOAT NOT NULL,
    Actor1Code TINYTEXT NULL,
    Actor1Name TINYTEXT NULL,
    Actor1CountryCode TINYTEXT NULL,
    Actor1KnownGroupCode TINYTEXT NULL,
    Actor1EthnicCode TINYTEXT NULL,
    Actor1Religion1Code TINYTEXT NULL,
    Actor1Religion2Code TINYTEXT NULL,
    Actor1Type1Code TINYTEXT NULL,
    Actor1Type2Code TINYTEXT NULL,
    Actor1Type3Code TINYTEXT NULL,
    Actor2Code TINYTEXT NULL,
    Actor2Name TINYTEXT NULL,
    Actor2CountryCode TINYTEXT NULL,
    Actor2KnownGroupCode TINYTEXT NULL,
    Actor2EthnicCode TINYTEXT NULL,
    Actor2Religion1Code TINYTEXT …
Run Code Online (Sandbox Code Playgroud)

mysql gdelt

1
推荐指数
1
解决办法
215
查看次数

标签 统计

r ×4

aggregate ×1

csv ×1

gdelt ×1

gsub ×1

memory ×1

mysql ×1

network-analysis ×1

regex ×1

text-mining ×1

url ×1