标签: data-manipulation

R:从滚动窗口创建数据框

假设我有一个具有以下结构的数据框:

DF <- data.frame(x = 0:4, y = 5:9)
> DF
  x y
1 0 5
2 1 6
3 2 7
4 3 8
5 4 9
Run Code Online (Sandbox Code Playgroud)

将"DF"转换为具有以下结构的数据框的最有效方法是什么:

w x y
1 0 5
1 1 6
2 1 6
2 2 7
3 2 7
3 3 8
4 3 8
4 4 9
Run Code Online (Sandbox Code Playgroud)

其中w是滚动数据帧'DF'的长度为2的窗口.窗口的长度应该是任意的,即长度为3的产量

w x y
1 0 5
1 1 6
1 2 7
2 1 6
2 2 7
2 3 8
3 2 …
Run Code Online (Sandbox Code Playgroud)

r data-manipulation data-management rolling-computation

6
推荐指数
1
解决办法
1353
查看次数

创建一个捕获组最常出现的变量

限定:

df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)
Run Code Online (Sandbox Code Playgroud)

ST

> df1
  id v1
1  1  a
2  1  b
3  1  b
4  2  c
5  2  c
6  2  c
Run Code Online (Sandbox Code Playgroud)

我想创建一个第三个变量freq包含最常见的观察v1idST

> df2
  id v1 freq
1  1  a    b
2  1  b    b
3  1  b    b
4  2  c    c
5  2  c    c
6  2  c    c
Run Code Online (Sandbox Code Playgroud)

r frequency count data-manipulation data-management

6
推荐指数
1
解决办法
2756
查看次数

使用jq从多维JSON数组中选择第n个元素

我如何使用jq转换此数组数组:

[
  [
    "sequence",
    "int"
  ],
  [
    "time",
    "string"
  ],
  ...
]
Run Code Online (Sandbox Code Playgroud)

进入一个包含每个子数组中第一个(0)元素的数组?产生这样的输出的意义:

[
    "sequence",
    "time",
    ...
]
Run Code Online (Sandbox Code Playgroud)

我正在考虑使用,reduce xx as $item (...)但我没有设法提出任何有用的东西.

command-line json data-manipulation jq

6
推荐指数
1
解决办法
2199
查看次数

从为每个观察记录的单个水平字符串创建新的二进制变量

我一直在摆弄Kaggle West-Nile病毒竞争数据作为练习拟合时空GAM的手段.(从原始CSV稍微处理过的)weather数据的前几行是下面的(加上dput()问题末尾的前20行输出).

> head(weather)
  Station       Date Tmax Tmin Tavg Depart DewPoint WetBulb Heat Cool Sunrise
1       1 2007-05-01   83   50   67     14       51      56    0    2     448
2       2 2007-05-01   84   52   68     NA       51      57    0    3      NA
3       1 2007-05-02   59   42   51     -3       42      47   14    0     447
4       2 2007-05-02   60   43   52     NA       42      47   13    0      NA
5       1 2007-05-03   66   46   56      2       40      48    9    0 …
Run Code Online (Sandbox Code Playgroud)

r data-manipulation data-processing

6
推荐指数
1
解决办法
200
查看次数

在一个非常大的文件中搜索和替换字符串

我首选shell命令来完成任务.我有一个非常非常大的文件 - 大约2.8 GB,内容是JSON的内容.一切都在一条线上,我被告知那里至少有150万条记录.

我必须准备文件以供消费.每条记录必须独立.样品:

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},{"RecordId":"2",...},{"RecordId":"3",...},{"RecordId":"4",...},{"RecordId":"5",...} }}
Run Code Online (Sandbox Code Playgroud)

或者,使用以下内容......

{"Accounts":{"Customer":[{"AccountHolderId":"9c585258-c94c-442b-a2f0-1ebbcc274795","Title":"Mrs","Forename":"Tina","Surname":"Wright","DateofBirth":"1988-01-01","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"1","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"2","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"3","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"4","Superseded":"Yes" }, {"Contact_Info":"christian.bale@hollywood.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"5","Superseded":"NO" },{"Contact_Info":"15482475584","TypeId":"Mobile_Phone","PrimaryFlag":"No","Index":"6","Superseded":"No" }],"Address":[{"AddressPtr":"5","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB100KP","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"6","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB10V6T","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"6884133655531279","Field_B":"887.07","Field_C":"A Loan Product",...,"FieldY_":"2015-09-18","Field_Z":"24275627"}]},{"AccountHolderId":"92a5788f-cd8f-423d-ae5f-4eb0ceb457fd","_Title":"Dr","_Forename":"Christopher","_Surname":"Carroll","_DateofBirth":"1977-02-02","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"7","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"8","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"9","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"10","Superseded":"Yes" }],"Address":[{"AddressPtr":"11","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB11TXF","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"12","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB11O8W","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"4121879819185553","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_X":"2015-09-18","Field_Z":"25679434"}]},{"AccountHolderId":"4aa10284-d9aa-4dc0-9652-70f01d22b19e","_Title":"Dr","_Forename":"Cheryl","_Surname":"Ortiz","_DateofBirth":"1977-03-03","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"13","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"14","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"15","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"16","Superseded":"Yes" }],"Address":[{"AddressPtr":"17","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB12SQR","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"18","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB12BAQ","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"3288214945919484","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_Y":"2015-09-18","Field_Z":"66264768"}]}]}}
Run Code Online (Sandbox Code Playgroud)

最终结果应该是:

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
{"RecordId":"2",...},
{"RecordId":"3",...},
{"RecordId":"4",...},
{"RecordId":"5",...} }}
Run Code Online (Sandbox Code Playgroud)

尝试的命令:

  • sed -e 's/,{"RecordId"/}]},\n{"RecordId"/g' sample.dat
  • awk '{gsub(",{\"RecordId\"",",\n{\"RecordId\"",$0); print $0}' sample.dat

尝试的命令对于小文件非常适用.但它不适用于我必须操作的2.8 GB文件.在没有任何理由的情况下,塞德在10分钟后中途退出,没有做任何事情.经过多个小时后,awk出现了分段错误(核心转储)的原因.我尝试了perl的搜索和替换,并收到错误消息"Out of memory".

任何帮助/想法都会很棒!

我的机器上的其他信息:

  • 超过105 GB的磁盘空间. …

perl awk json data-manipulation large-files

6
推荐指数
1
解决办法
598
查看次数

根据位数拆分pandas dataframe列

我有一个pandas数据帧,它有两列key和value,值总是由8位数字组成

>df1
key value
10  10000100
20  10000000
30  10100000
40  11110000
Run Code Online (Sandbox Code Playgroud)

现在我需要取值列并将其拆分为存在的数字,这样我的结果就是一个新的数据帧

>df_res
key 0 1 2 3 4 5 6 7
10  1 0 0 0 0 1 0 0
20  1 0 0 0 0 0 0 0
30  1 0 1 0 0 0 0 0
40  1 1 1 1 0 0 0 0
Run Code Online (Sandbox Code Playgroud)

我无法改变输入数据格式,我认为最传统的事情是将值转换为字符串并循环遍历每个数字字符并将其放入列表中,但是我正在寻找更优雅,更快速的东西,请帮忙.

编辑:输入不在字符串中,它是整数.

python data-manipulation dataframe pandas

6
推荐指数
2
解决办法
950
查看次数

dplyr的过滤功能:如何返回每个值(或"取消"过滤器的效果)?

这可能看起来像一个奇怪的问题,但有没有办法将值传递给filter()基本上什么也没做?

data(cars)
library(dplyr)
cars %>% filter(speed==`magic_value_that_returns_cars?`)
Run Code Online (Sandbox Code Playgroud)

而且你会得到整个数据框cars.我认为这在闪亮的应用程序中很有用,用户只需要选择他想要过滤的值; 例如,用户可以选择"欧洲","非洲"或"美国",并在幕后对数据框进行过滤,然后返回具有"欧洲"描述性统计数据的表格(如果用户选择"欧洲") .但是如果用户想要在没有首先过滤的情况下拥有描述性统计数据呢?是否有一个值可以传递过滤到"取消"过滤器并将整个数据帧传递给summary()?

r data-manipulation dplyr

6
推荐指数
2
解决办法
1295
查看次数

使用dplyr或datatable每年的公司数量

让我们说我有数据框:

df <- data.frame(City = c("NY", "NY", "NY", "NY", "NY", "LA", "LA", "LA", "LA"),
                 YearFrom = c("2001", "2003", "2002", "2006", "2008", "2004", "2005", "2005", "2002"),
                 YearTo = c(NA, "2005", NA, NA, "2009", NA, "2008", NA, NA))
Run Code Online (Sandbox Code Playgroud)

其中YearFrom是例如公司成立的年份,YearTo是取消的年份.如果YearTo是NA,那么它仍在工作.

我想计算每年的公司数量.

该表应如下所示

City    |"Year"   |"Count"
"NY"    |2001       1
"NY"    |2002       2
"NY"    |2003       3
"NY"    |2004       3
"NY"    |2005       2
"NY"    |2006       3
"NY"    |2007       3
"NY"    |2008       4
"NY"    |2009       3
"LA"    |2001       0
"LA"    |2002       1
"LA"    |2003       1
"LA"    |2004 …
Run Code Online (Sandbox Code Playgroud)

r data-manipulation dplyr data.table

6
推荐指数
3
解决办法
232
查看次数

按组累计最小值

我想计算min给定组内的累积值。

我当前的数据框:

Group <- c('A', 'A', 'A','A', 'B', 'B', 'B', 'B') 
Target <- c(1, 0, 5, 0, 3, 5, 1, 3) 
data <- data.frame(Group, Target))
Run Code Online (Sandbox Code Playgroud)

我想要的输出:

Desired.Variable <- c(1, 0, 0, 0, 3, 3, 1, 1)
data <- data.frame(Group, Target, Desired.Variable))
Run Code Online (Sandbox Code Playgroud)

对此的任何帮助将不胜感激!

r data-manipulation

6
推荐指数
1
解决办法
493
查看次数

有什么解决方法可以找到最佳阈值,以基于R中的相关矩阵来过滤原始特征?

我打算通过测量其Pearson相关性来提取高度相关的特征,并由此获得相关性矩阵。但是,为了过滤高相关特征,我任意选择了相关系数,我不知道过滤高相关特征的最佳阈值。我正在考虑先量化正相关和负相关的特征,然后获得可靠的数据来设置过滤特征的阈值。谁能指出我如何从相关矩阵中量化正负相关特征?是否有任何有效的方法来选择用于过滤高度相关特征的最佳阈值?

可复制的数据

这是我使用的可重现数据,而row是样本数,原始特征数中的列:

> dput(my_df)
structure(list(SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01", 
"Tarca_025_P1C01", "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01", 
"Tarca_051_P1E03", "Tarca_063_P1F03", "Tarca_075_P1G03", "Tarca_087_P1H03"
), GA = c(11, 15.3, 21.7, 26.7, 31.3, 32.1, 19.7, 23.6, 27.6, 
30.6), `1_at` = c(6.06221469449721, 5.8755020052495, 6.12613148162098, 
6.1345548976595, 6.28953417729806, 6.08561779473768, 6.25857984382111, 
6.22016811759586, 6.22269236303877, 6.11986885253451), `10_at` = c(3.79648446367096, 
3.45024474095539, 3.62841140410044, 3.51232455992681, 3.56819306931016, 
3.54911765491621, 3.59024881523945, 3.69553021972333, 3.61860245801661, 
3.74019994293802), `100_at` = c(5.84933778267459, 6.55052475296263, 
6.42187743053935, 6.15489279092855, 6.34807354206396, 6.11780116002087, 
6.24635169763079, 6.25479583503303, 6.16095987926232, 6.26979789563404
), `1000_at` = c(3.5677794435745, 3.31613364795286, 3.43245075704917, 
3.63813996294905, 3.39904385276621, 3.54214650423219, 3.51532853598111, 
3.50451431462302, 3.38965905673286, 3.54646930636612), `10000_at` …
Run Code Online (Sandbox Code Playgroud)

r data-manipulation feature-extraction correlation

6
推荐指数
1
解决办法
147
查看次数