假设我有一个具有以下结构的数据框:
DF <- data.frame(x = 0:4, y = 5:9)
> DF
x y
1 0 5
2 1 6
3 2 7
4 3 8
5 4 9
Run Code Online (Sandbox Code Playgroud)
将"DF"转换为具有以下结构的数据框的最有效方法是什么:
w x y
1 0 5
1 1 6
2 1 6
2 2 7
3 2 7
3 3 8
4 3 8
4 4 9
Run Code Online (Sandbox Code Playgroud)
其中w是滚动数据帧'DF'的长度为2的窗口.窗口的长度应该是任意的,即长度为3的产量
w x y
1 0 5
1 1 6
1 2 7
2 1 6
2 2 7
2 3 8
3 2 …
Run Code Online (Sandbox Code Playgroud) 限定:
df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)
Run Code Online (Sandbox Code Playgroud)
ST
> df1
id v1
1 1 a
2 1 b
3 1 b
4 2 c
5 2 c
6 2 c
Run Code Online (Sandbox Code Playgroud)
我想创建一个第三个变量freq
包含最常见的观察v1
被id
ST
> df2
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c
Run Code Online (Sandbox Code Playgroud) 我如何使用jq
转换此数组数组:
[
[
"sequence",
"int"
],
[
"time",
"string"
],
...
]
Run Code Online (Sandbox Code Playgroud)
进入一个包含每个子数组中第一个(0)元素的数组?产生这样的输出的意义:
[
"sequence",
"time",
...
]
Run Code Online (Sandbox Code Playgroud)
我正在考虑使用,reduce xx as $item (...)
但我没有设法提出任何有用的东西.
我一直在摆弄Kaggle West-Nile病毒竞争数据作为练习拟合时空GAM的手段.(从原始CSV稍微处理过的)weather
数据的前几行是下面的(加上dput()
问题末尾的前20行输出).
> head(weather)
Station Date Tmax Tmin Tavg Depart DewPoint WetBulb Heat Cool Sunrise
1 1 2007-05-01 83 50 67 14 51 56 0 2 448
2 2 2007-05-01 84 52 68 NA 51 57 0 3 NA
3 1 2007-05-02 59 42 51 -3 42 47 14 0 447
4 2 2007-05-02 60 43 52 NA 42 47 13 0 NA
5 1 2007-05-03 66 46 56 2 40 48 9 0 …
Run Code Online (Sandbox Code Playgroud) 我首选shell命令来完成任务.我有一个非常非常大的文件 - 大约2.8 GB,内容是JSON的内容.一切都在一条线上,我被告知那里至少有150万条记录.
我必须准备文件以供消费.每条记录必须独立.样品:
{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},{"RecordId":"2",...},{"RecordId":"3",...},{"RecordId":"4",...},{"RecordId":"5",...} }}
Run Code Online (Sandbox Code Playgroud)
或者,使用以下内容......
{"Accounts":{"Customer":[{"AccountHolderId":"9c585258-c94c-442b-a2f0-1ebbcc274795","Title":"Mrs","Forename":"Tina","Surname":"Wright","DateofBirth":"1988-01-01","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"1","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"2","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"3","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"4","Superseded":"Yes" }, {"Contact_Info":"christian.bale@hollywood.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"5","Superseded":"NO" },{"Contact_Info":"15482475584","TypeId":"Mobile_Phone","PrimaryFlag":"No","Index":"6","Superseded":"No" }],"Address":[{"AddressPtr":"5","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB100KP","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"6","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB10V6T","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"6884133655531279","Field_B":"887.07","Field_C":"A Loan Product",...,"FieldY_":"2015-09-18","Field_Z":"24275627"}]},{"AccountHolderId":"92a5788f-cd8f-423d-ae5f-4eb0ceb457fd","_Title":"Dr","_Forename":"Christopher","_Surname":"Carroll","_DateofBirth":"1977-02-02","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"7","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"8","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"9","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"10","Superseded":"Yes" }],"Address":[{"AddressPtr":"11","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB11TXF","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"12","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB11O8W","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"4121879819185553","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_X":"2015-09-18","Field_Z":"25679434"}]},{"AccountHolderId":"4aa10284-d9aa-4dc0-9652-70f01d22b19e","_Title":"Dr","_Forename":"Cheryl","_Surname":"Ortiz","_DateofBirth":"1977-03-03","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"13","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"14","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"15","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"16","Superseded":"Yes" }],"Address":[{"AddressPtr":"17","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB12SQR","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"18","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB12BAQ","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"3288214945919484","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_Y":"2015-09-18","Field_Z":"66264768"}]}]}}
Run Code Online (Sandbox Code Playgroud)
最终结果应该是:
{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
{"RecordId":"2",...},
{"RecordId":"3",...},
{"RecordId":"4",...},
{"RecordId":"5",...} }}
Run Code Online (Sandbox Code Playgroud)
尝试的命令:
sed -e 's/,{"RecordId"/}]},\n{"RecordId"/g' sample.dat
awk '{gsub(",{\"RecordId\"",",\n{\"RecordId\"",$0); print $0}' sample.dat
尝试的命令对于小文件非常适用.但它不适用于我必须操作的2.8 GB文件.在没有任何理由的情况下,塞德在10分钟后中途退出,没有做任何事情.经过多个小时后,awk出现了分段错误(核心转储)的原因.我尝试了perl的搜索和替换,并收到错误消息"Out of memory".
任何帮助/想法都会很棒!
我的机器上的其他信息:
我有一个pandas数据帧,它有两列key和value,值总是由8位数字组成
>df1
key value
10 10000100
20 10000000
30 10100000
40 11110000
Run Code Online (Sandbox Code Playgroud)
现在我需要取值列并将其拆分为存在的数字,这样我的结果就是一个新的数据帧
>df_res
key 0 1 2 3 4 5 6 7
10 1 0 0 0 0 1 0 0
20 1 0 0 0 0 0 0 0
30 1 0 1 0 0 0 0 0
40 1 1 1 1 0 0 0 0
Run Code Online (Sandbox Code Playgroud)
我无法改变输入数据格式,我认为最传统的事情是将值转换为字符串并循环遍历每个数字字符并将其放入列表中,但是我正在寻找更优雅,更快速的东西,请帮忙.
编辑:输入不在字符串中,它是整数.
这可能看起来像一个奇怪的问题,但有没有办法将值传递给filter()基本上什么也没做?
data(cars)
library(dplyr)
cars %>% filter(speed==`magic_value_that_returns_cars?`)
Run Code Online (Sandbox Code Playgroud)
而且你会得到整个数据框cars
.我认为这在闪亮的应用程序中很有用,用户只需要选择他想要过滤的值; 例如,用户可以选择"欧洲","非洲"或"美国",并在幕后对数据框进行过滤,然后返回具有"欧洲"描述性统计数据的表格(如果用户选择"欧洲") .但是如果用户想要在没有首先过滤的情况下拥有描述性统计数据呢?是否有一个值可以传递过滤到"取消"过滤器并将整个数据帧传递给summary()?
让我们说我有数据框:
df <- data.frame(City = c("NY", "NY", "NY", "NY", "NY", "LA", "LA", "LA", "LA"),
YearFrom = c("2001", "2003", "2002", "2006", "2008", "2004", "2005", "2005", "2002"),
YearTo = c(NA, "2005", NA, NA, "2009", NA, "2008", NA, NA))
Run Code Online (Sandbox Code Playgroud)
其中YearFrom是例如公司成立的年份,YearTo是取消的年份.如果YearTo是NA,那么它仍在工作.
我想计算每年的公司数量.
该表应如下所示
City |"Year" |"Count"
"NY" |2001 1
"NY" |2002 2
"NY" |2003 3
"NY" |2004 3
"NY" |2005 2
"NY" |2006 3
"NY" |2007 3
"NY" |2008 4
"NY" |2009 3
"LA" |2001 0
"LA" |2002 1
"LA" |2003 1
"LA" |2004 …
Run Code Online (Sandbox Code Playgroud) 我想计算min
给定组内的累积值。
我当前的数据框:
Group <- c('A', 'A', 'A','A', 'B', 'B', 'B', 'B')
Target <- c(1, 0, 5, 0, 3, 5, 1, 3)
data <- data.frame(Group, Target))
Run Code Online (Sandbox Code Playgroud)
我想要的输出:
Desired.Variable <- c(1, 0, 0, 0, 3, 3, 1, 1)
data <- data.frame(Group, Target, Desired.Variable))
Run Code Online (Sandbox Code Playgroud)
对此的任何帮助将不胜感激!
我打算通过测量其Pearson相关性来提取高度相关的特征,并由此获得相关性矩阵。但是,为了过滤高相关特征,我任意选择了相关系数,我不知道过滤高相关特征的最佳阈值。我正在考虑先量化正相关和负相关的特征,然后获得可靠的数据来设置过滤特征的阈值。谁能指出我如何从相关矩阵中量化正负相关特征?是否有任何有效的方法来选择用于过滤高度相关特征的最佳阈值?
可复制的数据
这是我使用的可重现数据,而row是样本数,原始特征数中的列:
> dput(my_df)
structure(list(SampleID = c("Tarca_001_P1A01", "Tarca_013_P1B01",
"Tarca_025_P1C01", "Tarca_037_P1D01", "Tarca_049_P1E01", "Tarca_061_P1F01",
"Tarca_051_P1E03", "Tarca_063_P1F03", "Tarca_075_P1G03", "Tarca_087_P1H03"
), GA = c(11, 15.3, 21.7, 26.7, 31.3, 32.1, 19.7, 23.6, 27.6,
30.6), `1_at` = c(6.06221469449721, 5.8755020052495, 6.12613148162098,
6.1345548976595, 6.28953417729806, 6.08561779473768, 6.25857984382111,
6.22016811759586, 6.22269236303877, 6.11986885253451), `10_at` = c(3.79648446367096,
3.45024474095539, 3.62841140410044, 3.51232455992681, 3.56819306931016,
3.54911765491621, 3.59024881523945, 3.69553021972333, 3.61860245801661,
3.74019994293802), `100_at` = c(5.84933778267459, 6.55052475296263,
6.42187743053935, 6.15489279092855, 6.34807354206396, 6.11780116002087,
6.24635169763079, 6.25479583503303, 6.16095987926232, 6.26979789563404
), `1000_at` = c(3.5677794435745, 3.31613364795286, 3.43245075704917,
3.63813996294905, 3.39904385276621, 3.54214650423219, 3.51532853598111,
3.50451431462302, 3.38965905673286, 3.54646930636612), `10000_at` …
Run Code Online (Sandbox Code Playgroud) r ×7
dplyr ×2
json ×2
awk ×1
command-line ×1
correlation ×1
count ×1
data.table ×1
dataframe ×1
frequency ×1
jq ×1
large-files ×1
pandas ×1
perl ×1
python ×1