如果我想将函数的参数解析为错误或警告,如果参数转换为函数中的data.table,则会发生奇怪的事情:
e <- data.frame(x = 1:10)
### something strange is happening
foo <- function(u) {
u <- data.table(u)
warning(deparse(substitute(u)), " is not a data.table")
u
}
foo(e)
## foo(e)
## x
## 1: 1
## 2: 2
## 3: 3
## 4: 4
## 5: 5
## 6: 6
## 7: 7
## 8: 8
## 9: 9
## 10: 10
## Warning message:
## In foo(e) :
## structure(list(x = 1:10), .Names = "x", row.names = c(NA, -10L), class …Run Code Online (Sandbox Code Playgroud) 我有2个data.tables dtp和dtab.
require(data.table)
set.seed(1)
dtp <- data.table(pid = gl(3, 3, labels = c("du", "i", "nouana")),
year = gl(3, 1, 9, labels = c("2007", "2010", "2012")),
val = rnorm(9), key = c("pid", "year"))
dtab <- data.table(pid = factor(c("i", "nouana")),
year = factor(c("2010", "2000")),
abn = sample(1:5, 2, replace = TRUE), key =
c("pid", "year"))
dtp
## pid year val
## 1: du 2007 -0.6264538
## 2: du 2010 0.1836433
## 3: du 2012 -0.8356286
## 4: i 2007 …Run Code Online (Sandbox Code Playgroud) 感谢首先实现dt1.9.6的转变.当我有许多不同的群体时,shift()反对的期望比我的旧代码慢:
library(data.table)
library(microbenchmark)
set.seed(1)
mg <- data.table(expand.grid(year = 2012:2016, id = 1:1000),
value = rnorm(5000))
microbenchmark(dt194 = mg[, l1 := c(value[-1], NA), by = .(id)],
dt196 = mg[, l2 := shift(value, n = 1,
type = "lead"), by = .(id)])
## Unit: milliseconds
## expr min lq mean median uq max eval
## dt194 4.93735 5.236034 5.718654 5.623736 5.74395 9.555922 100
## dt196 83.92612 87.530404 91.700317 90.953947 91.43783 257.473242 100
Run Code Online (Sandbox Code Playgroud)
详细脚本如下:https://github.com/nachti/datatable_test/blob/master/leadtest.R
我误用了shift()吗?
编辑:避免 …
在我使用sparklyrwith yarn-client方法管理它连接到我们的(新)集群之后,现在我只能显示默认方案中的表.我该如何连接scheme.table?使用DBI它正在工作,例如使用以下行:
dbGetQuery(sc, "SELECT * FROM scheme.table LIMIT 10")
在HUE中,我可以显示所有方案中的所有表.
〜g ^
考虑在spark中有2个表或表引用要比较,例如,以确保备份正常工作.是否有可能在火花中做那个遥控?因为使用将所有数据复制到R没有用collect().
library(sparklyr)
library(dplyr)
library(DBI)
##### create spark connection here
# sc <- spark_connect(<yourcodehere>)
spark_connection(sc)
spark_context(sc)
trees1_tbl <- sdf_copy_to(sc, trees, "trees1")
trees2_tbl <- sdf_copy_to(sc, trees, "trees2")
identical(trees1_tbl, trees2_tbl) # FALSE
identical(collect(trees1_tbl), collect(trees2_tbl)) # TRUE
setequal(trees1_tbl, trees2_tbl) # FALSE
setequal(collect(trees1_tbl), (trees2_tbl)) # TRUE
spark_disconnect(sc)
Run Code Online (Sandbox Code Playgroud)
会很好,如果dplyr::setequal()可以直接使用.
我有一个带嵌套引号的latin1编码的csv文件:
Ort;Stra?e;Bezeichnung
Vienna;Testgasse 1;"Ministerium ""Pestalozzi"""
Graz;Teststra?e 3;HS
Salzburg;Beispielstra?e 9;"NMS ""Die Schlauen"""
Vienna;Wolfgang-Stra?e 7;"Wirtshaus ""Wien III"""
Run Code Online (Sandbox Code Playgroud)
使用来自data.table 1.9.6的fread在标题中给出了一个错误的特殊字符(ß),而下面的所有ß都是正确的 - 引用的引号保持"".
dat <- fread("latin1quotedat.csv", encoding = "Latin-1")
dat # wrong header, wrong quotes
Ort Stra\xdfe Bezeichnung
1: Vienna Testgasse 1 Ministerium ""Pestalozzi""
2: Graz Teststraße 3 HS
3: Salzburg Beispielstraße 9 NMS ""Die Schlauen""
4: Vienna Wolfgang-Straße 7 Wirtshaus ""Wien III""
Run Code Online (Sandbox Code Playgroud)
read.csv2从基础R 使用一切都如预期:
dat1 <- read.csv2("latin1quotedat.csv", encoding = "latin1")
dat1 # ok
Ort Straße Bezeichnung
1 Vienna Testgasse 1 Ministerium "Pestalozzi" …Run Code Online (Sandbox Code Playgroud)