r 箭头架构更新

Question

r 箭头架构更新

.csv我尝试读取多个文件arrow::open_dataset()，但由于列类型不一致而引发错误。

我发现这个问题主要与我的问题有关，但我正在尝试一种稍微不同的方法。

我想使用arrow一个示例 CSV 文件来利用类型的自动检测。弄清楚所有类型的列非常耗时。
然后，我采用架构并更正一些导致问题的列。
然后我使用更新后的架构来读取所有文件。

以下是我的方法：

data = read_csv_arrow('data.csv.gz', as_data_frame = F) # has more than 30 columns
sch = data$schema
print(sch)

Run Code Online (Sandbox Code Playgroud)

Schema
trade_id: int64
secid: int64
side: int64
...
nonstd: int64
flags: string

Run Code Online (Sandbox Code Playgroud)

我想将'trade_id'列类型从int64更改为string并将其他列保持不变。

如何更新架构？

我正在使用 R arrow，但我想相关的答案pyarrow可能适用。

Answer 1

thi*_*nic 5

有几种不同的方法可以做到这一点；您可以提取架构的代码并自行手动更新，也可以将架构另存为变量并以编程方式更新。

library(arrow)


# set up an arrow table
cars_table <- arrow_table(mtcars)

# view the schema
sch <- cars_table$schema

# print the code that makes up the schema - you could now copy this and edit it
sch$code()
#> schema(mpg = float64(), cyl = float64(), disp = float64(), hp = float64(), 
#>     drat = float64(), wt = float64(), qsec = float64(), vs = float64(), 
#>     am = float64(), gear = float64(), carb = float64())

# look at an individual element in the schema
sch[[2]]
#> Field
#> cyl: double

# update this element
sch[[2]] <- Field$create("cylinders", int32())
sch[[2]]
#> Field
#> cylinders: int32

sch$code()
#> schema(mpg = float64(), cylinders = int32(), disp = float64(), hp = float64(), 
#>     drat = float64(), wt = float64(), qsec = float64(), vs = float64(), 
#>     am = float64(), gear = float64(), carb = float64())

Run Code Online (Sandbox Code Playgroud)

非常感谢您的回答。我尝试了很多东西，包括 unify_schemas，但这是适合我的用例的解决方案。我只是想补充一点，除了索引之外，我们还可以通过名称来引用，这可能更可靠，因此在这种情况下为 sch[["cyl"]] 。 (2认同)

归档时间：	3 年，2 月前
查看次数：	628 次
最近记录：	3 年，2 月前