使用 R 中的查找表查找/替换或映射

Question

使用 R 中的查找表查找/替换或映射

完全R-新手，在这里。请温柔一点。

\n\n

我在数据框中有一列，其中的数值代表种族（英国人口普查数据）。

\n\n

# create example data\nid = c(1, 2, 3, 4, 5, 6, 7, 8, 9) \nethnicode = c(0, 1, 2, 3, 4, 5, 6, 7, 8)\ndf = data.frame(id, ethnicode)\n

Run Code Online (Sandbox Code Playgroud)\n\n

我可以进行映射（或查找/替换）来创建包含人类可读值的列（或编辑现有列）：

\n\n

# map values one-to-one from numeric to string\ndf$ethnicity <- mapvalues(df$ethnicode, \n                             from = c(8, 7, 6, 5, 4, 3, 2, 1, 0), \n                             to = c("Other", "Black", "Asian", "Mixed", \n                                    "WhiteOther", "WhiteIrish", "WhiteUK", \n                                    "WhiteTotal", "All"))\n

Run Code Online (Sandbox Code Playgroud)\n\n

在我尝试过的所有方法中，这似乎是最快的（900 万行大约需要 20 秒，而某些方法需要超过一分钟）。

\n\n

我可以\xe2\x80\x99t似乎找到（或从我\xe2\x80\x99读过的内容中理解），是如何引用查找表。

\n\n

# create lookup table\nethnicode = c(8, 7, 6, 5, 4, 3, 2, 1, 0) \nethnicity = c(("Other", "Black", "Asian", "Mixed", "WhiteOther", \n               "WhiteIrish", "WhiteUK", "WhiteTotal", "All")\nlookup = data.frame(ethnicode, ethnicity)\n

Run Code Online (Sandbox Code Playgroud)\n\n

重点是，如果我想更改人类可读的字符串，或者对进程执行任何其他操作，我\xe2\x80\x99d 宁愿对查找表执行一次，而不是在多个地方执行此操作脚本...如果我能更有效地完成它（900 万行在 20 秒内完成），那也很好。

\n\n

我还想轻松确保 \xe2\x80\x9c8\xe2\x80\x9d 仍然等于 \xe2\x80\x98Other\xe2\x80\x99 （或任何等效项），并且 \xe2\x80\x9c0\xe2\ x80\x9d 仍然等于 \xe2\x80\x98All\xe2\x80\x99 等，这在视觉上更困难，使用上述方法的列表更长。

\n\n

提前致谢。

\n

Answer 1

Kar*_* W. 5

您可以为此使用命名向量。但是，您需要将ethode 转换为字符。

df = data.frame(
    id = c(1, 2, 3, 4, 5, 6, 7, 8, 9), 
    ethnicode = as.character(c(0, 1, 2, 3, 4, 5, 6, 7, 8)), 
    stringsAsFactors=FALSE
)

# create lookup table
ethnicode = c(8, 7, 6, 5, 4, 3, 2, 1, 0) 
ethnicity = c("Other", "Black", "Asian", "Mixed", "WhiteOther", 
           "WhiteIrish", "WhiteUK", "WhiteTotal", "All")
lookup = setNames(ethnicity, as.character(ethnicode))

Run Code Online (Sandbox Code Playgroud)

然后你可以做

df <- transform(df, ethnicity=lookup[ethnicode], stringsAsFactors=FALSE)

Run Code Online (Sandbox Code Playgroud)

你就完成了。

要处理 900 万行，我建议您使用 sqlite 或 monetdb 等数据库。对于 sqlite，以下代码可能会有所帮助：

library(RSQLite)

dbname <- "big_data_mapping.db" # db to create
csvname <- "data/big_data_mapping.csv" # large dataset

ethn_codes = data.frame(
    ethnicode= c(8, 7, 6, 5, 4, 3, 2, 1, 0), 
    ethnicity= c("Other", "Black", "Asian", "Mixed", "WhiteOther", "WhiteIrish", "WhiteUK", "WhiteTotal", "All")
)

# build db
con <- dbConnect(SQLite(), dbname)
dbWriteTable(con, name="main", value=csvname, overwrite=TRUE)
dbWriteTable(con, name="ethn_codes", ethn_codes, overwrite=TRUE)

# join the tables
dat <- dbGetQuery(con, "SELECT main.id, ethn_codes.ethnicity FROM main JOIN ethn_codes ON main.ethnicode=ethn_codes.ethnicode")

# finish
dbDisconnect(con)
#file.remove(dbname)

Run Code Online (Sandbox Code Playgroud)

monetdb据说更适合您通常使用 R 执行的任务，因此绝对值得一看。

归档时间：	9 年，3 月前
查看次数：	4220 次
最近记录：	8 年，1 月前