Ani*_*iko 71
R有两个可能有用的内置常量:state.abb
缩写和state.name
全名.这是一个简单的用法示例:
> x <- c("New York", "Virginia")
> state.abb[match(x,state.name)]
[1] "NY" "VA"
Run Code Online (Sandbox Code Playgroud)
G. *_*eck 33
1) grep
来自的全名state.name
并使用它来索引state.abb
:
state.abb[grep("New York", state.name)]
## [1] "NY"
Run Code Online (Sandbox Code Playgroud)
1a)或使用which
:
state.abb[which(state.name == "New York")]
## [1] "NY"
Run Code Online (Sandbox Code Playgroud)
2)或创建一个状态缩写的向量,其名称是全名,并使用全名为其索引:
setNames(state.abb, state.name)["New York"]
## New York
## "NY"
Run Code Online (Sandbox Code Playgroud)
与(1)不同,即使"纽约"被满状态名称的矢量所取代,例如,这个也有效 setNames(state.abb, state.name)[c("New York", "Idaho")]
小智 6
我发现内置的state.name和state.abb只有50个状态.我从在线获得了一个更大的表(包括DC等)(例如,这个链接:http://www.infoplease.com/ipa/A0110468.html)并将其粘贴到名为States.csv的.csv文件中.然后我加载状态和缩写.从此文件而不是使用内置.其余的与@Aniko非常相似
library(dplyr)
library(stringr)
library(stringdist)
setwd()
# load data
data = c("NY", "New York", "NewYork")
data = toupper(data)
# load state name and abbr.
State.data = read.csv('States.csv')
State = toupper(State.data$State)
Stateabb = as.vector(State.data$Abb)
# match data with state names, misspell of 1 letter is allowed
match = amatch(data, State, maxDist=1)
data[ !is.na(match) ] = Stateabb[ na.omit( match ) ]
Run Code Online (Sandbox Code Playgroud)
匹配和匹配之间的差异在于它们如何计算从一个单词到另一个单词的距离.请参见P25-26 http://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf
我知道旧帖子,但想把我的扔进去。我在 tidyverse 上学习过,所以无论好坏,我尽可能避免使用基础 R。我也想要一个带 DC 的,所以首先我建了人行横道:
library(tidyverse)
st_crosswalk <- tibble(state = state.name) %>%
bind_cols(tibble(abb = state.abb)) %>%
bind_rows(tibble(state = "District of Columbia", abb = "DC"))
Run Code Online (Sandbox Code Playgroud)
然后我将它加入到我的数据中:
left_join(data, st_crosswalk, by = "state")
Run Code Online (Sandbox Code Playgroud)