我已经从 USPS 下载了街道缩写。这是数据:
dput(usps_streets)
structure(list(common_abbrev = c("allee", "alley", "ally", "aly",
"anex", "annex", "annx", "anx", "arc", "arcade", "av", "ave",
"aven", "avenu", "avenue", "avn", "avnue", "bayoo", "bayou",
"bch", "beach", "bend", "bnd", "blf", "bluf", "bluff", "bluffs",
"bot", "btm", "bottm", "bottom", "blvd", "boul", "boulevard",
"boulv", "br", "brnch", "branch", "brdge", "brg", "bridge", "brk",
"brook", "brooks", "burg", "burgs", "byp", "bypa", "bypas", "bypass",
"byps", "camp", "cp", "cmp", "canyn", "canyon", "cnyn", "cape",
"cpe", "causeway", "causwa", "cswy", "cen", "cent", "center",
"centr", "centre", "cnter", "cntr", "ctr", "centers", "cir",
"circ", "circl", …Run Code Online (Sandbox Code Playgroud) 我已经从人口普查局下载了美国所有城镇等的列表。这是一个随机样本:
dput(somewhere)
structure(list(state = structure(c(30L, 31L, 5L, 31L, 24L, 36L,
13L, 21L, 6L, 10L, 31L, 28L, 10L, 5L, 5L, 8L, 23L, 11L, 34L,
19L, 29L, 4L, 24L, 13L, 21L, 31L, 2L, 3L, 29L, 24L, 1L, 13L,
15L, 10L, 11L, 33L, 35L, 8L, 11L, 12L, 36L, 28L, 9L, 31L, 8L,
14L, 11L, 12L, 36L, 13L, 8L, 5L, 29L, 8L, 7L, 23L, 25L, 39L,
16L, 28L, 10L, 29L, 26L, 8L, 32L, 40L, 28L, 23L, 37L, 31L, 18L,
5L, 1L, 31L, 18L, 13L, …Run Code Online (Sandbox Code Playgroud) 我试图在一个非常大的数据框(约220万行)中创建一个列,计算每个因子级别的1的累积和,并在达到新的因子级别时重置.以下是一些类似于我自己的基本数据.
itemcode <- c('a1', 'a1', 'a1', 'a1', 'a1', 'a2', 'a2', 'a3', 'a4', 'a4', 'a5', 'a6', 'a6', 'a6', 'a6')
goodp <- c(0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1)
df <- data.frame(itemcode, goodp)
Run Code Online (Sandbox Code Playgroud)
我想输出变量cum.goodp看起来像这样:
cum.goodp <- c(0, 1, 2, 0, 1, 1, 2, 0, 0, 1, 1, 1, 2, 0, 1)
Run Code Online (Sandbox Code Playgroud)
我知道那里有很多使用规范的split-apply-combine方法,从概念上讲它是直观的,但我尝试使用以下方法:
k <- transform(df, cum.goodp = goodp*ave(goodp, c(0L, cumsum(diff(goodp != 0)), FUN = seq_along, by = itemcode)))
Run Code Online (Sandbox Code Playgroud)
当我尝试运行此代码时,它非常慢.我得到的变化是其中一部分原因('by'也没有帮助).itemcode变量有超过70K的不同值,因此它应该是矢量化的.有没有办法使用cumsum对其进行矢量化?如果没有,任何帮助都将得到真正的赞赏.非常感谢.
我有一个城市和相关信息的列表,我已经放在一个数据框中,如下所示:
library(plyr)
library(dplyr)
library(ggmap)
library(Imap)
cities <- c("washington, dc", "wilmington, de", "amarillo, tx",
"denver, co", "needham, ma", "philadelphia, pa",
"doylestown, pa", "galveston, tx", "tuscaloosa, al",
"hollywood, fl"
)
id <- c(156952, 154222, 785695, 154423, 971453, 149888, 1356987,
178946, 169944, 136421)
month <- c(201811, 201811, 201912, 201912, 202005, 202005,
202005, 202106, 202106, 202106 )
category<- c("home", "work", "home", "home", "home", "work",
"cell", "home", "work", "cell")
places <- data.frame(cities, id, category, month)
Run Code Online (Sandbox Code Playgroud)
使用Imap和ggmap包,我可以检索每个城市的经度和纬度:
lat <- geocode(location = places$cities, …Run Code Online (Sandbox Code Playgroud) 我在一个数据框中有一列,其中包含城市和州名称:
ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")
ac <- as.data.frame(ac)
我想搜索ac$ac另一个数据框列中的值,如果存在匹配则d$description返回列的值。id
dput(df)
structure(list(month = c(202110L, 201910L, 202005L, 201703L,
201208L, 201502L), id = c(100559687L, 100558763L, 100558934L,
100558946L, 100543422L, 100547618L), description = c("residential local telephone service local with more san francisco ca flat rate with eas package plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block …Run Code Online (Sandbox Code Playgroud) 我有一系列 9 个网址,我想从中抓取数据:
http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0
Run Code Online (Sandbox Code Playgroud)
当页面更改到最后一页时,链接末尾的 offset= 从 0 到 900(乘以 100)。我想遍历每个页面并抓取每个表,然后使用 rbind 将每个 df 按顺序堆叠在一起。我一直在使用 rvest 并且想使用 lapply 因为我比 for 循环更好。
问题与此类似(从 url 列表中收获 (rvest) 多个 HTML 页面)但不同,因为我不想在运行程序之前将所有链接复制到一个向量。我想要一个关于如何遍历多个页面并收集数据的通用解决方案,每次创建一个数据框。
以下适用于第一页:
library(rvest)
library(stringr)
library(tidyr)
site <- 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=0'
webpage <- read_html(site)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]
Run Code Online (Sandbox Code Playgroud)
但我想在所有页面上重复这一点,而不必将 url 粘贴到向量中。我尝试了以下方法,但没有奏效:
jump <- seq(0, 900, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&round_min=&round_max=&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id&order_by_asc=&offset=', jump,'.htm', sep="")
webpage <- read_html(site)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]
Run Code Online (Sandbox Code Playgroud)
所以每个页面都应该有一个数据框,我想把它们放在一个列表中然后使用 rbind 来堆叠它们会更容易。
任何帮助将不胜感激!
我有一个数据框,其中有一列包含州名称。这些名称是官方缩写、部分拼写和完整州名的混合体。
d = pd.DataFrame(['fla', 'fl', 'del', 'ohio', 'calif', 'ca', 'del', 'texas', 'miss', 'tx', 'new mex'],
columns = ["state"])
Run Code Online (Sandbox Code Playgroud)
这里有一个带有状态缩写和名称的Python字典:https ://code.activestate.com/recipes/577305-python-dictionary-of-us-states-and-territories/
我想查看数据帧d并找到 中的最佳匹配dict并替换 中的值d['state']。我认为我不想使用,replace因为我想替换“整个单词”而不是子字符串。期望的结果:
d = ['fl', 'fl', 'de', 'oh', 'ca', 'ca', 'de', 'tx', 'ms', 'tx', 'nm']
Run Code Online (Sandbox Code Playgroud)
将字典直接加载到我的控制台中并调用它states_dict,我尝试了以下操作(根据此地图将美国州名映射到字典中单独给出的两个字母缩写词)
d['state'] = d['state'].map(states_dict)
Run Code Online (Sandbox Code Playgroud)
nan它为我的数据框中的每个条目生成d.
任何帮助将非常感激。
谢谢。
我有一个数据框,df我曾经生成两个系列的图,如下所示:
year = [2002, 2002, 2002, 2002]
month = ['Jan', 'Feb', 'Mar', 'Apr']
column1 = [3.3, 3.0, 3.1, 3.2, 2.9]
column2 = [7.0, 7.1, 7.3, 6.9, 7.3]
Dataset = list(zip(year, month, column1, column2))
df = DataFrame(data = Dataset, columns = ['year', 'month', 'column1', 'column2'])
df['column1'].plot(legend = True, label = 'column1')
df['column2'].plot(legend = True, label = 'column2', title = \
"Figure 1", style = '--', linewidth = 2.5)
Run Code Online (Sandbox Code Playgroud)
产生以下结果:
我的数据框中还有一列,df['year']其中包含我希望沿 x 轴移动的值。我尝试了以下方法
plt.xticks(df['year'])
Run Code Online (Sandbox Code Playgroud)
但发生了以下情况:
有没有办法使用该列df['year']并将其值作为 …