我有一个非常简单的csv,其中包含以下数据,在tar.gz文件中压缩.我需要使用pandas.read_csv在数据框中读取它.
A B
0 1 4
1 2 5
2 3 6
import pandas as pd
pd.read_csv("sample.tar.gz",compression='gzip')
Run Code Online (Sandbox Code Playgroud)
但是,我收到错误:
CParserError: Error tokenizing data. C error: Expected 1 fields in line 440, saw 2
Run Code Online (Sandbox Code Playgroud)
以下是read_csv命令的集合以及我得到的不同错误:
pd.read_csv("sample.tar.gz",compression='gzip', engine='python')
Error: line contains NULL byte
pd.read_csv("sample.tar.gz",compression='gzip', header=0)
CParserError: Error tokenizing data. C error: Expected 1 fields in line 440, saw 2
pd.read_csv("sample.tar.gz",compression='gzip', header=0, sep=" ")
CParserError: Error tokenizing data. C error: Expected 2 fields in line 94, saw 14
pd.read_csv("sample.tar.gz",compression='gzip', header=0, sep=" ", engine='python')
Error: …Run Code Online (Sandbox Code Playgroud) 这是我的玩具数据框.
df <- tibble::tribble(
~var1, ~var2, ~var3, ~var4, ~var5, ~var6, ~var7,
"A", "C", 1L, 5L, "AA", "AB", 1L,
"A", "C", 2L, 5L, "BB", "AC", 2L,
"A", "D", 1L, 7L, "AA", "BC", 2L,
"A", "D", 2L, 3L, "BB", "CC", 1L,
"B", "C", 1L, 8L, "AA", "AB", 1L,
"B", "C", 2L, 6L, "BB", "AC", 2L,
"B", "D", 1L, 9L, "AA", "BC", 2L,
"B", "D", 2L, 6L, "BB", "CC", 1L)
Run Code Online (Sandbox Code Playgroud)
如何获得唯一标识数据框中观察结果的最小数量变量的组合,即哪些变量可以组成主键?
我解决这个问题的方法是找到变量组合,其中不同的值等于数据帧的观察数.那么,在这种情况下,那些将给我8个观察的变量组合.我随机尝试了一下,发现很少:
df %>% distinct(var1, var2, var3)
df %>% distinct(var1, var2, …Run Code Online (Sandbox Code Playgroud) 我有类似于df3的数据.要重现数据,请运行以下命令:
vec1 <- c("A", "B")
vec2 <- c("A", "B", "C")
df1 <- tibble::tribble(
~A, ~B,
"X", 4L,
"X", 9L,
"Y", 5L,
"Y", 2L,
"Y", 8L,
"Y", 2L) %>%
group_by(A) %>%
nest()
df2 <- tibble::tribble(
~A, ~C,
"X", vec1,
"Y", vec2)
df3 <- df1 %>% left_join(df2, by = "A")
Run Code Online (Sandbox Code Playgroud)
我需要使用以下内容过滤嵌套数据:
df4 <- df3 %>% filter(when C==vec1, B (part of nested data now) < 5
when C==vec2, B (part of nested data now) >4)
Run Code Online (Sandbox Code Playgroud)
或者可能是这样的:
df4 <- df3 %>% map(.$data, ~filter((identicle(.$C, vec1) …Run Code Online (Sandbox Code Playgroud) 这是MonetDBLite数据库文件中的mtcars数据。
library(MonetDBLite)
library(tidyverse)
library(DBI)
dbdir <- getwd()
con <- dbConnect(MonetDBLite::MonetDBLite(), dbdir)
dbWriteTable(conn = con, name = "mtcars_1", value = mtcars)
data_mt <- con %>% tbl("mtcars_1")
Run Code Online (Sandbox Code Playgroud)
我想使用dplyr mutate创建新变量并将其添加(提交!)到数据库表中吗?就像是
data_mt %>% select(mpg, cyl) %>% mutate(var = mpg/cyl) %>% dbCommit(con)
Run Code Online (Sandbox Code Playgroud)
这样做时,所需的输出应该相同:
dbSendQuery(con, "ALTER TABLE mtcars_1 ADD COLUMN var DOUBLE PRECISION")
dbSendQuery(con, "UPDATE mtcars_1 SET var=mpg/cyl")
Run Code Online (Sandbox Code Playgroud)
那怎么办
这是我的玩具数据:
df <- tibble::tribble(
~var1, ~var2, ~var3, ~var4, ~var5, ~var6, ~var7,
"A", "C", 1L, 5L, "AA", "AB", 1L,
"A", "C", 2L, 5L, "BB", "AC", 2L,
"A", "D", 1L, 7L, "AA", "BC", 2L,
"A", "D", 2L, 3L, "BB", "CC", 1L,
"B", "C", 1L, 8L, "AA", "AB", 1L,
"B", "C", 2L, 6L, "BB", "AC", 2L,
"B", "D", 1L, 9L, "AA", "BC", 2L,
"B", "D", 2L, 6L, "BB", "CC", 1L)
Run Code Online (Sandbox Code Playgroud)
我在以下链接/sf/answers/3717723971/上的原始问题 是:
如何获得唯一标识数据框中观察结果的最小数量变量的组合,即哪些变量可以组成主键?以下答案/代码绝对正常,非常感谢thelatemail.
nms <- unlist(lapply(seq_len(length(df)), combn, x=names(df), simplify=FALSE), …Run Code Online (Sandbox Code Playgroud) 这是我的玩具数据集
df <- tribble(
~x, ~y, ~z,
7, NA, 4,
8, 2, NA,
NA, NA, NA,
NA, 4, 6)
Run Code Online (Sandbox Code Playgroud)
我想NA为每个变量获取一个数据框,其中每个变量只在每列中第一次和最后一次出现的数字NA之间以及第一次出现的数字和最后一行之间的 s数之间。因此,对于此示例,所需的解决方案是
desired_df <- tribble(~vars, ~na_count_between_1st_last_num, ~na_count_between_1st_num_last_row,
"x", 0, 2,
"y", 1, 1,
"z", 2, 2)
Run Code Online (Sandbox Code Playgroud)
如何获得所需的输出?
如何使用python从另一个列表创建列表?如果我有一个清单:
input = ['a/b', 'g', 'c/d', 'h', 'e/f']
Run Code Online (Sandbox Code Playgroud)
如何创建仅包含斜杠"/"的字母列表,即
desired_output = ['b','d','f']
Run Code Online (Sandbox Code Playgroud)
代码会非常有用.
我想将文件 f 分块读取到数据帧中。这是我使用的代码的一部分。
for i in range(0, maxline, chunksize):
df = pandas.read_csv(f,sep=',', nrows=chunksize, skiprows=i)
df.to_sql(member, engine, if_exists='append',index= False, index_label=None, chunksize=chunksize)
Run Code Online (Sandbox Code Playgroud)
我收到错误:
pandas.io.common.EmptyDataError:没有要从文件中解析的列
该代码仅在 chunksize >= maxline(即文件 f 中的总行数)时有效。但是,就我而言,chunksize<=maxline。
请建议修复。
我有以下类型的数据帧
df <- tibble::tribble(~x,
c("A", "B"),
c("A", "B", "C"),
c("A", "B", "C", "D"),
c("A", "B"))
Run Code Online (Sandbox Code Playgroud)
和这些矢量
vec1 <- c("A", "B")
vec2 <- c("A", "B", "C")
vec3 <- c("A", "B", "C", "D")
Run Code Online (Sandbox Code Playgroud)
我想改变一个变量y,它显示哪一行有哪个向量.我尝试了以下方法,但是获取带有警告的空y变量:"较长的对象长度不是较短对象长度的倍数"
df_new <- df %>%
mutate(y = case_when(x == vec1 ~ "vec1",
x == vec2 ~ "vec2",
x == vec2 ~ "vec3"))
Run Code Online (Sandbox Code Playgroud)
期望的输出是
df_new <- tibble::tribble(~x, ~y,
c("A", "B"), "vec1",
c("A", "B", "C"), "vec2",
c("A", "B", "C", "D"), "vec3",
c("A", "B"), "vec1")
Run Code Online (Sandbox Code Playgroud) 这是数据:
library(tidyverse)
data <- tibble::tribble(
~var1, ~var2, ~var3, ~var4, ~var5,
"a", "d", "g", "hello", 1L,
"a", "d", "h", "hello", 2L,
"b", "e", "h", "k", 4L,
"b", "e", "h", "k", 7L,
"c", "f", "i", "hello", 3L,
"c", "f", "i", "hello", 4L
)
Run Code Online (Sandbox Code Playgroud)
和矢量,我想用:
filter_var <- c("hello")
groupby_vars1 <- c("var1", "var2", "var3")
groupby_vars2 <- c("var1", "var2")
joinby_vars1 <- c("var1", "var2")
joinby_vars2 <- c("var1", "var2", "var3")
Run Code Online (Sandbox Code Playgroud)
第2和第5,第3和第4个向量相同,但请假设它们不同并将它们保留为不同的向量.
现在我想创建一个通用函数,我可以在其中获取数据和这些向量来获得结果.
my_fun <- function(data, filter_var, groupby_vars1,groupby_vars2, joinby_vars1, joinby_vars2) {
data2 <- data %>% filter(var4 == filter_var) …Run Code Online (Sandbox Code Playgroud)