小编ana*_*chy的帖子

将 pandas 中的 2 个字符串列与两列中的不同条件组合起来

我在 pandas 中有两列，数据如下所示。

code fx         category
AXD  AXDG.R     cat1
AXF  AXDG_e.FE  cat1 
333  333.R      cat1
....

Run Code Online (Sandbox Code Playgroud)

还有其他类别，但我只对 cat1 感兴趣。

我想组合该code列中的所有内容以及该.列中的所有内容fx，并用新的组合替换代码列，而不影响其他行。

code    fx         category
AXD.R   AXDG.R     cat1
AXF.FE  AXDG_e.FE  cat1
333.R   333.R      cat1
.....

Run Code Online (Sandbox Code Playgroud)

这是我的代码，我想我必须使用正则表达式，但我不确定如何以这种方式组合它。

df.loc[df['category']== 'cat1', 'code'] = df[df['category'] == 'cat1']['code'].str.replace(r'[a-z](?=\.)', '', regex=True).str.replace(r'_?(?=\.)','', regex=True).str.replace(r'G(?=\.)', '', regex=True)

Run Code Online (Sandbox Code Playgroud)

我也不知道如何选择第二列。任何帮助将不胜感激。

python dataframe pandas

ana*_*chy

2021 12-20

6
推荐指数

1
解决办法

514
查看次数

在 pandas 列标题上方插入一行以将标题名称保存在 Excel 工作表的第一个单元格中

我有多个看起来像这样的数据框，数据无关紧要。

我希望它看起来像这样，我想在列标题上方插入一个标题。

我想将它们合并到一个 Excel 文件中的多个选项卡中。

是否可以在将文件保存到 Excel 之前在列标题上方添加另一行并在第一个单元格中插入标题。

我目前正在这样做。

with pd.ExcelWriter('merged_file.xlsx',engine='xlsxwriter') as writer:
    for filename in os.listdir(directory):
        if filename.endswith('xlsx'):
            print(filename)
            if 'brands' in filename:
                some function
            elif 'share' in filename:
                somefunction
            else:
                some function
            df.to_excel(writer,sheet_name=f'{filename[:-5]}',index=True,index_label=True)
writer.close()

Run Code Online (Sandbox Code Playgroud)

但是sheet_name太长了，这就是为什么我想在列标题上方添加标题。

我试过这段代码，

columns = df.columns
columns = list(zip([f'{filename[:-5]}'] * len(df.columns), columns))             
columns = pd.MultiIndex.from_tuples(columns) 
df2 = pd.DataFrame(df,index=df.index,columns=columns) 
df2.to_excel(writer,sheet_name=f'{filename[0:3]}',index=True,index_label=True)

Run Code Online (Sandbox Code Playgroud)

但最终看起来像这样，所有数据都消失了，

它应该看起来像这样

python excel pandas

ana*_*chy

2022 12-19

4
推荐指数

1
解决办法

8104
查看次数

理解Python中的星号运算符位于括号中的函数之前

我知道星号用于解包系统参数等值或将列表解包到变量中。

但我之前在这个 asyncio 示例中没有见过这种语法。

我在这里阅读这篇文章，https://realpython.com/async-io-python/#the-10000-foot-view-of-async-io，但我不明白星号运算符在这做什么语境。

#!/usr/bin/env python3
# rand.py

import asyncio
import random

# ANSI colors
c = (
    "\033[0m",   # End of color
    "\033[36m",  # Cyan
    "\033[91m",  # Red
    "\033[35m",  # Magenta
)

async def makerandom(idx: int, threshold: int = 6) -> int:
    print(c[idx + 1] + f"Initiated makerandom({idx}).")
    i = random.randint(0, 10)
    while i <= threshold:
        print(c[idx + 1] + f"makerandom({idx}) == {i} too low; retrying.")
        await asyncio.sleep(idx + 1)
        i = random.randint(0, 10)
    print(c[idx + …

Run Code Online (Sandbox Code Playgroud)

python syntax asynchronous argument-unpacking python-asyncio

ana*_*chy

lucky-day

4
推荐指数

1
解决办法

1532
查看次数

如何使用 PySpark 有效地将这么多 csv 文件（大约 130,000 个）合并到一个大型数据集中？

我之前发布了这个问题并得到了一些使用 PySpark 的建议。

如何有效地将这一大数据集合并到一个大数据框中？

以下 zip 文件 ( https://fred.stlouisfed.org/categories/32263/downloaddata/INTRNTL_csv_2.zip ) 包含一个名为 data 的文件夹，其中包含大约 130,000 个 csv 文件。我想将它们全部合并到一个数据框中。我有 16GB 的 RAM，但当我打开前几百个文件时，我的 RAM 就一直用完。文件的总大小仅为约 300-400 mb 的数据。

如果打开任何 csv 文件，您可以看到它们都具有相同的格式，第一列用于日期，第二列用于数据系列。

所以现在我使用 PySpark，但是我不知道连接所有文件的最有效方法是什么，使用 pandas 数据帧我只会像这样连接单个帧的列表，因为我希望它们在日期上合并：

bigframe = pd.concat(listofframes,join='outer', axis=0)

Run Code Online (Sandbox Code Playgroud)

但就像我提到的，这个方法不起作用，因为我很快就耗尽了内存。

使用 PySpark 执行类似操作的最佳方法是什么？

到目前为止我已经有了这个，（顺便说一句，下面的文件列表只是我想要提取的文件列表，你可以忽略它）


import os

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spark-dataframe-demo').getOrCreate()
from pyspark.sql import *
from pyspark.sql.functions import col

from functools import reduce
from pyspark.sql import DataFrame

listdf = []

for subdir, dirs, files in os.walk("/kaggle/input/filelist/"):
    for file in files:
        path …

Run Code Online (Sandbox Code Playgroud)

python memory bigdata apache-spark pyspark

ana*_*chy

2020 02-18

3
推荐指数

1
解决办法

1万
查看次数

如何从 postgresql sql 表中删除重复行


date        | window  | points  |    actual_bool      |         previous_bool          |       creation_time        | source 
------------+---------+---------+---------------------+---------------------------------+----------------------------+--------
 2021-02-11 |     110 |     0.6 |                   0 |                               0 | 2021-02-14 09:20:57.51966  | bldgh
 2021-02-11 |     150 |     0.7 |                   1 |                               0 | 2021-02-14 09:20:57.51966  | fiata
 2021-02-11 |     110 |     0.7 |                   1 |                               0 | 2021-02-14 09:20:57.51966  | nfiws
 2021-02-11 |     150 |     0.7 |                   1 |                               0 | 2021-02-14 09:20:57.51966  | fiata
 2021-02-11 |     110 |     0.6 |                   0 |                               0 | …

Run Code Online (Sandbox Code Playgroud)

sql postgresql duplicates sql-delete

ana*_*chy

2021 03-17

2
推荐指数

1
解决办法

489
查看次数

在带有 os.system 的 python 脚本中正确使用 ssh 和 sed

我试图在 python 脚本中运行一个 ssh 命令，使用和在远程服务器中完全匹配的字符串的末尾os.system添加一个。0sshsed

我有一个nodelist在远程服务器中调用的文件，它是一个看起来像这样的列表。

test-node-1
test-node-2
...
test-node-11
test-node-12
test-node-13
...
test-node-21

Run Code Online (Sandbox Code Playgroud)

我想使用 sed 进行以下修改，我想搜索test-node-1，当找到完全匹配时，我想在最后添加一个 0，文件最终必须是这样的。

test-node-1 0
test-node-2
...
test-node-11
test-node-12
test-node-13
...
test-node-21

Run Code Online (Sandbox Code Playgroud)

但是，当我运行第一个命令时，

hostname = 'test-node-1'
function = 'nodelist'

os.system(f"ssh -i ~/.ssh/my-ssh-key username@serverlocation \"sed -i '/{hostname}/s/$/ 0/' ~/{function}.txt\"")

Run Code Online (Sandbox Code Playgroud)

结果变成这样，

test-node-1 0
test-node-2
...
test-node-11 0
test-node-12 0
test-node-13 0
...
test-node-21

Run Code Online (Sandbox Code Playgroud)

我尝试在这样的命令中添加一个 \b，

os.system(f"ssh -i ~/.ssh/my-ssh-key username@serverlocation \"sed -i '/\b{hostname}\b/s/$/ 0/' ~/{function}.txt\"")

Run Code Online (Sandbox Code Playgroud)

该命令根本不起作用。

我必须手动输入节点名称，而不是像这样使用变量，

os.system(f"ssh -i …

Run Code Online (Sandbox Code Playgroud)

python regex ssh os.system sed

ana*_*chy

lucky-day

1
推荐指数

1
解决办法

255
查看次数

postgresql sql表中获取列不存在错误

我有一个在 postgresql 中看起来像这样的 sql 表，名为test.

    date    |  data   |      source      
------------+---------+------------------
 2015-09-23 | 128     | aaamt
 2015-09-24 | 0       | aaamtx2
.....

Run Code Online (Sandbox Code Playgroud)

我输入SELECT * FROM test where source="aaamt"但收到以下错误，

ERROR:  column "aaamt" does not exist
LINE 1: SELECT * FROM test where source = "aaamt";

Run Code Online (Sandbox Code Playgroud)

为什么我会收到此错误以及如何修复它？

sql postgresql string-constant

ana*_*chy

2020 10-20

0
推荐指数

1
解决办法

4501
查看次数

标签统计

python ×5

pandas ×2

postgresql ×2

sql ×2

apache-spark ×1

argument-unpacking ×1

asynchronous ×1

bigdata ×1

dataframe ×1

duplicates ×1

excel ×1

memory ×1

os.system ×1

pyspark ×1

python-asyncio ×1

regex ×1

sed ×1

sql-delete ×1

ssh ×1

string-constant ×1

syntax ×1

标签 统计

小编ana_chy的帖子

标签统计