小编Flu*_*uxy的帖子

如何在 PySpark 中汇总中值和标准差？

我有以下代码：

from pyspark.sql import functions as func

cols = ("id","size")

result = df.groupby(*cols).agg({
    func.max("val1"),
    func.median("val2"),
    func.std("val2")
})

Run Code Online (Sandbox Code Playgroud)

但它在无法找到func.median("val2")的消息行中失败。同样发生在.medianfuncstd

python apache-spark apache-spark-sql pyspark

Flu*_*uxy

lucky-day

1
推荐指数

1
解决办法

3930
查看次数

提取字符串第一部分的正则表达式

我有以下短语列表：

[
  'This is erleada comp. recub. con película 60 mg.',
  'This is auxina e-200 uicaps. blanda 200 mg.',
  'This is ephynalsol. iny. 100 mg.',
  'This is paracethamol 100 mg.'
]

Run Code Online (Sandbox Code Playgroud)

我需要得到以下结果：

[
  'This is erleada.',
  'This is auxina.',
  'This is ephynalsol.',
  'This is paracethamol.'
]

Run Code Online (Sandbox Code Playgroud)

我编写了以下函数来清理短语：

def clean(string):
    sub_strings = [".","iny","comp","uicaps]
    try:
        string = [string[:string.index(sub_str)].rstrip() for sub_str in sub_strings]
        return string
    except:
        return string

Run Code Online (Sandbox Code Playgroud)

并按如下方式使用它：

for phrase in phrases:
    drug = clean(phrase)

Run Code Online (Sandbox Code Playgroud)

python regex

Flu*_*uxy

2021 03-07

1
推荐指数

1
解决办法

56
查看次数

有条件地合并 Pandas DataFrame 中的行

我有以下熊猫数据帧：

col1 col2                   col3        col4 
A    2021-03-28 01:40:00    1.381158    0.0
A    2021-03-28 01:50:00    0.480089    0.0
A    2021-03-28 03:00:00    0.000000    0.0
A    2021-03-28 03:00:00    0.111088    0.0
A    2021-03-28 03:10:00    0.000000    0.0
A    2021-03-28 03:10:00    0.000000    0.0
A    2021-03-28 03:10:00    0.151066    0.0
B    2021-03-28 03:10:00    1.231341    1.0

Run Code Online (Sandbox Code Playgroud)

我需要合并具有相同col1和col2值的行，并为col3.

这是预期的输出：

col1 col2                   col3        col4 
A    2021-03-28 01:40:00    1.381158    0.0
A    2021-03-28 01:50:00    0.480089    0.0
A    2021-03-28 03:00:00    0.111088    0.0
A    2021-03-28 03:10:00    0.151066    0.0
B    2021-03-28 03:10:00 …

Run Code Online (Sandbox Code Playgroud)

python pandas

Flu*_*uxy

lucky-day

1
推荐指数

1
解决办法

47
查看次数

如何计算具有相同阶段值的行之间的持续时间，然后获得每个阶段的累积持续时间？

我有以下数据框：

dt_datetime        stage    proc_val
2011-11-13 11:00   0        20
2011-11-13 11:10   0        21
2011-11-13 11:30   1        25
2011-11-13 11:40   2        22
2011-11-13 11:55   2        28
2011-11-13 12:00   2        29

Run Code Online (Sandbox Code Playgroud)

我需要添加一个名为的新列stage_duration并获得以下结果：

dt_datetime        stage    proc_val   stage_duration
2011-11-13 11:00   0        20         30
2011-11-13 11:10   0        21         30
2011-11-13 11:30   1        25         10
2011-11-13 11:40   2        22         20
2011-11-13 11:55   2        28         20
2011-11-13 12:00   2        29         20

Run Code Online (Sandbox Code Playgroud)

我该怎么做？

这是我当前的代码片段，但它没有提供预期的结果。它应该计算具有相同阶段值的行之间的持续时间，然后获取每个阶段的累积持续时间，但事实并非如此。

df['stage_duration'] = df.groupby('stage')['dt_datetime'].diff().dt.total_seconds() / 60
df['stage_duration'] = df['stage_duration'].cumsum()

Run Code Online (Sandbox Code Playgroud)

更新：

如果数据帧包含多组阶段，该解决方案也应该有效，例如，请参阅从2011-11-13 11:00和开始的阶段 …

python pandas

Flu*_*uxy

2023 01-26

1
推荐指数

1
解决办法

59
查看次数

无法使用 sklearn 的 joblib 加载 pickle 文件

我在集群中训练了一个模型，下载了它（pkl 格式）并尝试在本地加载。我知道 sklearn 的 joblib 版本用于保存模型mymodel.pkl（但我不知道到底是哪个版本......）。

from sklearn.externals import joblib

print(joblib.__version__)

model = joblib.load("mymodel.pkl")

Run Code Online (Sandbox Code Playgroud)

0.13.0我本地使用sklearn的joblib版本。

这是我得到的错误：

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-100-d0a3c42e5c53> in <module>
      3 print(joblib.__version__)
      4 
----> 5 model = joblib.load("mymodel.pkl")

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py in load(filename, mmap_mode)
    596                     return load_compatibility(fobj)
    597 
--> 598                 obj = _unpickle(fobj, filename, mmap_mode)
    599 
    600     return obj

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py in _unpickle(fobj, filename, mmap_mode)
    524     obj = None
    525     try:
--> 526         obj = unpickler.load()
    527         if unpickler.compat_mode:
    528             warnings.warn("The file '%s' has …

Run Code Online (Sandbox Code Playgroud)

python pickle scikit-learn

Flu*_*uxy

2020 01-05

0
推荐指数

1
解决办法

2万
查看次数

如何替换字符串列表中字符串的具体子字符串？

我有以下字符串列表：

list_of_str = ['Notification message', 'Warning message', 'This is the |xxx - show| message.', 'Notification message is defined by |xxx - show|', 'Notification message']

Run Code Online (Sandbox Code Playgroud)

如何获取最接近尾部且包含的字符串show|，并替换show|为Placeholder|？

预期结果：

list_of_str = ['Notification message', 'Warning message', 'This is the |xxx - show| message.', 'Notification message is defined by |xxx - Placeholder|', 'Notification message']

Run Code Online (Sandbox Code Playgroud)

python string

Flu*_*uxy

lucky-day

-1
推荐指数

1
解决办法

78
查看次数

类型错误：json.loads() 之后的字符串索引必须是整数

我有以下字符串：

'"{\\"values\\": [3.304000000004, 3.010000000002, 5.8220000000063]}"'

Run Code Online (Sandbox Code Playgroud)

我需要将其转换为 JSON。如果我做：

parsed = json.loads(data)
parsed["values"]

Run Code Online (Sandbox Code Playgroud)

...然后我收到以下错误：

TypeError: string indices must be integers

Run Code Online (Sandbox Code Playgroud)

如何解决？

python json

Flu*_*uxy

2020 12-17

-2
推荐指数

1
解决办法

64
查看次数

标签统计

python ×7

pandas ×2

apache-spark ×1

apache-spark-sql ×1

json ×1

pickle ×1

pyspark ×1

regex ×1

scikit-learn ×1

string ×1

标签 统计

小编Flu_uxy的帖子

标签统计