小编kik*_*222的帖子

Dask Distributed.nanny - 警告 - 重新启动工作线程问题

我正在使用 Dask,但有点困惑。

我运行下面的命令并得到这个,直到进程崩溃。

当出现故障时,它使用了所有 4 个 CPU 核心的 100%;

有人能给我建议吗?

distributed.nanny - WARNING - Restarting worker
Run Code Online (Sandbox Code Playgroud)

这是代码

import pandas as pd
import dask.dataframe as dd
import numpy as np
import time
from dask.distributed import Client
client = Client()
%time dahsn = dd.read_csv("US_Accidents_Dec19.csv")
dahsn.groupby('City').count().compute()
Run Code Online (Sandbox Code Playgroud)

python-3.x dask-distributed

5
推荐指数
0
解决办法
2158
查看次数

将 json 列映射到 Pandas 数据框列

我有以下需要一些 JSON 输入并转换为 Pandas 数据帧的内容。

但是,因为 JSON 没有一致的架构,所以它全都没有对齐。(如果一个条目中不存在一个字段,它会将所有内容向左移动)

有什么办法可以说下面的内容并明确定义它吗?

df.field1 = json.field1
Run Code Online (Sandbox Code Playgroud)

如果我可以用他们的名字来定义,我可以做得很好:)

谢谢

output = subprocess.check_output(command, shell=True)

# output of subprocess will be bytes, converting to string.
if isinstance(output, bytes):
    output = output.decode()

output = json.loads(output)
df = pd.DataFrame(output['apps']['app'])
df = df.loc[df['startedTime'] > starttime]
df.to_csv('yarn_output.csv')
Run Code Online (Sandbox Code Playgroud)

示例输入 JSON

{"apps":{"app":[{"id":"application_1589431105417_21534","user":"udsldr","name":"HIVE-61a4ee14-1d26-4c7b-bf0d-1cc2a990557d","queue":"udsldr","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21534/","diagnostics":"Session stats:submittedDAGs=0, successfulDAGs=0, failedDAGs=0, killedDAGs=0\n","clusterId":1589431105417,"applicationType":"TEZ","applicationTags":"","priority":0,"startedTime":1590294649069,"finishedTime":1590294666011,"elapsedTime":16942,"amContainerLogs":"http://uds-far-dn150.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21534_01_000001/udsldr","amHostHttpAddress":"uds-far-dn150.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":144531,"vcoreSeconds":17,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"SUCCEEDED","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21535","user":"nifildr","name":"HIVE-850812d7-9d22-4be8-a225-7b341f6ea980","queue":"default","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21535/","diagnostics":"Session stats:submittedDAGs=0, successfulDAGs=1, failedDAGs=0, killedDAGs=0\n","clusterId":1589431105417,"applicationType":"TEZ","applicationTags":"","priority":0,"startedTime":1590294664397,"finishedTime":1590294801090,"elapsedTime":136693,"amContainerLogs":"http://uds-far-dn129.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21535_01_000001/nifildr","amHostHttpAddress":"uds-far-dn129.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":18279340,"vcoreSeconds":4248,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"TIME_OUT","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21532","user":"udsldr","name":"HIVE-73e0c359-32a5-4334-89da-4a8ae2bb1037","queue":"udsldr","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21532/","diagnostics":"Session stats:submittedDAGs=0, successfulDAGs=0, failedDAGs=0, killedDAGs=0\n","clusterId":1589431105417,"applicationType":"TEZ","applicationTags":"","priority":0,"startedTime":1590294622244,"finishedTime":1590294643808,"elapsedTime":21564,"amContainerLogs":"http://uds-far-dn35.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21532_01_000001/udsldr","amHostHttpAddress":"uds-far-dn35.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":182247,"vcoreSeconds":22,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"SUCCEEDED","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21533","user":"udssupport","name":"tcs.uds.webstats","queue":"udssystem","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21533/","diagnostics":"","clusterId":1589431105417,"applicationType":"SPARK","applicationTags":"","priority":0,"startedTime":1590294631138,"finishedTime":1590295670552,"elapsedTime":1039414,"amContainerLogs":"http://uds-far-dn148.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21533_01_000001/udssupport","amHostHttpAddress":"uds-far-dn148.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":4762538052,"vcoreSeconds":775756,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"TIME_OUT","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21530","user":"nifildr","name":"HIVE-e9a64e12-11f0-4ba8-b069-3be0ce561137","queue":"default","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21530/","diagnostics":"Session stats:submittedDAGs=0, successfulDAGs=3, failedDAGs=0, killedDAGs=0\n","clusterId":1589431105417,"applicationType":"TEZ","applicationTags":"","priority":0,"startedTime":1590294606965,"finishedTime":1590295033193,"elapsedTime":426228,"amContainerLogs":"http://uds-far-dn75.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21530_01_000001/nifildr","amHostHttpAddress":"uds-far-dn75.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":114397555,"vcoreSeconds":27175,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"TIME_OUT","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21531","user":"nifi","name":"HIVE-a063ddd1-5bf8-47b4-8ce3-8497c93b79a5","queue":"default","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21531/","diagnostics":"Session stats:submittedDAGs=0, successfulDAGs=0, failedDAGs=0, killedDAGs=0\n","clusterId":1589431105417,"applicationType":"TEZ","applicationTags":"","priority":0,"startedTime":1590294613578,"finishedTime":1590294655173,"elapsedTime":41595,"amContainerLogs":"http://uds-far-dn56.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21531_01_000001/nifi","amHostHttpAddress":"uds-far-dn56.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":345792,"vcoreSeconds":42,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"SUCCEEDED","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21528","user":"udsldr","name":"com.cardinality.LocationDB","queue":"udsldr","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21528/","diagnostics":"","clusterId":1589431105417,"applicationType":"SPARK","applicationTags":"5ec9f8480000f1697e683969","priority":0,"startedTime":1590294605875,"finishedTime":1590294782281,"elapsedTime":176406,"amContainerLogs":"http://uds-far-dn167.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21528_01_000001/udsldr","amHostHttpAddress":"uds-far-dn167.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":43389139,"vcoreSeconds":5239,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"TIME_OUT","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21529","user":"keenek1","name":"Clean DPI Report","queue":"default","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21529/","diagnostics":"","clusterId":1589431105417,"applicationType":"SPARK","applicationTags":"","priority":0,"startedTime":1590294607111,"finishedTime":1590295032105,"elapsedTime":424994,"amContainerLogs":"http://uds-far-dn62.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21529_01_000001/keenek1","amHostHttpAddress":"uds-far-dn62.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":2114077299,"vcoreSeconds":344079,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"TIME_OUT","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21542","user":"murugaa1","name":"HIVE-a1a5aadb-254c-4289-ad22-e9c7ce5e9814","queue":"default","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21542/","diagnostics":"Session stats:submittedDAGs=0, successfulDAGs=1, failedDAGs=0, killedDAGs=0\n","clusterId":1589431105417,"applicationType":"TEZ","applicationTags":"","priority":0,"startedTime":1590295275713,"finishedTime":1590295297948,"elapsedTime":22235,"amContainerLogs":"http://uds-far-dn46.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21542_01_000001/murugaa1","amHostHttpAddress":"uds-far-dn46.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":999465,"vcoreSeconds":217,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"SUCCEEDED","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21543","user":"murugaa1","name":"HIVE-cdc8a5da-f880-4f8e-9baf-b306095b9efb","queue":"default","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21543/","diagnostics":"Session stats:submittedDAGs=0, successfulDAGs=1, failedDAGs=0, killedDAGs=0\n","clusterId":1589431105417,"applicationType":"TEZ","applicationTags":"","priority":0,"startedTime":1590295277611,"finishedTime":1590295301515,"elapsedTime":23904,"amContainerLogs":"http://uds-far-dn41.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21543_01_000001/murugaa1","amHostHttpAddress":"uds-far-dn41.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":1077860,"vcoreSeconds":228,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"SUCCEEDED","unmanagedApplication":false,"amNodeLabelExpression":""}]}}
Run Code Online (Sandbox Code Playgroud)

CSV …

python python-3.x pandas

5
推荐指数
1
解决办法
116
查看次数

了解 Spark 中用于 RDD 的 lambda 函数输入

我正在做一个关于 Spark 的课程,我有点困惑。

所以有下面的代码。我知道第 1 行正在创建元组 (word, 1)。然后第 2 行按单词分组并对计数求和。

我不明白的是,第 2 行中的 X 和 Y 是什么。我们只有一个数字输入到 lamda 函数,即 wordcounts 中的计数列(全部为 1),那么为什么是 y?

wordCounts = words.map(lambda x: (x, 1)) #outputs [('self', 1), ('employment', 1), ('building', 1)...
wordCounts2 = wordCounts.reduceByKey(lambda x, y: x + y) # outputs [('self', 111), ('an', 178), ('internet', 26)
Run Code Online (Sandbox Code Playgroud)

然后,我们有这段代码紧随其后。我知道它对 RDD 进行排序。确认我的理解是 X[1] 字和 X[2] 总数?我猜是,但我不是 100%

对不起,这些愚蠢的问题,但我找不到明确的解释!

wordCountsSorted = wordCounts2.map(lambda x: (x[1], x[0])).sortByKey()
Run Code Online (Sandbox Code Playgroud)

python lambda apache-spark pyspark

3
推荐指数
1
解决办法
2105
查看次数

在点处拆分 PySpark 数据框列

我已经在 Pandas 中尝试了以下方法并且它有效。我想知道如何在 PySpark 中做到这一点?

输入是

news.bbc.co.uk
Run Code Online (Sandbox Code Playgroud)

它应该在 '.' 处拆分它 因此 index 应该等于:

[['news', 'bbc', 'co', 'uk'], ['next', 'domain', 'name']]

index = df2.domain.str.split('.').tolist() 
Run Code Online (Sandbox Code Playgroud)

有谁知道我会如何在火花而不是熊猫中做到这一点?

谢谢

python-3.x apache-spark pyspark

2
推荐指数
2
解决办法
1651
查看次数

从熊猫数据框中删除停用词

我有下面的脚本 & 在最后一行,我试图从名为“响应”的列中的字符串中删除停用词。

问题是,不是“有点恼火”变成“有点恼火”,它实际上甚至会丢弃字母 - 因此,有点恼火会变得有点恼火。因为'a'是一个停用词

任何人都可以给我建议吗?

   import pandas as pd
   from textblob import TextBlob
   import numpy as np
   import os
   import nltk
   nltk.download('stopwords')
   from nltk.corpus import stopwords
   stop = stopwords.words('english')

   path = 'Desktop/fanbase2.csv'
   df = pd.read_csv(path, delimiter=',', header='infer', encoding = "ISO-8859-1")
   #remove punctuation
   df['response'] = df.response.str.replace("[^\w\s]", "")
   #make it all lower case
   df['response'] = df.response.apply(lambda x: x.lower())
   #Handle strange character in source
   df['response'] = df.response.str.replace("‰Ûª", "''")

   df['response'] = df['response'].apply(lambda x: [item for item in x if item not in …
Run Code Online (Sandbox Code Playgroud)

python nltk pandas

2
推荐指数
1
解决办法
8532
查看次数

从 Hive 中的字符串字段中仅提取数字

我在 Hive 中有一个超级简单的问题。我编写了下面的摘录,它应该从字符串中返回“10”。当我在 regexr 上测试它时它有效,但在 Hive 中,它只是返回一个空白字段。

有谁知道我做错了什么?

select REGEXP_EXTRACT('DOM_10GB_mth','/[0-9]*/g', 0)  
Run Code Online (Sandbox Code Playgroud)

regex hive

2
推荐指数
1
解决办法
9714
查看次数

将pandas数据框转换为PySpark数据框

我有以下设置的脚本。

我在用:

1)Spark数据帧以提取数据2)初始聚合后转换为熊猫数据帧3)想要转换回Spark以写入HDFS

从Spark-> Pandas进行的转换很简单,但是我在如何将Pandas数据框转换回Spark方面感到很苦恼。

你能建议吗?

from pyspark.sql import SparkSession
import pyspark.sql.functions as sqlfunc
from pyspark.sql.types import *
import argparse, sys
from pyspark.sql import *
import pyspark.sql.functions as sqlfunc
import pandas as pd

def create_session(appname):
    spark_session = SparkSession\
        .builder\
        .appName(appname)\
        .master('yarn')\
        .config("hive.metastore.uris", "thrift://uds-far-mn1.dab.02.net:9083")\
        .enableHiveSupport()\
        .getOrCreate()
    return spark_session
### START MAIN ###
if __name__ == '__main__':
    spark_session = create_session('testing_files')
Run Code Online (Sandbox Code Playgroud)

我尝试了以下内容-没有错误,只有数据!确认一下,df6确实有数据&是熊猫数据框

df6 = df5.sort_values(['sdsf'], ascending=["true"])
sdf = spark_session.createDataFrame(df6)
sdf.show()
Run Code Online (Sandbox Code Playgroud)

python-3.x pandas apache-spark-sql pyspark pyspark-sql

1
推荐指数
1
解决办法
8362
查看次数

不应用所选参数的随机搜索 CV

我希望你能帮忙

我一直在尝试使用 scikit learn 中的随机搜索功能来调整我的随机森林模型。

如下所示,我给出了几个最大深度和几个叶子样本的选项。

# Create a based model
model = RandomForestClassifier()

# Instantiate the random search model
best = RandomizedSearchCV(model, {
'bootstrap': [True, False],
'max_depth': [80, 90, 100, 110],
'min_samples_leaf': [3, 4, 5]
}, cv=5, return_train_score=True, iid=True, n_iter = 4)

best.fit(train_features, train_labels.ravel())
print(best.best_score_)
print(best)
Run Code Online (Sandbox Code Playgroud)

但是当我运行这个时,我得到下面的结果,其中最大深度和每片叶子的最小样本设置为不在我的数组中的值。

我在这里做错了什么?

RandomizedSearchCV(cv=5, error_score='raise',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            **max_depth=None**, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            **min_samples_leaf=1**, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          fit_params=None, iid=True, n_iter=4, n_jobs=1,
          param_distributions={'bootstrap': [True, False], 'max_depth': [80, 90, 100, 110], …
Run Code Online (Sandbox Code Playgroud)

python machine-learning random-forest scikit-learn cross-validation

1
推荐指数
1
解决办法
3690
查看次数

在 Python 中将循环小数转换为分数

我有下面的内容,我将小数转换为分数。成功了,0.6 变成了 3/5。

但是,如果我有 0.666666666666。我预计是2/3。我怎样才能实现这个目标?

#set starting parameters
decimal = 0.6
starting_denominator = 100
starting_numerator = int(decimal * 100)

print('starting fraction: '+str(starting_numerator) + '/' + str(starting_denominator))

#find the common factors for the numerator and denominator
i = 1
numerator_factors = []
denominator_factors = []

while i < starting_numerator+1:
    if starting_numerator%i == 0:
        numerator_factors.append(i)
    i = i+1

i = 1
while i < starting_denominator+1:
    if starting_denominator%i == 0:
        denominator_factors.append(i)
    i = i+1   

print('numerator factors: '+ str(numerator_factors))
print('denominator factors: '+ str(denominator_factors))

#Find …
Run Code Online (Sandbox Code Playgroud)

python python-3.x

1
推荐指数
1
解决办法
1492
查看次数