我正在使用 Dask,但有点困惑。
我运行下面的命令并得到这个,直到进程崩溃。
当出现故障时,它使用了所有 4 个 CPU 核心的 100%;
有人能给我建议吗?
distributed.nanny - WARNING - Restarting worker
Run Code Online (Sandbox Code Playgroud)
这是代码
import pandas as pd
import dask.dataframe as dd
import numpy as np
import time
from dask.distributed import Client
client = Client()
%time dahsn = dd.read_csv("US_Accidents_Dec19.csv")
dahsn.groupby('City').count().compute()
Run Code Online (Sandbox Code Playgroud) 我有以下需要一些 JSON 输入并转换为 Pandas 数据帧的内容。
但是,因为 JSON 没有一致的架构,所以它全都没有对齐。(如果一个条目中不存在一个字段,它会将所有内容向左移动)
有什么办法可以说下面的内容并明确定义它吗?
df.field1 = json.field1
Run Code Online (Sandbox Code Playgroud)
如果我可以用他们的名字来定义,我可以做得很好:)
谢谢
output = subprocess.check_output(command, shell=True)
# output of subprocess will be bytes, converting to string.
if isinstance(output, bytes):
output = output.decode()
output = json.loads(output)
df = pd.DataFrame(output['apps']['app'])
df = df.loc[df['startedTime'] > starttime]
df.to_csv('yarn_output.csv')
Run Code Online (Sandbox Code Playgroud)
示例输入 JSON
{"apps":{"app":[{"id":"application_1589431105417_21534","user":"udsldr","name":"HIVE-61a4ee14-1d26-4c7b-bf0d-1cc2a990557d","queue":"udsldr","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21534/","diagnostics":"Session stats:submittedDAGs=0, successfulDAGs=0, failedDAGs=0, killedDAGs=0\n","clusterId":1589431105417,"applicationType":"TEZ","applicationTags":"","priority":0,"startedTime":1590294649069,"finishedTime":1590294666011,"elapsedTime":16942,"amContainerLogs":"http://uds-far-dn150.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21534_01_000001/udsldr","amHostHttpAddress":"uds-far-dn150.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":144531,"vcoreSeconds":17,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"SUCCEEDED","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21535","user":"nifildr","name":"HIVE-850812d7-9d22-4be8-a225-7b341f6ea980","queue":"default","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21535/","diagnostics":"Session stats:submittedDAGs=0, successfulDAGs=1, failedDAGs=0, killedDAGs=0\n","clusterId":1589431105417,"applicationType":"TEZ","applicationTags":"","priority":0,"startedTime":1590294664397,"finishedTime":1590294801090,"elapsedTime":136693,"amContainerLogs":"http://uds-far-dn129.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21535_01_000001/nifildr","amHostHttpAddress":"uds-far-dn129.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":18279340,"vcoreSeconds":4248,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"TIME_OUT","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21532","user":"udsldr","name":"HIVE-73e0c359-32a5-4334-89da-4a8ae2bb1037","queue":"udsldr","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21532/","diagnostics":"Session stats:submittedDAGs=0, successfulDAGs=0, failedDAGs=0, killedDAGs=0\n","clusterId":1589431105417,"applicationType":"TEZ","applicationTags":"","priority":0,"startedTime":1590294622244,"finishedTime":1590294643808,"elapsedTime":21564,"amContainerLogs":"http://uds-far-dn35.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21532_01_000001/udsldr","amHostHttpAddress":"uds-far-dn35.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":182247,"vcoreSeconds":22,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"SUCCEEDED","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21533","user":"udssupport","name":"tcs.uds.webstats","queue":"udssystem","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21533/","diagnostics":"","clusterId":1589431105417,"applicationType":"SPARK","applicationTags":"","priority":0,"startedTime":1590294631138,"finishedTime":1590295670552,"elapsedTime":1039414,"amContainerLogs":"http://uds-far-dn148.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21533_01_000001/udssupport","amHostHttpAddress":"uds-far-dn148.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":4762538052,"vcoreSeconds":775756,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"TIME_OUT","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21530","user":"nifildr","name":"HIVE-e9a64e12-11f0-4ba8-b069-3be0ce561137","queue":"default","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21530/","diagnostics":"Session stats:submittedDAGs=0, successfulDAGs=3, failedDAGs=0, killedDAGs=0\n","clusterId":1589431105417,"applicationType":"TEZ","applicationTags":"","priority":0,"startedTime":1590294606965,"finishedTime":1590295033193,"elapsedTime":426228,"amContainerLogs":"http://uds-far-dn75.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21530_01_000001/nifildr","amHostHttpAddress":"uds-far-dn75.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":114397555,"vcoreSeconds":27175,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"TIME_OUT","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21531","user":"nifi","name":"HIVE-a063ddd1-5bf8-47b4-8ce3-8497c93b79a5","queue":"default","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21531/","diagnostics":"Session stats:submittedDAGs=0, successfulDAGs=0, failedDAGs=0, killedDAGs=0\n","clusterId":1589431105417,"applicationType":"TEZ","applicationTags":"","priority":0,"startedTime":1590294613578,"finishedTime":1590294655173,"elapsedTime":41595,"amContainerLogs":"http://uds-far-dn56.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21531_01_000001/nifi","amHostHttpAddress":"uds-far-dn56.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":345792,"vcoreSeconds":42,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"SUCCEEDED","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21528","user":"udsldr","name":"com.cardinality.LocationDB","queue":"udsldr","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21528/","diagnostics":"","clusterId":1589431105417,"applicationType":"SPARK","applicationTags":"5ec9f8480000f1697e683969","priority":0,"startedTime":1590294605875,"finishedTime":1590294782281,"elapsedTime":176406,"amContainerLogs":"http://uds-far-dn167.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21528_01_000001/udsldr","amHostHttpAddress":"uds-far-dn167.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":43389139,"vcoreSeconds":5239,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"TIME_OUT","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21529","user":"keenek1","name":"Clean DPI Report","queue":"default","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21529/","diagnostics":"","clusterId":1589431105417,"applicationType":"SPARK","applicationTags":"","priority":0,"startedTime":1590294607111,"finishedTime":1590295032105,"elapsedTime":424994,"amContainerLogs":"http://uds-far-dn62.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21529_01_000001/keenek1","amHostHttpAddress":"uds-far-dn62.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":2114077299,"vcoreSeconds":344079,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"TIME_OUT","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21542","user":"murugaa1","name":"HIVE-a1a5aadb-254c-4289-ad22-e9c7ce5e9814","queue":"default","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21542/","diagnostics":"Session stats:submittedDAGs=0, successfulDAGs=1, failedDAGs=0, killedDAGs=0\n","clusterId":1589431105417,"applicationType":"TEZ","applicationTags":"","priority":0,"startedTime":1590295275713,"finishedTime":1590295297948,"elapsedTime":22235,"amContainerLogs":"http://uds-far-dn46.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21542_01_000001/murugaa1","amHostHttpAddress":"uds-far-dn46.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":999465,"vcoreSeconds":217,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"SUCCEEDED","unmanagedApplication":false,"amNodeLabelExpression":""},{"id":"application_1589431105417_21543","user":"murugaa1","name":"HIVE-cdc8a5da-f880-4f8e-9baf-b306095b9efb","queue":"default","state":"FINISHED","finalStatus":"SUCCEEDED","progress":100.0,"trackingUI":"History","trackingUrl":"http://uds-far-mn4.dab.02.net:8088/proxy/application_1589431105417_21543/","diagnostics":"Session stats:submittedDAGs=0, successfulDAGs=1, failedDAGs=0, killedDAGs=0\n","clusterId":1589431105417,"applicationType":"TEZ","applicationTags":"","priority":0,"startedTime":1590295277611,"finishedTime":1590295301515,"elapsedTime":23904,"amContainerLogs":"http://uds-far-dn41.dab.02.net:8042/node/containerlogs/container_e66_1589431105417_21543_01_000001/murugaa1","amHostHttpAddress":"uds-far-dn41.dab.02.net:8042","allocatedMB":-1,"allocatedVCores":-1,"runningContainers":-1,"memorySeconds":1077860,"vcoreSeconds":228,"queueUsagePercentage":0.0,"clusterUsagePercentage":0.0,"preemptedResourceMB":0,"preemptedResourceVCores":0,"numNonAMContainerPreempted":0,"numAMContainerPreempted":0,"logAggregationStatus":"SUCCEEDED","unmanagedApplication":false,"amNodeLabelExpression":""}]}}
Run Code Online (Sandbox Code Playgroud)
CSV …
我正在做一个关于 Spark 的课程,我有点困惑。
所以有下面的代码。我知道第 1 行正在创建元组 (word, 1)。然后第 2 行按单词分组并对计数求和。
我不明白的是,第 2 行中的 X 和 Y 是什么。我们只有一个数字输入到 lamda 函数,即 wordcounts 中的计数列(全部为 1),那么为什么是 y?
wordCounts = words.map(lambda x: (x, 1)) #outputs [('self', 1), ('employment', 1), ('building', 1)...
wordCounts2 = wordCounts.reduceByKey(lambda x, y: x + y) # outputs [('self', 111), ('an', 178), ('internet', 26)
Run Code Online (Sandbox Code Playgroud)
然后,我们有这段代码紧随其后。我知道它对 RDD 进行排序。确认我的理解是 X[1] 字和 X[2] 总数?我猜是,但我不是 100%
对不起,这些愚蠢的问题,但我找不到明确的解释!
wordCountsSorted = wordCounts2.map(lambda x: (x[1], x[0])).sortByKey()
Run Code Online (Sandbox Code Playgroud) 我已经在 Pandas 中尝试了以下方法并且它有效。我想知道如何在 PySpark 中做到这一点?
输入是
news.bbc.co.uk
Run Code Online (Sandbox Code Playgroud)
它应该在 '.' 处拆分它 因此 index 应该等于:
[['news', 'bbc', 'co', 'uk'], ['next', 'domain', 'name']]
index = df2.domain.str.split('.').tolist()
Run Code Online (Sandbox Code Playgroud)
有谁知道我会如何在火花而不是熊猫中做到这一点?
谢谢
我有下面的脚本 & 在最后一行,我试图从名为“响应”的列中的字符串中删除停用词。
问题是,不是“有点恼火”变成“有点恼火”,它实际上甚至会丢弃字母 - 因此,有点恼火会变得有点恼火。因为'a'是一个停用词
任何人都可以给我建议吗?
import pandas as pd
from textblob import TextBlob
import numpy as np
import os
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
path = 'Desktop/fanbase2.csv'
df = pd.read_csv(path, delimiter=',', header='infer', encoding = "ISO-8859-1")
#remove punctuation
df['response'] = df.response.str.replace("[^\w\s]", "")
#make it all lower case
df['response'] = df.response.apply(lambda x: x.lower())
#Handle strange character in source
df['response'] = df.response.str.replace("‰Ûª", "''")
df['response'] = df['response'].apply(lambda x: [item for item in x if item not in …Run Code Online (Sandbox Code Playgroud) 我在 Hive 中有一个超级简单的问题。我编写了下面的摘录,它应该从字符串中返回“10”。当我在 regexr 上测试它时它有效,但在 Hive 中,它只是返回一个空白字段。
有谁知道我做错了什么?
select REGEXP_EXTRACT('DOM_10GB_mth','/[0-9]*/g', 0)
Run Code Online (Sandbox Code Playgroud) 我有以下设置的脚本。
我在用:
1)Spark数据帧以提取数据2)初始聚合后转换为熊猫数据帧3)想要转换回Spark以写入HDFS
从Spark-> Pandas进行的转换很简单,但是我在如何将Pandas数据框转换回Spark方面感到很苦恼。
你能建议吗?
from pyspark.sql import SparkSession
import pyspark.sql.functions as sqlfunc
from pyspark.sql.types import *
import argparse, sys
from pyspark.sql import *
import pyspark.sql.functions as sqlfunc
import pandas as pd
def create_session(appname):
spark_session = SparkSession\
.builder\
.appName(appname)\
.master('yarn')\
.config("hive.metastore.uris", "thrift://uds-far-mn1.dab.02.net:9083")\
.enableHiveSupport()\
.getOrCreate()
return spark_session
### START MAIN ###
if __name__ == '__main__':
spark_session = create_session('testing_files')
Run Code Online (Sandbox Code Playgroud)
我尝试了以下内容-没有错误,只有数据!确认一下,df6确实有数据&是熊猫数据框
df6 = df5.sort_values(['sdsf'], ascending=["true"])
sdf = spark_session.createDataFrame(df6)
sdf.show()
Run Code Online (Sandbox Code Playgroud) 我希望你能帮忙
我一直在尝试使用 scikit learn 中的随机搜索功能来调整我的随机森林模型。
如下所示,我给出了几个最大深度和几个叶子样本的选项。
# Create a based model
model = RandomForestClassifier()
# Instantiate the random search model
best = RandomizedSearchCV(model, {
'bootstrap': [True, False],
'max_depth': [80, 90, 100, 110],
'min_samples_leaf': [3, 4, 5]
}, cv=5, return_train_score=True, iid=True, n_iter = 4)
best.fit(train_features, train_labels.ravel())
print(best.best_score_)
print(best)
Run Code Online (Sandbox Code Playgroud)
但是当我运行这个时,我得到下面的结果,其中最大深度和每片叶子的最小样本设置为不在我的数组中的值。
我在这里做错了什么?
RandomizedSearchCV(cv=5, error_score='raise',
estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
**max_depth=None**, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
**min_samples_leaf=1**, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False),
fit_params=None, iid=True, n_iter=4, n_jobs=1,
param_distributions={'bootstrap': [True, False], 'max_depth': [80, 90, 100, 110], …Run Code Online (Sandbox Code Playgroud) python machine-learning random-forest scikit-learn cross-validation
我有下面的内容,我将小数转换为分数。成功了,0.6 变成了 3/5。
但是,如果我有 0.666666666666。我预计是2/3。我怎样才能实现这个目标?
#set starting parameters
decimal = 0.6
starting_denominator = 100
starting_numerator = int(decimal * 100)
print('starting fraction: '+str(starting_numerator) + '/' + str(starting_denominator))
#find the common factors for the numerator and denominator
i = 1
numerator_factors = []
denominator_factors = []
while i < starting_numerator+1:
if starting_numerator%i == 0:
numerator_factors.append(i)
i = i+1
i = 1
while i < starting_denominator+1:
if starting_denominator%i == 0:
denominator_factors.append(i)
i = i+1
print('numerator factors: '+ str(numerator_factors))
print('denominator factors: '+ str(denominator_factors))
#Find …Run Code Online (Sandbox Code Playgroud) python ×5
python-3.x ×5
pandas ×3
pyspark ×3
apache-spark ×2
hive ×1
lambda ×1
nltk ×1
pyspark-sql ×1
regex ×1
scikit-learn ×1