我有正则表达式re.sub(r"(?<!\s)\}", r' }', string)
.什么是(?<!…)
顺序说明什么?
pyspark DataFrame 对象中类似于pandas.DataFrame.set_index 的方法是什么?你能建议吗?
我正在使用spark版本2.2.0和Python 2.7.我正在使用pyspark连接BigSQL并尝试检索数据.以下是我使用的代码
import cPickle as cpick
import numpy as np
import pandas as pd
import time
import sys
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_train_df = spark.read.jdbc("jdbc:db2://BigSQL URL:Port:sslConnection=true;","Schema.Table",
properties={"user": "my userid",
"password": "password",
'driver' : 'com.ibm.db2.jcc.DB2Driver'})
spark_train_df.registerTempTable('data_table')
# query to get columns necessary to create indexes
sql = "select * FROM data_table"
train_df = spark.sql(sql)
cmr_dict = { 'date': time.strftime('%a, %b %d, %Y'),
'description': '`cmrs` contains data from data_table',
'cmrs': train_df}
with open('cmrs.pkl', mode='wb') as fp:
cpick.dump(cmr_dict, fp, cpick.HIGHEST_PROTOCOL) …
Run Code Online (Sandbox Code Playgroud) 我有一个pyspark 2.0.1。我正在尝试对数据框进行分组并从我的数据框中检索所有字段的值。我找到
z=data1.groupby('country').agg(F.collect_list('names'))
Run Code Online (Sandbox Code Playgroud)
将为我提供国家/地区名称属性和名称属性的值,它将列标题为collect_list(names)
。但是对于我的工作,我有大约15列的数据框&我将运行一个循环&每次在循环内都会更改groupby字段并且需要所有其余字段的输出。请您建议我如何使用collect_list( )或其他任何pyspark函数?
我也尝试过这段代码
from pyspark.sql import functions as F
fieldnames=data1.schema.names
names1= list()
for item in names:
if item != 'names':
names1.append(item)
z=data1.groupby('names').agg(F.collect_list(names1))
z.show()
Run Code Online (Sandbox Code Playgroud)
但收到错误消息
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace: py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist
Run Code Online (Sandbox Code Playgroud) 我正在使用 SMOTE 函数对包含大约 98% 0s 和 2% 1s 的稀疏数据集进行过采样。我使用了以下代码
from imblearn.over_sampling import SMOTE
import os
import pandas as pd
df_input= pd.read_csv('input_tr.csv',index_col=0)
train_X=df_input.ix[:, df_input.columns != 'row_num']
df_output=pd.read_csv("output_tr.csv",index_col=0)
train_y=df_output
sm = SMOTE(random_state=12, ratio = 1.0)
train_X_sm,train_y_sm=sm.fit_sample(train_X,train_y)
Run Code Online (Sandbox Code Playgroud)
我收到以下错误
line 347, in kneighbors
(train_size, n_neighbors)
ValueError: Expected n_neighbors <= n_samples, but n_samples = 4, n_neighbors = 6
Run Code Online (Sandbox Code Playgroud)
你能帮我解决这个错误吗?