我有以下熊猫数据框:
df =
A B C
111-ABC 123 EEE
111-ABC 222 EEE
111-ABC 444 XXX
222-CCC 222 YYY
222-CCC 333 67T
333-DDD 123 TTT
333-DDD 123 BTB
333-DDD 444 XXX
333-DDD 555 AAA
Run Code Online (Sandbox Code Playgroud)
我想删除列中A不包含的所有行组(分组依据) 。123B
预期结果是这样的(行组222-CCC被删除):
result =
A B C
111-ABC 123 EEE
111-ABC 222 EEE
111-ABC 444 XXX
333-DDD 123 TTT
333-DDD 123 BTB
333-DDD 444 AAA
Run Code Online (Sandbox Code Playgroud)
怎么做?我认为首先我应该使用groupby,但是如何过滤掉行组,而不仅仅是特定的行?
result = df.groupby("A").... ??
Run Code Online (Sandbox Code Playgroud) 如何使用Kafka REST Proxy删除Kafka主题?我尝试了以下命令,但它返回错误消息:
curl -X DELETE XXX.XX.XXX.XX:9092/topics/test_topic
Run Code Online (Sandbox Code Playgroud)
如果不可能,那么如何更新删除消息并更新主题方案?
初始化Java程序时出现以下错误堆栈跟踪:
Exception in thread "main" java.lang.VerifyError: class com.fasterxml.jackson.module.scala.ser.ScalaIteratorSerializer overrides final method withResolved.(Lcom/fasterxml/jackson/databind/BeanProperty;Lcom/fasterxml/jackson/databind/jsontype/TypeSerializer;Lcom/fasterxml/jackson/databind/JsonSerializer;)Lcom/fasterxml/jackson/databind/ser/std/AsArraySerializerBase;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at com.fasterxml.jackson.module.scala.ser.IteratorSerializerModule$class.$init$(IteratorSerializerModule.scala:70)
at com.fasterxml.jackson.module.scala.DefaultScalaModule.<init>(DefaultScalaModule.scala:19)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<init>(DefaultScalaModule.scala:35)
at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<clinit>(DefaultScalaModule.scala)
at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:81)
at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
at org.apache.spark.SparkContext.hadoopRDD(SparkContext.scala:991)
at org.apache.spark.sql.execution.datasources.json.JSONRelation.org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd(JSONRelation.scala:101)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:115)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:109)
at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:108)
at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:636)
at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635)
at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:37)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125) …Run Code Online (Sandbox Code Playgroud) 我有一个用户定义的函数:
calc = udf(calculate, FloatType())
param1 = "A"
result = df.withColumn('col1', calc(col('type'), col('pos'))).groupBy('pk').sum('events')
def calculate(type, pos):
if param1=="A":
a, b = [ 0.05, -0.06 ]
else:
a, b = [ 0.15, -0.16 ]
return a * math.pow(type, b) * max(pos, 1)
Run Code Online (Sandbox Code Playgroud)
我需要将参数传递param1给this udf。我该怎么做?
我有以下数据框df:
time_diff avg_trips_per_day
631 1.0
231 1.0
431 1.0
7031 1.0
17231 1.0
20000 20.0
21000 15.0
22000 10.0
Run Code Online (Sandbox Code Playgroud)
我想time_diff在 X 轴和avg_trips_per_dayY 轴上创建一个直方图,以便查看time_diff. 因此,Y 轴不是 中 X 值的重复频率df,但应该是avg_trips_per_day。问题是我不知道如何放入time_diff垃圾箱以将其作为连续变量处理。
这是我尝试的,但它把所有可能的值time_diff放在 X 轴上。
norm = plt.Normalize(df["avg_trips_per_day"].values.min(), df["avg_trips_per_day"].values.max())
colors = plt.cm.spring(norm(df["avg_trips_per_day"]))
plt.figure(figsize=(12,8))
ax = sns.barplot(x="time_diff", y="avg_trips_per_day", data=df, palette=colors)
plt.xticks(rotation='vertical', fontsize=12)
ax.grid(b=True, which='major', color='#d3d3d3', linewidth=1.0)
ax.grid(b=True, which='minor', color='#d3d3d3', linewidth=0.5)
plt.show()
Run Code Online (Sandbox Code Playgroud) 我有一个数据框df,可以df_c使用以下代码转换为该数据框:
df = pd.DataFrame(columns=["App","Feature1", "Feature2","Feature3",
"Feature4","Feature5",
"Feature6","Feature7","Feature8"],
data=[["SHA",0,0,1,1,1,0,1,0],
["LHA",1,0,1,1,0,1,1,0],
["DRA",0,0,0,0,0,0,1,0],
["FRA",1,0,1,1,1,0,1,1],
["BRU",0,0,1,0,1,0,0,0],
["PAR",0,1,1,1,1,0,1,0],
["AER",0,0,1,1,0,1,1,0],
["SHE",0,0,0,1,0,0,1,0]])
df_c = df.iloc[:, 1:].eq(1).sum().rename_axis('Feature').reset_index(name='Cou??nt')
Run Code Online (Sandbox Code Playgroud)
然后,我使用matplotlib和seaborn创建条形图:
plt.figure(figsize=(12,8))
ax = sns.barplot(x="Feature", y="Count", data=df_c, palette=sns.color_palette("GnBu", 10), order=df_c['Feature'])
plt.xticks(rotation='vertical')
ax.grid(b=True, which='major', color='#d3d3d3', linewidth=1.0)
ax.grid(b=True, which='minor', color='#d3d3d3', linewidth=0.5)
plt.show()
Run Code Online (Sandbox Code Playgroud)
我想按从左到右的升序对栏进行排序。如果我这样做order=df_c['Count'],则条形消失。
我加入了两个 PySpark DataFrames,如下所示:
exprs = [max(x) for x in ["col1","col2"]]
df = df1.union(df2).groupBy(['campk', 'ppk']).agg(*exprs)
Run Code Online (Sandbox Code Playgroud)
但我收到此错误:
AssertionError: all exprs should be Column
Run Code Online (Sandbox Code Playgroud)
怎么了?
我在使用 Amazon SDK 从 S3 检索数据时遇到问题。问题是它只检索 1000 个元素,而实际上aws_bucket_data->中有 10,000 个元素currentDataDirectory。我不使用setMaxKeys(...),所以结果似乎很奇怪。
BasicAWSCredentials credentials = new BasicAWSCredentials("...", "...");
client = new AmazonS3Client(credentials);
ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
.withBucketName(aws_bucket_data)
.withPrefix(currentDataDirectory);
ObjectListing objectListing = client.listObjects(listObjectsRequest);
System.out.println(objectListing.getObjectSummaries().size());
Run Code Online (Sandbox Code Playgroud)
我怎么解决这个问题?
我有一个df包含大约1 Gb数据的数据框.为什么命令df.count()需要相对较长的时间才能完成,而df.filter(...)速度要快得多?有估算的条目数没有更好的办法df是比快df.count()"
我想用matplotlib和seaborn创建一个平滑的折线图.
这是我的数据帧df:
hour direction hourly_avg_count
0 1 20
1 1 22
2 1 21
3 1 21
.. ... ...
24 1 15
0 2 24
1 2 28
... ... ...
Run Code Online (Sandbox Code Playgroud)
折线图应包含两行,一行direction等于1,另一行direction等于2.X轴为hourY轴,Y轴为hourly_avg_count.
我试过这个,但我看不到线条.
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
plt.figure(figsize=(12,8))
sns.tsplot(df, time='hour', condition='direction', value='hourly_avg_count')
Run Code Online (Sandbox Code Playgroud) python ×6
apache-spark ×3
matplotlib ×3
pandas ×3
java ×2
pyspark ×2
seaborn ×2
amazon-s3 ×1
apache-kafka ×1
http-proxy ×1
jms-topic ×1
maven ×1
pom.xml ×1
rest ×1
scala ×1