我已经在本地安装了 Spark 和组件,并且能够在 Jupyter、iPython 中以及通过 Spark-submit 执行 PySpark 代码 - 但是收到以下警告:
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/ayubk/spark-3.0.1-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
20/12/27 07:54:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable …Run Code Online (Sandbox Code Playgroud) 我的代码卡在clean_up()方法上MyClass()
my_class.py:
import os
import pandas as pd
import psycopg2, pymysql, pyodbc
from db_credentials_dict import db_credentials
class MyClass():
def __init__(self, from_database, to_table_name, report_name):
...
def get_sql(self):
...
def get_connection(self):
...
def query_to_csv(self):
...
def csv_to_postgres(self):
...
def extract_and_load(self):
self.query_to_csv()
self.csv_to_postgres()
def get_final_sql(self):
...
def postgres_to_csv(self):
...
def clean_up(self):
print('\nTruncating {}...'.format(self.to_table_name), end='')
with self.postgres_connection.cursor() as cursor:
cursor.execute("SELECT NOT EXISTS (SELECT 1 FROM %s)" % self.to_table_name)
empty = cursor.fetchone()[0]
if not empty:
cursor.execute("TRUNCATE TABLE %s" % self.to_table_name)
self.postgres_connection.commit()
print('DONE') …Run Code Online (Sandbox Code Playgroud) 下面的窗口函数在 pandas 中相当于什么
COUNT(order_id) OVER(PARTITION BY city)
Run Code Online (Sandbox Code Playgroud)
我可以获得 row_number 或排名
df['row_num'] = df.groupby('city').cumcount() + 1
Run Code Online (Sandbox Code Playgroud)
但是像示例中那样按城市计数分区正是我正在寻找的
询问
SELECT ID, Name, Phone
FROM Table1
LEFT JOIN Table2 ON Table1.ID = Table2.ID
WHERE Table2.ID IS NULL
Run Code Online (Sandbox Code Playgroud)
问题
Table2完全省略吗?因为根本没有加入?任何帮助将非常感激.
我将如何删除字符串中的文本列表?本质上有一个 URL 列,并且希望避免过长的正则表达式和多个嵌套替换函数。
有没有一种方法可以声明“http”、“www.”等文本列表,并将它们一次性从列中删除?
.tsv.gzPySpark 有没有办法从 URL读取 a ?
from pyspark.sql import SparkSession
def create_spark_session():
return SparkSession.builder.appName("wikipediaClickstream").getOrCreate()
spark = create_spark_session()
url = "https://dumps.wikimedia.org/other/clickstream/2017-11/clickstream-jawiki-2017-11.tsv.gz"
# df = spark.read.csv(url, sep="\t") # doesn't work
df = spark.read.option("sep", "\t").csv(url) # doesn't work either
df.show(10)
Run Code Online (Sandbox Code Playgroud)
得到以下错误:
Py4JJavaError: An error occurred while calling o65.csv.
: java.lang.UnsupportedOperationException
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
/var/folders/sn/4dk4tbz9735crf4npgcnlt8r0000gn/T/ipykernel_1443/4137722240.py in <module>
1 url = "https://dumps.wikimedia.org/other/clickstream/2017-11/clickstream-jawiki-2017-11.tsv.gz"
2 # df = spark.read.csv(url, sep="\t")
----> 3 df = spark.read.option("sep", "\t").csv(url)
4 df.show(10)
Run Code Online (Sandbox Code Playgroud)
spark.version是3.1.2
python ×4
sql ×3
apache-spark ×2
pyspark ×2
left-join ×1
pandas ×1
postgresql ×1
python-3.x ×1
regex ×1
sql-server ×1
t-sql ×1
where-clause ×1