我尝试将我的 databricks 与我的 IDE 连接
我的机器上没有下载 Spark ad/或 scala,但我下载了 pyspark (pip install pyspark)。我构建了必要的环境变量并创建了一个文件夹 Hadoop,在其中放置了一个文件夹 bin,在其中放置了一个 winutils.exe 文件。
这是一个循序渐进的过程,缓慢而稳定地解决了我的所有错误,除了最后一个:
import logging
from pyspark.sql import SparkSession
from pyspark import SparkConf
if __name__ == "__main__":
spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("OFF")
Run Code Online (Sandbox Code Playgroud)
给予
1/03/30 15:14:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Exception in thread "main" …Run Code Online (Sandbox Code Playgroud) 我正在尝试将以下 SQL 代码翻译为 pandas 等效代码
SELECT
t.company,
t.topic,
t.statement
FROM
(
SELECT
e.company,
e.topic,
e.probability,
e.distance,
LOWER(e.statement) AS statement,
dense_rank() OVER (PARTITION BY e.company,e.topic ORDER BY e.distance DESC) as rank
FROM
esg.group_dist e
) t
WHERE
t.rank = 1
AND t.topic IN ('green energy')
ORDER BY
company,
topic,
rank
Run Code Online (Sandbox Code Playgroud)
我已经到达
esg_group_dist['rank'] = esg_group_dist[['company', 'topic', 'probability', 'distance', 'sentence']] \
.sort_values(by=['distance']) \
.groupby(['company', 'topic']) \
Run Code Online (Sandbox Code Playgroud)
我发现以下 SO 线程应该包含一个解决方案,但我无法成功地为我的用例实现它
谢谢!