带有 EMR 和 Jupyter Notebook 的 Postgres JAR

Question

带有 EMR 和 Jupyter Notebook 的 Postgres JAR

DBA*_*642 4 postgresql amazon-web-services amazon-emr jupyter-notebook

我正在尝试启动一个包含 Postgres 驱动程序 JAR 文件的 EMR 集群，以便我可以从 Postgres 加载数据并使用 PySpark 进行分析。我有我想要包含的 JAR，存储在 S3 中。我尝试过以下操作：

1 - 输入以下配置：

[
  {
    "Classification": "presto-connector-postgresql",
    "Properties": {
      "connection-url": "jdbc:postgresql://example.net:5432/database",
      "connection-user": "MYUSER",
      "connection-password": "MYPASS"
    },
    "Configurations": []
  }
]

Run Code Online (Sandbox Code Playgroud)

2 - 添加 JAR 作为自定义步骤（从 S3 选择 JAR）

3 - 添加 JAR 作为自定义引导操作（从 S3 选择 JAR）

这些都不起作用，我无法弄清楚如何在 Jupyter 中使用步骤 1 中的连接器，并且当我启动集群时，自定义步骤/引导操作都会失败。如何启动安装了 Postgres 驱动程序的 EMR 集群，以便可以在 Jupyter 中查询数据？

编辑：

我使用以下引导脚本将 JAR 复制到我的主/工作节点：

#!/bin/bash
aws s3 cp s3://BUCKETNAME/postgresql-42.2.8.jar /mnt1/myfolder

Run Code Online (Sandbox Code Playgroud)

但仍然出现以下错误：

An error was encountered:
An error occurred while calling o90.load.
: java.lang.ClassNotFoundException: org.postgresql.Driver

Run Code Online (Sandbox Code Playgroud)

使用以下代码：

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbcURL") \
    .option("user", "user") \
    .option("password", "password") \
    .option("driver", "org.postgresql.Driver") \
    .option("query", "select * from slm_files limit 100") \
    .load()

df.count()

Run Code Online (Sandbox Code Playgroud)

Answer 1

DBA*_*642 7

在我的 Jupyter 笔记本的第一个单元格中使用此代码为我解决了这个问题：

%%configure -f
{ "conf":{
          "spark.jars": "s3://JAR-LOCATION/postgresql-42.2.8.jar"
         }
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	4 年，10 月前
查看次数：	1184 次
最近记录：	4 年，9 月前