我正在 Databricks 集群中执行 Spark 作业。我通过 Azure 数据工厂管道触发该作业,它以 15 分钟的间隔执行,因此在它successful execution of three or four times失败并抛出异常之后"java.lang.OutOfMemoryError: GC overhead limit exceeded"。虽然上述问题有很多答案,但在大多数情况下,他们的作业没有运行,但在我的情况下,在成功执行一些先前的作业后,它会失败。我的数据大小仅不到 20 MB。
所以我的问题是我应该在服务器配置中进行哪些更改。如果问题出在我的代码上,那么为什么它大多数时候都能成功。请给我建议并提出解决方案。
我使用以下代码将数据帧数据直接插入到 databricks 增量表中:
eventDataFrame.write.format("delta").mode("append").option("inferSchema","true").insertInto("some delta table"))
Run Code Online (Sandbox Code Playgroud)
但是,如果创建 detla 表的列顺序与数据帧列顺序不同,则值会变得混乱,然后不会写入正确的列。如何维持秩序?是否有标准方法/最佳实践来做到这一点?
我正在做一个相当大的项目。我需要使用 azure-security-keyvault-secrets,因此我在 pom.xml 文件中添加了以下内容:
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-security-keyvault-secrets</artifactId>
<version>4.0.1</version>
</dependency>
<dependency>
<groupId>com.azure</groupId>
<artifactId>azure-identity</artifactId>
<version>1.0.1</version>
</dependency>
Run Code Online (Sandbox Code Playgroud)
当我运行此示例代码时:
object Test {
def main(args: Array[String]): Unit = {
// get vault name from system env -> databricks
val secretClient = new SecretClientBuilder()
.vaultUrl("https://myVault.vault.azure.net/")
.credential(new DefaultAzureCredentialBuilder().build())
.buildClient
val secret = secretClient.getSecret("AzureAccountName")
println("===================== " + secret.getValue)
}
}
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
java.lang.NoSuchMethodError: com.fasterxml.jackson.datatype.jsr310.deser.JSR310DateTimeDeserializerBase.findFormatOverrides(Lcom/fasterxml/jackson/databind/DeserializationContext;Lcom/fasterxml/jackson/databind/BeanProperty;Ljava/lang/Class;)Lcom/fasterxml/jackson/annotation/JsonFormat$Value;
at com.fasterxml.jackson.datatype.jsr310.deser.JSR310DateTimeDeserializerBase.createContextual(JSR310DateTimeDeserializerBase.java:79)
at com.fasterxml.jackson.datatype.jsr310.deser.InstantDeserializer.createContextual(InstantDeserializer.java:241)
at com.fasterxml.jackson.databind.DeserializationContext.handleSecondaryContextualization(DeserializationContext.java:669)
at com.fasterxml.jackson.databind.DeserializationContext.findContextualValueDeserializer(DeserializationContext.java:430)
at com.fasterxml.jackson.databind.deser.std.StdDeserializer.findDeserializer(StdDeserializer.java:947)
at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.resolve(BeanDeserializerBase.java:439)
at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:296)
at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:244)
at com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:142)
at com.fasterxml.jackson.databind.DeserializationContext.findContextualValueDeserializer(DeserializationContext.java:428)
at com.fasterxml.jackson.databind.deser.std.StdDeserializer.findDeserializer(StdDeserializer.java:947)
at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.resolve(BeanDeserializerBase.java:439)
at …Run Code Online (Sandbox Code Playgroud) 我正在尝试使用文档提供的示例代码将 Spark 数据帧转换为增量格式,但总是收到这个奇怪的错误。您能帮忙或指导一下吗?
df_sdf.write.format("delta").save("/mnt/.../delta/")
Run Code Online (Sandbox Code Playgroud)
错误看起来像:
org.apache.spark.SparkException: Job aborted.
--------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) <command-3011941952225495> in <module> ----> 1 df_sdf.write.format("delta").save("/mnt/.../delta/") /databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options) 737 self._jwrite.save() 738 else: --> 739 self._jwrite.save(path) 740 741 @since(1.4)
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in call(self, *args) 1255 answer = self.gateway_client.send_command(command) 1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, self.target_id, self.name) 1258 1259 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(a, *kw)
Run Code Online (Sandbox Code Playgroud) 我尝试使用 Python 检查 Databricks 中是否存在该路径:
try:
dirs = dbutils.fs.ls ("/my/path")
pass
except IOError:
print("The path does not exist")
Run Code Online (Sandbox Code Playgroud)
如果路径不存在,我希望该except语句执行。except但是,该try语句失败并出现错误,而不是语句:
java.io.FileNotFoundException: GET ...
ErrorMessage=The specified path does not exist.
Run Code Online (Sandbox Code Playgroud)
如何正确捕捉FileNotFoundException?
我正在研究 Azure databricks。我的驱动程序节点和工作程序节点规格为:14.0 GB 内存、4 核、0.75 DBU Standard_DS3_v2。
我的 pyspark 笔记本因 Java 堆空间错误而失败。我上网查了一下,有一个建议是增加驱动内存。我正在尝试在笔记本中使用以下conf参数
spark.conf.get("spark.driver.memory")
Run Code Online (Sandbox Code Playgroud)
获取驱动程序内存。但我的笔记本电脑因错误而失败。
java.util.NoSuchElementException: spark.driver.memory
Run Code Online (Sandbox Code Playgroud)
知道如何检查驱动程序内存并更改其值吗?
我正在尝试使用带有 null_resource 和 local-exec 的 pat 令牌来配置数据块。这是代码块:
resource "null_resource" "databricks_token" {
triggers = {
workspace = azurerm_databricks_workspace.databricks.id
key_vault_access = azurerm_key_vault_access_policy.terraform.id
}
provisioner "local-exec" {
command = "${path.cwd}/generate-pat-token.sh"
environment = {
RESOURCE_GROUP = var.resource_group_name
DATABRICKS_WORKSPACE_RESOURCE_ID = azurerm_databricks_workspace.databricks.id
KEY_VAULT = azurerm_key_vault.databricks_token.name
SECRET_NAME = "DATABRICKS-TOKEN"
DATABRICKS_ENDPOINT = "https://westeurope.azuredatabricks.net"
}
}
}
Run Code Online (Sandbox Code Playgroud)
但是,我收到以下错误:
2020-02-26T19:41:51.9455473Z [0m[1mnull_resource.databricks_token: Provisioning with 'local-exec'...[0m[0m
2020-02-26T19:41:51.9458257Z [0m[0mnull_resource.databricks_token (local-exec): Executing: ["/bin/sh" "-c" "/home/vsts/work/r1/a/_Infrastructure/Infrastructure/ei-project/devtest/generate-pat-token.sh"]
2020-02-26T19:41:51.9480441Z [0m[0mnull_resource.databricks_token (local-exec): /bin/sh: 1: /home/vsts/work/r1/a/_Infrastructure/Infrastructure/ei-project/devtest/generate-pat-token.sh: Permission denied
2020-02-26T19:41:51.9481502Z [0m[0m
2020-02-26T19:41:52.0386092Z [31m
2020-02-26T19:41:52.0399075Z [1m[31mError: [0m[0m[1mError running command '/home/vsts/work/r1/a/_Infrastructure/Infrastructure/ei-project/devtest/generate-pat-token.sh': exit …Run Code Online (Sandbox Code Playgroud) 我正在从 Azure 机器学习笔记本运行 pyspark。我正在尝试使用 dbutil 模块移动文件。
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
def get_dbutils(spark):
try:
from pyspark.dbutils import DBUtils
dbutils = DBUtils(spark)
except ImportError:
import IPython
dbutils = IPython.get_ipython().user_ns["dbutils"]
return dbutils
dbutils = get_dbutils(spark)
dbutils.fs.cp("file:source", "dbfs:destination")
Run Code Online (Sandbox Code Playgroud)
我收到此错误: ModuleNotFoundError:没有名为“pyspark.dbutils”的模块 有解决方法吗?
这是另一个 Azure 机器学习笔记本中的错误:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-183f003402ff> in get_dbutils(spark)
4 try:
----> 5 from pyspark.dbutils import DBUtils
6 dbutils = DBUtils(spark)
ModuleNotFoundError: No module named 'pyspark.dbutils'
During handling of the above exception, another exception occurred:
KeyError …Run Code Online (Sandbox Code Playgroud) pyspark databricks azure-databricks azure-machine-learning-service dbutils
使用结构化流 pyspark中记录的步骤,我无法从我设置的 Azure 事件中心在pyspark中创建数据帧以读取流数据。
错误消息是: java.util.ServiceConfigurationError:org.apache.spark.sql.sources.DataSourceRegister:无法实例化提供程序org.apache.spark.sql.eventhubs.EventHubsSourceProvider
我已经安装了 Maven 库(com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12 不可用),但似乎没有一个可以工作:com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.15 com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.6
但ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)返回的错误消息是:
java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.$init$(Lorg/apache/spark/internal/Logging;)V
连接字符串是正确的,因为它也在写入 Azure 事件中心且有效的控制台应用程序中使用。
请有人指出我正确的方向。使用中的代码如下:
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Event Hub Namespace Name
NAMESPACE_NAME = "*myEventHub*"
KEY_NAME = "*MyPolicyName*"
KEY_VALUE = "*MySharedAccessKey*"
# The connection string to your Event Hubs Namespace
connectionString = "Endpoint=sb://{0}.servicebus.windows.net/;SharedAccessKeyName={1};SharedAccessKey={2};EntityPath=ingestion".format(NAMESPACE_NAME, KEY_NAME, KEY_VALUE)
ehConf = {}
ehConf['eventhubs.connectionString'] = connectionString
# For 2.3.15 version and above, the configuration dictionary requires that connection string …Run Code Online (Sandbox Code Playgroud) 我已经通过运行以下命令安装了 databricks cli 工具
pip install databricks-cli使用适合您的 Python 安装的 pip 版本。如果您使用的是 Python 3,请运行 pip3。
然后,通过创建 PAT(Databricks 中的个人访问令牌),我运行以下 .sh bash 脚本:
# You can run this on Windows as well, just change to a batch files
# Note: You need the Databricks CLI installed and you need a token configued
#!/bin/bash
echo "Creating DBFS direcrtory"
dbfs mkdirs dbfs:/databricks/packages
echo "Uploading cluster init script"
dbfs cp --overwrite python_dependencies.sh dbfs:/databricks/packages/python_dependencies.sh
echo "Listing DBFS direcrtory"
dbfs ls dbfs:/databricks/packages
Run Code Online (Sandbox Code Playgroud)
python_dependency.sh 脚本
#!/bin/bash
# Restart …Run Code Online (Sandbox Code Playgroud) azure-databricks ×10
apache-spark ×4
databricks ×4
pyspark ×4
dbutils ×2
python ×2
azure ×1
azure-devops ×1
azure-machine-learning-service ×1
bash ×1
dataframe ×1
delta-lake ×1
jackson ×1
linux ×1
maven ×1
scala ×1
sh ×1
terraform ×1