如何在HDInsight Spark/Jupyter上使用Avro?

Jie*_*eng 6 azure hdinsight jupyter

我试图在HDInsight Spark/Jupyter集群内的avro文件中读取但是得到了

u'Failed to find data source: com.databricks.spark.avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;'
Traceback (most recent call last):
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 159, in load
    return self._df(self._jreader.load(path))
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u'Failed to find data source: com.databricks.spark.avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;'
Run Code Online (Sandbox Code Playgroud)
df = spark.read.format("com.databricks.spark.avro").load("wasb://containername@aaa...aaa.blob.core.windows.net/...")
Run Code Online (Sandbox Code Playgroud)

我该如何解决这个问题?好像我需要安装软件包,但是如何在HDInsight上安装呢?

Tar*_*ani 5

您只需要关注以下文章

https://docs.microsoft.com/zh-cn/azure/hdinsight/spark/apache-spark-jupyter-notebook-use-external-packages

对于HDInsight 3.3和HDInsight 3.4

您将在笔记本的下面的单元格中添加

%%configure 
{ "packages":["com.databricks:spark-avro_2.10:0.1"] }
Run Code Online (Sandbox Code Playgroud)

对于HDInsight 3.5

您将在笔记本的下面的单元格中添加

%%configure
{ "conf": {"spark.jars.packages": "com.databricks:spark-avro_2.10:0.1" }}
Run Code Online (Sandbox Code Playgroud)

对于HDInsight 3.6

您将在笔记本的下面的单元格中添加

%%configure
{ "conf": {"spark.jars.packages": "com.databricks:spark-avro_2.11:4.0.0" }}
Run Code Online (Sandbox Code Playgroud)