我正在尝试在Hadoop 2.2.0中启用Hadoop本机库和snappy库进行压缩,但我总是最终得到:
./hadoop/bin/hadoop checknative -a
Native library checking:
hadoop: false
zlib: false
snappy: false
lz4: false
bzip2: false
Run Code Online (Sandbox Code Playgroud)
我hadoop-2.2.0-src从零开始编译x64并将结果.so放入hadoop/lib/native/.我还从头开始编写snappy并将其放在那里.在不同的方法我安装通过瞬间sudo apt-get再联系所产生的.so到hadoop/lib/native/libsnappy.so,仍然没有运气.
这里发生了什么?为什么Hadoop不能找到我的本地库?是否有任何日志我可以检查加载过程中出了什么问题?
我刚刚将spark 1.6.0解压缩并安装到具有全新安装的hadoop 2.6.0和hive 0.14的环境中.
我已经验证了hive,beeline和mapreduce在示例中运行良好.
但是,只要我sc.textfile()在spark-shell中运行,它就会返回一个错误:
$ spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.
scala> val textFile = sc.textFile("README.md")
java.lang.IllegalArgumentException: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.1.2-2ccaf764-c7c4-4ff1-a68e-bbfdec0a3aa1-libsnappyjava.so: /tmp/snappy-1.1.2-2ccaf764-c7c4-4ff1-a68e-bbfdec0a3aa1-libsnappyjava.so: failed to map segment from …Run Code Online (Sandbox Code Playgroud) 我安装了已经安装了python(3.6)&anaconda的EC2服务器中的以下模块:
除了fastparquet,其他一切都在导入.当我尝试导入fastparquet时,它会抛出以下错误:
[username@ip8 ~]$ conda -V
conda 4.2.13
[username@ip-~]$ python
Python 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 12:22:00)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
import fastparquet
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/username/anaconda3/lib/python3.6/site-packages/fastparquet/__init__.py", line 15, in <module>
from .core import read_thrift
File "/home/username/anaconda3/lib/python3.6/site-packages/fastparquet/core.py", line 11, in <module>
from .compression import decompress_data
File "/home/username/anaconda3/lib/python3.6/site-packages/fastparquet/compression.py", line 43, in <module> …Run Code Online (Sandbox Code Playgroud) 如何在 python 3.5 中打开 .snappy.parquet 文件?到目前为止,我使用了这个代码:
import numpy
import pyarrow
filename = "/Users/T/Desktop/data.snappy.parquet"
df = pyarrow.parquet.read_table(filename).to_pandas()
Run Code Online (Sandbox Code Playgroud)
但是,它给出了这个错误:
AttributeError: module 'pyarrow' has no attribute 'compat'
Run Code Online (Sandbox Code Playgroud)
PS我以这种方式安装了pyarrow:
pip install pyarrow
Run Code Online (Sandbox Code Playgroud) 我一直在使用支持直接从 AWS S3 读取和写入的最新R arrow包 ( arrow_2.0.0.20201106)(这很棒)。
我在编写和读取自己的文件时似乎没有问题(见下文):
write_parquet(iris, "iris.parquet")
system("aws s3 mv iris.parquet s3://myawsbucket/iris.parquet")
df <- read_parquet("s3://myawsbucket/iris.parquet")
Run Code Online (Sandbox Code Playgroud)
但是,当我尝试读取其中一个示例R arrow文件时,出现以下错误:
df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
Error in parquet___arrow___FileReader__ReadTable1(self) :
IOError: NotImplemented: Support for codec 'snappy' not built
Run Code Online (Sandbox Code Playgroud)
当我检查编解码器是否可用时,它看起来不是:
codec_is_available(type="snappy")
[1] FALSE
Run Code Online (Sandbox Code Playgroud)
任何人都知道一种使“活泼”编解码器可用的方法?
谢谢,迈克
############
感谢下面@Neal 的回答。这是为我安装所有需要的依赖项的代码。
Sys.setenv(ARROW_S3="ON")
Sys.setenv(NOT_CRAN="true")
install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
Run Code Online (Sandbox Code Playgroud) 我正在构建一个cdc管道来通过maxwell读取mysql binlog并将它们放入kafka中,我的压缩类型在maxwell配置中是snappy。但是在我的spring项目的消费者端我收到了这个错误。
org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Mac and os.arch=aarch64
at org.xerial.snappy.SnappyLoader.findNativeLibrary(SnappyLoader.java:361) ~[snappy-java-1.1.7.7.jar:1.1.7.7]
at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:195) ~[snappy-java-1.1.7.7.jar:1.1.7.7]
at org.xerial.snappy.SnappyLoader.loadSnappyApi(SnappyLoader.java:167) ~[snappy-java-1.1.7.7.jar:1.1.7.7]
at org.xerial.snappy.Snappy.init(Snappy.java:69) ~[snappy-java-1.1.7.7.jar:1.1.7.7]
at org.xerial.snappy.Snappy.<clinit>(Snappy.java:46) ~[snappy-java-1.1.7.7.jar:1.1.7.7]
at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:435) ~[snappy-java-1.1.7.7.jar:1.1.7.7]
at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:466) ~[snappy-java-1.1.7.7.jar:1.1.7.7]
at java.base/java.io.DataInputStream.readByte(DataInputStream.java:271) ~[na:na]
at org.apache.kafka.common.utils.ByteUtils.readUnsignedVarint(ByteUtils.java:170) ~[kafka-clients-2.7.2.jar:na]
at org.apache.kafka.common.utils.ByteUtils.readVarint(ByteUtils.java:205) ~[kafka-clients-2.7.2.jar:na]
at org.apache.kafka.common.record.DefaultRecord.readFrom(DefaultRecord.java:296) ~[kafka-clients-2.7.2.jar:na]
at org.apache.kafka.common.record.DefaultRecordBatch$2.doReadRecord(DefaultRecordBatch.java:278) ~[kafka-clients-2.7.2.jar:na]
at org.apache.kafka.common.record.DefaultRecordBatch$StreamRecordIterator.readNext(DefaultRecordBatch.java:617) ~[kafka-clients-2.7.2.jar:na]
at org.apache.kafka.common.record.DefaultRecordBatch$RecordIterator.next(DefaultRecordBatch.java:582) ~[kafka-clients-2.7.2.jar:na]
at org.apache.kafka.common.record.DefaultRecordBatch$RecordIterator.next(DefaultRecordBatch.java:551) ~[kafka-clients-2.7.2.jar:na]
at org.apache.kafka.clients.consumer.internals.Fetcher$CompletedFetch.nextFetchedRecord(Fetcher.java:1578) ~[kafka-clients-2.7.2.jar:na]
at org.apache.kafka.clients.consumer.internals.Fetcher$CompletedFetch.fetchRecords(Fetcher.java:1613) ~[kafka-clients-2.7.2.jar:na]
at org.apache.kafka.clients.consumer.internals.Fetcher$CompletedFetch.access$1700(Fetcher.java:1454) ~[kafka-clients-2.7.2.jar:na]
at org.apache.kafka.clients.consumer.internals.Fetcher.fetchRecords(Fetcher.java:687) ~[kafka-clients-2.7.2.jar:na]
at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:638) ~[kafka-clients-2.7.2.jar:na]
at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1299) ~[kafka-clients-2.7.2.jar:na]
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1233) ~[kafka-clients-2.7.2.jar:na]
at …Run Code Online (Sandbox Code Playgroud) 我在 hive v0.12.0 中运行以下代码,我希望使用不同的方法压缩三个表,因此文件的大小和内容应该不同。
\n\n--- Create table and compress it with ZLIB\ncreate table zzz_test_szlib\n stored as orc\n tblproperties ("orc.compress"="ZLIB")\n as\nselect * from uk_pers_dev.orc_dib_trans limit 100000000;\n\n--- Create table and compress it with SNAPPY\ncreate table zzz_test_ssnap\n stored as orc\n tblproperties ("orc.compress"="SNAPPY")\n as\nselect * from uk_pers_dev.orc_dib_trans limit 100000000;\n\n--- Create table and DO NOT compress it\ncreate table zzz_test_snone\n stored as orc\n tblproperties ("orc.compress"="NONE")\n as\nselect * from uk_pers_dev.orc_dib_trans limit 100000000;Run Code Online (Sandbox Code Playgroud)\n\n当我使用描述或通过色调检查表元数据时,我得到:
\n\nName Value Value Value\n---------------- …Run Code Online (Sandbox Code Playgroud) 我已经使用 python-snappy 压缩了一个文件并将其放入我的 hdfs 存储中。我现在正试图像这样阅读它,但我得到了以下回溯。我找不到如何读取文件的示例,以便我可以处理它。我可以很好地阅读文本文件(未压缩)版本。我应该使用 sc.sequenceFile 吗?谢谢!
I first compressed the file and pushed it to hdfs
python-snappy -m snappy -c gene_regions.vcf gene_regions.vcf.snappy
hdfs dfs -put gene_regions.vcf.snappy /
I then added the following to spark-env.sh
export SPARK_EXECUTOR_MEMORY=16G
export HADOOP_HOME=/usr/local/hadoop
export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:$HADOOP_HOME/lib/native
export SPARK_CLASSPATH=$SPARK_CLASSPATH:$HADOOP_HOME/lib/lib/snappy-java-1.1.1.8-SNAPSHOT.jar
I then launch my spark master and slave and finally my ipython notebook where I am executing the code below.
a_file = sc.textFile("hdfs://master:54310/gene_regions.vcf.snappy")
a_file.first()
Run Code Online (Sandbox Code Playgroud)
ValueError Traceback(最近一次调用最后一次) in () ----> 1 a_file.first()
/home/user/Software/spark-1.3.0-bin-hadoop2.4/python/pyspark/rdd.pyc in …
我想将我的应用程序配置为使用 lz4 压缩而不是 snappy,我所做的是:
session = SparkSession.builder()
.master(SPARK_MASTER) //local[1]
.appName(SPARK_APP_NAME)
.config("spark.io.compression.codec", "org.apache.spark.io.LZ4CompressionCodec")
.getOrCreate();
Run Code Online (Sandbox Code Playgroud)
但是查看控制台输出,它仍然在执行程序中使用 snappy
org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
Run Code Online (Sandbox Code Playgroud)
和
[Executor task launch worker-0] compress.CodecPool (CodecPool.java:getCompressor(153)) - Got brand-new compressor [.snappy]
Run Code Online (Sandbox Code Playgroud)
根据这篇文章,我在这里所做的只是配置了驱动程序,而不是执行程序。帖子中的解决方案是更改 spark-defaults.conf 文件,但我在本地模式下运行 spark,我在任何地方都没有该文件。
我需要在本地模式下运行应用程序(为了单元测试)。测试在我的机器上本地运行良好,但是当我将测试提交到构建引擎(RHEL5_64)时,出现错误
snappy-1.0.5-libsnappyjava.so: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found
Run Code Online (Sandbox Code Playgroud)
我做了一些研究,似乎最简单的解决方法是对编解码器使用 lz4 而不是 snappy,所以我尝试了上述解决方案。
我已经被这个问题困了几个小时了,感谢您的帮助,谢谢。
我在 HDFS 中有一堆 json snappy 压缩文件。它们是 HADOOP snappy 压缩的(不是 python,参见其他 SO 问题)并且具有嵌套结构。
找不到将它们加载到 HIVE 中的方法(使用 json_tuple)?
我可以获得有关如何加载它们的一些资源/提示吗
以前的参考文献(没有有效答案)
snappy ×10
apache-spark ×4
hadoop ×4
hive ×3
python ×2
anaconda ×1
apache-arrow ×1
apache-kafka ×1
apple-m1 ×1
cloudera ×1
compression ×1
conda ×1
fastparquet ×1
java ×1
json ×1
parquet ×1
pyspark ×1
r ×1