Spark文档中指出默认的zstd压缩级别为1。https://spark.apache.org/docs/latest/configuration.html
我在spark-defaults.conf中将此属性设置为不同的值,
和代码里面像
val conf = new SparkConf(false)
conf.set("spark.io.compression.zstd.level", "22")
val spark = SparkSession.builder.config(conf).getOrCreate()
..
Run Code Online (Sandbox Code Playgroud)
多次读取相同的输入并使用 zstd 压缩以 parquet 格式保存/写入它根本不会改变输出文件的大小。如何在 Spark 中调整这一压缩级别?
对于 Python 来说,我还是个初学者,但我在学校的一个项目需要我对这个 Reddit 流行度数据集执行分类算法。这些文件是巨大的 .zst 文件,可以在这里找到: https: //files.pushshift.io/reddit/submissions/ 无论如何,我只是不确定如何将其提取到数据库中,因为我们已经完成了作业到目前为止,我只使用了 .csv 数据集,我可以轻松地将其放入 pandas 数据框中。我偶然发现了另一篇文章,并尝试使用代码:
def transform_zst_file(self,infile):
zst_num_bytes = 2**22
lines_read = 0
dctx = zstd.ZstdDecompressor()
with dctx.stream_reader(infile) as reader:
previous_line = ""
while True:
chunk = reader.read(zst_num_bytes)
if not chunk:
break
string_data = chunk.decode('utf-8')
lines = string_data.split("\n")
for i, line in enumerate(lines[:-1]):
if i == 0:
line = previous_line + line
self.appendData(line, self.type)
lines_read += 1
if self.max_lines_to_read and lines_read >= self.max_lines_to_read:
return
previous_line = lines[-1]
Run Code Online (Sandbox Code Playgroud)
但我不完全确定如何将其放入 pandas 数据框中,或者如果文件太大,则仅将一定比例的数据点放入数据框中。任何帮助将不胜感激!
每次我尝试运行以下代码时,它只会使我的计算机崩溃: …
我在运行时使用 CMake安装headptrack时遇到问题cmake -DCMAKE_BUILD_TYPE=Release ..
heaptrack/build
-- Could NOT find ZSTD (missing: ZSTD_LIBRARY ZSTD_INCLUDE_DIR)
CMake Error at 3rdparty/libbacktrace/CMakeLists.txt:160 (message):
Could not find dwarf.h, try installing the dwarf or elfutils development
package.
-- Configuring incomplete, errors occurred!
Run Code Online (Sandbox Code Playgroud) 我正在尝试通过Windows终端使用zstd v1.4.0解压缩大量压缩文件,然后“ ag”搜索:
zstd -dc -r . | ag -z -i "term"
进行时它给我以下错误:
zstd:错误70:写入错误:管道损坏(无法写入解码块)
我花了数小时寻找解决方案,尝试了该zstd
命令的其他选项,但无法解决。
我在 Windows 10 下使用 docker-compose 像这样:
version: '3'
services:
mongo:
image: mongo:4.2
ports:
- "27017:27017"
restart: always
volumes:
- type: bind
source: ${PWD}/mongod.conf
target: /etc/mongod.conf
entrypoint: ["mongod", "--bind_ip_all", "--config", "/etc/mongod.conf"]
Run Code Online (Sandbox Code Playgroud)
我的 mongod.conf:
storage:
wiredTiger:
collectionConfig:
blockCompressor: zstd
configString: "allocation_size=64KB,internal_page_max=64KB,leaf_page_max=64KB"
Run Code Online (Sandbox Code Playgroud)
当我docker exec docker_mongo_1 df
使用和不使用 zstd时,我仍然使用相同的磁盘大小。使用 zstd:
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 65792556 49263808 13156972 79% /data/db
Run Code Online (Sandbox Code Playgroud)
没有:
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 65792556 47991952 14428828 77% /data/db
Run Code Online (Sandbox Code Playgroud)
(轻微的变化是由于插入数据的随机性,但文档的数量和它们的大小在彼此的百分之几以内。)我使用 mongodump 从 snappy 和 mongorestore …
我尝试过为 mysqlclient 错误编写的常见解决方案
brew install mysql-connector-c
LDFLAGS=-L/usr/local/opt/openssl/lib pip install mysqlclient
brew install zstd
并且mysql服务器运行良好。但错误尚未修复..
我越来越clang linker error
想说library not found for -lzstd
clang -bundle -undefined dynamic_lookup -L/usr/local/opt/readline/lib -L/usr/local/opt/readline/lib -L/Users/user/.asdf/installs/python/3.7.10/lib -L/usr/local/opt/llvm/lib -L/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib -L/usr/local/opt/readline/lib -L/usr/local/opt/readline/lib -L/Users/user/.asdf/installs/python/3.7.10/lib -L/usr/local/opt/llvm/lib -L/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib -L/usr/local/opt/llvm/lib -I/usr/local/opt/llvm/include build/temp.macosx-11.2-x86_64-3.7/MySQLdb/_mysql.o -L/usr/local/Cellar/mysql/8.0.25_1/lib -lmysqlclient -lzstd -lresolv -o build/lib.macosx-11.2-x86_64-3.7/MySQLdb/_mysql.cpython-37m-darwin.so
ld: library not found for -lzstd
clang-12: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'clang' failed with exit status 1
Run Code Online (Sandbox Code Playgroud) 我正在尝试筛选压缩在 .zst 中的大型数据库。我知道我可以简单地解压缩它,然后处理生成的文件,但这会占用我的 SSD 上的大量空间,并且需要 2 个多小时,因此我希望尽可能避免这种情况。
通常,当我处理大文件时,我会使用如下代码逐行流式传输它
with open(filename) as f:
for line in f.readlines():
do_something(line)
Run Code Online (Sandbox Code Playgroud)
我知道 gzip 有这个
with gzip.open(filename,'rt') as f:
for line in f:
do_something(line)
Run Code Online (Sandbox Code Playgroud)
但它似乎不适用于 .zsf,所以我想知道是否有任何库可以以类似的方式解压缩和流式传输解压缩的数据。例如:
with zstlib.open(filename) as f:
for line in f.zstreadlines():
do_something(line)
Run Code Online (Sandbox Code Playgroud) 我有很大的 zstd 压缩文本文件。
我怎样才能对它们进行快速搜索?
我可以使用 AG(The Silver Searcher)或类似的东西吗?
我试过 AG 但它不起作用,我有一个“加载失败错误”:
ag -z -i "term"
Run Code Online (Sandbox Code Playgroud) 我一直在尝试通过我的 Mac 终端下载这个数据集。我知道它很大!https://zenodo.org/record/3606810
我有 tar.zst 文件,当我尝试解压缩(使用zstd -d pol_0616-1119_labeled.tar.zst
)时,它会抛出此错误:
1119_labeled.tar.zst : Read error (39) : premature end
我看起来疯狂地寻找解决问题的方法。我有什么明显遗漏的东西吗?预先感谢您的任何帮助。
我使用Spark 3.0.1和用户提供的Hadoop 3.2.0和Scala 2.12.10在Kubernetes 上运行。
读取压缩成一个文件拼花时一切正常活泼的,但是当我尝试读取压缩成一个文件拼花zstd以下错误下几项任务失败:
java.io.IOException: Decompression error: Version not supported
at com.github.luben.zstd.ZstdInputStream.readInternal(ZstdInputStream.java:164)
at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:120)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2781)
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2797)
at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3274)
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:934)
at java.io.ObjectInputStream.(ObjectInputStream.java:396)
at org.apache.spark.MapOutputTracker$.deserializeObject$1(MapOutputTracker.scala:954)
at org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:964)
at org.apache.spark.MapOutputTrackerWorker.$anonfun$getStatuses$2(MapOutputTracker.scala:856)
at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
at org.apache.spark.MapOutputTrackerWorker.getStatuses(MapOutputTracker.scala:851)
at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:808)
at org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:128)
at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:185)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) …
Run Code Online (Sandbox Code Playgroud) zstd ×10
ag ×2
apache-spark ×2
compression ×2
python ×2
zstandard ×2
ack ×1
archive ×1
clang ×1
cmake ×1
cmd ×1
dataframe ×1
dependencies ×1
docker ×1
dwarf ×1
mongodb ×1
mysql ×1
pandas ×1
parquet ×1
pip ×1
pipe ×1
python-3.x ×1
spark3 ×1
wiredtiger ×1