我正在寻找使用python从s3读取多个分区目录数据的方法.
data_folder/serial_number = 1/cur_date = 20-12-2012/abcdsd0324324.snappy.parquet data_folder/serial_number = 2/cur_date = 27-12-2012/asdsdfsd0324324.snappy.parquet
pyarrow的ParquetDataset模块具有从分区读取的能力.所以我尝试了以下代码:
>>> import pandas as pd
>>> import pyarrow.parquet as pq
>>> import s3fs
>>> a = "s3://my_bucker/path/to/data_folder/"
>>> dataset = pq.ParquetDataset(a)
Run Code Online (Sandbox Code Playgroud)
它引发了以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 502, in __init__
self.metadata_path) = _make_manifest(path_or_paths, self.fs)
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 601, in _make_manifest
.format(path))
OSError: Passed non-file path: s3://my_bucker/path/to/data_folder/
Run Code Online (Sandbox Code Playgroud)
根据pyarrow的文档,我尝试使用s3fs作为文件系统,即:
>>> dataset = pq.ParquetDataset(a,filesystem=s3fs)
Run Code Online (Sandbox Code Playgroud)
这会引发以下错误:
Traceback (most recent call last):
File "<stdin>", …Run Code Online (Sandbox Code Playgroud) 我安装了已经安装了python(3.6)&anaconda的EC2服务器中的以下模块:
除了fastparquet,其他一切都在导入.当我尝试导入fastparquet时,它会抛出以下错误:
[username@ip8 ~]$ conda -V
conda 4.2.13
[username@ip-~]$ python
Python 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 12:22:00)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
import fastparquet
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/username/anaconda3/lib/python3.6/site-packages/fastparquet/__init__.py", line 15, in <module>
from .core import read_thrift
File "/home/username/anaconda3/lib/python3.6/site-packages/fastparquet/core.py", line 11, in <module>
from .compression import decompress_data
File "/home/username/anaconda3/lib/python3.6/site-packages/fastparquet/compression.py", line 43, in <module> …Run Code Online (Sandbox Code Playgroud) 我正在使用 pandas 编写一个数据质量脚本,该脚本将检查每列的某些条件
目前我需要找出特定列中没有小数或实际数字的行。如果数字是整数,我可以找到它,但是到目前为止我所看到的方法(ieisdigit() , isnumeric(), isdecimal()等)无法正确识别该数字何时是十进制数。例如:2.5、0.1245 等。
以下是一些示例代码和数据:
>>> df = pd.DataFrame([
[np.nan, 'foo', 0],
[1, '', 1],
[-1.387326, np.nan, 2],
[0.814772, ' baz', ' '],
["a", ' ', 4],
[" ", 'foo qux ', ' '],
], columns='A B C'.split(),dtype=str)
>>> df
A B C
0 NaN foo 0
1 1 1
2 -1.387326 NaN 2
3 0.814772 baz
4 a 4
5 foo qux
>>> df['A']
0 NaN
1 1
2 -1.387326
3 0.814772
4 …Run Code Online (Sandbox Code Playgroud) python ×3
fastparquet ×2
anaconda ×1
arrow-python ×1
conda ×1
data-quality ×1
pandas ×1
parquet ×1
snappy ×1