小编sto*_*eld的帖子

如何在python中使用pyarrow从S3读取分区镶木地板文件

我正在寻找使用python从s3读取多个分区目录数据的方法.

data_folder/serial_number = 1/cur_date = 20-12-2012/abcdsd0324324.snappy.parquet data_folder/serial_number = 2/cur_date = 27-12-2012/asdsdfsd0324324.snappy.parquet

pyarrow的ParquetDataset模块具有从分区读取的能力.所以我尝试了以下代码:

>>> import pandas as pd
>>> import pyarrow.parquet as pq
>>> import s3fs
>>> a = "s3://my_bucker/path/to/data_folder/"
>>> dataset = pq.ParquetDataset(a)
Run Code Online (Sandbox Code Playgroud)

它引发了以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 502, in __init__
    self.metadata_path) = _make_manifest(path_or_paths, self.fs)
  File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 601, in _make_manifest
    .format(path))
OSError: Passed non-file path: s3://my_bucker/path/to/data_folder/
Run Code Online (Sandbox Code Playgroud)

根据pyarrow的文档,我尝试使用s3fs作为文件系统,即:

>>> dataset = pq.ParquetDataset(a,filesystem=s3fs)
Run Code Online (Sandbox Code Playgroud)

这会引发以下错误:

Traceback (most recent call last):
  File "<stdin>", …
Run Code Online (Sandbox Code Playgroud)

python parquet arrow-python fastparquet

18
推荐指数
4
解决办法
2万
查看次数

在python中导入fastparquet时出现snappy错误

我安装了已经安装了python(3.6)&anaconda的EC2服务器中的以下模块:

  • 瞬间
  • pyarrow
  • s3fs
  • fastparquet

除了fastparquet,其他一切都在导入.当我尝试导入fastparquet时,它会抛出以下错误:

[username@ip8 ~]$ conda -V
conda 4.2.13
[username@ip-~]$ python
    Python 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 12:22:00)
    [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
     import fastparquet
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/username/anaconda3/lib/python3.6/site-packages/fastparquet/__init__.py", line 15, in <module>
        from .core import read_thrift
      File "/home/username/anaconda3/lib/python3.6/site-packages/fastparquet/core.py", line 11, in <module>
        from .compression import decompress_data
      File "/home/username/anaconda3/lib/python3.6/site-packages/fastparquet/compression.py", line 43, in <module> …
Run Code Online (Sandbox Code Playgroud)

python snappy anaconda conda fastparquet

6
推荐指数
1
解决办法
3125
查看次数

如何使用 pandas 查找特定列具有小数的行?

我正在使用 pandas 编写一个数据质量脚本,该脚本将检查每列的某些条件

目前我需要找出特定列中没有小数或实际数字的行。如果数字是整数,我可以找到它,但是到目前为止我所看到的方法(ieisdigit() , isnumeric(), isdecimal()等)无法正确识别该数字何时是十进制数。例如:2.5、0.1245 等。

以下是一些示例代码和数据:

>>> df = pd.DataFrame([
    [np.nan, 'foo', 0],
    [1, '', 1],
    [-1.387326, np.nan, 2],
    [0.814772, ' baz', ' '],     
    ["a", '      ', 4],
    ["  ",  'foo qux ', '  '],         
], columns='A B C'.split(),dtype=str)

>>> df
    A   B   C
0   NaN foo 0
1   1       1
2   -1.387326   NaN 2
3   0.814772    baz 
4   a       4
5       foo qux 

>>> df['A']
0          NaN
1            1
2    -1.387326
3     0.814772
4 …
Run Code Online (Sandbox Code Playgroud)

python data-quality pandas

5
推荐指数
1
解决办法
4266
查看次数