小编alp*_*pal的帖子

如何将AWS S3上的文本文件导入到pandas中而无需写入磁盘

我有一个文本文件保存在S3上,这是一个制表符分隔表.我想将它加载到pandas但不能保存它,因为我在heroku服务器上运行.这是我到目前为止所拥有的.

import io
import boto3
import os
import pandas as pd

os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"

s3_client = boto3.client('s3')
response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt")
file = response["Body"]


pd.read_csv(file, header=14, delimiter="\t", low_memory=False)
Run Code Online (Sandbox Code Playgroud)

错误是

OSError: Expected file path name or file-like object, got <class 'bytes'> type
Run Code Online (Sandbox Code Playgroud)

如何将响应体转换为pandas接受的格式?

pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False)

returns

TypeError: initial_value must be str or None, not StreamingBody

pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)

returns

TypeError: 'StreamingBody' does not support the buffer interface
Run Code Online (Sandbox Code Playgroud)

更新 - 使用以下工作

file = response["Body"].read()
Run Code Online (Sandbox Code Playgroud)

pd.read_csv(io.BytesIO(file), header=14, …
Run Code Online (Sandbox Code Playgroud)

python heroku amazon-s3 pandas boto3

68
推荐指数
5
解决办法
5万
查看次数

如何使字边界\ b在破折号上不匹配

我将代码简化为我遇到的具体问题.

import re
pattern = re.compile(r'\bword\b')
result = pattern.sub(lambda x: "match", "-word- word")
Run Code Online (Sandbox Code Playgroud)

我正进入(状态

'-match- match'
Run Code Online (Sandbox Code Playgroud)

但我想要

'-word- match'
Run Code Online (Sandbox Code Playgroud)

编辑:

或者是字符串 "word -word-"

我想要

"match -word-"
Run Code Online (Sandbox Code Playgroud)

python regex

7
推荐指数
2
解决办法
1736
查看次数

如何使GridSeachCV与管道中的自定义转换器一起工作?

如果我排除自定义转换器,则GridSearchCV可以正常运行,但是会出错。这是一个伪数据集:

import pandas
import numpy
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import RandomForestClassifier
import sklearn_pandas
from sklearn.preprocessing import MinMaxScaler

df = pandas.DataFrame({"Letter":["a","b","c","d","a","b","c","d","a","b","c","d","a","b","c","d"],
                       "Number":[1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4], 
                       "Label":["G","G","B","B","G","G","B","B","G","G","B","B","G","G","B","B"]})

class MyTransformer(TransformerMixin):

    def transform(self, x, **transform_args):
        x["Number"] = x["Number"].apply(lambda row: row*2)
        return x

    def fit(self, x, y=None, **fit_args):
        return self

x_train = df
y_train = x_train.pop("Label")    

mapper = DataFrameMapper([
    ("Number", MinMaxScaler()),
    ("Letter", LabelBinarizer()),
    ])

pipe = …
Run Code Online (Sandbox Code Playgroud)

python pandas scikit-learn

5
推荐指数
1
解决办法
1036
查看次数

标签 统计

python ×3

pandas ×2

amazon-s3 ×1

boto3 ×1

heroku ×1

regex ×1

scikit-learn ×1