使用PyMongo将Pandas Dataframe插入mongodb

Nyx*_*nyx 32 python mongodb pymongo python-2.7 pandas

使用pongas DataFrame插入mongodb的最快方法是什么PyMongo

尝试

db.myCollection.insert(df.to_dict())
Run Code Online (Sandbox Code Playgroud)

发了错误

InvalidDocument: documents must have only string keys, the key was Timestamp('2013-11-23 13:31:00', tz=None)

db.myCollection.insert(df.to_json())
Run Code Online (Sandbox Code Playgroud)

发了错误

TypeError: 'str' object does not support item assignment

db.myCollection.insert({id: df.to_json()})
Run Code Online (Sandbox Code Playgroud)

发了错误

InvalidDocument: documents must have only string a keys, key was <built-in function id>

DF

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 150 entries, 2013-11-23 13:31:26 to 2013-11-23 13:24:07
Data columns (total 3 columns):
amount    150  non-null values
price     150  non-null values
tid       150  non-null values
dtypes: float64(2), int64(1)
Run Code Online (Sandbox Code Playgroud)

alk*_*lko 34

我怀疑有一种最快捷,简单的方法.如果您不担心数据转换,您可以这样做

>>> import json
>>> df = pd.DataFrame.from_dict({'A': {1: datetime.datetime.now()}})
>>> df
                           A
1 2013-11-23 21:14:34.118531

>>> records = json.loads(df.T.to_json()).values()
>>> db.myCollection.insert(records)
Run Code Online (Sandbox Code Playgroud)

但是如果你试图加载数据,你会得到:

>>> df = read_mongo(db, 'myCollection')
>>> df
                     A
0  1385241274118531000
>>> df.dtypes
A    int64
dtype: object
Run Code Online (Sandbox Code Playgroud)

所以你必须将'A'columnt转换回datetimes,以及所有not int,float或者str你的字段DataFrame.对于这个例子:

>>> df['A'] = pd.to_datetime(df['A'])
>>> df
                           A
0 2013-11-23 21:14:34.118531
Run Code Online (Sandbox Code Playgroud)

  • `db.myCollection.insert(records)`应替换为`db.myCollection.insert_many(records)`参见warning` // anaconda/bin/ipython:1:DeprecationWarning:insert不推荐使用.请改用insert_one或insert_many.#!/ bin/bash // anaconda/bin/python.app` (6认同)

小智 30

在这里你有最快捷的方式.使用insert_manypymongo 3中的方法和方法的'records'参数to_dict.

db.insert_many(df.to_dict('records'))
Run Code Online (Sandbox Code Playgroud)

  • imo是最好的主意,尽管我认为语法不适用于原始用例。基本问题是mongo需要字符串键,而df具有Timestamp索引。您需要使用传递给`to_dict()`的参数来使mongo中的键不是日期。我经常遇到的用例是,您实际上希望df中的每一行都是带有附加“日期”字段的记录。 (2认同)

Fem*_*der 9

odo可以使用它

odo(df, db.myCollection)
Run Code Online (Sandbox Code Playgroud)

  • 我真的很喜欢odo,但是当mongo uri具有非alpha用户名passwd时,它会非常失败。除了使用未经身份验证的mongo外,我什么都不推荐。 (2认同)

Raf*_*ero 5

我认为这个问题有很酷的想法。就我而言,我花了更多时间来处理大型数据帧的移动。在这种情况下,pandas 倾向于允许您选择块大小(例如pandas.DataFrame.to_sql中的示例)。所以我想我可以通过添加我在这个方向上使用的函数来做出贡献。

def write_df_to_mongoDB(  my_df,\
                          database_name = 'mydatabasename' ,\
                          collection_name = 'mycollectionname',
                          server = 'localhost',\
                          mongodb_port = 27017,\
                          chunk_size = 100):
    #"""
    #This function take a list and create a collection in MongoDB (you should
    #provide the database name, collection, port to connect to the remoete database,
    #server of the remote database, local port to tunnel to the other machine)
    #
    #---------------------------------------------------------------------------
    #Parameters / Input
    #    my_list: the list to send to MongoDB
    #    database_name:  database name
    #
    #    collection_name: collection name (to create)
    #    server: the server of where the MongoDB database is hosted
    #        Example: server = 'XXX.XXX.XX.XX'
    #    this_machine_port: local machine port.
    #        For example: this_machine_port = '27017'
    #    remote_port: the port where the database is operating
    #        For example: remote_port = '27017'
    #    chunk_size: The number of items of the list that will be send at the
    #        some time to the database. Default is 100.
    #
    #Output
    #    When finished will print "Done"
    #----------------------------------------------------------------------------
    #FUTURE modifications.
    #1. Write to SQL
    #2. Write to csv
    #----------------------------------------------------------------------------
    #30/11/2017: Rafael Valero-Fernandez. Documentation
    #"""



    #To connect
    # import os
    # import pandas as pd
    # import pymongo
    # from pymongo import MongoClient

    client = MongoClient('localhost',int(mongodb_port))
    db = client[database_name]
    collection = db[collection_name]
    # To write
    collection.delete_many({})  # Destroy the collection
    #aux_df=aux_df.drop_duplicates(subset=None, keep='last') # To avoid repetitions
    my_list = my_df.to_dict('records')
    l =  len(my_list)
    ran = range(l)
    steps=ran[chunk_size::chunk_size]
    steps.extend([l])

    # Inser chunks of the dataframe
    i = 0
    for j in steps:
        print j
        collection.insert_many(my_list[i:j]) # fill de collection
        i = j

    print('Done')
    return
Run Code Online (Sandbox Code Playgroud)