Python pandas 将 mongodb 集合无序导出到 CSV 列

blu*_*ndr 1 python mongodb amazon-web-services

我有一个 Python 脚本,用于在我们所有 AWS 账户(大约 150 个)中创建 EC2 实例列表,并将结果存储在 MongoDB 中。

我正在使用 Python pandas 模块将 mongodb 集合导出到 CSV 文件。它可以工作,只是标题乱序,而且我不想打印 MongoDB 索引。

在脚本的原始版本中(添加数据库之前),我使用 CSV 模块来写入文件,并且标题是正确的: csv文件输出没有mongo

我添加数据库既是为了学习练习,也是因为它可以让我们更轻松地处理我们拥有的所有亚马逊账户。

如果我查看 mongo 数据库中的 json 集合,我将打印所有字段的顺序都是正确的:

{'_id': ObjectId('5f14f9ffa40de31278dade03'), 'AWS Account': 'jf-master-pd', 'Account Number': '123456789101', 'Name': 'usawsweb001', 'Instance ID': 'i-01e5e920b4d3d5dcb', 'AMI ID': 'ami-006219aba10688d0b', 'Volumes': 'vol-0ce8db4e071bc7229, vol-099f6d212a91121d0, vol-0bb36e343e9c01374, vol-05610645edfd02253, vol-05adc01d70d75d649', 'Private IP': '172.31.62.168', 'Public IP': 'xx.xx.xx.xx', 'Private DNS': 'ip-172-31-62-168.ec2.internal', 'Availability Zone': 'us-east-1e', 'VPC ID': 'vpc-68b1ff12', 'Type': 't2.micro', 'Key Pair Name': 'jf-timd', 'State': 'running', 'Launch Date': 'July 20 2020'}
{'_id': ObjectId('5f14f9ffa40de31278dade05'), 'AWS Account': 'jf-master-pd', 'Account Number': '123456789101', 'Name': 'usawsweb002', 'Instance ID': 'i-0b7db2bcab853ef96', 'AMI ID': 'ami-006219aba10688d0b', 'Volumes': 'vol-095a9dcf54ca97c0e, vol-0c8e96b71fbb7dfcf, vol-070c16c457f91c54e, vol-0dc1eaf2e826fa3a6, vol-0f0f157a8489ab939', 'Private IP': '172.31.63.131', 'Public IP': 'xx.xx.xx.xx', 'Private DNS': 'ip-172-31-63-131.ec2.internal', 'Availability Zone': 'us-east-1e', 'VPC ID': 'vpc-68b1ff12', 'Type': 't2.micro', 'Key Pair Name': 'jf-timd', 'State': 'running', 'Launch Date': 'July 20 2020'}
{'_id': ObjectId('5f14f9ffa40de31278dade07'), 'AWS Account': 'jf-master-pd', 'Account Number': '123456789101', 'Name': 'usawsweb003', 'Instance ID': 'i-0611acf4b6cc53b61', 'AMI ID': 'ami-006219aba10688d0b', 'Volumes': 'vol-0aa28f89f6ce50577, vol-0e37ff844e8b9c47a, vol-0d54c713ae231739c, vol-0e29df46edc814619, vol-07e0c40a8913b1d31', 'Private IP': '172.31.52.44', 'Public IP': 'xx.xx.xx.xx', 'Private DNS': 'ip-172-31-52-44.ec2.internal', 'Availability Zone': 'us-east-1e', 'VPC ID': 'vpc-68b1ff12', 'Type': 't2.micro', 'Key Pair Name': 'jf-timd', 'State': 'running', 'Launch Date': 'July 20 2020'}
Run Code Online (Sandbox Code Playgroud)

但是使用 python pandas 从 mongo 数据库导出,标头不正常。信息与正确的标题对齐,但列完全乱序:

csv 文件 mongo 输出

在我的代码中,我创建一个包含服务器信息的字典,然后将字典传递给打印 Mongo 集合的函数:

def list_instances(aws_account,aws_account_number, interactive, regions, show_details, instance_col):
for region in regions:
    if 'gov' in aws_account and not 'admin' in aws_account:
        try:
            session = boto3.Session(profile_name=aws_account, region_name=region)
        except botocore.exceptions.ProfileNotFound as e:
            profile_missing_message = f"An exception has occurred: {e}"
            account_found = 'no'
            raise
    else:
        try:
            session = boto3.Session(profile_name=aws_account, region_name=region)
            account_found = 'yes'
        except botocore.exceptions.ProfileNotFound as e:
            profile_missing_message = f"An exception has occurred: {e}"
            raise
    try:
        ec2 = session.client("ec2")
    except Exception as e:
        print(f"An exception has occurred: {e}")
    message = f"  Region: {region} in {aws_account}: ({aws_account_number})  "
    banner(message)

    print(Fore.RESET)
    # Loop through the instances
    try:
        instance_list = ec2.describe_instances()
    except Exception as e:
        print(f"An exception has occurred: {e}")
        for reservation in instance_list["Reservations"]:
                for instance in reservation.get("Instances", []):
                    instance_count = instance_count + 1
                    launch_time = instance["LaunchTime"]
                    launch_time_friendly = launch_time.strftime("%B %d %Y")
                    tree = objectpath.Tree(instance)
                    block_devices = set(tree.execute('$..BlockDeviceMappings[\'Ebs\'][\'VolumeId\']'))
                    if block_devices:
                        block_devices = list(block_devices)
                        block_devices = str(block_devices).replace('[','').replace(']','').replace('\'','')
                    else:
                        block_devices = None
                    private_ips =  set(tree.execute('$..PrivateIpAddress'))
                    if private_ips:
                        private_ips_list = list(private_ips)
                        private_ips_list = str(private_ips_list).replace('[','').replace(']','').replace('\'','')
                    else:
                        private_ips_list = None
                    public_ips =  set(tree.execute('$..PublicIp'))
                    if len(public_ips) == 0:
                        public_ips = None
                    if public_ips:
                        public_ips_list = list(public_ips)
                        public_ips_list = str(public_ips_list).replace('[','').replace(']','').replace('\'','')
                    else:
                        public_ips_list = None
                    name = None
                    if 'Tags' in instance:
                        try:
                            tags = instance['Tags']
                            name = None
                            for tag in tags:
                                if tag["Key"] == "Name":
                                    name = tag["Value"]
                                if tag["Key"] == "Engagement" or tag["Key"] == "Engagement Code":
                                    engagement = tag["Value"]
                        except ValueError:
                            # print("Instance: %s has no tags" % instance_id)
                            raise
                    key_name = instance['KeyName'] if instance['KeyName'] else None
                    vpc_id = instance.get('VpcId') if instance.get('VpcId') else None
                    private_dns = instance['PrivateDnsName'] if instance['PrivateDnsName'] else None
                    ec2info[instance['InstanceId']] = {
                        'AWS Account': aws_account,
                        'Account Number': aws_account_number,
                        'Name': name,
                        'Instance ID': instance['InstanceId'],
                        'AMI ID': instance['ImageId'],
                        'Volumes': block_devices,
                        'Private IP': private_ips_list,
                        'Public IP': public_ips_list,
                        'Private DNS': private_dns,
                        'Availability Zone': instance['Placement']['AvailabilityZone'],
                        'VPC ID': vpc_id,
                        'Type': instance['InstanceType'],
                        'Key Pair Name': key_name,
                        'State': instance['State']['Name'],
                        'Launch Date': launch_time_friendly
                    }
                    mongo_instance_dict = {'_id': '', 'AWS Account': aws_account, "Account Number": aws_account_number, 'Name': name, 'Instance ID': instance["InstanceId"], 'AMI ID': instance['ImageId'], 'Volumes': block_devices,  'Private IP': private_ips_list, 'Public IP': public_ips_list, 'Private DNS': private_dns, 'Availability Zone': instance['Placement']['AvailabilityZone'], 'VPC ID': vpc_id, 'Type': instance["InstanceType"], 'Key Pair Name': key_name, 'State': instance["State"]["Name"], 'Launch Date': launch_time_friendly}
                    insert_doc(mongo_instance_dict)
    mongo_export_to_file(interactive, aws_account)
Run Code Online (Sandbox Code Playgroud)

这是将字典插入 MongoDB 的函数:

def insert_doc(mydict):
    mydb, mydb_name, instance_col = set_db()
    mydict['_id'] = ObjectId()
    instance_doc = instance_col.insert_one(mydict)
    return instance_doc
Run Code Online (Sandbox Code Playgroud)

这是将 MongoDB 写入文件的函数:

def mongo_export_to_file():
    aws_account = 'jf-master-pd'
    today = datetime.today()
    today = today.strftime("%m-%d-%Y")
    mydb, mydb_name, instance_col = set_db()
    # make an API call to the MongoDB server
    cursor = instance_col.find()
    # extract the list of documents from cursor obj
    mongo_docs = list(cursor)

    # create an empty DataFrame for storing documents
    docs = pandas.DataFrame(columns=[])

    # iterate over the list of MongoDB dict documents
    for num, doc in enumerate(mongo_docs):
        # convert ObjectId() to str
        doc["_id"] = str(doc["_id"])
        # get document _id from dict
        doc_id = doc["_id"]
        # create a Series obj from the MongoDB dict
        series_obj = pandas.Series( doc, name=doc_id )
         # append the MongoDB Series obj to the DataFrame obj
        docs = docs.append(series_obj)
        # get document _id from dict
        doc_id = doc["_id"]
        # Set the output file
        output_dir = os.path.join('..', '..', 'output_files', 'aws_instance_list', 'csv', '')
        output_file = os.path.join(output_dir, 'aws-instance-master-list-' + today +'.csv')

        # export MongoDB documents to a CSV file
        docs.to_csv(output_file, ",") # CSV delimited by commas
Run Code Online (Sandbox Code Playgroud)

这是github中原始代码目录的链接。我们想要的文件是 aws_ec2_list_instances.py 和 ec2_mongo.py

为什么 MongoDB 版本中的列和标题乱序?从 pandas 打印到文件时,如何删除 mongo 为 ID 添加的额外列?

Old*_*Pro 5

Pandas 是一个非常灵活且宽容的库,用于管理和分析数据。如果您只想在csv模块成为标准时将 MongoDB 集合转换为 CSV 文件,而您使用它的方式效率非常低,那就完全是大材小用了。另一件需要注意的事情是,直到最近,Python 和 Pandas 都没有尝试保留字典中项目的顺序。在 Python 3.5 版本开始保留顺序之前,编写代码时假设字典中项目的顺序并不重要。仅从 Python 3.7 开始,维护字典条目的顺序才成为官方语言功能。

DataFrame 是 Pandas 的主要数据结构,它表示一个二维数据数组。关于它的一些事情可能会令人困惑,我认为您被行和列都可以具有命名索引这一事实所困扰。一般来说,当谈论Panda中的数据时,“索引”指的是行索引。

在您的数据中,行索引将是 MongoDB 的值_id,您希望将其丢弃。这很好,但它可能会让您认为“索引”意味着列。

系列通常意味着表示一列数据。当使用字典初始化时,键被视为索引,也就是说行标签,而不是列标签。您将看到 DataFrame 和 Series 之间的大多数操作都将 Series 视为列。但正如我所说,Pandas 很灵活,因此它们具有DataFrame.append将 Series 视为一行的功能。

问题是,在追加行时,Pandas 期望 Series 将行追加到现有列。当 Series 具有 DataFrame 中不存在的索引(原始字典中的键)时,它将它们作为新列添加到列的末尾,并且如您所见,它按排序顺序添加它们。这实际上是当前版本(1.0.5)中的一个错误,它可能会持续这么长时间而没有得到修复,因为无论如何字典顺序都会被忽略,但要感激它,因为它导致您进一步调查。

通过将 Series 附加到最初为空的 DataFrame 来将 MongoDB 集合转换为 DataFrame 的效率确实很低。DataFrame 完全能够读取您的 MongoDB 集合,并且您需要编写的代码要少得多。

mongo_export_to_file如果您需要 Pandas,这是我推荐的版本:

def mongo_export_to_file():
    today = datetime.today()
    today = today.strftime("%m-%d-%Y")
    _, _, instance_col = set_db()
    # make an API call to the MongoDB server
    mongo_docs = instance_col.find()

    # Convert the mongo docs to a DataFrame
    docs = pandas.DataFrame(mongo_docs)
    # Discard the Mongo ID for the documents
    docs.pop("_id")

    # compute the output file directory and name
    output_dir = os.path.join('..', '..', 'output_files', 'aws_instance_list', 'csv', '')
    output_file = os.path.join(output_dir, 'aws-instance-master-list-' + today +'.csv')

    # export MongoDB documents to a CSV file, leaving out the row "labels" (row numbers)
    docs.to_csv(output_file, ",", index=False) # CSV delimited by commas
Run Code Online (Sandbox Code Playgroud)

这是我在不需要 Pandas 的项目中使用的版本:

def mongo_export_to_file():  
    today = datetime.today()
    today = today.strftime("%m-%d-%Y")
    _, _, instance_col = set_db()
    # make an API call to the MongoDB server
    mongo_docs = instance_col.find()
    if mongo_docs.count() == 0:
        return

    fieldnames = list(mongo_docs[0].keys())
    fieldnames.remove('_id')

    # compute the output file directory and name
    output_dir = os.path.join('..', '..', 'output_files', 'aws_instance_list', 'csv', '')
    output_file = os.path.join(output_dir, 'aws-instance-master-list-' + today +'.csv')
    with open(output_file, 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction="ignore")
        writer.writeheader()
        writer.writerows(mongo_docs)
Run Code Online (Sandbox Code Playgroud)