将Perl脚本转换为Python:基于哈希键重复数据删除2个文件

gal*_*her 1 python perl hash

我是Python的新手,想知道是否有人会将一个相当简单的Perl脚本示例转换为Python?

该脚本采用2个文件,并通过比较散列键仅输出第二个文件中的唯一行.它还将重复的行输出到文件.我发现使用Perl这种重复删除方法非常快,并且希望看到Python如何比较.

#! /usr/bin/perl

## Compare file1 and file2 and output only the unique lines from file2.

## Opening file1.txt and store the data in a hash.
open my $file1, '<', "file1.txt" or die $!;
while ( <$file1> ) {
    my $name = $_;
    $file1hash{$name}=$_;
}
## Opening file2.txt and store the data in a hash.
open my $file2, '<', "file2.txt" or die $!;

while  ( <$file2> ) {
    $name = $_;
    $file2hash{$name}=$_;
}

open my $dfh, '>', "duplicate.txt";

## Compare the keys and remove the duplicate one in the file2 hash
foreach ( keys %file1hash ) {
    if ( exists ( $file2hash{$_} ))
    {
    print $dfh $file2hash{$_};
    delete $file2hash{$_};
    }
}

open my $ofh, '>', "file2_clean.txt";
print  $ofh values(%file2hash) ;
Run Code Online (Sandbox Code Playgroud)

我已经在超过100万行的2个文件上测试了perl和python脚本,总时间少于6秒.出于商业目的,这项服务表现非常出色!

我修改了Kriss提供的脚本,我对这两个结果非常满意:1)脚本的性能和2)我修改脚本的灵活性更加灵活:

#!/usr/bin/env python

import os

filename1 = raw_input("What is the first file name to compare? ")
filename2 = raw_input("What is the second file name to compare? ")

file1set = set([line for line in file(filename1)])
file2set = set([line for line in file(filename2)])

for name, results in [
    (os.path.abspath(os.getcwd()) + "/duplicate.txt", file1set.intersection(file2set)),
    (os.path.abspath(os.getcwd()) + "/" + filename2 + "_clean.txt", file2set.difference(file1set))]:
    with file(name, 'w') as fh:
        for line in results:
            fh.write(line)
Run Code Online (Sandbox Code Playgroud)

gho*_*g74 7

如果您不关心订单,可以在Python中使用集合:

file1=set(open("file1").readlines())
file2=set(open("file2").readlines())
intersection = file1 & file2 #common lines
non_intersection = file2 - file1  #uncommon lines (in file2 but not file1)
for items in intersection:
    print items
for nitems in non_intersection:
    print nitems
Run Code Online (Sandbox Code Playgroud)

其他方法包括使用difflib,filecmp库.

另一种方式,只使用列表比较.

# lines in file2 common with file1
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
    line=line.rstrip()
    if line in data1:
        print line

# lines in file2 not in file1, use "not"
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
    line=line.rstrip()
    if not line in data1:
        print line
Run Code Online (Sandbox Code Playgroud)