我是Python的新手,想知道是否有人会将一个相当简单的Perl脚本示例转换为Python?
该脚本采用2个文件,并通过比较散列键仅输出第二个文件中的唯一行.它还将重复的行输出到文件.我发现使用Perl这种重复删除方法非常快,并且希望看到Python如何比较.
#! /usr/bin/perl
## Compare file1 and file2 and output only the unique lines from file2.
## Opening file1.txt and store the data in a hash.
open my $file1, '<', "file1.txt" or die $!;
while ( <$file1> ) {
my $name = $_;
$file1hash{$name}=$_;
}
## Opening file2.txt and store the data in a hash.
open my $file2, '<', "file2.txt" or die $!;
while ( <$file2> ) {
$name = $_;
$file2hash{$name}=$_;
}
open my $dfh, '>', "duplicate.txt";
## Compare the keys and remove the duplicate one in the file2 hash
foreach ( keys %file1hash ) {
if ( exists ( $file2hash{$_} ))
{
print $dfh $file2hash{$_};
delete $file2hash{$_};
}
}
open my $ofh, '>', "file2_clean.txt";
print $ofh values(%file2hash) ;
Run Code Online (Sandbox Code Playgroud)
我已经在超过100万行的2个文件上测试了perl和python脚本,总时间少于6秒.出于商业目的,这项服务表现非常出色!
我修改了Kriss提供的脚本,我对这两个结果非常满意:1)脚本的性能和2)我修改脚本的灵活性更加灵活:
#!/usr/bin/env python
import os
filename1 = raw_input("What is the first file name to compare? ")
filename2 = raw_input("What is the second file name to compare? ")
file1set = set([line for line in file(filename1)])
file2set = set([line for line in file(filename2)])
for name, results in [
(os.path.abspath(os.getcwd()) + "/duplicate.txt", file1set.intersection(file2set)),
(os.path.abspath(os.getcwd()) + "/" + filename2 + "_clean.txt", file2set.difference(file1set))]:
with file(name, 'w') as fh:
for line in results:
fh.write(line)
Run Code Online (Sandbox Code Playgroud)
如果您不关心订单,可以在Python中使用集合:
file1=set(open("file1").readlines())
file2=set(open("file2").readlines())
intersection = file1 & file2 #common lines
non_intersection = file2 - file1 #uncommon lines (in file2 but not file1)
for items in intersection:
print items
for nitems in non_intersection:
print nitems
Run Code Online (Sandbox Code Playgroud)
其他方法包括使用difflib,filecmp库.
另一种方式,只使用列表比较.
# lines in file2 common with file1
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
line=line.rstrip()
if line in data1:
print line
# lines in file2 not in file1, use "not"
data1=map(str.rstrip,open("file1").readlines())
for line in open("file2"):
line=line.rstrip()
if not line in data1:
print line
Run Code Online (Sandbox Code Playgroud)