python difflib比较文件

koo*_*gee 18 python text difflib

我正在尝试使用difflib为包含推文的两个文本文件生成diff.这是代码:

#!/usr/bin/env python

# difflib_test

import difflib

file1 = open('/home/saad/Code/test/new_tweets', 'r')
file2 = open('/home/saad/PTITVProgs', 'r')

diff = difflib.context_diff(file1.readlines(), file2.readlines())
delta = ''.join(diff)
print delta
Run Code Online (Sandbox Code Playgroud)

这是PTITVProgs文本文件:

Watch PTI on April 6th (7) Dr Israr Shah at 10PM on Business Plus in "Talking Policy". Rgds #PTI
CORRECTION!! Watch PTI on April 6th (5) @Asad_Umar  at 8PM on ARY News. Rgds #PTI
Watch PTI on April 6th (5) @Asad_Umar  at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI
Watch PTI on April 6th (5) Asad Umar at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI
Watch PTI on April 6th (5) Waleed Iqbal at 8PM on Channel 5. Rgds #PTI
Watch PTI on April 6th (3) Dr Israr Shah at 10PM on PTV News. Rgds #PTI
Watch PTI on April 6th (4) Javed hashmi at 1PM on PTV News. Rgds #PTI
Watch PTI on April 6th (3) Imran Alvi at 1PM on AAJ News. Rgds #PTI
Watch PTI on April 6th (1) Dr @ArifAlvi, Andleeb Abbas and Ehtisham Ameer at 11PM on ARY News (2) Hamid Khan at 10PM on ATV. Rgds #PTI
Watch PTI on April 5th (1) Farooq Amjad Meer at 10:45PM on Dunya News. Rgds #PTI
Watch PTI on April 4th (4) Faisal Khan at 8PM on PTV News. Rgds #PTI
@FaisalJavedKhan
Watch PTI on April 4th (3) Faisal Khan at 11PM on ATV. Rgds #PTI
@FaisalJavedKhan
Watch PTI on April 4th (1) Dr Israr Shah at 8PM on Waqt News (2) Dr Arif Alvi at 9PM on PTV World. Rgds #PTI
@ArifAlvi
Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI
Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI
Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI
Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI
Watch PTI on April 3rd (8) Mehmood Rasheed at 8PM on ARY News. Rgds #PTI
Watch PTI on April 3rd (7) Israr Abbasi (Repeat on Arp 4th) at 1:20AM and 1PM on Vibe TV. Rgds #PTI
Watch PTI on April 3rd (5) Rao Fahad at 9PM on Express News (6) Dr Seems Zia at 10:30PM on Health TV. Rgds #PTI
Run Code Online (Sandbox Code Playgroud)

这是new_tweets文本文件:

Watch PTI on April 7th (3) Malaika Reza at 8PM on AAJ News (4) Shah Mehmood Qureshi at 8PM on Geo News. Rgds #PTI
Watch PTI on April 7th (2) Chairman IMRAN KHAN at 10PM on PTV News in News Night with Sadia Afzal, Rpt: 2AM, 2PM. Rgds #PTI
@ImranKhanPTI
Watch PTI on April 7th (1) Dr Waseem Shahzad NOW at 6PM on PTV News. Rgds #PTI
Watch PTI on April 6th (7) Dr Israr Shah at 10PM on Business Plus in "Talking Policy". Rgds #PTI
CORRECTION!! Watch PTI on April 6th (5) @Asad_Umar  at 8PM on ARY News. Rgds #PTI
Watch PTI on April 6th (5) @Asad_Umar  at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI
Watch PTI on April 6th (5) Asad Umar at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI
Watch PTI on April 6th (5) Waleed Iqbal at 8PM on Channel 5. Rgds #PTI
Watch PTI on April 6th (3) Dr Israr Shah at 10PM on PTV News. Rgds #PTI
Watch PTI on April 6th (4) Javed hashmi at 1PM on PTV News. Rgds #PTI
Watch PTI on April 6th (3) Imran Alvi at 1PM on AAJ News. Rgds #PTI
Watch PTI on April 6th (1) Dr @ArifAlvi, Andleeb Abbas and Ehtisham Ameer at 11PM on ARY News (2) Hamid Khan at 10PM on ATV. Rgds #PTI
Watch PTI on April 5th (1) Farooq Amjad Meer at 10:45PM on Dunya News. Rgds #PTI
Watch PTI on April 4th (4) Faisal Khan at 8PM on PTV News. Rgds #PTI
@FaisalJavedKhan
Watch PTI on April 4th (3) Faisal Khan at 11PM on ATV. Rgds #PTI
@FaisalJavedKhan
Watch PTI on April 4th (1) Dr Israr Shah at 8PM on Waqt News (2) Dr Arif Alvi at 9PM on PTV World. Rgds #PTI
@ArifAlvi
Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI
Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI
Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI
Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI
Run Code Online (Sandbox Code Playgroud)

这是我从程序得到的差异:

*** 
--- 
***************
*** 1,7 ****
- Watch PTI on April 7th (3) Malaika Reza at 8PM on AAJ News (4) Shah Mehmood Qureshi at 8PM on Geo News. Rgds #PTI
- Watch PTI on April 7th (2) Chairman IMRAN KHAN at 10PM on PTV News in News Night with Sadia Afzal, Rpt: 2AM, 2PM. Rgds #PTI
- @ImranKhanPTI
- Watch PTI on April 7th (1) Dr Waseem Shahzad NOW at 6PM on PTV News. Rgds #PTI
  Watch PTI on April 6th (7) Dr Israr Shah at 10PM on Business Plus in "Talking Policy". Rgds #PTI
  CORRECTION!! Watch PTI on April 6th (5) @Asad_Umar  at 8PM on ARY News. Rgds #PTI
  Watch PTI on April 6th (5) @Asad_Umar  at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI
--- 1,3 ----
***************
*** 21,24 ****
  Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI
  Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI
  Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI
! Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI--- 17,23 ----
  Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI
  Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI
  Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI
! Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI
! Watch PTI on April 3rd (8) Mehmood Rasheed at 8PM on ARY News. Rgds #PTI
! Watch PTI on April 3rd (7) Israr Abbasi (Repeat on Arp 4th) at 1:20AM and 1PM on Vibe TV. Rgds #PTI
! Watch PTI on April 3rd (5) Rao Fahad at 9PM on Express News (6) Dr Seems Zia at 10:30PM on Health TV. Rgds #PTI
Run Code Online (Sandbox Code Playgroud)

正如你可以快速地比较两个源文件(PTITVProgs和new_tweets)它们之间的区别是看到了3个鸣叫是4月7日4月3日3个鸣叫.

我只是希望new_tweets不在其中的线条PTITVProgs出现在差异中.

但它抛出了一堆我不想看到的文字.我不知道什么*** 1,7****** 1,3***差异输出代表......?获得改变的线条的正确方法是什么?

gat*_*tto 28

只需像这样解析diff的输出(如果需要,将' - '更改为'+'):

#!/usr/bin/env python

# difflib_test

import difflib

file1 = open('/home/saad/Code/test/new_tweets', 'r')
file2 = open('/home/saad/PTITVProgs', 'r')

diff = difflib.ndiff(file1.readlines(), file2.readlines())
delta = ''.join(x[2:] for x in diff if x.startswith('- '))
print delta
Run Code Online (Sandbox Code Playgroud)


jur*_*eza 18

库中存在多种不同的样式和不同的功能difflib.unified_diff,ndiffcontext_diff.

如果您不想使用行号摘要,则ndiff函数会提供不同样式的增量:

import difflib

f1 = '''1
2
3
4
5'''
f2 = '''1
3
4
5
6'''

diff = difflib.ndiff(f1,f2)

for l in diff:
    print(l)
Run Code Online (Sandbox Code Playgroud)

输出:

  1
- 2          
  3          
  4          
  5   
+ 6
Run Code Online (Sandbox Code Playgroud)

编辑:

您还可以解析差异以仅提取更改,如果这是您想要的:

>>>changes = [l for l in diff if l.startswith('+ ') or l.startswith('- ')]

>>>for c in changes:
       print(c)
>>>
- 2
+ 6
Run Code Online (Sandbox Code Playgroud)