How to diff the two files using Python Generator

Non*_*ons 6 python generator range

I have one file of 100GB having 1 to 1000000000000 separated by new line. In this some lines are missing like 5, 11, 19919 etc. My Ram size is 8GB.

How to find the missing elements.

My idea take another file for i in range(1,1000000000000) read the lines one by one using the generator. can we use yield statement for this

Can help in writing the code

My Code, the below code taking as a list in does the below code can use it for production.?

def difference(a,b):
    with open(a,'r') as f:
        aunique=set(f.readlines())


    with open(b,'r') as f:
        bunique=set(f.readlines())

    with open('c','a+') as f:
        for line in list(bunique - aunique):
            f.write(line)
Run Code Online (Sandbox Code Playgroud)

ilm*_*acs 6

If the values are in sequential order, you can simply note the previous value and see if the difference equals one:

prev = 0
with open('numbers.txt','r') as f:
    for line in f:
        value = int(line.strip())
        for i in range(prev, value-1):
            print('missing:', i+1)
    prev = value
# output numbers that are missing at the end of the file (see comment by @blhsing)
for i in range(prev, 1000000000000):
    print('missing:', i+1)
Run Code Online (Sandbox Code Playgroud)

This should work fine in python3, as readlines is an iterator so will not load the full file at once or keep it in memory.


blh*_*ing 5

You can iterate over all the numbers generated by range and keep comparing the number to the next number in the file. Output the number if it's missing, or read the next number for the next match:

with open('numbers') as f:
    next_number = 0
    for n in range(1000000000001):
        if n == next_number:
            next_number = int(next(f, 0))
        else:
            print(n)
Run Code Online (Sandbox Code Playgroud)

Demo (assuming target numbers from 1 to 10 instead): https://repl.it/repls/WaterloggedUntimelyCoding