This is a simple recipe - it reads a line in file, removes the line-ending and attempts to search throughout another file for the same line, anywhere in the file
In case a line is missing, the line number is printed to stdout
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | import sys
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("file1", help="First file whose lines you want to check")
parser.add_argument("file2", help="Second file, in which you want to search for lines from first file")
args = parser.parse_args()
file1 = open(args.file1)
file2 = open(args.file2)
print "Comparing:"
print args.file1
print "and"
print args.file2
print ""
print "Attempting to find lines in *file1* that are missing in *file2*"
print ""
file1array = file1.readlines()
file2a = file2.readlines()
lengthfile1array = len(file1array)
j=0;
for file1item in file1array:
j += 1
sys.stdout.write("Checking line#: %d/" %(j))
sys.stdout.write("%d \r" %(lengthfile1array))
i=0;
for file2item in file2a:
if file1item.rstrip() == file2item.rstrip():
i += 1
break
else:
i += 1
if i == len(file2a):
print "MISSING LINE FOUND at Line# " + str(j)
|
This recipe is useful if you have a large amount of line-by-line data e.g. telecom network CDRs
I wrote this in under an hour and this is NOT optimized - there may be lots of ways to improve this syntactically and performance-wise
Edit: removed extraneous print statements
I had a similar thing built a couple of months ago, which used generators and regex. Sharing the code anyway...
https://github.com/rebx/crosscheck
thanks rebs, btw I am a bit bummed out activeState doesn't let me edit my recipes..
it does scroll down on the right hand side column "edit this recipe" if logged in
thanks david