The recipe below passes a filename and an argument to grep, returning the stdout and stderr. Each line in the stdout will have its line number prepended.
1 2 3 4 5 6 | import subprocess
def grep(filename, arg):
process = subprocess.Popen(['grep', '-n', arg, filename], stdout=subprocess.PIPE)
stdout, stderr = process.communicate()
return stdout, stderr
|
Reading a subset of a text file is a common task. There are many ways of approaching the problem but none as simple as reusing existing tools. Grep is an excellent candidate as it is fast, efficient, and offers great flexibility in selecting the desired content.
Actually, I beg to differ.. In most cases I find using the regular expressions module simpler and more reliable than this solution:
...or to match grep -n:
The advantage of doing this is that you get native python objects in return, the linenumber is seperated from the line content
I second Bard. You should write your own 'grep' function rather than calling it from subprocess. The benefits being, it will be cross-platform (grep works on *nix based system only). Also writing your own grep function will lead towards the code being more flexible. Using optparse (or may be argparse, it will be added in stdlib in 2.7), you can clone the grep function totally with all the flags the biggest benefit being the amount of knowledge you will acquire not to mention a full-fledged 'grep' written in python which can run on any Operating System.
Thank you for your insightful comments. The primary motivation behind using grep is that I have a number of 200+ megabyte log files to process and I was concerned about speed and memory. I wagered that using grep would be faster than looping through the file using python's re module. It would be interesting to benchmark my recipe against Bard's for a large text file.
Daniel, you're absolutely correct to be afraid of gigantic files. And, if you're storing the results in a list, as Bard is, you could run into memory issues. This is a PERFECT application for generators!
(Also, you can make the regex searches a bit faster by pre-compiling them with the re.compile function.)
Here's a quick version I hacked up:
Output should look like this: