Welcome, guest | Sign In | My Account | Store | Cart

The recipe below passes a filename and an argument to grep, returning the stdout and stderr. Each line in the stdout will have its line number prepended.

Python, 6 lines
1
2
3
4
5
6
import subprocess

def grep(filename, arg):
    process = subprocess.Popen(['grep', '-n', arg, filename], stdout=subprocess.PIPE)
    stdout, stderr = process.communicate()
    return stdout, stderr

Reading a subset of a text file is a common task. There are many ways of approaching the problem but none as simple as reusing existing tools. Grep is an excellent candidate as it is fast, efficient, and offers great flexibility in selecting the desired content.

5 comments

Bard Aase 11 years, 9 months ago  # | flag

Actually, I beg to differ.. In most cases I find using the regular expressions module simpler and more reliable than this solution:

mport re
def grep(pattern,fileObj):
  r=[]
  for line in fileObj:
    if re.search(pattern,line):
      r.append(line)
  return r
Bard Aase 11 years, 9 months ago  # | flag

...or to match grep -n:

import re
def grep(pattern,fileObj):
  r=[]
  linenumber=0
  for line in fileObj:
    linenumber +=1
    if re.search(pattern,line):
      r.append((linenumber,line))
  return r

The advantage of doing this is that you get native python objects in return, the linenumber is seperated from the line content

Shashwat Anand 11 years, 9 months ago  # | flag

I second Bard. You should write your own 'grep' function rather than calling it from subprocess. The benefits being, it will be cross-platform (grep works on *nix based system only). Also writing your own grep function will lead towards the code being more flexible. Using optparse (or may be argparse, it will be added in stdlib in 2.7), you can clone the grep function totally with all the flags the biggest benefit being the amount of knowledge you will acquire not to mention a full-fledged 'grep' written in python which can run on any Operating System.

Daniel Cohn (author) 11 years, 9 months ago  # | flag

Thank you for your insightful comments. The primary motivation behind using grep is that I have a number of 200+ megabyte log files to process and I was concerned about speed and memory. I wagered that using grep would be faster than looping through the file using python's re module. It would be interesting to benchmark my recipe against Bard's for a large text file.

Matthew Wood 11 years, 9 months ago  # | flag

Daniel, you're absolutely correct to be afraid of gigantic files. And, if you're storing the results in a list, as Bard is, you could run into memory issues. This is a PERFECT application for generators!

(Also, you can make the regex searches a bit faster by pre-compiling them with the re.compile function.)

Here's a quick version I hacked up:

#!/usr/bin/env python

import re

def grep(pattern, file_obj, include_line_nums=False):
    grepper = re.compile(pattern)
    for line_num, line in enumerate(file_obj):
        if grepper.search(line):
            if include_line_nums:
                yield (line_num, line)
            else:
                yield line

if __name__ == '__main__':
    import sys
    for elem in grep('re', file(sys.argv[0])):
        print repr(elem)
    print '%' * 30
    for elem in grep('re', file(sys.argv[0]), True):
        print repr(elem)

Output should look like this:

'import re\n'
'def grep(pattern, file_obj, include_line_nums=False):\n'
'    grepper = re.compile(pattern)\n'
'        if grepper.search(line):\n'
"    for elem in grep('re', file(sys.argv[0])):\n"
'        print repr(elem)\n'
"    for elem in grep('re', file(sys.argv[0]), True):\n"
'        print repr(elem)\n'
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
(2, 'import re\n')
(4, 'def grep(pattern, file_obj, include_line_nums=False):\n')
(5, '    grepper = re.compile(pattern)\n')
(7, '        if grepper.search(line):\n')
(15, "    for elem in grep('re', file(sys.argv[0])):\n")
(16, '        print repr(elem)\n')
(18, "    for elem in grep('re', file(sys.argv[0]), True):\n")
(19, '        print repr(elem)\n')