Welcome, guest | Sign In | My Account | Store | Cart

Splits a large text file into smaller ones, based on line count. Original file is unmodified.

Resulting text files are stored in the same directory as the original file.

Useful for breaking up text based logs or blocks of email logins into smaller parts.

Python, 42 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
"""splits a large text file into smaller ones, based on line count

Original is left unmodified.

Resulting text files are stored in the same directory as the original file.

Useful for breaking up text-based logs or blocks of login credentials.

"""

import os

def split_file(filepath, lines_per_file=100):
    """splits file at `filepath` into sub-files of length `lines_per_file`
    """
    lpf = lines_per_file
    path, filename = os.path.split(filepath)
    with open(filepath, 'r') as r:
        name, ext = os.path.splitext(filename)
        try:
            w = open(os.path.join(path, '{}_{}{}'.format(name, 0, ext)), 'w')
            for i, line in enumerate(r):
                if not i % lpf:
                    #possible enhancement: don't check modulo lpf on each pass
                    #keep a counter variable, and reset on each checkpoint lpf.
                    w.close()
                    filename = os.path.join(path,
                                            '{}_{}{}'.format(name, i, ext))
                    w = open(filename, 'w')
                w.write(line)
        finally:
            w.close()

def test():
    """demonstrates the utility of split_file() function"""
    testpath = "/tmp/test_split_file/"
    if not os.path.exists(testpath): os.mkdir(testpath)
    testfile = os.path.join(testpath, "test.txt")
    with open(testfile, 'w') as w:
        for i in range(1, 10001):
            w.write("email{}@myserver.net\tb4dpassw0rd{}\n".format(i, i))
    split_file(testfile, 1000)

This is a very simple recipe, changed to be 3k-compatible.

3 comments

Rogier Steehouder 12 years, 2 months ago  # | flag

You do not need to know the number of lines in the input file. Reading the input file twice is a waste.

Filenames can contain more than one period.

File objects are iterators. With the enumerate() function, you do not need xrange() very often.

#!/usr/bin/env python3
import os

def split_file(filepath, lines=100):
    """Split a file based on a number of lines."""
    path, filename = os.path.split(filepath)
    # filename.split('.') would not work for filenames with more than one .
    basename, ext = os.path.splitext(filename)
    # open input file
    with open(filepath, 'r') as f_in:
        try:
            # open the first output file
            f_out = open(os.path.join(path, '{}_{}{}'.format(basename, 0, ext)), 'w')
            # loop over all lines in the input file, and number them
            for i, line in enumerate(f_in):
                # every time the current line number can be divided by the
                # wanted number of lines, close the output file and open a
                # new one
                if i % lines == 0:
                    f_out.close()
                    f_out = open(os.path.join(path, '{}_{}{}'.format(basename, i, ext)), 'w')
                # write the line to the output file
                f_out.write(line)
        finally:
            # close the last output file
            f_out.close()

if __name__ == '__main__':
    with open('split_file.txt', 'w') as f:
        for x in range(950):
            f.write('{}\n'.format(x))
    split_file('split_file.txt')
Andrew Yurisich (author) 12 years, 2 months ago  # | flag

@Rogier are you implying that I was enumerting over the file? ;) I will certainly incorporate your knowledge into my submission once I reach a proper client (I'm on the smartphone atm).

Patrick 8 years, 2 months ago  # | flag