Welcome, guest | Sign In | My Account | Store | Cart

Creating a tar file is easy if you read the spec (you can look it up on wikipedia). Not every kind of files are supported (it support regular files, folders ans symlinks) and it's generating archives for the original tar file format (path length are limited to 100 chars, no extended attributes, ...). It wasn't tested very much but it was a fun hack :) ... I cheated just a little by looking at the python tarfile code from the stdlib for the checksum computation.

A tar file is very simple, it's a list of header/payload for each entry (file|folder|symlink) you want to archive. There's only a payload for file contents. The header is 512 bytes long and can be written in ascii. Numbers (attributes) needs to be written in octal. The files themselves needs to be written in chunks of 512 bytes, which mean you have to fill the last chunk with zeros when the file size is not a multiple of 512 bytes.

Use it like that:

python batar.py /tmp/foo.tar `find .` &&  tar tf /tmp/foo.tar # or xf if you want to extract it
Python, 76 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
#!/usr/bin/env python
''' Create tar archives the hard way.

The python tarfile module does that much better but it's more 
an exercise and fun to see very little code is needed to manage to 
create a tarball.

Support only the original tar format. 
http://en.wikipedia.org/wiki/Tar_(file_format)
'''

import os, sys
from os.path import getsize, isfile, isdir, islink
from os import stat

def write_header(f, fn):
    '''
    100	 name	 name of file
    8	 mode	 file mode
    8	 uid	 owner user ID
    8	 gid	 owner group ID
    12	 size	 length of file in bytes
    12	 mtime	 modify time of file
    8	 chksum	 checksum for header
    1	 link	 indicator for links
    100	 linkname	 name of linked file
    '''
    def rpad(s, size):
        L = len(s)
        return s + (size - L) * '\0'
        
    header  = rpad(fn, 100) 
    header += rpad('%o' % stat(fn).st_mode, 8)
    header += rpad('%o' % stat(fn).st_uid, 8)
    header += rpad('%o' % stat(fn).st_gid, 8)
    size = getsize(fn) if isfile(fn) else 0
    header += rpad('%o' % size, 12)
    header += rpad('%o' % stat(fn).st_mtime, 12)
    header += 8 * '\0' # 8 zeros while the cksum is computed
    if islink(fn): header += 1 * '2'
    elif isfile(fn): header += 1 * '0'
    elif isdir(fn): header += 1 * '5'
    if islink(fn): header += rpad(os.readlink(fn), 100)
    else: header += 100 * '\0'

    # the checksum part is shamelessy stolen from the tarfile module 
    # with little edit
    cksum = 256 + sum(ord(h) for h in header)
    header = rpad(header, 512)
    header = header[:-364] + '%06o\0' % cksum + header[-357:]

    f.write( header )

def write_body(f, fn):
    fo = open(fn)
    bytes = fo.read()
    fo.close()

    f.write(bytes)
    zeros = 512 - len(bytes) % 512
    f.write(zeros * '\0')

def write(files, out):
    f = open(out, 'wb')

    for fn in files:
        write_header(f, fn)
        if isfile(fn) and not islink(fn): write_body(f, fn)

    f.close()

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print 'Usage: batar.py <tar archive> [files | directories]'
    else:
        write(sys.argv[2:], sys.argv[1])
    

I think it's worth looking at:

  1. How to decompress such an archive
  2. How to implement the GNU extension that allow to store path names of unlimited length.
  3. Make it faster / see where the time is spent (using a list of += for writing the header is not a good choice).