Creating a tar file is easy if you read the spec (you can look it up on wikipedia). Not every kind of files are supported (it support regular files, folders ans symlinks) and it's generating archives for the original tar file format (path length are limited to 100 chars, no extended attributes, ...). It wasn't tested very much but it was a fun hack :) ... I cheated just a little by looking at the python tarfile code from the stdlib for the checksum computation.
A tar file is very simple, it's a list of header/payload for each entry (file|folder|symlink) you want to archive. There's only a payload for file contents. The header is 512 bytes long and can be written in ascii. Numbers (attributes) needs to be written in octal. The files themselves needs to be written in chunks of 512 bytes, which mean you have to fill the last chunk with zeros when the file size is not a multiple of 512 bytes.
Use it like that:
python batar.py /tmp/foo.tar `find .` && tar tf /tmp/foo.tar # or xf if you want to extract it
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | #!/usr/bin/env python
''' Create tar archives the hard way.
The python tarfile module does that much better but it's more
an exercise and fun to see very little code is needed to manage to
create a tarball.
Support only the original tar format.
http://en.wikipedia.org/wiki/Tar_(file_format)
'''
import os, sys
from os.path import getsize, isfile, isdir, islink
from os import stat
def write_header(f, fn):
'''
100 name name of file
8 mode file mode
8 uid owner user ID
8 gid owner group ID
12 size length of file in bytes
12 mtime modify time of file
8 chksum checksum for header
1 link indicator for links
100 linkname name of linked file
'''
def rpad(s, size):
L = len(s)
return s + (size - L) * '\0'
header = rpad(fn, 100)
header += rpad('%o' % stat(fn).st_mode, 8)
header += rpad('%o' % stat(fn).st_uid, 8)
header += rpad('%o' % stat(fn).st_gid, 8)
size = getsize(fn) if isfile(fn) else 0
header += rpad('%o' % size, 12)
header += rpad('%o' % stat(fn).st_mtime, 12)
header += 8 * '\0' # 8 zeros while the cksum is computed
if islink(fn): header += 1 * '2'
elif isfile(fn): header += 1 * '0'
elif isdir(fn): header += 1 * '5'
if islink(fn): header += rpad(os.readlink(fn), 100)
else: header += 100 * '\0'
# the checksum part is shamelessy stolen from the tarfile module
# with little edit
cksum = 256 + sum(ord(h) for h in header)
header = rpad(header, 512)
header = header[:-364] + '%06o\0' % cksum + header[-357:]
f.write( header )
def write_body(f, fn):
fo = open(fn)
bytes = fo.read()
fo.close()
f.write(bytes)
zeros = 512 - len(bytes) % 512
f.write(zeros * '\0')
def write(files, out):
f = open(out, 'wb')
for fn in files:
write_header(f, fn)
if isfile(fn) and not islink(fn): write_body(f, fn)
f.close()
if __name__ == '__main__':
if len(sys.argv) < 3:
print 'Usage: batar.py <tar archive> [files | directories]'
else:
write(sys.argv[2:], sys.argv[1])
|
I think it's worth looking at:
- How to decompress such an archive
- How to implement the GNU extension that allow to store path names of unlimited length.
- Make it faster / see where the time is spent (using a list of += for writing the header is not a good choice).