Welcome, guest | Sign In | My Account | Store | Cart

An extension to Arthur de Jong's excellent webcheck tool (a website link checker) (http://arthurdejong.org/webcheck) that will read in the resultant webcheck.dat file and create a csv formatted file.

Python, 28 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from __future__ import with_statement

import csv
import os

# this module is included as part of webcheck.
import serialize

FILENAME = 'my_site_links_as_csv.csv'
DATFILE = 'my_site/webcheck.dat'

if __name__ == '__main__':

    # using webcheck's serialize module to create a site object.
    site = serialize.deserialize(open(DATFILE, 'r'))

    with open(FILENAME, 'w') as sitecsv:
        writer = csv.writer(sitecsv)

        writer.writerow(("path", "extension", "internal", "errors"))
        writer.writerows(
                ((k,
                 os.path.splitext(v.path)[-1],
                 v.isinternal,
                 ' '.join(v.linkproblems))
                 
                 # the site object has a dictionary between URI and a link object.
                 for (k, v) in site.linkMap.iteritems()))

Arthur's webcheck recursively checks the links on a parent webpage and reports on dead and alive links in a pretty html format (http://arthurdejong.org/webcheck/demo/)

This tool takes webcheck's dat file and turns a subset of its information into a csv formatted file using webcheck's serialize module. This might be helpful if you wanted a csv list of all the pdf files on a client website.

The csv formatted file will be with the headings:

path, extension, internal, error

path The uri, a link that was found somewhere on the parent site.

extension The extension for a url (.html, .pdf, ...)

internal True if this link was an internal link

error An error description for the link if one occurred (404 error etc..)

Note that you must have run his tool and there must exist a resultant .dat file for this to work!

The csv formatted file will be with the headings:

path, extension, internal, error

path The uri, a link that was found somewhere on the parent site.

extension The extension for a url (.html, .pdf, ...)

internal True if this link was an internal link

error An error description for the link if one occurred (404 error etc..)

Note that you must have run his tool and there must exist a resultant .dat file for this to work!

Having run this recipe, you might be treated to output that looks like this:

path, extension, internal, error
http://code.activestate.com/recipes/577602-webcheck-site-to-csv, , True,,
http://code.activestate.com/recipes/avatar.jpg, .jpg, True, 404 Error,