Skip to content

Check broken link in a bunch of static html files (python version)

7 agosto 2009

OK, this time we are more pythonic (:
The small script looks for html files in the current directory (but you can change the base dir passing it as the first argument to the script) and tells you which local links are broken.

#!/usr/bin/env python
import os, sys
import lxml.html

stats = dict(
    checked = 0,
    broken = 0

def valid_link(dirpath, link):
    # checks only local links
    if '://' not in link:
        return os.path.exists(os.path.join(dirpath, link))
    return True

def check(dirpath, filename):
    fullfile = os.path.join(dirpath, filename)
        html = lxml.html.fromstring(open(fullfile).read())
        print '[ERROR]'
        print '    file: %s' % fullfile

    for element, attribute, link, pos in html.iterlinks():
        stats['checked'] += 1
        if not valid_link(dirpath, link):
            print '[BROKEN]'
            print '    file: %s' % fullfile
            print '    link: %s' % link
            stats['broken'] += 1

if __name__ == '__main__':
    basedir = '.'
    if len(sys.argv) == 2:
        basedir = sys.argv[1]
    for dirpath, dirnames, filenames in os.walk(basedir):
        htmls = filter(
            lambda f: f.endswith('.html'),
        for f in htmls:
            check(dirpath, f)

    print '[STATS]'
    print '    checked: %s' % stats['checked']
    print '     broken: %s' % stats['broken']

P.S.: lxml rocks.


From → computer, English


Inserisci i tuoi dati qui sotto o clicca su un'icona per effettuare l'accesso:


Stai commentando usando il tuo account Chiudi sessione /  Modifica )

Google+ photo

Stai commentando usando il tuo account Google+. Chiudi sessione /  Modifica )

Foto Twitter

Stai commentando usando il tuo account Twitter. Chiudi sessione /  Modifica )

Foto di Facebook

Stai commentando usando il tuo account Facebook. Chiudi sessione /  Modifica )


Connessione a %s...

%d blogger hanno fatto clic su Mi Piace per questo: