Skip to content

Check broken link in a bunch of static html files (python version)

7 agosto 2009

OK, this time we are more pythonic (:
The small script looks for html files in the current directory (but you can change the base dir passing it as the first argument to the script) and tells you which local links are broken.

#!/usr/bin/env python
import os, sys
import lxml.html

stats = dict(
    checked = 0,
    broken = 0
)

def valid_link(dirpath, link):
    # checks only local links
    if '://' not in link:
        return os.path.exists(os.path.join(dirpath, link))
    return True

def check(dirpath, filename):
    fullfile = os.path.join(dirpath, filename)
    try:
        html = lxml.html.fromstring(open(fullfile).read())
    except:
        print '[ERROR]'
        print '    file: %s' % fullfile
        return

    for element, attribute, link, pos in html.iterlinks():
        stats['checked'] += 1
        if not valid_link(dirpath, link):
            print '[BROKEN]'
            print '    file: %s' % fullfile
            print '    link: %s' % link
            print
            stats['broken'] += 1

if __name__ == '__main__':
    basedir = '.'
    if len(sys.argv) == 2:
        basedir = sys.argv[1]
    for dirpath, dirnames, filenames in os.walk(basedir):
        htmls = filter(
            lambda f: f.endswith('.html'),
            filenames)
        for f in htmls:
            check(dirpath, f)

    print '[STATS]'
    print '    checked: %s' % stats['checked']
    print '     broken: %s' % stats['broken']

P.S.: lxml rocks.

From → computer, English

Lascia un commento

Inserisci i tuoi dati qui sotto o clicca su un'icona per effettuare l'accesso:

Logo WordPress.com

Stai commentando usando il tuo account WordPress.com. Chiudi sessione / Modifica )

Foto Twitter

Stai commentando usando il tuo account Twitter. Chiudi sessione / Modifica )

Foto di Facebook

Stai commentando usando il tuo account Facebook. Chiudi sessione / Modifica )

Google+ photo

Stai commentando usando il tuo account Google+. Chiudi sessione / Modifica )

Connessione a %s...

%d blogger cliccano Mi Piace per questo: