May 3, 2010

Python: extract all hyperlinks from a webpage

I know they must be thousands of programs to do this, but just thought i would give it a try. Its pretty easy. I will keep editing this as and when I improve my regular expression to do this.

import urllib as ul
import BeautifulSoup as bs
import re

myFile = ul.urlopen(http://www.sfbay.craigslist.org/roo/)
soup = bs.BeautifulSoup(myFile)
#print soup.prettify()
for anchor in soup.findAll(a):
#print re.match(href,anchor)
myString = str(anchor)
#print myString
try:
[a,b]= re.search(href=[0-9,-/~.a-zA-Z://]*{0,1},myString).span()
print myString[a:b]
except:
print error

Python

Random

Python: extract all hyperlinks from a webpage

Leave a Reply Cancel reply