Salı, Mayıs 22, 2012

Python mechanize, beautiful soup and html scraping

I used to have bash,sed,awk,curl,beautifiul  soup  when I was doing html parsing,scraping and automating tasks. I was aware of mechanize but was not using it. Today I decided to give it a try for an automation task and I didnt regret. It was a real fun(!) learning(!) it. Actually I learned a bit of it. It helped me a lot on automating browser requests. Although I didnt yet use it for form handling, I know it has some magical power there too. I also realised that one must use Beautiful Soup with Mechanize. These make an awesome combo worth trying and using. Thanks to all the folks behind them.
here is a snippet
br = mechanize.Browser()

br.open("http://www.site.com")
all_links=[l for l in br.links(url_regex="pattern")]
for i in all_links[5:]:


br.follow_link(i)
temp=br.response().read()
soup=BeautifulSoup(temp)                                                                                                  
link=soup.find('a', href=re.compile("mp3"))   #title\/tt[0-9]*\/"))
if hasattr(link, "href"):
    lin=link['href']
    file=lin.split("/")[-1]
    print file+"----"+lin
    br.retrieve(lin,file)

Here are some helper links
http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
http://stockrt.github.com/p/handling-html-forms-with-python-mechanize-and-BeautifulSoup/

2 yorum:

Adsız dedi ki...

Having read this I believed it was very enlightening.
I appreciate you taking the time and effort to put this information together.
I once again find myself personally spending
way too much time both reading and commenting. But so what,
it was still worth it!

My web page Boutique Air Jordan

Adsız dedi ki...

Hi i am kavin, its my first occasion to commenting anywhere, when i
read this piece of writing i thought i could
also make comment due to this sensible article.