Saturday, October 14, 2017

A simple web scrapper

I was recently asked what is the simplest web scrapper in python that i can write. Initially i through that the Scrapy framework was the best. But then i realized it is too complex for just scrape one page that is given by console parameter. Thus, i decided to use the requests and bs4 packages.

The first step was to install them:

pip install requests bs4

After the first check of parameters using the sys.argv, i had to check if the provided URL contains the http or https prefix. The requests package expects the url to include the http or http:

if ("http" not in url):
url = "http://" + url

Next i fetch the data and get them convert them using the BeutifulSoup:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.p
arser')


Almost all the pages today include scripts and style, so the best way to remove them is this:

# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it ou
t

We usually want the body of the page and not the other parts:

extractedText = soup.find_all('body')[0].get_text()

Don't forget to transform the text to ascii to avoid any printing issues:

extractedText = extractedText.encode('ascii', errors='ignore')

Finally, it is also nice to remove the white space in the final text, especially if the input page has a lot of unicode text.
for rmtext in ('\n','\r','\t',' | ', ' , ', ' : ', ' / ',' ', ' ', ' ', ):
extractedText = extractedText.replace(rmtext, ' ')

I uploaded the script in GitHub if someone wants to play or using in any project.

No comments:

Post a Comment