Sunday, October 29, 2017

NAS or not to NAS? - 2017 Guide

As a family we have six different external disks of various sizes ranging from 320 GBytes up to 1TB. I also have an old PC that i no longer use it and seems to be too slow even for every day tasks such as web browsing due to its mechanical disk. Then the idea of using it as Network-attached storage (NAS) come. The machine is powerful enough to provide storage inside the home.

Before building the machine i had to verify through memchek that the memory modules are in perfect condition. So, i left for 12 hours the memory check and thankfully, no error come up. Next step was to verify that each of the available disks does not have any bad sector or relocations. This can be done by using utilities to scan the disk either provided by the manufacturer (such as WD)  or use the Ultimate Boot CD.

The second step was to collect the prices of disks from local retailers. Here is a table with the model and the NAS targeted disks. I wanted something cheap, so i spitted in three configurations:


1. The cheapest choice: 2x 3TB RAID  1   + 1 TB for downloading stuff

I will select the Seagate IronWolf 3TB (119 Euro) and the WD Red 3TB (128 Euro) with a total of 247 Euro. The disk will be set in RAID 1 and the downloaded stuff in the old 1TB disk. Alternative, the 1 TB disk will be used as a backup for critical files and the 3TB disks will be also used for download. The old 320 GB will be used as a startup disk. I the future, it will be easy to upgrade using an additional 3TB disk transforming it to RAID5 and move the 1TB disk as external.

2. The wise choice: 3x 4TB RAID 5 + 1 TB for startup and  download disk.

I will select the Seagate IronWolf 4TB (170 Euro), the WD Red 4TB (190 Euro - another retailer has the price 160 but he is out of stock), and the HGST for NAS 4TB (195 Euro) with a total of 555 Euro. The primary OS disk should be replaced with a cheap SSD (40 Euro). With this configuration you can have one drive failure and 8 TB of disk space (more than enough!)

3. The expensive choice: build a system from the start using memory with error correction code. This can easy go up to 1000 Euro, as you have to buy specialized motherboard, such as Supermicro boards.

Eventually I will go with the first option, as it is the cheapest and fits very well my local needs. I will still continue using the external disk as a backup.

Saturday, October 14, 2017

A simple web scrapper

I was recently asked what is the simplest web scrapper in python that i can write. Initially i through that the Scrapy framework was the best. But then i realized it is too complex for just scrape one page that is given by console parameter. Thus, i decided to use the requests and bs4 packages.

The first step was to install them:

pip install requests bs4

After the first check of parameters using the sys.argv, i had to check if the provided URL contains the http or https prefix. The requests package expects the url to include the http or http:

if ("http" not in url):
url = "http://" + url

Next i fetch the data and get them convert them using the BeutifulSoup:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.p
arser')


Almost all the pages today include scripts and style, so the best way to remove them is this:

# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it ou
t

We usually want the body of the page and not the other parts:

extractedText = soup.find_all('body')[0].get_text()

Don't forget to transform the text to ascii to avoid any printing issues:

extractedText = extractedText.encode('ascii', errors='ignore')

Finally, it is also nice to remove the white space in the final text, especially if the input page has a lot of unicode text.
for rmtext in ('\n','\r','\t',' | ', ' , ', ' : ', ' / ',' ', ' ', ' ', ):
extractedText = extractedText.replace(rmtext, ' ')

I uploaded the script in GitHub if someone wants to play or using in any project.