2.- Tools for data visualisation: cleaning and scraping data

 

If data is not in xls or csv files it can be in html or pdf. In these cases you will need to scrape them in order to have them in an excel file for doing the analyse. Maybe after doing this, data is a bit messy and you will need to clean them.

But don’t worry because there are many different tools that let you do this in a very simple way. You can find below some resources that go from simple to complex.

1.- Clean the data

Data Wrangler: Wrangler is an interactive tool for data cleaning and transformation.

 

Tutorial: https://vimeo.com/19185801

 

Open Refine: OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data. It has a programming language called General Refine Expressions Language (GREL) that accept regular expressions (the black magic). With general expressions you may detect characters in text and delete them, for example.

Tutorial:

https://www.youtube.com/watch?v=B70J_H_zAWM

 

2.- Extract the data

Import.io: It will let you extract large amounts of data from a web page into an Excel spreadsheet.

Tutorial: https://www.youtube.com/watch?v=sG68ziyLjCw

 

Tabula: An open software that easily allows scraping for pdf files.

 

Tutorial: https://www.youtube.com/watch?v=of9680dgqIc

importHTML and importXML: these are two great functions of Google Spreadsheets that allow you to get elements of the DOM (HTML) and import them into a Spreadsheet.

Tutorialhttps://mashe.hawksey.info/2012/10/feeding-google-spreadsheets-exercises-in-import/

 

Leave a Reply

Your email address will not be published. Required fields are marked *