How to do a Data Scrape on the Titanic

How to do a Data Scrape on the Titanic

In an earlier post, I talked about the MOOC Data Journalism course that I did. Part of that course covers how to do a data scrape of information from other sites for reporting purposes.

In this post, I want to share what I learnt.

What is a Data Scrape

Scraping data is, essentially, a way of grabbing content from lists and tables on other websites.

And with this information, you can the really study it, and twist it and turn it to see what other insights you can draw out of it.

Example

How to do a Data Scrape - Titanic 1912As an example, consider the passengers on the Titanic. You might want to do some analysis on who there was on the ship, who survived, ages, etc.

By scraping the data from a reliable source, you can then put it into a spreadsheet and sorting, and grouping, etc, in a way that will give you the information that you want.

How to do a Data Scrape

There are several tools that you can use to do a data scrape.

The tool that I am going to describe is Google Sheets.

As described above, I’m going to scrape the list of Titanic passengers from Wikipedia.

The Titanic

Wikipedia has a list of the passengers that were on the Titanic.

The address of the Wikipedia page is:
http://en.wikipedia.org/wiki/List_of_Titanic_passengers#Survivors_and_victims

If you visit that link, you see a large list of everyone who was on board the Titanic on her maiden voyage. (It can be quite disheartening to read.)

Titanic Passenger List

Scraping the Data

I am going to show you how to data scrape of the passenger information so that you can put it into a spreadsheet.

  1. In your browser, go to Google Drive. (You will need to have a Google account for this.)
  2. Click on New and then select Google Sheets
    How to Scrape Data - Google Sheets menu
    The Google Sheet will be displayed.
  3. In the first cell, enter the following:
    =importhtml

    Import-html
    Google will autosuggest as you are typing.

  4. Continue typing the following
    (“https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic, “Table”, 1)
  5. Press enter.
Also of interest:
Dousing the flame

Here’s the full command. You can also copy this and paste it into the spreadsheet:

=importhtml("https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic, "Table", 1)

Initially, you’ll see “loading”, and then the list of passengers in First Class can be seen.

Quick Explanation of IMPORTHTML

As seen above, the command to use is IMPORTHTML

Then, between brackets, you need the following:

url the URL of the page that has the information that you want to scrape
query “Table” or “List” depending on whether the information you want is in a table, or a list.
index this is the number of the table or list that is on the web page.

 

In our case, we used:

url https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic
query “Table”
index 1

Here’s an actual example of a Google spreadsheet with the list of passengers.

And the other passengers?

As you might have noticed, the list has only the First Class passengers.

This is because the Second Class passengers and Third Class passenger are in separate tables.

So to get that data we’ll do the following:

Adding the Second Class passengers

First – let’s add an extra column so that we know which passengers are First class

  1. Go to the first empty column after the data. (In my case, it was Column H)
  2. Enter “Class” on the first row.
  3. Enter “1” on the next row.
  4. Copy that value into each cell down to the end of the table.How to do a data scrape - First Class passengers

Now let’s add the Second Class passengers

  1. Go to the first empty row at the bottom of the table.
  2. Again, enter =importHTML
  3. And follow with
    (“https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic, “Table”, 2)
    (note that the index is now “2”).
  4. Press Enter
Also of interest:
Is Microsoft a Religious Experience?

Here’s the Example table with the Second Class passengers

In the Class column (that we created above), add the number “2”

Treating the Third Class passengers Differently

You read that right. We are going to have to handle the Third Class passengers differently.

Why?

Because, if you look closely on the Wikipedia page, the table for the Third Class passengers has an extra column.

In the tables for the First and Second Class passengers, the column “Hometown” included both the town, and the country. In the table for the Third Class passengers, the “Hometown” column has the “Town”, and there is a separate column for “Home Country”.

The extra column makes it difficult to combine it with the other data.

However, there is a workaround for this. I will be covering that in a later post.

 


Want to learn more?

Below is a selection of resources that I personally feel are relevant to this blog post, and will allow you to get more in-depth knowledge. I do earn a commission if you purchase any of these, and for that I am grateful. Thank you. (Important Disclosure)

See more …