In the previous post, we’ve successfully gathered data regarding the wiki pages relationship around data science. What exactly are we going to do with this data?

Image this, you now have the relationship between wiki pages. What are the questions that we can answer using this dataset. I’ll provide some of my ideas below

  1. From the data science article, which other article has the highest connection to it?
  2. What are the top articles that surrounds data science?
  3. Is there an article bigger than data science?
  4. Is data science the center of this link?

To answer the following question, we need to see the dataset. We can view it as a table.

Note: Only the first 10 result was shown.

We can use this but this make our understanding slower. Why? because

tables are good when viewing specific data, that is when we know exactly where to look.

In this case we have no previous knowledge on what exactly the dataset looks like. How about a chart? I think a bar chart would make sense, lets check it out.

To create a bar chart we need to identify what goes to our x and y axis. We need a categorical variable for the x axis while we can use continuous or discrete numeric variable for the y. With that being said we can set our x axis as the wiki article and the y axis as the frequency on how many times it was linked.

We see in this chart is that Statistics is the most linked article followed by Data Mining, higher than our seed article Data Science. So what exactly am I gonna do with that data?

Data alone may not directly help us in what we want to achieve, but with additional creativity data can be powerful. Imagine you are someone who wants to learn data science but didn’t know where to start. By looking at these chart, it enables you to see that statistics and data mining are closely related to data science.

But I can do that with google. Of course an easy google search will help you with that. The point is, google can help us with public topics but what about sensitive information, like business or corporate data. You can run these without the risk exposing it to the public.

In the next topic will choose a better visualization, called force layout.