Section Two - All data is subjective
A story about Hurricane Sandy & Beverly Hills 90210 (3 minutes)
Fundamental problems with data
Even people that are enthousiastic about (big) data are aware of problems with (big) data. Things like privacy, storage, how to get the data, how to get the right data, computing power and so on. These are all important issues, but often the fundamental issues with data are not addressed.
In this and the next sections we will talk about these fundamental problems.
That still doesn't mean that data doesn't offer great opportunities. It does, but only if you are constantly aware of potential pitfalls and issues. It is important to understand the limitations. We will help you with this so that you are better able to assess the possibilities of (technologies that use) data.
And so that you never use the term data-driven again.
Let's start with data being subjective.
All data is subjective
This is probably the most important thing to understand about data. It is not neutral! It is not objective.
Often, when people talk about data you get the impression that data is "just" there. It is kind of objective. This is not true. Data is not a natural phenomenon. You have to do something to get the data. You have to measure data. You have to collect data. And this means making choices. What will you measure? What can you measure? How will you measure? When will you measure? What is within the budget?
All those choices are decisive. And far from objective.
Example (1): At Fontys University, we would really like to know how our students have developed in 4 years, but we cannot measure that. That's why we measure student satisfaction, or what jobs students get, or study time, but those are all proxies. They are choices. They are not neutral. That does not mean that it is bad, but it is very important to understand that this data is subjective.
Knowing which data you do not have is as important as knowing what data you do have. Prejudices are also part of choices for collecting data. And because data is not a natural phenomenon, you should always take a critical look at it. Is that data collection correct?
Example (2): If we look at the users of the Farmville game, we see that there are a lot of people from Beverly Hills. That's weird. Or maybe 90210 is just the most famous zip code in America?
Data is actually just as neutral and objective as the questions you ask in a questionnaire. It is not. As Kate Crawford (Harvard) puts it:
"Data is something we create, but also fantasize together" - Kate Crawford
Example (3): 20 million tweets were sent when Hurricane Sandy struck the American East Coast. You can analyze the data, but without context you draw the wrong conclusions. After all, the hardest hit areas (Breezy Point, Coney Island) tweeted little, simply because they were too busy with the storm, less interested in Twitter, or lacking battery or internet connection. On the other hand, in Manhattan, where it was not so bad, there was plenty of Twittering. The data indicated that the disaster mainly occurred in Manhattan. The reality was very different.
A final famous example is Google Flu Trends (GFT). Based on searches for flu-related symptoms, Google was able to track and predict flu epidemics as early as 2008. It became a famous story on the possibilities of big data, but later, when it gained more media attention and the data increased, Google was 140% wrong and it became the famous story about problems with (big) data.
There is another pitfall. Often when we realize that the data is subjective and does not represent reality, the response is: we need more data. But of course the 'more data' is just as subjective.
A final famous example - survivor bias
During World War II, researchers at the Center for Naval Analysis faced a critical problem.
Many bombers were getting shot down on runs over Germany. The naval researchers knew they needed hard data to solve this problem and went to work. After each mission, the bullet holes and damage from each bomber was painstakingly reviewed and recorded. The researchers poured over the data looking for vulnerabilities. The data began to show a clear pattern (see picture). Most damage was to the wings and body of the plane.
The solution to their problem was clear. Increase the armor on the plane's wings and body. But there was a problem.
The analysis was completely wrong.
The researchers had only looked at bombers who’d returned to base. Missing from the data? Every plane that had been shot down. But the research wasn’t a wasted effort. These surviving bombers rarely had damage in the cockpit, engine, and parts of the tail. This wasn’t because of superior protection to those areas. In fact, these were the most vulnerable areas on the entire plane.
The researchers’ bullet hole data had created a map of the exact places that the bomber could be shot and still survive. With the new analysis in hand, crews reinforced the bombers' cockpit, engines, and tail armor. The result was fewer fatalities and greater success of bombing missions. This analysis proved to be so useful that it continued to influence military plane design up through the Vietnam war.
This story is a vivid example of survivor bias. Survivor bias is when we only look at the data of those who succeed and exclude those who fail.
Survivor bias is all around us, especially in the media. You read articles about entrepreneurs who risked everything financially and are now a success. But no one profiles the hundred other entrepreneurs who followed the same strategy and went bankrupt.
Take Aways from section two:
- All data is subjective;
- Always be aware of how data is collected;
- Always try to realize what data is NOT collected