Exploratory Data Analysis (Part 2): Maps

Exploratory Data Analysis: Prisons (Part 1)

Questions are Tools

The oldest fossil evidence shows life on Earth started about 3.8 billion years ago. Every generation, life that survived in its local environment long enough to reproduce could pass on its code (with a few mutations), and the rest was filtered out. After billions of generations, countless mutations, and almost 3.8 billion years (or about 200,000 years ago), one branch of life evolved into the homo sapien species (aka modern humans). About 70,000 years ago, some humans stoped living nomadically, stoped (exclusively) foraging for food, and started farming, which allowed people to produce surplus food, beyond what the farmers needed. This surplus allowed people to trade, which allowed some people to dedicate their labor to crafts other than food, which allowed settlements to grow into towns with simple economies. These towns could trade stories, furs, food, mates, and knowledge with nearby towns, but it wasn't until around 3500-3000 BCE that craftspeople developed wheeled vehicles and boats, which allowed ideas and technologies to spread much further much faster. Writing systems followed these transportation breakthroughs, both to track trade, enable people to make contractual deals, and transmit ideas. With better farming practices, tools, medicines, etc., the global population started climbing much faster, and today, there are about 7.5 billion people on Earth making livings across an inconcievably diverse range of occupations. 

Put more concicesly, the world is an extremely complex system with billions of moving parts, and it's only getting more complex. To understand something new, we must be able to manage this complexity, or else we will be overwhelmed by the noise. The first step in managing that complexity is to define a good question.

A good question will have the following properties:
• It can be answered (by you, or the researcher).
• The answer(s) can be supported by evidence and data.
• Its answer(s) will be valuable to enough people that to justify the cost of doing the research and collecting the data.

A good question will trim away unnecessary complexity, narrow your focus to fewer hypotheses, and guide you to the data you need to answer the question. Often, this will involve breaking massive questions down into smaller questions. 

For example, NASA was given a massive and vague question ("How do we put a person on the moon?") and they broke that big question down into smaller big questions ("How can we make sure the astronauts survive?", "How do we land a ship on the moon?", "How do we take off from the moon?", etc), and they broke those down into more specific questions ("How much fuel does the lunar lander need to get back up to orbit?", "How much oxygen does a human consume per minute? What's the maximum consumption rate?", "How hot will the ship's surface get when it reenters Earth's atmosphere, and what insulation materials can survive that heat?" etc). The main question was too big and could have been answered in too many ways, but by breaking questions down, engineers could target their research. After all of the component questions were answered and integrated, we were able to answer the main question and put human footprints on the moon. 

Without good, well defined questions, we wouldn't have ever gotten off the ground. 

How Can I Know Anything?

In this blog, I plan on focusing on the complex phenomena that are revealed through data science, but before I can investigating phenomena, there's a fundamental philosophical question I have to answer before I can answer any others:
How can I **know** anything? How can I be sure that an explanation is correct or a statement is true? 
Or, put less philosophically: How can I be sure that I'm not about to embarass myself and waste my time by being confidently wrong?

Philosophically, it's very difficult to know anything for a certainty. Philosophical skepticism holds that there's always the posibility that an unknown unknown is deceiving us, so it's impossible to be certain and someone could always poke a potential hole in any theory. Yet from our lived experiences, it's obvious that we know things. We know that we won't fall through the floor when we get out of bed, so clearly philosophical certainty isn't a very useful way to understand the world. Evolution has provided a much more pragmatic way to gain knowledge, and we start using it at birth.

When a baby keeps dropping things from their highchair, they're demonstrating my favorite answer to that question. We aren't born with knowledge of gravity or object permenance, so when that baby first discovers that objects fall downward when pushed off the ledge, they observe something interesting. The baby doesn't have a theory to explain the observations, but by collecting a lot of observations, they develop intuition. Their intuition improves with new observations (eg. both my red and blue sippy cups fell down when I let go of it), and they can make predictions (eg. my green sippy cup will fall down when I let go of it). Sometimes, we encounter unexpected observations (eg. the balloon went up when I let go of it) that force us to refine our intuition (eg. sippy cups go down and balloons can go up).

We use this same learning pattern through our entire lives; when we recognize a familiar scenario, we can use our prior experiences to make predictions, and if we recognized all the important aspects of the scenario, we can safely assume that our prediction will be correct. Stated more directly, we assume that reality is consistent. This is the fundamental assumption behind data science, and as long as reality remains consistent, we'll be able to predict the future if we can recognize the important parts of the present and find relevant observations from the past.