So, recently, I've been learning a lot about data analysis, and it's been a wild ride thanks to the GoogleXCoursera scholarship program. You will be amazed by what you discover about data, from how it is turned from its raw state into actual information to how it may be used to help business decision-makers make more educated choices.
Did you know ❓❓
Between the beginning of time and 2003, Google generated five exabytes of data. This quantity of data was being generated every two days in 2010, and every 40 minutes by 2021.
Data analysis is somewhat similar to software development (Backend), where you evaluate the logic of things and try to determine how an application should flow. However, in this context, instead of thinking about the flow, you think about the data and try to figure out if that particular data is answering the question you need it to answer, and if so, how can you modify that data into valuable information.
A lot of things are overlooked during the process which can cause your data to be biased.
Why are we talking about data bias ❓
As a data analyst, business owner, or as general person you must consider bias and fairness from the moment you begin data collection until the point at which you offer your conclusion.
An example of bias is when you are asked to participate in a poll to determine the number of people who would like to go fishing. You go ahead and presume you asked every student in the class, but you omitted the impacted (wheelchair-bound, disabled, etc.) individuals. That survey's data collection is systematically biased.
Another example is being asked to take part in a survey of employees in your organization to see how much employees enjoy working from the office rather than working remotely, but ever since the Covid 19 incident, businesses have adopted a work-from-home/ hybrid policy. As a result, the survey was only taken by those who were present in the office at the time. You can tell right away that the data is biased since it does not fully represent the population ( people at the office ).
What does bias mean ❓
Before providing more context, it is essential to understand what bias is. Using the above examples, you can see how being biased can cause data to be skewed.
Simply put, bias refers to a preference for or against a person, group of people, or object. Since our perspective is more data analytical, data bias is an error that methodically skews outcomes in one direction.
Forms of Bias
- Sample Bias
When a sample doesn't accurately reflect the entire population. This can be avoided by ensuring that the sample is chosen at random.
- Unbiased Sampling
A sample that is representative of the population being studied.
Comment down below an instance of an unbiased sample ...
- Observer Bias
They are sometimes referred to as experimental bias or research bias. This is the tendency for two persons working on a project together to see things differently. It is more like the 6-9 case.
- Interpretation Bias
The tendency to interpret ambiguous situations in either a good or negative light.
- Confirmation Bias
The tendency to seek out or interpret information in a way that reinforces pre-existing ideas. People see what they want to see and ignore other types of information/data.
Explore data credibility
The more high-quality data you have the more confidence you can when making data-driven decisions. You can rest assured that your data is credible if you follow a methodology called ROCCC.
Which is an acronym for Reliable, Original, Comprehensive, Cited and Current.
R - With this kind of data, you can be sure that you're getting unbiased and precise information because it comes from a trusted source.
O - You will eventually have to work with first-party to third-party data; to ensure that you are working with reliable data, and confirm with the original source.
C - The most accurate data has all the relevant details required to provide an answer or a solution.
C - Consider these three factors when picking a data source: Who developed the dataset? Does it belong to a reputable organization? when the information was last updated.
C - The usefulness of data diminishes with time, and the finest data sources are current and relevant to the task at hand.
If your data source has all these attributes then your data “ROCCC” pun intended 😄.