Data science tries to create models to analyze and predict events.
To create those models we need to observe actual events and collect DATA about them.
Once data is collected we have to describe it and organize it in rational ways. This is called DESCRIPTIVE ANALYSIS.
The collected data are called the sample data. From that sample data we can estimate what happens in the whole population. For this process we use STATISTICAL INFERENCE.
Prediction of data requires that we identify a model that explains the collected outcome data. This means that we have to analyze the ouctome variable and related factors, creating a model that will allow us to predict new events. For these purposes we use STATISTICAL INFERENCE, REGRESSION TECHNIQUES and MACHINE LEARNING.
Data
There are 2 types of data:
- Structured
- Unstructured
Structured data are data ordered in form of tables. There are columns and rows.
Unstructured Data includes video data on the net, audio, text data, social media data.
Types of Structured Data
Ther are two types of structured data:
- Categorical (nominal and ordinal data)
- Quantitative (interval and ratio scale)
Categorical data
This data consists of ‘categories’ or labels. For example, the variable ‘Sex’ is categorical as we have 2 categories: male and female.
Categorical data can be nominal or ordinal.
-
Nominal data — consists of the labels for the data without any order. These labels can be text or numerical. The ‘Sex’ column (text) and the ‘Zip code’ column (numerical), are examples of nominal data. The ‘Sex’ column does not have an order; male does not come before female or vice versa. It is the same regarding the zip code. ‘24595’ does not come before or after ‘92617’.
-
Ordinal data — consists of labels that can be ordered. Can also contain both textual and numerical data. For example, the values in a column that contains the rating of a restaurant, are ordinal. If the rating scale is from 1 to 5 with 1 being the worst rating and 5 being the best, we can see that there is an order. Another example is the grade values such as A+, B-, C. These labels also have an order attached to them as we know A+ is a better grade than C.
Quantitative Data
Quantitative data is any numeric data that can either take on a discrete value or a continuous value.
A discrete value (finite) is obtained by counting. For example, the number of students in a class, salary, age.
A continuous value (infinite) is obtained by measuring. For example, height, weight, temperature.
-
Interval scale — consists of values measured along a scale, should have all properties of ordinal data, can hold values below zero, and the interval should be fixed. For example, looking at a Celsius thermometer, the values are measured along a fixed scale and they are ordered as we know 0 degrees Celsius is lower than 10 degrees Celsius.
-
Ratio scale — consists of values that have all properties of interval data (fixed scale), contains a ‘true zero’ value (zero indicates no measurement unlike 0 degrees celsius), no values below zero, the ratio of 2 values is meaningful, and the values can be used to perform basic math calculations. For example, weight, length, height, and area all fall under the ratio scale.