INTRODUCTION TO STATISTICS FOR DATA SCIENCE
Chapter 1: Introduction to Statistics
- Overview of Statistics and its applications in data science
- Basic statistical concepts: population, sample, variable, data types, and measures of central tendency and dispersion
- Sampling techniques and sampling distributions
1.1 Overview of statistics and its applications in data science
Statistics is a branch of mathematics that deals with the collection, analysis, interpretation, presentation, and organization of data. It is an important tool in data science, which involves the extraction of insights and knowledge from data using various techniques and algorithms.
Statistics is used in various stages of the data science workflow, including data pre-processing, exploratory data analysis, feature engineering, model training and evaluation, and interpretation of results.
It provides a framework for making sense of data, identifying patterns, relationships, and drawing conclusions based on evidence.
Some of the applications of statistics in data science include:
- Descriptive statistics: This involves summarizing and describing the main features of a dataset, such as measures of central tendency (e.g. mean, median, mode) and measures of dispersion (e.g. variance, standard deviation), and can be used to identify outliers and anomalies.
- Inferential statistics: This involves making inferences about a population based on a sample of data. Inferential statistics can be used to test hypotheses, estimate parameters, and make predictions. Some common inferential statistical techniques include hypothesis testing, confidence intervals, and regression analysis.
- Regression analysis: This involves examining the relationship between a dependent variable and one or more independent variables. Regression analysis can be used to make predictions and identify important predictors of a target variable. There are various types of regression analysis, including linear regression, logistic regression, and time series regression.
- Time series analysis: This involves analyzing data that is collected over time, such as stock prices or weather data. Time series analysis can be used to identify trends, patterns, and seasonality in the data.
- Bayesian statistics: This involves using prior knowledge and assumptions to update probabilities based on new data. Bayesian statistics can be used to make predictions and estimate probabilities in situations where there is uncertainty.
In summary, statistics is a fundamental tool in data science, used to analyze and interpret data, make predictions, and drive insights. By applying statistical techniques to data, data scientists can uncover patterns and relationships that can inform decisions and drive innovation.
1.2. Basic statistical concepts
Population
A population is the entire group of individuals, objects, or events that we are interested in studying. It is often too difficult or expensive to study an entire population, so we usually select a sample from the population. For example, let's say we are interested in studying the average height of all women in India. The population in this case would be all women in India.
Examples:
- All registered voters in India
- All iPhone users in the world
- All students enrolled in a university
Sample
A sample is a subset of the population that we observe and collect data from. In the previous example, it would be impossible to measure the height of every single woman in India, so we would instead measure the height of a sample of women. For instance, we could measure the height of 1,000 randomly selected women from across India and use that information to make inferences about the population. Then we can say 1,000 is the sample size.
Examples:
- A random sample of 1,000 registered voters in India i.e., Sample size = 1,000
- A convenience sample of 100 iPhone users in a certain region i.e., Sample size = 100
- A systematic sample of 500 students enrolled in a university i.e., Sample size = 500
Variable
A variable is any characteristic or attribute that can take on different values. For example, in a study measuring the effect of smoking on lung cancer, smoking status (smoker or non-smoker) would be variable. In another study measuring the impact of exercise on weight loss, the weight would be variable.
Examples:
- Age of a person
- Gender of a person
- Income of a household
- Height of a tree
- Number of siblings a person has
Data types:
There are two main types of data: Categorical and Numerical.
Categorical data are data that fall into categories or groups, such as gender, race, or favourite colour.
Numerical data are data that represent a quantity or measurement, such as weight, age, or income. Numerical data can be further classified as either discrete or continuous.
Discrete data are data that can only take on certain values, such as the number of children in a family. Continuous data are data that can take on any value within a range, such as height or weight.
- Categorical: Gender, Hair colour, Political affiliation, Type of car
- Numerical (Discrete): Number of children in a family, Number of pets in a household, Number of employees in a company
- Numerical (Continuous): Height of a person, The weight of a person, Temperature, Age
Measures of central tendency
Measures of central tendency describe the typical or central value of a dataset. The most common measures of central tendency are the mean, median and mode. Here are the mathematical formulas and sample problems with solutions for each measure:
Mean/Arithmetic Average: The mean is the arithmetic average of all the values in a dataset. This uses all the data points and provides the comprehensive measure of the central tendency. This is highly influenced by the outliers or skewed data.
Comments
Post a Comment