Data Collection
• the process of gathering and measuring information on variables of interest in a systematic way to answer research questions, test hypotheses, or evaluate outcomes
❑Data
• Structured (labeled data)
• Unstructured (unlabeled data)
Types of data
➢Based on
▪Nature
▪Origin or Structure
▪Source
▪time
Based on Nature
❑Quantitative (numerical)
❑Qualitative (descriptive/categorical)
❑Quantitative (numerical)
• Represents numerical information that can be measured and counted.
•Discrete Data: Can only take specific values, usually whole numbers (e.g., number of cars).
•Continuous Data: Can take any value within a range (e.g., height, weight).
❑Qualitative (descriptive/ categorical)
• Represents descriptive information that cannot be measured numerically
• Nominal Data: Categories with no inherent order or ranking.
• Ex: Eye color (brown, blue, green)
• Ordinal Data:Categories with a meaningful order or ranking, but the differences between categories are not necessarily equal or quantifiable.
•Ex: Ranking in a race (first, second, third)
Based on Origin or Structure
❑Structured Data:
• It has a predefined schema with rows and columns, making it easy to query and analyze. Ex: spreadsheet records ❑Semi-structured Data:
• Partially organized data with tags or markers • Ex: JSON or XML files ❑Unstructured Data:
• Data without a predefined format or structure
• Ex: text messages, images, audios, videos
Based on Source
❑Primary data is original data collected by a researcher or organization for a specific research purpose
• surveys, interviews, observations, or experiments.
❑Secondary data refers to data that has been previously collected by someone else for a purpose other than the current research objective
•Government and public records: Census data, vital statistics, official reports., research reports
Based on Time
• Cross-sectional data consists of several variables recorded at the same time or single point.
• Used for comparing multiple people, groups or items at the same time
• Ex: No. of students enrolled in various courses in this semester
• Time series data is data that is recorded over consistent intervals of time. Daily, monthly, yearly
• Used for observing trends, patterns, forecasting
• Pooled data is a combination of both time series data and crosssectional data.
• Ex: Weekly attendance of students in the semester
Data Collection
• the process of gathering and measuring information on variables of interest in a systematic way to answer research questions, test hypotheses, or evaluate outcomes
❑Data
• Structured (labeled data)
• Unstructured (unlabeled data)
Types of data
➢Based on
▪Nature
▪Origin or Structure
▪Source
▪time
Based on Nature
❑Quantitative (numerical)
❑Qualitative (descriptive/categorical)
❑Quantitative (numerical)
• Represents numerical information that can be measured and counted.
•Discrete Data: Can only take specific values, usually whole numbers (e.g., number of cars).
•Continuous Data: Can take any value within a range (e.g., height, weight).
❑Qualitative (descriptive/ categorical)
• Represents descriptive information that cannot be measured numerically
• Nominal Data: Categories with no inherent order or ranking.
• Ex: Eye color (brown, blue, green)
• Ordinal Data:Categories with a meaningful order or ranking, but the differences between categories are not necessarily equal or quantifiable.
•Ex: Ranking in a race (first, second, third)
Based on Origin or Structure
❑Structured Data:
• It has a predefined schema with rows and columns, making it easy to query and analyze. Ex: spreadsheet records ❑Semi-structured Data:
• Partially organized data with tags or markers • Ex: JSON or XML files ❑Unstructured Data:
• Data without a predefined format or structure
• Ex: text messages, images, audios, videos
Based on Source
❑Primary data is original data collected by a researcher or organization for a specific research purpose
• surveys, interviews, observations, or experiments.
❑Secondary data refers to data that has been previously collected by someone else for a purpose other than the current research objective
•Government and public records: Census data, vital statistics, official reports., research reports
Based on Time
• Cross-sectional data consists of several variables recorded at the same time or single point.
• Used for comparing multiple people, groups or items at the same time
• Ex: No. of students enrolled in various courses in this semester
• Time series data is data that is recorded over consistent intervals of time. Daily, monthly, yearly
• Used for observing trends, patterns, forecasting
• Pooled data is a combination of both time series data and crosssectional data.
• Ex: Weekly attendance of students in the semester
Data Collection Methods
➢Primary Data Collection Methods
➢Secondary Data Collection Methods
❑ Primary Data Collection Methods
• First-hand, original data collected directly for a specific research purpose.
• Advantages: More reliable, valid, objective, and authentic than secondary data.
• Examples: Surveys, interviews, experiments, and questionnaires.
• Challenges: Can be costly, time-consuming, and complex to plan and execute.
• Nature: Exclusive data that remains unpublished until shared. It supports both qualitative and quantitative research methods.
❑ Primary Data Collection Methods:
• Questionnaire
• Interviews(Structured, semi structured, unstructured)
• Surveys
• Observations
• Focus Groups
• Schedules
• Experiments(laboratory, field, natural experiments)
• Diaries
• Poll
Secondary Data Collection Methods
• collected from existing sources and used after its original purpose.
• Books and Libraries: Printed/digital books via libraries/databases.
• Journals, Magazines, Newspapers, Periodicals: Academic/trade publications, including e-journals.
• Government Sources: Reports, census data, public records, etc.
• Business and Organizational Records: Financials, reports, statements, health/safety data.
• Online Databases and Internet Sources: Databases with articles, surveys, press releases.
• Personal and Unpublished Documents: Letters, biographies, personal records.
• Blogs and Weblogs: Individual or organizational blog content.
• Research Reports and Past Studies: Prior studies and findings as references.
• Social Media and Public Platforms: User-generated data from forums and social networks.
• Secondary Data Collection Methods:
• Journals, Magazines, Newspapers, and Periodicals
• Books and Libraries
• Government Sources
• Business and Organizational Records
• Online Databases and Internet Sources
• Personal and Unpublished Documents
• Blogs and Weblogs
• Research Reports and Past Studies
• Social Media and Public Platforms
• Benefits:
• Cost-effective and accessible compared to primary data.
• Useful when primary data collection isn’t feasible.
Data Preprocessing
• Critical in data science and machine learning.
• Involves cleaning, transforming, organizing raw data.
• Objective: Improve data quality
Steps in Data Preprocessing
• Data preprocessing is a multi-step process that improves the quality and usability of raw data for analysis or modeling.
• Data Profiling
• Understanding the structure, quality, and content of data.
• Data Cleaning
• Fixing or removing incorrect, corrupted, or missing data.
• Data Integration
• Combining data from multiple sources into a single dataset.
• Data Transformation
• Converting data into a suitable format (e.g., normalization, encoding).
• Data Reduction
• Reducing the volume of data while maintaining its integrity (e.g., dimensionality reduction).
• Data Discretization
• Converting continuous data into discrete buckets or intervals.
• Data Validation
• Ensuring data meets quality and consistency rules before analysis.
Forms of Data Preprocessing
• Data Cleaning: Removing noise/errors.
• Data Integration: Merging data from various sources.
• Data Reduction: Shrinking data size (e.g., fewer features).
• Data Transformation: Formatting/numerical conversion (e.g., from "5,87,600" to a normalized number like "0.05, 0.87, 6.00").
Data Profiling
•Examining and analyzing data to gather quality-related statistics.
•Helps understand data characteristics (attributes, distributions).
•Activities include:
• Surveying existing datasets
• Listing attributes
• Forming feature hypotheses
• Relating data to business concepts
• Selecting preprocessing tools
Data Cleaning
• Improves data quality by fixing:
• Missing values
• Duplicates
• Errors
• Outliers • Techniques:
• Handling Missing Data: Drop rows, fill with mean/median, or use imputation.
• Removing Duplicates: Eliminate redundant records.
• Correcting Errors: Fix typos, standardize formats (e.g., date formats), correct invalid entries.
• Dealing with Outliers: Identify extreme values and decide to retain, adjust, or remove.
• Handling Missing Data │ ├── Deletion
│ ├── Dropping Rows
│ └── Dropping Columns
└── Imputation
│ ├── Filling with Specific Values
│ ├── Mean Imputation
│ ├── Median Imputation │ ├── Mode Imputation
│ └── Constant Value Imputation ├── Statistical Imputation
│ ├── Regression Imputation │ ├── KNN Imputation
│ └── Multiple Imputation
├── Time-Series Imputation
│ ├── Forward Fill
│ ├── Backward Fill
│ └── Interpolation
└── Model-Based Imputation
└── Using Machine Learning Models
Handling missing data
• Missing data impact your analysis and the performance of machine learning models.
• "N/A" or "-999", or even misentered data.
❑ Deletion
a) Dropping Rows
• If a row has one or more missing values, the entire row is removed from the dataset.
• can lead to a significant loss of data, especially if missingness is widespread.
Customer ID Product Price Quantity
1 A 25 2
2 B NaN 1
3 C 30 NaN
4 A 25 3
Customer ID Product Price Quantity
1 A 25 2
4 A 25 3
b) Dropping Columns
• removing an entire column (variable) from the dataset if it has a high proportion of missing values
• considered less critical for the analysis.
❑ Imputation
• addresses missing values in a dataset by replacing them with estimated or plausible values.
• enabling more complete and reliable analysis
• Filling with specific values
• Statistical imputation
• Time series imputation
• Model based imputation
❖Filling with Specific Values
• replaces missing data with a predetermined value, such as the mean, median, mode, or a constant.
a) Mean Imputation: Replaces missing numerical values in a column with the average (mean) of the non-missing values in that same column.
• Example (Price): The mean of the non-missing "Price" values is
26.67.
Customer ID Product Price Quantity
2 B 26.67 1
• Example (Quantity): The mean of the non-missing "Quantity" values is 2.
Customer ID Product Price Quantity
3 C 30 2
b) Median Imputation: Replaces missing numerical values in a column with the middle value (median) of the non-missing values in that column.
• Example (Price): The median of the non-missing "Price" values (sorted as) is 25.
Customer ID Product Price Quantity
2 B 25 1
• Example (Quantity): The median of the non-missing "Quantity" values (sorted as) is 2
Customer ID Product Price Quantity
3 C 30 2
c) Mode Imputation: Replaces missing categorical values in a column with the most frequent category (mode) in that same column.
• It is not typically used for numerical data like "Price" and "Quantity" unless those columns contained discrete, categorical-like values.
• Example (If "Product" had missing values):
Customer ID Product Price Quantity
1 A 25 2
2 NaN NaN 1
3 C 30 NaN
4 A 25 3
• The mode of the "Product" column is "A" (appears twice). The missing product would be filled as "A".
d) Constant Value Imputation: Replaces all missing values in a specific column with a predetermined, fixed value.
• This value is chosen based on domain knowledge or a specific assumption.
• Example (Price = 0 for unknown price): If a missing "Price" indicates it hasn't been set yet, we might fill it with 0.
• Example (Quantity = 1 for at least one sold): If a missing "Quantity" implies at least one sale occurred but wasn't recorded, we might fill it with 1.
❖Statistical Imputation
•the correlations between variables in the dataset are used to estimate and fill in missing data using established statistical methods
• Regression Imputation
• KNN (K-Nearest Neighbours) Imputation
• Multiple Imputation
a) Regression Imputation
• Uses regression models to predict missing numerical values based on other variables in the dataset.
• It assumes a linear relationship between the variable with missing values and the other variables.
• Example (Predicting Price based on Product): To predict the missing "Price" for Product "B", we would need a larger dataset where "Price" and "Product" are both present.
• We could train a regression model to learn the relationship and then use Product "B" to predict its price.
b) KNN (K-Nearest Neighbours) Imputation:
• Imputes missing values by finding the most similar data points (neighbours) in the dataset based on the non-missing features and then using the average (for numerical) or mode (for categorical) of their values for the missing feature.
• Example (Imputing Price for CustomerID 2): To impute the missing "Price" for CustomerID 2 (Product "B", Quantity 1), we would look for other customers with similar "Product" and "Quantity" values.
• If we found customers who bought "B" with a quantity close to 1, we would use the average of their "Price" to fill the NaN.
c) Multiple Imputation:
• It is for handling missing data where the missing value is estimated multiple times.
• This process results in multiple complete datasets, each of which is analyzed. and the results are pooled to create one final result.
• Example: Instead of a single value for the missing "Price" of Product "B", multiple imputation might generate several plausible prices (e.g., 22, 27, 24) based on statistical models and the variability in the observed data.
• Each of these imputed datasets would then be analyzed, and the results would be combined.
❖Time-Series Imputation
• These techniques are specifically designed for datasets where data points are ordered by time.
• Forward Fill (ffill): Fills missing values with the last valid observation.
• Backward Fill (bfill): Fills missing values with the next valid observation.
• Interpolation: Estimates missing values based on the values of adjacent data points (e.g., linear interpolation assumes a straight line between the known points).
• Example: If we had daily sales data for Product "B" and some days had missing "Quantity":.
Date Product Quantity
2025-05-01 B 5
2025-05-02 B NaN
2025-05-03 B 7
• a) Forward Fill: The NaN on 2025-05-02 would be filled with 5 (the value from the previous day).
• b) Backward Fill: The NaN on 2025-05-02 would be filled with 7 (the value from the next day).
• c) Linear Interpolation: Fills the gap by finding the value halfway between the data points before and after the missing one. So the NaN on 2025-05-02 would be filled with (5+7)/2=6.
❖Model-Based Imputation
• using machine learning models to predict the missing values based on the other variables in the dataset.
• The variable with missing values is treated as the target variable, and the other variables are used as predictors.
• Example (Predicting Price using Product and Quantity):
• We could train a model (eg., decision tree or a simple linear regression if the relationship is linear) on the complete rows to predict "Price" based on "Product" and "Quantity."
• Then, we would use the "Product" ("B") and "Quantity" (1) for Customer ID 2 to predict the missing "Price", referring to the dataset Table 2.1.
Removing Duplicates
• Duplicate records can skew your analysis by over-representing certain observations.
• Identifying and removing rows that are identical across all or a subset of relevant columns is necessary.
• Exact Duplicates: Rows where all values are the same.
• Partial Duplicates: Rows that are the same across key identifying columns but might differ in other less important columns (e.g., timestamp of entry). You need to decide which columns define a "duplicate" in your context.
• Order of Keeping: You might want to keep the first or the last occurrence of a duplicate based on your data's context (e.g., the latest entry might be more relevant).
Order ID Customer ID Product Quantity
1 101 A 2
2 102 B 1
4 103 C 3
Correcting Errors
• wide range of techniques to fix inaccuracies and inconsistencies.
❑ Fixing Typos
• identifying and correcting spelling errors in categorical or text data.
• Manual inspection, using spell-checking libraries, fuzzy matching algorithms etc. are some of the techniques to find and suggest corrections for similar strings.
• Example: If a "City" column has entries like "New Yrok", "New Yorkk", and "New York", you would standardize them to "New York".
❑ Standardizing Formats
• It ensures that data in a column follows a consistent format.
• particularly important for dates, times, addresses, and phone numbers.
• use string manipulation functions, date/time parsing libraries, regular expressions etc., to identify and reformat inconsistent entries.
• Example: Converting date formats like "05/05/2025", "2025-05-05", and "May 5, 2025" to a standard format like "YYYY-MM-DD".
❑ Correcting Invalid Values
• Identifying and handling values that are outside the expected range or violate business rules will handle here.
• A few techniques are applying logical checks, using domain knowledge to define valid ranges, and then either correcting the values, or imputing them, for further investigation.
• Example: If an "Age" column has a value of -5 or 150, these are invalid and need to be addressed.
Dealing with Outliers
• Outliers are data points that deviate significantly from the rest of the data.
• They can arise due to genuine extreme values or errors in data collection.
▪Identifying Outliers
➢a) Visual Methods
• Box Plots:
• Shows the distribution of data and can clearly highlight values that fall outside the whiskers (lines that extend from the box to indicate the variability outside the upper and lower quartiles).
• Scatter Plots:
• Useful for identifying outliers in the relationship between two variables.
• Histograms:
• Can show unusual values at the tails of the distribution.
➢b) Statistical Methods
• Z-Score:
• Measures how many standard deviations a data point is from the mean.
• Values with a Z-score above a certain threshold (e.g., +/- 3) are often considered outliers.
• Interquartile Range (IQR):
• Quartiles are special percentiles. 1st Quartile Q1 is the same as the 25th percentile, 2nd Quartile Q2 is the same as 50th percentile and 3rd Quartile Q3 is same as 75th percentile
• For finding the quartile and percentile, the data should be sorted and ordered from the smallest to largest.
• For Quartiles, ordered data is divided into 4 equal parts.
• For Percentiles, ordered data is divided into 100 equal parts.
• IQR is the difference between third quartile, Q3 and first quartile Q1,
• ie, IQR = Q3 - Q1
• Outliers can be defined as values below Lower Bound or above Upper Bound, where, Lower Bound = Q1 - 1.5 x IQR and Upper Bound = Q3 + 1.5 x IQR.
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
• A clustering algorithm that can identify outliers as data points that do not belong to any dense cluster.
• Handling Outliers
• a) Retention: If the outlier is a genuine extreme value and represents a real phenomenon, it might be important to keep it.
• b) Adjustment (Capping/Flooring): Replace outlier values with a predefined maximum or minimum value within a reasonable range. This can reduce their impact without completely removing them.
• c) Transformation: Applying mathematical transformations (e.g., logarithmic, square root) can sometimes reduce the impact of outliers by compressing the scale of the data.
• d) Removal:
• If the outlier is clearly an error or is likely to unduly influence the analysis, it might be removed.
• However, be cautious about removing too many data points.
House ID Price
1 250
2 300
3 275
4 320
5 800
Outlier
• A box plot of the "Price" column would likely show 800 as an outlier. Depending on the context, you might keep it, cap it at a certain value, or investigate if it is an error.
视频信息
答案文本
视频字幕
Data collection is the systematic process of gathering and measuring information on variables of interest to answer research questions, test hypotheses, or evaluate outcomes. Data can be classified as structured, which has a predefined format like spreadsheets, or unstructured, which lacks a specific format like text messages or images. We classify data based on four main criteria: nature, which distinguishes between quantitative and qualitative data; origin or structure, which categorizes data as structured, semi-structured, or unstructured; source, which differentiates between primary and secondary data; and time, which considers when and how frequently data is collected.
Data types by nature are classified into quantitative and qualitative categories. Quantitative data represents numerical information that can be measured and counted. It includes discrete data, which can only take specific values like whole numbers such as the number of cars, and continuous data, which can take any value within a range like height or weight measurements. Qualitative data represents descriptive information that cannot be measured numerically. It includes nominal data, which consists of categories with no inherent order like eye colors, and ordinal data, which has categories with meaningful order or ranking like positions in a race, though the differences between categories are not necessarily equal or quantifiable.
Data can be classified using three different systems. By origin or structure, we have structured data with predefined schemas like spreadsheets, semi-structured data that is partially organized with tags like JSON or XML files, and unstructured data without predefined format like text messages, images, and videos. By source, we distinguish between primary data, which is original data collected directly for specific research purposes through surveys and interviews, and secondary data, which refers to previously collected data used for purposes other than the original research objective, such as government records and research reports. By time, we have cross-sectional data consisting of variables recorded at the same time for comparing multiple subjects, time series data recorded over consistent intervals for observing trends and forecasting, and pooled data which combines both time series and cross-sectional elements.
Data collection methods are divided into primary and secondary approaches. Primary data collection involves gathering first-hand, original data directly for a specific research purpose through methods like questionnaires, interviews, surveys, observations, focus groups, experiments, diaries, and polls. Primary data is more reliable, valid, objective, and authentic, but can be costly, time-consuming, and complex to plan and execute. Secondary data collection uses existing sources and data collected for purposes other than the current research objective. Sources include books and libraries, journals and magazines, government sources, business records, online databases, and social media platforms. Secondary data is cost-effective and accessible, making it useful when primary data collection isn't feasible, though it may not perfectly match the current research needs.
Data preprocessing is a critical process in data science and machine learning that involves cleaning, transforming, and organizing raw data to improve its quality for analysis and modeling. The preprocessing pipeline consists of seven key steps. First, data profiling examines and analyzes data to understand its structure, quality, and characteristics. Second, data cleaning fixes or removes incorrect, corrupted, or missing data. Third, data integration combines data from multiple sources into a single dataset. Fourth, data transformation converts data into suitable formats through normalization and encoding. Fifth, data reduction reduces the volume of data while maintaining its integrity through techniques like dimensionality reduction. Sixth, data discretization converts continuous data into discrete buckets or intervals. Finally, data validation ensures data meets quality and consistency rules before analysis. This systematic approach transforms messy, unreliable raw data into clean, structured datasets ready for meaningful analysis and accurate machine learning models.