Data Collection • the process of gathering and measuring information on variables of interest in a systematic way to answer research questions, test hypotheses, or evaluate outcomes ❑Data • Structured (labeled data) • Unstructured (unlabeled data) Types of data ➢Based on ▪Nature ▪Origin or Structure ▪Source ▪time Based on Nature ❑Quantitative (numerical) ❑Qualitative (descriptive/categorical) ❑Quantitative (numerical) • Represents numerical information that can be measured and counted. •Discrete Data: Can only take specific values, usually whole numbers (e.g., number of cars). •Continuous Data: Can take any value within a range (e.g., height, weight). ❑Qualitative (descriptive/ categorical) • Represents descriptive information that cannot be measured numerically • Nominal Data: Categories with no inherent order or ranking. • Ex: Eye color (brown, blue, green) • Ordinal Data:Categories with a meaningful order or ranking, but the differences between categories are not necessarily equal or quantifiable. •Ex: Ranking in a race (first, second, third) Based on Origin or Structure ❑Structured Data: • It has a predefined schema with rows and columns, making it easy to query and analyze. Ex: spreadsheet records ❑Semi-structured Data: • Partially organized data with tags or markers • Ex: JSON or XML files ❑Unstructured Data: • Data without a predefined format or structure • Ex: text messages, images, audios, videos Based on Source ❑Primary data is original data collected by a researcher or organization for a specific research purpose • surveys, interviews, observations, or experiments. ❑Secondary data refers to data that has been previously collected by someone else for a purpose other than the current research objective •Government and public records: Census data, vital statistics, official reports., research reports Based on Time • Cross-sectional data consists of several variables recorded at the same time or single point. • Used for comparing multiple people, groups or items at the same time • Ex: No. of students enrolled in various courses in this semester • Time series data is data that is recorded over consistent intervals of time. Daily, monthly, yearly • Used for observing trends, patterns, forecasting • Pooled data is a combination of both time series data and crosssectional data. • Ex: Weekly attendance of students in the semester Data Collection • the process of gathering and measuring information on variables of interest in a systematic way to answer research questions, test hypotheses, or evaluate outcomes ❑Data • Structured (labeled data) • Unstructured (unlabeled data) Types of data ➢Based on ▪Nature ▪Origin or Structure ▪Source ▪time Based on Nature ❑Quantitative (numerical) ❑Qualitative (descriptive/categorical) ❑Quantitative (numerical) • Represents numerical information that can be measured and counted. •Discrete Data: Can only take specific values, usually whole numbers (e.g., number of cars). •Continuous Data: Can take any value within a range (e.g., height, weight). ❑Qualitative (descriptive/ categorical) • Represents descriptive information that cannot be measured numerically • Nominal Data: Categories with no inherent order or ranking. • Ex: Eye color (brown, blue, green) • Ordinal Data:Categories with a meaningful order or ranking, but the differences between categories are not necessarily equal or quantifiable. •Ex: Ranking in a race (first, second, third) Based on Origin or Structure ❑Structured Data: • It has a predefined schema with rows and columns, making it easy to query and analyze. Ex: spreadsheet records ❑Semi-structured Data: • Partially organized data with tags or markers • Ex: JSON or XML files ❑Unstructured Data: • Data without a predefined format or structure • Ex: text messages, images, audios, videos Based on Source ❑Primary data is original data collected by a researcher or organization for a specific research purpose • surveys, interviews, observations, or experiments. ❑Secondary data refers to data that has been previously collected by someone else for a purpose other than the current research objective •Government and public records: Census data, vital statistics, official reports., research reports Based on Time • Cross-sectional data consists of several variables recorded at the same time or single point. • Used for comparing multiple people, groups or items at the same time • Ex: No. of students enrolled in various courses in this semester • Time series data is data that is recorded over consistent intervals of time. Daily, monthly, yearly • Used for observing trends, patterns, forecasting • Pooled data is a combination of both time series data and crosssectional data. • Ex: Weekly attendance of students in the semester Data Collection Methods ➢Primary Data Collection Methods ➢Secondary Data Collection Methods  ❑ Primary Data Collection Methods • First-hand, original data collected directly for a specific research purpose. • Advantages: More reliable, valid, objective, and authentic than secondary data. • Examples: Surveys, interviews, experiments, and questionnaires. • Challenges: Can be costly, time-consuming, and complex to plan and execute. • Nature: Exclusive data that remains unpublished until shared. It supports both qualitative and quantitative research methods. ❑ Primary Data Collection Methods: • Questionnaire • Interviews(Structured, semi structured, unstructured) • Surveys • Observations • Focus Groups • Schedules • Experiments(laboratory, field, natural experiments) • Diaries • Poll Secondary Data Collection Methods • collected from existing sources and used after its original purpose. • Books and Libraries: Printed/digital books via libraries/databases. • Journals, Magazines, Newspapers, Periodicals: Academic/trade publications, including e-journals. • Government Sources: Reports, census data, public records, etc. • Business and Organizational Records: Financials, reports, statements, health/safety data. • Online Databases and Internet Sources: Databases with articles, surveys, press releases. • Personal and Unpublished Documents: Letters, biographies, personal records. • Blogs and Weblogs: Individual or organizational blog content. • Research Reports and Past Studies: Prior studies and findings as references. • Social Media and Public Platforms: User-generated data from forums and social networks. • Secondary Data Collection Methods: • Journals, Magazines, Newspapers, and Periodicals • Books and Libraries • Government Sources • Business and Organizational Records • Online Databases and Internet Sources • Personal and Unpublished Documents • Blogs and Weblogs • Research Reports and Past Studies • Social Media and Public Platforms • Benefits: • Cost-effective and accessible compared to primary data. • Useful when primary data collection isn’t feasible. Data Preprocessing • Critical in data science and machine learning. • Involves cleaning, transforming, organizing raw data. • Objective: Improve data quality Steps in Data Preprocessing • Data preprocessing is a multi-step process that improves the quality and usability of raw data for analysis or modeling.   • Data Profiling • Understanding the structure, quality, and content of data. • Data Cleaning • Fixing or removing incorrect, corrupted, or missing data. • Data Integration • Combining data from multiple sources into a single dataset. • Data Transformation • Converting data into a suitable format (e.g., normalization, encoding). • Data Reduction • Reducing the volume of data while maintaining its integrity (e.g., dimensionality reduction). • Data Discretization • Converting continuous data into discrete buckets or intervals. • Data Validation • Ensuring data meets quality and consistency rules before analysis. Forms of Data Preprocessing • Data Cleaning: Removing noise/errors. • Data Integration: Merging data from various sources. • Data Reduction: Shrinking data size (e.g., fewer features). • Data Transformation: Formatting/numerical conversion (e.g., from "5,87,600" to a normalized number like "0.05, 0.87, 6.00"). Data Profiling •Examining and analyzing data to gather quality-related statistics. •Helps understand data characteristics (attributes, distributions). •Activities include: • Surveying existing datasets • Listing attributes • Forming feature hypotheses • Relating data to business concepts • Selecting preprocessing tools Data Cleaning • Improves data quality by fixing: • Missing values • Duplicates • Errors • Outliers • Techniques: • Handling Missing Data: Drop rows, fill with mean/median, or use imputation. • Removing Duplicates: Eliminate redundant records. • Correcting Errors: Fix typos, standardize formats (e.g., date formats), correct invalid entries. • Dealing with Outliers: Identify extreme values and decide to retain, adjust, or remove. • Handling Missing Data │ ├── Deletion │ ├── Dropping Rows │ └── Dropping Columns └── Imputation │ ├── Filling with Specific Values │ ├── Mean Imputation │ ├── Median Imputation │ ├── Mode Imputation │ └── Constant Value Imputation ├── Statistical Imputation │ ├── Regression Imputation │ ├── KNN Imputation │ └── Multiple Imputation ├── Time-Series Imputation │ ├── Forward Fill │ ├── Backward Fill │ └── Interpolation └── Model-Based Imputation └── Using Machine Learning Models Handling missing data • Missing data impact your analysis and the performance of machine learning models. • "N/A" or "-999", or even misentered data. ❑ Deletion a) Dropping Rows • If a row has one or more missing values, the entire row is removed from the dataset. • can lead to a significant loss of data, especially if missingness is widespread. Customer ID Product Price Quantity 1 A 25 2 2 B NaN 1 3 C 30 NaN 4 A 25 3 Customer ID Product Price Quantity 1 A 25 2 4 A 25 3 b) Dropping Columns • removing an entire column (variable) from the dataset if it has a high proportion of missing values • considered less critical for the analysis. ❑ Imputation • addresses missing values in a dataset by replacing them with estimated or plausible values. • enabling more complete and reliable analysis • Filling with specific values • Statistical imputation • Time series imputation • Model based imputation ❖Filling with Specific Values • replaces missing data with a predetermined value, such as the mean, median, mode, or a constant. a) Mean Imputation: Replaces missing numerical values in a column with the average (mean) of the non-missing values in that same column. • Example (Price): The mean of the non-missing "Price" values is 26.67. Customer ID Product Price Quantity 2 B 26.67 1 • Example (Quantity): The mean of the non-missing "Quantity" values is 2. Customer ID Product Price Quantity 3 C 30 2 b) Median Imputation: Replaces missing numerical values in a column with the middle value (median) of the non-missing values in that column. • Example (Price): The median of the non-missing "Price" values (sorted as) is 25. Customer ID Product Price Quantity 2 B 25 1 • Example (Quantity): The median of the non-missing "Quantity" values (sorted as) is 2 Customer ID Product Price Quantity 3 C 30 2 c) Mode Imputation: Replaces missing categorical values in a column with the most frequent category (mode) in that same column. • It is not typically used for numerical data like "Price" and "Quantity" unless those columns contained discrete, categorical-like values. • Example (If "Product" had missing values): Customer ID Product Price Quantity 1 A 25 2 2 NaN NaN 1 3 C 30 NaN 4 A 25 3 • The mode of the "Product" column is "A" (appears twice). The missing product would be filled as "A". d) Constant Value Imputation: Replaces all missing values in a specific column with a predetermined, fixed value. • This value is chosen based on domain knowledge or a specific assumption. • Example (Price = 0 for unknown price): If a missing "Price" indicates it hasn't been set yet, we might fill it with 0. • Example (Quantity = 1 for at least one sold): If a missing "Quantity" implies at least one sale occurred but wasn't recorded, we might fill it with 1. ❖Statistical Imputation •the correlations between variables in the dataset are used to estimate and fill in missing data using established statistical methods • Regression Imputation • KNN (K-Nearest Neighbours) Imputation • Multiple Imputation a) Regression Imputation • Uses regression models to predict missing numerical values based on other variables in the dataset. • It assumes a linear relationship between the variable with missing values and the other variables. • Example (Predicting Price based on Product): To predict the missing "Price" for Product "B", we would need a larger dataset where "Price" and "Product" are both present. • We could train a regression model to learn the relationship and then use Product "B" to predict its price. b) KNN (K-Nearest Neighbours) Imputation: • Imputes missing values by finding the most similar data points (neighbours) in the dataset based on the non-missing features and then using the average (for numerical) or mode (for categorical) of their values for the missing feature. • Example (Imputing Price for CustomerID 2): To impute the missing "Price" for CustomerID 2 (Product "B", Quantity 1), we would look for other customers with similar "Product" and "Quantity" values. • If we found customers who bought "B" with a quantity close to 1, we would use the average of their "Price" to fill the NaN. c) Multiple Imputation: • It is for handling missing data where the missing value is estimated multiple times. • This process results in multiple complete datasets, each of which is analyzed. and the results are pooled to create one final result. • Example: Instead of a single value for the missing "Price" of Product "B", multiple imputation might generate several plausible prices (e.g., 22, 27, 24) based on statistical models and the variability in the observed data. • Each of these imputed datasets would then be analyzed, and the results would be combined. ❖Time-Series Imputation • These techniques are specifically designed for datasets where data points are ordered by time. • Forward Fill (ffill): Fills missing values with the last valid observation. • Backward Fill (bfill): Fills missing values with the next valid observation. • Interpolation: Estimates missing values based on the values of adjacent data points (e.g., linear interpolation assumes a straight line between the known points). • Example: If we had daily sales data for Product "B" and some days had missing "Quantity":. Date Product Quantity 2025-05-01 B 5 2025-05-02 B NaN 2025-05-03 B 7 • a) Forward Fill: The NaN on 2025-05-02 would be filled with 5 (the value from the previous day). • b) Backward Fill: The NaN on 2025-05-02 would be filled with 7 (the value from the next day). • c) Linear Interpolation: Fills the gap by finding the value halfway between the data points before and after the missing one. So the NaN on 2025-05-02 would be filled with (5+7)/2=6. ❖Model-Based Imputation • using machine learning models to predict the missing values based on the other variables in the dataset. • The variable with missing values is treated as the target variable, and the other variables are used as predictors. • Example (Predicting Price using Product and Quantity): • We could train a model (eg., decision tree or a simple linear regression if the relationship is linear) on the complete rows to predict "Price" based on "Product" and "Quantity." • Then, we would use the "Product" ("B") and "Quantity" (1) for Customer ID 2 to predict the missing "Price", referring to the dataset Table 2.1. Removing Duplicates • Duplicate records can skew your analysis by over-representing certain observations. • Identifying and removing rows that are identical across all or a subset of relevant columns is necessary. • Exact Duplicates: Rows where all values are the same. • Partial Duplicates: Rows that are the same across key identifying columns but might differ in other less important columns (e.g., timestamp of entry). You need to decide which columns define a "duplicate" in your context. • Order of Keeping: You might want to keep the first or the last occurrence of a duplicate based on your data's context (e.g., the latest entry might be more relevant). Order ID Customer ID Product Quantity 1 101 A 2 2 102 B 1 4 103 C 3 Correcting Errors • wide range of techniques to fix inaccuracies and inconsistencies. ❑ Fixing Typos • identifying and correcting spelling errors in categorical or text data. • Manual inspection, using spell-checking libraries, fuzzy matching algorithms etc. are some of the techniques to find and suggest corrections for similar strings. • Example: If a "City" column has entries like "New Yrok", "New Yorkk", and "New York", you would standardize them to "New York". ❑ Standardizing Formats • It ensures that data in a column follows a consistent format. • particularly important for dates, times, addresses, and phone numbers. • use string manipulation functions, date/time parsing libraries, regular expressions etc., to identify and reformat inconsistent entries. • Example: Converting date formats like "05/05/2025", "2025-05-05", and "May 5, 2025" to a standard format like "YYYY-MM-DD". ❑ Correcting Invalid Values • Identifying and handling values that are outside the expected range or violate business rules will handle here. • A few techniques are applying logical checks, using domain knowledge to define valid ranges, and then either correcting the values, or imputing them, for further investigation. • Example: If an "Age" column has a value of -5 or 150, these are invalid and need to be addressed. Dealing with Outliers • Outliers are data points that deviate significantly from the rest of the data. • They can arise due to genuine extreme values or errors in data collection. ▪Identifying Outliers ➢a) Visual Methods • Box Plots: • Shows the distribution of data and can clearly highlight values that fall outside the whiskers (lines that extend from the box to indicate the variability outside the upper and lower quartiles). • Scatter Plots: • Useful for identifying outliers in the relationship between two variables. • Histograms: • Can show unusual values at the tails of the distribution. ➢b) Statistical Methods • Z-Score: • Measures how many standard deviations a data point is from the mean. • Values with a Z-score above a certain threshold (e.g., +/- 3) are often considered outliers. • Interquartile Range (IQR): • Quartiles are special percentiles. 1st Quartile Q1 is the same as the 25th percentile, 2nd Quartile Q2 is the same as 50th percentile and 3rd Quartile Q3 is same as 75th percentile • For finding the quartile and percentile, the data should be sorted and ordered from the smallest to largest. • For Quartiles, ordered data is divided into 4 equal parts. • For Percentiles, ordered data is divided into 100 equal parts. • IQR is the difference between third quartile, Q3 and first quartile Q1, • ie, IQR = Q3 - Q1 • Outliers can be defined as values below Lower Bound or above Upper Bound, where, Lower Bound = Q1 - 1.5 x IQR and Upper Bound = Q3 + 1.5 x IQR. • DBSCAN (Density-Based Spatial Clustering of Applications with Noise): • A clustering algorithm that can identify outliers as data points that do not belong to any dense cluster. • Handling Outliers • a) Retention: If the outlier is a genuine extreme value and represents a real phenomenon, it might be important to keep it. • b) Adjustment (Capping/Flooring): Replace outlier values with a predefined maximum or minimum value within a reasonable range. This can reduce their impact without completely removing them. • c) Transformation: Applying mathematical transformations (e.g., logarithmic, square root) can sometimes reduce the impact of outliers by compressing the scale of the data. • d) Removal: • If the outlier is clearly an error or is likely to unduly influence the analysis, it might be removed. • However, be cautious about removing too many data points. House ID Price 1 250 2 300 3 275 4 320 5 800 Outlier • A box plot of the "Price" column would likely show 800 as an outlier. Depending on the context, you might keep it, cap it at a certain value, or investigate if it is an error.

视频信息