CHAPTER 1 Introduction to Data Science • Introduction to Data Science • Evolution of Data Science • Data Science Classification Data Science Classification • Machine Learning (ML) is a subfield of artificial intelligence (AI) that focuses on developing algorithms and models that allow computers to learn from and make predictions or decisions based on data, without being explicitly programmed for each specific task. •Supervised Learning •Unsupervised Learning •Semi-Supervised Learning •Reinforcement Learning Type What it Learns From Example Supervised Learning Labeled data (with correct answers) Predict house prices, classify emails Unsupervised Learning Unlabeled data Group similar customers, detect patterns Reinforcement Learning Feedback from actions (rewards) Game AI, robotics Semi-Supervised Learning Small labeled + large unlabeled data Text classification with limited labels • Supervised Learning is a type of machine learning where a model is trained on labeled data — that is, data where the input (features) and the correct output (label/target) are known. • Unsupervised Learning is a type of machine learning where the model is given only input data (X) — with no labeled output (y) — and it tries to find hidden patterns or structures in the data on its own. • Reinforcement learning (RL) is a machine learning approach where an agent learns to make decisions in an environment to maximize a reward. • Semi-supervised learning (SSL) is a machine learning approach that combines a small amount of labeled data with a large amount of unlabeled data to train a model Data Science tasks • Classification • Regression • Deep learning • Clustering • Association analysis • Recommendation system • Anomaly or outlier detection • Feature selection • Time series forecasting • Text mining Classification • classification is the task of assigning data points to predefined categories or classes. It's a supervised learning technique where a model is trained on labeled data (data with known categories) to predict the class of new, unseen data. • Think of it as teaching a machine to sort things into groups based on their characteristics. • It learns by looking at examples with labels (like emails marked "spam" or "not spam"). After learning, it can decide which category new items belong to, like identifying if a new email is spam or not. • Patient diabetes or not.. Algorithms • Logistic Regression • Decision Trees • Random Forests • Support Vector Machines (SVM) • Naive Bayes, and • K-Nearest Neighbors (KNN). Regression • regression is a supervised learning technique used to predict a continuous numerical value based on one or more input features. • It focuses on finding the relationship between independent variables (features) and a dependent variable (target) to make predictions. • Essentially, it aims to model the relationship between variables so that future values of the dependent variable can be predicted. • Estimating the temperature for next day based on weather dataset Algorithms •Linear Regression: •Polynomial Regression: •Ridge Regression: •Lasso Regression: •Support Vector Regression (SVR): •Decision Tree Regression: •Random Forest Regression: Deep learning • Deep Learning is a subset of Artificial Intelligence (AI) that helps machines to learn from large datasets using multi-layered neural networks. • It automatically finds patterns and makes predictions and eliminates the need for manual feature extraction. • Ex: Identifying objects in an image or converting speech to text Algorithms • Feedforward Neural Networks(FNNs) • Convolutional Neural Networks (CNNs) • Recurrent Neural Networks (RNNs) • Long Short-Term Memory Networks (LSTMs) • Generative Adversarial Networks (GANs) • Transformer Networks • Autoencoders • Deep Belief Networks (DBNs) • Deep Q-Networks (DQNs) • Graph Neural Networks (GNNs) Clustering • The task of grouping data points based on their similarity with each other is called Clustering or Cluster Analysis. • branch of unsupervised learning, which aims at gaining insights from unlabelled data points. • Ex: Segmenting customers into similar buying groups for targeted marketing Algorithms •K-Means •Hierarchical Clustering •DBSCAN (Density-Based Spatial Clustering of Applications with Noise) •Mean Shift •Gaussian Mixture Models (GMM) •Spectral Clustering •Affinity Propagation •BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) Association Analysis • Association analysis is an unsupervised data science technique where there is no target variable to predict. • Instead, the algorithm reviews each transaction containing a number of items (products) and extracts useful relationship patterns among the items in the form of rules. • A typical example is a Market Based Analysis. Market Based Analysis is one of the key techniques used by large relations to show associations between items. Algorithms • Apriori • Eclat • FP-Growth(Frequent Pattern Growth) Recommendation System • Recommendation systems, powered by machine learning, predict user preferences to suggest relevant items, enhancing user experience and driving engagement. • These systems analyze user data, such as purchase history, browsing activity, and reviews, to identify patterns and make personalized recommendations. • Ex : Suggesting movies on Netflix, products on Amazon, flipkart Algorithms • Collaborative filtering • content-based filtering • hybrid approaches/recommenders Anomaly or Outlier Detection • Anomaly and outlier detection are both techniques for identifying unusual data points, but they differ slightly in their approach and interpretation. • Anomaly detection focuses on identifying data points that deviate significantly from the expected behavior of a dataset, often indicating something potentially harmful or important. • Outlier detection, on the other hand, specifically identifies data points that are extremely distant from other data points within a dataset • Ex: detecting fraudulent credit card transactions Algorithms • Distance-Based algorithms • Statistical algorithms • Clustering-Based algorithms • Density based algorithms Feature selection • Feature selection is the process of choosing the most relevant input features (variables) from a dataset for use in building a machine learning model. • Picking only essential health metrics to predict the risk of heart disease. Algorithms • Filter Methods • Wrapper Methods • Embedded Methods Time series forecasting •a method used to predict future values of a variable based on its historical data collected over time. •Ex: forecasting temperature, sales, /stock prices Algorithms •Autoregressive Integrated Moving Average(ARIMA) •Exponential smoothing Text mining • the process of extracting meaningful information from unstructured text data using computational techniques. • Ex: analyzing customer reviews to detect positive or negative sentiment Algorithms • Named Entity Recognition (NER) • Extractive Summarization • Word embedding algorithms   Data collection • From various sources : internal datasets, APIs, web scrapping, external datasets • Interviews, surveys, • Direct observation • Online marketing analysis • Focus groups • Subscription data • reviews Data cleaning & preparation • Data cleaning and preparation is the crucial process of identifying and correcting errors, inconsistencies, and inaccuracies within a dataset to ensure its quality and reliability for analysis. Involves: • Cleaning • Integration • Data reduction Exploratory Data Analysis • EDA involves examining the data to uncover patterns, trends, and relationships. It includes: • Descriptive statistics to summarize data (mean, median, standard deviation). • Statistical modeling to identify relationships and trends. •Data visualization using charts, graphs and plots to explore data distributions and correlations. • Machine learning for deeper pattern detection and prediction. Feature Engineering • the process of creating, transforming, and selecting the most relevant features for modeling. It includes: • Generating new features from existing data (e.g., extracting "season" from dates). • Transforming features through scaling, encoding, and normalization. • Selecting important features using methods like variance thresholding, correlation analysis, etc. Modeling • the data is a key stage where predictive or descriptive models are built using statistical and machine learning techniques. It involves: • Selecting the right model type (e.g., classification, regression, clustering). • Choosing and applying suitable algorithms. • Training the model on the dataset. •Performing hyperparameter tuning to optimize performance and avoid overfitting (Overfitting is when a model performs well on training data but poorly on new data). Evaluation • process of assessing a model's performance using Key Evaluation metrics like accuracy, precision, recall, F1 score, RMSE, and AUC-ROC . ❑Key techniques include: • Hold-Out: Testing the model on a separate dataset not used in training. •Cross-Validation: Splitting data into multiple sets to train and evaluate the model more reliably. •Accuracy: Measures the overall correctness of the model, indicating the proportion of correctly classified instances. •Precision: Indicates the proportion of correctly predicted positive cases out of all instances predicted as positive. •Recall: Measures the proportion of correctly predicted positive cases out of all actual positive instances. •F1-score: The harmonic mean of precision and recall, providing a balance between the two metrics. •AUC-ROC: Area Under the Receiver Operating Characteristic curve, a measure of a model's ability to distinguish between classes. •Mean Squared Error (MSE): A common regression metric that calculates the average squared difference between predicted and actual values. •R-squared: Measures the proportion of variance in the dependent variable that is predictable from the independent variables. •Confusion Matrix: A table that summarizes the performance of a classification model by showing the counts of true positives, true negatives, false positives, and false negatives. •Minimum Sum of Absolute Errors (MSAE): minimizes the sum of absolute errors (SAE), making it more resistant to outliers. • R Square : a statistical measure that represents the proportion of the variance in the dependent variable that is explained by the independent variable(s) in the model. • Adjusted R Square : a modified version of R-squared that adjusts for the number of predictors in a regression model. Deployment • It is the process of placing the trained model into a real-world environment where it can generate predictions or insights. • It involves: • Exporting the model in a deployable format. • Creating APIs to connect the model with applications or systems. • Performing integration testing to ensure smooth and accurate operation in production. Monitoring and Maintenance • It is the final stage of the data science life cycle. It involves: • Continuously tracking the model's performance using key metrics. • Retraining the model with new data to maintain accuracy. • Setting up alerts to detect performance drops or anomalies for timely updates and improvements. • Implementing robust security measures to protect the model and data, especially when dealing with sensitive information. Important Roles in Data Science • Data Scientist • Data Analyst • Data Engineers • Data Architect • Machine Learning Engineer • Business Analyst • Data and Analytics Manager Data Scientist • structured and unstructured data. • Knowledge in R, Python, MATLAB, SQL, statistics, machine learn-ing • strong data analysis and processing knowledge etc. • Frame data problems and test hypotheses. • Identifying sources for data collection • Data processing and integration • Performing predictive analytics • Automating data workflows • Offering insights using data science tools Data Analyst • concentrate on interpreting existing data to support business decisions. • for visualizing, processing, and analyzing large datasets,often using tools like SQL, R, SAS, and Python • Extracting data from primary and secondary sources • Maintaining databases • Performing exploratory data analysis and creating reports with insights and recommendations. Data Engineer • focuses on building and maintaining scalable data systems for efficient processing and analysis. • hands-on experience is needed in technologies like Hive, NoSQL, R, Ruby, Java, C++, and MATLAB • Designing and maintaining data management systems • Collecting and maintaining data • Conducting primary and secondary research • Upgrading technologies Data Architect •Designing Data Models •Ensuring Data Quality and Compliance •Developing Data Strategies •Managing Data Infrastructure •Supporting Analytics and Business Intelligence: •Staying Up-to-Date with Technologies: •Collaborating with Other Teams: Machine Learning Engineer • build intelligent systems using machine learning algorithms and tools. • Java, SQL, Python, mathematics and statistics skills • Designing, researching and testing Machine learning systems • Building data pipelines • Build, implement and tune machine learning algorithms • Train models and evaluate performance • Deploy models into production system Business Analyst • bridge the gap between business needs and data insights. • data modelling, business finance, business intelligence, and data visualization tools. • Understand organizational goals • Conduct in-depth business analysis to identify problems, opportunities and solutions. • Improve existing business processes Data Science Toolkit 1. Python: A widely used, beginner-friendly general-purpose programming language known for its readability and simplicity, ideal for data science and automation. 2. R: An open-source language for data manipulation and visualization, popular in statistical research. Easy to learn with extensive packages and community support. 3. Keras: A high-level deep learning library in Python that runs on TensorFlow, designed for easy and fast experimentation with neural networks. 4. Apache Spark: A powerful, general-purpose cluster computing framework that supports multiple languages (Python, Java, Scala, R) and offers tools for data processing, machine learning, and streaming. 5. Statistical Analysis System (SAS): A proprietary software suite used for statistical analysis, favored by large organizations for its reliability and powerful statistical modeling capabilities. 6. BigML: A cloud-based platform offering an interactive GUI for applying machine learning algorithms across different business functions. 7. D3.js: A JavaScript library for creating dynamic, interactive data visualizations in web browsers. 8. MATLAB: A proprietary tool for numerical computing, widely used for data analysis, neural networks, fuzzy logic, image processing, and advanced visualizations. 9. Jupyter: An open-source web application supporting multiple languages (Python, R, Julia) for writing live code, visualizations, and interactive data science reports. 10. Matplotlib: A Python library for creating, animated, and interactive plots like bar charts, scatter plots, and histograms. 11. Natural Language Toolkit (NLTK): A Python toolkit for natural language processing, offering tools for tasks like tokenization, stemming, tagging, parsing, and machine learning 12. Scikit-learn: A Python library for easy implementation of machine learning algorithms including classification, regression, clustering, data preprocessing, dimensionality reduction etc. 13. Tensor Flow: A powerful open-source toolkit for deep learning and machine learning, known for its scalability and ability to run on CPUs (Central Processing Unit), GPUs (Graphics Processing Unit), and TPUs (Tensor Processing Unit).   • E-Commerce Platforms(Amazon & Flipkart) • Social Media: Enhancing Engagement and Ad Targeting • Entertainment and Streaming Services • Healthcare (Hospitals, Google Health, AI Diagnosis) • Banking and Finance (SBI, Paytm, Credit Cards) • Education (NPTEL, Online Class Platforms) • Transportation and Ride-Sharing (Uber, Ola, Google Maps) • Agriculture (Smart Farming, Crop Monitoring) • Medicine and Drug Development • Logistics and Supply Chain • Credit and Insurance Industry • Fraud and Risk Detection • Social Media Analytics • Customer Sentiment Analysis • Cyber Security • Scientific Research and Innovation • Search Engines • Targeted Advertising • Website Recommendations • Sports Analytics • Transport and Autonomous Vehicles • Airline Route Planning • Human Resources and Recruitment

视频信息