**Machine Learning (ML) Core Definition** * ML enables a machine to **learn patterns from data without being explicitly programmed** with rigid rules. * Instead of providing step-by-step instructions, we supply examples of inputs and their corresponding outputs; the ML algorithm “decodes” the relationship and builds a model. --- **Analogy: Grocery Shopping** 1. **Traditional Programming** (Rule-Based): * You receive a complete set of instructions (“drive this route, go to exactly that shop, ask for this brand, check the scale, inspect packaging, handle payment, etc.”). * You simply follow each step; there’s no flexibility or “learning.” 2. **Machine Learning Approach**: * You’re told only WHAT you need (e.g., “Bring 1 kg of this sugar, here’s the money”). * You explore routes, check multiple shops, evaluate scale accuracy, examine packaging, verify change—all on your own. * You adapt based on what you observe (e.g., if the first shop is out of stock, you go to the next). * Over time, you “learn” which shops are reliable, which routes are fastest, etc., forming an adaptive pattern. --- **Relationship: AI, ML, Deep Learning** * **Artificial Intelligence (AI)**: The broad field encompassing any system that mimics human intelligence. * Includes both rule-based systems (explicit instructions) and learning-based systems. * **Machine Learning (ML)**: A **subset of AI** where the system learns patterns from data instead of relying solely on hard-coded rules. * Anything ML can do is part of AI, but not all AI techniques are ML. * **Deep Learning (DL)**: A **subset of ML** that uses multi-layer (deep) neural networks to learn hierarchical patterns. --- **ML Life Cycle (Key Steps)** 1. **Problem Understanding** * Clearly define the problem’s nature and scope (e.g., predicting sales, detecting fraud). * A correct initial understanding prevents wasted effort (e.g., treating a flood like a fire would be disastrous). 2. **Data Collection** * Gather relevant, problem-specific data (e.g., academic records if addressing an academic issue). * Irrelevant data (e.g., sports stats for an academic problem) is discarded. 3. **Data Preprocessing** * Clean, normalize, and transform raw data into a usable format for modeling. * Examples include handling missing values, encoding categorical features, scaling numeric values. 4. **Model Selection** * Choose an appropriate ML algorithm (e.g., linear regression for continuous outcomes, decision trees for classification). * The goal is to find the “optimal” algorithm for your problem. 5. **Model Training** * Use the chosen algorithm on the **training dataset** (features X\_train, labels Y\_train). * The model “learns” the mapping between inputs and outputs. 6. **Model Evaluation (Testing)** * Evaluate the trained model on the **test dataset** (features X\_test, labels Y\_test) to measure performance (accuracy, error rate, loss, etc.). * Compare model’s predicted outputs (Y\_pred) to true labels (Y\_test). 7. **Hyperparameter Tuning** * If performance is suboptimal, adjust hyperparameters (e.g., learning rate, tree depth). * Re-train and re-evaluate until satisfactory metrics are achieved (e.g., near 100% accuracy on both train and test). 8. **Deployment** * Integrate the final model into a production environment. * Users can submit new inputs (e.g., via a URL or UI) and immediately receive model predictions. 9. **Monitoring & Maintenance** * Continuously monitor model performance on real-world data (accuracy, resource usage). * If performance degrades (data drift, concept drift), retrain or update the model with new data. --- **Data vs. Information** * **Data**: Unprocessed facts, figures, statistics (e.g., raw numbers, logs). * **Information**: Processed, meaningful data that’s relevant to the problem at hand. * Example: Raw sales records → aggregated, cleaned, and contextualized to show quarterly trends. --- **Labeled vs. Unlabeled Data** * **Labeled Data**: Each input (features) is paired with a known output (label). * Used in **Supervised Learning**. * Notation: * **X (Independent Features)**: Inputs (e.g., number of employees, number of projects). * **Y (Dependent/Target)**: The value to predict (e.g., annual sales). * **Unlabeled Data**: Inputs without corresponding outputs. * Used in **Unsupervised Learning** (e.g., clustering similar items without predefined labels). --- **Types of Learning** 1. **Supervised Learning** * Requires labeled data (X paired with Y). * Tasks: * **Regression**: Predict continuous values (e.g., sales revenue). * **Classification**: Predict discrete categories (e.g., spam vs. non-spam). 2. **Unsupervised Learning** * Uses only unlabeled data (no Y values). * The algorithm seeks patterns—clusters or groups based on similarity/dissimilarity. * Example: Grouping apples by color (red vs. green) without explicit labels. * Real-life metaphor: Students in an empty lecture hall naturally cluster by gender without assigned seats. 3. **Reinforcement Learning** * An **agent** takes actions in an **environment**, receiving **rewards** or **penalties** based on its actions. * Goal: Maximize cumulative rewards over time. * Examples: * A baby crying for milk → receives milk (reward) → learns crying works. Eventually, crying stops working → baby tries crawling → crawling yields milk → reinforced behavior. * Touching a hot candle → gets burned (penalty) → learns not to touch next time. --- **Train/Test Split (Supervised Learning Workflow)** * Divide the available dataset **horizontally** into: 1. **Training Set** (≈ 70–80 % of data): Used to train the model (X\_train, Y\_train). 2. **Testing Set** (≈ 20–30 % of data): Completely held out during training; used to evaluate the trained model’s performance (X\_test, Y\_test). * Additionally, think of a **vertical split**: separating input features (all X columns) from the target column (Y). * **Training Phase**: Model sees (X\_train, Y\_train) and “learns” the pattern. * **Testing Phase**: Provide only X\_test to the trained model, which outputs predictions Y\_pred. * Compare Y\_pred against true Y\_test (“ground truth”) to compute metrics such as accuracy, error rate, or loss. --- > *These concise notes capture the essential points from the original transcript. Let me know if you need further elaboration on any topic.*

视频信息