After you define the business use case and establish the success criteria, the process of delivering an ML model to production typically involves several steps, which can be completed manually or by an automated pipeline. The first three steps deal with data. Data must be ingested, which means it's extracted from a raw data source. With data extraction, you retrieve the data from various sources. Those sources can be streaming, in real time, or batch. For example, you might extract data from a customer relationship management system, or CRM, to analyze customer behavior. This data might be structured where the file is in a CSV, TXT, JSON, or XML format. Or you might have an unstructured data source with images of your products or text comments from chat sessions with your customers. You might have to extract streaming data from your company's transportation vehicles that are equipped with sensors transmitting data in real time. If the data you want to train your model on or get predictions for is structured, you might retrieve it from a data warehouse, such as BigQuery, or you can use Apache Beam's IO module. In this data flow example, we're loading data from BigQuery, calling predict on every record, and then writing the results back into BigQuery. In data analysis, you analyze the data you've extracted. For example, you can use exploratory data analysis, or EDA. This involves using graphics and basic sample statistics to explore your data, such as looking for outliers or anomalies, trends, and data distributions. This step helps you identify those features that can aid in increasing the predictive power of your machine learning model. The way changes in the distribution of your data could affect your model might not be apparent, so let's consider a scenario. In this scenario, an upstream data source encodes a categorical feature using a number, such as a product number. One day, the product number and convention changes, and now the customer uses a totally different mapping with some old numbers and some new numbers. How would you know that this had happened? How would you debug your ML model? The output of your model would tell you whether there's a drop in performance, but it won't tell you why. The raw inputs themselves would appear valid because you're still getting numbers. In order to recognize this change, you would need to look at changes in the distribution of your inputs. In doing so, you might find that earlier, the most commonly occurring value was four. In the new distribution, four might never occur, and the most commonly occurring value might be ten. Depending on how you implemented your feature columns, these new values might be mapped to one component of a one-hot encoded vector or to many components. If, for example, you used a categorical column with a hash bucket, the new values would be distributed according to the hash function. And so one hash bucket might now get more and different values than before. If you used a vocabulary, the new values would map to OOV buckets. But what's important is that for a given tensor, its relationship to the label before and its relationship to the label now are probably very different. So after you've extracted and analyzed your data, the next step in the process is data preparation. Data preparation includes data transformation and feature engineering, which is the process of changing or converting the format, structure, or values of data you've extracted into another format or structure. Most ML models require categorical data to be in a numerical format, but some models work either with numerical or categorical features, while others can handle mixed type features. For example, here are three types of preprocessing for dates using SQL in BigQuery ML: Where we are extracting the parts of the date into different columns, year, month, day, etc., extracting the time period between the current date and columns in terms of years, months, days, etc., and extracting some specific features from the date, name of the weekday, weekend or not, holiday or not, etc. Now, here is an example of the day of week and hour of day queries extracted using SQL and visualized as a table in Data Studio. Please note that for all non-numeric columns other than timestamp, BigQuery ML performs a one-hot encoding transformation. This transformation generates a separate feature for each unique value in the column.
视频信息
答案文本
视频字幕
The first step in delivering an ML model to production is data ingestion. This involves extracting data from various sources including customer relationship management systems, real-time sensors, data warehouses like BigQuery, and structured files in formats like CSV or JSON. Data can be streaming in real-time or processed in batches depending on the use case.
Data analysis involves using exploratory data analysis or EDA to examine the extracted data. This includes looking for outliers, anomalies, trends, and data distributions using graphics and basic statistics. This step helps identify features that can increase the predictive power of machine learning models and detect potential issues in data quality.
A critical challenge in ML production is detecting changes in data distribution. For example, if an upstream system changes how it encodes categorical features like product numbers, your model's performance may drop without obvious warning signs. The raw inputs still appear valid, but their relationship to the target variable has changed completely, requiring careful monitoring of input distributions.
Data preparation is the next crucial step, involving data transformation and feature engineering. This process converts raw data into a format suitable for machine learning models. Most ML models require categorical data to be in numerical format, so we apply transformations like one-hot encoding for categories, extracting components from dates, and converting text into numerical features.
BigQuery ML provides powerful SQL functions for date preprocessing. You can extract date parts like year, month, and day, calculate time periods between dates, and extract specific features like day of week or whether it's a holiday. BigQuery ML automatically performs one-hot encoding transformation for all non-numeric columns except timestamps, generating separate features for each unique value.