After you define the business use case and establish the success criteria, the process of delivering an ML model to production typically involves several steps, which can be completed manually or by an automated pipeline. The first three steps deal with data. Data must be ingested, which means it's extracted from a raw data source. With data extraction, you retrieve the data from various sources. Those sources can be streaming, in real time, or batch. For example, you might extract data from a customer relationship management system, or CRM, to analyze customer behavior. This data might be structured where the file is in a CSV, TXT, JSON, or XML format. Or you might have an unstructured data source with images of your products or text comments from chat sessions with your customers. You might have to extract streaming data from your company's transportation vehicles that are equipped with sensors transmitting data in real time. If the data you want to train your model on or get predictions for is structured, you might retrieve it from a data warehouse, such as BigQuery, or you can use Apache Beam's IO module. In this data flow example, we're loading data from BigQuery, calling predict on every record, and then writing the results back into BigQuery. In data analysis, you analyze the data you've extracted. For example, you can use exploratory data analysis, or EDA. This involves using graphics and basic sample statistics to explore your data, such as looking for outliers or anomalies, trends, and data distributions. This step helps you identify those features that can aid in increasing the predictive power of your machine learning model. The way changes in the distribution of your data could affect your model might not be apparent, so let's consider a scenario. In this scenario, an upstream data source encodes a categorical feature using a number, such as a product number. One day, the product number and convention changes, and now the customer uses a totally different mapping with some old numbers and some new numbers. How would you know that this had happened? How would you debug your ML model? The output of your model would tell you whether there's a drop in performance, but it won't tell you why. The raw inputs themselves would appear valid because you're still getting numbers. In order to recognize this change, you would need to look at changes in the distribution of your inputs. In doing so, you might find that earlier, the most commonly occurring value was four. In the new distribution, four might never occur, and the most commonly occurring value might be ten. Depending on how you implemented your feature columns, these new values might be mapped to one component of a one-hot encoded vector or to many components. If, for example, you used a categorical column with a hash bucket, the new values would be distributed according to the hash function. And so one hash bucket might now get more and different values than before. If you used a vocabulary, the new values would map to OOV buckets. But what's important is that for a given tensor, its relationship to the label before and its relationship to the label now are probably very different. So after you've extracted and analyzed your data, the next step in the process is data preparation. Data preparation includes data transformation and feature engineering, which is the process of changing or converting the format, structure, or values of data you've extracted into another format or structure. Most ML models require categorical data to be in a numerical format, but some models work either with numerical or categorical features, while others can handle mixed type features. For example, here are three types of preprocessing for dates using SQL in BigQuery ML: Where we are extracting the parts of the date into different columns, year, month, day, etc., extracting the time period between the current date and columns in terms of years, months, days, etc., and extracting some specific features from the date, name of the weekday, weekend or not, holiday or not, etc. Now, here is an example of the day of week and hour of day queries extracted using SQL and visualized as a table in Data Studio. Please note that for all non-numeric columns other than timestamp, BigQuery ML performs a one-hot encoding transformation. This transformation generates a separate feature for each unique value in the column.

视频信息