Machine Learning Overview
Machine Learning Overview
In traditional programming, a developer takes a set of rules and prepares instructions for a computer to follow whenever it receives data.
For example, you could program a computer to send an alert if a gas sensor shows that CO₂ levels are over a certain threshold, but only if the building vents are open and the fans are also running. If CO₂ levels reach this threshold and these conditions don't match, the program can instruct the computer to open the vents and turn the fans on first. In this scenario, each condition represents a simple, unambiguous rule that we can define and program. This logic is the basis of the Rayven Workflow Builder, and it works well for most use cases, even for very complex activities.
Sometimes, however, rules are not clearly and unambiguously defined. You might want frequently add, remove, or change them. At times, you might not explicitly understand your rules (for example, you might not know the relationship between wind strength and the distance that dust travels from a quarry). In these cases, building and maintaining a set of rules to prepare a program becomes very difficult.
Instead of presenting the computer with a program and data to obtain answers, we must turn the problem around: we present a computer with data and answers so it can obtain a program.
This reversal is the fundamental idea of Machine Learning (ML). Machine learning is sometimes confused with Artificial Intelligence (AI). Artificial intelligence is a technology that enables machines to simulate human behavior. Machine learning is a subset of AI that defines how machines learn from data without being explicitly programmed.
In our dust example, we would provide the computer information about wind strength, the dust itself, and other variables (data), as well as measurements of dust levels from various distances away from the quarry (answers). The machine learning process would then give us a model (program) for predicting dust dispersion from new data.
This technique is incredibly powerful. However, while machine learning can help to identify relationships between data and answers (correlation) for the range of data provided, it can't explain why this relationship exists or what is happening (causation). There are a few points we must consider before embarking on a machine learning project:
- When to use machine learning: Machine learning is most helpful when we are attempting to uncover relationships and patterns in data. However, we must first have collected the data, visualized it, and uncovered some hypotheses we want to explore.
- What data is required: Machine learning needs broad and deep data. Breadth refers to the choice of variables and the range they each cover. You might need to evaluate many variables to find the critical few that influence outcomes, and you will need to understand the behavior of those variables over a meaningful range. Depth refers to the amount of data you need - machine learning requires data for training consumption and testing the quality of candidate models. In each case, you need to be sure that you have a relevant time frame to capture seasonality. Remember that you must train your machine with answers and ensure you have corresponding outputs with meaningful labels.
- Data analysis and modelling: Machine learning generally needs a human teacher to prepare data, decide what approaches to take, and determine which model outcome is best. This process of discovery is iterative. It begins with a hypothesis and may result in intermediate solutions that provide useful results as you work towards your ultimate model.
- Content expertise: Often, the relationships we want to understand are related to real-world events like the failure of a piece of equipment. Therefore, we must ensure the model makes sense in the real world. Just because we observe that there are more sharks in Summer, this correlation doesn't mean sharks cause increased ice cream sales.
- Prediction: Machine learning models describe relationships in the data and answers we provide to the computer. What's more, they describe what happens within the ranges of those observations. As a result, these models probably often don't fare well if we ask what would happen outside the training ranges. Imagine we gave our computer observations of water made between 5° C and 15° C and at standard pressure, then asked the model to predict what would happen at 100° C. Since our training data set has never experienced boiling (and without the benefit of a physicist providing content expertise), our computer is unlikely to predict the phenomenon.
How to develop a successful machine learning model
The Rayven platform provides an integrated workflow to create and successfully deploy machine learning models directly into your solution. You can begin with a ready-to-go model, analyze your data, and build a model yourself in the machine learning workbench. Alternatively, you can load a pre-existing model from elsewhere.
Whichever approach you choose, success with machine learning comes from a structured method of problem-solving:
- Hypothesis formulation: Begin with a proposed explanation based on your existing evidence. This theory will be the starting point for further investigation.
- Data collection: Consider what data may be required to explore your hypothesis. Remember that machine learning needs broad and deep data, so gather data for many variables, across wide and relevant ranges, for representative time frames. If in doubt, include the data now - the modelling process will remove irrelevant and weak variables.
- Data cleansing: Assess the quality of your data. Is the data for each variable complete, or are there strange outliers or erroneous results? Decide on a policy for dealing with these. Remember that outliers may contain valuable insights.
- Label and transform data: You will need to label the data representing answers for the machine learning process. Some of your data might also need normalizing or transforming (for example, from categorical data into numerical data).
- Identify models: Find machine learning models that will help solve the type of problem you have (for example, categorization or regression). Ensure they are appropriate and fit with the real-world problem you want to solve. You might want to try multiple approaches to determine which one works best.
- Train models: Provide data to the models so they can learn the patterns and relationships in the data. Carry out feature selection and identify and manipulate hyperparameters to achieve the best performance.
- Evaluate and select a model: Evaluate the quality of each model by comparing its predictions against actual outcomes. Determine the overall best performer based on factors such as predictive ability, speed, and ease of use.
- Model deployment: Deploy your model to use live data and visualize the results in a dashboard.
- Monitor and reassess: Continually review the model's performance and reassess as data and circumstances change. Implement changes to the model or evaluate new models as technology advances.
Here is an example of how this process might occur: