Sensors allow you to collect huge amounts of data, but this information is useless unless you use it to figure out what went wrong retroactively, or use it to predict an impending failure/malfunction. In this article, I will go through a very simple example of using Microsoft’s Azure Machine Learning Studio to allow you to make those kind of predictions. Bear in mind I am not a trained data scientist, I know just enough to make me dangerous.
So at this point you have saved your IoT event stream data in some kind of persistent store. In my case, I saved it to Azure Blob storage since its much cheaper than sending it to a document (NoSQL) database. This saves the event data as a JSON file that you can browse either through the Azure Portal or from Azure Storage Explorer. I simply downloaded the file from the Azure Portal.
You will discover however that Azure Machine Learning Studio does not accept JSON as a valid data input. I took the path of least resistance and simply converted the JSON formatted file to a .csv using an online tool. The CSV file can then be easily added to Azure ML Studio as a new dataset. Once it has been added as a dataset, simply drag and drop it as the first node in your experiment. Now the next step is to perform a bit of preprocessing to clean up the data. In this step, I used the ‘select columns in dataset’ (found under Data Manipulation) node to select which columns I wanted, in my case, its temperature, humidity, device, location, and moisturedetected. Next, I dragged the Metadata editor node as into the flow to indicate that the ‘moisturedetected’ is categorical (not a numeric value even though it is represented as 0 or 100) as well as a label (i.e. the value we are trying to predict). Now that the data is cleaned up we can go ahead and start building our model.
In order to train the model we need to split the dataset into a training set and scoring set. The ‘split data’ node allows you to specify what percentage you want to allocate to each set. In my case I allocate 70% (i.e. 0.7) as training data and 30% as scoring data. Now drag a ‘Train Model’ node onto the canvas and connect it to the 70% connector path from the ‘Split data’ node. At this point you have to decide what kind of algorithm you are going to use. Choosing an algorithm depends on the answer you are looking for. In my case I needed a simple Yes/No answer to the question of ‘Is it likely to have any excess moisture/leak?’ Since this is a simple Yes/No classification problem I looked under the classifier algorithms and chose a two-class decision tree to start with. Drag the node onto the canvas and connect it with your ‘Train Model’ node. Now after training the model you want to score it to determine how well it did, so drop a ‘Score Model’ node underneath the ‘Train Model’ node and connect them. At the same time, connect the 30% of scoring data from the ‘Split data’ node into the right connector of ‘Score Model’ so we can unleash our newly trained algorithm on it. Finally, in order to visualize our results in a more human friendly manner I dropped an ‘Evaluate Model’ node underneath ‘Score Model.’ This allows you to see the results in a visual graph and determine how well it performs. At this point the fully fleshed out experiment looks like this:
You can go ahead and run the experiment now. This submits the job and it might take several minutes to complete. When it’s all done, you can click on the output node of ‘Evaluate Model’ and select ‘Visualize.’ This will show you the ROC curve. A good general guide to the performance of your model is the AUC or Area Under the Curve. Something in the 0.8 range is generally good. If you get results above 0.9 you are probably over fitting your model and if you get 0.7 or below you might want to try another algorithm.
You can play around the with the configuration parameters of the algorithm such as the threshold, increasing the number of trees, shortening the depth etc to see if you get better results. I also went ahead and tried out all 9 of the two-class classifier algorithms at their default settings to see which one gave the best results. It turns out a two-class decision jungle provided the best results for my dataset. Once I determined the best one, I deleted all the other poorer performing algorithms and got ready to deploy my model as a web service. This is a very straightforward process, all you have to do is click on the ‘Setup Web Service’ button and Azure ML Studio will automatically remove unnecessary nodes for deployment such as ‘Evaluate Model’ and ‘Split Data.’ It also automatically creates a trained model node for you. At this point, I also removed the ‘moisturedetected’ column from the input dataset since it is what I am trying to predict. I also set the web service input node right before ‘Score Model’ so that it knows what data fields to accept as input.
A click on ‘Deploy Web Service’ is all it took to deploy my model for widespread usage. This will provide you with a Swagger-like API definition page that contains the request URI, API, request headers/body, response body as well as sample call in C#, R, and Python so that you can call your Machine learning web service from external applications.
The Test button let me test my newly deployed web service by entering sample input and observing the predicted response.
Looks like when the temperature is 88 F and the humidity is 69% the likelihood of having a moisture leak is 0.68 i.e. 68% (the last set of numbers in the series) so the model is predicting that it will happen.