Machine learning has gained a lot of attention over the past years. My colleague Ankit explained what machine learning is in this post. To summarize: machine learning is the process by which a computer can learn from past data and apply those learnings to make predictions about similar data in the future. It might seem like machine learning is hard, but we can easily build a machine learning model from a simple Excel spreadsheet, with no code!
Predicting house prices
Let’s say you are a realtor, and you have been keeping track of all real estate deals in your city. Specifically, let’s say that you have been meticulously recording every single data point (e.g., number of bedrooms, number of bathrooms, type of construction, size of garage etc., along with the selling price and the transaction date) you could get your hands on about every single house that sold.
Having gathered enough data, you would now like to make your data work for you, to help you predict the sale price of the next house on the market. You have noted the properties of the house, such as the living area, the year built, whether it has a garage or not, and more. After sales, you also have denoted the final sales price of the house.
We now have all the requirements for training a model:
1) Inputs - living area, the year built, has a garage, etc. - all columns but one in our Excel sheet
2) Output - final sale price - one column in our Excel sheet
3) Learning examples - the rows in our Excel sheet contain examples of which output the model should compute, given the other input columns.
Let's create a Machine Learning model that learns from this historical data, and can predict the most probable sale price of a new property, without any coding!
Intersect Labs is a Y-Combinator backed startup that does exactly what we need.
Under the hood, they automatically convert your Excel sheet into a computer-readable format, and apply a large variety of the most advanced machine learning algorithms on it. The software automatically trains a number of models to predict your desired outcome. Once trained, the most accurate model is identified, and you are given an overview of the accuracy and a list of the most important columns from your dataset (e.g. Living Area size). You can even see how a change of one of the inputs impacts the output.
For our example, we want to see if a renovation would impact the sales price significantly. More about that below.
Training the Model
Follow these steps to build your Machine Learning in Intersect Labs:
1) Create a new project and upload your Excel sheet in CSV format.
2) Select which columns you want to use as inputs.
We select all of them.
3) Select which column you want to predict.
We go for the Sale price, but it would be interesting to see how other columns would do!
4) Sit back and get a coffee while their advanced algorithms crunch the math and build the right machine learning model. They clean your spreadsheet too, saving you from the painstaking task of detecting and removing wrong or missing values.
4) Check out the results and the data insights
After the models have completed training, we get our result page.
We get the accuracy of the model by the Error Margin. For us, this means that on average, the sales price from this model is off from the actual values by 5.44%.
We can apply this model to make new predictions on our properties that we haven't sold yet and would like to get the most likely sales price for. We can choose to either upload the data through their API (handy if you want to apply the predictions in a framework) or we simply upload another CSV spreadsheet. As you could probably tell - this time this CSV does not need the outputs, as that is what it is going to predict, but only inputs.
We can also see the most predictive columns of the dataset. The quality of the pool is apparently the most important variable to the sales price, an interesting insight for anyone wanted to sell their house!
Simulate different outcomes
Their Outcome Simulator is a very cool feature, in which you can see how changing the input values affects the output. We select the General Living Area on the x-axis and the Sale price on the y-axis.
We now see how the Sale Price will change with the size of the General Living Area, given that the other columns are fixed. We can see how the value increases (duh!) but it reaches a plateau (interesting!). By changing the fixed column values to whatever values we like, we can create our virtual property however we like it, and get the most likely sale price out.
We could even derive our business decisions from this: Would it be worthwhile to do a renovation for example? Change the column values to how that property would then look like, and the model will compute the most likely sale price. With this cost-benefit calculation, we can compute if and when a renovation is worthwhile.
Let's run some more predictions on houses of which we know the final sale price, to see how accurate the model actually is, by creating what is called a test set. I put a part of the rows in the spreadsheet apart in a separate file and moved the sales price column to another spreadsheet as well for later validation. Make sure this part was not included in training dataset, as you don't want the model to have seen it.
Let's run the test set through the model and compare the predicted sale price to the actual sale price to see how the model performs with our own eyes. Make sure your CSV file contains all the columns you selected for training, and does not have the output column.
After uploading, click on 'Download Transformed Data', which downloads the CSV with the predictions as an added column. I added the actual sale price and looked at the comparison.
We can see that the model is really accurate most of the time, with an average error of about 4-5%. It miscalculates one sample. This is unfortunate but unavoidable for these kinds of datasets. As long as the dataset originates from human decision-making (e.g bidding), these things happen. One time the actual sales price was far higher than the predicted one. Perhaps the neighborhood is booming or there was a bidding war between two parties. The dataset doesn't give the answers in this case. However, I would find this a very handy tool that would help me in my decision-making if I was an estate agent.
A model that predicts a final sales price of $130599, off by just $6599 from the actual selling price? Yes please.
Try Intersect Labs
Intersect Labs makes it incredibly easy to build a machine learning model from your spreadsheets. No prior coding experience or mathematical background necessary! Book a call today with one of our expert machine learning engineers and see how a few minutes is all it would take for you to be on your way to adding predictive modeling to your business.