Basic Machine Learning for Time Series Data

A few people have contacted me about machine learning in a time series data set. These short of datasets require a little bit extra in terms of data processing, as you are trying to predict the outcome of a future data point, this means you have to obtain that data point, and classify it. For this brief example, we will look at stock market data: all_stocks_5yr.csv for the last 5 years. We have to note though, time series data is hard to predict, and there’s a lot of reasons why a stock will go up or down, not just it’s previous data points. This is merely used as an example, it’s not an endorsement that a simple application of a machine learning algorithm should be used to predict stocks. Again, this is mainly about how to process time series data for machine learning. First of all we import the following modules:

Screen Shot 2018-03-31 at 01.13.11.png

We then read the data, select a stock that we want to analyze, and plot it to get a feel for it. Note that I make a new data frame called split as opposed to writing over the original data frame:

Screen Shot 2018-03-31 at 01.40.41.pngNow with time series we usually consider rates of change. This is the difference in x and y between two different points. However, between all data points in a column, the x difference will be the same, so we will just focus on the change of y. Now I like converting the columns into lists and looping through as vectorization becomes tricky when you’re taking into account data points before and after the data point. Because of this, I like to make the columns first:

Screen Shot 2018-03-31 at 01.57.38.png

diff just means difference. Two and three means two of three data points back, and bin is short for binary meaning one for an increase over time, and zero of a decrease over time. This is where we have to define our machine learning question and tool for the time being. Predicting everything here is just too much, for this approach we will see if we can predict if the closing price tomorrow will be higher or lower than the opening price today. A simplistic binary outcome can be done using logistic regression. Now we can calculate our outcome and rates of change by the following loops:

Screen Shot 2018-03-31 at 01.51.25.png Notice that I had to fill the first four with none values. This is because rates of change cannot be calculated with no previous data points. We will use the dropna to get rid of these later. We now add the values to the split data frame:

Screen Shot 2018-03-31 at 02.12.58.png

We now drop all of the fields that cannot be utilized by logistic regression:

Screen Shot 2018-03-31 at 02.13.55.png

Now let’s look to see if there’s anything else that we can pull from the data. We can plot all the variables against each other using seaborn:

Screen Shot 2018-03-31 at 02.17.37.png
it’s a lot bigger with more variables but for the sake of the blog I have cropped it

You can see that there’s clustering. If negative and positive points cluster together, this could be another form of classification that could be fed into the logistic regression algorithm. If a positive point falls into the parameters of the positive cluster it could be assigned a value one in a cluster column. We can chart it by the following:

Screen Shot 2018-03-31 at 02.24.58.png

Giving the following plot:

Screen Shot 2018-03-31 at 02.25.45.png

As you can see, there’s a lot of cross-over. I went through all the clusters and nothing. This is usually the case. Clustering is a machine learning project in itself and it’s a bit of a golden gift if you stumble across clean clustering. We create a set of outcomes (y), and a set of inputs (x). We then split it into test/train data, fit the logistic regression model and test it:

Screen Shot 2018-03-31 at 02.32.39.png

….. ok so it’s better than flipping a coin. We didn’t plot a training curve or cross validate, and the number of data points is low. Considering this, I ran it a few times and the results varied a lot, which isn’t a good sign, but this post is focusing on time series. Overfitting and learning curves is a different subject for another post. But we might as well apply it to see where the errors are happening. We get the logistic coefficients by the following:

Screen Shot 2018-03-31 at 02.39.28.png

We do this defining the logistic function and apply it to the data frame:

Screen Shot 2018-03-31 at 02.40.10.png

Now that we’ve calculated the logistic function for each data point, we can compare it with the actual rate of change to see where it fails:

Screen Shot 2018-03-31 at 02.43.42.png

We can now plot the failure points in the time series by looping through. If it’s positive it gets appended to one list and if it’s negative gets appended to another list:

Screen Shot 2018-03-31 at 02.46.17.png

As you can see there’s no outright failure at a particular point. And there you have it! This is how you process time series data. Of course, there’s more you can do, but now you have the basics of time series analysis, you can keep going. Ideally the outcome of this logistic regression would be a variable in a bigger machine learning algorithm that would take into account factors such as new streams etc.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s