Data Pre-Processing

 

1. Data Normalization

During the initial testing stages, I used the open, close, adjusted close, high, low, and volume data. Some of this testing even included combining multiple stocks from the same sector to see if this correlated data would improve predictions. The resulting predictions had huge magnitudes of error, which is shown in the sample results in Table 4 below.  This table shows that the changes in predicted prices are wildly inaccurate as compared to the true changes in Apple stock prices.

Table 4.: Sample Results Using Non-Normalized Inputs

This error was due to a mismatch in the scales of the input values. The volume data was generally much larger in magnitude than the stock prices. When data from multiple stocks was used at once, the varied ranges of prices also created error. In order to amend this error, I began using data normalization. The normalization utilized rescaled all of the numbers onto a scale from 0 to 1. Normalizing resulted in significant improvements in the predicted stock prices that were within a much more reasonable range. (For these improved results please refer to Section VI).

2. Stationary and Non-Stationary Data

While the data normalization significantly improved the price predictions, there was still a large degree of error. While some of the predictions were on a reasonable scale, the algorithm too often prediction the wrong direction that the price was going to move. Further research showed the root of the issue was my use of time series data. An example of this huge degree of error in the predicted price due to the issue of utilizing nonstationary data, please refer to Table 5 below:

Table 5: Example Error in Predictions due to Utilization of Non-Stationary Data

Time series data often contains trends and seasonality. Trends indicate that a data’s mean varies over time, while seasonality indicates that a data’s variance is changing over time. Such data is classified as non-stationary. Machine learning algorithms can have difficulty prediction on non-stationary datasets (shown in Fig. 16 below). These frameworks prefer stationary datasets, where the mean and variance are stable over time (shown in Fig. 17 below).

Stationary data is easier to model, so I had to add another step to my data pre-processing—turning non-stationary data into stationary data. The time dependence and trends underlying data can be removed by using a simple difference transform. This transform takes the difference between sequential datapoints in the dataset. Thus, the transformed, nonstationary dataset is comprised of the daily price changes over time. This transform is also ideal for the application of stock price prediction, since investors care more about the changes in stock prices. Investors want to buy stocks when the price is lower and sell after the price increases, making a profit on the price difference.

Performing a difference transform on the data also prompted me to change the way my algorithm was making predictions. Instead of outputting a series of actual prices, my algorithm output an array of predicted price changes. The value of the last known day of stock prices is also stored. Then, this last known value and the first price change are added together to get the projected next-day price. This is done iteratively (i.e., we keep adding onto the previous day prediction) for each day where the magnitude of price change was predicted. This new method of prediction not only decreased the percent error the predictions, but also improved the algorithm’s ability to predict the correct direction a stock price would move. Ultimately, predicting the correct price movement is more essential than a prediction of the magnitude of this movement due to the nature of the application of stock trading.

The final flow diagram for our final system including our data-preprocessing and new

prediction method is shown in Fig. 18 below:

Fig. 18: Final Flow Diagram of Predictive Machine Learning Algorithm

Skip to toolbar