Design

Predicting the market given hundreds of thousands of dollars in software and hardware is a pretty easy task to a relative point since predictive analytics is only as good as the algorithm itself. Such is the reason why one would want to use a neural network and machine learning so that the algorithm can adapt over time. The idea behind this is that the market while mostly made up of institutional investors who are moving size still depend on independent investors who trade in what we call the herd mentality are responsible for majority of the market volatility which is something that Investment Banks and hedge funds especially those who are into high speed trading need to make money. The issue though is you need to predict the changes in enough time before they happen. Thus, given enough computing power or a finely tuned algorithm someone would be able to predict within 10 seconds of real time.

But this requires specially designed software and hardware such as GPU’s and massive amounts of ram. Ideally the goal of the project is to develop a neural network that is cost effective meaning that it is both not processor intense and it doesn’t require special software or hardware to run. Such algorithm would be able to accurately predict within 5% of the strike price for a stock. Such algorithm would need to take in massive amounts of data and need to be able to digest it. That is where the NLP comes in.

By developing an updated lexicon for the computer to understand the news articles and different news feeds the algorithm would have to save this information into a centralized database for all nodes of the neural network to access. This could be through some form of database. The distributed nodes of the network would then be responsible for processing parts of the data such as twitter feeds, news feeds, things of that sort while a central computer does the main processing compiling everything that the nodes processed.

By creating a network like this we could squeeze more efficiency out of the market by better calculating for volatility and it would allow hedge funds to turn over massive profits on small swings of stock price. It was these considerations which lead to the system design you see in figure 4 below which is a block diagram overview of the system.

Figure 1: Block diagram of System

As you can see from the block diagram of the system the implementation was designed to be as simple and straight forward as possible so that the that is cost effective meaning that it is both not processor intense and it doesn’t require special software or hardware to run. While making sure the solution is able to integrate with existing industry standards and programs such as excel. This led to the system being composed of three basic parts. The Information system, the prediction engine, and the graphing output.

The first part of the system the Information system is composed of three subsystems which are the Financial Data feed, the Twitter Feed, and News Wire feed. These subsystems get combined into a single system and their output is a csv file which is exported from excel. This allows the data to be inputted into different database systems and allows the data to be easily viewed and doesn’t require any specialized software.

The Financial Data Feed subsystem pulls live market data from FactSet which a financial service data provider and using their API the data is downloaded for the requested days into the Data Frame in 1-minute increments for weekdays.  The Twitter feed subsystem pulls tweets directly from Twitters api and feeds them directly into a Pandas Data Frame. These tweets are then cleaned, analyzed and sorted and stored into the csv output file. The News Data subsystem takes the RSS Feeds from the top 30 Top News providers is and stores them into a Pandas Data Frame. From there the headlines are cleaned sorted and list of the top words found in the news headlines are generated. These lists of top words used are then used to filter tweets for relevance to current events.  This first part of the system including its subsystems are hosted in Google CoLab which is a web based Virtual Machine for Machine learning and Data Analytics.

The second part of the system is the Predictive Engine and at this point it is still something that is underneath development. The task of the Predictive engine is where the processing of twitter data and market data takes place and using Python and R Data analytics takes place and the results are output to the third subsystem which is the graphing engine. During the winter term I plan on investigation and trying out different machine learning systems and platforms at that is the ingenuity behind this system design is that we can change out parts of the system design without affecting the system performance or the final outputs.

The third and final part of this system is the graphing system which is integral to the overall performance and utilization of this system. The reason being is that a lot of data is generated during the first two parts of the system so to view that data in raw form would not be conducive to the user. So, to accomplish this a program called Tableau is utilized to load in the CVS files from the predictive engine. This then allows us to filter the data and visualize it in graphs and other forms so that the data can be better understood, and decisions can be made quickly and efficiently since visuals are the best way of displaying time series-based data.