Machine Learning in Trading: Review of Q-Learning

Machine Learning Trading

Unpredictability abounds in the trading industry. For years, the edges to predict stock/forex trade successfully are fading consistently. But ever since the birth of machine learning, things have changed. Using Q learning, we could potentially allow an agent to approximate prices for stocks in a portfolio. We could manage the risks. Unlike other algorithms, Q-learning is relatively easy to understand. Most importantly, it makes sense. Although it’s still in early development, Q-learning could create quite an impact in trading and finance in a few years.

What is Q-Learning?

Before we talk about various aspects of Q-learning, let’s start with the basics. So, what does Q-learning mean? It is a reinforcement learning algorithm. Imagine a scenario: You are stuck in a maze. There are hundreds of paths to choose from. But you are not sure which one leads to the exit. This will confuse you. You can’t make up your mind on which path will be the right one for you. You may choose one randomly. But it can ultimately result in a bigger problem. So, for situations like this, Q-learning is a lifesaver.

It finds the best action to choose in a particular state or situation. Q-learning finds a path or a policy that has the maximum reward. You might be wondering what the Q stands for. The Q means quality. So, the overall concept is it represents how a specific action is. The action’s usefulness depends on the future reward or success it leads to. The objective of it is to maximize value in a short time. This algorithm is excellent because it’s fast, and it finds the best of all possible actions in a situation.


Characteristics Of Q-Learning Models

Q-learning is a particular type of reinforcement learning. It will have the same characteristics as reinforcement learning. But it has some new features too. So, let’s dive into it.

1. An input and output system

Q-learning models have an input and output system. The information is the agent itself, and output can be the final result.

2. Rewards

In Q-learning, the environment gives the agent a reward when it performs a specific task. The agent’s only objective is to get as many rewards as possible. The mechanism is then simple. To determine the agent’s decision as good or bad, we calculate the number of prizes.

3. Environment

The environment is a situation. It can be a tricky path or task with lots of possibilities. The agent interacts with the environment. It tries to solve the task by different approaches. The most common example of a Q-learning environment is a maze.

4. Finite Number of Possible States

This is a distinct characteristic of the Q-learning model. So, the agent must always be one of a fixed number of possible solutions. It can’t be something outside of the possible solutions.

5. Finite Number of Possible Actions

The same with this one. As we’ve mentioned, the agent will try out different actions to maximize the rewards. The agent will always need to choose from among a fixed number of possible activities. The agent can’t just select a step outside of the possible actions. Think of it like this: you have created a maze. You put a robot in the starting line. Now, there are several paths the robot can take to make it to the end. But as the goal is to go from point A to point B, the robot can take other measures. It can break the maze to reach the last point. But as it’s not one of the possible actions, you can’t count it.


Why Is Q-Learning Different?

Q-learning is unique. The main reason is it’s very efficient. But there are also some other factors:

1. Off-Policy

Q-learning is off-policy. It just doesn’t consider the actions that are inside the current Policy. The algorithm even considers actions outside the Policy. It considers factors like random actions.

So, unlike other on-policy algorithms, we don’t have to initiate a whole trial of iteration for each state. Just the visiting pair of actions is enough. This takes less time. On-policy algorithms must take the entire trial for each state. This makes it more computationally intensive.

2. Model Free

Q-learning doesn’t require a model. It learns by interaction with the environment or the situation at hand. In general, you need to create a model which can take extra time and effort. But this machine learning algorithm doesn’t need a model to function correctly. It doesn’t need to model the environment’s distribution.

3. Trial and Error Based Approach

In Q-learning, the AI agent implements trial and error-based techniques. The agent learns by repeatedly attempting to solve the problem at hand using varied approaches. This happens across many different episodes while updating its Policy with what it knew. It’s a real-time approach that learns as it goes.


How does Q-learning Work?

We can explain Q-learning in a bunch of different ways. We will try to understand the process with a simple example. So, imagine a robot must move from the starting point to the ending point. In the path, there are some obstacles. Now the agent must reach the ending point as quickly as possible. So, the agent must take many actions. There’s only one path that will lead to the endpoint in the shortest time. So, how will the agent figure it out? So here comes the Q-learning. There are some steps.


Step 1: Creating the Q Table

To find out the best action, we use a data structure called the Q-table. We use it to calculate the rewards for each situation. To learn the different values of the Q-table, we use the Q-learning algorithm. So, we create an action table with columns and rows. Now there can be several actions, and for each step, there are several states. Let’s think of the rows as actions and columns as states. We can start the journey with a value of 0.


Step: 2 Choosing an action

Now the agent can choose from a wide variety of actions. The actions can be going left, right, up, and down. The agent will take each effort and learn if it’s the most effective or not.


Step: 3 Performing the action


Now, after choosing an action, the agent will perform it. This will keep on going for an undefined period. Until the process stops, the activities will keep on running. As we’ve mentioned, the initial value for each action is 0. The deals will update as the agent performs the steps. Initially, the agent will explore the situation. It will make random actions.


Think of it this way. You get lost in a city. You’ve never been there before. All the streets are unknown to you. So, what do you do? You go to random directions and see where it leads you. As you get more familiar, you take specific laws that lead you to your desired path. The same thing happens here. At first, the agent makes arbitrary decisions as it is entirely unaware of the environment. But as it gets more information, it starts to exploit the environment. Therefore, the agent gets more accurate and confident when it estimates the Q value.


Step 4: Measuring the Reward

Now the fun part comes in. After taking each possible action, we calculate the reward. The rewards can be anything. Let’s assume for this example the tips are like below.

When the agent reaches one step closer to the ending point, it gets 1 point. On the other hand, when it hits a mine, It loses 100 points, and the game is over. When the agent reaches the end, it gets 100 points.


Step 5: Evaluating the Result

Now, after calculating the rewards, we will get a Q-table with all the results. We can compare the total points for each action. After evaluating the outcomes, we can figure out the shortest path. This process is repeatedly done until the learning process finishes.


Q-Learning in Trading

Q-learning can solve a wide range of problems regarding trading. In trading, there is no endpoint. Neither it has any known reward functions or transition probability. Therefore, a model-free approach like Q-learning can open many doors to solve issues. The best implementation of Q-learning in trading is building an automated trading bot. The process is like what we have discussed. We must define the functions and the rewards. The Q-learning will be able to predict the best action by learning from the past.

So, we can implement Q-learning in trading with the following factors.



So, just like the example we discussed before, there will be an agent. For trading, this agent is a trader. This trader has access to a brokerage account. The trader’s primary goal is just like the example. It makes trading actions or decisions to maximize the rewards.


The environment is the trading market. There is profit and loss. The trader must find out the best action.


The trader does not know the state of the environment. Therefore initially, it will take arbitrary decisions. But then, as it learns, the rewards will increase. The outcomes will be more accurate.


The trader will try to maximize the profits. But this can lead to some complications like high risks. That’s why we can think of using other matrices like the Sharpe ratio.

Limitations of Q-Learning

While there are tons of reasons to use Q learning, there are some limitations too. Researchers are trying to figure out ways to eliminate these limitations. So, let’s check them out.

1. Delay and Noise

After looking at the Q-learning process, you can understand something. The Q-learning learns about a specific action from another action. They are not independent. Assume you are stuck in traffic. There are hundreds of vehicles in front of your car. Sitting in your car, you are trying to figure out how long it will take for you to go home. So, you guess it can take about an hour. Then you keep on moving with your car and realize the traffic is much heavier than before. So, your guess changes. This is the problem with Q-learning. You can’t be sure of the ending result. The feedback is quite delayed.


2. Off-Policy

This is a distinct characteristic of Q-learning. But it creates some problems too. Q- learning explores as the agent takes different actions. There is no specific policy. Therefore, the algorithm is quite unreliable. The possibility can be anything. So, you can’t be sure.

3. Struggle with nonstationary environments

Suppose you go for a walk early in the morning. While walking, you go to the main road to see the vehicles passing by. It mesmerizes you. After doing this for some time, you notice there is a traffic delay at 7 am. You can make a generalized assumption that the traffic delay occurs precisely at 7 am. But is it true? You can’t be sure that it will happen every day at the exact time. The pattern can change at any time. The Q-learning struggles in such nonstationary environments. It explores the background but can’t come to an accurate conclusion.

Final Thoughts

Financial markets are way too unpredictable and intricate. There’s no way we can predict the risks or the profits or losses. Non-learning-based algorithms can’t keep up with the ever-changing nature of the trading market. That’s why Q learning is worth experimenting with and a viable way to gain insights into the trading world. With more development, we can minimize the risks and gain unique leads.

This paper inspired this article.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: