Sometime last year, I stumbled upon a paper while I was trying to come up with a really basic way to implement a budget and expenditure planner using an RL agent.
A bit of a background
A typical reinforcement learning setting is one where an agent interacts with an environment πΊ over a number of time steps π.
For each of the time steps π, the agent receives a State $π_{π}$, and selects an Action $πͺ_{π}$ from a set of most possible actions πͺ based on its Policy π, where π is a mapping from State $π_{π}$ to Action $πͺ_{π}$. After the action, the agent receives the next State $π_{π}+1$ and a scalar reward $π_{π}$.
Easier put, it is about learning the most rewarding behaviour in an environment and enabling the choice of the most optimal decision.
The above process continues until the agent reaches a terminal state, and then the process restarts - with each process a further learning upon the previous. The goal of the agent is to maximise the expected return (reward) from each state $π_{π}$.
Unimportant extra info
Apart from the agent and the environment, there are three other main elements in a reinforcement learning system:
- the policy: think of this as an outline that represents the connection between a viewpoint and an action in an environment. The policy in RL is usually the elemental of what the agent learns.
- the reward: the reward differentiates between the agentβs good and bad actions. The goal of the RL system is to maximise the reward.
- the value: a stateβs value is the sum of the accumulated rewards that the agent is expected to get in the future if it begins from that state.
Now, back to the paper I stumbled upon.
This paper, titled βAsynchronous Methods for Deep Reinforcement Learningβ was published in 2016. I found the article really rad (subtle mention here that it is a tad complicated rather than straight-forward).
It proposed that asynchronous implementation of parallel learner agents could stabilise deep neural network training, and then went ahead to discuss several ways to achieve this asynchronicity in deep reinforcement learning.
It worked!
What stood out for me was what they actually solved. Because it works very fine, and it is now what modern RL solutions are built upon.
Since the beginning of what we know Reinforcement Learning to be, a number of algorithms have been proposed over the years, and have had their great runs. And it was initially thought - and almost believed - that combining RL algorithms with deep neural networks was fundamentally unstable.
Several approaches had been proposed to attempt to stabilise the algorithms in DNNs. Their idea was to use an experience replay memory to store the agentβs data so it can be batched (batching usually does save the day but maybe not this time) or sampled in a random manner from distinct time-steps.
However, the drawbacks were enormous: way more memory and computation usage for every substantial interaction, and it required learning algorithms that do not depend on policy i.e they were mostly useful for offline learning where the agents do not need to explore much in the given environment.
But instead of experience replay, the authors of this paper attempted to asynchronously execute multiple agents in parallel on multiple instances of the same environment. Their idea - vigorously applied using deep neural networks - proved to have a much larger effect on a wide range of elementary on-policy RL algorithms such as Sarsa, n-steps, and actor-critic methods, as well as off-policy RL algorithms like Q-learning.
In my next article of this RL series, Iβll explain the actor-critic method and go ahead to show a nice sample using TensorFlow.