As humans we have a primary and a secondary directive: the primary says stay alive and the secondary says have an interesting life. Both depend on making good choices and acting, where the act depends on the choice. Given the right tools and framework, machines can perform similarly.
In the domain of computational philosophy we can imagine a board such as the image at right, where the dots that make up the array or image vary in intensity, the whiter dots being more valuable to life than the darker. Life stretches out before us as a series of opportunities and choices to be made. Any action has a cost, but the result of the action will provide a benefit. Provided we keep choosing moves that have a net benefit we continue to be healthy.
Our knowledge of the board is limited. We may observe that the arrangement of the pixels is random, but we know that even chaos clumps - that is, it is in the nature of randomness to produce areas of dense patches of rewarding areas and others not so rewarding. Once in fertile ground it could pay to stay there and explore a bit before making other leaps into the unknown.
We do not know any of the characteristics of the playing area except the values in the four points at the N,E,S and W of the point we currently occupy. We cannot improve our health by remaining in one point no matter how much value it contains since that value declines immediately we occupy that point; our presence eventually exhausts the resources. So we have to make a move; the move costs us effort, so our health declines with the effort. We have five choices for the move, either to one of the four close points or some completely random unknown point. In addition it is in the nature of such domains to increase both extensively (expand outwards at the margins) and intensively (gaining deeper granularity) as the learning progresses1. These new areas may have entirely different characteristics from the original values. So what you think you remember may only be partly true.
The figure at right illustrates the situation. We have arbitrarily landed on the centre square, but now the value obtained has gone to zero and we must act to maintain health. We have five choices: up, down, left, right, jump. Which is the best long term strategy? Say the cost of the action is 0.3; moving north has a net result of minus 0.1 so probably not a good idea unless you have great stored health and have reason to believe that the squares revealed after the move will have good values. If your health is very low and the adjacent squares are all low then there is a great incentive to take a random jump.
One decision to be made is what is the value at which you want to take a random leap? At the beginning of the game you know nothing about the board, but as the game progresses you build a history of the results of random leaps and begin to get an idea of the inherent distribution of values on the board.
A second decision is what is the best policy with regard to choosing the best adjacent square? One possibility is to choose the maximum of the values and move there. The reward is known and if greater than the cost of the jump then it enhances well-being.
In this respect the game is much like Artificial Intelligence Reinforcement Learning.
One way to programme this policy and criterion combination is to run a simulation. The picture fragment at right shows the result of running some decisions using a policy of maximum of the 4, and jump to random criterion of 0.5. The light green jagged patches represent the point of interest as it goes about finding more value using the stated policy. As we can see there were at least 8 random jumps, and the policy was only partly successful in exploiting the high value patches once close to them. No doubt we can do better.
To improve the decision making we can run Monte-Carlo simulations with various values of criteria for the jump value. If we say 'let the criterion be 0.1 and run a thousand moves' we can get an idea of whether we can stay healthy. Do this a thousand times over and we can see how reliable the result is. Then change the criterion and repeat, until we can see a pattern emerging of just what value for the criterion produces the best result. Even though we begin the game in total ignorance of the overall nature of the environment in which we act, repeated effort and learning allow an accumulation of experience that converges on generalizations closer and closer to the ground truth.
1. [With thanks to David Ricardo for the extensive/intensive idea] Back to text