doi:10.1016/j.ins.2005.10.009
Copyright © 2005 Elsevier Inc. All rights reserved.
Adaptive stock trading with dynamic asset allocation using reinforcement learning
aSchool of Computer Science and Engineering, Seoul National University, San 56-1, Shillim-dong, Kwanak-gu, Seoul 151-742, Republic of Korea
bDepartment of Multimedia Science, Sookmyung Women’s University, Chongpa-dong, Yongsan-gu, Seoul 140-742, Republic of Korea
cSchool of Computer Science and Engineering, Sungshin Women’s University, Dongsun-dong, Sungbuk-gu, Seoul 136-742, Republic of Korea
Received 4 December 2003;
revised 11 October 2005;
accepted 14 October 2005.
Available online 12 December 2005.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
Stock trading is an important decision-making problem that involves both stock selection and asset management. Though many promising results have been reported for predicting prices, selecting stocks, and managing assets using machine-learning techniques, considering all of them is challenging because of their complexity. In this paper, we present a new stock trading method that incorporates dynamic asset allocation in a reinforcement-learning framework. The proposed asset allocation strategy, called meta policy (MP), is designed to utilize the temporal information from both stock recommendations and the ratio of the stock fund over the asset. Local traders are constructed with pattern-based multiple predictors, and used to decide the purchase money per recommendation. Formulating the MP in the reinforcement learning framework is achieved by a compact design of the environment and the learning agent. Experimental results using the Korean stock market show that the proposed MP method outperforms other fixed asset-allocation strategies, and reduces the risks inherent in local traders.
Keywords: Stock trading; Reinforcement learning; Multiple-predictors approach; Asset allocation
Fig. 1. Tendency of recommendations of LTbear. The x-axis represents training days, and the y-axis represents the sum of profits, in percentage terms, induced by recommendations for a day. An upward bar means that positive sums dominate negatives for that day. A downward bar means the opposite.
Fig. 2. Funding history and traded recommendations when the purchase money per recommendation is fixed at 0.4 million Won. The upper figure shows the change in asset over a trading period. Whenever there is a candidate share, the fixed amount of asset is used to trade it. The asset volume is shown in the vertical axis in monetary units, i.e., Won. The horizontal axis represents trading days. The lower figure shows the sum of the profits this trading scheme induces for the trading period, as a percentage.
Fig. 3. Funding history and traded recommendations when the purchase money per recommendation is fixed at 40 million Won.
Fig. 4. The stock trading process with MP.
Fig. 5. The overall architecture for stock trading in a reinforcement-learning framework. The stock market corresponds to the environment, and the MP corresponds to the agent. Ne,t is the number of candidates the local trader LTe retrieves. SFt is the stock fund ratio. PMRe,t corresponds to the action that is the PMR for LTe. Buying and selling by local traders are conducted by local_trade. Reward is the profit ratio of Eq. (8).
Fig. 6. The Q-learning algorithm for meta policy. The function Action chooses the action by Eq. (7) and the function Trade simulates a trading day.
Fig. 7. The asset-allocation strategy of policy1.
Fig. 8. The asset-allocation strategy of policy2.
Fig. 9. The tendency of Q-learning. The profit ratios are estimated by the simulation in Fig. 4 for the training period and the validation period. The horizontal axis represents the estimating points after every 1000 episodes. The vertical axis shows the profit ratios as a percentage. The solid line is for the training period and the dotted line is for the validation period.
Fig. 10. A comparison of the performances of the three trading systems. The horizontal axis shows days from January 2002 to May 2003. The vertical axis shows the asset volume in Won. Lines are: dotted (KOSPI index), dashed-dotted (policy1), dashed (policy2), and solid (MPG).
Fig. 11. A comparison of the performances of the three trading systems. The test was conducted on the KOSDAQ market. Lines are: dotted (KOSDAQ index), dashed-dotted (policy1), dashed (policy2), and solid (MPG).
Fig. 12. Upper figure: asset history. Middle figure: the proportion of stock fund to asset. Lower figure: sums of positive and negative profits recommended per day. It is an imaginary summation, which assumes that all the recommendations were purchased and traded according to the local policy.
Fig. 13. A comparison of the performances when the discretization level of action is changed. The solid line shows the result of (0.5, 1.0, 3.0, 5.0), and the dotted line shows the result of (1.5, 3.0, 4.5, 6.0).
Table 1.
Parameters of local policy

Table 2.
Performances of local predictors

Table 3.
Bit vector representation of recommendations

Table 4.
Bit vector of the stock fund ratio

Table 5.
Policies of the trading systems

Table 6.
Partitions of the data

Table 7.
The profit induced by each trading system in the KOSPI market

Table 8.
Profit induced by each trading system on the KOSDAQ market
