Tuning heuristics and convergence analysis of reinforcement learning algorithm for online data-based optimal control design





Optimal Control; Reinforcement Learning; Approximate Dynamic Programming; Output Feedback; Tuning.


A heuristic for tuning and convergence analysis of the reinforcement learning algorithm for control with output feedback with only input / output data generated by a model is presented. To promote convergence analysis, it is necessary to perform the parameter adjustment in the algorithms used for data generation, and iteratively solve the control problem. A heuristic is proposed to adjust the data generator parameters creating surfaces to assist in the convergence and robustness analysis process of the optimal online control methodology. The algorithm tested is the discrete linear quadratic regulator (DLQR) with output feedback, based on reinforcement learning algorithms through temporal difference learning in the policy iteration scheme to determine the optimal policy using input / output data only. In the policy iteration algorithm, recursive least squares (RLS) is used to estimate online parameters associated with output feedback DLQR. After applying the proposed tuning heuristics, the influence of the parameters could be clearly seen, and the convergence analysis facilitated.


Aangenent, W., Kostic, D., de Jager, B., van de Molengraft, R., & Steinbuch, M. (2005, June). Data-based optimal control. In Proceedings of the 2005, american control conference, 2005. (p. 1460-1465 vol. 2). doi: 10.1109/ACC.2005.1470171

Alexander S. Poznyak, W. Y., Edgar N. Sanchez. (2001). Differential neural networks for robust nonlinear control: Identification, state estimation and trajectory tracking (1st ed.). World Scientific Publishing Company. Retrieved from http://gen.lib.rus.ec/book/index.php?md5=029EFF9BEA958638157E5C63C73986DA

Alonso, H., Mendonça, T., & Rocha, P. (2009). Hopfield neural networks for on-line parameter estimation. Neural Networks, 22(4), 450–462.

Al-Tamimi, A., Lewis, F. L., & Abu-Khalaf, M. (2008, Aug). Discrete-time nonlinear hjb solution using approximate dynamic programming: Convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38(4), 943949. doi: 10.1109/TSMCB.2008.926614

Anguelova, M. (2004). Nonlinear observability and identifiability: 484 general theory and a case study of a kinetic model for s. cerevisiae. Chalmers tekniska högsk. Retrieved from https://books.google.com.br/books?id=wIYLYAAACAAJ

Athans, M., & Falb, P. L. (2013). Optimal control: an introduction to the theory and its applications. Courier Corporation.

Battistelli, G., Mari, D., Selvi, D., & Tesi, P. (2014, Dec). Unfalsified approach to datadriven control design. In 53rd ieee conference on decision and control (p. 6003-6008). doi: 10.1109/CDC.2014.7040329

Bradtke, S. J., Ydstie, B. E., & Barto, A. G. (1994, June). Adaptive linear quadratic control using policy iteration. In American control conference, 1994 (Vol. 3, p. 3475-3479 vol.3). doi: 10.1109/ACC.1994.735224

Brewer, J. (1978, Sep). Kronecker products and matrix calculus in system theory. IEEE Transactions on Circuits and Systems, 25(9), 772-781. doi: 10.1109/TCS.1978.1084534

Chen, S., Billings, S., & Grant, P. (1990). Non-linear system identification using neural networks. International journal of control, 51(6), 1191–1214.

Chiera, B. A., & White, L. B. (2008). Application of model-free lqg subspace predictive control to tcp congestion control. International Journal of Adaptive Control and Signal Processing, 22(6), 551–568.

Chu, S. R., Shoureshi, R., & Tenorio, M. (1990). Neural networks for system identification. Control Systems Magazine, IEEE, 10(3), 31–35.

Favoreel, W., De Moor, B., Gevers, M., & Van Overschee, P. (1999). Closed loop modelfree subspace-based lqg-design. In Proc. of the 7th ieee mediterranean conference on control and automation, june (pp. 28–30).

Favoreel, W., De Moor, B., Van Overschee, P., & Gevers, M. (1999). Model-free subspacebased lqg-design. In American control conference, 1999. proceedings of the 1999 (Vol. 5, pp. 3372–3376).

Fleming, W. H. (1968). Optimal control of partially observable diffusions. SIAM Journal on Control, 6(2), 194–214.

Guildas, B. (2007). Nonlinear observers and applications. Springer-Verlag Berlin Heidelberg.

Hinnen, K., Verhaegen, M., & Doelman, N. (2008, May). A data-driven -optimal control approach for adaptive optics. IEEE Transactions on Control Systems Technology, 16(3), 381-395. doi: 10.1109/TCST.2007.903374.

Hou, S., Zhongsheng; Jin. (2013). Model free adaptive control : Theory and applications. CRC Press.

Lewis, F. L., & Syrmos, V. L. (1995). Optimal control. John Wiley & Sons.

Lewis, F. L., & Vamvoudakis, K. G. (2011, Feb). Reinforcement learning for partially observable dynamic processes: Adaptive dynamic programming using measured output data. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(1), 14-25. doi: 10.1109/TSMCB.2010.2043839.

Liming, W., Shan, F., & Dan, H. (2015, Oct). Unfalsified adaptive controller design for pilot-aircraft system(iccas 2015). In Control, automation and systems (iccas), 2015 15th international conference on (p. 1494-1499). doi: 10.1109/ICCAS.2015.7364589.

Littman, M. L. (2009). A tutorial on partially observable markov decision processes. Journal of Mathematical Psychology, 53(3), 119–125.

Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observed markov decision processes. Annals of Operations Research, 28(1), 47–65.

Modares, H., Peen, G. O., Zhu, L., Lewis, F. L., & Yue, B. (2014, June). Datadriven optimal control with reduced output measurements. In Intelligent control and automation (wcica), 2014 11th world congress on (p. 1775-1780). doi: 10.1109/WCICA.2014.7052989.

Monahan, G. E. (1982). State of the art a survey of partially observable markov decision processes: theory, models, and algorithms. Management Science, 28(1), 1–16.

Nechyba, M. C., & Xu, Y. (1994). Neural network approach to control system identification with variable activation functions. In Intelligent control, 1994., proceedings of the 1994 ieee international symposium on (pp. 358–363).

Saeki, M., Kondo, K., Wada, N., & Satoh, S. (2014, Dec). Data-driven online unfalsified control by using analytic center. In 53rd ieee conference on decision and control (p. 2026-2031). doi: 10.1109/CDC.2014.7039696.

Safonov, M. G., & Tsao, T.-C. (1997, Jun). The unfalsified control concept and learning. IEEE Transactions on Automatic Control, 42(6), 843-847. doi: 10.1109/9.587340.

Sajjanshetty, K. S., & Safonov, M. G. (2015, Sept). Adaptive dwell-time switching in unfalsified control. In 2015 ieee conference on control applications (cca) (p. 1773-1778). doi: 10.1109/CCA.2015.7320866.

Smallwood, R. D., & Sondik, E. J. (1973). The optimal control of partially observable markov processes over a finite horizon. Operations Research, 21(5), 1071–1088.

Sondik, E. J. (1971). The optimal control of partially observable markov processes. (Tech. Rep.). DTIC Document.

Sondik, E. J. (1978). The optimal control of partially observable markov processes over the infinite horizon: Discounted costs. Operations research, 26(2), 282–304.

Werbos, P. J. (1992). Approximate dynamic programming for real-time control and neural modeling. Handbook of intelligent control: Neural, fuzzy, and adaptive approaches, 15, 493–525.

White III, C. C. (1991). A survey of solution techniques for the partially observed markov decision process. Annals of Operations Research, 32(1), 215–230.

Woodley, B. R., How, J. P., & Kosut, R. L. (2001). Model free subspace based infinity control. In American control conference, 2001. proceedings of the 2001 (Vol. 4, pp. 2712–2717).

Yongqiang, H., Jiabin, C., Xiaochun, T., & Nan, L. (2015, July). A robust data driven error damping method for inertial navigation system based on unfalsified adaptive control. In Control conference (ccc), 2015 34th chinese (p. 5455-5460). doi: 10.1109/ChiCC.2015.7260492

Zhang, N. L., & Zhang, W. (2001). Speeding up the convergence of value iteration in partially observable markov decision processes. Journal of Artificial Intelligence Research, 14, 29–51.

Zhang, W. (2001). Algorithms for partially observable markov decision processes (Unpublished doctoral dissertation). Citeseer.




How to Cite

SILVA, F. N. da; NETO, J. V. F. Tuning heuristics and convergence analysis of reinforcement learning algorithm for online data-based optimal control design. Research, Society and Development, [S. l.], v. 9, n. 2, p. e188922128, 2020. DOI: 10.33448/rsd-v9i2.2128. Disponível em: https://rsdjournal.org/index.php/rsd/article/view/2128. Acesso em: 22 oct. 2021.