Tuning heuristics and convergence analysis of reinforcement learning algorithm for online data-based optimal control design

Fábio Nogueira da Silva; João Viana Fonseca Neto

doi:10.33448/rsd-v9i2.2128

Autores

Fábio Nogueira da Silva Universidade Federal do Maranhão https://orcid.org/0000-0001-7215-6520
João Viana Fonseca Neto Universidade Federal do Maranhão https://orcid.org/0000-0003-4606-7510

DOI:

https://doi.org/10.33448/rsd-v9i2.2128

Palavras-chave:

Controle Ótimo; Aprendizagem por Reforço; Programação Dinâmica Aproximada; Realimentação de Saída; Sintonia.

Resumo

Uma heurística para sintonia e análise de convergência do algoritmo de aprendizado por reforço para controle com realimentação de saída com apenas dados de entrada / saída, gerados por um modelo, são apresentados. Para promover a análise de convergência, é necessário realizar o ajuste dos parâmetros nos algoritmos utilizados para a geração de dados, e iterativamente resolver o problema de controle. É proposta uma heurística para ajustar os parâmetros do gerador de dados criando superfícies para auxiliar no processo de análise de convergência e robustez da metodologia de controle ótimo on-line. O algoritmo testado é o regulador quadrático linear discreto (DLQR) com realimentação de saída, baseado em algoritmos de aprendizado por reforço através do aprendizado por diferença temporal no esquema de iteração de política para determinar a política ideal usando apenas dados de entrada / saída. No algoritmo de iteração de política, o RLS (Mínimos Quadrados Recursivos) é usado para estimar parâmetros on-line associados ao DLQR com realimentação de saída. Após a aplicação das heurísticas propostas para o ajuste, a influência dos parâmetros pôde ser vista claramente, e a análise de convergência e facilitada.

Referências

Aangenent, W., Kostic, D., de Jager, B., van de Molengraft, R., & Steinbuch, M. (2005, June). Data-based optimal control. In Proceedings of the 2005, american control conference, 2005. (p. 1460-1465 vol. 2). doi: 10.1109/ACC.2005.1470171

Alexander S. Poznyak, W. Y., Edgar N. Sanchez. (2001). Differential neural networks for robust nonlinear control: Identification, state estimation and trajectory tracking (1st ed.). World Scientific Publishing Company. Retrieved from http://gen.lib.rus.ec/book/index.php?md5=029EFF9BEA958638157E5C63C73986DA

Alonso, H., Mendonça, T., & Rocha, P. (2009). Hopfield neural networks for on-line parameter estimation. Neural Networks, 22(4), 450–462.

Al-Tamimi, A., Lewis, F. L., & Abu-Khalaf, M. (2008, Aug). Discrete-time nonlinear hjb solution using approximate dynamic programming: Convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38(4), 943949. doi: 10.1109/TSMCB.2008.926614

Anguelova, M. (2004). Nonlinear observability and identifiability: 484 general theory and a case study of a kinetic model for s. cerevisiae. Chalmers tekniska högsk. Retrieved from https://books.google.com.br/books?id=wIYLYAAACAAJ

Athans, M., & Falb, P. L. (2013). Optimal control: an introduction to the theory and its applications. Courier Corporation.

Battistelli, G., Mari, D., Selvi, D., & Tesi, P. (2014, Dec). Unfalsified approach to datadriven control design. In 53rd ieee conference on decision and control (p. 6003-6008). doi: 10.1109/CDC.2014.7040329

Bradtke, S. J., Ydstie, B. E., & Barto, A. G. (1994, June). Adaptive linear quadratic control using policy iteration. In American control conference, 1994 (Vol. 3, p. 3475-3479 vol.3). doi: 10.1109/ACC.1994.735224

Brewer, J. (1978, Sep). Kronecker products and matrix calculus in system theory. IEEE Transactions on Circuits and Systems, 25(9), 772-781. doi: 10.1109/TCS.1978.1084534

Chen, S., Billings, S., & Grant, P. (1990). Non-linear system identification using neural networks. International journal of control, 51(6), 1191–1214.

Chiera, B. A., & White, L. B. (2008). Application of model-free lqg subspace predictive control to tcp congestion control. International Journal of Adaptive Control and Signal Processing, 22(6), 551–568.

Chu, S. R., Shoureshi, R., & Tenorio, M. (1990). Neural networks for system identification. Control Systems Magazine, IEEE, 10(3), 31–35.

Favoreel, W., De Moor, B., Gevers, M., & Van Overschee, P. (1999). Closed loop modelfree subspace-based lqg-design. In Proc. of the 7th ieee mediterranean conference on control and automation, june (pp. 28–30).

Favoreel, W., De Moor, B., Van Overschee, P., & Gevers, M. (1999). Model-free subspacebased lqg-design. In American control conference, 1999. proceedings of the 1999 (Vol. 5, pp. 3372–3376).

Fleming, W. H. (1968). Optimal control of partially observable diffusions. SIAM Journal on Control, 6(2), 194–214.

Guildas, B. (2007). Nonlinear observers and applications. Springer-Verlag Berlin Heidelberg.

Hinnen, K., Verhaegen, M., & Doelman, N. (2008, May). A data-driven -optimal control approach for adaptive optics. IEEE Transactions on Control Systems Technology, 16(3), 381-395. doi: 10.1109/TCST.2007.903374.

Hou, S., Zhongsheng; Jin. (2013). Model free adaptive control : Theory and applications. CRC Press.

Lewis, F. L., & Syrmos, V. L. (1995). Optimal control. John Wiley & Sons.

Lewis, F. L., & Vamvoudakis, K. G. (2011, Feb). Reinforcement learning for partially observable dynamic processes: Adaptive dynamic programming using measured output data. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(1), 14-25. doi: 10.1109/TSMCB.2010.2043839.

Liming, W., Shan, F., & Dan, H. (2015, Oct). Unfalsified adaptive controller design for pilot-aircraft system(iccas 2015). In Control, automation and systems (iccas), 2015 15th international conference on (p. 1494-1499). doi: 10.1109/ICCAS.2015.7364589.

Littman, M. L. (2009). A tutorial on partially observable markov decision processes. Journal of Mathematical Psychology, 53(3), 119–125.

Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observed markov decision processes. Annals of Operations Research, 28(1), 47–65.

Modares, H., Peen, G. O., Zhu, L., Lewis, F. L., & Yue, B. (2014, June). Datadriven optimal control with reduced output measurements. In Intelligent control and automation (wcica), 2014 11th world congress on (p. 1775-1780). doi: 10.1109/WCICA.2014.7052989.

Monahan, G. E. (1982). State of the art a survey of partially observable markov decision processes: theory, models, and algorithms. Management Science, 28(1), 1–16.

Nechyba, M. C., & Xu, Y. (1994). Neural network approach to control system identification with variable activation functions. In Intelligent control, 1994., proceedings of the 1994 ieee international symposium on (pp. 358–363).

Saeki, M., Kondo, K., Wada, N., & Satoh, S. (2014, Dec). Data-driven online unfalsified control by using analytic center. In 53rd ieee conference on decision and control (p. 2026-2031). doi: 10.1109/CDC.2014.7039696.

Safonov, M. G., & Tsao, T.-C. (1997, Jun). The unfalsified control concept and learning. IEEE Transactions on Automatic Control, 42(6), 843-847. doi: 10.1109/9.587340.

Sajjanshetty, K. S., & Safonov, M. G. (2015, Sept). Adaptive dwell-time switching in unfalsified control. In 2015 ieee conference on control applications (cca) (p. 1773-1778). doi: 10.1109/CCA.2015.7320866.

Smallwood, R. D., & Sondik, E. J. (1973). The optimal control of partially observable markov processes over a finite horizon. Operations Research, 21(5), 1071–1088.

Sondik, E. J. (1971). The optimal control of partially observable markov processes. (Tech. Rep.). DTIC Document.

Sondik, E. J. (1978). The optimal control of partially observable markov processes over the infinite horizon: Discounted costs. Operations research, 26(2), 282–304.

Werbos, P. J. (1992). Approximate dynamic programming for real-time control and neural modeling. Handbook of intelligent control: Neural, fuzzy, and adaptive approaches, 15, 493–525.

White III, C. C. (1991). A survey of solution techniques for the partially observed markov decision process. Annals of Operations Research, 32(1), 215–230.

Woodley, B. R., How, J. P., & Kosut, R. L. (2001). Model free subspace based infinity control. In American control conference, 2001. proceedings of the 2001 (Vol. 4, pp. 2712–2717).

Yongqiang, H., Jiabin, C., Xiaochun, T., & Nan, L. (2015, July). A robust data driven error damping method for inertial navigation system based on unfalsified adaptive control. In Control conference (ccc), 2015 34th chinese (p. 5455-5460). doi: 10.1109/ChiCC.2015.7260492

Zhang, N. L., & Zhang, W. (2001). Speeding up the convergence of value iteration in partially observable markov decision processes. Journal of Artificial Intelligence Research, 14, 29–51.

Zhang, W. (2001). Algorithms for partially observable markov decision processes (Unpublished doctoral dissertation). Citeseer.

Sintonia heurística e análise de convergência de algoritmo de aprendizagem por reforço para projeto de controle ótimo baseado em dados

Autores

DOI:

Palavras-chave:

Resumo

Referências

Downloads

Publicado

Como Citar

Edição

Seção

Licença

JOURNAL METRICS

Idioma

Enviar Submissão