Heurística de ajuste y análisis de convergencia del algoritmo de aprendizaje por refuerzo para el proyecto de control óptimo basado en datos online

Autores/as

DOI:

https://doi.org/10.33448/rsd-v9i2.2128

Palabras clave:

Control Óptimo; Aprendizaje por Refuerzo; Programación Dinámica Aproximada; Realimentación de Salida; Sintonización.

Resumen

Se presenta una heurística para el análisis de sintonía y convergencia del algoritmo de aprendizaje de refuerzo para el control con retroalimentación de salida con solo datos de entrada / salida generados por un modelo. Para promover el análisis de convergencia, es necesario realizar el ajuste de parámetros en los algoritmos utilizados para la generación de datos y resolver de forma iterativa el problema de control. Se propone una heurística para ajustar los parámetros del generador de datos creando superficies para ayudar en el proceso de análisis de convergencia y robustez de la metodología óptima de control online. El algoritmo probado es el regulador cuadrático lineal discreto (DLQR) con retroalimentación de salida, basado en algoritmos de aprendizaje de refuerzo a través del aprendizaje de diferencia temporal en el esquema de iteración de políticas para determinar la política óptima utilizando solo datos de entrada / salida. En el algoritmo de iteración de políticas, se utilizan mínimos cuadrados recursivos (RLS) para estimar los parámetros online asociados con la retroalimentación de salida DLQR. Después de aplicar las heurísticas de ajuste propuestas, se pudo ver claramente la influencia de los parámetros y se facilitó el análisis de convergencia.

Citas

Aangenent, W., Kostic, D., de Jager, B., van de Molengraft, R., & Steinbuch, M. (2005, June). Data-based optimal control. In Proceedings of the 2005, american control conference, 2005. (p. 1460-1465 vol. 2). doi: 10.1109/ACC.2005.1470171

Alexander S. Poznyak, W. Y., Edgar N. Sanchez. (2001). Differential neural networks for robust nonlinear control: Identification, state estimation and trajectory tracking (1st ed.). World Scientific Publishing Company. Retrieved from http://gen.lib.rus.ec/book/index.php?md5=029EFF9BEA958638157E5C63C73986DA

Alonso, H., Mendonça, T., & Rocha, P. (2009). Hopfield neural networks for on-line parameter estimation. Neural Networks, 22(4), 450–462.

Al-Tamimi, A., Lewis, F. L., & Abu-Khalaf, M. (2008, Aug). Discrete-time nonlinear hjb solution using approximate dynamic programming: Convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38(4), 943949. doi: 10.1109/TSMCB.2008.926614

Anguelova, M. (2004). Nonlinear observability and identifiability: 484 general theory and a case study of a kinetic model for s. cerevisiae. Chalmers tekniska högsk. Retrieved from https://books.google.com.br/books?id=wIYLYAAACAAJ

Athans, M., & Falb, P. L. (2013). Optimal control: an introduction to the theory and its applications. Courier Corporation.

Battistelli, G., Mari, D., Selvi, D., & Tesi, P. (2014, Dec). Unfalsified approach to datadriven control design. In 53rd ieee conference on decision and control (p. 6003-6008). doi: 10.1109/CDC.2014.7040329

Bradtke, S. J., Ydstie, B. E., & Barto, A. G. (1994, June). Adaptive linear quadratic control using policy iteration. In American control conference, 1994 (Vol. 3, p. 3475-3479 vol.3). doi: 10.1109/ACC.1994.735224

Brewer, J. (1978, Sep). Kronecker products and matrix calculus in system theory. IEEE Transactions on Circuits and Systems, 25(9), 772-781. doi: 10.1109/TCS.1978.1084534

Chen, S., Billings, S., & Grant, P. (1990). Non-linear system identification using neural networks. International journal of control, 51(6), 1191–1214.

Chiera, B. A., & White, L. B. (2008). Application of model-free lqg subspace predictive control to tcp congestion control. International Journal of Adaptive Control and Signal Processing, 22(6), 551–568.

Chu, S. R., Shoureshi, R., & Tenorio, M. (1990). Neural networks for system identification. Control Systems Magazine, IEEE, 10(3), 31–35.

Favoreel, W., De Moor, B., Gevers, M., & Van Overschee, P. (1999). Closed loop modelfree subspace-based lqg-design. In Proc. of the 7th ieee mediterranean conference on control and automation, june (pp. 28–30).

Favoreel, W., De Moor, B., Van Overschee, P., & Gevers, M. (1999). Model-free subspacebased lqg-design. In American control conference, 1999. proceedings of the 1999 (Vol. 5, pp. 3372–3376).

Fleming, W. H. (1968). Optimal control of partially observable diffusions. SIAM Journal on Control, 6(2), 194–214.

Guildas, B. (2007). Nonlinear observers and applications. Springer-Verlag Berlin Heidelberg.

Hinnen, K., Verhaegen, M., & Doelman, N. (2008, May). A data-driven -optimal control approach for adaptive optics. IEEE Transactions on Control Systems Technology, 16(3), 381-395. doi: 10.1109/TCST.2007.903374.

Hou, S., Zhongsheng; Jin. (2013). Model free adaptive control : Theory and applications. CRC Press.

Lewis, F. L., & Syrmos, V. L. (1995). Optimal control. John Wiley & Sons.

Lewis, F. L., & Vamvoudakis, K. G. (2011, Feb). Reinforcement learning for partially observable dynamic processes: Adaptive dynamic programming using measured output data. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(1), 14-25. doi: 10.1109/TSMCB.2010.2043839.

Liming, W., Shan, F., & Dan, H. (2015, Oct). Unfalsified adaptive controller design for pilot-aircraft system(iccas 2015). In Control, automation and systems (iccas), 2015 15th international conference on (p. 1494-1499). doi: 10.1109/ICCAS.2015.7364589.

Littman, M. L. (2009). A tutorial on partially observable markov decision processes. Journal of Mathematical Psychology, 53(3), 119–125.

Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observed markov decision processes. Annals of Operations Research, 28(1), 47–65.

Modares, H., Peen, G. O., Zhu, L., Lewis, F. L., & Yue, B. (2014, June). Datadriven optimal control with reduced output measurements. In Intelligent control and automation (wcica), 2014 11th world congress on (p. 1775-1780). doi: 10.1109/WCICA.2014.7052989.

Monahan, G. E. (1982). State of the art a survey of partially observable markov decision processes: theory, models, and algorithms. Management Science, 28(1), 1–16.

Nechyba, M. C., & Xu, Y. (1994). Neural network approach to control system identification with variable activation functions. In Intelligent control, 1994., proceedings of the 1994 ieee international symposium on (pp. 358–363).

Saeki, M., Kondo, K., Wada, N., & Satoh, S. (2014, Dec). Data-driven online unfalsified control by using analytic center. In 53rd ieee conference on decision and control (p. 2026-2031). doi: 10.1109/CDC.2014.7039696.

Safonov, M. G., & Tsao, T.-C. (1997, Jun). The unfalsified control concept and learning. IEEE Transactions on Automatic Control, 42(6), 843-847. doi: 10.1109/9.587340.

Sajjanshetty, K. S., & Safonov, M. G. (2015, Sept). Adaptive dwell-time switching in unfalsified control. In 2015 ieee conference on control applications (cca) (p. 1773-1778). doi: 10.1109/CCA.2015.7320866.

Smallwood, R. D., & Sondik, E. J. (1973). The optimal control of partially observable markov processes over a finite horizon. Operations Research, 21(5), 1071–1088.

Sondik, E. J. (1971). The optimal control of partially observable markov processes. (Tech. Rep.). DTIC Document.

Sondik, E. J. (1978). The optimal control of partially observable markov processes over the infinite horizon: Discounted costs. Operations research, 26(2), 282–304.

Werbos, P. J. (1992). Approximate dynamic programming for real-time control and neural modeling. Handbook of intelligent control: Neural, fuzzy, and adaptive approaches, 15, 493–525.

White III, C. C. (1991). A survey of solution techniques for the partially observed markov decision process. Annals of Operations Research, 32(1), 215–230.

Woodley, B. R., How, J. P., & Kosut, R. L. (2001). Model free subspace based infinity control. In American control conference, 2001. proceedings of the 2001 (Vol. 4, pp. 2712–2717).

Yongqiang, H., Jiabin, C., Xiaochun, T., & Nan, L. (2015, July). A robust data driven error damping method for inertial navigation system based on unfalsified adaptive control. In Control conference (ccc), 2015 34th chinese (p. 5455-5460). doi: 10.1109/ChiCC.2015.7260492

Zhang, N. L., & Zhang, W. (2001). Speeding up the convergence of value iteration in partially observable markov decision processes. Journal of Artificial Intelligence Research, 14, 29–51.

Zhang, W. (2001). Algorithms for partially observable markov decision processes (Unpublished doctoral dissertation). Citeseer.

Descargas

Publicado

01/01/2020

Cómo citar

SILVA, F. N. da; NETO, J. V. F. Heurística de ajuste y análisis de convergencia del algoritmo de aprendizaje por refuerzo para el proyecto de control óptimo basado en datos online. Research, Society and Development, [S. l.], v. 9, n. 2, p. e188922128, 2020. DOI: 10.33448/rsd-v9i2.2128. Disponível em: https://rsdjournal.org/index.php/rsd/article/view/2128. Acesso em: 30 jun. 2024.

Número

Sección

Ingenierías