Ê×Ò³
ѧϰ
»î¶¯
רÇø
¹¤¾ß
TVP
·¢²¼
¾«Ñ¡ÄÚÈÝ/¼¼ÊõÉçȺ/ÓŻݲúÆ·,¾¡ÔÚС³ÌÐò
Á¢¼´Ç°Íù

BAIR×îÐÂRLËã·¨³¬Ô½¹È¸èDreamer£¬ÐÔÄÜÌáÉý2.8±¶

pixel-based RL Ëã·¨ÄæÏ®£¬BAIR Ìá³ö½«¶Ô±ÈѧϰÓë RL Ïà½áºÏµÄËã·¨£¬Æä sample-efficiency Æ¥µÐ state-based RL¡£

´Ë´ÎÑо¿µÄ±¾ÖÊÔÚÓڻشðÒ»¸öÎÊÌ⡪ʹÓÃͼÏñ×÷Ϊ¹Û²âÖµ£¨pixel-based£©µÄ RL ÊÇ·ñÄܹ»ºÍÒÔ×ø±ê״̬×÷Ϊ¹Û²âÖµµÄ RL Ò»ÑùÓÐЧ£¿´«Í³ÒâÒåÉÏ£¬´ó¼ÒÆÕ±éÈÏΪÒÔͼÏñΪ¹Û²âÖµµÄ RL Êý¾ÝЧÂʽϵͣ¬Í¨³£ÐèÒªÒ»ÒÚ¸ö½»»¥µÄ step À´½â¾ö Atari ÓÎÏ·ÄÇÑùµÄ»ù×¼²âÊÔÈÎÎñ¡£

Ñо¿ÈËÔ±½éÉÜÁË CURL£ºÒ»ÖÖÓÃÓÚÇ¿»¯Ñ§Ï°µÄÎ޼ල¶Ô±È±íÕ÷¡£CURL ʹÓöԱÈѧϰµÄ·½Ê½´ÓԭʼÏñËØÖÐÌáÈ¡¸ß½×ÌØÕ÷£¬²¢ÔÚÌáÈ¡µÄÌØÕ÷Ö®ÉÏÖ´ÐÐÒì²ßÂÔ¿ØÖÆ¡£ÔÚ DeepMind Control Suite ºÍ Atari Games Öеĸ´ÔÓÈÎÎñÉÏ£¬CURL ÓÅÓÚÒÔÇ°µÄ pixel-based µÄ·½·¨£¨°üÀ¨ model-based ºÍ model-free£©£¬ÔÚ 100K ½»»¥²½Öè»ù×¼²âÊÔÖУ¬ÆäÐÔÄÜ·Ö±ðÌá¸ßÁË 2.8 ±¶ÒÔ¼° 1.6 ±¶¡£ÔÚ DeepMind Control Suite ÉÏ£¬CURL ÊǵÚÒ»¸ö¼¸ºõÓë»ùÓÚ״̬ÌØÕ÷·½·¨µÄ sample-efficiency ºÍÐÔÄÜËùÆ¥ÅäµÄ»ùÓÚͼÏñµÄËã·¨¡£

  • ÂÛÎÄÁ´½Ó£ºhttps://arxiv.org/abs/2004.04136
  • ÍøÕ¾£ºhttps://mishalaskin.github.io/curl/
  • GitHub Á´½Ó£ºhttps://github.com/MishaLaskin/curl

±³¾°½éÉÜ

CURL Êǽ«¶Ô±ÈѧϰÓë RL Ïà½áºÏµÄͨÓÿò¼Ü¡£ÀíÂÛÉÏ£¬¿ÉÒÔÔÚ CURL pipeline ÖÐʹÓÃÈÎÒ» RL Ëã·¨£¬ÎÞÂÛÊÇͬ²ßÂÔ»¹ÊÇÒì²ßÂÔ¡£¶ÔÓÚÁ¬Ðø¿ØÖÆ»ù×¼¶øÑÔ£¨DM Control£©£¬Ñо¿ÍŶÓʹÓÃÁ˽ÏΪÊìÖªµÄ Soft Actor-Critic£¨SAC£©(Haarnoja et al., 2018) ;¶ø¶ÔÓÚÀëÉ¢¿ØÖÆ»ù×¼£¨Atari£©£¬Ñо¿ÍŶÓʹÓÃÁË Rainbow DQN£¨Hessel et al., 2017)£©¡£ÏÂÃ棬ÎÒÃǼòÒª»Ø¹ËһϠSAC£¬Rainbow DQN ÒÔ¼°¶Ô±Èѧϰ¡£

Soft Actor Critic

SAC ÊÇÒ»ÖÖÒì²ßÂÔ RL Ëã·¨£¬ËüÓÅ»¯ÁËËæ»ú²ßÂÔ£¬ÒÔ×î´ó»¯Ô¤ÆڵĹ켣»Ø±¨¡£ÏñÆäËû SOTA ¶Ëµ½¶ËµÄ RL Ëã·¨Ò»Ñù£¬SAC ÔÚ´Ó״̬¹Û²ìÖнâ¾öÈÎÎñʱ·Ç³£ÓÐЧ£¬µ«È´ÎÞ·¨´ÓÏñËØÖÐѧϰÓÐЧµÄ²ßÂÔ¡£

Rainbow

×îºÃ½« Rainbow DQN£¨Hessel et al., 2017£©×ܽáΪÔÚÔ­À´Ó¦Óà Nature DQN Ö®ÉϵĶàÏî¸Ä½ø£¨Mnih et al., 2015£©¡£¾ßÌåÀ´Ëµ£¬Éî¶È Q ÍøÂ磨DQN£©£¨Mnih et al., 2015£©½«Òì²ßÂÔËã·¨ Q-Learning Óë¾í»ýÉñ¾­ÍøÂç×÷Ϊº¯Êý±Æ½üÆ÷Ïà½áºÏ£¬½«Ô­Ê¼ÏñËØÓ³Éäµ½¶¯×÷¼ÛÖµº¯ÊýÀï¡£

³ý´ËÖ®Í⣬¼ÛÖµ·Ö²¼Ç¿»¯Ñ§Ï°£¨Bellemare et al., 2017£©Ìá³öÁËÒ»ÖÖͨ¹ý C51 Ëã·¨Ô¤²â¿ÉÄÜÖµº¯Êý bin Éϵķֲ¼¼¼Êõ¡£Rainbow DQN ½«ÉÏÊöËùÓм¼Êõ×éºÏÔÚµ¥Ò»µÄÒì²ßÂÔËã·¨ÖУ¬ÓÃÒÔʵÏÖ Atari »ù×¼µÄ×îРsample efficiency¡£´ËÍ⣬Rainbow »¹Ê¹ÓÃÁ˶ಽ»Ø±¨£¨Sutton et al.£¬1998£©¡£

¶Ô±Èѧϰ

CURL µÄ¹Ø¼ü²¿·ÖÊÇʹÓöԱÈÎ޼ලѧϰÀ´Ñ§Ï°¸ßάÊý¾ÝµÄ·á¸»±íʾµÄÄÜÁ¦¡£¶Ô±Èѧϰ¿ÉÒÔÀí½âΪ¿ÉÇø·ÖµÄ×Öµä²éÕÒÈÎÎñ¡£¸ø¶¨Ò»¸ö²éѯ q¡¢¼ü K= {k_0, k_1, . . . } ÒÔ¼°Ò»¸öÃ÷È·µÄ K£¨¹ØÓÚ q£©P(K) = ({k+}, K {k+}) ·ÖÇø£¬¶Ô±ÈѧϰµÄÄ¿±êÊÇÈ·±£ q Óë k +µÄÆ¥Åä³Ì¶È±È K {k +} ÖеÄÈκεļü¶¼¸ü´ó¡£ÔÚ¶Ô±ÈѧϰÖУ¬q£¬K£¬k +ºÍ K {k +} Ò²·Ö±ð³ÆΪêµã£¨anchor£©£¬Ä¿±ê£¨targets£©£¬ÕýÑù±¾£¨positive£©, ¸ºÑù±¾£¨negatives£©¡£

CURL ¾ßÌåʵÏÖ

CURL ͨ¹ý½«ÑµÁ·¶Ô±ÈÄ¿±ê×÷ΪÅú¸üÐÂʱµÄ¸¨ÖúËðʧº¯Êý£¬ÔÚ×îС³Ì¶ÈÉϸıä»ù´¡ RL Ëã·¨¡£ÔÚʵÑéÖУ¬Ñо¿Õß½« CURL ÓëÁ½¸öÎÞÄ£ÐÍ RL Ë㷨һͬѵÁ·¡ª¡ªSAC ÓÃÓÚ DMControl ʵÑ飬Rainbow DQN ÓÃÓÚ Atari ʵÑé¡£

×ÜÌå¿ò¼Ü¸ÅÊö

CURL ʹÓõÄʵÀýÅб𷽷¨£¨instance discrimination£©ÀàËÆÓÚ SimCLR¡¢MoC ºÍ CPC¡£´ó¶àÊýÉî¶ÈÇ¿»¯Ñ§Ï°¿ò¼Ü²ÉÓÃһϵÁжѵþÔÚÒ»ÆðµÄͼÏñ×÷ΪÊäÈë¡£Òò´Ë£¬Ëã·¨ÔÚ¶à¸ö¶ÑµþµÄÖ¡ÖнøÐÐʵÀýÅб𣬶ø²»Êǵ¥Ò»µÄͼÏñʵÀý¡£

Ñо¿Õß·¢ÏÖ£¬Ê¹ÓÃÀàËÆÓÚ MoCo µÄ¶¯Á¿±àÂëÁ÷³Ì£¨momentum encoding£©À´´¦ÀíÄ¿±ê£¬ÔÚ RL ÖÐÐÔÄܽϺá£×îºó£¬Ñо¿ÕßʹÓÃÒ»¸öÀàËÆÓÚ CPC ÖеÄË«ÏßÐÔÄÚ»ýÀ´´¦Àí InfoNCE score ·½³Ì£¬Ñо¿Õß·¢ÏÖЧ¹û±È MoCo ºÍ SimCLR Öеĵ¥Î»·¶ÊýÏòÁ¿»ý£¨unit norm vector products£©ÒªºÃ¡£¶Ô±È±íÕ÷ºÍ RL Ë㷨һͬ½øÐÐѵÁ·£¬Í¬Ê±´Ó¶Ô±ÈÄ¿±êºÍ Q º¯ÊýÖлñµÃÌݶȡ£×ÜÌå¿ò¼ÜÈçÏÂͼËùʾ¡£

ͼ 2£ºCURL ×ÜÌå¿ò¼ÜʾÒâͼ

ÅбðÄ¿±ê

Ñ¡Ôñ¹ØÓÚÒ»¸öêµãµÄÕý¡¢¸ºÑù±¾ÊǶԱȱíÕ÷ѧϰµÄÆäÖÐÒ»¸ö¹Ø¼ü×é³É²¿·Ö¡£

²»Í¬ÓÚÔÚͬһÕÅͼÏñÉ쵀 image-patches£¬Åбð±ä»»ºóµÄͼÏñʵÀýÓÅ»¯´øÓÐ InfoNCE ËðʧÏîµÄ¼ò»¯ÊµÀýÅбðÄ¿±êº¯Êý£¬²¢ÐèÒª×îС»¯¶Ô½á¹¹µÄµ÷Õû¡£ÔÚ RL É趨Ï£¬Ñ¡Ôñ¸ü¼ò»¯ÅбðÄ¿±êµÄÀíÓÉÖ÷ÒªÓÐÈçÏÂÁ½µã£º

  • ¼øÓÚ RL Ë㷨ʮ·Ö´àÈõ£¬¸´ÔÓµÄÅбðÄ¿±ê¿ÉÄܵ¼Ö RL Ä¿±ê²»Îȶ¨¡£
  • RL Ëã·¨ÔÚ¶¯Ì¬Éú³ÉµÄÊý¾Ý¼¯ÉϽøÐÐѵÁ·£¬¸´ÔÓµÄÅбðÄ¿±ê¿ÉÄÜ»áÏÔÖøÔö¼ÓѵÁ·ËùÐèʱ¼ä¡£

Òò´Ë£¬CURL ʹÓÃʵÀýÅбð¶ø²»ÊÇ patch Åбð¡£ÎÒÃǿɽ«ÀàËÆÓÚ SimCLR ºÍ MoCo ÕâÑùµÄ¶Ô±ÈʵÀýÅбðÉèÖ㬿´×ö×î´ó»¯Ò»ÕÅͼÏñÓëÆä¶ÔÓ¦Ôö¹ã°æ±¾Ö®¼äµÄ¹²Í¬ÐÅÏ¢¡£

²éѯ-¼üÖµ¶ÔµÄÉú³É

ÀàËÆÓÚÔÚͼÏñÉ趨ϵÄʵÀýÅбð£¬ÃªµãºÍÕý¹Û²âÖµÊÇÀ´×Ôͬһ·ùͼÏñµÄÁ½¸ö²»Í¬Ôö¹ãÖµ£¬¶ø¸º¹Û²âÖµÔòÀ´Ô´ÓÚÆäËûͼÏñ¡£CURL Ö÷ÒªÒÀ¿¿Ëæ»ú²ÃÇÐÊý¾ÝÔö¹ã·½·¨£¬´ÓԭʼäÖȾͼÏñÖÐËæ»ú²ÃÇÐÒ»¸öÕý·½Ð뵀 patch¡£

Ñо¿ÕßÔÚÅúÊý¾ÝÉÏʹÓÃËæ»úÊý¾ÝÔö¹ã£¬µ«ÔÚͬһ¶ÑÖ¡Ö®¼ä±£³ÖÒ»Ö£¬ÒÔ±£Áô¹Û²âֵʱ¼ä½á¹¹µÄÐÅÏ¢¡£Êý¾ÝÔö¹ãÁ÷³ÌÈçͼ 3 Ëùʾ¡£

ͼ 3: ʹÓÃËæ»ú²Ã¼ô²úÉúêµãÓëÆäÕýÑù±¾¹ý³ÌµÄÖ±¹Ûչʾ¡£

ÏàËƶÈÁ¿

Çø·ÖÄ¿±êÖеÄÁíÒ»¸ö¾ö¶¨ÒòËØÊÇÓÃÓÚ²âÁ¿²éѯ¼ü¶ÔÖ®¼äµÄÄÚ²¿³Ë»ý¡£CURL ²ÉÓÃË«ÏßÐÔÄÚ»ý sim(q£¬k)= q^TW_k£¬ÆäÖÐ W ÊÇѧϰµÄ²ÎÊý¾ØÕó¡£Ñо¿ÍŶӷ¢ÏÖÕâÖÖÏàËÆÐÔ¶ÈÁ¿µÄÐÔÄÜÓÅÓÚ×î½üÔÚ¼ÆËã»úÊÓ¾õ£¨Èç MoCo ºÍ SimCLR£©ÖÐ×îеĶԱÈѧϰ·½·¨ÖÐʹÓõıê×¼»¯µã»ý¡£

¶¯Á¿Ä¿±ê±àÂë

ÔÚ CURL ÖÐʹÓöԱÈѧϰµÄÄ¿±êÊÇѵÁ·´Ó¸ßάÏñËØÖÐÄÜÓ³Éäµ½¸ü¶àÓïÒåÒþ״̬µÄ±àÂëÆ÷¡£InfoNCE ÊÇÒ»ÖÖÎ޼ලµÄËðʧ£¬Ëüͨ¹ýѧϰ±àÂëÆ÷ f_q ºÍ f_k ½«Ô­Ê¼Ãªµã£¨²éѯ£©x_q ºÍÄ¿±ê£¨¹Ø¼ü×Ö£©x_k Ó³É䵽DZÔÚÖµ q = f_q(x_q) ºÍ k = f_k(x_k) ÉÏ£¬ÔÚ´ËÍŶÓÓ¦ÓÃÏàËƵã»ý¡£Í¨³£ÔÚêµãºÍÄ¿±êÓ³ÉäÖ®¼ä¹²ÏíÏàͬµÄ±àÂëÆ÷£¬¼´ f_q = f_k¡£

CURL ½«Ö¡-¶ÑջʵÀýµÄʶ±ðÓëÄ¿±êµÄ¶¯Á¿±àÂë½áºÏÔÚÒ»Æð£¬Í¬Ê± RL ÊÇÔÚ±àÂëÆ÷ÌØÕ÷Ö®ÉÏÖ´Ðеġ£

CURL ¶Ô±Èѧϰα´úÂ루PyTorch ·ç¸ñ£©

ʵÑé

Ñо¿ÕßÆÀ¹À£¨i£©sample-efficiency£¬·½·¨¾ßÌåΪ²âÁ¿±íÏÖ×î¼ÑµÄ»ùÏßÐèÒª¶àÉÙ¸ö½»»¥²½Öè²ÅÄÜÓë 100k ½»»¥²½ÖèµÄ CURL ÐÔÄÜÏàÆ¥Å䣬ÒÔ¼°£¨ii£©Í¨¹ý²âÁ¿ CURL È¡µÃµÄÖÜÆڻر¨ÖµÓë×î¼Ñ±íÏÖ»ùÏߵıÈÀýÀ´¶ÔÐÔÄܲãÃæµÄ 100k ²½Öè½øÐкâÁ¿¡£»»¾ä»°Ëµ£¬µ±Ì¸µ½Êý¾Ý»ò sample-efficiency ʱ£¬ÆäʵָµÄÊÇ£¨i£©£¬¶øµ±Ì¸ÆðÐÔÄÜʱÔòÖ¸µÄÊÇ£¨ii£©¡£

DMControl

ÔÚ DMControl ʵÑéÖеÄÖ÷Òª·¢ÏÖ£º

  • CURL ÊÇÎÒÃÇÔÚÿ¸ö DMControl »·¾³ÉϽøÐлù×¼²âÊ﵀ SOTA ImageBased RL Ëã·¨£¬ÓÃÓÚ¸ù¾ÝÏÖÓÐµÄ Image-based µÄ»ù×¼½øÐвÉÑùЧÂʲâÊÔ¡£ÔÚ DMControl100k ÉÏ£¬CURL µÄÐÔÄÜ±È Dreamer£¨Hafner µÈÈË£¬2019£©¸ß 2.8 ±¶£¬ÕâÊÇÒ»ÖÖÁìÏ鵀 model-based µÄ·½·¨£¬²¢ÇÒÊý¾ÝЧÂÊ¸ß 9.9 ±¶¡£
  • ´Óͼ 7 ËùʾµÄ´ó¶àÊý 16 ÖÖ DMControl »·¾³ÖеÄ״̬¿ªÊ¼£¬½ö¿¿ÏñËزÙ×÷µÄ CURL ¼¸ºõ¿ÉÒÔ½øÐÐÆ¥Å䣨ÓÐʱÉõÖÁ³¬¹ý£©SAC µÄ²ÉÑùЧÂÊ¡£ËüÊÇ»ùÓÚ model-based£¬model-free£¬Óи¨ÖúÈÎÎñ»òÕßÊÇûÓи¨ÖúÈÎÎñ¡£
  • ÔÚ 50 Íò²½Ö®ÄÚ£¬CURL ½â¾öÁË 16 ¸ö DMControl ʵÑéÖеĴó¶àÊý£¨ÊÕÁ²µ½½Ó½ü 1000 µÄ×î¼Ñ·ÖÊý£©¡£ËüÔÚ¶Ì¶Ì 10 Íò²½µÄʱ¼äÄھ;ßÓÐÓë SOTA ÏàËÆÐÔÄܵľºÕùÁ¦£¬²¢ÇÒ´ó´óÓÅÓڸ÷½°¸ÖеÄÆäËû·½·¨¡£

±í 1. ÔÚ 500k£¨DMControl500k£©ºÍ 100k£¨DMControl100k£©»·¾³²½³¤»ù׼ϣ¬CURL ºÍ DMControl »ù×¼ÉÏ»ñµÃµÄ»ùÏߵ÷֡£

ͼ 4. Ïà¶ÔÓÚ SLAC¡¢PlaNet¡¢Pixel SAC ºÍ State SAC »ùÏߣ¬Æ½¾ù 10 ¸ö seeds µÄ CURL ñîºÏ SAC ÐÔÄÜ¡£

ͼ 6. Òª»ñµÃÓë CURL ÔÚ 100k ѵÁ·²½ÖèÖÐËùµÃ·ÖÏàͬµÄ·ÖÊý£¬ÐèÒªÏÈÐвÉÓÃÁìÏ鵀 pixel-based ·½·¨ Dreamer µÄ²½ÖèÊý¡£

ͼ 7. ½« CURL Óë state-based µÄ SAC ½øÐбȽϣ¬ÔÚ 16 ¸öËùÑ¡ DMControl »·¾³ÖеÄÿ¸ö»·¾³ÉÏÔËÐÐ 2 ¸ö seeds¡£

Atari

ÔÚ Atari ʵÑéÖеÄÖ÷Òª·¢ÏÖ£º

  • ¾Í´ó¶àÊý 26 Ïî Atari100k ʵÑéµÄÊý¾ÝЧÂʶøÑÔ£¬CURL ÊÇ SOTA PixelBased RL Ëã·¨¡£Æ½¾ù¶øÑÔ£¬ÔÚ Atari100k ÉÏ£¬CURL µÄÐÔÄÜ±È SimPLe ¸ß 1.6 ±¶£¬¶ø Efficient Rainbow DQN Ôò¸ß 2.5 ±¶¡£
  • CURL ´ïµ½ 24£¥µÄÈËÀà±ê×¼»¯·ÖÊý£¨HNS£©£¬¶ø SimPLe ºÍ Efficient Rainbow DQN ·Ö±ð´ïµ½ 13.5£¥ºÍ 14.7£¥¡£CURL£¬SimPLe ºÍ Efficient Rainbow DQN µÄƽ¾ù HNS ·Ö±ðΪ 37.3£¥£¬39£¥ºÍ 23.8£¥¡£
  • CURL ÔÚÈý¿îÓÎÏ· JamesBond£¨98.4£¥HNS£©£¬Freeway£¨94.2£¥HNS£©ºÍ Road Runner£¨86.5£¥HNS£©Éϼ¸ºõ¿ÉÒÔÓëÈËÀàµÄЧÂÊÏàÌá²¢ÂÛ£¬ÕâÔÚËùÓÐ pixel-based µÄ RL Ëã·¨ÖоùÊôÊ×Àý¡£

±í 2. ͨ¹ý CURL ºÍÒÔ 10 Íò¸öʱ¼ä²½³¤£¨Atari100k£©Îª±ê×¼Ëù»ñµÃµÄ·ÖÊý¡£CURL ÔÚ 26 ¸ö»·¾³ÖÐµÄ 14 ¸ö»·¾³ÖÐʵÏÖÁË SOTA¡£

ÏîÄ¿½éÉÜ

°²×°

ËùÓÐÏà¹ØÏÔÚ conda_env.yml ÎļþÖС£ËüÃÇ¿ÉÒÔÊÖ¶¯°²×°£¬Ò²¿ÉÒÔʹÓÃÒÔÏÂÃüÁî°²×°£º

´úÂëÓïÑÔ£ºjavascript
¸´ÖÆ
conda?env?create?-f?conda_env.yml?

ʹÓÃ˵Ã÷

Òª´Ó»ùÓÚͼÏñµÄ¹Û²ìÖÐѵÁ· CURL agent Íê³É cartpole swingup ÈÎÎñ£¬Çë´Ó¸ÃĿ¼µÄ¸ùĿ¼ÔËÐÐ bash script/run.sh¡£run.sh Îļþ°üº¬ÒÔÏÂÃüÁҲ¿ÉÒÔ¶ÔÆä½øÐÐÐÞ¸ÄÒÔ³¢ÊÔ²»Í¬µÄ»·¾³/³¬²ÎÊý¡£

´úÂëÓïÑÔ£ºjavascript
¸´ÖÆ
CUDA_VISIBLE_DEVICES=0?python?train.py???????--domain_name?cartpole???????--task_name?swingup???????--encoder_type?pixel???????--action_repeat?8???????--save_tb?--pre_transform_image_size?100?--image_size?84???????--work_dir?./tmp???????--agent?curl_sac?--frame_stack?3???????--seed?-1?--critic_lr?1e-3?--actor_lr?1e-3?--eval_freq?10000?--batch_size?128?--num_train_steps?1000000?

ÔÚ¿ØÖÆ̨ÖУ¬Ó¦¸Ã¿´µ½ÈçÏÂËùʾµÄÊä³ö£º

´úÂëÓïÑÔ£ºjavascript
¸´ÖÆ
|?train?|?E:?221?|?S:?28000?|?D:?18.1?s?|?R:?785.2634?|?BR:?3.8815?|?A_LOSS:?-305.7328?|?CR_LOSS:?190.9854?|?CU_LOSS:?0.0000??|?train?|?E:?225?|?S:?28500?|?D:?18.6?s?|?R:?832.4937?|?BR:?3.9644?|?A_LOSS:?-308.7789?|?CR_LOSS:?126.0638?|?CU_LOSS:?0.0000??|?train?|?E:?229?|?S:?29000?|?D:?18.8?s?|?R:?683.6702?|?BR:?3.7384?|?A_LOSS:?-311.3941?|?CR_LOSS:?140.2573?|?CU_LOSS:?0.0000??|?train?|?E:?233?|?S:?29500?|?D:?19.6?s?|?R:?838.0947?|?BR:?3.7254?|?A_LOSS:?-316.9415?|?CR_LOSS:?136.5304?|?CU_LOSS:?0.0000?

cartpole swing up µÄ×î¸ß·ÖÊýԼΪ 845 ·Ö¡£¶øÇÒ£¬CURL ÈçºÎÒÔСÓÚ 50k µÄ²½³¤½â¾ö visual cartpole¡£¸ù¾ÝʹÓÃÕßµÄ GPU ²»Í¬¶ø¶¨£¬´óÔ¼ÐèÒªÒ»¸öСʱµÄѵÁ·¡£Í¬Ê±×÷Ϊ²Î¿¼£¬×îеĶ˵½¶Ë·½·¨ D4PG ÐèÒª 50M µÄ timesteps À´½â¾öÏàͬµÄÎÊÌâ¡£

´úÂëÓïÑÔ£ºjavascript
¸´ÖÆ
Log?abbreviation?mapping£º??train?-?training?episode??E?-?total?number?of?episodes???S?-?total?number?of?environment?steps??D?-?duration?in?seconds?to?train?1?episode??R?-?mean?episode?reward??BR?-?average?reward?of?sampled?batch??A_LOSS?-?average?loss?of?actor??CR_LOSS?-?average?loss?of?critic??CU_LOSS?-?average?loss?of?the?CURL?encoder?

ÓëÔËÐÐÏà¹ØµÄËùÓÐÊý¾Ý¶¼´æ´¢ÔÚÖ¸¶¨µÄ working_dir ÖС£ÈôÒªÆôÓÃÄ£ÐÍ»òÊÓƵ±£´æ£¬ÇëʹÓÃ--save_model »ò--save_video¡£¶ø¶ÔÓÚËùÓпÉÓõıêÖ¾£¬ÐèÒª¼ì²é train.py¡£Ê¹Óà tensorboard ÔËÐÐÀ´½øÐпÉÊÓ»¯£º

´úÂëÓïÑÔ£ºjavascript
¸´ÖÆ
tensorboard?--logdir?log?--port?6006?

ͬʱÔÚä¯ÀÀÆ÷ÖÐתµ½ localhost£º6006¡£Èç¹ûÔËÐÐÒì³££¬¿ÉÒÔ³¢ÊÔʹÓà ssh ½øÐж˿Úת·¢¡£

¶ÔÓÚʹÓà GPU ¼ÓËÙäÖȾ£¬È·±£ÔÚ¼ÆËã»úÉÏ°²×°ÁË EGL ²¢ÉèÖÃÁË export MUJOCO_GL = egl¡£

  • ·¢±íÓÚ:
  • Ô­ÎÄÁ´½Ó£ºhttp://news.51cto.com/art/202005/617645.htm
  • ÈçÓÐÇÖȨ£¬ÇëÁªÏµ cloudcommunity@tencent.com ɾ³ý¡£

Ïà¹Ø¿ìѶ

ɨÂë

Ìí¼ÓÕ¾³¤ ½ø½»Á÷Ⱥ

ÁìȡרÊô 10ÔªÎÞÃż÷ȯ

˽Ïí×îР¼¼Êõ¸É»õ

ɨÂë¼ÓÈ뿪·¢ÕßÉçȺ
Áìȯ
http://www.vxiaotou.com