pixel-based RL Ëã·¨ÄæÏ®£¬BAIR Ìá³ö½«¶Ô±ÈѧϰÓë RL Ïà½áºÏµÄËã·¨£¬Æä sample-efficiency Æ¥µÐ state-based RL¡£
´Ë´ÎÑо¿µÄ±¾ÖÊÔÚÓڻشðÒ»¸öÎÊÌ⡪ʹÓÃͼÏñ×÷Ϊ¹Û²âÖµ£¨pixel-based£©µÄ RL ÊÇ·ñÄܹ»ºÍÒÔ×ø±ê״̬×÷Ϊ¹Û²âÖµµÄ RL Ò»ÑùÓÐЧ£¿´«Í³ÒâÒåÉÏ£¬´ó¼ÒÆÕ±éÈÏΪÒÔͼÏñΪ¹Û²âÖµµÄ RL Êý¾ÝЧÂʽϵͣ¬Í¨³£ÐèÒªÒ»ÒÚ¸ö½»»¥µÄ step À´½â¾ö Atari ÓÎÏ·ÄÇÑùµÄ»ù×¼²âÊÔÈÎÎñ¡£
Ñо¿ÈËÔ±½éÉÜÁË CURL£ºÒ»ÖÖÓÃÓÚÇ¿»¯Ñ§Ï°µÄÎ޼ල¶Ô±È±íÕ÷¡£CURL ʹÓöԱÈѧϰµÄ·½Ê½´ÓÔʼÏñËØÖÐÌáÈ¡¸ß½×ÌØÕ÷£¬²¢ÔÚÌáÈ¡µÄÌØÕ÷Ö®ÉÏÖ´ÐÐÒì²ßÂÔ¿ØÖÆ¡£ÔÚ DeepMind Control Suite ºÍ Atari Games Öеĸ´ÔÓÈÎÎñÉÏ£¬CURL ÓÅÓÚÒÔÇ°µÄ pixel-based µÄ·½·¨£¨°üÀ¨ model-based ºÍ model-free£©£¬ÔÚ 100K ½»»¥²½Öè»ù×¼²âÊÔÖУ¬ÆäÐÔÄÜ·Ö±ðÌá¸ßÁË 2.8 ±¶ÒÔ¼° 1.6 ±¶¡£ÔÚ DeepMind Control Suite ÉÏ£¬CURL ÊǵÚÒ»¸ö¼¸ºõÓë»ùÓÚ״̬ÌØÕ÷·½·¨µÄ sample-efficiency ºÍÐÔÄÜËùÆ¥ÅäµÄ»ùÓÚͼÏñµÄËã·¨¡£
CURL Êǽ«¶Ô±ÈѧϰÓë RL Ïà½áºÏµÄͨÓÿò¼Ü¡£ÀíÂÛÉÏ£¬¿ÉÒÔÔÚ CURL pipeline ÖÐʹÓÃÈÎÒ» RL Ëã·¨£¬ÎÞÂÛÊÇͬ²ßÂÔ»¹ÊÇÒì²ßÂÔ¡£¶ÔÓÚÁ¬Ðø¿ØÖÆ»ù×¼¶øÑÔ£¨DM Control£©£¬Ñо¿ÍŶÓʹÓÃÁ˽ÏΪÊìÖªµÄ Soft Actor-Critic£¨SAC£©(Haarnoja et al., 2018) ;¶ø¶ÔÓÚÀëÉ¢¿ØÖÆ»ù×¼£¨Atari£©£¬Ñо¿ÍŶÓʹÓÃÁË Rainbow DQN£¨Hessel et al., 2017)£©¡£ÏÂÃ棬ÎÒÃǼòÒª»Ø¹ËһϠSAC£¬Rainbow DQN ÒÔ¼°¶Ô±Èѧϰ¡£
Soft Actor Critic
SAC ÊÇÒ»ÖÖÒì²ßÂÔ RL Ëã·¨£¬ËüÓÅ»¯ÁËËæ»ú²ßÂÔ£¬ÒÔ×î´ó»¯Ô¤ÆڵĹ켣»Ø±¨¡£ÏñÆäËû SOTA ¶Ëµ½¶ËµÄ RL Ëã·¨Ò»Ñù£¬SAC ÔÚ´Ó״̬¹Û²ìÖнâ¾öÈÎÎñʱ·Ç³£ÓÐЧ£¬µ«È´ÎÞ·¨´ÓÏñËØÖÐѧϰÓÐЧµÄ²ßÂÔ¡£
Rainbow
×îºÃ½« Rainbow DQN£¨Hessel et al., 2017£©×ܽáΪÔÚÔÀ´Ó¦Óà Nature DQN Ö®ÉϵĶàÏî¸Ä½ø£¨Mnih et al., 2015£©¡£¾ßÌåÀ´Ëµ£¬Éî¶È Q ÍøÂ磨DQN£©£¨Mnih et al., 2015£©½«Òì²ßÂÔËã·¨ Q-Learning Óë¾í»ýÉñ¾ÍøÂç×÷Ϊº¯Êý±Æ½üÆ÷Ïà½áºÏ£¬½«ÔʼÏñËØÓ³Éäµ½¶¯×÷¼ÛÖµº¯ÊýÀï¡£
³ý´ËÖ®Í⣬¼ÛÖµ·Ö²¼Ç¿»¯Ñ§Ï°£¨Bellemare et al., 2017£©Ìá³öÁËÒ»ÖÖͨ¹ý C51 Ëã·¨Ô¤²â¿ÉÄÜÖµº¯Êý bin Éϵķֲ¼¼¼Êõ¡£Rainbow DQN ½«ÉÏÊöËùÓм¼Êõ×éºÏÔÚµ¥Ò»µÄÒì²ßÂÔËã·¨ÖУ¬ÓÃÒÔʵÏÖ Atari »ù×¼µÄ×îРsample efficiency¡£´ËÍ⣬Rainbow »¹Ê¹ÓÃÁ˶ಽ»Ø±¨£¨Sutton et al.£¬1998£©¡£
¶Ô±Èѧϰ
CURL µÄ¹Ø¼ü²¿·ÖÊÇʹÓöԱÈÎ޼ලѧϰÀ´Ñ§Ï°¸ßάÊý¾ÝµÄ·á¸»±íʾµÄÄÜÁ¦¡£¶Ô±Èѧϰ¿ÉÒÔÀí½âΪ¿ÉÇø·ÖµÄ×Öµä²éÕÒÈÎÎñ¡£¸ø¶¨Ò»¸ö²éѯ q¡¢¼ü K= {k_0, k_1, . . . } ÒÔ¼°Ò»¸öÃ÷È·µÄ K£¨¹ØÓÚ q£©P(K) = ({k+}, K {k+}) ·ÖÇø£¬¶Ô±ÈѧϰµÄÄ¿±êÊÇÈ·±£ q Óë k +µÄÆ¥Åä³Ì¶È±È K {k +} ÖеÄÈκεļü¶¼¸ü´ó¡£ÔÚ¶Ô±ÈѧϰÖУ¬q£¬K£¬k +ºÍ K {k +} Ò²·Ö±ð³ÆΪêµã£¨anchor£©£¬Ä¿±ê£¨targets£©£¬ÕýÑù±¾£¨positive£©, ¸ºÑù±¾£¨negatives£©¡£
CURL ͨ¹ý½«ÑµÁ·¶Ô±ÈÄ¿±ê×÷ΪÅú¸üÐÂʱµÄ¸¨ÖúËðʧº¯Êý£¬ÔÚ×îС³Ì¶ÈÉϸıä»ù´¡ RL Ëã·¨¡£ÔÚʵÑéÖУ¬Ñо¿Õß½« CURL ÓëÁ½¸öÎÞÄ£ÐÍ RL Ë㷨һͬѵÁ·¡ª¡ªSAC ÓÃÓÚ DMControl ʵÑ飬Rainbow DQN ÓÃÓÚ Atari ʵÑé¡£
×ÜÌå¿ò¼Ü¸ÅÊö
CURL ʹÓõÄʵÀýÅб𷽷¨£¨instance discrimination£©ÀàËÆÓÚ SimCLR¡¢MoC ºÍ CPC¡£´ó¶àÊýÉî¶ÈÇ¿»¯Ñ§Ï°¿ò¼Ü²ÉÓÃһϵÁжѵþÔÚÒ»ÆðµÄͼÏñ×÷ΪÊäÈë¡£Òò´Ë£¬Ëã·¨ÔÚ¶à¸ö¶ÑµþµÄÖ¡ÖнøÐÐʵÀýÅб𣬶ø²»Êǵ¥Ò»µÄͼÏñʵÀý¡£
Ñо¿Õß·¢ÏÖ£¬Ê¹ÓÃÀàËÆÓÚ MoCo µÄ¶¯Á¿±àÂëÁ÷³Ì£¨momentum encoding£©À´´¦ÀíÄ¿±ê£¬ÔÚ RL ÖÐÐÔÄܽϺá£×îºó£¬Ñо¿ÕßʹÓÃÒ»¸öÀàËÆÓÚ CPC ÖеÄË«ÏßÐÔÄÚ»ýÀ´´¦Àí InfoNCE score ·½³Ì£¬Ñо¿Õß·¢ÏÖЧ¹û±È MoCo ºÍ SimCLR Öеĵ¥Î»·¶ÊýÏòÁ¿»ý£¨unit norm vector products£©ÒªºÃ¡£¶Ô±È±íÕ÷ºÍ RL Ë㷨һͬ½øÐÐѵÁ·£¬Í¬Ê±´Ó¶Ô±ÈÄ¿±êºÍ Q º¯ÊýÖлñµÃÌݶȡ£×ÜÌå¿ò¼ÜÈçÏÂͼËùʾ¡£
ͼ 2£ºCURL ×ÜÌå¿ò¼ÜʾÒâͼ
ÅбðÄ¿±ê
Ñ¡Ôñ¹ØÓÚÒ»¸öêµãµÄÕý¡¢¸ºÑù±¾ÊǶԱȱíÕ÷ѧϰµÄÆäÖÐÒ»¸ö¹Ø¼ü×é³É²¿·Ö¡£
²»Í¬ÓÚÔÚͬһÕÅͼÏñÉ쵀 image-patches£¬Åбð±ä»»ºóµÄͼÏñʵÀýÓÅ»¯´øÓÐ InfoNCE ËðʧÏîµÄ¼ò»¯ÊµÀýÅбðÄ¿±êº¯Êý£¬²¢ÐèÒª×îС»¯¶Ô½á¹¹µÄµ÷Õû¡£ÔÚ RL É趨Ï£¬Ñ¡Ôñ¸ü¼ò»¯ÅбðÄ¿±êµÄÀíÓÉÖ÷ÒªÓÐÈçÏÂÁ½µã£º
Òò´Ë£¬CURL ʹÓÃʵÀýÅбð¶ø²»ÊÇ patch Åбð¡£ÎÒÃǿɽ«ÀàËÆÓÚ SimCLR ºÍ MoCo ÕâÑùµÄ¶Ô±ÈʵÀýÅбðÉèÖ㬿´×ö×î´ó»¯Ò»ÕÅͼÏñÓëÆä¶ÔÓ¦Ôö¹ã°æ±¾Ö®¼äµÄ¹²Í¬ÐÅÏ¢¡£
²éѯ-¼üÖµ¶ÔµÄÉú³É
ÀàËÆÓÚÔÚͼÏñÉ趨ϵÄʵÀýÅбð£¬ÃªµãºÍÕý¹Û²âÖµÊÇÀ´×Ôͬһ·ùͼÏñµÄÁ½¸ö²»Í¬Ôö¹ãÖµ£¬¶ø¸º¹Û²âÖµÔòÀ´Ô´ÓÚÆäËûͼÏñ¡£CURL Ö÷ÒªÒÀ¿¿Ëæ»ú²ÃÇÐÊý¾ÝÔö¹ã·½·¨£¬´ÓÔʼäÖȾͼÏñÖÐËæ»ú²ÃÇÐÒ»¸öÕý·½Ð뵀 patch¡£
Ñо¿ÕßÔÚÅúÊý¾ÝÉÏʹÓÃËæ»úÊý¾ÝÔö¹ã£¬µ«ÔÚͬһ¶ÑÖ¡Ö®¼ä±£³ÖÒ»Ö£¬ÒÔ±£Áô¹Û²âֵʱ¼ä½á¹¹µÄÐÅÏ¢¡£Êý¾ÝÔö¹ãÁ÷³ÌÈçͼ 3 Ëùʾ¡£
ͼ 3: ʹÓÃËæ»ú²Ã¼ô²úÉúêµãÓëÆäÕýÑù±¾¹ý³ÌµÄÖ±¹Ûչʾ¡£
ÏàËƶÈÁ¿
Çø·ÖÄ¿±êÖеÄÁíÒ»¸ö¾ö¶¨ÒòËØÊÇÓÃÓÚ²âÁ¿²éѯ¼ü¶ÔÖ®¼äµÄÄÚ²¿³Ë»ý¡£CURL ²ÉÓÃË«ÏßÐÔÄÚ»ý sim(q£¬k)= q^TW_k£¬ÆäÖÐ W ÊÇѧϰµÄ²ÎÊý¾ØÕó¡£Ñо¿ÍŶӷ¢ÏÖÕâÖÖÏàËÆÐÔ¶ÈÁ¿µÄÐÔÄÜÓÅÓÚ×î½üÔÚ¼ÆËã»úÊÓ¾õ£¨Èç MoCo ºÍ SimCLR£©ÖÐ×îеĶԱÈѧϰ·½·¨ÖÐʹÓõıê×¼»¯µã»ý¡£
¶¯Á¿Ä¿±ê±àÂë
ÔÚ CURL ÖÐʹÓöԱÈѧϰµÄÄ¿±êÊÇѵÁ·´Ó¸ßάÏñËØÖÐÄÜÓ³Éäµ½¸ü¶àÓïÒåÒþ״̬µÄ±àÂëÆ÷¡£InfoNCE ÊÇÒ»ÖÖÎ޼ලµÄËðʧ£¬Ëüͨ¹ýѧϰ±àÂëÆ÷ f_q ºÍ f_k ½«Ôʼêµã£¨²éѯ£©x_q ºÍÄ¿±ê£¨¹Ø¼ü×Ö£©x_k Ó³É䵽DZÔÚÖµ q = f_q(x_q) ºÍ k = f_k(x_k) ÉÏ£¬ÔÚ´ËÍŶÓÓ¦ÓÃÏàËƵã»ý¡£Í¨³£ÔÚêµãºÍÄ¿±êÓ³ÉäÖ®¼ä¹²ÏíÏàͬµÄ±àÂëÆ÷£¬¼´ f_q = f_k¡£
CURL ½«Ö¡-¶ÑջʵÀýµÄʶ±ðÓëÄ¿±êµÄ¶¯Á¿±àÂë½áºÏÔÚÒ»Æð£¬Í¬Ê± RL ÊÇÔÚ±àÂëÆ÷ÌØÕ÷Ö®ÉÏÖ´Ðеġ£
Ñо¿ÕßÆÀ¹À£¨i£©sample-efficiency£¬·½·¨¾ßÌåΪ²âÁ¿±íÏÖ×î¼ÑµÄ»ùÏßÐèÒª¶àÉÙ¸ö½»»¥²½Öè²ÅÄÜÓë 100k ½»»¥²½ÖèµÄ CURL ÐÔÄÜÏàÆ¥Å䣬ÒÔ¼°£¨ii£©Í¨¹ý²âÁ¿ CURL È¡µÃµÄÖÜÆڻر¨ÖµÓë×î¼Ñ±íÏÖ»ùÏߵıÈÀýÀ´¶ÔÐÔÄܲãÃæµÄ 100k ²½Öè½øÐкâÁ¿¡£»»¾ä»°Ëµ£¬µ±Ì¸µ½Êý¾Ý»ò sample-efficiency ʱ£¬ÆäʵָµÄÊÇ£¨i£©£¬¶øµ±Ì¸ÆðÐÔÄÜʱÔòÖ¸µÄÊÇ£¨ii£©¡£
DMControl
ÔÚ DMControl ʵÑéÖеÄÖ÷Òª·¢ÏÖ£º
±í 1. ÔÚ 500k£¨DMControl500k£©ºÍ 100k£¨DMControl100k£©»·¾³²½³¤»ù׼ϣ¬CURL ºÍ DMControl »ù×¼ÉÏ»ñµÃµÄ»ùÏߵ÷֡£
ͼ 4. Ïà¶ÔÓÚ SLAC¡¢PlaNet¡¢Pixel SAC ºÍ State SAC »ùÏߣ¬Æ½¾ù 10 ¸ö seeds µÄ CURL ñîºÏ SAC ÐÔÄÜ¡£
ͼ 6. Òª»ñµÃÓë CURL ÔÚ 100k ѵÁ·²½ÖèÖÐËùµÃ·ÖÏàͬµÄ·ÖÊý£¬ÐèÒªÏÈÐвÉÓÃÁìÏ鵀 pixel-based ·½·¨ Dreamer µÄ²½ÖèÊý¡£
ͼ 7. ½« CURL Óë state-based µÄ SAC ½øÐбȽϣ¬ÔÚ 16 ¸öËùÑ¡ DMControl »·¾³ÖеÄÿ¸ö»·¾³ÉÏÔËÐÐ 2 ¸ö seeds¡£
Atari
ÔÚ Atari ʵÑéÖеÄÖ÷Òª·¢ÏÖ£º
±í 2. ͨ¹ý CURL ºÍÒÔ 10 Íò¸öʱ¼ä²½³¤£¨Atari100k£©Îª±ê×¼Ëù»ñµÃµÄ·ÖÊý¡£CURL ÔÚ 26 ¸ö»·¾³ÖÐµÄ 14 ¸ö»·¾³ÖÐʵÏÖÁË SOTA¡£
°²×°
ËùÓÐÏà¹ØÏÔÚ conda_env.yml ÎļþÖС£ËüÃÇ¿ÉÒÔÊÖ¶¯°²×°£¬Ò²¿ÉÒÔʹÓÃÒÔÏÂÃüÁî°²×°£º
conda?env?create?-f?conda_env.yml?
ʹÓÃ˵Ã÷
Òª´Ó»ùÓÚͼÏñµÄ¹Û²ìÖÐѵÁ· CURL agent Íê³É cartpole swingup ÈÎÎñ£¬Çë´Ó¸ÃĿ¼µÄ¸ùĿ¼ÔËÐÐ bash script/run.sh¡£run.sh Îļþ°üº¬ÒÔÏÂÃüÁҲ¿ÉÒÔ¶ÔÆä½øÐÐÐÞ¸ÄÒÔ³¢ÊÔ²»Í¬µÄ»·¾³/³¬²ÎÊý¡£
CUDA_VISIBLE_DEVICES=0?python?train.py???????--domain_name?cartpole???????--task_name?swingup???????--encoder_type?pixel???????--action_repeat?8???????--save_tb?--pre_transform_image_size?100?--image_size?84???????--work_dir?./tmp???????--agent?curl_sac?--frame_stack?3???????--seed?-1?--critic_lr?1e-3?--actor_lr?1e-3?--eval_freq?10000?--batch_size?128?--num_train_steps?1000000?
ÔÚ¿ØÖÆ̨ÖУ¬Ó¦¸Ã¿´µ½ÈçÏÂËùʾµÄÊä³ö£º
|?train?|?E:?221?|?S:?28000?|?D:?18.1?s?|?R:?785.2634?|?BR:?3.8815?|?A_LOSS:?-305.7328?|?CR_LOSS:?190.9854?|?CU_LOSS:?0.0000??|?train?|?E:?225?|?S:?28500?|?D:?18.6?s?|?R:?832.4937?|?BR:?3.9644?|?A_LOSS:?-308.7789?|?CR_LOSS:?126.0638?|?CU_LOSS:?0.0000??|?train?|?E:?229?|?S:?29000?|?D:?18.8?s?|?R:?683.6702?|?BR:?3.7384?|?A_LOSS:?-311.3941?|?CR_LOSS:?140.2573?|?CU_LOSS:?0.0000??|?train?|?E:?233?|?S:?29500?|?D:?19.6?s?|?R:?838.0947?|?BR:?3.7254?|?A_LOSS:?-316.9415?|?CR_LOSS:?136.5304?|?CU_LOSS:?0.0000?
cartpole swing up µÄ×î¸ß·ÖÊýԼΪ 845 ·Ö¡£¶øÇÒ£¬CURL ÈçºÎÒÔСÓÚ 50k µÄ²½³¤½â¾ö visual cartpole¡£¸ù¾ÝʹÓÃÕßµÄ GPU ²»Í¬¶ø¶¨£¬´óÔ¼ÐèÒªÒ»¸öСʱµÄѵÁ·¡£Í¬Ê±×÷Ϊ²Î¿¼£¬×îеĶ˵½¶Ë·½·¨ D4PG ÐèÒª 50M µÄ timesteps À´½â¾öÏàͬµÄÎÊÌâ¡£
Log?abbreviation?mapping£º??train?-?training?episode??E?-?total?number?of?episodes???S?-?total?number?of?environment?steps??D?-?duration?in?seconds?to?train?1?episode??R?-?mean?episode?reward??BR?-?average?reward?of?sampled?batch??A_LOSS?-?average?loss?of?actor??CR_LOSS?-?average?loss?of?critic??CU_LOSS?-?average?loss?of?the?CURL?encoder?
ÓëÔËÐÐÏà¹ØµÄËùÓÐÊý¾Ý¶¼´æ´¢ÔÚÖ¸¶¨µÄ working_dir ÖС£ÈôÒªÆôÓÃÄ£ÐÍ»òÊÓƵ±£´æ£¬ÇëʹÓÃ--save_model »ò--save_video¡£¶ø¶ÔÓÚËùÓпÉÓõıêÖ¾£¬ÐèÒª¼ì²é train.py¡£Ê¹Óà tensorboard ÔËÐÐÀ´½øÐпÉÊÓ»¯£º
tensorboard?--logdir?log?--port?6006?
ͬʱÔÚä¯ÀÀÆ÷ÖÐתµ½ localhost£º6006¡£Èç¹ûÔËÐÐÒì³££¬¿ÉÒÔ³¢ÊÔʹÓà ssh ½øÐж˿Úת·¢¡£
¶ÔÓÚʹÓà GPU ¼ÓËÙäÖȾ£¬È·±£ÔÚ¼ÆËã»úÉÏ°²×°ÁË EGL ²¢ÉèÖÃÁË export MUJOCO_GL = egl¡£
ÁìȡרÊô 10ÔªÎÞÃż÷ȯ
˽Ïí×îР¼¼Êõ¸É»õ