¶ÔÓÚ˹̹¸£ NLP ¿â£¬ÎÒÃÇÒ»¶¨²»»áÄ°Éú£¬µ«ÊÇÕâÒ»¿âÖ÷Òª»ùÓÚ Java¡£½üÈÕ£¬Christopher Manning ËùÔÚµÄ˹̹¸£ NLP ×鿪ԴÁË Python °æµÄ¹¤¾ß°ü¡ª¡ªStanza£¬Èà Python Éú̬ϵͳÓÖÔöÌíÁËÒ»Ô± NLP ´ó½«¡£
ÎÒÃǶ¼ÖªµÀ˹̹¸£ NLP ×éµÄ¿ªÔ´¹¤¾ß¡ª¡ªÕâÊÇÒ»¸ö°üº¬Á˸÷ÖÖ NLP ¹¤¾ßµÄ´úÂë¿â¡£½üÈÕ£¬ËûÃǹ«¿ªÁË Python °æ±¾µÄ¹¤¾ß£¬ÃûΪ Stanza¡£¸Ã¿âÓÐ 60 ¶àÖÖÓïÑÔµÄÄ£ÐÍ£¬¿É½øÐÐÃüÃûʵÌåʶ±ðµÈ NLP ÈÎÎñ¡£Ò»¾¿ªÔ´£¬±ãÒýÆðÁËÉçÇøµÄÈÈÒé¡£Àî·É·É¾ÍÔÚÍÆÌØÉϵãÔÞÁËÕâ¸öÏîÄ¿¡£
Ä¿Ç°£¬¸ÃÏîÄ¿¿ÉÖ±½Ó´Ó pip ½øÐа²×°¡£
ÏîÄ¿µØÖ·£ºhttps://github.com/stanfordnlp/stanza
Stanza °üº¬ÁË 60 ¶àÖÖÓïÑÔÄ£ÐÍ£¬ÔÚ Universal Dependencies v2.5 Êý¾Ý¼¯ÉϽøÐÐÁËԤѵÁ·¡£ÕâЩģÐÍ°üÀ¨¼òÌå¡¢·±Ìå¡¢¹ÅÎÄÖÐÎÄ£¬Ó¢Óï¡¢·¨Óï¡¢Î÷°àÑÀÓï¡¢µÂÓï¡¢ÈÕÓï¡¢º«Óï¡¢°¢À²®ÓïµÈ£¬ÉõÖÁ»¹Óб±ÈøÃ×ÓïµÈ²»Ì«³£¼ûµÄÓïÑÔ¡£
³ýÁËÓïÑÔÄ£ÐÍÍ⣬Stanza »¹Ö§³ÖÁËÊýÊ®ÖÖÓïÑÔµÄÃôÃôʵÌåʶ±ðÄ£ÐÍ¡£ÍêÕûÁбíÈçÏ£º
¾Ý Stanza µÄÂÛÎĽéÉÜ£¬Stanza º¸ÇÁ˶à¸ö×ÔÈ»ÓïÑÔ´¦ÀíÈÎÎñ£¬Èç·Ö´Ê¡¢´ÊÐÔ±ê×¢¡¢ÒÀ´æ¾ä·¨·ÖÎö¡¢ÃüÃûʵÌåʶ±ðµÈ¡£´ËÍ⣬Ëü»¹ÌṩÁË Pyhton ½çÃ棬ÓÃÀ´ºÍÎÒÃÇÊìϤµÄ Stanford CoreNLP ¿â½øÐн»»¥£¬´Ó¶øÀ©Õ¹ÁËÒÑÓеŦÄÜ¡£
ÁíÍâÖµµÃ×¢ÒâµÄÊÇ£¬Stanza ÊÇÍêÈ«»ùÓÚÉñ¾ÍøÂç pipeline µÄ¡£Ñо¿ÕßÔÚ 112 ¸öÊý¾Ý¼¯ÉϽøÐÐÁËԤѵÁ·£¬µ«Ê¹ÓõÄÊÇͬһ¸öÄ£Ðͼܹ¹¡£ËûÃÇ·¢ÏÖ£¬Í¬ÑùÒ»¸öÉñ¾ÍøÂç¼Ü¹¹¿ÉÒÔ·º»¯µÃºÜºÃ¡£ÍøÂçÔÚËùÓÐÓïÑÔÉϵÄÐÔÄܶ¼ºÜºÃ¡£Õû¸öÉñ¾ÍøÂç pipeline ¶¼ÊÇͨ¹ý PyTorch ʵÏֵġ£
Éñ¾ÍøÂç pipeline ÈëÃÅ
ÒªÔËÐÐÊ׸ö Stanza pipeline£¬Ö»ÐèÒªÔÚ python ½âÊÍÆ÷ z ÖнøÐвÙ×÷£º
>>>?import?stanza??>>>?stanza.download('en')?#?This?downloads?the?English?models?for?the?neural?pipeline#?IMPORTANT:?The?above?line?prompts?you?before?downloading,?which?doesn't?work?well?in?a?Jupyter?notebook.#?To?avoid?a?prompt?when?using?notebooks,?instead?use:?>>>?stanza.download('en',?force=True)??>>>?nlp?=?stanza.Pipeline()?#?This?sets?up?a?default?neural?pipeline?in?English??>>>?doc?=?nlp("Barack?Obama?was?born?in?Hawaii.?He?was?elected?president?in?2008.")??>>>?doc.sentences[0].print_dependencies()??
¶ø×îºóÒ»ÌõÖ¸ÁÊä³öµ±Ê±ÊäÈë×Ö·û´®ÖеÚÒ»¸ö¾ä×ÓÖеĴʣ¨»òÊÇ Stanza ÖбíʾµÄ Document£©£¬ÒÔ¼°ÔڸþäµÄ Universal Dependencies parse£¨Æ䡸head¡¹²¿·Ö£©ÖпØÖƸôʵÄË÷Òý£¬ÒÔ¼°´Ê֮ǰµÄÒÀÀµ¹Øϵ¡£Êä³öÈçÏ£º
('Barack',?'4',?'nsubj:pass')?('Obama',?'1',?'flat')?('was',?'4',?'aux:pass')?('born',?'0',?'root')?('in',?'6',?'case')?('Hawaii',?'4',?'obl')?('.',?'4',?'punct')?
ÈëÃÅÖ¸ÄÏÀï»áÓиü¶àÏêϸÐÅÏ¢¡£
·ÃÎÊ Java Stanford Core NLP Èí¼þ
³ýÁËÉñ¾ Pipeline£¬¸ÃÈí¼þ°ü»¹°üÀ¨Ò»¸ö¹Ù·½°ü£¬ÓÃÓÚʹÓà Python ´úÂë·ÃÎÊ Java Stanford CoreNLP Èí¼þ¡£
³õʼÉèÖãº
ÎĵµÖлáÓÐÈ«ÃæµÄʾÀý£¬Õ¹Ê¾ÈçºÎͨ¹ý Stanza ʹÓà CoreNLP£¬²¢´ÓÖлñȡעÊÍ¡£
ѵÁ· Neural Pipeline Ä£ÐÍ
µ±Ç°ÎªËùÓÃµÄ Universal Dependencies ¿â V2.5 ÌṩģÐÍ£¬²¢Îª¼¸Öֹ㷺ʹÓõÄÓïÑÔÌṩ NER Ä£ÐÍ¡£
ÅúÁ¿´¦Àí×î´ó»¯ Pipeline ËÙ¶È
ΪÁË×î´ó³Ì¶ÈµØÌṩËٶȷ½ÃæµÄÐÔÄÜ£¬±ØÐëÕë¶Ô³ÉÅúµÄÎĵµÔËÐÐ Pipeline¡£Ã¿Ò»´Îµ¥ÔÚÒ»¸ö¾ä×ÓÉÏÔËÐÐÒ»¸ö for Ñ»·½« fei'c ·Ç³£Âý£¬Ä¿Ç°½â¾ö·½·¨Êǽ«ÎĵµÁ¬ÔÚÒ»Æð£¬Ã¿¸öÎĵµ¼ûÓÿÕÐУ¨¼°Á½¸ö»»Ðзûnn£©½øÐзָ·Ö´ÊÆ÷½«ÔÚ¾ä×ÓÖжÏʱȥʶ±ð¿Õ°×ÐС£
ѵÁ·×Ô¼ºµÄ neural pipelines
¸Ã¿âÖÐËùÓÐÉñ¾Ä£¿é¶¼¿ÉÒÔʹÓÃ×Ô¼ºµÄÊý¾Ý½øÐÐѵÁ·¡£Èç Tokenizer¡¢multi-word token£¨MWT£©À©Õ¹Æ÷¡¢POS/ÌØÕ÷±ê¼ÇÆ÷µÈ¡£Ä¿Ç°£¬²»Ö§³Öͨ¹ý pipeline ½øÐÐÄ£ÐÍѵÁ·£¬Òò´ËÐèÒª¿Ë¡ git ´æ´¢¿â²¢´ÓÔ´´úÂëÖÐÔËÐÐѵÁ·¡£
ÒÔÏÂΪѵÁ·Éñ¾ pipeline µÄʾÀý£¬¿ÉÒÔ¿´µ½ÏîÄ¿ÖÐÌṩÁ˸÷ÖÖ bash ½Å±¾À´¼ò»¯ scripts Ŀ¼ÖеÄѵÁ·¹ý³Ì¡£ÑµÁ·Ä£ÐÍÔËÐÐÒÔÏÂÖ¸Á
bash?scripts/run_${module}.sh?${corpus}?${other_args}?
ÆäÖÐ ${module} ÊÇ tokenize, mwt, pos, lemma£¬depparse Ö®Ò»£¬ÊÇÖ÷ÌåµÄÈ«Ãû; ${corpus} ÊÇѵÁ·½Å±¾ËùÔÊÐíµÄÆäËû²ÎÊý¡£
ÀýÈ磬¿ÉÒÔʹÓÃÒÔÏÂÖ¸ÁîÔÚ UD_English-EWT ÓïÁÏ¿âÉÏѵÁ·Ê±ÅúÁ¿´¦Àí´óСΪ 32£¬¶øÖÕÖ¹ÂÊΪ 0.33£º
bash?scripts/run_tokenize.sh?UD_English-EWT?--batch_size?32?--dropout?0.33?
×¢Òâ¶ÔÓÚ dependency parser, »¹ÐèÒªÔÚѵÁ·/¿ª·¢Êý¾ÝÖÐΪʹÓÃµÄ POS ±êÇ©ÀàÐÍÖ¸¶¨ gold|predicted£º
bash?scripts/run_depparse.sh?UD_English-EWT?gold?
Èç¹ûʹÓÃÁË predicted£¬ÑµÁ·ºóµÄ±ê¼ÇÆ÷Ä£ÐÍ»áÊ×ÏÈÔÚѵÁ·/¿ª·¢Êý¾ÝÉÏÔËÐÐÒÔ±ãÉú³ÉÔ¤²âµÄ±ê¼Ç¡£
ĬÈÏÇé¿öÏ£¬Ä£ÐÍÎļþ½«ÔÚѵÁ·Æڼ䱣´æµ½ save_models Ŀ¼£¨Ò²¿ÉÒÔʹÓà save_dir ²ÎÊý½øÐиü¸Ä£©¡£
Stanza µÄÂÛÎÄÌṩÁËÕû¸ö´úÂë¿âµÄ¼Ü¹¹¡£¿ÉÒÔ¿´µ½£¬ËüÒÔÔʼÎı¾ÎªÊäÈ룬Äܹ»Ö±½ÓÊä³ö½á¹¹»¯µÄ½á¹û¡£
tanza µÄÉñ¾ÍøÂ粿·Ö¼Ü¹¹¡£³ýÁËÉñ¾ÍøÂç pipeline ÒÔÍ⣬Stanza Ò²ÓÐÒ»¸ö Python ¿Í»§¶Ë½çÃ棬ºÍ Java °æµÄ Stanford CoreNLP ½øÐн»»¥¡£
ÓÚ´Ëͬʱ£¬ÂÛÎÄ»¹½« Stanza ºÍÏÖÓÐµÄ NLP ¹¤¾ß£¬Èç spaCy µÈ½øÐÐÁ˶Աȡ£¿ÉÒÔ¿´µ½£¬Stanza ÊÇÄ¿Ç°º¸ÇÓïÑÔÊýÁ¿×î¶à£¬´ïµ½ SOTA ÇÒÍêÈ«»ùÓÚÉñ¾ÍøÂç¿ò¼ÜµÄ¿â¡£
ºÍÏÖÓÐ NLP ¿âµÄ¶Ô±È¡£
×îºó£¬Ñо¿Õß»¹½« Stanza ÉÏ NLP ÈÎÎñµÄÐÔÄܺÍÏÖÓеĻùÏß½øÐжԱȣ¬·¢ÏÖ Stanza ´ó²¿·ÖÇé¿ö϶¼³¬¹ýÁË SOTA¡£
ºÍÒÑÓлùÏßÐÔÄܵĶԱȡ£¿ÉÒÔ¿´µ½£¬Stanza ÔÚ¶à¸öÓïÑÔ¶à¸öÈÎÎñÖж¼ÊµÏÖÁË SOTA¡£
ÁìȡרÊô 10ÔªÎÞÃż÷ȯ
˽Ïí×îР¼¼Êõ¸É»õ