Ç°ÍùС³ÌÐò£¬Get¸üÓÅÔĶÁÌåÑ飡
Á¢¼´Ç°Íù
Ê×Ò³
ѧϰ
»î¶¯
רÇø
¹¤¾ß
TVP
·¢²¼
ÉçÇøÊ×Ò³ >רÀ¸ >Ò»ÐÐPython´úÂëÖÐ×Ô¶¯»¯Îı¾´¦Àí

Ò»ÐÐPython´úÂëÖÐ×Ô¶¯»¯Îı¾´¦Àí

×÷ÕßÍ·Ïñ
ÅÍ´´AI
·¢²¼ÓÚ 2021-09-03 16:34:44
7230
·¢²¼ÓÚ 2021-09-03 16:34:44
¾Ù±¨


ÅÍ´´AI·ÖÏí

×÷Õß|Satyam Kumar

±àÒë|VK À´Ô´|Towards Data Science

×ÔÈ»ÓïÑÔ´¦Àí£¨NLP£©ÊÇÈ˹¤ÖÇÄܵÄÒ»¸ö×ÓÁìÓò£¬Éæ¼°¼ÆËã»úÓë×ÔÈ»ÓïÑÔÖ®¼äµÄ½»»¥¡£ËüΧÈÆ×ÅÈçºÎѵÁ·Ò»¸öÄܹ»Àí½âºÍʵÏÖ×ÔÈ»ÓïÑÔÈÎÎñʹÓõÄÊý¾Ý¿ÆѧģÐÍÕ¹¿ª¡£

µäÐ͵ÄNLPÏîÄ¿×ñÑ­¹ÜµÀµÄ¸÷¸ö·½ÃæÀ´ÑµÁ·Ä£ÐÍ¡£¹ÜµÀÖеĸ÷¸ö²½Öè°üÀ¨Îı¾ÇåÀí¡¢±ê¼Ç»¯¡¢´Ê¸ù»¯¡¢±àÂëΪÊý×ÖÏòÁ¿µÈ£¬È»ºóÊÇÄ£ÐÍѵÁ·¡£

NLPÈÎÎñµÄÊý¾Ý¼¯ÊÇÎı¾Êý¾Ý£¬Ö÷ÒªÀ´×Ô»¥ÁªÍø¡£´ó¶àÊýÇé¿öÏ£¬ÓÃÓÚNLP½¨Ä£µÄÎı¾Êý¾ÝÊÇÔàµÄ£¬ÐèÒªÔÚÊý¾Ý´¦ÀíµÄÔçÆڽ׶νøÐÐÇåÀí¡£Êý¾Ý¿Æѧ¼Ò°Ñ´ó²¿·Öʱ¼ä»¨ÔÚÊý¾ÝÔ¤´¦ÀíÉÏ£¬°üÀ¨ÇåÀíÎı¾Êý¾Ý¡£

ÔÚ±¾ÎÄÖУ¬ÎÒÃǽ«ÌÖÂÛÒ»¸öÓÐȤµÄCleanText¿â£¬Ëü¼ò»¯ÁËÇåÀíÎı¾Êý¾ÝµÄ¹ý³Ì£¬²¢¼Ó¿ìÁËÊý¾ÝÔ¤´¦ÀíÁ÷³Ì¡£

ʲôÊÇCleanText

CleanTextÊÇÒ»¸ö¿ª·ÅÔ´ÂëµÄPython¿â£¬Ëü¿ÉÒÔÇå³ý´Óweb»òÉ罻ýÌåÖÐÅÀÈ¡µÄÎı¾Êý¾Ý¡£CleanTextʹ¿ª·¢ÈËÔ±Äܹ»´´½¨¹æ·¶»¯µÄÎı¾±íʾ¡£CleanTextʹÓÃftfy¡¢unidecodeºÍ¸÷ÖÖÆäËûÓ²±àÂë¹æÔò£¨°üÀ¨RegEx£©½«Ë𻵻òÔàµÄÊäÈëÎı¾×ª»»Îª¸É¾»Îı¾£¬¿ÉÒÔ½øÒ»²½´¦ÀíÕâЩÎı¾À´ÑµÁ·NLPÄ£ÐÍ¡£

°²×°£º

¿ÉÒÔʹÓÃÒÔÏÂÃüÁî´ÓPyPl°²×°CleanText¿â£º

´úÂëÓïÑÔ£ºjavascript
¸´ÖÆ
pip install clean-text

°²×°ºó£¬¿ÉÒÔʹÓÃÒÔÏ·½·¨µ¼Èë¿â£º

´úÂëÓïÑÔ£ºjavascript
¸´ÖÆ
from cleantext import clean

Ó÷¨£º

CleanText¿âÖ»ÌṩÁËÒ»¸öº¯Êý¡°Clean¡±£¬Ëü½ÓÊܸ÷ÖÖ²ÎÊý£¬ÕâЩ²ÎÊý¿ÉÒÔ½øÐе÷ÓÅÒÔÖ´ÐÐÎı¾ÇåÀí¡£clean¿ÉÖ´ÐÐ11ÖÖÀàÐ͵ÄÇåÀí£¬°üÀ¨£º

Unicode£º
´úÂëÓïÑÔ£ºjavascript
¸´ÖÆ
s1 = 'Z¨¹rich'
clean(s1, fix_unicode=True)

# Output: zurich
ASCII£º

Ëü½«Îı¾×ª»»Îª×î½Ó½üµÄASCII±íʾÐÎʽ¡£

´úÂëÓïÑÔ£ºjavascript
¸´ÖÆ
s2 = "ko\u017eu\u0161\u010dek"
clean(s2, to_ascii=True)

# Output: kozuscek
Lower£º

½«Îı¾Êý¾Ýת»»ÎªÐ¡Ð´¡£

´úÂëÓïÑÔ£ºjavascript
¸´ÖÆ
s3 = "My Name is SATYAM"
clean(s3, lower=True)

# Output: my name is satyam
Ìæ»»URL/µç×ÓÓʼþ/µç»°ºÅÂ룺

ÓÃÌØÊâ±ê¼ÇÌæ»»Îı¾Êý¾ÝÖеÄËùÓÐURL¡¢µç×ÓÓʼþ»òµç»°ºÅÂë¡£

´úÂëÓïÑÔ£ºjavascript
¸´ÖÆ
s4 = "https://www.Google.com and https://www.Bing.com are popular seach engines. You can mail me at satkr7@gmail.com. If not replied call me at 9876543210"

clean(s4, no_urls=True, replace_with_url="URL",
no_emails=True, replace_with_email="EMAIL"
no_phone_numbers=True, replace_with_email="PHONE")

# Output: url and url are popular search engines. You can mail me at EMAIL. If not replied call me at PHONE
Ìæ»»»õ±Ò£º

ÓÃÌØÊâ±ê¼ÇÌæ»»Îı¾Êý¾ÝÖеÄËùÓлõ±Ò¡£

´úÂëÓïÑÔ£ºjavascript
¸´ÖÆ
s5 = "I want ? 40"
clean(s5, no_currency_symbols = True)
clean(s5, no_currency_symbols = True, replace_with_currency_symbol="Rupees")

# Output: i want <cur> 40
# Output: i want rupees 40
ɾ³ýºÅÂ룺

ÓÃÌØÊâ±ê¼ÇÌæ»»»òɾ³ýËùÓÐÊý×Ö¡£

´úÂëÓïÑÔ£ºjavascript
¸´ÖÆ
s7 = 'abc123def456ghi789zero0'

clean(s7, no_digits = True)
clean(s7, no_digits = True, replace_with_digit="")

# Output: abc000def000ghi000zero0
# Output: abcdefghizero
Ìæ»»±êµã·ûºÅ£º

ɾ³ý»òÓÃÌØÊâ±ê¼ÇÌæ»»Îı¾Êý¾ÝÖеÄËùÓбêµã¡£

´úÂëÓïÑÔ£ºjavascript
¸´ÖÆ
s6 = "40,000 is greater than 30,000."
clean(s6, no_punct = True)

# Output: 40000 is greater than 30000
×éºÏËùÓвÎÊý£º

ÎÒÃÇÒѾ­·Ö±ðÌÖÂÛÁËÉÏÊöËùÓвÎÊý¡£ÏÖÔÚ£¬ÈÃÎÒÃÇÔÚCleanº¯ÊýÖÐ×éºÏËùÓÐÕâЩº¯Êý£¬ÎªÊ¾ÀýÎı¾µ÷ÓÃËü£¬²¢¹Û²ì¸É¾»µÄÎı¾½á¹û¡£

´úÂëÓïÑÔ£ºjavascript
¸´ÖÆ
from cleantext import clean

text = """
Z¨¹rich has a famous website https://www.zuerich.com/ 
WHICH ACCEPTS 40,000 € and adding a random string, :
abc123def456ghi789zero0 for this demo. Also remove punctions ,. 
my phone number is 9876543210 and mail me at satkr7@gmail.com.' 
     """

clean_text = clean(s8, 
      fix_unicode=True, 
      to_ascii=True, 
      lower=True, 
      no_line_breaks=True,
      no_urls=True, 
      no_numbers=True, 
      no_digits=True, 
      no_currency_symbols=True, 
      no_punct=True, 
      replace_with_punct="", 
      replace_with_url="<URL>", 
      replace_with_number="<NUMBER>", 
      replace_with_digit="", 
      replace_with_currency_symbol="<CUR>",
      lang='en')

print(clean_text)

# Output: zurich has a famous website <url> which accepts <number> <cur> and adding a random string abcdefghizero for this demo also remove punctions my phone number is <number> and mail me at satkrgmailcom

Òò´Ë£¬Ö»Ðè±àдһÐÐPython´úÂ룬¾Í¿ÉÒÔÇå³ýÔàµÄÎı¾Êý¾Ý²¢½øÐнøÒ»²½µÄÔ¤´¦Àí¡£

½áÂÛ

CleanTextÊÇÒ»¸ö¸ßЧµÄ¿â£¬Ëü¿ÉÒÔ´¦Àí»òÇå³ýÅÀÈ¡µÄÔàÊý¾Ý£¬Ö»ÐèÒ»ÐдúÂë¾Í¿ÉÒÔ»ñµÃ±ê×¼»¯µÄ¸É¾»Îı¾Êä³ö¡£¿ª·¢ÈËÔ±Ö»ÐèÒª¸ù¾Ý×Ô¼ºµÄÐèÒªµ÷Õû²ÎÊý¡£Ëü¼ò»¯ÁËÊý¾Ý¿Æѧ¼ÒµÄ¹¤×÷£¬ÒòΪÏÖÔÚËû/Ëý²»±ØдºÜ¶àÐи´ÔÓµÄÕýÔò±í´ïʽ´úÂëÀ´ÇåÀíÎı¾¡£

CleanText²»½öÊÊÓÃÓÚÓ¢ÓïÊäÈëÎı¾£¬¶øÇÒ¿ÉÒÔ´¦ÀíµÂÓֻÐèÉèÖÃlang='de'¡£

CleanText¿âÖ»°üº¬Ò»Ð©Îı¾ÇåÀí²ÎÊý£¬»¹ÓиĽøµÄÓàµØ¡£¾¡¹ÜÈç´Ë£¬¿ª·¢ÈËÔ±ÈÔÈ»¿ÉÒÔ½«ÆäÓÃÓÚһЩÇåÀíÈÎÎñ£¬È»ºó¼ÌÐøÊÖ¶¯±àÂëÒÔÍê³ÉÊ£ÓàµÄÈÎÎñ¡£

ÔĶÁÏÂÃæÌáµ½µÄÎÄÕÂÁ˽âAutoNLP-Ò»¸ö×Ô¶¯NLP¿â¡£

https://medium.com/swlh/autonlp-sentiment-analysis-in-5-lines-of-python-code-7b2cd2c1e8ab

²Î¿¼ÎÄÏ×£º

[1] Clean-Text Repository: https://github.com/jfilter/clean-text

±¾ÎIJÎÓë?ÌÚѶÔÆ×ÔýÌå·ÖÏí¼Æ»®£¬·ÖÏí×Ô΢ÐŹ«Öںš£
ԭʼ·¢±í£º2021-08-26£¬ÈçÓÐÇÖȨÇëÁªÏµ?cloudcommunity@tencent.com ɾ³ý

±¾ÎÄ·ÖÏí×Ô ÅÍ´´AI ΢ÐŹ«Öںţ¬Ç°Íù²é¿´

ÈçÓÐÇÖȨ£¬ÇëÁªÏµ cloudcommunity@tencent.com ɾ³ý¡£

±¾ÎIJÎÓë?ÌÚѶÔÆ×ÔýÌå·ÖÏí¼Æ»®? £¬»¶Ó­ÈÈ°®Ð´×÷µÄÄãÒ»Æð²ÎÓ룡

ÆÀÂÛ
µÇ¼ºó²ÎÓëÆÀÂÛ
0 ÌõÆÀÂÛ
ÈȶÈ
×îÐÂ
ÍƼöÔĶÁ
Ŀ¼
  • °²×°£º
  • Unicode£º
  • ASCII£º
  • Lower£º
  • Ìæ»»URL/µç×ÓÓʼþ/µç»°ºÅÂ룺
  • Ìæ»»»õ±Ò£º
  • ɾ³ýºÅÂ룺
  • Ìæ»»±êµã·ûºÅ£º
  • ×éºÏËùÓвÎÊý£º
Ïà¹Ø²úÆ·Óë·þÎñ
NLP ·þÎñ
NLP ·þÎñ£¨Natural Language Process£¬NLP£©Éî¶ÈÕûºÏÁËÌÚѶÄÚ²¿µÄ NLP ¼¼Êõ£¬Ìṩ¶àÏîÖÇÄÜÎı¾´¦ÀíºÍÎı¾Éú³ÉÄÜÁ¦£¬°üÀ¨´Ê·¨·ÖÎö¡¢ÏàËÆ´ÊÕٻء¢´ÊÏàËƶȡ¢¾ä×ÓÏàËƶȡ¢Îı¾ÈóÉ«¡¢¾ä×Ó¾À´í¡¢Îı¾²¹È«¡¢¾ä×ÓÉú³ÉµÈ¡£Âú×ã¸÷ÐÐÒµµÄÎı¾ÖÇÄÜÐèÇó¡£
Áìȯ
ÎÊÌâ¹éµµ×¨À¸ÎÄÕ¿ìѶÎÄÕ¹鵵¹Ø¼ü´Ê¹éµµ¿ª·¢ÕßÊÖ²á¹éµµ¿ª·¢ÕßÊÖ²á Section ¹éµµ
http://www.vxiaotou.com