当前位置：主页 > 查看内容

Pandas高级教程之:处理text数据

发布时间：2021-06-23 00:00| 有位朋友查看

简介：简介在1.0之前只有一种形式来存储text数据那就是object。在1.0之后添加了一个新的数据类型叫做StringDtype 。今天将会给大家讲解Pandas中text中的那些事。创建text的DF 先看下常见的使用text来构建DF的例子 In [1]: pd.Series([ a , b , c ])Out[1]: dt……

简介

在1.0之前只有一种形式来存储text数据那就是object。在1.0之后添加了一个新的数据类型叫做StringDtype 。今天将会给大家讲解Pandas中text中的那些事。

创建text的DF

先看下常见的使用text来构建DF的例子

In [1]: pd.Series([ a , b , c ])
Out[1]: 
dtype: object

如果要使用新的StringDtype 可以这样

In [2]: pd.Series([ a , b , c ], dtype string )
Out[2]: 
dtype: string
In [3]: pd.Series([ a , b , c ], dtype pd.StringDtype())
Out[3]: 
dtype: string

或者使用astype进行转换

In [4]: s pd.Series([ a , b , c ])
In [5]: s
Out[5]: 
dtype: object
In [6]: s.astype( string )
Out[6]: 
dtype: string

String 的方法

String可以转换成大写小写和统计它的长度

In [24]: s pd.Series([ A , B , C , Aaba , Baca , np.nan, CABA , dog , cat ],
 ....: dtype string )
 ....: 
In [25]: s.str.lower()
Out[25]: 
3 aaba
4 baca
5 NA 
6 caba
7 dog
8 cat
dtype: string
In [26]: s.str.upper()
Out[26]: 
3 AABA
4 BACA
5 NA 
6 CABA
7 DOG
8 CAT
dtype: string
In [27]: s.str.len()
Out[27]: 
5 NA 
dtype: Int64

还可以进行trip操作

In [28]: idx pd.Index([ jack , jill , jesse , frank ])
In [29]: idx.str.strip()
Out[29]: Index([ jack , jill , jesse , frank ], dtype object )
In [30]: idx.str.lstrip()
Out[30]: Index([ jack , jill , jesse , frank ], dtype object )
In [31]: idx.str.rstrip()
Out[31]: Index([ jack , jill , jesse , frank ], dtype object )

columns的String操作

因为columns是String表示的所以可以按照普通的String方式来操作columns

In [34]: df.columns.str.strip()
Out[34]: Index([ Column A , Column B ], dtype object )
In [35]: df.columns.str.lower()
Out[35]: Index([ column a , column b ], dtype object )

In [32]: df pd.DataFrame(np.random.randn(3, 2),
 ....: columns [ Column A , Column B ], index range(3))
 ....: 
In [33]: df
Out[33]: 
 Column A Column B 
0 0.469112 -0.282863
1 -1.509059 -1.135632
2 1.212112 -0.173215

分割和替换String

Split可以将一个String切分成一个数组。

In [38]: s2 pd.Series([ a_b_c , c_d_e , np.nan, f_g_h ], dtype string )
In [39]: s2.str.split( _ )
Out[39]: 
0 [a, b, c]
1 [c, d, e]
2 NA 
3 [f, g, h]
dtype: object

要想访问split之后数组中的字符可以这样

In [40]: s2.str.split( _ ).str.get(1)
Out[40]: 
2 NA 
dtype: object
In [41]: s2.str.split( _ ).str[1]
Out[41]: 
2 NA 
dtype: object

使用 expand True 可以将split过后的数组扩展成为多列

In [42]: s2.str.split( _ , expand True)
Out[42]: 
 0 1 2
0 a b c
1 c d e
2 NA NA NA 
3 f g h

可以指定分割列的个数

In [43]: s2.str.split( _ , expand True, n 1)
Out[43]: 
0 a b_c
1 c d_e
2 NA NA 
3 f g_h

replace用来进行字符的替换在替换过程中还可以使用正则表达式

s3.str.replace( ^.a|dog , XX-XX , case False)

String的连接

使用cat 可以连接 String

In [64]: s pd.Series([ a , b , c , d ], dtype string )
In [65]: s.str.cat(sep , )
Out[65]: a,b,c,d

使用 .str来index

pd.Series会返回一个Series 如果Series中是字符串的话可通过index来访问列的字符举个例子

In [99]: s pd.Series([ A , B , C , Aaba , Baca , np.nan,
 ....: CABA , dog , cat ],
 ....: dtype string )
 ....: 
In [100]: s.str[0]
Out[100]: 
5 NA 
dtype: string
In [101]: s.str[1]
Out[101]: 
0 NA 
1 NA 
2 NA 
5 NA 
dtype: string

extract

Extract用来从String中解压数据它接收一个 expand参数在0.23版本之前这个参数默认是False。如果是false extract会返回Series index或者DF 。如果expand true 那么会返回DF。0.23版本之后默认是true。

extract通常是和正则表达式一起使用的。

In [102]: pd.Series([ a1 , b2 , c3 ],
 .....: dtype string ).str.extract(r ([ab])(\d) , expand False)
 .....: 
Out[102]: 
0 a 1
1 b 2
2 NA NA

上面的例子将Series中的每一字符串都按照正则表达式来进行分解。前面一部分是字符后面一部分是数字。

注意只有正则表达式中group的数据才会被extract .

下面的就只会extract数字

In [106]: pd.Series([ a1 , b2 , c3 ],
 .....: dtype string ).str.extract(r [ab](\d) , expand False)
 .....: 
Out[106]: 
2 NA 
dtype: string

还可以指定列的名字如下

In [103]: pd.Series([ a1 , b2 , c3 ],
 .....: dtype string ).str.extract(r (?P letter [ab])(?P digit \d) ,
 .....: expand False)
 .....: 
Out[103]: 
 letter digit
0 a 1
1 b 2
2 NA NA

extractall

和extract相似的还有extractall 不同的是extract只会匹配第一次而extractall会做所有的匹配举个例子

In [112]: s pd.Series([ a1a2 , b1 , c1 ], index [ A , B , C ],
 .....: dtype string )
 .....: 
In [113]: s
Out[113]: 
A a1a2
dtype: string
In [114]: two_groups (?P letter [a-z])(?P digit [0-9]) 
In [115]: s.str.extract(two_groups, expand True)
Out[115]: 
 letter digit
A a 1
B b 1
C c 1

extract匹配到a1之后就不会继续了。

In [116]: s.str.extractall(two_groups)
Out[116]: 
 letter digit
 match 
A 0 a 1
 1 a 2
B 0 b 1
C 0 c 1

extractall匹配了a1之后还会匹配a2。

contains 和 match

contains 和 match用来测试DF中是否含有特定的数据

In [127]: pd.Series([ 1 , 2 , 3a , 3b , 03c , 4dx ],
 .....: dtype string ).str.contains(pattern)
 .....: 
Out[127]: 
0 False
1 False
2 True
3 True
4 True
5 True
dtype: boolean

In [128]: pd.Series([ 1 , 2 , 3a , 3b , 03c , 4dx ],
 .....: dtype string ).str.match(pattern)
 .....: 
Out[128]: 
0 False
1 False
2 True
3 True
4 False
5 True
dtype: boolean

In [129]: pd.Series([ 1 , 2 , 3a , 3b , 03c , 4dx ],
 .....: dtype string ).str.fullmatch(pattern)
 .....: 
Out[129]: 
0 False
1 False
2 True
3 True
4 False
5 False
dtype: boolean

String方法总结

最后总结一下String的方法

MethodDescriptioncat()Concatenate stringssplit()Split strings on delimiterrsplit()Split strings on delimiter working from the end of the stringget()Index into each element (retrieve i-th element)join()Join strings in each element of the Series with passed separatorget_dummies()Split strings on the delimiter returning DataFrame of dummy variablescontains()Return boolean array if each string contains pattern/regexreplace()Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrencerepeat()Duplicate values (s.str.repeat(3) equivalent to x * 3)pad()Add whitespace to left, right, or both sides of stringscenter()Equivalent to str.centerljust()Equivalent to str.ljustrjust()Equivalent to str.rjustzfill()Equivalent to str.zfillwrap()Split long strings into lines with length less than a given widthslice()Slice each string in the Seriesslice_replace()Replace slice in each string with passed valuecount()Count occurrences of patternstartswith()Equivalent to str.startswith(pat) for each elementendswith()Equivalent to str.endswith(pat) for each elementfindall()Compute list of all occurrences of pattern/regex for each stringmatch()Call re.match on each element, returning matched groups as listextract()Call re.search on each element, returning DataFrame with one row for each element and one column for each regex capture groupextractall()Call re.findall on each element, returning DataFrame with one row for each match and one column for each regex capture grouplen()Compute string lengthsstrip()Equivalent to str.striprstrip()Equivalent to str.rstriplstrip()Equivalent to str.lstrippartition()Equivalent to str.partitionrpartition()Equivalent to str.rpartitionlower()Equivalent to str.lowercasefold()Equivalent to str.casefoldupper()Equivalent to str.upperfind()Equivalent to str.findrfind()Equivalent to str.rfindindex()Equivalent to str.indexrindex()Equivalent to str.rindexcapitalize()Equivalent to str.capitalizeswapcase()Equivalent to str.swapcasenormalize()Return Unicode normal form. Equivalent to unicodedata.normalizetranslate()Equivalent to str.translateisalnum()Equivalent to str.isalnumisalpha()Equivalent to str.isalphaisdigit()Equivalent to str.isdigitisspace()Equivalent to str.isspaceislower()Equivalent to str.islowerisupper()Equivalent to str.isupperistitle()Equivalent to str.istitleisnumeric()Equivalent to str.isnumericisdecimal()Equivalent to str.isdecimal

本文已收录于 http://www.flydean.com/06-python-pandas-text/

最通俗的解读最深刻的干货最简洁的教程众多你不知道的小技巧等你来发现

欢迎关注我的公众号:「程序那些事」,懂技术更懂你

本文转自网络，原文链接：https://developer.aliyun.com/article/784844
本站部分内容转载于网络，版权归原作者所有，转载之目的在于传播更多优秀技术内容，如有侵权请联系QQ/微信：153890879删除，谢谢！

上一篇：nmap 检测扫描端口，shell elif then else 用法 下一篇：linux centos nginx 环境下配置 ssl 证书

随机推荐

为何Spark在编程界越来越吃香？Spark将成

前言统计科学家使用交互式的统计工具(比如R)来回答数据中的问题，获得全景的认...
Java编程内功-数据结构与算法「赫夫曼树

基本介绍给定 n 个权值作为 n 个叶子节点，构造一颗二叉树，若该树的带权路径长...
互联网创业第一步，从他花3500美元买了个

近几年，互联网行业蓬勃发展，在互联网浪潮的冲击下，互联网创业已成为一种比较...
溢价域名的续费价格如何

溢价域名的续费价格如何？通常来说，因为溢价域名的价值高于普通域名，所以溢...
鸿蒙内核源码分析(汇编汇总篇) | 鸿蒙所

想了解更多内容，请访问： 51CTO和华为官方战略合作共建的鸿蒙技术社区 https://...
TIOBE 3月榜单：新功能将加入，C语言仍高

TIOBE 公布了 2021 年 3 月的编程语言排行榜。本月 TIOBE 指数没有什么有趣的变...
NVIDIA GPU Operator分析一：NVIDIA驱动

背景我们知道如果在Kubernetes中支持GPU设备调度需要做如下的工作节点上安装...
一日一技：巧用or关键字实现多重条件判断

在Python开发过程中，我们难免会遇到多重条件判断的情况的情况，此时除了用很多...
没有数据的数据科学？请尽早聘用数据工程

本文转载自公众号读芯术(ID：AI_Discovery)。这一刻你正在应对什么挑战?这位前...
技术扫盲：关于低代码编程的可持续性交付

本文转载自微信公众号「bugstack虫洞栈」，作者小傅哥。转载本文请联系bugstack...

Pandas高级教程之:处理text数据

推荐图文

云安全的现代方法

智能数据构建与管理平台Dataphin的前世今生：缘起

C4D 学习笔记

云服务器调整实例配置询价 - API 文档

国内首款基于.NET Core平台的大数据可视化分析工具

大数据时代，必须做好这3大布局：才能抢占新的造富

随机推荐

为何Spark在编程界越来越吃香？Spark将成

Java编程内功-数据结构与算法「赫夫曼树

互联网创业第一步，从他花3500美元买了个

溢价域名的续费价格如何

鸿蒙内核源码分析(汇编汇总篇) | 鸿蒙所

TIOBE 3月榜单：新功能将加入，C语言仍高

NVIDIA GPU Operator分析一：NVIDIA驱动

一日一技：巧用or关键字实现多重条件判断

没有数据的数据科学？请尽早聘用数据工程

技术扫盲：关于低代码编程的可持续性交付

关于我们