在数据处理中 Pandas会将无法解析的数据或者缺失的数据使用NaN来表示。虽然所有的数据都有了相应的表示 但是NaN很明显是无法进行数学运算的。
本文将会讲解Pandas对于NaN数据的处理方法。
NaN的例子上面讲到了缺失的数据会被表现为NaN 我们来看一个具体的例子
我们先来构建一个DF
In [1]: df pd.DataFrame(np.random.randn(5, 3), index [ a , c , e , f , h ], ...: columns [ one , two , three ]) ...: In [2]: df[ four ] bar In [3]: df[ five ] df[ one ] 0 In [4]: df Out[4]: one two three four five a 0.469112 -0.282863 -1.509059 bar True c -1.135632 1.212112 -0.173215 bar False e 0.119209 -1.044236 -0.861849 bar True f -2.104569 -0.494929 1.071804 bar False h 0.721555 -0.706771 -1.039575 bar True
上面DF只有acefh这几个index 我们重新index一下数据
In [5]: df2 df.reindex([ a , b , c , d , e , f , g , h ]) In [6]: df2 Out[6]: one two three four five a 0.469112 -0.282863 -1.509059 bar True b NaN NaN NaN NaN NaN c -1.135632 1.212112 -0.173215 bar False d NaN NaN NaN NaN NaN e 0.119209 -1.044236 -0.861849 bar True f -2.104569 -0.494929 1.071804 bar False g NaN NaN NaN NaN NaN h 0.721555 -0.706771 -1.039575 bar True
数据缺失 就会产生很多NaN。
为了检测是否NaN 可以使用isna()或者notna() 方法。
In [7]: df2[ one ] Out[7]: a 0.469112 b NaN c -1.135632 d NaN e 0.119209 f -2.104569 g NaN h 0.721555 Name: one, dtype: float64 In [8]: pd.isna(df2[ one ]) Out[8]: a False b True c False d True e False f False g True h False Name: one, dtype: bool In [9]: df2[ four ].notna() Out[9]: a True b False c True d False e True f True g False h True Name: four, dtype: bool
注意在Python中None是相等的
In [11]: None None # noqa: E711 Out[11]: True
但是np.nan是不等的
In [12]: np.nan np.nan Out[12]: False整数类型的缺失值
NaN默认是float类型的 如果是整数类型 我们可以强制进行转换
In [14]: pd.Series([1, 2, np.nan, 4], dtype pd.Int64Dtype()) Out[14]: 2 NA dtype: Int64Datetimes 类型的缺失值
时间类型的缺失值使用NaT来表示
In [15]: df2 df.copy() In [16]: df2[ timestamp ] pd.Timestamp( 20120101 ) In [17]: df2 Out[17]: one two three four five timestamp a 0.469112 -0.282863 -1.509059 bar True 2012-01-01 c -1.135632 1.212112 -0.173215 bar False 2012-01-01 e 0.119209 -1.044236 -0.861849 bar True 2012-01-01 f -2.104569 -0.494929 1.071804 bar False 2012-01-01 h 0.721555 -0.706771 -1.039575 bar True 2012-01-01 In [18]: df2.loc[[ a , c , h ], [ one , timestamp ]] np.nan In [19]: df2 Out[19]: one two three four five timestamp a NaN -0.282863 -1.509059 bar True NaT c NaN 1.212112 -0.173215 bar False NaT e 0.119209 -1.044236 -0.861849 bar True 2012-01-01 f -2.104569 -0.494929 1.071804 bar False 2012-01-01 h NaN -0.706771 -1.039575 bar True NaT In [20]: df2.dtypes.value_counts() Out[20]: float64 3 datetime64[ns] 1 bool 1 object 1 dtype: int64None 和 np.nan 的转换
对于数字类型的 如果赋值为None 那么会转换为相应的NaN类型
In [21]: s pd.Series([1, 2, 3]) In [22]: s.loc[0] None In [23]: s Out[23]: 0 NaN 1 2.0 2 3.0 dtype: float64
如果是对象类型 使用None赋值 会保持原样
In [24]: s pd.Series([ a , b , c ]) In [25]: s.loc[0] None In [26]: s.loc[1] np.nan In [27]: s Out[27]: 0 None 1 NaN dtype: object缺失值的计算
缺失值的数学计算还是缺失值
In [28]: a Out[28]: one two a NaN -0.282863 c NaN 1.212112 e 0.119209 -1.044236 f -2.104569 -0.494929 h -2.104569 -0.706771 In [29]: b Out[29]: one two three a NaN -0.282863 -1.509059 c NaN 1.212112 -0.173215 e 0.119209 -1.044236 -0.861849 f -2.104569 -0.494929 1.071804 h NaN -0.706771 -1.039575 In [30]: a b Out[30]: one three two a NaN NaN -0.565727 c NaN NaN 2.424224 e 0.238417 NaN -2.088472 f -4.209138 NaN -0.989859 h NaN NaN -1.413542
但是在统计中会将NaN当成0来对待。
In [31]: df Out[31]: one two three a NaN -0.282863 -1.509059 c NaN 1.212112 -0.173215 e 0.119209 -1.044236 -0.861849 f -2.104569 -0.494929 1.071804 h NaN -0.706771 -1.039575 In [32]: df[ one ].sum() Out[32]: -1.9853605075978744 In [33]: df.mean(1) Out[33]: a -0.895961 c 0.519449 e -0.595625 f -0.509232 h -0.873173 dtype: float64
如果是在cumsum或者cumprod中 默认是会跳过NaN 如果不想统计NaN 可以加上参数skipna False
In [34]: df.cumsum() Out[34]: one two three a NaN -0.282863 -1.509059 c NaN 0.929249 -1.682273 e 0.119209 -0.114987 -2.544122 f -1.985361 -0.609917 -1.472318 h NaN -1.316688 -2.511893 In [35]: df.cumsum(skipna False) Out[35]: one two three a NaN -0.282863 -1.509059 c NaN 0.929249 -1.682273 e NaN -0.114987 -2.544122 f NaN -0.609917 -1.472318 h NaN -1.316688 -2.511893使用fillna填充NaN数据
数据分析中 如果有NaN数据 那么需要对其进行处理 一种处理方法就是使用fillna来进行填充。
下面填充常量
In [42]: df2 Out[42]: one two three four five timestamp a NaN -0.282863 -1.509059 bar True NaT c NaN 1.212112 -0.173215 bar False NaT e 0.119209 -1.044236 -0.861849 bar True 2012-01-01 f -2.104569 -0.494929 1.071804 bar False 2012-01-01 h NaN -0.706771 -1.039575 bar True NaT In [43]: df2.fillna(0) Out[43]: one two three four five timestamp a 0.000000 -0.282863 -1.509059 bar True 0 c 0.000000 1.212112 -0.173215 bar False 0 e 0.119209 -1.044236 -0.861849 bar True 2012-01-01 00:00:00 f -2.104569 -0.494929 1.071804 bar False 2012-01-01 00:00:00 h 0.000000 -0.706771 -1.039575 bar True 0
还可以指定填充方法 比如pad
In [45]: df Out[45]: one two three a NaN -0.282863 -1.509059 c NaN 1.212112 -0.173215 e 0.119209 -1.044236 -0.861849 f -2.104569 -0.494929 1.071804 h NaN -0.706771 -1.039575 In [46]: df.fillna(method pad ) Out[46]: one two three a NaN -0.282863 -1.509059 c NaN 1.212112 -0.173215 e 0.119209 -1.044236 -0.861849 f -2.104569 -0.494929 1.071804 h -2.104569 -0.706771 -1.039575
可以指定填充的行数
In [48]: df.fillna(method pad , limit 1)
fill方法统计
方法名描述pad / ffill向前填充bfill / backfill向后填充可以使用PandasObject来填充
In [53]: dff Out[53]: A B C 0 0.271860 -0.424972 0.567020 1 0.276232 -1.087401 -0.673690 2 0.113648 -1.478427 0.524988 3 NaN 0.577046 -1.715002 4 NaN NaN -1.157892 5 -1.344312 NaN NaN 6 -0.109050 1.643563 NaN 7 0.357021 -0.674600 NaN 8 -0.968914 -1.294524 0.413738 9 0.276662 -0.472035 -0.013960 In [54]: dff.fillna(dff.mean()) Out[54]: A B C 0 0.271860 -0.424972 0.567020 1 0.276232 -1.087401 -0.673690 2 0.113648 -1.478427 0.524988 3 -0.140857 0.577046 -1.715002 4 -0.140857 -0.401419 -1.157892 5 -1.344312 -0.401419 -0.293543 6 -0.109050 1.643563 -0.293543 7 0.357021 -0.674600 -0.293543 8 -0.968914 -1.294524 0.413738 9 0.276662 -0.472035 -0.013960 In [55]: dff.fillna(dff.mean()[ B : C ]) Out[55]: A B C 0 0.271860 -0.424972 0.567020 1 0.276232 -1.087401 -0.673690 2 0.113648 -1.478427 0.524988 3 NaN 0.577046 -1.715002 4 NaN -0.401419 -1.157892 5 -1.344312 -0.401419 -0.293543 6 -0.109050 1.643563 -0.293543 7 0.357021 -0.674600 -0.293543 8 -0.968914 -1.294524 0.413738 9 0.276662 -0.472035 -0.013960
上面操作等同于
In [56]: dff.where(pd.notna(dff), dff.mean(), axis columns )使用dropna删除包含NA的数据
除了fillna来填充数据之外 还可以使用dropna删除包含na的数据。
In [57]: df Out[57]: one two three a NaN -0.282863 -1.509059 c NaN 1.212112 -0.173215 e NaN 0.000000 0.000000 f NaN 0.000000 0.000000 h NaN -0.706771 -1.039575 In [58]: df.dropna(axis 0) Out[58]: Empty DataFrame Columns: [one, two, three] Index: [] In [59]: df.dropna(axis 1) Out[59]: two three a -0.282863 -1.509059 c 1.212112 -0.173215 e 0.000000 0.000000 f 0.000000 0.000000 h -0.706771 -1.039575 In [60]: df[ one ].dropna() Out[60]: Series([], Name: one, dtype: float64)插值interpolation
数据分析时候 为了数据的平稳 我们需要一些插值运算interpolate() 使用起来很简单
In [61]: ts Out[61]: 2000-01-31 0.469112 2000-02-29 NaN 2000-03-31 NaN 2000-04-28 NaN 2000-05-31 NaN ... 2007-12-31 -6.950267 2008-01-31 -7.904475 2008-02-29 -6.441779 2008-03-31 -8.184940 2008-04-30 -9.011531 Freq: BM, Length: 100, dtype: float64
In [64]: ts.interpolate() Out[64]: 2000-01-31 0.469112 2000-02-29 0.434469 2000-03-31 0.399826 2000-04-28 0.365184 2000-05-31 0.330541 ... 2007-12-31 -6.950267 2008-01-31 -7.904475 2008-02-29 -6.441779 2008-03-31 -8.184940 2008-04-30 -9.011531 Freq: BM, Length: 100, dtype: float64
插值函数还可以添加参数 指定插值的方法 比如按时间插值
In [67]: ts2 Out[67]: 2000-01-31 0.469112 2000-02-29 NaN 2002-07-31 -5.785037 2005-01-31 NaN 2008-04-30 -9.011531 dtype: float64 In [68]: ts2.interpolate() Out[68]: 2000-01-31 0.469112 2000-02-29 -2.657962 2002-07-31 -5.785037 2005-01-31 -7.398284 2008-04-30 -9.011531 dtype: float64 In [69]: ts2.interpolate(method time ) Out[69]: 2000-01-31 0.469112 2000-02-29 0.270241 2002-07-31 -5.785037 2005-01-31 -7.190866 2008-04-30 -9.011531 dtype: float64
按index的float value进行插值
In [70]: ser Out[70]: 0.0 0.0 1.0 NaN 10.0 10.0 dtype: float64 In [71]: ser.interpolate() Out[71]: 0.0 0.0 1.0 5.0 10.0 10.0 dtype: float64 In [72]: ser.interpolate(method values ) Out[72]: 0.0 0.0 1.0 1.0 10.0 10.0 dtype: float64
除了插值Series 还可以插值DF
In [73]: df pd.DataFrame({ A : [1, 2.1, np.nan, 4.7, 5.6, 6.8], ....: B : [.25, np.nan, np.nan, 4, 12.2, 14.4]}) ....: In [74]: df Out[74]: 0 1.0 0.25 1 2.1 NaN 2 NaN NaN 3 4.7 4.00 4 5.6 12.20 5 6.8 14.40 In [75]: df.interpolate() Out[75]: 0 1.0 0.25 1 2.1 1.50 2 3.4 2.75 3 4.7 4.00 4 5.6 12.20 5 6.8 14.40
interpolate还接收limit参数 可以指定插值的个数。
In [95]: ser.interpolate(limit 1) Out[95]: 0 NaN 1 NaN 2 5.0 3 7.0 4 NaN 5 NaN 6 13.0 7 13.0 8 NaN dtype: float64使用replace替换值
replace可以替换常量 也可以替换list
In [102]: ser pd.Series([0., 1., 2., 3., 4.]) In [103]: ser.replace(0, 5) Out[103]: 0 5.0 1 1.0 2 2.0 3 3.0 4 4.0 dtype: float64
In [104]: ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0]) Out[104]: 0 4.0 1 3.0 2 2.0 3 1.0 4 0.0 dtype: float64
可以替换DF中特定的数值
In [106]: df pd.DataFrame({ a : [0, 1, 2, 3, 4], b : [5, 6, 7, 8, 9]}) In [107]: df.replace({ a : 0, b : 5}, 100) Out[107]: 0 100 100 1 1 6 2 2 7 3 3 8 4 4 9
可以使用插值替换
In [108]: ser.replace([1, 2, 3], method pad ) Out[108]: 0 0.0 1 0.0 2 0.0 3 0.0 4 4.0 dtype: float64
本文已收录于 http://www.flydean.com/07-python-pandas-missingdata/
最通俗的解读 最深刻的干货 最简洁的教程 众多你不知道的小技巧等你来发现
欢迎关注我的公众号:「程序那些事」,懂技术 更懂你
前言 Hadoop到目前为止发展已经有10余年,版本经过无数次的更新迭代,目前业内大...
12月15日,2021中国信通院ICT + 深度观察大会“大数据反诈与网络安全论坛”在国...
还有几天就是新春佳节了,除了放假,这个时候不少用户朋友最期待的要数 TOP云 的...
本文转载自微信公众号「程序喵大人」,作者程序喵大人 。转载本文请联系程序喵大...
01 大数据概念 数据发展推动科技进步,海量数据给数据分析带来了新的机遇和挑战...
大数据分析工具BI,是企业数据化管理的一整套方案,用于将企业中现有的数据进行...
世界本是多维的。进入5G时代,观众对多维度视觉体验的需求日益增长,5G MEC网络...
本文转载自微信公众号「程序员内点事」,作者程序员内点事 。转载本文请联系程序...
操作场景 远程桌面协议(Remote Desktop Protocol,RDP),是微软提供的多通道的...
我们知道,在 CPython 中,有一个全局解释器锁,英文叫 global interpreter lock...