介绍Pandas中3个常见的数据类型操作方法:
import pandas as pd
import numpy as np
官网地址:https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html
pandas.to_numeric(arg, # scalar, list, tuple, 1-d array, or Series
errors='raise', # ‘ignore’, ‘raise’, ‘coerce’;默认是raise
downcast=None)
errors的3种取值情况:
downcast的使用:
s = pd.Series(["2.0", '1', -3, 5.0]) # 数值(类似)
s
0 2.0
1 1
2 -3
3 5.0
dtype: object
默认是object类型,也就是字符串。下面转成数值型:
# 1、默认转成float64
pd.to_numeric(s)
0 2.0
1 1.0
2 -3.0
3 5.0
dtype: float64
# 2、指定类型
pd.to_numeric(s, downcast="integer")
0 2
1 1
2 -3
3 5
dtype: int8
# 3、指定类型
pd.to_numeric(s, downcast="signed")
0 2
1 1
2 -3
3 5
dtype: int8
# 4、指定类型
pd.to_numeric(s, downcast="unsigned")
0 2.0
1 1.0
2 -3.0
3 5.0
dtype: float64
# 5、指定类型
pd.to_numeric(s, downcast="float")
0 2.0
1 1.0
2 -3.0
3 5.0
dtype: float32
s1 = pd.Series(["2.0", 'pandas', -3, 5.0]) # 数值+字符串
s1
0 2.0
1 pandas
2 -3
3 5.0
dtype: object
# pd.to_numeric(s1) # 默认是会抛出异常
# 忽略异常
pd.to_numeric(s1, errors="ignore")
0 2.0
1 pandas
2 -3
3 5.0
dtype: object
# pd.to_numeric(s1, errors="raise") # 无效解析引发异常
# 无效解析设置为None
pd.to_numeric(s1, errors="coerce")
0 2.0
1 NaN
2 -3.0
3 5.0
dtype: float64
# 无效解析设置为None
pd.to_numeric(s1, errors="coerce", downcast="float")
0 2.0
1 NaN
2 -3.0
3 5.0
dtype: float32
# 无效解析设置为None,最后用0代替
pd.to_numeric(s1, errors="coerce").fillna(0)
0 2.0
1 0.0
2 -3.0
3 5.0
dtype: float64
s2 = pd.Series([1,2.0,3.0], dtype="float64")
s2
0 1.0
1 2.0
2 3.0
dtype: float64
s3 = pd.to_numeric(s2, downcast="float")
s3
0 1.0
1 2.0
2 3.0
dtype: float32
s4 = pd.to_numeric(s2, downcast="integer")
s4
0 1
1 2
2 3
dtype: int8
类型转化的优势之一:节省内存资源。比较上面3种不同数值类型下的数据所占内存大小:
print("memory of float64: ", s2.memory_usage())
print("memory of float32: ", s3.memory_usage())
print("memory of int8: ", s4.memory_usage())
memory of float64: 152
memory of float32: 140
memory of int8: 131
另一种转化的方法:astype
s2
0 1.0
1 2.0
2 3.0
dtype: float64
s2.astype("float32")
0 1.0
1 2.0
2 3.0
dtype: float32
s2.astype("int64")
0 1
1 2
2 3
dtype: int64
s2.astype("int32")
0 1
1 2
2 3
dtype: int32
s2.astype("category")
/Applications/downloads/anaconda/anaconda3/lib/python3.7/site-packages/pandas/io/formats/format.py:1429: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self.
for val, m in zip(values.ravel(), mask.ravel())
0 1.0
1 2.0
2 3.0
dtype: category
Categories (3, float64): [1.0, 2.0, 3.0]
pandas.to_datetime(arg,
errors='raise',
dayfirst=False,
yearfirst=False,
utc=None,
format=None,
exact=True,
unit=None,
infer_datetime_format=False,
origin='unix',
cache=True)
df = pd.DataFrame({"Year":[2022,2021,2022],
"Month":[1,3,5],
"Day":["10","12","28"]
})
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Year | Month | Day | |
---|---|---|---|
0 | 2022 | 1 | 10 |
1 | 2021 | 3 | 12 |
2 | 2022 | 5 | 28 |
df.dtypes
Year int64
Month int64
Day object
dtype: object
直接拼接会报错:字符串和数值型不能直接相加。
# df["Date"] = df["Year"] + df["Month"] + df["Day"]
df["Date"] = pd.to_datetime(df)
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Year | Month | Day | Date | |
---|---|---|---|---|
0 | 2022 | 1 | 10 | 2022-01-10 |
1 | 2021 | 3 | 12 | 2021-03-12 |
2 | 2022 | 5 | 28 | 2022-05-28 |
df.dtypes
Year int64
Month int64
Day object
Date datetime64[ns]
dtype: object
pd.to_datetime("10/2/21") # 默认
Timestamp('2021-10-02 00:00:00')
pd.to_datetime("10-2-21") # 默认
Timestamp('2021-10-02 00:00:00')
pd.to_datetime("10/2/21",dayfirst=True)
Timestamp('2021-02-10 00:00:00')
pd.to_datetime("10/2/21",yearfirst=True)
Timestamp('2010-02-21 00:00:00')
pd.to_datetime("22-01-21",dayfirst=True)
Timestamp('2021-01-22 00:00:00')
pd.to_datetime("22-01-21",yearfirst=True)
Timestamp('2022-01-21 00:00:00')
pd.to_datetime('20220107', format='%Y%m%d', errors='ignore')
Timestamp('2022-01-07 00:00:00')
pd.to_datetime('20220107112347', errors='ignore')
Timestamp('2022-01-07 11:23:47')
pd.to_datetime('20220107112233', format='%Y%m%d%H%M%S')
Timestamp('2022-01-07 11:22:33')
筛选指定类型下的数据信息
df.dtypes
Year int64
Month int64
Day object
Date datetime64[ns]
dtype: object
df.select_dtypes(include=["int"])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Year | Month | |
---|---|---|
0 | 2022 | 1 |
1 | 2021 | 3 |
2 | 2022 | 5 |
df.select_dtypes(include=["object"])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Day | |
---|---|
0 | 10 |
1 | 12 |
2 | 28 |
df.select_dtypes(include=["O"]) # 效果同上
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Day | |
---|---|
0 | 10 |
1 | 12 |
2 | 28 |
# 排除object字段类型
df.select_dtypes(exclude=["object"])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Year | Month | Date | |
---|---|---|---|
0 | 2022 | 1 | 2022-01-10 |
1 | 2021 | 3 | 2021-03-12 |
2 | 2022 | 5 | 2022-05-28 |
# 排除object + int字段类型
df.select_dtypes(exclude=["object","int"])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
Date | |
---|---|
0 | 2022-01-10 |
1 | 2021-03-12 |
2 | 2022-05-28 |