【数据准备和特征工程】数据清理

原创

LoneRanger

发布于 2022-02-02 15:37:24

8400

发布于 2022-02-02 15:37:24

1.基本概念

import pandas as pd df = pd.read_csv("test.csv") df.sample(10)

获取前几行数据
```python

data.head()

获取数据维度信息

df.shape

获取数据表属性的相关信息

```python

data.info()

获取数据表属性类型信息

```python

data.head()

### 2.转换数据类型

```python

import pandas as pd

df = pd.DataFrame([{'col1':'a', 'col2':'1'},

{'col1':'b', 'col2':'2'}])

df.dtypes

Object转换为数值型

```python

df'col2-int' = df'col2'.astype(int)

含有不是数字的Object类型转换为数值型

```python

#此时由于含有不是数字的值，所以就算忽略报错后转换后的类型还是Object

s = pd.Series('1', '2', '4.7', 'pandas', '10')

s.astype(float,errors='ignore')

```python

#这种方法可以将其转换为数值型，Pandas则变为Nan

pd.to_numeric(s, errors='coerce')

转换为日期类型

```python

#可以将三列数据Month、Day、Year转换为日期类型的数据

pd.to_datetime(df['Month', 'Day', 'Year'])

最终的代码

```python

import pandas as pd

import numpy as np

def convert_money(value):

new_value = value.replace(",","").replace("$","")

return float(new_value)

df2 = pd.read_csv("/home/aistudio/data/data20506/sales_types.csv",

dtype = {'Customer Number': 'int'},

converters = {'2016': convert_money,

'2017': convert_money,

'Percent Growth': lambda x: float(x.replace("%", "")) / 100,

'Jan Units': lambda x: pd.to_numeric(x, errors='coerce'),

'Active': lambda x: np.where(x =='Y', 1, 0),

})

df2'Date' = pd.to_datetime(df['Month', 'Day', 'Year'])

df2

### 3.处理重复数据

```python

import pandas as pd

d = {'Name':'Newton', 'Galilei', 'Einstein', 'Feynman', 'Newton', 'Maxwell', 'Galilei',

'Age':26, 30, 28, 28, 26, 39, 40,

'Score':90, 80, 90, 100, 90, 70, 90}

df = pd.DataFrame(d,columns='Name','Age','Score')

df.duplicated()函数使用

```python

Consider dataset containing ramen rating.

df = pd.DataFrame({

... 'brand': 'Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie',

... 'style': 'cup', 'cup', 'cup', 'pack', 'pack',

... 'rating': 4, 4, 3.5, 15, 5

... })

df

brand style  rating

0 Yum Yum cup 4.0

1 Yum Yum cup 4.0

2 Indomie cup 3.5

3 Indomie pack 15.0

4 Indomie pack 5.0

By default, for each set of duplicated values, the first occurrence

is set on False and all others on True.

df.duplicated()

0 False

1 True

2 False

3 False

4 False

dtype: bool

By using 'last', the last occurrence of each set of duplicated values

is set on False and all others on True.

df.duplicated(keep='last')

0 True

1 False

2 False

3 False

4 False

dtype: bool

By setting keep on False, all duplicates are True.

df.duplicated(keep=False)

0 True

1 True

2 False

3 False

4 False

dtype: bool

To find duplicates on specific column(s), use subset.

df.duplicated(subset='brand')

0 False

1 True

2 False

3 True

4 True

dtype: bool

删除重复的

```python

#保留第一个

df.drop_duplicates()

#保留最后一个重复的元素

df.drop_duplicates('Age', keep='last')

### 4.处理缺失数据

#### a.检查缺失数据

```python

#方法一

#isna()函数，若为空则为False,若不为空则为True

df = pd.DataFrame({"one":1, 2, np.nan, "two":np.nan, 3, 4})

df.isna() #返回m行n列，每个元素的值都会返回（True,False)

df.isna().any() #只返回1列，只要有一个是False就整个属性的值就为False

#方法二

#可以统计缺失率

(data.shape0 - data.count()) / data.shape0

#### b.直接删除缺失数据

Dropna()函数

```python

Examples

df = pd.DataFrame({"name": 'Alfred', 'Batman', 'Catwoman',

... "toy": np.nan, 'Batmobile', 'Bullwhip',

... "born": [pd.NaT, pd.Timestamp("1940-04-25"),

... pd.NaT]})

df

   name        toy       born

0 Alfred NaN NaT

1 Batman Batmobile 1940-04-25

2 Catwoman Bullwhip NaT

Drop the rows where at least one element is missing.

df.dropna()

 name        toy       born

1 Batman Batmobile 1940-04-25

Drop the columns where at least one element is missing.

df.dropna(axis='columns')

   name

0 Alfred

1 Batman

2 Catwoman

Drop the rows where all elements are missing.

df.dropna(how='all')

   name        toy       born

0 Alfred NaN NaT

1 Batman Batmobile 1940-04-25

2 Catwoman Bullwhip NaT

Keep only the rows with at least 2 non-NA values.

df.dropna(thresh=2)

   name        toy       born

1 Batman Batmobile 1940-04-25

2 Catwoman Bullwhip NaT

Define in which columns to look for missing values.

df.dropna(subset='name', 'born')

   name        toy       born

1 Batman Batmobile 1940-04-25

Keep the DataFrame with valid entries in the same variable.

在同一个变量改变

df.dropna(inplace=True)
df

 name        toy       born

1 Batman Batmobile 1940-04-25

#### c.用指定值填补缺失数据

```python

df = pd.DataFrame({'ColA':1, np.nan, np.nan, 4, 5, 6, 7, 'ColB':1, 1, 1, 1, 2, 2, 2})

df'ColA'.fillna(method='ffill')#以前面一个值填充

df'ColA'.fillna(method='bfill')#以后面一个值填充

调用sklearn.impute中的SimpleImputer来填补缺失数据

```python

from sklearn.impute import SimpleImpute

#以均值填充空值

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

col_values = imp_mean.fit_transform(pdf2'Height-na'.values.reshape((-1, 1)))

#以常数填充空值

df = pd.DataFrame({"name": "Google", "Huawei", "Facebook", "Alibaba","price": 100, -1, -1, 90})

imp = SimpleImputer(missing_values=-1, strategy='constant', fill_value=110) # ⑤

imp.fit_transform(df'price'.values.reshape((-1, 1)))

#### d.根据规律填补缺失值

利用sklearn.linear\_model的LinearRegression来回归未知的数据

```python

df = pd.DataFrame({"one":np.random.randint(1, 100, 10),

"two": 2, 4, 6, 8, 10, 12, 14, 16, 18, 20,

"three":5, 9, 13, np.nan, 21, np.nan, 29, 33, 37, 41})

from sklearn.linear_model import LinearRegression # ⑥

df_train = df.dropna() #训练集

df_test = df[df'three'.isnull()] #测试集

regr = LinearRegression()

regr.fit(df_train'two'.values.reshape(-1, 1), df_train'three'.values.reshape(-1, 1)) # ⑦

df_three_pred = regr.predict(df_test'two'.values.reshape(-1, 1))

将所得数值填补到原数据集中

df.loc(df.three.isnull()), 'three' = df_three_pred

使用sklearn.ensemble的RandomForestRegressor来进行预测

```python

import pandas as pd

train_data = pd.read_csv("/home/aistudio/data/data20507/train.csv")

train_data.info()

df = train_data['Age','Fare', 'Parch', 'SibSp', 'Pclass'] #可能跟年龄有关的特征

known_age = df[df'Age'.notnull()].values

unknown_age = df[df'Age'.isnull()].values

y = known_age:, 0

X = known_age:, 1:

from sklearn.ensemble import RandomForestRegressor # ⑩

rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1) # ○11

rfr.fit(X, y)

pred_age = rfr.predict(unknown_age:, 1:) # ○13

pred_age.mean()

train_data.loc(train_data.Age.isnull()), 'Age' = pred_age

train_data.isna().any()

还可以利用KNN来填补缺失值

```python

%matplotlib inline

import seaborn as sns

from sklearn.datasets import load_iris # 引入鸢尾花数据集

import numpy as np

iris = load_iris()

X = iris.data

制造含有缺失值的数据集

rng = np.random.RandomState(0)

X_missing = X.copy()

mask = np.abs(X:, 2 - rng.normal(loc=5.5, scale=0.7, size=X.shape0)) < 0.6

X_missingmask, 3 = np.nan # X_missing是包含了缺失值的数据集

from missingpy import KNNImputer # 引入KNN填充缺失值的模型

imputer = KNNImputer(n_neighbors=3, weights="uniform")

X_imputed = imputer.fit_transform(X_missing)

#填补之前的数据分布

sns.distplot(X.reshape((-1, 1)))

#填补缺失数据后的分布

sns.distplot(X_imputed.reshape((-1, 1)))

### 5.离群数据

#### 5.1 通过可视化

```python

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

from sklearn.datasets import load_boston

boston = load_boston()

x = boston.data

y = boston.target

columns = boston.feature_names

#create the dataframe

boston_df = pd.DataFrame(x)

boston_df.columns = columns

boston_df.head()

####  Box-plot

```python

import seaborn as sns

sns.boxplot(x=boston_df'INDUS');

![[箱线图.png]]

```python

sns.boxplot(x=boston_df'PTRATIO');

sns.boxplot(x=boston_df'DIS');

#### Scatter-plot

```python

fig, ax = plt.subplots(figsize=(16,8))

ax.scatter(boston_df'INDUS', boston_df'TAX')

ax.set_xlabel('Proportion of non-retail business acres per town')

ax.set_ylabel('Full-value property-tax rate per $10,000')

plt.show();

#### 5.2 通过数学计算

#### Z-Score

[Z-Score](https://en.wikipedia.org/wiki/Standard\_score)是指观测点或数据的值超过观测值或测量值平均值的标准差的符号数。

```python

from scipy import stats

import numpy as np

z = np.abs(stats.zscore(boston_df))

print(z)

threshold = 3

print(np.where(z > 3))

#### IQR score

[四分位数范围(IQR)](https://en.wikipedia.org/wiki/Interquartile\_range)，又称平均数或50%平均数，或技术上称为H-spread，是衡量统计学分散度的指标，等于75%和25%之间的差值，或上四分位数和下四分位数之间的差值，IQR=Q3 - Q1。[百度百科解释](https://baike.baidu.com/item/%E5%9B%9B%E5%88%86%E4%BD%8D%E5%B7%AE/8362429?fr=aladdin)

![[IQR.png]]

```python

Q1 = boston_df.quantile(0.25)

Q3 = boston_df.quantile(0.75)

IQR = Q3 - Q1

print(IQR)

既然我们现在有了IQR估计，那么就可以选择离群值。下面的代码将产生带有真值和假值的结果。带有False的数据点表示这些值是有效的，而True则表示有释放。

```python

print((boston_df < (Q1 - 1.5 * IQR)) | (boston_df > (Q3 + 1.5 * IQR)))

5.3 处理离群数据

识别异常值--分析有无异常值的结果模型--做出结论。
如果你确定数值是错误的，就修正它。
如果离群值不在利益分配范围内，则删除。
考虑到数据的差距，使用抗离群值的统计工具，例如，稳健回归（用另一种参数估计方法）Robust_regression。
离群值的常见原因是两种分布的混合，可能是两个不同的子人群，也可能表明 "测量正确 "和 "测量误差"；这通常是由混合模型来建模。 (Mixture model).