22. Pandas的数据清洗-删除重复

在Pandas里有duplicated函数可以查询到数据里是否有重复的数据,可以用drop_duplicates函数删除重复数据。

import pandas as pd
import numpy as np
col = ["apple", "pearl", "watermelon"] * 4
pri = [2.50, 3.00, 2.75] * 4
df = pd.DataFrame({"fruit": col, "price" : pri})
print df
print df.duplicated()
print df.drop_duplicates()

程序的执行结果:

         fruit  price
0        apple   2.50
1        pearl   3.00
2   watermelon   2.75
3        apple   2.50
4        pearl   3.00
5   watermelon   2.75
6        apple   2.50
7        pearl   3.00
8   watermelon   2.75
9        apple   2.50
10       pearl   3.00
11  watermelon   2.75
0     False
1     False
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
dtype: bool
        fruit  price
0       apple   2.50
1       pearl   3.00
2  watermelon   2.75

如果想影响dataframe本身,启用函数的inplace=True参数。 如果想保留重复出现最后出现的数据可以使用keep参数。

import pandas as pd
import numpy as np
col = ["apple", "pearl", "watermelon"] * 4
pri = [2.50, 3.00, 2.75] * 4
df = pd.DataFrame({"fruit": col, "price" : pri})
print df
print df.duplicated()
print df.drop_duplicates()
print df.drop_duplicates(keep="last")

程序执行结果:

         fruit  price
0        apple   2.50
1        pearl   3.00
2   watermelon   2.75
3        apple   2.50
4        pearl   3.00
5   watermelon   2.75
6        apple   2.50
7        pearl   3.00
8   watermelon   2.75
9        apple   2.50
10       pearl   3.00
11  watermelon   2.75
0     False
1     False
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
dtype: bool
        fruit  price
0       apple   2.50
1       pearl   3.00
2  watermelon   2.75
         fruit  price
9        apple   2.50
10       pearl   3.00
11  watermelon   2.75