28. Pandas的数据重塑-pivot与pivot_table函数

pandas.pivot的重点在于reshape, 通俗理解就是合并同类项,所以在行与列的交叉点值的索引应该是唯一值,如果不是唯一值,则会报,即原始数据集中存在重复条目,此时pivot函数无法确定数据透视表中的数值即会报错ValueError: Index contains duplicate entries, cannot reshape。尽管如此,pivot()方法可以对数据进行行列互换,或者进行透视转换,在有些场合下分析数据时非常方便。

  • pivot函数,DataFrame对象的pivot()方法可以接收三个参数,分别是index、columns和values,其中index用来指定转换后DataFrame对象的纵向索引,columns用来指定转换后DataFrame对象的横向索引或者列名,values用来指定转换后DataFrame对象的值。
import pandas as pd
idx = [101,101,101,102,102,102,103,103,103]
idx += [104, 104, 105, 106]
name = ["apple","pearl","orange", "apple","orange","pearl","apple","pearl","orange"]
name += ["apple","pearl", "apple", "orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
price += [5.30, 4.00, 5.25, 7.50]
df0 = pd.DataFrame({ "fruit": name, "price" : price, "supplier" :idx})
print df0
print ""
df1 = df0.pivot(index = "supplier", columns = "fruit", values = "price")
print df1

程序执行结果:

     fruit  price  supplier
0    apple   5.20       101
1    pearl   3.50       101
2   orange   7.30       101
3    apple   5.00       102
4   orange   7.50       102
5    pearl   7.30       102
6    apple   5.20       103
7    pearl   3.70       103
8   orange   7.30       103
9    apple   5.30       104
10   pearl   4.00       104
11   apple   5.25       105
12  orange   7.50       106

fruit     apple  orange  pearl
supplier                      
101        5.20     7.3    3.5
102        5.00     7.5    7.3
103        5.20     7.3    3.7
104        5.30     NaN    4.0
105        5.25     NaN    NaN
106         NaN     7.5    NaN
  • pivot_table函数,在调用pivot方法前需要保证数据集中不存在重复条目,否则我们需要调用另外一个方法:pivot_table,函数会对重复条目进行column to aggregate列聚合即求均值。
import pandas as pd
idx = [101,101,101,102,103,103,103,103,103]
name = ["apple","pearl","orange", "apple","pearl","orange","apple","pearl","orange"]
price = [1.0,2.0,3.0,4.00,5.0,6.0,7.0,8.0,9.0]
df0 = pd.DataFrame({ "fruit": name, "price" : price, "supplier" :idx})
print df0
#df1 = df0.pivot(index = "supplier", columns = "fruit", values = "price")
#print df1
print ""
df2 = df0.pivot_table(index = "supplier", columns = "fruit", values = "price")
print df2

程序执行结果:

    fruit  price  supplier
0   apple      1       101
1   pearl      2       101
2  orange      3       101
3   apple      4       102
4   pearl      5       103
5  orange      6       103
6   apple      7       103
7   pearl      8       103
8  orange      9       103

fruit     apple  orange  pearl
supplier                      
101           1     3.0    2.0
102           4     NaN    NaN
103           7     7.5    6.5

由于103 orange103 pearl存在2项,所以pivot函数不能用,用pivot_table函数会对103 orange103 pearl聚合处理用均值作为其值。例如103 orange的值应为7.5 = (9.0 + 6.0 ) / 2