16. Pandas的Categorical Data创建

前一章里介绍了Categorical Data的基本含义,本章就如何创建、使用本数据类型进行较为相近的解析。 需再说明一下Categorical Data和categories的区别,Categorical Data由两部分组成即categories和codes, categories是有限且唯一的分类的集合,codes是Categorical data的值对应于categories的编码用于存储。

16.1 创建Categorical Data数据

在Pandas里有很多的方式可以创建Categorical Data型的数据,可以基于已有的dataframe数据将模列转化成Catagorical data型的数据,也可直接创建Categorical data型数据,某些函数的返回值也有可能就是Categorical data型数据。

1). astype('category')方式创建 ,可以将某dataframe的某列直接转为Categorical Data型的数据。

import pandas as pd
import time
idx = [1,2,3,5,6,7,9,4,8]
name = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
#df = pd.DataFrame({ "fruit": name , "price" : price}, index = idx)
N = 1
df = pd.DataFrame({ "fruit": name * N, "price" : price * N}, index = idx * N)
df['fruit'] = df['fruit'].astype('category')
print df,"\n"
#print type(df.fruit.values)
print "df.price.values\n", df.price.values,"\n"
print "df.fruit.values\n", df.fruit.values, "\n"

这是前一章里使用的例子就是直接将dataframe的df的第2列即fruit由Series型数据直接转为categorical data型数据即category。

    fruit  price
1   apple    5.2
2   pearl    3.5
3  orange    7.3
5   apple    5.0
6  orange    7.5
7  orange    7.3
9   apple    5.2
4   pearl    3.7
8  orange    7.3 

df.price.values
[5.2 3.5 7.3 5.  7.5 7.3 5.2 3.7 7.3] 

df.fruit.values
[apple, pearl, orange, apple, orange, orange, apple, pearl, orange]
Categories (3, object): [apple, orange, pearl] 

2). pandas.Categorical直接创建Categorical


import pandas as pd
val = ["apple","pearl","orange", "apple", "orange"]
cat = pd.Categorical(val)
print "type is",type(cat)
print "*" * 20
print "categorical data:\n",cat
print "*" * 20
print cat.categories
print cat.codes

程序执行结果:

type is <class 'pandas.core.categorical.Categorical'>
********************
categorical data:
[apple, pearl, orange, apple, orange]
Categories (3, object): [apple, orange, pearl]
********************
Index([u'apple', u'orange', u'pearl'], dtype='object')
[0 2 1 0 1]
********************

val是python的列表,而cat则是categorical data数据类型,有categories和codes属性,分别表示数据存储时的分类和编码。

3). 用categoris和codes生成Categorical Data,categories要求唯一、有限,codes可以任意定义。

import pandas as pd
val = ["apple","pearl","orange", "apple", "orange"]
cat = pd.Categorical(val)
print "type is",type(cat)
print "*" * 20
print "categorical data:\n",cat
print "*" * 20
print cat.categories
print cat.codes
print "*" * 20
codes = pd.Series([0,1, 0,2,1,0,2,0])
print "create categorical data:"
print cat.take(codes)
print pd.Categorical.take(cat, codes)
print cat.from_codes(codes, cat.categories)

程序执行结果:

type is <class 'pandas.core.categorical.Categorical'>
********************
categorical data:
[apple, pearl, orange, apple, orange]
Categories (3, object): [apple, orange, pearl]
********************
Index([u'apple', u'orange', u'pearl'], dtype='object')
[0 2 1 0 1]
********************
create categorical data:
[apple, pearl, apple, orange, pearl, apple, orange, apple]
Categories (3, object): [apple, orange, pearl]
[apple, pearl, apple, orange, pearl, apple, orange, apple]
Categories (3, object): [apple, orange, pearl]
[apple, orange, apple, pearl, orange, apple, pearl, apple]
Categories (3, object): [apple, orange, pearl]

程序里的cat变量是基于列表val创建的一个categorical data数据,cat有categories和codes属性。下面用cat的categories作为分类集来生成另一个categorical。

  • Categorical Data的实例对象调用take函数,一个categorical的实例对象cat可以传入"要查询"的编码表codes给take函数获得其对应的值,即给出编码找对应的分类。
print cat.take(codes)

"查出"的数据为:

[apple, pearl, apple, orange, pearl, apple, orange, apple]
Categories (3, object): [apple, orange, pearl]
  • pd.Categorical类调用take函数,这时形参有两个,一个是pd.Categorical的实例对象cat,另一个是编码表。
print pd.Categorical.take(cat, codes)

"查询"结果:

[apple, pearl, apple, orange, pearl, apple, orange, apple]
Categories (3, object): [apple, orange, pearl]
  • Categorical Data的实例对象调用from_codes函数,此函数需要传入“查询”编码表和分类即categories。
print cat.from_codes(codes, cat.categories)

"查询"结果:

[apple, pearl, apple, orange, pearl, apple, orange, apple]
Categories (3, object): [apple, orange, pearl]

16.2 DataFrame里插入Categorical Data

可以利用pandas.Categorical创建的Categorical data数据插入到DataFrame里。

import pandas as pd
idx = [1,2,3,5,6,7,9,4,8]
fruit = ["apple","pearl","orange", "apple","orange","orange","apple","pearl","orange"]
price = [5.20,3.50,7.30,5.00,7.50,7.30,5.20,3.70,7.30]
df = pd.DataFrame({"price" : price}, index = idx)
print df
cat = pd.Categorical(fruit)
df["fruit"] = cat
print df
print cat.codes
print cat.categories

程序执行结果:

   price
1    5.2
2    3.5
3    7.3
5    5.0
6    7.5
7    7.3
9    5.2
4    3.7
8    7.3
   price   fruit
1    5.2   apple
2    3.5   pearl
3    7.3  orange
5    5.0   apple
6    7.5  orange
7    7.3  orange
9    5.2   apple
4    3.7   pearl
8    7.3  orange
[0 2 1 0 1 1 0 2 1]
Index([u'apple', u'orange', u'pearl'], dtype='object')

当然先创建DataFrame再将某列用astype('category')转也可以。