Python科学计算：庖丁解牛之Pandas

本文主讲内容：

numpy中的多维数组VSpandas中的多维数据表

一维数据表Series

Series中内置的5个函数

二维数据表DataFrame

数据表的索引与切片

总结：索引与切片（精华部分：一定要看哦！）

高级索引：布尔索引和函数

numpy中的多维数组VSpandas中的多维数据表

Pandas 里面的数据结构是「多维数据表」，学习它可以类比这 NumPy 里的「多维数组」。1/2/3 维的「多维数据表」分别叫做 Series (系列), DataFrame (数据框) 和 Panel (面板)，和1/2/3 维的「多维数组」的类比关系如下。

对比 NumPy (np) 和 Pandas (pd) 每个维度下的数据结构，不难看出

pd 多维数据表=np 多维数组+描述

其中

Series = 1darray + index
DataFrame = 2darray + index + columns
Panel = 3darray + index + columns + item

其意义，不言而喻：

一维数据表Series

arr=pd.Series([27.2, 27.65, 27.70, 28],index=pd.date_range('20190401',periods=4))
print(arr)
print(arr.values)
print(arr.index)

#输出
2019-04-01    27.20
2019-04-02    27.65
2019-04-03    27.70
2019-04-04    28.00
Freq: D, dtype: float64
[27.2  27.65 27.7  28.  ]
DatetimeIndex(['2019-04-01', '2019-04-02', '2019-04-03', '2019-04-04'], dtype='datetime64[ns]', freq='D')

Series中内置的5个函数

除了用列表，我们还可以用 numpy 数组来生成 Series。在下例中，我们加入缺失值 np.nan，并分析一下 Series 中另外 5 个属性或内置函数的用法：

len: s 里的元素个数
shape: s 的形状 (用元组表示)
count: s 里不含 nan 的元素个数
unique: 返回 s 里不重复的元素(numpy中没有）
value_counts: 统计 s 里非 nan 元素的出现次数（numpy中没有）

s = pd.Series( np.array([27.2, 27.65, 27.70, 28, 28, np.nan]) )
print( 'The length is', len(s) )
print( 'The shape is', s.shape )
print( 'The count is', s.count() )
print(s.uniqe())
print(s.value_counts())

The length is 6
The shape is (6,)
The count is 5
array([27.2 , 27.65, 27.7 , 28. , nan])
28.00 2
27.70 1
27.65 1
27.20 1
dtype: int64

二维数据表DataFrame

用字典创建，index需要自定义

字典中的键对应DataFrame中的columns
字典中的值对应DataFrame中的values
DataFrame中的index需要自定义

symbol = ['BABA', 'JD', 'AAPL', 'MS', 'GS', 'WMT']
data = {'行业': ['电商', '电商', '科技', '金融', '金融', '零售'],
        '价格': [176.92, 25.95, 172.97, 41.79, 196.00, 99.55],
        '交易量': [16175610, 27113291, 18913154, 10132145, 2626634, 8086946],
        '雇员': [101550, 175336, 100000, 60348, 36600, 2200000]}
df2 = pd.DataFrame( data, index=symbol )
df2.name='美股'
df2.index.name = '代号'
df2

数据表的索引与切片

DataFrame 的索引或切片可以基于标签 (label-based) ，也可以基于位置 (position-based)，不像 numpy 数组的索引或切片只基于位置。

索引单元素的总结图：

切片单个 columns 的总结图：

切片多个 columns 的总结图：

切片单个 index 的总结图：

切片多个 index 的总结图：

切片 index 和 columns 的总结图：

不易出错的索引切片总结：

【索引和切片数据表】在索引或切片 DataFrame，有很多种方法。最好记的而不易出错的是用基于位置的 at 和 loc，和基于标签的iat 和 iloc，具体来说，索引用 at 和 iat，切片用 loc 和 iloc。带 i 的基于位置，不带 i 的基于标签。(全是[index][columns]格式）

实际上我们通常不使用at，loc来完成索引与切片：

高级索引

除此之外，还可以用布尔索引和函数索引：

布尔索引：

当我们要过滤掉雇员小于 100,000 人的公司，我们可以用 loc 加上布尔索引。

print( df.雇员 >= 100000 )
df.loc[ df.雇员 >= 100000, : ]#行被过滤掉

现在来看一个「罕见」例子，假如我们想找到所有值为整数型的 columns

print( df.dtypes == 'int64' )
df.loc[ :, df.dtypes == 'int64' ]#列被过滤掉

调用函数：

当我们要找出交易量大于平均交易量的所有公司，我们可以用 loc 加上匿名函数 (这里 x 代表df)。

df.loc[ lambda x: x.交易量 > x.交易量.mean() , : ]#行被过滤掉

在上面基础上再加一个条件 -- 价格要在 100 之上 (这里 x 还是代表 df)

df.loc[ lambda x: (x.交易量 > x.交易量.mean()) 
                & (x.价格 > 100), : ]

最后来看看价格大于 100 的股票 (注意这里 x 代表df.价格)

df.价格.loc[ lambda x: x > 100 ] #只对某列过滤

输出
代号
BABA 176.92
AAPL 172.97
GS 196.00
Name: 价格, dtype: float64

Python科学计算：庖丁解牛之Pandas

numpy中的多维数组VSpandas中的多维数据表

一维数据表Series

Series中内置的5个函数

二维数据表DataFrame

数据表的索引与切片

高级索引

参与讨论

回复《 Python科学计算：庖丁解牛之Pandas》

EditorJs 编辑器

作者信息

打赏记录

等待回复

上一篇

下一篇

Python科学计算：庖丁解牛之Pandas

numpy中的多维数组VSpandas中的多维数据表

一维数据表Series

Series中内置的5个函数

二维数据表DataFrame

数据表的索引与切片

高级索引

参与讨论

回复《 Python科学计算：庖丁解牛之Pandas》

EditorJs 编辑器 什么是EditorJs？更多帮助信息请前往：https://editorjs.io/

作者信息

打赏记录

等待回复

上一篇

下一篇

EditorJs 编辑器