Pandas库

Pandas是python第三方库，提供高性能易用数据类型和分析工具。
Pandas基于NumPy实现，常与NumPy和Matplotlib一同使用。
pandas库引用： import pandas as pd
Pandas库包括两个数据类型：Series（相当于一维数据类型），DataFrame（相当于二维-多维数据类型），构成pandas的基础。
可以进行基本操作、运算操作、特征类操作（提取数据特征）、关联类操作（挖掘数据关联关系）。

Series类型 = 索引 + 一维数据

Series类型索引

自动索引/自定义索引（index=[ ]）pd.Series([ ],index=[ ])，index一词可省略

import pandas as pd

# Series类型索引
a = pd.Series([9,8,7,6],index=['1','2','3','4'])
a

1    9
2    8
3    7
4    6
dtype: int64

Series类型创建

Python列表，index与列表元素个数一致
标量值，index的个数决定Series类型的尺寸
Python字典，键值对中的“键”是索引，index从字典中进行选择操作
ndrray，索引和数据都可以通过ndrray类型创建

# Python列表，index与列表元素个数一致
a = pd.Series([6,7,8])
a

0    6
1    7
2    8
dtype: int64

# 标量值，index的个数决定Series类型的尺寸
a = pd.Series(25,index=['a','b','c'])
a

a    25
b    25
c    25
dtype: int64

# Python字典，键值对中的“键”是索引，index从字典中进行选择操作
a = pd.Series({'a':8,'b':7})
a

a    8
b    7
dtype: int64

import numpy as np
# ndrray，索引和数据都可以通过ndrray类型创建
a = pd.Series(np.arange(5),index=np.zeros(5,int))
a

0    0
0    1
0    2
0    3
0    4
dtype: int32

Series类型的基本操作

包括b.index和b.values两部分，索引、切片、运算

b = pd.Series([9,8,7,6],index=['a','b','c','d'])

# 获得索引
b.index

Index(['a', 'b', 'c', 'd'], dtype='object')

# 获得数据
b.values

array([9, 8, 7, 6], dtype=int64)

# 自动索引
b[1]
# 下标从0开始计算

# 自定义索引
b['b']

# 自动索引和自定义索引并存，但是不能混用
b[[0,1,2,0]]

a    9
b    8
c    7
a    9
dtype: int64

b[['a','b','c','d']]

a    9
b    8
c    7
d    6
dtype: int64

# 索引混用之后报错
b[['a','b',0]]

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

~\AppData\Local\Temp\ipykernel_15204\316410750.py in <module>
      1 # 索引混用之后报错
----> 2 b[['a','b',0]]


D:\anaconda\envs\python37\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
    964             return self._get_values(key)
    965 
--> 966         return self._get_with(key)
    967 
    968     def _get_with(self, key):


D:\anaconda\envs\python37\lib\site-packages\pandas\core\series.py in _get_with(self, key)
   1004 
   1005         # handle the dup indexing case GH#4246
-> 1006         return self.loc[key]
   1007 
   1008     def _get_values_tuple(self, key):


D:\anaconda\envs\python37\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
    929 
    930             maybe_callable = com.apply_if_callable(key, self.obj)
--> 931             return self._getitem_axis(maybe_callable, axis=axis)
    932 
    933     def _is_scalar_access(self, key: tuple):


D:\anaconda\envs\python37\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1151                     raise ValueError("Cannot index with multidimensional key")
   1152 
-> 1153                 return self._getitem_iterable(key, axis=axis)
   1154 
   1155             # nested tuple slicing


D:\anaconda\envs\python37\lib\site-packages\pandas\core\indexing.py in _getitem_iterable(self, key, axis)
   1091 
   1092         # A collection of keys
-> 1093         keyarr, indexer = self._get_listlike_indexer(key, axis)
   1094         return self.obj._reindex_with_indexers(
   1095             {axis: [keyarr, indexer]}, copy=True, allow_dups=True


D:\anaconda\envs\python37\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis)
   1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1313 
-> 1314         self._validate_read_indexer(keyarr, indexer, axis)
   1315 
   1316         if needs_i8_conversion(ax.dtype) or isinstance(


D:\anaconda\envs\python37\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 1377             raise KeyError(f"{not_found} not in index")
   1378 
   1379 


KeyError: '[0] not in index'

#切片
b = pd.Series([9,8,7,6],index=['a','b','c','d'])
b[:3]

a    9
b    8
c    7
dtype: int64

#切片
b[b > b.median()]

a    9
b    8
dtype: int64

#运算
np.exp(b)

a    8103.083928
b    2980.957987
c    1096.633158
d     403.428793
dtype: float64

b['c']

'c' in b

True

7 in b

False

b.get('c', 100)

b.get('f',100)

a    9
b    8
c    7
d    6
dtype: int64

Series类型的对齐操作

Series类型有索引，是基于索引的计算，更精确不易出错；numpy是基于维度的计算。

Series类型的name属性

Series对象和索引都可以有一个名字，存储在属性.name中。

b.name=' '

b.index.name=' '

b.name = 'object'
b.index.name = 'index'
b

index
a    15
b     9
c     6
d     3
Name: object, dtype: int64

Series类型的修改

Series对象可以随时修改并即刻生效

b = pd.Series([8,9,6,3],['a','b','c','d'])
# 修改b['a']的值
b['a'] = 15
b

a    15
b     9
c     6
d     3
dtype: int64

DataFrame类型 = 行列索引 + 二维数据

DataFrame是二维带“标签”数组，基本操作类似Series，依据行列索引获得。

DataFrame是一个表格型的数据类型，每列值类型可以不同。有行索引也有列索引。常用于表达二维数据，也可以表达多维度数据。

DataFrame类型的创建

二维ndarray对象对象
由一维ndarray/列表/字典/元组或者Series构成的字典
Series类型
其他的DataFrame类型

# 二维ndarray对象创建
d = pd.DataFrame(np.arange(10).reshape(2,5))
d

	0	1	2	3	4
0	0	1	2	3	4
1	5	6	7	8	9

# 两个二维Series创建，以字典形式组织，字典的键即为列索引，从左到右---从上到下排列下来。
dt = {
    'one': pd.Series([1,2,3],['a','b','c']),
    'two': pd.Series([8,9,9,3],['a','b','c','d'])
}
d = pd.DataFrame(dt)
d

	one	two
a	1.0	8
b	2.0	9
c	3.0	9
d	NaN	3

# 由列表构成的字典创建，DataFrame可以只给出行列数据，系统会自动补齐索引（主要关心数据）
d1 = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(d1)# 默认的索引是'0','1','2','3'
d.index=['a','b','c','d']
d

	one	two
a	1	9
b	2	8
c	3	7
d	4	6

 dl = {
'城市':['北京','上海','广州','深圳','沈阳'],
'环比':[101.5,101.2,101.3,102.0,100.1],
'同比':[120.7,127.3,119.4,140.9,101.4],
'定基':[121.4,127.8,120.0,145.5,101.6]}

d = pd.DataFrame(dl,index=['c1','c2','c3','c4','c5'])
d

	城市	环比	同比	定基
c1	北京	101.5	120.7	121.4
c2	上海	101.2	127.3	127.8
c3	广州	101.3	119.4	120.0
c4	深圳	102.0	140.9	145.5
c5	沈阳	100.1	101.4	101.6

d.index

Index(['c1', 'c2', 'c3', 'c4', 'c5'], dtype='object')

d.columns

Index(['城市', '环比', '同比', '定基'], dtype='object')

d.values

array([['北京', 101.5, 120.7, 121.4],
       ['上海', 101.2, 127.3, 127.8],
       ['广州', 101.3, 119.4, 120.0],
       ['深圳', 102.0, 140.9, 145.5],
       ['沈阳', 100.1, 101.4, 101.6]], dtype=object)

# 获得列的内容
d['同比']

c1    120.7
c2    127.3
c3    119.4
c4    140.9
c5    101.4
Name: 同比, dtype: float64

#获得行的内容
# .ix已经被弃用
# loc里只能以行标签为索引，当没有行标签，loc可以用数字做为索引
d.loc['c2']

城市       上海
环比    101.2
同比    127.3
定基    127.8
Name: c2, dtype: object

# 获得某个位置
d['同比']['c2']

127.3