要利用python进行数据分析,pandas必不可少。

作为python数据分析利器,pandas以快速,高效著称。

为了更加方便地处理数据,pandas创造了自己的数据类型:Series,DataFrame。

一般使用pandas要进行如下import:

import pandas as pd

Series 

可以认为series类型的包含着一列数据。
In [4]: s = pd.Series(np.random.randn(4), name='daily returns')

In [5]: s
Out[5]:
0    0.430271
1    0.617328
2   -0.265421
3   -0.836113
Name: daily returns
索引从零开始,与列表一样。
Series数据是基于numpy的array结构的,所以Series支持相似的运算。
In [6]: s * 100
Out[6]:
0    43.027108
1    61.732829
2   -26.542104
3   -83.611339
Name: daily returns

In [7]: np.abs(s)
Out[7]:
0    0.430271
1    0.617328
2    0.265421
3    0.836113
Name: daily returns
但是Series有着更加高级的特性。例如:
In [8]: s.describe()
Out[8]:
count    4.000000
mean    -0.013484
std      0.667092
min     -0.836113
25%     -0.408094
50%      0.082425
75%      0.477035
max      0.617328
这是一种统计描述性数据。
还有,更加丰富的索引形式。
In [9]: s.index = ['AMZN', 'AAPL', 'MSFT', 'GOOG']

In [10]: s
Out[10]:
AMZN    0.430271
AAPL    0.617328
MSFT   -0.265421
GOOG   -0.836113
Name: daily returns
这么一看,Series也像字典类型,但是要求字典的值必须是相同类型。
一些类似字典的操作,Series也支持:
In [11]: s['AMZN']
Out[11]: 0.43027108469945924

In [12]: s['AMZN'] = 0

In [13]: s
Out[13]:
AMZN    0.000000
AAPL    0.617328
MSFT   -0.265421
GOOG   -0.836113
Name: daily returns

In [14]: 'AAPL' in s
Out[14]: True

DataFrame

如果说Series是一列数据,那么DataFrame就是多列数据。

DataFrame读入csv文件十分方便,假如有以下csv文件:test_pwt.csv

"country","country isocode","year","POP","XRAT","tcgdp","cc","cg"
"Argentina","ARG","2000","37335.653","0.9995","295072.21869","75.716805379","5.5788042896"
"Australia","AUS","2000","19053.186","1.72483","541804.6521","67.759025993","6.7200975332"
"India","IND","2000","1006300.297","44.9416","1728144.3748","64.575551328","14.072205773"
"Israel","ISR","2000","6114.57","4.07733","129253.89423","64.436450847","10.266688415"
"Malawi","MWI","2000","11801.505","59.543808333","5026.2217836","74.707624181","11.658954494"
"South Africa","ZAF","2000","45064.098","6.93983","227242.36949","72.718710427","5.7265463933"
"United States","USA","2000","282171.957","1","9898700","72.347054303","6.0324539789"
"Uruguay","URY","2000","3219.793","12.099591667","25255.961693","78.978740282","5.108067988"
利用read_csv()函数,轻松读入csv文件,csv文件中的数据就组成了一个DataFrame。

In [28]: df = pd.read_csv('data/test_pwt.csv')

In [29]: type(df)
Out[29]: pandas.core.frame.DataFrame

In [30]: df
Out[30]:
         country country isocode  year          POP       XRAT           tcgdp  cc         cg
0      Argentina             ARG  2000    37335.653   0.999500   295072.218690   0  75.716805   5.578804
1      Australia             AUS  2000    19053.186   1.724830   541804.652100   1  67.759026   6.720098
2          India             IND  2000  1006300.297  44.941600  1728144.374800   2  64.575551  14.072206
3         Israel             ISR  2000     6114.570   4.077330   129253.894230   3  64.436451  10.266688
4         Malawi             MWI  2000    11801.505  59.543808     5026.221784   4  74.707624  11.658954
5   South Africa             ZAF  2000    45064.098   6.939830   227242.369490   5  72.718710   5.726546
6  United States             USA  2000   282171.957   1.000000  9898700.000000   6  72.347054   6.032454
7        Uruguay             URY  2000     3219.793  12.099592    25255.961693   7  78.978740   5.108068
对DataFrame可以采用行数切片索引,得到的仍然是DataFrame类型的数据

In [13]: df[2:5]
Out[13]:
  country country isocode  year          POP       XRAT           tcgdp         cc   cg
2   India             IND  2000  1006300.297  44.941600  1728144.374800  64.575551   14.072206
3  Israel             ISR  2000     6114.570   4.077330   129253.894230  64.436451   10.266688
4  Malawi             MWI  2000    11801.505  59.543808     5026.221784  74.707624   11.658954
选取DataFrame的类,往往采用列名索引的形式:

In [14]: df[['country', 'tcgdp']]
Out[14]:
         country           tcgdp
0      Argentina   295072.218690
1      Australia   541804.652100
2          India  1728144.374800
3         Israel   129253.894230
4         Malawi     5026.221784
5   South Africa   227242.369490
6  United States  9898700.000000
7        Uruguay    25255.961693
既要选择特定的行,又要选择特定的列时:

In [21]: df.ix[2:5, ['country', 'tcgdp']]
Out[21]:
        country           tcgdp
2         India  1728144.374800
3        Israel   129253.894230
4        Malawi     5026.221784
5  South Africa   227242.369490
pop()方法可以从DataFrame中分离出一列数据:

In [34]: countries = df.pop('country')

In [35]: type(countries)
Out[35]: pandas.core.series.Series

In [36]: countries
Out[36]:
0        Argentina
1        Australia
2            India
3           Israel
4           Malawi
5     South Africa
6    United States
7          Uruguay
Name: country

In [37]: df
Out[37]:
           POP           tcgdp
0    37335.653   295072.218690
1    19053.186   541804.652100
2  1006300.297  1728144.374800
3     6114.570   129253.894230
4    11801.505     5026.221784
5    45064.098   227242.369490
6   282171.957  9898700.000000
7     3219.793    25255.961693

In [38]: df.index = countries

In [39]: df
Out[39]:
                       POP           tcgdp
country
Argentina        37335.653   295072.218690
Australia        19053.186   541804.652100
India          1006300.297  1728144.374800
Israel            6114.570   129253.894230
Malawi           11801.505     5026.221784
South Africa     45064.098   227242.369490
United States   282171.957  9898700.000000
Uruguay           3219.793    25255.961693
修改DataFrame的列名:

In [40]: df.columns = 'population', 'total GDP'

In [41]: df
Out[41]:
                population       total GDP
country
Argentina        37335.653   295072.218690
Australia        19053.186   541804.652100
India          1006300.297  1728144.374800
Israel            6114.570   129253.894230
Malawi           11801.505     5026.221784
South Africa     45064.098   227242.369490
United States   282171.957  9898700.000000
Uruguay           3219.793    25255.961693
对一列数据进行运算:

In [66]: df['population'] = df['population'] * 1e3

In [67]: df
Out[67]:
                population       total GDP
country
Argentina        37335653    295072.218690
Australia        19053186    541804.652100
India          1006300297   1728144.374800
Israel            6114570    129253.894230
Malawi           11801505      5026.221784
South Africa     45064098    227242.369490
United States   282171957   9898700.000000
Uruguay           3219793     25255.961693
根据已有数据创建新的列:

In [74]: df['GDP percap'] = df['total GDP'] * 1e6 / df['population']

In [75]: df
Out[75]:
               population       total GDP    GDP percap
country
Argentina        37335653   295072.218690   7903.229085
Australia        19053186   541804.652100  28436.433261
India          1006300297  1728144.374800   1717.324719
Israel            6114570   129253.894230  21138.672749
Malawi           11801505     5026.221784    425.896679
South Africa     45064098   227242.369490   5042.647686
United States   282171957  9898700.000000  35080.381854
Uruguay           3219793    25255.961693   7843.970620
DataFrame内置了基于matplotlib的绘图功能;

In [76]: df['GDP percap'].plot(kind='bar')
Out[76]: <matplotlib.axes.AxesSubplot at 0x2f22ed0>

In [77]: import matplotlib.pyplot as plt

In [78]: plt.show()

排序操作:

In [83]: df = df.sort_index(by='GDP percap', ascending=False) #根据GDP percap数据,降序排列

In [84]: df
Out[84]:
               population       total GDP    GDP percap
country
United States   282171957  9898700.000000  35080.381854
Australia        19053186   541804.652100  28436.433261
Israel            6114570   129253.894230  21138.672749
Argentina        37335653   295072.218690   7903.229085
Uruguay           3219793    25255.961693   7843.970620
South Africa     45064098   227242.369490   5042.647686
India          1006300297  1728144.374800   1717.324719
Malawi           11801505     5026.221784    425.896679
使用在线数据

pandas可以通过urllib2库函数自动获得在线数据,不需用户自己下载。

url = 'http://research.stlouisfed.org/fred2/series/UNRATE/downloaddata/UNRATE.csv'
source = urllib2.urlopen(url)
data = pd.read_csv(source, index_col=0, parse_dates=True, header=None)
In [71]: type(data)
Out[71]: pandas.core.frame.DataFrame

In [72]: data.head()  # A useful method to get a quick look at a data frame
Out[72]:
                1
0
DATE        VALUE
1948-01-01    3.4
1948-02-01    3.8
1948-03-01    4.0
1948-04-01    3.9

In [73]: data.describe()  # Your output might differ slightly
Out[73]:
          1
count   786
unique   81
top     5.4
freq     31
但是pandas自己也可直接在线获得一些数据,同样是上面的数据,依靠pandas自身库函数也可以做到:

In [77]: import pandas.io.data as web

In [78]: import datetime as dt  # Standard Python date / time library

In [79]: start, end = dt.datetime(2006, 1, 1), dt.datetime(2012, 12, 31)

In [80]: data = web.DataReader('UNRATE', 'fred', start, end)

In [81]: type(data)
Out[81]: pandas.core.frame.DataFrame

In [82]: data.plot()
Out[82]: <matplotlib.axes.AxesSubplot at 0xcf79390>

In [83]: import matplotlib.pyplot as plt

In [84]: plt.show()