要利用python进行数据分析,pandas必不可少。
作为python数据分析利器,pandas以快速,高效著称。
为了更加方便地处理数据,pandas创造了自己的数据类型:Series,DataFrame。
一般使用pandas要进行如下import:
import pandas as pd
Series
可以认为series类型的包含着一列数据。
In [4]: s = pd.Series(np.random.randn(4), name='daily returns') In [5]: s Out[5]: 0 0.430271 1 0.617328 2 -0.265421 3 -0.836113 Name: daily returns索引从零开始,与列表一样。
Series数据是基于numpy的array结构的,所以Series支持相似的运算。
In [6]: s * 100 Out[6]: 0 43.027108 1 61.732829 2 -26.542104 3 -83.611339 Name: daily returns In [7]: np.abs(s) Out[7]: 0 0.430271 1 0.617328 2 0.265421 3 0.836113 Name: daily returns但是Series有着更加高级的特性。例如:
In [8]: s.describe() Out[8]: count 4.000000 mean -0.013484 std 0.667092 min -0.836113 25% -0.408094 50% 0.082425 75% 0.477035 max 0.617328这是一种统计描述性数据。
还有,更加丰富的索引形式。
In [9]: s.index = ['AMZN', 'AAPL', 'MSFT', 'GOOG'] In [10]: s Out[10]: AMZN 0.430271 AAPL 0.617328 MSFT -0.265421 GOOG -0.836113 Name: daily returns这么一看,Series也像字典类型,但是要求字典的值必须是相同类型。
一些类似字典的操作,Series也支持:
In [11]: s['AMZN'] Out[11]: 0.43027108469945924 In [12]: s['AMZN'] = 0 In [13]: s Out[13]: AMZN 0.000000 AAPL 0.617328 MSFT -0.265421 GOOG -0.836113 Name: daily returns In [14]: 'AAPL' in s Out[14]: True
DataFrame
DataFrame读入csv文件十分方便,假如有以下csv文件:test_pwt.csv
"country","country isocode","year","POP","XRAT","tcgdp","cc","cg" "Argentina","ARG","2000","37335.653","0.9995","295072.21869","75.716805379","5.5788042896" "Australia","AUS","2000","19053.186","1.72483","541804.6521","67.759025993","6.7200975332" "India","IND","2000","1006300.297","44.9416","1728144.3748","64.575551328","14.072205773" "Israel","ISR","2000","6114.57","4.07733","129253.89423","64.436450847","10.266688415" "Malawi","MWI","2000","11801.505","59.543808333","5026.2217836","74.707624181","11.658954494" "South Africa","ZAF","2000","45064.098","6.93983","227242.36949","72.718710427","5.7265463933" "United States","USA","2000","282171.957","1","9898700","72.347054303","6.0324539789" "Uruguay","URY","2000","3219.793","12.099591667","25255.961693","78.978740282","5.108067988"利用read_csv()函数,轻松读入csv文件,csv文件中的数据就组成了一个DataFrame。
In [28]: df = pd.read_csv('data/test_pwt.csv') In [29]: type(df) Out[29]: pandas.core.frame.DataFrame In [30]: df Out[30]: country country isocode year POP XRAT tcgdp cc cg 0 Argentina ARG 2000 37335.653 0.999500 295072.218690 0 75.716805 5.578804 1 Australia AUS 2000 19053.186 1.724830 541804.652100 1 67.759026 6.720098 2 India IND 2000 1006300.297 44.941600 1728144.374800 2 64.575551 14.072206 3 Israel ISR 2000 6114.570 4.077330 129253.894230 3 64.436451 10.266688 4 Malawi MWI 2000 11801.505 59.543808 5026.221784 4 74.707624 11.658954 5 South Africa ZAF 2000 45064.098 6.939830 227242.369490 5 72.718710 5.726546 6 United States USA 2000 282171.957 1.000000 9898700.000000 6 72.347054 6.032454 7 Uruguay URY 2000 3219.793 12.099592 25255.961693 7 78.978740 5.108068对DataFrame可以采用行数切片索引,得到的仍然是DataFrame类型的数据
In [13]: df[2:5] Out[13]: country country isocode year POP XRAT tcgdp cc cg 2 India IND 2000 1006300.297 44.941600 1728144.374800 64.575551 14.072206 3 Israel ISR 2000 6114.570 4.077330 129253.894230 64.436451 10.266688 4 Malawi MWI 2000 11801.505 59.543808 5026.221784 74.707624 11.658954选取DataFrame的类,往往采用列名索引的形式:
In [14]: df[['country', 'tcgdp']] Out[14]: country tcgdp 0 Argentina 295072.218690 1 Australia 541804.652100 2 India 1728144.374800 3 Israel 129253.894230 4 Malawi 5026.221784 5 South Africa 227242.369490 6 United States 9898700.000000 7 Uruguay 25255.961693既要选择特定的行,又要选择特定的列时:
In [21]: df.ix[2:5, ['country', 'tcgdp']] Out[21]: country tcgdp 2 India 1728144.374800 3 Israel 129253.894230 4 Malawi 5026.221784 5 South Africa 227242.369490pop()方法可以从DataFrame中分离出一列数据:
In [34]: countries = df.pop('country') In [35]: type(countries) Out[35]: pandas.core.series.Series In [36]: countries Out[36]: 0 Argentina 1 Australia 2 India 3 Israel 4 Malawi 5 South Africa 6 United States 7 Uruguay Name: country In [37]: df Out[37]: POP tcgdp 0 37335.653 295072.218690 1 19053.186 541804.652100 2 1006300.297 1728144.374800 3 6114.570 129253.894230 4 11801.505 5026.221784 5 45064.098 227242.369490 6 282171.957 9898700.000000 7 3219.793 25255.961693 In [38]: df.index = countries In [39]: df Out[39]: POP tcgdp country Argentina 37335.653 295072.218690 Australia 19053.186 541804.652100 India 1006300.297 1728144.374800 Israel 6114.570 129253.894230 Malawi 11801.505 5026.221784 South Africa 45064.098 227242.369490 United States 282171.957 9898700.000000 Uruguay 3219.793 25255.961693修改DataFrame的列名:
In [40]: df.columns = 'population', 'total GDP' In [41]: df Out[41]: population total GDP country Argentina 37335.653 295072.218690 Australia 19053.186 541804.652100 India 1006300.297 1728144.374800 Israel 6114.570 129253.894230 Malawi 11801.505 5026.221784 South Africa 45064.098 227242.369490 United States 282171.957 9898700.000000 Uruguay 3219.793 25255.961693对一列数据进行运算:
In [66]: df['population'] = df['population'] * 1e3 In [67]: df Out[67]: population total GDP country Argentina 37335653 295072.218690 Australia 19053186 541804.652100 India 1006300297 1728144.374800 Israel 6114570 129253.894230 Malawi 11801505 5026.221784 South Africa 45064098 227242.369490 United States 282171957 9898700.000000 Uruguay 3219793 25255.961693根据已有数据创建新的列:
In [74]: df['GDP percap'] = df['total GDP'] * 1e6 / df['population'] In [75]: df Out[75]: population total GDP GDP percap country Argentina 37335653 295072.218690 7903.229085 Australia 19053186 541804.652100 28436.433261 India 1006300297 1728144.374800 1717.324719 Israel 6114570 129253.894230 21138.672749 Malawi 11801505 5026.221784 425.896679 South Africa 45064098 227242.369490 5042.647686 United States 282171957 9898700.000000 35080.381854 Uruguay 3219793 25255.961693 7843.970620DataFrame内置了基于matplotlib的绘图功能;
In [76]: df['GDP percap'].plot(kind='bar') Out[76]: <matplotlib.axes.AxesSubplot at 0x2f22ed0> In [77]: import matplotlib.pyplot as plt In [78]: plt.show()
排序操作:
In [83]: df = df.sort_index(by='GDP percap', ascending=False) #根据GDP percap数据,降序排列 In [84]: df Out[84]: population total GDP GDP percap country United States 282171957 9898700.000000 35080.381854 Australia 19053186 541804.652100 28436.433261 Israel 6114570 129253.894230 21138.672749 Argentina 37335653 295072.218690 7903.229085 Uruguay 3219793 25255.961693 7843.970620 South Africa 45064098 227242.369490 5042.647686 India 1006300297 1728144.374800 1717.324719 Malawi 11801505 5026.221784 425.896679使用在线数据
pandas可以通过urllib2库函数自动获得在线数据,不需用户自己下载。
url = 'http://research.stlouisfed.org/fred2/series/UNRATE/downloaddata/UNRATE.csv'
source = urllib2.urlopen(url)
data = pd.read_csv(source, index_col=0, parse_dates=True, header=None)
In [71]: type(data) Out[71]: pandas.core.frame.DataFrame In [72]: data.head() # A useful method to get a quick look at a data frame Out[72]: 1 0 DATE VALUE 1948-01-01 3.4 1948-02-01 3.8 1948-03-01 4.0 1948-04-01 3.9 In [73]: data.describe() # Your output might differ slightly Out[73]: 1 count 786 unique 81 top 5.4 freq 31但是pandas自己也可直接在线获得一些数据,同样是上面的数据,依靠pandas自身库函数也可以做到:
In [77]: import pandas.io.data as web In [78]: import datetime as dt # Standard Python date / time library In [79]: start, end = dt.datetime(2006, 1, 1), dt.datetime(2012, 12, 31) In [80]: data = web.DataReader('UNRATE', 'fred', start, end) In [81]: type(data) Out[81]: pandas.core.frame.DataFrame In [82]: data.plot() Out[82]: <matplotlib.axes.AxesSubplot at 0xcf79390> In [83]: import matplotlib.pyplot as plt In [84]: plt.show()