文章目录

pandas
- 1、Series
- - 1）Series的创建

Excel的主要作用是保存数据，进行数据分析

Pandas是线上服务类型，数据分析和数据处理(在机器学习中数据处理)

在统计学理论支撑下诞生的，帮助相关的业务部分部门需要监控、定位、分析、解决问题，帮助企业高效决策，提高经验的效率，从而提高利润，发挥价值，规避分析。

pandas

数据分析三剑客

numpy数值计算
pandas数据分析
matplotlib+seaborn数据可视化

tableau + power bi + Excel

# pandas继承了numpy
# pandas中有两组数据类型，一个是Series，另一个是DataFrame
import pandas as pd
import numpy as np

1、Series

Series是一种类似与一维数组的对象，由下面两个部分组成：

values：一组数据（ndarray类型）
key：相关的数据索引标签

1）Series的创建

两种创建方式：

(1) 由列表或numpy数组创建

默认索引为0到N-1的整数型索引

s1 = pd.Series([1,2,3,4])
s1
#0 1
#1 2
#2 3
#3 4
#dtype: int64

通过设置index参数指定索引

s2 = pd.Series(data=[1,2,3,4],index=list('abcd'))
s2
#a 1
#b 2
#c 3
#d 4
#dtype: int64

name参数

s3 = pd.Series(data=[1,2,3,4],index=list('abcd'),name='demo')
s3
#a 1
#b 2
#c 3
#d 4
#Name: demo, dtype: int64

copy属性

对于ndarray来说，直接可以引用地址

arr = np.array([1,2,3,4,5])
ser = pd.Series(data=arr,copy=True)
ser
#0 1
#1 2
#2 3
#3 4
#4 5
#dtype: int32
arr[0]=100
ser
#0 1
#1 2
#2 3
#3 4
#4 5
#dtype: int32

(2) 由字典创建

#set类型不支持
dict_ = dict(a=1,b=2,c=3)
pd.Series(dict_)
#a 1
#b 2
#c 3
#dtype: int64

2）Series的索引和切片

可以使用中括号取单个索引（此时返回的是元素类型），或者中括号里一个列表取多个索引（此时返回的仍然是一个Series类型）。分为显示索引和隐式索引：

(0)常规索引的方式

S = pd.Series(dict(a=1,b=2,c=3,d=4))
S[:2],S[:'c'],S.c
#(a 1
# b 2
# dtype: int64, a 1
# b 2
# c 3
# dtype: int64, 3)

索引的类型有两种：

枚举型索引:特征索引是连续数值
关联型索引:特征索引都是离散字符类型

(1) 显式索引：

使用index中的关联类型作为索引值
使用.loc[]（推荐）
可以理解为pandas是ndarray的升级版,但是Series也可是dict的升级版

注意，此时是闭区间

S.loc['b']
#2

(2) 隐式索引：

使用整数作为索引值
使用.iloc[]（推荐）
注意，此时是半闭区间

S.iloc[2]
#3

常规切片

S
#a 1
#b 2
#c 3
#d 4
#dtype: int64
S[1:-1]
#b 2
#c 3
#dtype: int64
S['a':'d']
#a 1
#b 2
#c 3
#d 4
#dtype: int64

显式切片

S.loc['a':'d']
#a 1
#b 2
#c 3
#d 4
#dtype: int64

隐式切片

S.iloc[0:-1]
#a 1
#b 2
#c 3
#dtype: int64

3）Series的基本概念

可以把Series看成一个定长的有序字典

可以通过ndim,shape，size，index,values等得到series的属性

S.ndim,S.shape,S.size,S.dtype
#(1, (4,), 4, dtype('int64'))
S.index
#Index(['a', 'b', 'c', 'd'], dtype='object')
S.keys()
#Index(['a', 'b', 'c', 'd'], dtype='object')
S.values
#array([1, 2, 3, 4], dtype=int64)
S.nbytes
#32

可以通过head(),tail()快速查看Series对象的样式

共同都有一个参数n，默认值为5

S = pd.Series(data=np.random.randint(0,10,10000))
#Linux 当中 head -n xxx.txt 读取前几行
S.head(n=3)
#0 4
#1 9
#2 8
#dtype: int32

使用pandas读取CSV文件

#filepath_or_buffer = 路径
#sep=',' CSV的分割符
city = pd.read_csv('500_Cities__Local_Data_for_Better_Health.csv')
city.head()

当索引没有对应的值时，可能出现缺失数据显示NaN（not a number）的情况

np.array([None,1,2,3])
#array([None, 1, 2, 3], dtype=object)
S=pd.Series([None,1,2,3])

可以使用pd.isnull()，pd.notnull()，或自带isnull(),notnull()函数检测缺失数据

#mysql where demo is not null
index = S.notnull()
index
#0 False
#1 True
#2 True
#3 True
#dtype: bool
S[index]
#1 1.0
#2 2.0
#3 3.0
#dtype: float64

【数据挖掘重要笔记day11】pandas和Series的获取、Series的基本使用