Pandas文本数据类型

Pandas文本数据类型有object和string两种。
pandas1.0之前只有文本数据只有object类型，pandas1.01.0之后有了string类型。
如果一列数据中包含文本和数据，则会默认为object类型。

import pandas as pd
import numpy as np

df = pd.DataFrame({
   
    'A':['a','b','c','d'],
    'B':['aa','bb','cc',np.nan],
    'C':[1,2,3,4],
    'D':[11,12,13,np.nan]   
})

# object是文办数据默认的类型
# .dtypes可以查询数据类型
df.dtypes

A     object
B     object
C      int64
D    float64
dtype: object

如何将数据类型设定为string

分为两类：创建时设定和创建后设定。

# 创建Series或DataFrame时，通过dtype指定 string类型
# 方法1 ：dtype='string'
pd.Series(["a","b","c"],dtype='string')

0    a
1    b
2    c
dtype: string

#方法2 ： dtype=pd.StringDtype()
pd.Series(["a","b","c"],dtype=pd.StringDtype())

0    a
1    b
2    c
dtype: string

# 创建后的Series或DataFrame，如何转换为string？
s=pd.Series(["a","b","c"])
s

0    a
1    b
2    c
dtype: object

# 方法1：通过astype转换为string
s.astype('string')

0    a
1    b
2    c
dtype: string

# 方法2：通过df.convert_dtypes()进行智能数据类型选择
df.dtypes

A     object
B     object
C      int64
D    float64
dtype: object

dfn = df.convert_dtypes()
dfn.dtypes

A    string
B    string
C     Int64
D     Int64
dtype: object

string和object的区别

区别1 ：

统计字符串s.str.count()时，string类型的None返回，dtype为Int64

object类型的None返回NaN, dtpye为float64

不过通过dropna()去除None后，两者的dtype都是Int64。

也就是说无论Series或DataFrame含不含None，string类型的dtype都是Int64

但是Series或DataFrame含None，object的dtype是float64；不含None，object的dtype是Int64。

# string类型
s= pd.Series(['a',None,'b'],dtype='string')

s.str.count('a')

0       1
1    <NA>
2       0
dtype: Int64

s.dropna().str.count('a')

0    1
2    0
dtype: Int64

# object类型
s= pd.Series(['a',None,'b'],dtype='object')

s.str.count('a')

0    1.0
1    NaN
2    0.0
dtype: float64

s.dropna().str.count('a')

0    1
2    0
dtype: int64

区别2

通过str.isdigit ()检查字符串，如果为string类型，则返回布尔类型，dtype= boolean,None返回
如果为object类型，虽然返回是布尔类型，但是dtype= object, None返回 None

# string类型
# str.isdigit ()方法用于检查序列中每个字符串中的所有字符是否都是数字。字符串中出现空格或任何其他字符将返回false。
# 如果字符串中有数字，str.isdigit ()为True；为其他，则为False
s= pd.Series(['a',None,'b','1'],dtype='string')
s.str.isdigit()

0    False
1     <NA>
2    False
3     True
dtype: boolean

# object类型
s= pd.Series(['a',None,'b','1'],dtype='object')
s.str.isdigit()

0    False
1     None
2    False
3     True
dtype: object

字符串方法

文本格式转换

s = pd.Series(
    ['A','B','Aaba','Baca', np.nan, 'cat','This is apple','1',"我勒个去"],
    dtype = 'string'
)

# 转小写
s.str.lower()

0                a
1                b
2             aaba
3             baca
4             <NA>
5              cat
6    this is apple
7                1
8             我勒个去
dtype: string

# 转大写
s.str.upper()

0                A
1                B
2             AABA
3             BACA
4             <NA>
5              CAT
6    THIS IS APPLE
7                1
8             我勒个去
dtype: string

# 每个单词大写
s.str.title()

0                A
1                B
2             Aaba
3             Baca
4             <NA>
5              Cat
6    This Is Apple
7                1
8             我勒个去
dtype: string

# 首字母大写s.str.capitalize()

0                A
1                B
2             Aaba
3             Baca
4             <NA>
5              Cat
6    This is apple
7                1
8             我勒个去
dtype: string

# 大小写互换
s.str.swapcase()

0                a
1                b
2             aABA
3             bACA
4             <NA>
5              CAT
6    tHIS IS APPLE
7                1
8             我勒个去
dtype: string

# 转为小写，支持其他语言s.str.casefold()

0                a
1                b
2             aaba
3             baca
4             <NA>
5              cat
6    this is apple
7                1
8             我勒个去
dtype: string

文本对齐

# 居中对齐， 宽度为10，填充字符‘-’
s.str.center(10,fillchar='-')

0       ----A-----
1       ----B-----
2       ---Aaba---
3       ---Baca---
4             <NA>
5       ---cat----
6    This is apple
7       ----1-----
8       ---我勒个去---
dtype: string

# 左对齐， 宽度为10，填充字符‘！’
s.str.ljust(10,fillchar='!')

0       A!!!!!!!!!
1       B!!!!!!!!!
2       Aaba!!!!!!
3       Baca!!!!!!
4             <NA>
5       cat!!!!!!!
6    This is apple
7       1!!!!!!!!!
8       我勒个去!!!!!!
dtype: string

# 右对齐， 宽度为10，填充字符‘+’
s.str.rjust(10,fillchar='+')

0       +++++++++A
1       +++++++++B
2       ++++++Aaba
3       ++++++Baca
4             <NA>
5       +++++++cat
6    This is apple
7       +++++++++1
8       ++++++我勒个去
dtype: string

# 宽度为10，填充字符左对齐，，填充字符“☆”
s.str.pad(width=10,side='left',fillchar='☆')

0       ☆☆☆☆☆☆☆☆☆A
1       ☆☆☆☆☆☆☆☆☆B
2       ☆☆☆☆☆☆Aaba
3       ☆☆☆☆☆☆Baca
4             <NA>
5       ☆☆☆☆☆☆☆cat
6    This is apple
7       ☆☆☆☆☆☆☆☆☆1
8       ☆☆☆☆☆☆我勒个去
dtype: string

# 指定宽度3，不是则在前面添加0,超过则为原字符
s.str.zfill(3)

0              00A
1              00B
2             Aaba
3             Baca
4             <NA>
5              cat
6    This is apple
7              001
8             我勒个去
dtype: string

计数与内容编码

# 查找指定字母的数量
s.str.count('a')

0       0
1       0
2       2
3       2
4    <NA>
5       1
6       1
7       0
8       0
dtype: Int64

# 字符串的长度
s.str.len()

0       1
1       1
2       4
3       4
4    <NA>
5       3
6      13
7       1
8       4
dtype: Int64

# 编码
s.str.encode('utf-8')

0                                                 b'A'
1                                                 b'B'
2                                              b'Aaba'
3                                              b'Baca'
4                                                 <NA>
5                                               b'cat'
6                                     b'This is apple'
7                                                 b'1'
8    b'\xe6\x88\x91\xe5\x8b\x92\xe4\xb8\xaa\xe5\x8e...
dtype: object

# 解码
s.str.encode('utf-8').str.decode('utf-8')

0                A
1                B
2             Aaba
3             Baca
4             <NA>
5              cat
6    This is apple
7                1
8             我勒个去
dtype: object

格式判断

# 是否为字母
s.str.isalpha()

0     True
1     True
2     True
3     True
4     <NA>
5     True
6    False
7    False
8     True
dtype: boolean

# 是否数据字为0-9
s.str.isnumeric()

0    False
1    False
2    False
3    False
4     <NA>
5    False
6    False
7     True
8    False
dtype: boolean

# 是否由数字或字母组成
s.str.isalnum()

0     True
1     True
2     True
3     True
4     <NA>
5     True
6    False
7     True
8     True
dtype: boolean

# 是否为数字s.str.isdigit()

0    False1    False2    False3    False4     <NA>5    False6    False7     True8    Falsedtype: boolean

# 检查字符串是否只包含十进制字符s.str.isdecimal()

0    False
1    False
2    False
3    False
4     <NA>
5    False
6    False
7     True
8    False
dtype: boolean

# 是否全部为空格
s.str.isspace()

0    False
1    False
2    False
3    False
4     <NA>
5    False
6    False
7    False
8    False
dtype: boolean

# 是否全为小写
s.str.islower()

0    False1    False2    False3    False4     <NA>5     True6    False7    False8    Falsedtype: boolean

# 是否全为大写s.str.isupper()

0     True1     True2    False3    False4     <NA>5    False6    False7    False8    Falsedtype: boolean

# 是否全为标题格式s.str.istitle()

0     True1     True2     True3     True4     <NA>5    False6    False7    False8    Falsedtype: boolean

文本高级操作

文本拆分

# 方法1： split()拆分，返回列表
s=  pd.Series(['a_b_c','c_d_e', np.nan, 'f_g_h'],
             dtype='string')

s.str.split('_')

0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

# 方法2: 使用 get，[]拆分列表中元素
s.str.split("_").str.get(1)

0       b
1       d
2    <NA>
3       g
dtype: object

s.str.split('_').str[0]

0       a1       c2    <NA>3       fdtype: object

# 展开拆分后的列表s.str.split('_',expand=True)

0	1	2
0	a	b	c
1	c	d	e
2	<NA>	<NA>	<NA>
3	f	g	h

# 限制分隔的次数，默认从左开始分隔
s.str.split('_',n=1)

0    [a, b_c]1    [c, d_e]2        <NA>3    [f, g_h]dtype: object

s.str.split('_',expand=True,n=1)

0	1
0	a	b_c
1	c	d_e
2	<NA>	<NA>
3	f	g_h

# rsplit是从右开始分隔s.str.rsplit('_',expand=True,n=1)

0	1
0	a_b	c
1	c_d	e
2	<NA>	<NA>
3	f_g	h

# 分隔符处传入正则表达式s = pd.Series(["六便士和月亮及地球"])s.str.split(r'和|及', expand=True)

0	1	2
0	六便士	月亮	地球

文本替换

s = pd.Series(['us $66.8', 'us $56.8', 'us $112.9'],
             dtype='string')

# replace()第一个是被替换的，第二个是要替换后的
s.str.replace(r'us \$','',regex=True)

0     66.8
1     56.8
2    112.9
dtype: string

# regex为False是进行字面替换。
s.str.replace('us $','',regex=False)

0     66.8
1     56.8
2    112.9
dtype: string

# 正则表达式替换## 对中英文部分倒序替换s = pd.Series(['foo 123', 'bar baz', np.nan], dtype= 'string')s

0    foo 123
1    bar baz
2       <NA>
dtype: string

# 被替换的对象
pat = r'[a-z]+'
# 要替换的内容
## 正则表达式中，group（）用来提出分组截获的字符串，（）用来分组,group(0)是对所有数据进行分组。
def repl(m):
    return m.group(0)[::-1]

# 注意结果中 'bar baz'倒序为 rab zab，而不是 zab rab
s.str.replace(pat, repl, regex=True)

0    oof 123
1    rab zab
2       <NA>
dtype: string

# 举例说明上面案例import rere.match(pat,'foo 123')[0]

'foo'

re.match(pat,'foo 123')[0][::-1]

'oof'

repl(re.match(pat,'foo 123'))

'oof'

# 保留选定内容，替换剩余内容
s = pd.Series(['ax','bxy','cxyz'])
s

0      ax
1     bxy
2    cxyz
dtype: object

# 保留第一个字符，替换其他为T
s.str.slice_replace(1,repl='T')

0    aT
1    bT
2    cT
dtype: object

# 指定区间的内容被替换# 从第1个字符开始到第3个字符结束，全部替换为T# 注意字符串的索引开始是0s.str.slice_replace(start=1, stop=3, repl='T')

0     aT1     bT2    cTzdtype: object

# 让原有的文本内容重复# str.repeat(2)是重复2次s.str.repeat(2)

0        axax1      bxybxy2    cxyzcxyzdtype: object

文本拼接

s = pd.Series(['a','b',np.nan,'d'],
             dtype='string')
s

0       a
1       b
2    <NA>
3       d
dtype: string

# str.cat()拼接一个序列的内容，忽略None
s.str.cat()

'abd'

# 连接指定符号
s.str.cat(sep='-')

'a-b-d'

# 连接指定符号和缺失值
s.str.cat(sep='-', na_rep='空值')

'a-b-空值-d'

# 连接一个序列和另一个等长的列表
# 因为s有缺失值，导致连接后也有缺失值。
s.str.cat(['A','B','C','D'], sep='+')

0     a+A
1     b+B
2    <NA>
3     d+D
dtype: string

s.str.cat(['A','B','C','D'], sep='+', na_rep='-')

0    a+A1    b+B2    -+C3    d+Ddtype: string

#连接一个序列和另一个等长的数组（索引一致）t = pd.Series(['A','B','C','D'],dtype='string')

d = pd.concat([t,s],axis=1)d

0	1
0	A	a
1	B	b
2	C	<NA>
3	D	d

s.str.cat(d,na_rep='-')

0    aAa1    bBb2    -C-3    dDddtype: string

# 索引对齐u = pd.Series(['b','d','a','c'], index=[1,3,0,2], dtype='string')u

1    b3    d0    a2    cdtype: string

0       a1       b2    <NA>3       ddtype: string

s.str.cat(u,na_rep='空值')

0     aa1     bb2    空值c3     dddtype: string

# 索引对齐中，通过参数join来指定对齐形式，默认为左对齐left，还有outer, inner, rightv= pd.Series(['z','a','b','d','e'], index = [-1,0,1,3,4], dtype = 'string')v

-1    z
 0    a
 1    b
 3    d
 4    e
dtype: string

0       a
1       b
2    <NA>
3       d
dtype: string

s.str.cat(v,join='right')

-1    <NA>
 0      aa
 1      bb
 3      dd
 4    <NA>
dtype: string

文本匹配

文本查询

# str.findall()返回查询到的值
s = pd.Series(['中国','中国山东','中国河南','中国河北','中国辽宁','莫斯科'])
s

0      中国
1    中国山东
2    中国河南
3    中国河北
4    中国辽宁
5     莫斯科
dtype: object

s.str.findall('中国')

0    [中国]
1    [中国]
2    [中国]
3    [中国]
4    [中国]
5      []
dtype: object

s.str.find('中国')

0    0
1    0
2    0
3    0
4    0
5   -1
dtype: int64

# str.find()返回匹配到的结果所在的位置（-1表示不存在）
s.str.find('国')

0    1
1    1
2    1
3    1
4    1
5   -1
dtype: int64

文本包含

0      中国
1    中国山东
2    中国河南
3    中国河北
4    中国辽宁
5     莫斯科
dtype: object

# str.contain()常见于数据筛选中,返回布尔类型
s.str.contains('中国')

0     True
1     True
2     True
3     True
4     True
5    False
dtype: bool

s.str.contains('中国|莫')

0    True
1    True
2    True
3    True
4    True
5    True
dtype: bool

文本提取

# str.extract()用正则表达式将文本中满足要求的数据提取出来形成单独的列
s =pd.Series(['a1e3','b2f6','c3o7','a3c4'],
            dtype='string')
s

0    a1e3
1    b2f6
2    c3o7
3    a3c4
dtype: string

# expand参数为Fasle时如果返回结果是一列则为Series，否则是Dataframe。
# str.extract()在()里的就是被提取出来的。
# 提取数字
s.str.extract(r'([a-z]+(\d)[a-z]+(\d))',expand=False)

0	1	2
0	a1e3	1	3
1	b2f6	2	6
2	c3o7	3	7
3	a3c4	3	4

s.str.extract(r'[a-z]+(\d)[a-z]+(\d)',expand=True)

0	1
0	1	3
1	2	6
2	3	7
3	3	4

# 提取数字
s.str.extract(r'[a-z]+(\d)[a-z]+(\d)',expand=False)

0	1
0	1	3
1	2	6
2	3	7
3	3	4

# 提取字母
s.str.extract('([a-z]+)\d+([a-z])+\d+',expand=False)

0	1
0	a	e
1	b	f
2	c	o
3	a	c

# 对提取的列进行命名
s.str.extract(r'(?P<ab字母列>[ab])(?P<数字列>\d)',
             expand=False)

ab字母列	数字列
0	a	1
1	b	2
2	<NA>	<NA>
3	a	3

# 提取全部匹配项,将一个文本中所有符合规则的内容匹配出来，最后形成一个多层索引数据：
s = pd.Series(['a1a2','b1','c1'],
             index=['A','B','C'],
             dtype='string')
s

A    a1a2
B      b1
C      c1
dtype: string

two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'

s.str.extract(two_groups, expand=True)

letter	digit
A	a	1
B	b	1
C	c	1

s.str.extractall(two_groups)

letter	digit
match
A	0	a	1
A	1	a	2
B	0	b	1
C	0	c	1

# 提取虚拟变量
s = pd.Series(['a','a|b',np.nan,'a|c'],
             dtype='string')
s

0       a
1     a|b
2    <NA>
3     a|c
dtype: string

s.str.get_dummies(sep='|')

a	b	c
0	1	0	0
1	1	1	0
2	0	0	0
3	1	0	1

Pandas文本数据类型及处理