Pandas基础2

Pandas Series

First Steps with Pandas Series

1
import pandas as pd
1
titanic = pd.read_csv("titanic.csv")
1
titanic.info()

1
titanic["age"]

从DataFrame当中选出一列,它的类型是 Series,所以说DataFrame是由Series组合形成的

1
2
type(titanic["age"])
pandas.core.series.Series

从下面的代码也可以看得出选择一列可以使用方括号或者点运算符

1
2
titanic["age"].equals(titanic.age)
True

我们选则其中一列,看一下这个Series的参数

1
age = titanic["age"]

dtype 顾名思义就是 data type 代表了这一列中数据的数据类型

1
2
age.dtype
dtype('float64')

shape 代表这一列(Series) 的形状,从结果看这是一个长达 891的数组

1
2
age.shape
(891,)

len就代表Series长度

1
2
len(age)
891

index就代表这个series的索引是怎么排列的,从结果可以看出,索引是从0开始,891结束,步长为1

1
2
age.index
RangeIndex(start=0, stop=891, step=1)

如果直接 age.info()是会报错的,因为age是Series,但是并不是DataFrame,只有DataFrame才具有info()这个属性。所以我们要先用to_frame()把age给转换为DataFrame才可以

1
age.info()
1
age.to_frame().info()

Analyzing Numerical Series

接下来我们以age为例分析数字类型的 Series

1
age.describe()

count()会计算出Series中非空数字的数量

1
2
age.count()
714

size则是包括非空和空值在内的总大小

1
2
age.size
891
1
2
len(age)
891

sum可以计算这列的总和,当skipna = False时,如果当列中有空值,那么结果会变成nan

1
2
age.sum(skipna = False)
nan

想要跳过空值,计算可以累加的数值,那么就要设置skipna = True

1
2
age.sum(skipna = True)
21205.17

unique是Series中去除重复的数值(包括nan)之后留下的数值

1
age.unique()

1
2
len(age.unique())
89
1
2
age.nunique(dropna = False)
89

value_counts(sort = True)是默认的,直接用value_counts()即为排好序后的结果

1
age.value_counts()

1
age.value_counts(sort = False)

默认是dropna(跳过空值)为True的

1
age.value_counts(dropna = True)

1
age.value_counts(dropna = False)

默认是降序从多到少排列的

1
age.value_counts(ascending = False)

1
age.value_counts(ascending = True)

我们可以对比一下 normalize False与True的区别。当设置为True时,反应的各个分类所占的比例

1
age.value_counts(sort = True, dropna = True, ascending = False, normalize = False)

1
age.value_counts(sort = True, dropna = True, ascending = False, normalize = True)

1
age.value_counts(sort = True, dropna = False, ascending = False, normalize = True)

我们可以通过设置 bins,来把Series按照分布分成长度相等的几个区间,然后再统计数量。下面是分成5组和10组后的不同结果

1
age.value_counts(sort = True, dropna = True, ascending= False, normalize = False, bins = 5)

1
age.value_counts(sort = True, dropna = True, ascending= False, normalize = True, bins = 10)

Analyzing non-numerical Series

那么如果Series中的Data是非数值的呢?我们来研究一下其性质

1
import pandas as pd
1
summer = pd.read_csv("summer.csv")
1
summer.info()

1
athlete = summer["Athlete"]

我们看到 athlete这个Series是非数字的。

1
athlete.head()

1
2
type(athlete)
pandas.core.series.Series

dtype(‘o’)代表object类型

1
2
athlete.dtype
dtype('O')
1
2
athlete.shape
(31165,)

describe()中可以看出一些简单讯息

1
athlete.describe()

1
2
athlete.size
31165
1
2
athlete.count()
31165

min()是按照字符串排序后的结果

1
2
athlete.min()
'AABYE, Edgar'
1
2
3
4
athlete.unique()
array(['HAJOS, Alfred', 'HERSCHMANN, Otto', 'DRIVAS, Dimitrios', ...,
'TOTROV, Rustam', 'ALEKSANYAN, Artur', 'LIDBERG, Jimmy'],
dtype=object)
1
2
len(athlete.unique())
22762
1
2
athlete.nunique(dropna= False)
22762
1
athlete.value_counts()

1
athlete.value_counts(sort = True, ascending=True)

1
athlete.value_counts(sort = True, ascending=False, normalize = True).head()

Creating Pandas Series (Part 1)

现在我们来看看怎么创建一个Pandas Series。我们可以从DataFrame中选出一列作为Series,可以从csv中选出一列作为Series,也可以从零开始创建一个Series

1
import pandas as pd

from DataFrame

1
2
3
summer = pd.read_csv("summer.csv")
summer["Athlete"]
summer.Athlete
1
summer.iloc[0]

Importing from CSV

1
pd.read_csv("summer.csv", usecols = ["Athlete"], squeeze = True)

Creating from scratch with pd.Series()

直接创建一列的话,要注意,index要与值一一对应,多一个少一个都会报错!我们同时可以规定names = “” 也就是这个Series的名字

1
pd.Series([10,25,6,36,2])
1
#pd.Series([10,25,6,36,2], index=["Mon","Tue","Wed","Thu", "Fri", "Sat"])
1
pd.Series([10,25,6,36,2], index=["Mon","Tue","Wed","Thu", "Fri"], name = "Sales")

Creating Pandas Series (Part 2)

除了从上面的文件中导入、直接创建等方法。Series还可以从Array、List和Dictionary中导入。

from Numpy Array

1
2
import pandas as pd
import numpy as np
1
2
sales = np.array([10,25,6,36,2])
sales
1
pd.Series(sales)

from List

1
sales = [10,25,6,36,2]
1
pd.Series(sales)

from Dictionary

注意,从Dictionary转换成Series的时候,key即为Index,我们人为规定index的话,必须和dic中的key一一对应,否则值就是NaN。

1
2
dic = {"Mon":10, "Tue":25, "Wed":6, "Thu": 36, "Fri": 2}
dic
1
sales = pd.Series(dic)
1
sales

1
pd.Series(dic, index = ["Fri", "Sat", "Sun", "Mon", "Tue", "Wed"])

1
pd.Series(dic, index = [1,2,3,4,5])

Indexing and Slicing

1
import pandas as pd
1
titanic = pd.read_csv("titanic.csv")
1
2
age = titanic.age
age

1
2
age[0]
22.0
1
2
age[2]
26.0
1
2
age.iloc[-1]
32.0
1
2
age[890]
32.0
1
age[[3,4]]

1
age.loc[:3]

1
summer = pd.read_csv("summer.csv", index_col = "Athlete")
1
event = summer.Event
1
event.head()

1
event.index

1
2
event[0]
'100M Freestyle'
1
2
event.iloc[-1]
'Wg 96 KG'
1
event.iloc[:3]

1
2
event["DRIVAS, Dimitrios"]
'100M Freestyle For Sailors'
1
event[:"DRIVAS, Dimitrios"]

1
event.loc["PHELPS, Michael"]

1
2
event.loc["PHELPS, Michael"].equals(event["PHELPS, Michael"])
True

不能这样写,因为PHELPS, Michael有好多行,python会不知道到底锁定在哪一行

1
#event[:"PHELPS, Michael"]
1
event.loc[["PHELPS, Michael", "LEWIS, Carl"]]

前提是这个索引是存在的哦,唐老鸭不在,所以会报错噢

1
#event[["PHELPS, Michael", "DUCK, Donald"]]

Sorting and introduction to the inplace-parameter

inplace = False 是缺省值,如果将其设置为True 的话,那么即为在原对象上进行修改,而不是另生成一个对象

1
import pandas as pd
1
dic = {1:10, 3:25, 2:6, 4:36, 5:2, 6:0, 7:None}
1
2
sales = pd.Series(dic)
sales

1
sales.sort_index()

1
2
sales.sort_index(ascending = True, inplace= True)
sales

1
sales.sort_values(inplace=False)

1
2
sales.sort_values(ascending=False, na_position="last", inplace= True)
sales

1
2
3
dic = {"Mon":10, "Tue":25, "Wed":6, "Thu": 36, "Fri": 2}
dic
#{'Mon': 10, 'Tue': 25, 'Wed': 6, 'Thu': 36, 'Fri': 2}
1
2
sales = pd.Series(dic)
sales

1
sales.sort_index(ascending=False)

nlargest() and nsmallest()

在pandas库里面,我们常常关心的是最大的前几个,比如销售最好的几个产品,几个店,等。之前讲到的head(), 能够看到看到DF里面的前几行,如果需要看到最大或者最小的几行就需要先进行排序。max()和min()可以看到最大或者最小值,但是只能看到一个值。

所以我们可以使用nlargest()函数,nlargest()的优点就是能一次看到最大的几行,而且不需要排序。 同理nsmallest()则是一次看到最小的几行

1
import pandas as pd
1
titanic = pd.read_csv("titanic.csv")
1
titanic.head()

1
age = titanic.age
1
age.sort_values(ascending=False).head(3)

1
age.sort_values(ascending=True).iloc[:3]

1
age.nlargest(n = 3).index[0]
1
age.nsmallest(n = 3).index[0]

idxmin() and idxmax()

1
2
titanic.age.idxmax()
630
1
2
titanic.age.idxmin()
803
1
titanic.loc[630]

1
titanic.loc[titanic.age.idxmin()]

1
2
dic = {"Mon":10,"Tue":25, "Wed":6, "Thu":36, "Fri":2, "Sat":0, "Sun":None}
sales = pd.Series(dic)
1
2
sales.sort_values(ascending=True).index[0]
'Sat'
1
2
sales.idxmin()
'Sat'
1
2
sales.sort_values(ascending=False).index[0]
'Thu'
1
2
sales.idxmax()
'Thu'

Manipulating Series

1
import pandas as pd
1
2
sales = pd.Series([10,25,6,36,2,0,None,5], index = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun", "Mon"])
sales

1
sales["Sun"] = 0
1
sales

1
2
sales.iloc[3] = 30
sales

1
2
sales_EUR = (sales/1.1).round(2)
sales_EUR

1
2
sales = (sales/1.1).round(2)
sales

1
2
sales["Mon"] = 0
sales

-------------本文结束,感谢您的阅读-------------