Python骚包的操作，用SQL方式的去玩Pandas-白红宇

Pandas是一个非常方便的数据处理、数据分析的类库，在这篇文章中，我将Pandas进行了一个系统的梳理。

但不可否认的是，不是所有的程序员都会Python，也不是所有的Pythoner都会使用Pandas。

不过好消息是，借助于pandassql,你可以使用SQL来操作DataFrame。

# 导入相关库import numpy as npimport pandas as pdfrom pandasql import sqldf, load_meat, load_births复制代码

基础

pandasql 中的主要函数是 sqldf，它接收两个参数：

一个sql查询语句

一组会话/环境变量（locals() 或 globals()）

为了方便起见，我们可以定义一个函数来方便我们调用。

pysqldf = lambda sql: sqldf(sql, globals())复制代码

接下来我们导入一些数据。

meat = load_meat()meat.head()复制代码

	date	beef	veal	pork	lamb_and_mutton	broilers	other_chicken	turkey
0	1944-01-01	751.0	85.0	1280.0	89.0	NaN	NaN	NaN
1	1944-02-01	713.0	77.0	1169.0	72.0	NaN	NaN	NaN
2	1944-03-01	741.0	90.0	1128.0	75.0	NaN	NaN	NaN
3	1944-04-01	650.0	89.0	978.0	66.0	NaN	NaN	NaN
4	1944-05-01	681.0	106.0	1029.0	78.0	NaN	NaN	NaN

births = load_births()births.head()复制代码

	date	births
0	1975-01-01	265775
1	1975-02-01	241045
2	1975-03-01	268849
3	1975-04-01	247455
4	1975-05-01	254545

查询

pandassql 使用的语法是 SQLite 的语法。任何 DataFrame 都会被 pandassql 自动检测到，你可以将它们作为来查询。

限定条数

先来看下如何去限定数据条数。这里来获取下前两条数据。

sql = "select * from births limit 2"pysqldf(sql)复制代码

	date	births
0	1975-01-01 00:00:00.000000	265775
1	1975-02-01 00:00:00.000000	241045

除了可以限定从头开始的前N条数据外，我们还可以设置偏移量。这里来获取下从第二行开始的前两条数据。

sql = "select * from births limit 2 offset 2"pysqldf(sql)复制代码

	date	births
0	1975-03-01 00:00:00.000000	268849
1	1975-04-01 00:00:00.000000	247455

限定字段

既然是SQL，我们当然可以限定查询时的所需字段了。这里我们限定只获取指定的births字段。

sql = "select births from births limit 2"pysqldf(sql)复制代码

	births
0	265775
1	241045

排序

排序功能也是非常常见的，pandassql 完美支持。这里我们按照 date 降序，births 升序来排。

sql = "select * from births order by date desc, births asc limit 2"pysqldf(sql)复制代码

	date	births
0	2012-12-01 00:00:00.000000	340995
1	2012-11-01 00:00:00.000000	320195

限定查询条件

我们可以指定 where 来查询满足要求的数据。这里我们筛选出 turkey 不为空并且 date 在 1974-12-31 之后的数据。

sql = """select *from meatwhere turkey not nulland date > '1974-12-31'limit 5"""pysqldf(sql)复制代码

	date	beef	veal	pork	lamb_and_mutton	broilers	other_chicken	turkey
0	1975-01-01 00:00:00.000000	2106.0	59.0	1114.0	36.0	646.2	None	64.9
1	1975-02-01 00:00:00.000000	1845.0	50.0	954.0	31.0	570.2	None	47.1
2	1975-03-01 00:00:00.000000	1891.0	57.0	976.0	35.0	616.6	None	54.4
3	1975-04-01 00:00:00.000000	1895.0	60.0	1100.0	34.0	688.3	None	68.7
4	1975-05-01 00:00:00.000000	1849.0	59.0	934.0	31.0	690.1	None	81.9

聚合

数据分析时，聚合必不可少，pandassql 当然也支持了。这里我们按照年份来分组，然后对 births 求和、求均值、求最大值以及求最小值。

sql = """select strftime('%Y', date) as year,sum(births),avg(births),max(births),min(births)from birthsgroup bystrftime('%Y', date)limit 3"""pysqldf(sql)复制代码

	year	sum(births)	avg(births)	max(births)	min(births)
0	1975	3136965	261413.750000	281300	241045
1	1976	6304156	262673.166667	286496	236551
2	1979	3333279	277773.250000	302805	249898

关联

关联也是非常常见的操作。这里我们根据字段 date 作为条件来关联 meat 和 births 这两个DataFrame。

sql = """selectm.date, b.births, m.beeffrom meat minner join births bon m.date = b.dateorder bym.datelimit 5;"""pysqldf(sql)复制代码

	date	births	beef
0	1975-01-01 00:00:00.000000	265775	2106.0
1	1975-02-01 00:00:00.000000	241045	1845.0
2	1975-03-01 00:00:00.000000	268849	1891.0
3	1975-04-01 00:00:00.000000	247455	1895.0
4	1975-05-01 00:00:00.000000	254545	1849.0

以上是我列举的一些常用功能，除了这些之外，pandassql 还支持更多的一些操作，这些操作都是基于 SQLite 的语法来完成的，感兴趣的话可以自己研究。

想要学习更多关于人工智能的知识，请关注公众号：AI派