Pandas has a DataFrame method, query()
, that lets you filter out subsets of data using a logical expression query. Since the expression is a string, it reads a little more naturally than subsetting dataframes using nested brackets. The expression is evaluated in a columnar fashion.
Pandas uses numexpr
as the default engine to evaluate the expression, so the filtering process is usually accelerated for a large datasets, when compared to subsetting by slicing (ie brackets). One caveat though… at the time of this writing, this engine is incompatible with complex string comparisons that involve substring searches (there is a workaround, but it’s stated to have lower performance). Otherwise, numeric and arithmetic expressions work well.
import pandas as pd
df = pd.DataFrame({'num_legs': [2, 4, 8, 0, 0, 4, 4, 6, 2, 6],
'num_wings': [2, 0, 0, 0, 0, 0, 2, 4, 2, 2],
'num_specimen_seen': [5, 12, 11, 8, 25, 5, 1, 4, 44, 10],
'family':['raptor','canine','arachnid','fish','gastropod','ursidae','cryptid','odanta','chiroptera','diptera']},
index=['falcon', 'dog', 'spider', 'carp', 'snail', 'bear', 'griffin', 'dragonfly', 'bat', 'fly'])
print(df.head(10))
num_legs num_wings num_specimen_seen family
falcon 2 2 5 raptor
dog 4 0 12 canine
spider 8 0 11 arachnid
carp 0 0 8 fish
snail 0 0 25 gastropod
bear 4 0 5 ursidae
griffin 4 2 1 cryptid
dragonfly 6 4 4 odanta
bat 2 2 44 chiroptera
fly 6 2 10 diptera
print(df.query('num_wings>=2 and num_legs>=4'))
num_legs num_wings num_specimen_seen family
griffin 4 2 1 cryptid
dragonfly 6 4 4 odanta
fly 6 2 10 diptera
The above and below produce the same results, but it can be argued that the query()
method is easier to read and type out.
print(df[(df['num_wings']>=2) & (df['num_legs']>=4)])
num_legs num_wings num_specimen_seen family
griffin 4 2 1 cryptid
dragonfly 6 4 4 odanta
fly 6 2 10 diptera
Basic string comparisons do work. Here, the dataframe is filtered out for “United States” entries.
data_url = 'http://bit.ly/2cLzoxH'
gapminder = pd.read_csv(data_url)
print(gapminder.info())
print(gapminder.head(3))
gapminder.query('country=="United States"').head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country 1704 non-null object
year 1704 non-null int64
pop 1704 non-null float64
continent 1704 non-null object
lifeExp 1704 non-null float64
gdpPercap 1704 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 80.0+ KB
None
country year pop continent lifeExp gdpPercap
0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314
1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030
2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
That’s it for now.