Ask Pandas - bobbys-dev blog

Oct 9, 2020

Pandas has a DataFrame method, query(), that lets you filter out subsets of data using a logical expression query. Since the expression is a string, it reads a little more naturally than subsetting dataframes using nested brackets. The expression is evaluated in a columnar fashion.

Pandas uses numexpr as the default engine to evaluate the expression, so the filtering process is usually accelerated for a large datasets, when compared to subsetting by slicing (ie brackets). One caveat though… at the time of this writing, this engine is incompatible with complex string comparisons that involve substring searches (there is a workaround, but it’s stated to have lower performance). Otherwise, numeric and arithmetic expressions work well.

import pandas as pd
df = pd.DataFrame({'num_legs': [2, 4, 8, 0, 0, 4, 4, 6, 2, 6],
                   'num_wings': [2, 0, 0, 0, 0, 0, 2, 4, 2, 2],
                   'num_specimen_seen': [5, 12, 11, 8, 25, 5, 1, 4, 44, 10],
                  'family':['raptor','canine','arachnid','fish','gastropod','ursidae','cryptid','odanta','chiroptera','diptera']},
                  index=['falcon', 'dog', 'spider', 'carp', 'snail', 'bear', 'griffin', 'dragonfly', 'bat', 'fly'])
print(df.head(10))

           num_legs  num_wings  num_specimen_seen      family
falcon            2          2                  5      raptor
dog               4          0                 12      canine
spider            8          0                 11    arachnid
carp              0          0                  8        fish
snail             0          0                 25   gastropod
bear              4          0                  5     ursidae
griffin           4          2                  1     cryptid
dragonfly         6          4                  4      odanta
bat               2          2                 44  chiroptera
fly               6          2                 10     diptera

print(df.query('num_wings>=2 and num_legs>=4'))

           num_legs  num_wings  num_specimen_seen   family
griffin           4          2                  1  cryptid
dragonfly         6          4                  4   odanta
fly               6          2                 10  diptera

The above and below produce the same results, but it can be argued that the query() method is easier to read and type out.

print(df[(df['num_wings']>=2) & (df['num_legs']>=4)])

           num_legs  num_wings  num_specimen_seen   family
griffin           4          2                  1  cryptid
dragonfly         6          4                  4   odanta
fly               6          2                 10  diptera

Basic string comparisons do work. Here, the dataframe is filtered out for “United States” entries.

data_url = 'http://bit.ly/2cLzoxH'
gapminder = pd.read_csv(data_url)
print(gapminder.info())
print(gapminder.head(3))
gapminder.query('country=="United States"').head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
year         1704 non-null int64
pop          1704 non-null float64
continent    1704 non-null object
lifeExp      1704 non-null float64
gdpPercap    1704 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 80.0+ KB
None
       country  year         pop continent  lifeExp   gdpPercap
0  Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1  Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2  Afghanistan  1962  10267083.0      Asia   31.997  853.100710

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

That’s it for now.

pandas