Pandas study note#
Index, subsetting#
understanding how index works in pandas [1]
Understanding
.loc
,iloc
,.ilocx
[]
: Primarily selects subsets of columns, but can select rows as well. Cannot simultaneously select rows and columns.when given names/list of names, default is to select the cols, e.g.
df['name']
; when gvien slicing index/names, selects the rows, e.g.df[2:3]
Cons:
It is not as optimised as
.loc
and.iloc
.It is more implicit, so Pandas has to figure out stuff like ‘are we indexing rows or columns?’ and more, behind the scenes.
not the recommended way to select data from your dataframe in production environments.
.loc
label-based indexing on columns and/or rows; selects subsets of rows and columns by label only:df.loc[[row_indexer], [col_indexer]]
selection is inclusive, it slices from the start label up to and including the stop label
Integers are accepted as labels, but they will be seen as labels, not as the position in the dataframe!
Can use boolean mask to filter and select rows
# selecting rows based on their index labels (could also be numerical) df.loc['row_label'] # or df.loc[[row_labels]] # or df.loc[1:10] # for numeric row label, result will be inclusive # selecting cols based on their names (labels) df.loc[:, 'col_name'] # or df.loc[:, [col_names]] # or df.loc[:, 'col1':'col4'] #result will be inclusive
.iloc
works on position (integer location, you must use intergers when selecting by interger location); selects subsets of rows and columns by integer location onlycontrary to
.loc
,.iloc
selection is exclusive, meaning that the stop position is not included in the final result.
# index can take the format as start:end:step for both row and col df.iloc[row_index, col_index] # selecting rows based on their integer row position df.iloc[3], df.iloc[[2,4,6]], df.iloc[:3] # selecting both rows and cols df.iloc[1:5, 1:3]
at
&iat
at
gets scalar values. it’s a very fastloc
iat
gets scalar values. it’s avery fastiloc
df.where()
: Replaces every row that doesn’t match the filter withNA
slow_fat = df['diet'] == 'low fat' df_lf = df.where(low_fat)
df.query()
: quickly select rows based on explicit SQL-like filtersdf.query("pulse < 85")
df.get()
: select a column, dictionary-styledf.get('diet')
.idxmin
,.idxmax
to return the index of the maximum/minimum value across a specified axis in a pandas DataFrame:
df.idxmax(axis = 0, skipna = True)
they will return the first occurrence of the maximum value.
subsetting: using boolean in the
.loc
row indexing to select subset of the data
Data manipulation#
duplicated()
,drop_duplicates()
df.duplicated(subset, keep)
: returns a Series of bool denoting dupliate rows.drop_duplicates(subset = col_lables, keep = {
first,
last, False}, inplace)
Get the number of rows, columns, all elements (size) of DataFrame
len(df)
returns # of rows of a dflen(df.columns)
returns # of colsdf.shapre
returns bow numbers of rows and colsdf.size
returns the number of elements
sum()
withlevel=
option orgroupby()
level=
option insum()
indicates which index level(s) the sum is performed over; it’s similar to sum aftergroupby
over some col(s), in which the groupby var(s) is/set to index(es) in the resulting DataFrame
nlargest(n)
df.nlargest(n, columns, keep='first')
Return the first n rows ordered by columns in descending order.This method is equivalent to
df.sort_values(columns, ascending=False).head(n)
, but more performant.
df.transform(func, axis, *args, **kwargs)
: something like themutate
function indplyr
Call
func
on self producing a DataFrame with the same axis shape as self.Can be used to transform values of the cols/features
Can be used with
groupby
and apply aggregate function to the resulting data frame.Used ot filter data, applying the rule to the values created by
transform
:df[df.groupby('city')['sales'].transform('sum') > 40]
Handling missing values at the group level:
df['value'] = df.groupby('name') .transform(lambda x: x.fillna(x.mean()))
df.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None, method='single')
: Provide rolling window calculations.Understanding
rang
,np.arange
,np.linspace
range
vsnp.arange
The main difference between
range
andnp.arange
is that therange()
function returns an iterator instead of a list andnp.arange()
function gives a numpy array that consists of evenly spaced values within a given interval.The
range()
function generates a sequence of integer values lying between a certain range.The
range()
is a built-in function whereasarange()
is anumpy
library function.The
range()
function is more convenient when you need to iterate values using the for loop. Thenp.arange()
function is more useful when you are working with arrays and you need to generate an array based on a specific sequence.
np.arange
is similar torange
but more performantnp.arange(start, stop, stepsize)
allows you to define the stepsize and infers the number of steps
excludes the maximum value unless rounding error makes it do otherwise
np.linspace(start, stop, size)
allows you to define how many values you get including the specified min and max value. It infers the stepsize
You can exclude the
stop
value (in our case 1.3) usingendpoint=False
Time series data#
df.shift()
?Works with time series data
Shift index (date or datetime var) by desired number of periods with an optional time freq.
freq
argument indate_range()
df.interpolate()
?