Pandas study note#
Index, subsetting#
understanding how index works in pandas [1]
Understanding
.loc,iloc,.ilocx[]: Primarily selects subsets of columns, but can select rows as well. Cannot simultaneously select rows and columns.when given names/list of names, default is to select the cols, e.g.
df['name']; when gvien slicing index/names, selects the rows, e.g.df[2:3]Cons:
It is not as optimised as
.locand.iloc.It is more implicit, so Pandas has to figure out stuff like ‘are we indexing rows or columns?’ and more, behind the scenes.
not the recommended way to select data from your dataframe in production environments.
.loclabel-based indexing on columns and/or rows; selects subsets of rows and columns by label only:df.loc[[row_indexer], [col_indexer]]selection is inclusive, it slices from the start label up to and including the stop label
Integers are accepted as labels, but they will be seen as labels, not as the position in the dataframe!
Can use boolean mask to filter and select rows
# selecting rows based on their index labels (could also be numerical) df.loc['row_label'] # or df.loc[[row_labels]] # or df.loc[1:10] # for numeric row label, result will be inclusive # selecting cols based on their names (labels) df.loc[:, 'col_name'] # or df.loc[:, [col_names]] # or df.loc[:, 'col1':'col4'] #result will be inclusive
.ilocworks on position (integer location, you must use intergers when selecting by interger location); selects subsets of rows and columns by integer location onlycontrary to
.loc,.ilocselection is exclusive, meaning that the stop position is not included in the final result.
# index can take the format as start:end:step for both row and col df.iloc[row_index, col_index] # selecting rows based on their integer row position df.iloc[3], df.iloc[[2,4,6]], df.iloc[:3] # selecting both rows and cols df.iloc[1:5, 1:3]
at&iatatgets scalar values. it’s a very fastlociatgets scalar values. it’s avery fastiloc
df.where(): Replaces every row that doesn’t match the filter withNAslow_fat = df['diet'] == 'low fat' df_lf = df.where(low_fat)
df.query(): quickly select rows based on explicit SQL-like filtersdf.query("pulse < 85")
df.get(): select a column, dictionary-styledf.get('diet')
.idxmin,.idxmaxto return the index of the maximum/minimum value across a specified axis in a pandas DataFrame:
df.idxmax(axis = 0, skipna = True)they will return the first occurrence of the maximum value.
subsetting: using boolean in the
.locrow indexing to select subset of the data
Data manipulation#
duplicated(),drop_duplicates()df.duplicated(subset, keep): returns a Series of bool denoting dupliate rows.drop_duplicates(subset = col_lables, keep = {first,last, False}, inplace)
Get the number of rows, columns, all elements (size) of DataFrame
len(df)returns # of rows of a dflen(df.columns)returns # of colsdf.shaprereturns bow numbers of rows and colsdf.sizereturns the number of elements
sum()withlevel=option orgroupby()level=option insum()indicates which index level(s) the sum is performed over; it’s similar to sum aftergroupbyover some col(s), in which the groupby var(s) is/set to index(es) in the resulting DataFrame
nlargest(n)df.nlargest(n, columns, keep='first')Return the first n rows ordered by columns in descending order.This method is equivalent to
df.sort_values(columns, ascending=False).head(n), but more performant.
df.transform(func, axis, *args, **kwargs): something like themutatefunction indplyrCall
funcon self producing a DataFrame with the same axis shape as self.Can be used to transform values of the cols/features
Can be used with
groupbyand apply aggregate function to the resulting data frame.Used ot filter data, applying the rule to the values created by
transform:df[df.groupby('city')['sales'].transform('sum') > 40]Handling missing values at the group level:
df['value'] = df.groupby('name') .transform(lambda x: x.fillna(x.mean()))
df.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None, method='single'): Provide rolling window calculations.Understanding
rang,np.arange,np.linspacerangevsnp.arangeThe main difference between
rangeandnp.arangeis that therange()function returns an iterator instead of a list andnp.arange()function gives a numpy array that consists of evenly spaced values within a given interval.The
range()function generates a sequence of integer values lying between a certain range.The
range()is a built-in function whereasarange()is anumpylibrary function.The
range()function is more convenient when you need to iterate values using the for loop. Thenp.arange()function is more useful when you are working with arrays and you need to generate an array based on a specific sequence.
np.arangeis similar torangebut more performantnp.arange(start, stop, stepsize)allows you to define the stepsize and infers the number of steps
excludes the maximum value unless rounding error makes it do otherwise
np.linspace(start, stop, size)allows you to define how many values you get including the specified min and max value. It infers the stepsize
You can exclude the
stopvalue (in our case 1.3) usingendpoint=False
Time series data#
df.shift()?Works with time series data
Shift index (date or datetime var) by desired number of periods with an optional time freq.
freqargument indate_range()df.interpolate()?