Udemy: Python for Data Science and Machine Learning Bootcamp#
Study material stored @ OneDrive/Documents/Study Documents/Online courses
Section 5. Numpy arrays#
numpy array, vector vs matrix
np.arange
: uses step size vsnp.linspace
: uses length as inputnp.eye(n)
: creates n-dimensional diagonal matrix, i.e. identity matrixnp.random
np.random.rand
: uniform(0, 1)np.random.randn
: \(N(0,1)\)np.random.randint
: random integer between low and high input
reshape method of an array: shape feature/attribute of an array
.max()/.min()
: returns value,.argmax()/.argmin()
: returns index labelCompared with
.argX()
functionsidxmax()
oridxmin()
may be preferred
array.dtype
numpy array indexing
array slicing => a new array: now a copy of the sliced part but just a pointer to the original array
to make a real copy, use
array.copy()
method
np.array()
can be used to create an array instanceindexing of 2-d array: if only one index is provided -> refers to the row number, eg
array[0]
-> first rowto get a single element, using double
[
, i.e.[[]]
, can achieve it; BUT, single bracket with momma is preferred
Universal functions: unfun
broadcasting
Section 6. Pandas#
DataFrames - Part 1
df.drop(axis=0/1, inplace = True)
, withoutinplace = True
, pd will just wake make a copy of the DF after dropping, or assign DF is unchangedIndexing a DF:
df.loc["row index name"]
df.iloc[row numeric index number]
DataFrames - Part 2
:star: Python’s
and
keyword works only for scalar scalar booleen object but not for array of Boolean; should use&
for arraysrest index:
df.reset_index()
, original index => a column; no inplaceset index:
df.set_index(column number)
DataFrames - Part 3
pd.multiIndex.from_turple(input turple)
df.xs(index value, level = index name)
: cross-section method
Missing data
df.dropna(default axis = 0)
: drop any row withnan
thresh = int, at least how many
nan
values are needed to drop the row/column
df.fillna(value = )
Groupby
df.groupby(colum name).mean()
:star:
df.gorupby().describe()
Merging, joining, and concatenating
pd.concat(df1, df2, df3, ...)
,axis = 0
by default => join the rowsif column/index doesn’t align, will lead to some
nan
pd.merge(how = "left/right/innter, etc", on = column name)
pd.join()
: similar as merge by uses index rather than column for matching
Operations
find unique values:
df[col].unique()
find frequency of each unique value in a column:
df[col].value_counts()
:star: apply method:
df[col].apply(fun)
df.sort_values(by = col_name)
df.isnull()
pivotal table:
df.pivot_table(value = , index = , columns = )
Data I/O
.csv:
pd.read_csv()
,pd.to_csv("name", index = False)
Excel:
xlrd
modulepd.read_excel("file.xlsx", sheet_name = " ")
pd.to_excel("", sheet_name=" ")
HTML:
lxml
/html5lib
/BeautifulSoup4
pd.read_html("xxx.html")
SQL:
sqlalchemy
=>create_engine
Section 8. Python for data visualization - matplotlib
#
42-44. matplotlib
Parts 1 - 3
%matplotlib inline
: used in jupyter nb, otherwise need to useplot.show()
everytimeFunctional method:
plt.plot(x, y)
OO method:
fig = plt.figure() # figure -> canvas axes = fig.add_axes() # axes level plot axes.plot(x, y)
Note:
plt.subplot()
\(\ne\)plt.subplots()
, the second creates a list of axes object as doing multiplefig.add_axes()
fig, axes = plt.subplots()
plt.tight_layout()
: better show multple subplots
Section 9. Python for data visualization - seaborn
#
distplot
: histogramjointplot
: scatter plotpairplot
: scatter plot matrixrugplot
: density plot -> kdeplotCategorical plot
barplot
: x = cat, y = continuouscountplot
: x = cat, y = count of occurrencesboxplot
: x = cat, y = continousviolinplot
:hue
,split = True
stripplot
:jitter = True
swarmplot
:stripplot
+violinplot
factorplot
: can do all above by specifyingkind = ...
Matrix plot
matrix form of the data
heatmap
(matrix-df)clustermap
: clustered version ofheatmap
Grids
g.sns.PairGrid()
g.map.diag/upper/lower
FacetGrid
: creates subgroups for plotting
Regression plot
lmplot()
: scatter plot with regression line
Style & color
sns.set_style()
sns.despine()
, inputs: top, right, bottom, leftsns.set_context("poster", font_scale = x)
Section 10. Python for data visualization - Pandas built-in data visualization#
Pandas built-in data visualization
pd[‘col’].plot(kind = “”, …)
pd['col'].plot.hist()
df.plot.x()
Section 11. Python for data visualization - plotly & cufflinks#
plotly & cufflinks
Cufflinks: toolbox links plotly & pandas
plotly is free to use for all functions but needs to pay if like to save online
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot.iplot
init_notebook_mode(connected = True)
use plotly:
df.iplot()
Section 13. Data capstone project#
.groupby().unstack()
?.groupby().reset_index()
->FacetGrid()
Section 18. K-nearest neighbors#
In KNN, all variables need to be at the same scale, otherwise some variables may dominate the distance calculation
Find scikit-learn cheatsheet (multiple)
Tuning parameters
n-neighbors
Distance metric
Section 19. Decision trees & random forest#
Seciton 20. SVM#
from sklearn.svm import SVC
Grid search:
from sklearn.model_selection import GridSearchcv
grid = GridSearchCV(SVC(), param_grid, refit = True, Verbose = 3)
grid_fit(x_train, y_train)
Section 21. K means clustering (unsupervised)#
Finding K:
“elbow” method, using SSE: sum of the squared distance between each member of the clusters and its centroid
In
sklean.dataets
, can usemake_blobs
to generate fake cluster dataAfter fitting kmeans to data, can retrive centers from
x.cluster_centers
and retrieve cluster label fromx.labels_
Section 22. PCA#
PCA with python
need to standadize the variables before conducitng PCA:
from sklearn.preprocessing import standardScaler scaler = StandardScaler() scaler.fit(df) scaled_df = scaler.transform(df)
Load PCA
from sklearn.decomposition import PCA
Section 23. Reconmmendation system#
Content based: attribute of the items and based on similarity b/t them
Collaborative filtering (CF): Amazon. based on knowledge of users’ attitude to items, “wisdom of crowd”
more commonly used, produces better results
able to feature learning on its own
CF subtypes
memory-based collaborative filtering
Model-based collaborative filtering: SVD
Pandas:
df.corrwith(df)
: correlation b/t two df columnsSeems the method only uses correlation b/t ratings to check the similarity between 2 movies
Section 25. Big data and spark with python#
Local vs distributed system:
distributed means multiple machiens connected in a network
Hadoop: a way ot distribute very large files across multiple machines
Hadoop Distributed File System (HDFS)
Hadoop also uses MapReduce, that allows computations on that data
HDFS used blocks of data iwth a size of 128 MB by default; each of these blocks is replicated 3 times; the blocks are distributed in a way to support fault tolerance
MapReduce: a way of splitting a computation taks to a distributed set of files (such as HDFS); it consists a job tracker and multiple Task Trackers
Spark can be thought as a flexible alternative to MapReduce:
MapReduce requires files to be stored in HDFS, Spark doesn’t
Spark can perform operations up to 100X fater than MR
Core idea of Spark: resilient distributed dataset (RDD), four main features
Distributed collection of data
Fault-tolerant
Parallel operation - partioned
Ability to use many data sources
RDD are immutable, lazily evaluated, and cacheable
There are two types of RDD operations:
Transformations
RDD.filter
RDD.map: ~
pd.apply()
RDD.flatMap
Actions
First: return the first element of RDD
Collect: return all the elements of the RDD
Count: retun NO. of elements
Take: return an array with the first n elements of the RDD
Reduce(): aggregate RDD elements using a function that returns a single element
ReduceByKey(): aggregate Pair RDD elements using a function that returns a Pair RDD
similar ot groupby operation
AWS EC2: virtual computer lab
login to EC2 using SSH
ssh -i xx.pem ubuntu@public DNS #
PySpark setup
source.hashrc # set to Anaconda Python
Intro. to Spark & Python
Notebook magic command:
%% writefile example.txt text ...
anything within text is written into “example.txt”
from pyspark import SparkContext SC = SparkContext()
SC has many different methods
RDD Transformations & Actions
Transformation => an RDD object
Action => a local object
Section 26. Neural Nets and Deep Learning#
Perceptron: “feed-formed” model, i.e. inputs are sent into the neuron, are processed, and result in an output
receive inputs
weight inputs
sum inputs
generate output
TensorFolow
Basic idea: create data flow graphs, which have nodes and edges. The array (data) passed along from layer of nodes ot layer of nodes is known as Tensor
Two ways to use TF:
Customizable Graph Session
Sci-kit learn type interface with
contrib.Learn
TensorFlow basics
Object/Data is called “Tensor”
tf.Session()
=>tensor.run()
methodPlaceholder: inserts a placeholder for a tensor that will be always fed
TF estimators
Steps
Read in Data (normalize if necessary)
Train/test split the data
Create estimator feature columns
Create input estimator column
Train estimator model
Predict with new test input function