Udemy: Python for Data Science and Machine Learning Bootcamp#

Study material stored @ OneDrive/Documents/Study Documents/Online courses


Section 5. Numpy arrays#

  1. numpy array, vector vs matrix

    • np.arange: uses step size vs np.linspace: uses length as input

    • np.eye(n): creates n-dimensional diagonal matrix, i.e. identity matrix

    • np.random

      • np.random.rand: uniform(0, 1)

      • np.random.randn: \(N(0,1)\)

      • np.random.randint: random integer between low and high input

    • reshape method of an array: shape feature/attribute of an array

    • .max()/.min(): returns value, .argmax()/.argmin(): returns index label

      • Compared with .argX() functions idxmax() or idxmin() may be preferred

    • array.dtype

    1. numpy array indexing

    • array slicing => a new array: now a copy of the sliced part but just a pointer to the original array

      • to make a real copy, use array.copy()method

    • np.array() can be used to create an array instance

    • indexing of 2-d array: if only one index is provided -> refers to the row number, eg array[0] -> first row

      • to get a single element, using double [ , i.e. [[]], can achieve it; BUT, single bracket with momma is preferred

    • Universal functions: unfun

    • broadcasting

Section 6. Pandas#

  1. DataFrames - Part 1

    • df.drop(axis=0/1, inplace = True), without inplace = True, pd will just wake make a copy of the DF after dropping, or assign DF is unchanged

    • Indexing a DF:

      • df.loc["row index name"]

      • df.iloc[row numeric index number]

  2. DataFrames - Part 2

    • :star: Python’s and keyword works only for scalar scalar booleen object but not for array of Boolean​; should use & for arrays

    • rest index: df.reset_index(), original index => a column; no inplace

    • set index: df.set_index(column number)

  3. DataFrames - Part 3

    • pd.multiIndex.from_turple(input turple)

    • df.xs(index value, level = index name): cross-section method

  4. Missing data

    • df.dropna(default axis = 0): drop any row with nan

      • thresh = int, at least how many nan values are needed to drop the row/column

    • df.fillna(value = )

  5. Groupby

    • df.groupby(colum name).mean()

    • :star: ​df.gorupby().describe()

  6. Merging, joining, and concatenating

    • pd.concat(df1, df2, df3, ...), axis = 0 by default => join the rows

      • if column/index doesn’t align, will lead to some nan

    • pd.merge(how = "left/right/innter, etc", on = column name)

    • pd.join(): similar as merge by uses index rather than column for matching

  7. Operations

    • find unique values: df[col].unique()

    • find frequency of each unique value in a column: df[col].value_counts()

    • :star: apply method: df[col].apply(fun)

    • df.sort_values(by = col_name)

    • df.isnull()

    • pivotal table: df.pivot_table(value = , index = , columns = )

  8. Data I/O

    1. .csv: pd.read_csv(), pd.to_csv("name", index = False)

    2. Excel: xlrd module

      • pd.read_excel("file.xlsx", sheet_name = " ")

      • pd.to_excel("", sheet_name=" ")

    3. HTML: lxml/html5lib/BeautifulSoup4

      • pd.read_html("xxx.html")

    4. SQL: sqlalchemy => create_engine

Section 8. Python for data visualization - matplotlib#

42-44. matplotlib Parts 1 - 3

  • %matplotlib inline: used in jupyter nb, otherwise need to use plot.show() everytime

  • Functional method: plt.plot(x, y)

  • OO method:

    fig = plt.figure() # figure -> canvas
    axes = fig.add_axes() # axes level plot
    axes.plot(x, y)
    
  • Note: plt.subplot() \(\ne\) plt.subplots(), the second creates a list of axes object as doing multiple fig.add_axes()

    fig, axes = plt.subplots()

  • plt.tight_layout(): better show multple subplots

Section 9. Python for data visualization - seaborn#

  1. distplot: histogram

    jointplot: scatter plot

    pairplot: scatter plot matrix

    rugplot: density plot -> kdeplot

  2. Categorical plot

    • barplot: x = cat, y = continuous

    • countplot: x = cat, y = count of occurrences

    • boxplot: x = cat, y = continous

    • violinplot: hue, split = True

    • stripplot: jitter  = True

    • swarmplot: stripplot + violinplot

    • factorplot: can do all above by specifying kind = ...

  3. Matrix plot

    • matrix form of the data

    • heatmap (matrix-df)

    • clustermap: clustered version of heatmap

  4. Grids

    • g.sns.PairGrid()

    • g.map.diag/upper/lower

    • FacetGrid: creates subgroups for plotting

  5. Regression plot

    • lmplot(): scatter plot with regression line

  6. Style & color

    • sns.set_style()

    • sns.despine(), inputs: top, right, bottom, left

    • sns.set_context("poster", font_scale = x)

Section 10. Python for data visualization - Pandas built-in data visualization#

  1. Pandas built-in data visualization

    • pd[‘col’].plot(kind = “”, …)

    • pd['col'].plot.hist()

    • df.plot.x()

Section 13. Data capstone project#

  • .groupby().unstack() ?

  • .groupby().reset_index() -> FacetGrid()

Section 18. K-nearest neighbors#

  • In KNN, all variables need to be at the same scale, otherwise some variables may dominate the distance calculation

  • Find scikit-learn cheatsheet (multiple)

  • Tuning parameters

    • n-neighbors

    • Distance metric

Section 19. Decision trees & random forest#

Seciton 20. SVM#

    • from sklearn.svm import SVC

    • Grid search:

      • from sklearn.model_selection import GridSearchcv

      • grid = GridSearchCV(SVC(), param_grid, refit = True, Verbose = 3)

      • grid_fit(x_train, y_train)

Section 21. K means clustering (unsupervised)#

  • Finding K:

    • “elbow” method, using SSE: sum of the squared distance between each member of the clusters and its centroid

    • In sklean.dataets, can use make_blobs to generate fake cluster data

    • After fitting kmeans to data, can retrive centers from x.cluster_centers and retrieve cluster label from x.labels_

Section 22. PCA#

  1. PCA with python

    • need to standadize the variables before conducitng PCA:

      from sklearn.preprocessing import standardScaler
      scaler = StandardScaler()
      scaler.fit(df)
      scaled_df = scaler.transform(df)
      
    • Load PCA

      from sklearn.decomposition import PCA
      

Section 23. Reconmmendation system#

  • Content based: attribute of the items and based on similarity b/t them

  • Collaborative filtering (CF): Amazon. based on knowledge of users’ attitude to items, “wisdom of crowd”

    • more commonly used, produces better results

    • able to feature learning on its own

  • CF subtypes

    • memory-based collaborative filtering

    • Model-based collaborative filtering: SVD

  • Pandas: df.corrwith(df): correlation b/t two df columns

  • Seems the method only uses correlation b/t ratings to check the similarity between 2 movies

Section 25. Big data and spark with python#

  • Local vs distributed system:

    • distributed means multiple machiens connected in a network

  • Hadoop: a way ot distribute very large files across multiple machines

    • Hadoop Distributed File System (HDFS)

    • Hadoop also uses MapReduce, that allows computations on that data

    • HDFS used blocks of data iwth a size of 128 MB by default; each of these blocks is replicated 3 times; the blocks are distributed in a way to support fault tolerance

  • MapReduce: a way of splitting a computation taks to a distributed set of files (such as HDFS); it consists a job tracker and multiple Task Trackers

  • Spark can be thought as a flexible alternative to MapReduce:

    • MapReduce requires files to be stored in HDFS, Spark doesn’t

    • Spark can perform operations up to 100X fater than MR

  • Core idea of Spark: resilient distributed dataset (RDD), four main features

    • Distributed collection of data

    • Fault-tolerant

    • Parallel operation - partioned

    • Ability to use many data sources

  • RDD are immutable, lazily evaluated, and cacheable

  • There are two types of RDD operations:

    1. Transformations

      • RDD.filter

      • RDD.map: ~ pd.apply()

      • RDD.flatMap

    2. Actions

      • First: return the first element of RDD

      • Collect: return all the elements of the RDD

      • Count: retun NO. of elements

      • Take: return an array with the first n elements of the RDD

  • Reduce(): aggregate RDD elements using a function that returns a single element

  • ReduceByKey(): aggregate Pair RDD elements using a function that returns a Pair RDD

    • similar ot groupby operation

  • AWS EC2: virtual computer lab

    • login to EC2 using SSH

      ssh -i xx.pem ubuntu@public DNS #
      
    • PySpark setup

      • source.hashrc # set to Anaconda Python

  1. Intro. to Spark & Python

    • Notebook magic command:

      %% writefile example.txt
      text ...
      

      anything within text is written into “example.txt”

      from pyspark import SparkContext
      SC = SparkContext()
      
      • SC has many different methods

  2. RDD Transformations & Actions

    • Transformation => an RDD object

    • Action => a local object

Section 26. Neural Nets and Deep Learning#

  • Perceptron: “feed-formed” model, i.e. inputs are sent into the neuron, are processed, and result in an output

    1. receive inputs

    2. weight inputs

    3. sum inputs

    4. generate output

  1. TensorFolow

    • Basic idea: create data flow graphs, which have nodes and edges. The array (data) passed along from layer of nodes ot layer of nodes is known as Tensor

    • Two ways to use TF:

      • Customizable Graph Session

      • Sci-kit learn type interface with contrib.Learn

  2. TensorFlow basics

  • Object/Data is called “Tensor”

  • tf.Session() => tensor.run() method

  • Placeholder: inserts a placeholder for a tensor that will be always fed

  1. TF estimators

    • Steps

      1. Read in Data (normalize if necessary)

      2. Train/test split the data

      3. Create estimator feature columns

      4. Create input estimator column

      5. Train estimator model

      6. Predict with new test input function