DS3: Python for Data Science

Synopsis

As a recurrent topic for a few years, the Data Science gathers subjects like statistics, machine learning, computer science and the domain expertise. Machine learning methods are characterized by algorithms that allow problem solving starting from data.

This training will give an insight of the diversity of machine learning methods, either for supervised learning (the explicative variable values are known and can be compared to the model results) or unsupervised learning (the explicative variable values are not known a priori) problems solving.

The training is designed around Python; the programming language basis and some specialized libraries (pandas, scikit-learn) will be core blocks.

Goals

Thanks to this training, you will develop the following skills:

  • Know how to use Python in a data analyze project
  • Know the main machine learning problems and the main models for each of them (Which model for which context and with which dataset?)
  • Master the pandas library for data analyze et scikit-learn for machine learning model implementation

Duration

3 days

  • Strong basis in statistic and probability
  • Knowledge about Python programming language

See also DS1: Introduction to Data Science and DS2: Python for scientific computing

Program

Program

This program is indicative. It could be adapted to your specific needs.

  • Theoretical basis

    • Statistic variable types
    • Basic notions in statistics (mean, standard deviation, correlation, …)
    • Usual probability laws (gaussian, uniform, Poisson, exponential, …)
    • Reminder about matrix computing

  • Working environment configuration

    • Python, ipython and jupyter-notebook setting up
    • Presentation of package management tools (pip, conda) and data analyze Python library setting up (numpy, pandas, matplotlib, seaborn)
    • First program and test of the machine configuration

  • Data Science Python library using

    • Build a data pipeline with Luigi
    • Scientific computing with numpy
    • Dataset handling with pandas
    • Data visualization with matplotlib and seaborn

  • Machine learning algorithms with scikit-learn

    • Regression (linear regression, polynomial regression, gaussian regression, XGBoost, …)
    • Classification (logistic regression, SVM, decision trees, …)
    • Clustering (K-means, DBScan, clustering hiérarchique, …)
    • Dimension reduction (Principle Component Analysis)

  • Analyze of a "real" dataset

    • Reading/Writing from/to a csv file
    • Elementary statistics and feature interpretation
    • Data handling with pandas
    • Machine learning algorithm conception with scikit-learn
    • Data visualization

DS3 – Data Science

Python for Data Science

The next courses in Paris :

Contact us for mostly on-site trainings at your office (dates are flexible to your needs).