Basic Data Science with scikit-learn

Planetary Examples

VO and Planetary Mapping Workshop 1-3 July 2019

Data Science

Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.

https://en.wikipedia.org/wiki/Data_science

Focus on data wrt methods

Data Science approaches

  • Statistics
  • 'Classical' Machine Learning
  • Deep Learning

Does still have sense?

Yes (particularly for vector data)

  • Data discovery tool
  • Better interpretability
  • Useful for pre-processing of DL applications
  • No mandatory GPU

Scikit-learn

https://scikit-learn.org/

  • Open source, commercially usable - BSD license
  • Python
  • Built on NumPy, SciPy, and matplotlib

Pandas

https://pandas.pydata.org/

A data manipulation library

Seaborn

https://seaborn.pydata.org/

A data visualisation library

Resources

Learning Problems Categories

  • Supervised Learning
    • Classification (discrete output values)
    • Regression (continuous output values)
  • Unsupervised Learning
    • Clustering (discover groups)
    • Manifold Learning (discover the 'true' dimensionality of a distribution)
    • ...

Data structure

Data as tables, catalogs ie vector information

Rows are catalog entries, granules, features (in GIS nomenclature), samples

Columns are observables, attributes (in GIS nomenclature), input parameters, features (in scikit-learn glossary)

Predictions are labels, real values, targets

X = input_data[n_samples,n_attributes]
Y = target_data[n_samples,n_targets]

API structure

Estimators (algorithm classes) share common basic methods

  • fit: it takes some samples X, targets y if the model is supervised, validate the input data and estimate and store model attributes from the estimated parameters and provided data.
  • predict: It makes a prediction for each sample, taking X as input (in a classifier or regressor
  • transform: if the estimator is a transformertransforms the input, usually only X, into a transformed space.

pipeline: pipeline of transforms with a final estimator.