Speeding up machine learning development with MLflow

He works on the AI platform team at LinkedIn.

# Agenda:
 - unique challenges
 - ml platform tour
 - intro of MLflow
   - demo

# Development Process Overview
   paper: hidden tech debt in ML systems (paper from Goog)

   we've seen this picture at Unbabel, the 90% of the work for ML is infra-tooling, 10% is ML code
   configs, data collection, data verification, hw resource mgmt, serving infra, monitoring , etc..

   This is the story of Software 2.0
   Instead of data driven -> Model driven.

   Sw-devel vs Model-devel
   Goal:
    sw: meet requirements
    model: optimize biz metrics (for us MQM)
   Quality:
    sw: depends on code
    model: depends on data, code, algo, parameters
   Tool:
    sw: standard sw stack
    ...
    lost slide.


    Development dimensions are also different:
    - there is an experimentation dimension
     this is scientific endeavor

     the experimentation dimension is what we currently do as a normal practice.
     this is the work we did for example, to get generic tickets models for transformers for EN-DE.

     Model training is done offline, and then inferencing is done online.

     As your data scales, training costs and time increases polinimially.
     you need tooling for your training systems.

     ML Platform Tour:
      you need a whole process and culture.

    Separation of concerns:
    scientists and ML platform.

    Something common among sw-base and model-base is HABILITY TO ITERATE QUICKLY

# Some platforms that exist out there.
    - FBLearner
    - Michelangelo
    - Google TFX
    -

    common functionalities:
     - feature infra
     - model training
     - model management
     - model running
     - ... lost slide
     - model monitoring


    "major pain points associated with ML project dramatically changes as the scale of the project increases"

    the cloud ones provide pretrained models


    ML Services:
    a layer above infra, that sits on top of k8s, pytorch, gpus, caffe, tensorflow, etc..
    ML ID, experimentation, training management, monitoring


# MLFLOW
    Open Source ML platform.

    principles:
     - open
     - ease of use
     - extensible
     - scalable
    (too many principles, trying to cover everything..)

    manage the ML lifecycle, including experimentation, reproducibility and ..

    What does it give you:
    - tracking
    - project
    - model
    - model registry

    Tracking
        record and query experiments, code, configs, results, etc..
        .. pic of excell sheet to track experiments.. (it's the wild west..)

    .. so it includes:
     - kv parameters
     - metrics
     - artifcats
     - source code (git url + commit hash..)
     - version
     - tags and notes
    ... connects to notebooks, apps, cloudjobs etc..  w/ APIs

    it holds the artifacts and metadata

    (now he is demoing the UI for running experiments for a project)
    now he is showing the code used for this, it is basic python normal code,
    with a few code changes to run the experiment with a `with mlflow.start_run(...) as run: ...`
    and some extra statements to log some details into the run, like run.with_model("name of the model", filepath)

    Models
        general format to describe the model
        this helps the problem of deployment the model

        You can just an API call to register the models that were created.
        metadata is in yaml.

        this metadata includes a "closure" of the code used to create this model, so this code can be used to run it.

    Model Registry
        this is to track sharing, versioning, approval
        centralized activity los and coments.
        can be integrated with CI/CD
        model administration and management.

    Model Serving
        It also has a command line to serve a given model
        this is to make it very easy to run models

    This talk wasn't very interesting, I don't understand how it is "scalable" or how to extend it