A Quickstart Tutorial for MLDev

Contents

Installation instructions

Please, check the project README.md for installation instructions.

Hello World!

Example for version v0.3 and above

Let us start with a simple Hello World! example.

MLDev uses YAML to describe experiments. We are going to write a simple YAML file that prints out Hello World!, the current time and then exits.

Step 1. Create a pipeline

After you install MLDev, create an empty folder hello_world in your home directory. Create an empty experiment.yml file in the hello_world folder.

Each experiment contains at least one pipeline. A pipeline is a sequence of steps called stages. Stages are units of computation. That is, it can be repeated as a whole. Stages take inputs from disk and produce outputs, which are in turn persisted to disk or other storage.

In order to run an experiment we create a pipeline of type GenericPipeline with a single stage. This basic stage does not take any inputs nor provides output files.

run_hello_world: !GenericPipeline
  runs:
    - !BasicStage
      name: hello_world

The name hello_world of the stage is used to log when the stage runs.

Step 2. Add a script to the stage hello_world

Then we add two bash commands to the stage script: field. Here are the annotated contents of the experiment file.

pipeline: !GenericPipeline
  # the pipeline sets the sequence of steps 
  runs:
    - !BasicStage
      # each step is a state
      # which takes inputs from disk and
      # puts outputs back to disk
      name: hello_world
      script:
        # this is our bash script
        - echo Hello World!
        # note that we need not to escape double quotes
        # if we use text block like this 
        - >
          echo "Time is $(date)"

In this example we use a text block prefix „>“ from YAML that combines all similarly indented lines after it to a single line.

Step 3. Setup and run the experiment

applies to v0.3

Let us configure our experiment and set up a virtual environment for our experiment. Generally, MLDev uses python venv to manage code dependencies. But in this experiment we do not have any python code yet, so initialization is simple.

Switch to the experiment folder hello_world and run the init command.

$ mldev init -p venv .

At this time, you should have a folder hello_world with a file experiment.yml and a venv folder in it.

$ ls .
experiment.yml
venv

In order to run the experiment, use the following command.

$ mldev run

You may see many lines starting with INFO:, ignore them for now. Results of the stage are the last two lines, which would look something like this

$ mldev run
INFO: ...
...
Hello World!
Time is Tue Apr 10 12:19:30 CET 2021

Hello World completed!

Congrats, you have run your first MLDev pipeline.

In this tutorial we have also used the following MLDev features:

  • experiment setup, pipelines and stages

  • configuration management using venv

  • running the experiments

Simple classification task on template-default

Example for version v0.3 and above

Please, install MLDev as set in the previous example, if you have not done so already.

In this tutorial, we will see how MLDev uses templates to pre-configure experiment and manages changes using Git.

These MLDev features make starting a new experiment easier. We consider a template-default template and create a pipeline of the three stages:

  • prepare, that initializes the data

  • train, that trains the model

  • predict, that runs and evaluates the model

Step 1. Get the template

Getting a template for your experiment is easy. Use the mldev init command with -t parameter, specify the name or URL for the template and set the destination folder.

$ mldev init -t template-default ./template-default

This will download the template and configure a new local Git repository for the experiment. If Git user and email are not configured, MLDev will ask you to provide them.

After the MLDev command completes, you should have the template-default folder with at least the following files and folders from the template:

models
src
results
venv
LICENSE
README.md
requirements.txt

Here README.md describes in detail the classification task we are going to solve. The src folder contains the source code of the experiment. The venv folder contains the virtual environment the experiment will be run in. It was created and configured for you by MLDev.

Step 2. Prepare the data

Let us create a new experiment file experiment-new.yml and put our new pipeline in it.

We create an new pipeline named run_prepare and add a basic stage that takes source code ./src as input dependency and outputs results to the data folder. If since the previous run no files were updated in input dependencies, when MLDev may skip the stage and reuse its outputs.

Here is the full source code for the pipeline. Note, we are using YAML anchors to reuse parts of the pipeline further in this tutorial.

run_prepare: !GenericPipeline
  runs:
    # we can use anchors '&' to reuse the stage later
    - !BasicStage &prepare_stage
      name: prepare
      # here we specify input dependencies - files or folders
      inputs: [ ./src/ ]

      # here are outputs - the ./data folder
      # note the anchor
      outputs: &input_data [ ./data/ ]

      # a bash script for the stage - just run src/prepare.py
      script:
        - python3 src/prepare.py
        - echo

In order to run the pipeline, switch to the experiment folder and run:

shell script $ mldev run -f experiment-new.yml run_prepare

After command completes, check that the ./data folder contains the following files:

shell script $ ls ./data X_dev.pickle   X_test.pickle   X_train.pickle   y_dev.pickle   y_test.pickle   y_train.pickle

If you would like to re-run the stage even if the input dependencies did not change, add the --force-run option like this:

mldev run -f experiment-basic.yml --force-run run_prepare

Step 3. Two stage pipeline

Now it is time to make a two stage pipeline. We add a train stage to the experiment-new.yml file.

Of course, we would like to reuse the stage from our previous pipeline. Let us refer to it using the anchor *prepare_stage. We also reuse outputs from that stage using the *input_data anchor.

Here is the full definition of the pipeline

run_prepare_train: !GenericPipeline
  runs:
    # We use an anchor to reuse the 'prepare' stage
    - *prepare_stage

    # Add a second stage to the pipeline and set an anchor
    - !BasicStage &train_stage
      name: train

      # We can use params attribute of the stage to add any needed parameters
      # They can be used in computable expressions
      # See below how to do it
      params:
        num_iters: 10

      # Here we use another anchor to add data dependency on the previous stage
      inputs: *input_data
      outputs: &model_data [ ./models/default/model.pickle ]
      
      script:
        # We use a computable expression to get num_iters parameter 
        # This works similar to BASH variables, but uses python
        # Environment variables are available using dollar sign $ without braces
        # I.e. $PATH instead of ${PATH}
        - python3 src/train.py --n ${self.params.num_iters}
        - echo Current PATH= $PATH
        - echo

Computable expressions is a major feature of MLDev. These expressions can be used to get a parameter value from the running experiment context (runtime representation of the dynamic YAML document) and use it in scripts or other places.

Expressions are computed at run-time on-demand. In our example, the self variable inside the expression refers to the currently running stage. If you reuse the expression in another stage, that new stage will be associated with the self.

Run this new pipeline with the command

$ mldev run -f experiment-new.yml run_prepare_train

Then check, that the outputs we set in the train stage are created successfully.

$ ls ./models/default
model.pickle

Here default is the name of the trial (run_name), which is specific to our experiment source code. Run python ./src/train.py --help for more details.

Step 4. Variables and expressions

There is no need to create a new pipeline each time, of course. You can add stages to your past pipelines whenever needed. Be aware though this will require re-running the modified pipelines.

Let us finally build the full pipeline. Add the following to the experiment-new.yml file.

run_predict: !GenericPipeline
  runs:

    # We reuse past stages
    - *prepare_stage
    - *train_stage

    # And add a new stage
    - !BasicStage
      name: predict
  
      # note the use of an anchor
      inputs: *model_data
      outputs: [ ./results/ ]

      # We can add more environment variables 
      # They will be available to the script at 'prepare time'
      # path() is an MLDev function to compose a full path from 
      # a relative path or link
      env:
        MLDEV_MODEL_PATH: "${path(self.inputs[0])}"
      script:
        # Variables can be set in the 
        # MLDev own config in .mldev/config.yaml in section environ
        # For example, PYTHON_INTERPRETER
        - $PYTHON_INTERPRETER src/predict.py

        # Here are two examples
        # (1) In the first line we read path from the environment at run-time
        # (2) In the second line we read the value using 
        # a computable expression at prepare-stage pass
        # We avoid escaping semicolon by using a multiline block
        - |
          echo From the environment: $MLDEV_MODEL_PATH
          echo From the stage params: ${self.env.MLDEV_MODEL_PATH}
          echo

Here is the explanation of the differences between (1) and (2). MLDev runs stages in two passes. During the first pass, the „prepare-stage“ pass the stages in the pipeline are asked to check their parameters and prepare to be run. This pass occurs before anything gets executed, so in case of error no damage is done.

At the second „run-stage“ pass, the stages are executed using their validated parameters in the same sequence as at „prepare-stage“ pass.

Run the new pipeline as usual

$ mldev run -f experiment-new.yml run_predict

In the output find the following lines:

...
INFO:mldev:Loading experiment-new.yml
INFO:mldev:Running run_predict from experiment-new.yml
INFO:mldev:GenericPipeline Preparing: prepare
INFO:mldev:GenericPipeline Preparing: train
INFO:mldev:GenericPipeline Preparing: predict
INFO:mldev:GenericPipeline Running: prepare
INFO:mldev:Unchanged (prepare)
INFO:mldev:GenericPipeline Running: train
INFO:mldev:Unchanged (train)
INFO:mldev:GenericPipeline Running: predict
...

The first line identifies the experiment that is being run. The second specifies the pipeline from the experiment. Third to fifth lines inform you about the stages that are prepared to run. Then the stages are run and skipped because nothing change in the input dependencies.

After the command completes, check that the output folder contains the results.

$ ls ./results/default
dev_report.csv  report.csv  test_report.csv  train_report.csv

Task completed!

That“s it! You have completed the second tutorial on MLDev.

In this tutorial we used the following MLDev features:

  • experiment templates

  • creating multi-stage pipelines with inputs and outputs

  • stage execution and lifecycle

  • computable expressions

  • environment variables for scripts

Using the Collaboration Tool

Please, watch the tutorial at the Collaboration Tool page.

How to get help

Feel free to ask a question on t.me/mldev_betatest or submit a question/suggestion here