User Guide

Quickstart

This is a summary of a complete typical workflow.

Define the original dataset with DatasetWrapper.

from iquaflow.datasets import DSWrapper

ds_wrapper = DSWrapper(data_path=data_path)

Define the modifications intended for each experiment. In this case JPG Modifiers with quality from 10 to 90.

from iquaflow.datasets import DSModifier_jpg

ds_modifiers_list = [DSModifier_jpg(params={'quality': i}) for i in [10,30,50,70,90] ]

Define the model execution method. In this case the training method is a python script, so we define PythonScriptExecutionTask. Additionally hyperparameter variations can be added. In this case the epochs and learning rate is varied. The tool will then loop through all possible combinations of the variations. The user can also set the number of repetitions.

from iquaflow.experiments import ExperimentSetup

experiment = ExperimentSetup(
   experiment_name="MyFirstExperiment",
   task_instance=task,
   ref_dsw_train=ds_wrapper,
   ds_modifiers_list=ds_modifiers_list,
   repetitions=5,
   extra_train_params={
      'epochs':[10,15,20],
      'lr':[1e-5,1e-6,1e-7]
   }
)

experiment.execute()

The information from the executed experiment can be collected in a json. Also a dataframe suitable for visualization tools (see next step) can be extracted from

from iquaflow.experiments import ExperimentInfo

experiment_info = ExperimentInfo(experiment_name)

runs = experiment_info.runs

df = experiment_info.get_df(
   ds_params=["modifier"],
   metrics=['rmse','epochs','lr'],
   dropna = True,
   fields_to_float_lst = ['rmse','lr'],
   fields_to_int_lst = ['epochs']
)

Visualizations can be made from the tool. In this case plots of root mean square error against learning rate variations are showed. This is one plot for each epoch (legend) in a shared chart.

from iquaflow.experiments import ExperimentVisual

ev = ExperimentVisual(
   df,
   os.path.join(data_path, "mod-rmse-lr-epoch.png")
)

ev.visualize(
   xvar="lr",
   yvar="rmse",
   legend_var="epochs",
   title="rmse - lr"
)

Conventions

In iquaflow conventions are prefered over configurations.

Dataset Formats

iquaflow understands a dataset as a folder containing a sub-folder with images and ground truth in json format. Datasets that does not follow this format should be changed in order to perform experiments.
In case of detection or segmentation tasks, the preferred formats are:
Json in COCO format.
GeoJson with the minimum required fields (“image_filename”, “class_id”, “geometry”).
A folder named maskes with images corresponding to the segmentation annotations.
iquaflow primarily works with COCO json ground truth adopted by most of the datasets and models of the field. In case that the dataset is in other format, the user can transform it to COCO https://blog.roboflow.ai/how-to-convert-annotations-from-voc-xml-to-coco-json/ Otherwise, iquaflow can not perform sanity neither statistics checks
For other kind of tasks, such as image generation, it is only necessary to have the ground truth in a json format. Alternatively, iquaflow can recognize a dataset without any ground truth file
When the dataset is modified, iquaflow creates a modified copy of the dataset in its parent folder. As a convention, iquaflow adds to the name of the original dataset a “#” followed by the name of the modification as you can see in the following image.

Training script

The training script requires these arguments:

outputpath
trainds
valds (opt)
testds (opt)
mlfuri (opt)
mlfexpid (opt)
mlfrunid (opt)
other hyperparameters (opt)

The arguments marked with (opt) are optional. Arguments starting with mlf are used when the flag mlflow_monitoring is activated in the ExperimentSetup

Output Formats

The packaged model could write in the output temporary folder the following files in order to be parsed as experiment parameters and metrics:

results.json: Json with keys as the name of parameter, values as a number related to the metric or an array reference to a sequence of values of that parameter.

{
 "train_f1": 0.83,
 "val_f1": 0.78,
 "test_f1": 0.79,
 "train_focal_loss": [1.34, 1.29, 1.24, …., 0.01]
 "val_focal_loss": [1.34, 1.29, 1.24, …., 0.01]
}

output.json : Output of the model (this allows to avoid reproducing experiments in the future in case it is wanted to test a new metric for former experiments) in a folder named output. The format of this json file depends on the task of the DL model.
Bounding Box Detection: output.json consists of a COCO format json, containing as many elements as detections have been made in the dataset. Each of these elements looks as shown below.

{
    "image_id" : 85
    "iscrowd" : 0
    "bbox":[
        522.5372924804688
        474.1499938964844
        28.968505859375
        27.19696044921875
    ]
    "area": 2427.050960971974
    "category_id": 1
    "id": 1
    score : 0.9709288477897644
}

Image generation: The json may contain the relative path to the generated images. Imagine the packaged model is Super Resolution model that generates five super resolution images. The package may store a folder named generated_sr_image in the output temporary file with this five images. Hence the output.json should be as following:

{
 [
   "generated_sr_image/image_1.png",
   "generated_sr_image/image_2.png",
   "generated_sr_image/image_3.png",
   "generated_sr_image/image_4.png",
   "generated_sr_image/image_5.png",
 ]
}

1) Pre-Processing

Sanity check and statistics

SanityCheck and DSStatistics are the classes that will perform sanity check and statistics of image datasets and ground truth. They are stand alone classes, it is to say they can work by proving the path folder of images and ground truth, or they can work with DSWrapper class.

Sanity check

The SanityCheck module performs sanity to image datasets and ground truth. It can either work as standalone class or with DSWrapper class. It will remove all corrupted samples following the logic in the argument flags. The new sanitized dataset is located in output_path attribute from the SanityCheck instance. A usage example:

from iquaflow.sanity import SanityCheck

sc = SanityCheck(data_path, output_folder)
sc.check_annotations()

Some relevant taskes performed are:

Finding duplicates in coco json images list
Check if the image format is a valid image file format.
Check integrity of one coco annotation.
Fix height and width in coco json images list
In geojson annotations, remove all rows containing a Nan value, empty geometries in any of the required field columns.
In geojson annotations, try to fix geometries with buffer = 0 and remove the persistent invalid geometries.

Note the difference between missing, empty and invalid geometries in a geojson:

Missing geometries: This is when the attribute geometry is empty or unknown. Most libraries load it as None type in python. These values were typically propagated in operations (for example in calculations of the area or of the intersection), or ignored in reductions such as unary_union.
Empty geometries: This happens when the coordinates are empty despite having a geometry type defined. This can happen as a result of an intersection between two polygons that have no overlap.
Invalid geometry: Problematic features such as edges of a polygon intersecting themselves. This could have happened due to a mistake from the annotator. For the case of invalid geometry. The tool will also attempt to fix them with buffer=0 functionality prior to removing. In future releases an additional argument to simplify geometries will be offered.

Statistics and exploration

There are several statistics that can be calculated from the datasets, they can be estimated and summariezed in visualizations. The resulting calculated parameters can be exported as json and the plots as images. The default location is in a subfolder stats within the dataset. The module DsStats performs stats to image datasets and annotations. It can either work as standalone class or with DSWrapper class. A usage example:

from iquaflow.ds_stats import DsStats

dss = DsStats(data_path, output_folder)
stats = dss.perform_stats(show_plots = True)

Statistics performed are:

Average height and width images
Class tags histogram
Image and bounding box aspect ratio and area histograms
Calculates the best fitting bounding box and rotated bounding box
High, width angle from bounding box and rotated bounding box
Compactness, centroid and area of the polygon
min, mean and max from a dataframe field

There are also two interactive exploratory tools. One to visualzie the annotations an another for the images. These are:

notebook_annots_summary
notebook_imgs_preview

Usage example:

from iquaflow.ds_stats import DsStats

DsStats.notebook_annots_summary(
    df,
    export_html_filename=html_filename,
    fields_to_include=["image_filename", "class_id", "area"],
    show_inline=True,
)

from iquaflow.ds_stats import DsStats

DsStats.notebook_imgs_preview(
        data_path=data_path,
        sample=100,
        size=100,
)

They can be used in line in notebooks or export them in html interactively.

See a notebook with Statistics examples

Dataset

DSWrapper is the class that iquaflow uses for identifying datasets. Basically the dateset is defined by a folder that contains only a unique sub-folder with the images and json that describes the annotations. It is preferred that the ground truth json is in COCO format or geojson so it can be used with the rest of the tools.

Having the dataset conformed as mentioned before it is simply as providing the location path to the DSWrapper

from iquaflow.datasets import DSWrapper
ds_wrapper = DSWrapper(data_path="[path_to_the_dataset]")

Internally iquaflow parses the structure helping the experiment tools to understand how the dateset is conformed.

Afterwards the user can find parsed the principal datasets paths:

ds_wrapper.parent_folder # It is Path of the folder containing the dataset
ds_wrapper.data_path #Root path of the dataset
ds_wrapper.data_input #Path of the folder that contains the images
ds_wrapper.json_annotations #Path to the jsn annotations. Preferred COCO annotations
ds_wrapper.geojson_annotations #Path to the geojson annotations.

Furthermore, DSWrapper contains an editable dictionary that describes the dataset. Initially this dictionary contains the key ds_name that is the name of the dataset. The user can populate this dictionary with any key/value parameter. Afterwards, this dictionary will be populated and changed automatically by DSModifier classes and it will be used for experiments logins.

ds_wrapper.params #Contains metainfomation of the dataset. Initially {"ds_name":"[name_of_the_dataset]"}

Modifiers

Modifiers take a dataset D and process to obtain a D’ dataset with some image/data processing (degradation, compression, enhancement…)

Using an existing modifier to run an experiment:

Just import the desired modifier and run it

from iquaflow.datasets import DSModifier_jpg

img_path = "test_datasets/ds_coco_dataset/images")
jpg85 = DSModifier_jpg(params={"quality": 85})
jpg85.modify(data_input=img_path)

After running, a test_datasets/ds_coco_dataset#jpg85_modifier/images/ folder should be created with the modified images.

Adding a new modifier tool:

In modifier_jpg.py you have a good guide on how to implement a new modifier, inheriting from DSModifier_dir and writing the internal _mod_img() member function.

2) Experiment

TaskExecution

iquaflow can to automatize experiments while the user has a flexible way of loging experiments information without knowing any specific login tool, he needs only to create a json file with the parameters that want to be tracked by iquaflow. Alternatively he can track any kind of file generated by the experiment by just saving the file in a temporary path (provided to the packaged model by iquaflow) or he even can store the raw results in a json for future computations. TaskExecution is the generic class that provides the mandatory and optional arguments to the packaged model when this is launched and it is also responable for translating all the experiment information to the mlflow tracking server. Hence, the user does not need to understand MLFlow, iquaflow internally uses MLFlow to organize the experiments.

PythonScriptTaskExecution

This particular class extends from TaskExecution and knows how to execute a model that is encapsulated in a python script. In order to use it just instantiate the class with the path to the python script.

task = PythonScriptTaskExecution(model_script_path="./path_to_script.py")

Alternatively the user can execute the task, but is not recommendable since iquaflow will perform executions internally when the whole experiment is defined. In order to execute the run, the user must provide the experiment name, the name of the run and the training dateset path or training DSWrapper. Optionally, the user can provide a training dataset path or ds_wrapper and a python dictionary with model hyper-parameters (that will be used when executing the package)

task.train_val(
            experiment_name="name of the experiment",
            run_name="test_run",
            train_ds=ds_wrapper_train,
            val_ds=ds_wrapper_validation,
            mlargs={"lr": 1e-6},
        )

SageMakerTaskExecution

Our application can run in sagemaker by passing a SageMakerEstimatorFactory as an argument of our TaskExecution. In which case it becomes a SageMakerTaskExecution. See an example on how to define it.

from sagemaker.pytorch import PyTorch
from iquaflow.experiments.task_execution import SageMakerEstimatorFactory, SageMakerTaskExecution

sage_estimator_factory = SageMakerEstimatorFactory(
   PyTorch,
   {
       "entry_point": "train.py",
       "source_dir": "yolov5",
       "role":role,
       "framework_version": "1.8.1",
       "py_version": "py3",
       "instance_count": 1,
       "instance_type": "ml.g4dn.xlarge"
   }
)

task = SageMakerTaskExecution( sage_estimator_factory )

Then in your training script, you might want to connect the argument script variables that are defined by convention in iquaflow (see Conventions) to SageMaker environmental variables to take full advantage of the SageMaker tools. As an example:

import argparse

parser = argparse.ArgumentParser()

# Define some defaults
trainds_default     = (os.environ["SM_CHANNEL_TRAINDS"] if "SM_CHANNEL_TRAINDS" in os.environ else "")
valds_default      = (os.environ["SM_CHANNEL_VALDS"] if "SM_CHANNEL_VALDS" in os.environ else "")
outputpath_default = (os.environ["SM_OUTPUT_DATA_DIR"] if "SM_OUTPUT_DATA_DIR" in os.environ else "./output")

# IQF arguments
parser.add_argument("--trainds", default=trainds_default, type=str, help="training dataset path")
parser.add_argument("--valds", default=valds_default, type=str, help="validation dataset path")
parser.add_argument("--outputpath", default=outputpath_default, type=str, help="path output")

Also, for these approaches you might want iquaflow to upload the modifed datasets (by iquaflow-modifiers) on a bucket on the fly. To do so, indicate the bucket_name in the cloud_options whithin ExperimentSetup

ExperimentSetup

iquaflow allows to formulate experiments taking as reference the modified training datase . In order to perform this task, the package provides tools that allows to automatize this kind of experiments that is composed by:

A reference dataset.
A list of dataset modifiers.
A encapsuled machine learning model.

The first two components are covered by DSWrapper and DSModifer respectively. The last one requires a Task Execution

Having defined all the components the user is able to perform a iquaflow experiment by using ExperimentSetup. The user must define the name of the experiment, the reference datasets, the list of datasets modifiers and the packaged model, as following

experiment = ExperimentSetup(
   experiment_name="experimentA",
   task_instance=PythonScriptTaskExecution(model_script_path="./path_to_script.py"),
   ref_dsw_train=DSWrapper(data_path="path_to_dataset"),
   ds_modifiers_list=[ DSModifier_jpg(params={'quality': i}) for i in [10,30,50,70,90] ]
)

And then just execute the training by

experiment.execute()

additional options

repetitions

Each combination of parameters and modifiers results in a run. Scipts might contain randomness (i.e. Random partitions). For those cases you might want to average out several executions to have a relevant statistic or study the variability. To do so, set the number of repetitions to greater than 1.

mlflow_monitoring

This allows monitoring in real time of the training scripts. When turned on, iquaflow will pass these aditional arguments to the training script:

- mlfuri
- mlfexpid
- mlfrunid

Thus, the user will be responsible to add these in the user training script when required. Then the user can activate the current experiment and run in the the script with a snippet such as:

mlflow.set_tracking_uri(args.mlfuri)

mlflow.start_run(
    run_id=args.mlfrunid,
    experiment_id=args.mlfexpid
)

cloud options

cloud_options is a dictionary of options useful for indicating endpoints such as:

- bucket_name – str. If set, modified data (by iquaflow-modifiers) will be uploaded to the bucket.
- tracking_uri – str. trackingURI for mlflow. default is local to the ./mlflow folder
- registry_uri – str. registryURI for mlflow. default is local to the ./mlflow folder

Inicating the bucket is useful for SageMakerTaskExecution instances.

3) Results

Experiment Info

This objects allows the user to manage the experiment information. It simplifies the access to MLFlow and allows to apply new metrics to previous executed experiments. Basic usage example:

from iquaflow.experiments import ExperimentInfo

experiment_info = ExperimentInfo(experiment_name)
runs = experiment_info.get_mlflow_run_info() # runs is a python dict

These are the main methods:

get_mlflow_run_info > It gathers the experiment information ina a python dictionary.
apply_metric_per_run > Applies a new metric to previously executed experiments.
get_df > Retrives a selection of data in a suitable format so that it can be used as an input in the Visualization module.

In the section Metrics and Visualization (just below) there are examples on how to use the last two methods.

Metrics

The module metrics contains functionalities to estimate metrics in your experiments. BBDetectionMetrics is an available metric that can be applied between bounding boxes of ground truth and predicted elements. They must be in COCO-format ( See COCO detection and COCO data ). When this metric is applied the metrics from COCOeval (See COCO detection ) are estimated.

SNRMetric

Signal-to-noise ratio is defined as the ratio of the power of a signal to the power of background noise. This metric is designed for L0 - L1 images. There are currently two approaches to estimate it: * Homogeneous blocks (HB) - default option, faster and less problematic. * Homogeneous areas (HA) - usually more accurate.

from iquaflow.metrics import (
       SNRMetric,
       snr_function_from_array,
       snr_function_from_fn
)

SharpnessMetric

About RER, FWHM and MTF. In general the MTF can be ignored because it is the most complex and the least reliable. With just a bit of noise in the data the metric changes a lot and in this case the images are noisy.

from iquaflow.metrics import SharpnessMetric

RER - It measures the slope in the edge response (transition). The lower the metric, the blurier the image is. Taking the derivative of normalized Edge Response produces the Line Spread Function (LSF). The LSF is a 1-D representation of the system PSF. The width of the LSF at half the height (the 50% point) is called the full-width at half maximum (FWHM).

The FWHM (Full Width at Half Maximum) measures the level of blur. It has three measurements depending on the direction. * FWHM_X - Horizontal direction of the image * FWHM_Y - Vertical direction of the image * FWHM_other - The rest. Grouped together. This one is less reliable because it depends on the content of the image such as how many angles there are.

The Fourier Transform of the LSF produces the Modulation Transfer Function (MTF). MTF is determined across all spatial frequencies, but can be evaluated at a single spatial frequency, such as the Nyquist frequency. The value of the MTF at Nyquist provides a measure of resolvable contrast at the highest ‘alias-free’ spatial frequency.

BBDetectionMetrics

from iquaflow.metrics import BBDetectionMetrics

This estimates object detection metrics (Recall, mAP, etc.) over a dataset that has its predictions in COCO-inference format (See conventions)

Custom metrics

Custom metrics can be created by inheriting the class Metrics:

from iquaflow.metrics import Metric

class CustomMetric(Metric):
    def __init__(self) -> None:
        self.metric_names = coco_eval_metrics_names
    def apply(self, predictions: str, gt_path: str) -> Any:
        # Your custom code here
        # Then return a dictionary of names and values for each metric
        return {k: v for k, v in zip(metric_names, stats)}

To calculate a metric to an executed experiment do:

from iquaflow.experiments import ExperimentInfo

experiment_info = ExperimentInfo(experiment_name)
my_custom_metric = CustomMetric()
experiment_info.apply_metric_per_run( my_custom_metric, json_annotations_name )

Visualization

Apart from the visualization tools explained in the Sanity check and Statistics section, there are also tools for plotting the results. On one hand there is the mlflow service which is launched by mlflow ui --host 0.0.0.0 and then accessed in the browser http://ip_address_of_your_mlflow_server:5000 The Tracking UI lets you visualize, search and compare runs, as well as download run artifacts or metadata for analysis in other tools. If you log runs to a local mlruns directory, run mlflow ui in the directory above it, and it loads the corresponding runs. The UI contains the following key features: * Experiment-based run listing and comparison * Searching for runs by parameter or metric value * Visualizing run metrics * Downloading run results

On the other hand there is the ExperimentVisual class. It offers both inline and saved files plotting utilities. It is designed so that it retrieves a dataframe extracted from an ExperimentInfo and then used as an input. See some examples in Visual Notebooks. See also some code examples in the Typical workflow section (just below).

Development

Package Overview

The python package structure of this tool box is based on cookiecutter. This library provides a standard workflow for developing production level packages. The tools that will be used are: 1. setuptools for packaging 1. versioneer for versioning 1. GitLab CI for continuous integration 1. tox for managing test environments 1. pytest for tests 1. sphinx for documentation 1. black, flake8 and isort for style checks 1. mypy for type checks

More information can be found in: 1. https://packaging.python.org/tutorials/packaging-projects/ 1. https://python-packaging.readthedocs.io/en/latest/minimal.html 1. https://www.learnpython.org/en/Modules_and_Packages

Environment installation

This repository does not require any specific python environment. In our case we use Python 3.7. Hence, we do recommend create a new environment with Python 3.7 and pip. The file setup.py allows to install iquaflow as a python package via pip. Once you have created your new environment, you only need to clone locally the repository:

git clone https://github.com/satellogic/iquaflow

and then do the wallowing command to install the iquaflow as a softlink in the environment:

python -m pip install -e .

Dependencies are defined in setup.cfg under install_requires tag. So first install the package in your local environment and then add the dependency in the setup.cfg with its corresponding version.

Documentation

We use Sphinx to automatically update our documentation. This allows to maintain package documentation updated at the same time new code is added (as long the code is commented). The documentation and Sphinx configuration can be found inside /doc.

Under the /doc folder type in console

make html

Sphinx will generate under /doc/build/html the desired html documentation. You can also use tox:

tox -e docs

More information about Sphinx can be found in here.

Continuous integration

In our project we use TOX. This tool allows to manage multiple environments in order to automatically validate code. More information about TOX can be found in here.

For quality check you only need to run:

tox -e check

For automatic code reformat:

tox -e reformat

For executing all test for first time use

tox -r -e py36

Alternatively, if it is not the first time it is not necesary to recreate the tox envirement

tox -e py36

Note: CI terminology for python can be found in here

Test

Unit tests are performed using PyTest. All tests are included in test the folder located in the repository main folder. Once you have created a new test module, e.g. test_new_module, that includes python assertions, simply type in the console pytest or:

pytest <module name>

to run the tests.

We strongly recommned to use “test_” as the prefix of every test you create.

You can also run test manually using tox(recommended) (use -r parametar for creating tox environment for the first time):

tox -e py36

More information can be found in https://docs.python-guide.org/writing/tests/

Initial development process

Below we describe usual steps when developing from scratch:

Setup python environment:

bash conda create -n iqt-env python=3.6

Clone repository:

bash git clone https://github.com/satellogic/iquaflow

Create branch:

bash git checkout -b <new_branch_name>

Install soft link via:

bash python -m pip install -e .

Create test that defines modules functionality.
Solve the test by adding package functionality.
If new branch pulled use tox -r to recreate tox environments.
Reformat code: bash python -m pip install tox tox -e reformat
Check code and solve:

bash tox -e check

Run tests:

bash tox -e py36

Push to remote branch.
Create MR and assign reviewer.
Refreshing local repository for running tests (after pip install -e .):

bash tox -r -e py36Sphinx