User Guide
Quickstart
This is a summary of a complete typical workflow.
Define the original dataset with DatasetWrapper.
from iquaflow.datasets import DSWrapper
ds_wrapper = DSWrapper(data_path=data_path)
Define the modifications intended for each experiment. In this case JPG Modifiers with quality from 10 to 90.
from iquaflow.datasets import DSModifier_jpg
ds_modifiers_list = [DSModifier_jpg(params={'quality': i}) for i in [10,30,50,70,90] ]
Define the model execution method. In this case the training method is a python script, so we define PythonScriptExecutionTask. Additionally hyperparameter variations can be added. In this case the epochs and learning rate is varied. The tool will then loop through all possible combinations of the variations. The user can also set the number of repetitions.
from iquaflow.experiments import ExperimentSetup
experiment = ExperimentSetup(
experiment_name="MyFirstExperiment",
task_instance=task,
ref_dsw_train=ds_wrapper,
ds_modifiers_list=ds_modifiers_list,
repetitions=5,
extra_train_params={
'epochs':[10,15,20],
'lr':[1e-5,1e-6,1e-7]
}
)
experiment.execute()
The information from the executed experiment can be collected in a json. Also a dataframe suitable for visualization tools (see next step) can be extracted from
from iquaflow.experiments import ExperimentInfo
experiment_info = ExperimentInfo(experiment_name)
runs = experiment_info.runs
df = experiment_info.get_df(
ds_params=["modifier"],
metrics=['rmse','epochs','lr'],
dropna = True,
fields_to_float_lst = ['rmse','lr'],
fields_to_int_lst = ['epochs']
)
Visualizations can be made from the tool. In this case plots of root mean square error against learning rate variations are showed. This is one plot for each epoch (legend) in a shared chart.
from iquaflow.experiments import ExperimentVisual
ev = ExperimentVisual(
df,
os.path.join(data_path, "mod-rmse-lr-epoch.png")
)
ev.visualize(
xvar="lr",
yvar="rmse",
legend_var="epochs",
title="rmse - lr"
)
Conventions
In iquaflow conventions are prefered over configurations.
Dataset Formats
iquaflow understands a dataset as a folder containing a sub-folder with images and ground truth in json format. Datasets that does not follow this format should be changed in order to perform experiments.
In case of detection or segmentation tasks, the preferred formats are:
Json in COCO format.
GeoJson with the minimum required fields (“image_filename”, “class_id”, “geometry”).
A folder named maskes with images corresponding to the segmentation annotations.
iquaflow primarily works with COCO json ground truth adopted by most of the datasets and models of the field. In case that the dataset is in other format, the user can transform it to COCO https://blog.roboflow.ai/how-to-convert-annotations-from-voc-xml-to-coco-json/ Otherwise, iquaflow can not perform sanity neither statistics checks
For other kind of tasks, such as image generation, it is only necessary to have the ground truth in a json format. Alternatively, iquaflow can recognize a dataset without any ground truth file
When the dataset is modified, iquaflow creates a modified copy of the dataset in its parent folder. As a convention, iquaflow adds to the name of the original dataset a “#” followed by the name of the modification as you can see in the following image.
Training script
The training script requires these arguments:
outputpath
trainds
valds (opt)
testds (opt)
mlfuri (opt)
mlfexpid (opt)
mlfrunid (opt)
other hyperparameters (opt)
The arguments marked with (opt) are optional. Arguments starting with mlf are used when the flag mlflow_monitoring is activated in the ExperimentSetup
Output Formats
The packaged model could write in the output temporary folder the following files in order to be parsed as experiment parameters and metrics:
results.json: Json with keys as the name of parameter, values as a number related to the metric or an array reference to a sequence of values of that parameter.
{
"train_f1": 0.83,
"val_f1": 0.78,
"test_f1": 0.79,
"train_focal_loss": [1.34, 1.29, 1.24, …., 0.01]
"val_focal_loss": [1.34, 1.29, 1.24, …., 0.01]
}
output.json : Output of the model (this allows to avoid reproducing experiments in the future in case it is wanted to test a new metric for former experiments) in a folder named output. The format of this json file depends on the task of the DL model.
Bounding Box Detection: output.json consists of a COCO format json, containing as many elements as detections have been made in the dataset. Each of these elements looks as shown below.
{
"image_id" : 85
"iscrowd" : 0
"bbox":[
522.5372924804688
474.1499938964844
28.968505859375
27.19696044921875
]
"area": 2427.050960971974
"category_id": 1
"id": 1
score : 0.9709288477897644
}
Image generation: The json may contain the relative path to the generated images. Imagine the packaged model is Super Resolution model that generates five super resolution images. The package may store a folder named
generated_sr_imagein the output temporary file with this five images. Hence the output.json should be as following:
{
[
"generated_sr_image/image_1.png",
"generated_sr_image/image_2.png",
"generated_sr_image/image_3.png",
"generated_sr_image/image_4.png",
"generated_sr_image/image_5.png",
]
}
1) Pre-Processing
Sanity check and statistics
SanityCheck and DSStatistics are the classes that will perform sanity check and statistics of image datasets and ground truth. They are stand alone classes, it is to say they can work by proving the path folder of images and ground truth, or they can work with DSWrapper class.
Sanity check
The SanityCheck module performs sanity to image datasets and ground truth. It can either work as standalone class or with DSWrapper class. It will remove all corrupted samples following the logic in the argument flags. The new sanitized dataset is located in output_path attribute from the SanityCheck instance. A usage example:
from iquaflow.sanity import SanityCheck
sc = SanityCheck(data_path, output_folder)
sc.check_annotations()
Some relevant taskes performed are:
Finding duplicates in coco json images list
Check if the image format is a valid image file format.
Check integrity of one coco annotation.
Fix height and width in coco json images list
In geojson annotations, remove all rows containing a Nan value, empty geometries in any of the required field columns.
In geojson annotations, try to fix geometries with buffer = 0 and remove the persistent invalid geometries.
Note the difference between missing, empty and invalid geometries in a geojson:
Missing geometries: This is when the attribute geometry is empty or unknown. Most libraries load it as
Nonetype in python. These values were typically propagated in operations (for example in calculations of the area or of the intersection), or ignored in reductions such as unary_union.Empty geometries: This happens when the coordinates are empty despite having a geometry type defined. This can happen as a result of an intersection between two polygons that have no overlap.
Invalid geometry: Problematic features such as edges of a polygon intersecting themselves. This could have happened due to a mistake from the annotator. For the case of invalid geometry. The tool will also attempt to fix them with buffer=0 functionality prior to removing. In future releases an additional argument to simplify geometries will be offered.
Statistics and exploration
There are several statistics that can be calculated from the datasets, they can be estimated and summariezed in visualizations. The resulting calculated parameters can be exported as json and the plots as images. The default location is in a subfolder stats within the dataset. The module DsStats performs stats to image datasets and annotations. It can either work as standalone class or with DSWrapper class. A usage example:
from iquaflow.ds_stats import DsStats
dss = DsStats(data_path, output_folder)
stats = dss.perform_stats(show_plots = True)
Statistics performed are:
Average height and width images
Class tags histogram
Image and bounding box aspect ratio and area histograms
Calculates the best fitting bounding box and rotated bounding box
High, width angle from bounding box and rotated bounding box
Compactness, centroid and area of the polygon
min, mean and max from a dataframe field
There are also two interactive exploratory tools. One to visualzie the annotations an another for the images. These are:
notebook_annots_summary
notebook_imgs_preview
Usage example:
from iquaflow.ds_stats import DsStats
DsStats.notebook_annots_summary(
df,
export_html_filename=html_filename,
fields_to_include=["image_filename", "class_id", "area"],
show_inline=True,
)
from iquaflow.ds_stats import DsStats
DsStats.notebook_imgs_preview(
data_path=data_path,
sample=100,
size=100,
)
They can be used in line in notebooks or export them in html interactively.
Dataset
DSWrapper is the class that iquaflow uses for identifying datasets. Basically the dateset is defined by a folder that contains only a unique sub-folder with the images and json that describes the annotations. It is preferred that the ground truth json is in COCO format or geojson so it can be used with the rest of the tools.
Having the dataset conformed as mentioned before it is simply as providing the location path to the DSWrapper
from iquaflow.datasets import DSWrapper
ds_wrapper = DSWrapper(data_path="[path_to_the_dataset]")
Internally iquaflow parses the structure helping the experiment tools to understand how the dateset is conformed.
Afterwards the user can find parsed the principal datasets paths:
ds_wrapper.parent_folder # It is Path of the folder containing the dataset
ds_wrapper.data_path #Root path of the dataset
ds_wrapper.data_input #Path of the folder that contains the images
ds_wrapper.json_annotations #Path to the jsn annotations. Preferred COCO annotations
ds_wrapper.geojson_annotations #Path to the geojson annotations.
Furthermore, DSWrapper contains an editable dictionary that describes the dataset. Initially this dictionary contains the key ds_name that is the name of the dataset. The user can populate this dictionary with any key/value parameter. Afterwards, this dictionary will be populated and changed automatically by DSModifier classes and it will be used for experiments logins.
ds_wrapper.params #Contains metainfomation of the dataset. Initially {"ds_name":"[name_of_the_dataset]"}
Modifiers
Modifiers take a dataset D and process to obtain a D’ dataset with some image/data processing (degradation, compression, enhancement…)
Using an existing modifier to run an experiment:
Just import the desired modifier and run it
from iquaflow.datasets import DSModifier_jpg
img_path = "test_datasets/ds_coco_dataset/images")
jpg85 = DSModifier_jpg(params={"quality": 85})
jpg85.modify(data_input=img_path)
After running, a test_datasets/ds_coco_dataset#jpg85_modifier/images/ folder should be created with the modified images.
Adding a new modifier tool:
In modifier_jpg.py you have a good guide on how to implement a new modifier, inheriting from DSModifier_dir and writing the internal _mod_img() member function.
2) Experiment
TaskExecution
iquaflow can to automatize experiments while the user has a flexible way of loging experiments information without knowing any specific login tool, he needs only to create a json file with the parameters that want to be tracked by iquaflow. Alternatively he can track any kind of file generated by the experiment by just saving the file in a temporary path (provided to the packaged model by iquaflow) or he even can store the raw results in a json for future computations. TaskExecution is the generic class that provides the mandatory and optional arguments to the packaged model when this is launched and it is also responable for translating all the experiment information to the mlflow tracking server. Hence, the user does not need to understand MLFlow, iquaflow internally uses MLFlow to organize the experiments.
PythonScriptTaskExecution
This particular class extends from TaskExecution and knows how to execute a model that is encapsulated in a python script. In order to use it just instantiate the class with the path to the python script.
task = PythonScriptTaskExecution(model_script_path="./path_to_script.py")
Alternatively the user can execute the task, but is not recommendable since iquaflow will perform executions internally when the whole experiment is defined. In order to execute the run, the user must provide the experiment name, the name of the run and the training dateset path or training DSWrapper. Optionally, the user can provide a training dataset path or ds_wrapper and a python dictionary with model hyper-parameters (that will be used when executing the package)
task.train_val(
experiment_name="name of the experiment",
run_name="test_run",
train_ds=ds_wrapper_train,
val_ds=ds_wrapper_validation,
mlargs={"lr": 1e-6},
)
SageMakerTaskExecution
Our application can run in sagemaker by passing a SageMakerEstimatorFactory as an argument of our TaskExecution. In which case it becomes a SageMakerTaskExecution. See an example on how to define it.
from sagemaker.pytorch import PyTorch
from iquaflow.experiments.task_execution import SageMakerEstimatorFactory, SageMakerTaskExecution
sage_estimator_factory = SageMakerEstimatorFactory(
PyTorch,
{
"entry_point": "train.py",
"source_dir": "yolov5",
"role":role,
"framework_version": "1.8.1",
"py_version": "py3",
"instance_count": 1,
"instance_type": "ml.g4dn.xlarge"
}
)
task = SageMakerTaskExecution( sage_estimator_factory )
Then in your training script, you might want to connect the argument script variables that are defined by convention in iquaflow (see Conventions) to SageMaker environmental variables to take full advantage of the SageMaker tools. As an example:
import argparse
parser = argparse.ArgumentParser()
# Define some defaults
trainds_default = (os.environ["SM_CHANNEL_TRAINDS"] if "SM_CHANNEL_TRAINDS" in os.environ else "")
valds_default = (os.environ["SM_CHANNEL_VALDS"] if "SM_CHANNEL_VALDS" in os.environ else "")
outputpath_default = (os.environ["SM_OUTPUT_DATA_DIR"] if "SM_OUTPUT_DATA_DIR" in os.environ else "./output")
# IQF arguments
parser.add_argument("--trainds", default=trainds_default, type=str, help="training dataset path")
parser.add_argument("--valds", default=valds_default, type=str, help="validation dataset path")
parser.add_argument("--outputpath", default=outputpath_default, type=str, help="path output")
Also, for these approaches you might want iquaflow to upload the modifed datasets (by iquaflow-modifiers) on a bucket on the fly. To do so, indicate the bucket_name in the cloud_options whithin ExperimentSetup
ExperimentSetup
iquaflow allows to formulate experiments taking as reference the modified training datase . In order to perform this task, the package provides tools that allows to automatize this kind of experiments that is composed by:
A reference dataset.
A list of dataset modifiers.
A encapsuled machine learning model.
The first two components are covered by DSWrapper and DSModifer respectively. The last one requires a Task Execution
Having defined all the components the user is able to perform a iquaflow experiment by using ExperimentSetup. The user must define the name of the experiment, the reference datasets, the list of datasets modifiers and the packaged model, as following
experiment = ExperimentSetup(
experiment_name="experimentA",
task_instance=PythonScriptTaskExecution(model_script_path="./path_to_script.py"),
ref_dsw_train=DSWrapper(data_path="path_to_dataset"),
ds_modifiers_list=[ DSModifier_jpg(params={'quality': i}) for i in [10,30,50,70,90] ]
)
And then just execute the training by
experiment.execute()
additional options
repetitions
Each combination of parameters and modifiers results in a run. Scipts might contain randomness (i.e. Random partitions). For those cases you might want to average out several executions to have a relevant statistic or study the variability. To do so, set the number of repetitions to greater than 1.
mlflow_monitoring
This allows monitoring in real time of the training scripts. When turned on, iquaflow will pass these aditional arguments to the training script:
- mlfuri
- mlfexpid
- mlfrunid
Thus, the user will be responsible to add these in the user training script when required. Then the user can activate the current experiment and run in the the script with a snippet such as:
mlflow.set_tracking_uri(args.mlfuri)
mlflow.start_run(
run_id=args.mlfrunid,
experiment_id=args.mlfexpid
)
cloud options
cloud_options is a dictionary of options useful for indicating endpoints such as:
- bucket_name – str. If set, modified data (by iquaflow-modifiers) will be uploaded to the bucket.
- tracking_uri – str. trackingURI for mlflow. default is local to the ./mlflow folder
- registry_uri – str. registryURI for mlflow. default is local to the ./mlflow folder
Inicating the bucket is useful for SageMakerTaskExecution instances.
3) Results
Experiment Info
This objects allows the user to manage the experiment information. It simplifies the access to MLFlow and allows to apply new metrics to previous executed experiments. Basic usage example:
from iquaflow.experiments import ExperimentInfo
experiment_info = ExperimentInfo(experiment_name)
runs = experiment_info.get_mlflow_run_info() # runs is a python dict
These are the main methods:
get_mlflow_run_info > It gathers the experiment information ina a python dictionary.
apply_metric_per_run > Applies a new metric to previously executed experiments.
get_df > Retrives a selection of data in a suitable format so that it can be used as an input in the Visualization module.
In the section Metrics and Visualization (just below) there are examples on how to use the last two methods.
Metrics
The module metrics contains functionalities to estimate metrics in your experiments. BBDetectionMetrics is an available metric that can be applied between bounding boxes of ground truth and predicted elements. They must be in COCO-format ( See COCO detection and COCO data ). When this metric is applied the metrics from COCOeval (See COCO detection ) are estimated.
SNRMetric
Signal-to-noise ratio is defined as the ratio of the power of a signal to the power of background noise. This metric is designed for L0 - L1 images. There are currently two approaches to estimate it: * Homogeneous blocks (HB) - default option, faster and less problematic. * Homogeneous areas (HA) - usually more accurate.
from iquaflow.metrics import (
SNRMetric,
snr_function_from_array,
snr_function_from_fn
)
SharpnessMetric
About RER, FWHM and MTF. In general the MTF can be ignored because it is the most complex and the least reliable. With just a bit of noise in the data the metric changes a lot and in this case the images are noisy.
from iquaflow.metrics import SharpnessMetric
RER - It measures the slope in the edge response (transition). The lower the metric, the blurier the image is. Taking the derivative of normalized Edge Response produces the Line Spread Function (LSF). The LSF is a 1-D representation of the system PSF. The width of the LSF at half the height (the 50% point) is called the full-width at half maximum (FWHM).
The FWHM (Full Width at Half Maximum) measures the level of blur. It has three measurements depending on the direction. * FWHM_X - Horizontal direction of the image * FWHM_Y - Vertical direction of the image * FWHM_other - The rest. Grouped together. This one is less reliable because it depends on the content of the image such as how many angles there are.
The Fourier Transform of the LSF produces the Modulation Transfer Function (MTF). MTF is determined across all spatial frequencies, but can be evaluated at a single spatial frequency, such as the Nyquist frequency. The value of the MTF at Nyquist provides a measure of resolvable contrast at the highest ‘alias-free’ spatial frequency.
BBDetectionMetrics
from iquaflow.metrics import BBDetectionMetrics
This estimates object detection metrics (Recall, mAP, etc.) over a dataset that has its predictions in COCO-inference format (See conventions)
Custom metrics
Custom metrics can be created by inheriting the class Metrics:
from iquaflow.metrics import Metric
class CustomMetric(Metric):
def __init__(self) -> None:
self.metric_names = coco_eval_metrics_names
def apply(self, predictions: str, gt_path: str) -> Any:
# Your custom code here
# Then return a dictionary of names and values for each metric
return {k: v for k, v in zip(metric_names, stats)}
To calculate a metric to an executed experiment do:
from iquaflow.experiments import ExperimentInfo
experiment_info = ExperimentInfo(experiment_name)
my_custom_metric = CustomMetric()
experiment_info.apply_metric_per_run( my_custom_metric, json_annotations_name )
Visualization
Apart from the visualization tools explained in the Sanity check and Statistics section, there are also tools for plotting the results. On one hand there is the mlflow service which is launched by mlflow ui --host 0.0.0.0 and then accessed in the browser http://ip_address_of_your_mlflow_server:5000 The Tracking UI lets you visualize, search and compare runs, as well as download run artifacts or metadata for analysis in other tools. If you log runs to a local
mlruns directory, run mlflow ui in the directory above it, and it loads the corresponding runs. The UI contains the following key features: * Experiment-based run listing and comparison * Searching for runs by parameter or metric value * Visualizing run metrics * Downloading run results
On the other hand there is the ExperimentVisual class. It offers both inline and saved files plotting utilities. It is designed so that it retrieves a dataframe extracted from an ExperimentInfo and then used as an input. See some examples in Visual Notebooks. See also some code examples in the Typical workflow section (just below).
Development
Package Overview
The python package structure of this tool box is based on cookiecutter. This library provides a standard workflow for developing production level packages. The tools that will be used are: 1. setuptools for packaging 1. versioneer for versioning 1. GitLab CI for continuous integration 1. tox for managing test environments 1. pytest for tests 1. sphinx for documentation 1. black, flake8 and isort for style checks 1. mypy for type checks
More information can be found in: 1. https://packaging.python.org/tutorials/packaging-projects/ 1. https://python-packaging.readthedocs.io/en/latest/minimal.html 1. https://www.learnpython.org/en/Modules_and_Packages
Environment installation
This repository does not require any specific python environment. In our case we use Python 3.7. Hence, we do recommend create a new environment with Python 3.7 and pip. The file setup.py allows to install iquaflow as a python package via pip. Once you have created your new environment, you only need to clone locally the repository:
git clone https://github.com/satellogic/iquaflow
and then do the wallowing command to install the iquaflow as a softlink in the environment:
python -m pip install -e .
Dependencies are defined in setup.cfg under install_requires tag. So first install the package in your local environment and then add the dependency in the setup.cfg with its corresponding version.
Documentation
We use Sphinx to automatically update our documentation. This allows to maintain package documentation updated at the same time new code is added (as long the code is commented). The documentation and Sphinx configuration can be found inside /doc.
Under the /doc folder type in console
make html
Sphinx will generate under /doc/build/html the desired html documentation. You can also use tox:
tox -e docs
More information about Sphinx can be found in here.
Continuous integration
In our project we use TOX. This tool allows to manage multiple environments in order to automatically validate code. More information about TOX can be found in here.
For quality check you only need to run:
tox -e check
For automatic code reformat:
tox -e reformat
For executing all test for first time use
tox -r -e py36
Alternatively, if it is not the first time it is not necesary to recreate the tox envirement
tox -e py36
Note: CI terminology for python can be found in here
Test
Unit tests are performed using PyTest. All tests are included in test the folder located in the repository main folder. Once you have created a new test module, e.g. test_new_module, that includes python assertions, simply type in the console pytest or:
pytest <module name>
to run the tests.
We strongly recommned to use “test_” as the prefix of every test you create.
You can also run test manually using tox(recommended) (use -r parametar for creating tox environment for the first time):
tox -e py36
More information can be found in https://docs.python-guide.org/writing/tests/
Initial development process
Below we describe usual steps when developing from scratch:
Setup python environment:
bash conda create -n iqt-env python=3.6
Clone repository:
bash git clone https://github.com/satellogic/iquaflow
Create branch:
bash git checkout -b <new_branch_name>
Install soft link via:
bash python -m pip install -e .
Create test that defines modules functionality.
Solve the test by adding package functionality.
If new branch pulled use
tox -rto recreate tox environments.Reformat code:
bash python -m pip install tox tox -e reformatCheck code and solve:
bash tox -e check
Run tests:
bash tox -e py36
Push to remote branch.
Create MR and assign reviewer.
Refreshing local repository for running tests (after
pip install -e .):
bash tox -r -e py36Sphinx