Batch processing systems offer efficient means for highly scalable ML training. Together with the infrastructure, we provide the expertise needed to scale machine learning and data analytics on such systems for your experiment.
SERVICE DESCRIPTION
As one of the largest scientific institutions in Europe and the only German university of excellence with national large-scale research facilities, KIT combines a long university tradition with program-oriented cutting-edge research. KIT's cluster systems are the backbone for many international data and HPC operations.
Within EUHubs4Data, experiments can scale their data analysis and machine learning tasks on a modern research cluster optimized for model training, hyperparameter optimization, and simulation-based machine learning algorithms.
The cluster consists of
- 520 "Thin" nodes (40 2.1 Ghz Intel cores / >=96 GB memory per node),
- 352 "HPC Broadwell" nodes (28 2.0 GHz cores / 128 GB memory each),
- 8 "Fat" nodes (80 cores / 3 TB memory each). To support machine learning - 4 nodes with 4 NVIDIA V100 and
- 8 nodes with 8 NVIDIA V100 GPU accelerators are available (more accellerators are currently integrated).
The systems support deployment via SLURM schedulers supporting Docker, Podman, and Singularity containerization. For interactive use, a Jupyter hub environment can be used to access compute nodes. Tight integration with Gitlab-based continuous integration is also supported.
For machine learning use with Tensorflow, Pytorch, and Sklearn, Dask can be used as a parallelization layer. KIT supports porting ML architectures to this cluster system and will provide cluster time for experiments.
SPECIAL ACCESS CONDITIONS
Conditions and requirements for participation in an experiment within the Open Calls:
By participating in an EUHubs4Data Open Call, you are initially only applying for funding that originally comes from the European Commission and is awarded by the coordinator exclusively in its own name under the conclusion of a sub-grant agreement. This sub-grant agreement does not establish a contract with KIT, neither through your application nor through a possible positive funding decision.
KIT will therefore - also in your own interest - conclude a separate written agreement with you at the start of the experiment (based on our sample cooperation agreement.
If you decide to propose the participation of KIT and SDIL infrastructure in your experiment, you must respect the following conditions. We provide this information in advance to ensure maximum transparency: please contact us if you have any questions. In the unlikely event that you are unable to conduct your experiment with our participation, we will attempt to assist you in selecting alternative services before the experiment begins.
Please note that contrary to the name "service", the above description is not a genuine commercial offer, but a listing of exclusive contributions as part of a genuine eye-to-eye collaboration.
For genuine commercial offerings related to the above topics, please feel free to contact us any time outside of the Open Calls.
PREREQUISITES
Existing PoC ML architecture with training data (e.g. from a previous task within an experiment).
CASE EXAMPLES
Currently we are working together with a local object detection startup ( https://www.kimoknow.de/ ) to set up an HPC driven data generation and image classification toolchain, that allows detection of realworld objects based on CAD-drawings: https://www.ff4eurohpc.eu/en/experiments/2021070910334695/aiplatform_for_automated_training_of_object_detection_models_based_on_cad_data