The Talent500 Blog
ES6

Streamlining Autonomous Machine Development with NVIDIA OSMO

The development of autonomous machines is a complex, iterative process that involves generating and gathering data, training models, and deploying them through intricate multi-stage workflows across diverse computing resources. This process typically requires collaboration among multiple teams, each needing access to shared and varied compute environments. Moreover, teams often aim to scale certain workloads to the cloud while keeping others on-premises, necessitating DevOps expertise.

NVIDIA OSMO: A Unified Platform

At the recent GTC event, NVIDIA unveiled OSMO, a cloud-native workflow orchestration platform designed to provide a comprehensive solution for scheduling and managing various autonomous machine workloads across heterogeneous shared compute environments. The platform supports a range of workloads, including:

  • Synthetic Data Generation (SDG)
  • Deep Neural Network (DNN) Training and Validation
  • Reinforcement Learning
  • Robot (Re)Simulation in Software-in-the-Loop (SIL) or Hardware-in-the-Loop (HIL)
  • Perception Evaluation on Simulated or Real Data

Deploying Workloads Across Diverse Compute Resources

OSMO simplifies the deployment and orchestration of multi-stage workloads on Kubernetes clusters. This includes managing shared heterogeneous multi-node compute resources such as aarch64 and x86-64 architectures, ensuring flexibility and compatibility across different systems.Users can effortlessly set up YAML-based, multi-stage, multi-node tasks, streamlining the end-to-end development pipeline from synthetic data generation to model validation. Additionally, OSMO seamlessly integrates into existing Continuous Integration/Continuous Deployment (CI/CD) pipelines, allowing for dynamic scheduling of tasks related to nightly regression testing, benchmarking, and model validation.The platform adheres to open standards like OpenID Connect (OIDC) for authentication and implements best practices for credential and dataset security with one-click key rotation. For compliance purposes, teams can manage and trace the lineage of all data used in model training through versioning in development, which is crucial for reproducibility.

Orchestrating On-Premises and Cloud Workloads

Synthetic data generation particularly benefits from distributed environments. It often begins on-premises with smaller batches but scales up to cloud resources as the need for larger volumes arises. OSMO employs elastic resource provisioning to significantly reduce costs associated with offline batch processes like SDG, enabling efficient data generation at scale.

Efficient Testing with SIL and HIL

OSMO also supports software-in-the-loop (SIL) robot testing by simulating multi-sensor and multi-robot scenarios or a variety of testing conditions. These scenarios are ideally suited for cloud environments where compute resources are readily available. OSMO’s capability to schedule and manage workloads across distributed settings ensures that SIL testing is conducted efficiently by leveraging the scalability of cloud resources.Conversely, hardware-in-the-loop (HIL) testing necessitates on-premises deployment due to specific hardware requirements. Heterogeneous compute is essential for HIL testing as it involves simulation and debugging tasks that require x86 architecture while running software targeted at aarch64. This setup provides accurate performance insights and hardware features that would otherwise be unavailable. Running HIL directly on the actual hardware also eliminates the need for expensive emulators.

Concurrent Generation and Training of Foundation Models

OSMO facilitates the concurrent operation of GR00T, a foundation model that requires NVIDIA DGX systems for model training alongside OVX systems for live reinforcement learning. This workload involves an iterative loop where models are generated and trained continuously.By managing and scheduling workloads across distributed environments, OSMO enables seamless coordination between DGX and OVX systems for effective model development.

Tracking Data Lineage

Data lineage management is critical for model auditing and ensuring traceability throughout the development lifecycle. With OSMO, users can monitor the lineage of data from its origin to the trained model, fostering transparency and accountability in the process.Managing extensive datasets and creating collections is straightforward with OSMO. The platform allows efficient organization and categorization of data, accommodating collections of real data, synthetic data, or a combination of both. This flexibility provides users with enhanced control over the datasets utilized for model training and evaluation.

Conclusion

NVIDIA OSMO represents a significant advancement in orchestrating AI-enabled robotics development workloads. By providing a unified platform that streamlines complex workflows across heterogeneous computing environments, it empowers teams to efficiently generate data, train models, conduct tests, and manage datasets—all while ensuring compliance and traceability throughout the process.

Read more such articles from our Newsletter here.

0
Avatar

prachi kothiyal

Add comment