Welcome to the third installment of Solution Spotlight, a series of interviews with subject matter experts who have led the delivery of some of EPAM SolutionsHub’s open source solutions, software products and accelerators. For this piece, Delivery Manager Dmytro Liaskovskyi and Systems Engineering Team Leader Volodymyr Veres shared more about DataLab, EPAM’s open-source, self-service, fail-safe exploratory environment for collaborative data science workflows. Here’s what they had to say:
Tell us more about DataLab
DataLab is a multi-cloud orchestrator for the provisioning of secured analytical environments in major cloud providers’ ecosystems. It is a self-service web console used to create and manage exploratory environments. It allows teams to spin up analytical environments with just a single click of a mouse. Once established, the environment can be managed autonomously by an analytical team, leveraging its simple and easy-to-use web interface.
Key features include:
-
Similar user experience whether implemented with Amazon Web Services (AWS), Google Cloud Platform (GCP) or Microsoft Azure
-
Automatically configurable exploratory environment integrated with enterprise security
-
Support provided by a private network perimeter with limited internet access
-
Unified single sign-on experience while working with DataLab and On-Premise software
-
Integration with best-of the-breed open source analytical tools, such as Jupyter, JupyterLab, Zeppelin, RStudio, Jupyter/RStudio + TensorFlow and Deep Learning
-
Extended computational resources through the cloud provider’s managed services provider or based on standalone Apache Spark clusters
-
Project-level collaboration environment across multiple clouds
-
Aggregated billing report with cost allocation across cloud providers
-
Cost-saving capabilities, like scheduling by time or by inactivity period and usage of spot instances
-
Project-level resource management
-
Centralized library management
-
Project-centric access between different cloud providers
-
Multi-account deployment support
Why did EPAM choose to open source DataLab?
Originally, we created DataLab to accelerate our development processes at EPAM and as a solution for our clients. Given that DataLab is a cloud-based solution that can help data scientists around the world work more efficiently and effectively, we decided that it would be a great solution to share more broadly.
What makes DataLab unique?
The most important aspect of DataLab is that it’s available as open source, thus, it’s free for all types of usage. It can also be deployed on several of the largest cloud providers, including AWS, Azure and GCP. DataLab, which is deployed on one single cloud, can launch exploratory environments and spin-up computational power on other clouds, thus enabling a multi-cloud journey.
DataLab can also integrate with various identity providers supporting OAuth and SAML 2.0 protocols, and has cool built-in enterprise features for infrastructure cost monitoring, effective resources usage and bucket browser.
Who uses it?
-
A manufacturing company leveraged DataLab for its data quality, data exploration and analytics work. The company’s data scientists work with data sources that have been transferred to the cloud in order to find new insights and help the implementation team define requirements for data engineering, decreasing time to deployment.
-
A retail company uses DataLab as an image recognition framework to enable automated restocking of inventory.
-
A travel company created a recommendation engine using DataLab, allowing end users to find more relevant accommodations faster and at a lower cost.
-
An investment company leverages DataLab as an AWS-based analytics platform so that their data scientists can easily gather multi-tenant data analytics. This enables data scientists to easily provision work environments with integrated data sources utilizing Apache Spark based on Elasticsearch, Apache HBase and Neo4j.
Currently, EPAM’s Internal Analytics group actively uses DataLab for various reasons, including:
-
Employee versus positions matching
-
Employee attrition score
-
Employee star score
-
Automated language assessment
-
Employee productivity modeling
Who contributes to it?
At present, all contributors work at EPAM. We submitted DataLab to the Apache incubator process to broaden its appeal – we welcome anyone interested in contributing!
Can I contribute to DataLab?
Yes! Please review contribution rules here. To speak to someone on the development team, please email us at [email protected].
What are the future plans for DataLab?
We plan to add several new features, including:
-
Upgrade DataLab software to Ubuntu 18.x
-
Add support of Kubernetes
-
Ability to upload datasets to any of AWS, GCP and Azure Blob storage using DataLab’s bucket browser (planned release 2.3.1 in July)
-
Support for Spark 3.x.x
-
Localization support in DataLab
-
Support of auditing capabilities in DataLab
-
Support of Spot instances for Standalone Spark clusters to save costs on computational resources
-
TensorFlow-RStudio as a new DataLab template
Anyone can view and vote for new features here. New users can also share ideas for the project using Issues on the GitHub repo.
How does EPAM support DataLab?
EPAM has a core DataLab team who are the main contributors. We have a frequent release cycle, which you can learn more about here.
What if I need to customize DataLab to make it applicable for my organization’s needs?
Our development team will be more than happy to help. Contact us via email [email protected].
What should I do if I want to use DataLab or learn more?
You can get an overview of DataLab here. To learn more about its main features, visit the user guide. To deploy DataLab, please refer to this guide. You can also contact us directly with any questions at [email protected].