Error Icon

Something went wrong. Please try again

loading...
Home>Blog>Solution Spotlight #2: Apache Big Data Projects

Solution Spotlight #2: Apache Big Data Projects

February 03, 2020 | 3 min read

In this article

  • Tell us about the Apache projects that EPAM is working on.

  • How did we get involved in this area? Why did EPAM choose to become contributors?

  • Can you tell us a little more about some of the most popular Apache projects?

  • What does Apache offer that other open source platforms don’t?

  • What is the future of big data?

  • What if I need extra features to make these solutions right for me? Will EPAM do that or do I need to depend on the open source community?

  • What should I do if I want to use these projects or just learn more?

Welcome to the second installment of Solution Spotlight, a series where we conduct Q&A sessions with the people behind the open source solutions, software products and accelerators featured here on EPAM SolutionsHub. For this piece, we talked to EPAMers Nikita Glashchenko, Software Engineer, and Stanislav Fedyakov, Senior Delivery Manager. Here’s what they had to say about their experience with Apache open source projects and EPAM’s involvement:

Solution Spotlight #2: Apache Big Data Projects

Tell us about the Apache projects that EPAM is working on.

EPAM’s team contributes regularly to big data frameworks and products that are used in our daily work to build big data solutions for our customers. Most of our contribution is done through Apache projects, including:

  • Spark

  • Kafka

  • Ignite

  • Flink

  • Calcite

  • Beam

  • Nifi

  • Zeppelin

How did we get involved in this area? Why did EPAM choose to become contributors?

EPAM decided to invest in developing these frameworks and products because we use them in building big data solutions for our customers. Because we understand how these specific technologies work from the inside, we are able to quickly fix bugs that our customers encounter and develop new releases that our customers need in a timely manner. Being active contributors, specifically to Apache projects, is often required by our customers who value our company’s commitment to open source. Our big data team regularly monitors what open source tools are being used on the market, which in turn influences our level of contribution to certain projects. It’s important to mention that EPAM does not own or sponsor any of these projects, frameworks and products. Our team fixes bugs and helps develop features required by our data practice or customers.

There are three Apache tools that stand out the most:

1. Apache Kafka is a stream-processing platform that provides a unified, high-throughput, low-latency solution for handling real-time data feeds. Here are some examples in market of how this platform is used:

  • The New York Times uses Apache Kafka and the Kafka Streams API to store and distribute published content in real-time to various applications and systems that make it available to readers.

  • The Wikimedia Foundation uses Apache Kafka as a log transport for analytics data from production webservers and applications. This data is consumed into Hadoop using Camus.

  • Netflix employs Apache Kafka for real-time monitoring and event-processing pipelines.

2. Apache Spark is a distributed, cluster-computing framework. Multiple companies have used it to automate certain processes, including:

  • Uber built a continuous ETL pipeline and converts raw unstructured event data into structured data through Apache Spark.

  • By applying Apache Spark and Machine Learning Library (MLlib), Yahoo provides advanced personal recommendations for users.

3. Apache Flink is a stream-processing framework that executes arbitrary dataflow programs in a data-parallel and pipelined manner. Various organizations use it to reduce inefficiencies, including:

  • Uber’s internal SQL-based streaming analytics platform is built on Apache Flink.

  • Powered by Apache Flink, eBay’s monitoring platform evaluates thousands of customizable alert rules on metrics and log streams.

What does Apache offer that other open source platforms don’t?

The Apache community is huge. In fact, it’s the world’s largest open source foundation, with 7,000 Apache code committers and over 200 million lines of code in stewardship. Additionally, Apache does not have restrictive licenses – anyone can become a contributor. Unlike some other open source platforms, Apache provides a clear process for contribution, modifications and new proposals.

What is the future of big data?

The overwhelming majority of big data tools and frameworks will continue to be developed through the open source community. We don’t expect this to change any time soon as open source provides the opportunity for businesses to build highly customizable solutions to address more of their customers’ needs.

What if I need extra features to make these solutions right for me? Will EPAM do that or do I need to depend on the open source community?

EPAM contributes to many open source technologies, frameworks and tools. With our deep, inside knowledge of how these solutions are built, our team knows the best way to make modifications, add new features or enhance the functionality.

What should I do if I want to use these projects or just learn more?

If you want to learn more about the projects we contribute to, please visit apache.org. If you’d like to contribute with us, please join us here.

Loading...

Related Content

View All Articles
Subscription banner

Get updates in your inbox

Subscribe to our emails to receive newsletters, product updates, and offers.

By clicking Subscribe you consent to EPAM Systems, Inc. processing your personal information as set out in the EPAM SolutionsHub Privacy Policy

Loading...