Spark

Data scientists have to learn Spark which is formally defined as the programming language based on Ada programming language. Apache Spark is an open source framework which focuses on interactive query, ML and real-time workloads.

Spark does not have its own storage system. It stores data in RDD on different partitions. It runs analytics on other storage systems such as Amazon S3, Amazon Redshift, Cassandra, Couchbase and others.

Spark is open source distributed processing system. It is suitable for Big Data workloads. It is an open source engine for any developer or data scientist dealing with Big Data. It is general purpose distributed data processing engine.

It utilizes in-memory caching, optimised query execution for fast analytic queries.

It provides for development of APIs in Jawa, Scala, Python and R.

It supports the coding use across multiple workloads, say batch processing, interactive queries, real-time analytics, ML and graph processing.

It is the most popular data distributed processing framework.

Spark SQL provides programming abstraction called DataFrames and also acts as a distributed SQL query engine.

The purpose is to create a new framework, optimised for fast iterative processing like ML, interactive data analysis. While doing so, it retains scalability and fault tolerance of Hadoop MapReduce.

Spark started as a research project at AMPLab, UC Berkley. Apache Spark was created by PhD scholars as a unified analytic tool and with many libraries for Big Data processing.

Apache Spark VS. Apache Hadoop

Hadoop is also an open source framework which has Hadoop Distributed File System (HDFs) as storage.

There is YARN as a way of managing computer resources used by different apps.

There is implementation of MapReduce programming module as an execution engine. The different execution engines are Spark, Tez and Presto.

Hadoop is a sequential multi-step process. At each step, it reads data from a cluster, and performs operations and gets results back to HDFS. Each step requires a disk read and write and hence jobs are slower because of latency of disk I/0.

Benefits of Spark

As there is in-memory caching, it is fast. There is optimised query execution.

It is developer friendly. It has a variety of languages to build apps, say Jawa, Scala, R and Python. These APIs make it easy for the developers.

It has the ability to run multiple workloads (including interactive queries), real-time analytics, ML and graph processing. One app can combine multiple workloads seamlessly.

Spark Workloads

There is Spark Core as the foundation for the platform. Then there is Spark SQL for interactive queries. Next we have Spark Streaming for real-time analytics. There is also MLlib for machine learning. Last there is GraphX for graph processing.

How Spark Scores over MapReduce?

Spark is an answer to the shortcomings of MapReduce. It does processing in-memory. It reduces the number of steps in a job. It reuses data across multiple parallel operations.

In Spark, only one step is needed for faster execution. It reuses data by using in-memory cache to accelerate ML algorithms that repeatedly call a function on the same dataset.

Data reuse is enabled by DataFrames. DataFrames are a collection of objects cached in memory and used in multiple Spark operations.

This lowers latency. Spark is several times faster than MapReduce for ML and interactive analytics.

Use Cases

Spark is used in financial services such as banking to assess the customer churn and to develop new financial products. In investment banking, it is used to analyse stock prices to predict future trends.

It is used in healthcare to provide comprehensive patient care. It helps frontline workers in patient interaction. It predicts future trends.

It is used in manufacturing to eliminate downtime of IoT devices by predicting preventive maintenance.

It is used in retail to attract and sustain customers by personalizing services and offers.

Spark and Cloud

It is an ideal workload in cloud. Cloud provides performance, scalability, reliability, availability and economies of scale.

Amazon EMR is the best place to deploy Apache Spark in the cloud.

Spark and Hadoop

Spark and Hadoop work better together. But Spark can be run in stand-alone made. Here you need resource manager such as CanN or Mesos.

Hadoop is not a pre-requisite to learn Spark. It is an independent project. It became popular because it runs on top of HDFS along with other Hadoop components.

Spark Core data processing engine works along with the libraries for SQL, ML, graph computation, stream processing. These can be used together in an application. ML is an iterative process that requires fast processing. Spark conducts in-memory data processing which makes this possible. Data scientists leverage the speed, ease and integration of Spark unified platform.

Learning Spark is not difficult if you have a basic understanding of Python or any programming language. It provides APIs in Java, Python, Scola.

print

Leave a Reply

Your email address will not be published. Required fields are marked *