Tohuwabohu excorcism

BigData

What is MapReduce?

MapReduce is a PATTERN, about the PROCESSING part in Hadoop

https://www.quora.com/What-is-the-relationship-between-MapReduce-and-Hadoop

Hadoop

Hadoop is a framework that allows to process and store huge data sets.

Hadoop is essentially a DISTRIBUTED DATA infrastructure: It distributes massive data collections across multiple nodes within a cluster

Basically, Hadoop can be divided into two parts: PROCESSING and storage.

https://www.infoworld.com/article/3014440/big-data/five-things-you-need-to-know-about-hadoop-v-apache-spark.html

What are Resilient Distributed Dataset?

is the primary data abstraction in Apache Spark

Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.

Distributed with data residing on multiple nodes in a cluster.

Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).

What is Apache Spark?

Apache Spark is an in-memory distributed data analysis platform– primarily targeted at speeding up batch analysis jobs, iterative machine learning jobs, interactive query and graph processing.

One of Spark's primary distinctions is its use of RDDs or Resilient Distributed Datasets. RDDs are great for pipelining parallel operators for computation and are, by definition, immutable, which allows Spark a unique form of fault tolerance based on lineage information.

If you are interested in, for example, executing a Hadoop MapReduce job much faster, Spark is a great option (although memory requirements must be considered).

https://stackoverflow.com/questions/24119897/apache-spark-vs-apache-storm

Presto

Presto (or PrestoDB) is an open source, distributed SQL query engine, designed from the ground up for fast analytic queries against data of any size. It supports both non-relational sources, such as the Hadoop Distributed File System (HDFS), Amazon S3, Cassandra, MongoDB, and HBase, and relational data sources such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata.

Presto can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning in seconds. You’ll find it used by many well-known companies like Facebook, Airbnb, Netflix, Atlassian, and Nasdaq.

Hadoop vs. Spark? What are the differences?

Spark can run on top of the Hadoop Cluster. Spark may be a replacement of MapReduce.

Hadoop and Apache Spark are both big-data frameworks, but they don't really serve the same purposes.

Hadoop is essentially a DISTRIBUTED DATA infrastructure: It distributes massive data collections across multiple nodes within a cluster.

Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn't do distributed storage. Spark only competes with the MapReduce part of Hadoop. Spark is speedier. Spark is generally a lot faster than MapReduce

https://www.infoworld.com/article/3014440/big-data/five-things-you-need-to-know-about-hadoop-v-apache-spark.html

What is Apache Storm?

Storm is a competitor of Spark.

Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.

Apache Storm is NOT a DataBase

Storm vs Spark?

They do practically the same - processing of data

multilantlanguage - Storm is better (like R) data sources - Spark is better (like S3)

Table of Contents