Differences

This shows you the differences between two versions of the page.

--- bigdata [2020/05/04 14:31] – skipidar
+++ bigdata [2023/01/14 15:36] (current) – skipidar
@@ Line 1: / Line 1: @@
 ==== BigData ====
+{{https://s3.eu-central-1.amazonaws.com/alf-digital-wiki-pics/sharex/BgjRo5Jfcq.png}}
+=== What is MapReduce? ===
+MapReduce is a PATTERN, about the PROCESSING part in Hadoop
+https://www.quora.com/What-is-the-relationship-between-MapReduce-and-Hadoop
+=== Hadoop ===
+Hadoop is a framework that allows to process and store huge data sets.
+Hadoop is essentially a DISTRIBUTED DATA infrastructure: It distributes massive data collections across multiple nodes within a cluster
+Basically, Hadoop can be divided into two parts: PROCESSING and storage.
+https://www.infoworld.com/article/3014440/big-data/five-things-you-need-to-know-about-hadoop-v-apache-spark.html
+=== What are Resilient Distributed Dataset? ===
+is the primary data abstraction in Apache Spark
+Represents an immutable, partitioned collection of elements that can be operated on in parallel.
+Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
+Distributed with data residing on multiple nodes in a cluster.
+Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with).
+=== What is Apache Spark? ===
+Apache Spark is an in-memory distributed data analysis platform-- primarily targeted at speeding up batch analysis jobs, iterative machine learning jobs, interactive query and graph processing.
+One of Spark's primary distinctions is its use of RDDs or Resilient Distributed Datasets. RDDs are great for pipelining parallel operators for computation and are, by definition, immutable, which allows Spark a unique form of fault tolerance based on lineage information.
+If you are interested in, for example, executing a Hadoop MapReduce job much faster, Spark is a great option (although memory requirements must be considered).
+https://stackoverflow.com/questions/24119897/apache-spark-vs-apache-storm
@@ Line 10: / Line 63: @@
 Presto can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning in seconds. You’ll find it used by many well-known companies like Facebook, Airbnb, Netflix, Atlassian, and Nasdaq.
+=== Hadoop vs. Spark? What are the differences? ===
+Spark can run on top of the Hadoop Cluster.
+Spark may be a replacement of MapReduce.
+Hadoop and Apache Spark are both big-data frameworks, but they don't really serve the same purposes.
+Hadoop is essentially a DISTRIBUTED DATA infrastructure: It distributes massive data collections across multiple nodes within a cluster.
+Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn't do distributed storage.
+Spark only competes with the MapReduce part of Hadoop.
+Spark is speedier. Spark is generally a lot faster than MapReduce
+https://www.infoworld.com/article/3014440/big-data/five-things-you-need-to-know-about-hadoop-v-apache-spark.html
+=== What is Apache Storm? ===
+Storm is a competitor of Spark.
+Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
+Apache Storm is NOT a DataBase
+=== Storm vs Spark? ===
+They do practically the same - processing of data
+multilantlanguage - Storm is better (like R)
+data sources - Spark is better (like S3)