bigdata
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
bigdata [2020/05/04 14:31] – skipidar | bigdata [2023/01/14 15:36] (current) – skipidar | ||
---|---|---|---|
Line 1: | Line 1: | ||
==== BigData ==== | ==== BigData ==== | ||
+ | |||
+ | {{https:// | ||
+ | |||
+ | |||
+ | === What is MapReduce? === | ||
+ | |||
+ | MapReduce is a PATTERN, about the PROCESSING part in Hadoop | ||
+ | |||
+ | https:// | ||
+ | |||
+ | |||
+ | |||
+ | === Hadoop === | ||
+ | |||
+ | Hadoop is a framework that allows to process and store huge data sets. | ||
+ | |||
+ | Hadoop is essentially a DISTRIBUTED DATA infrastructure: | ||
+ | |||
+ | Basically, Hadoop can be divided into two parts: PROCESSING and storage. | ||
+ | |||
+ | |||
+ | |||
+ | https:// | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | === What are Resilient Distributed Dataset? === | ||
+ | |||
+ | is the primary data abstraction in Apache Spark | ||
+ | |||
+ | Represents an immutable, partitioned collection of elements that can be operated on in parallel. | ||
+ | |||
+ | |||
+ | |||
+ | Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures. | ||
+ | |||
+ | Distributed with data residing on multiple nodes in a cluster. | ||
+ | |||
+ | Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with). | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | === What is Apache Spark? === | ||
+ | |||
+ | Apache Spark is an in-memory distributed data analysis platform-- primarily targeted at speeding up batch analysis jobs, iterative machine learning jobs, interactive query and graph processing. | ||
+ | |||
+ | One of Spark' | ||
+ | |||
+ | If you are interested in, for example, executing a Hadoop MapReduce job much faster, Spark is a great option (although memory requirements must be considered). | ||
+ | |||
+ | https:// | ||
Line 10: | Line 63: | ||
Presto can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, | Presto can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | === Hadoop vs. Spark? What are the differences? | ||
+ | |||
+ | Spark can run on top of the Hadoop Cluster. | ||
+ | Spark may be a replacement of MapReduce. | ||
+ | |||
+ | Hadoop and Apache Spark are both big-data frameworks, but they don't really serve the same purposes. | ||
+ | |||
+ | Hadoop is essentially a DISTRIBUTED DATA infrastructure: | ||
+ | |||
+ | Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; | ||
+ | Spark only competes with the MapReduce part of Hadoop. | ||
+ | Spark is speedier. Spark is generally a lot faster than MapReduce | ||
+ | |||
+ | https:// | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | === What is Apache Storm? === | ||
+ | |||
+ | Storm is a competitor of Spark. | ||
+ | |||
+ | Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. | ||
+ | |||
+ | Apache Storm is NOT a DataBase | ||
+ | |||
+ | |||
+ | |||
+ | === Storm vs Spark? === | ||
+ | |||
+ | |||
+ | They do practically the same - processing of data | ||
+ | |||
+ | multilantlanguage - Storm is better (like R) | ||
+ | data sources - Spark is better (like S3) | ||
+ | |||
+ |
bigdata.1588602673.txt.gz · Last modified: (external edit)