bigdata
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
bigdata [2020/05/04 14:24] – created skipidar | bigdata [2023/01/14 15:36] (current) – skipidar | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ===== BigData ===== | + | ==== BigData ==== |
+ | |||
+ | {{https:// | ||
+ | |||
+ | |||
+ | === What is MapReduce? === | ||
+ | |||
+ | MapReduce is a PATTERN, about the PROCESSING part in Hadoop | ||
+ | |||
+ | https:// | ||
+ | |||
+ | |||
+ | |||
+ | === Hadoop === | ||
+ | |||
+ | Hadoop is a framework that allows to process and store huge data sets. | ||
+ | |||
+ | Hadoop is essentially a DISTRIBUTED DATA infrastructure: | ||
+ | |||
+ | Basically, Hadoop can be divided into two parts: PROCESSING and storage. | ||
+ | |||
+ | |||
+ | |||
+ | https:// | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | === What are Resilient Distributed Dataset? === | ||
+ | |||
+ | is the primary data abstraction in Apache Spark | ||
+ | |||
+ | Represents an immutable, partitioned collection of elements that can be operated on in parallel. | ||
+ | |||
+ | |||
+ | |||
+ | Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures. | ||
+ | |||
+ | Distributed with data residing on multiple nodes in a cluster. | ||
+ | |||
+ | Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects (that represent records of the data you work with). | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | === What is Apache Spark? === | ||
+ | |||
+ | Apache Spark is an in-memory distributed data analysis platform-- primarily targeted at speeding up batch analysis jobs, iterative machine learning jobs, interactive query and graph processing. | ||
+ | |||
+ | One of Spark' | ||
+ | |||
+ | If you are interested in, for example, executing a Hadoop MapReduce job much faster, Spark is a great option (although memory requirements must be considered). | ||
+ | |||
+ | https:// | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | === Presto === | ||
+ | |||
+ | Presto (or PrestoDB) is an open source, distributed SQL query engine, designed from the ground up for fast analytic queries against data of any size. It supports both non-relational sources, such as the Hadoop Distributed File System (HDFS), Amazon S3, Cassandra, MongoDB, and HBase, and relational data sources such as MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata. | ||
+ | |||
+ | Presto can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | === Hadoop vs. Spark? What are the differences? | ||
+ | |||
+ | Spark can run on top of the Hadoop Cluster. | ||
+ | Spark may be a replacement of MapReduce. | ||
+ | |||
+ | Hadoop and Apache Spark are both big-data frameworks, but they don't really serve the same purposes. | ||
+ | |||
+ | Hadoop is essentially a DISTRIBUTED DATA infrastructure: | ||
+ | |||
+ | Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; | ||
+ | Spark only competes with the MapReduce part of Hadoop. | ||
+ | Spark is speedier. Spark is generally a lot faster than MapReduce | ||
+ | |||
+ | https:// | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | === What is Apache Storm? === | ||
+ | |||
+ | Storm is a competitor of Spark. | ||
+ | |||
+ | Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. | ||
+ | |||
+ | Apache Storm is NOT a DataBase | ||
+ | |||
+ | |||
+ | |||
+ | === Storm vs Spark? | ||
+ | |||
+ | |||
+ | They do practically the same - processing of data | ||
+ | |||
+ | multilantlanguage - Storm is better (like R) | ||
+ | data sources - Spark is better (like S3) | ||
bigdata.1588602257.txt.gz · Last modified: (external edit)