===== Big Data ====== ==== Capabilities of BigData and directions to work into: Offensive, Defensive ==== See https://medium.com/@willemkoenders/offensive-vs-defensive-data-strategy-do-you-really-need-to-choose-c04f0387dbc3 {{https://s3.eu-central-1.amazonaws.com/alf-digital-wiki-pics/sharex/JmrwzIff4L.png}} ==== Analytics on AWS ==== https://docs.aws.amazon.com/whitepapers/latest/derive-insights-from-aws-modern-data/why-use-aws-for-modern-data-analytics.html {{https://s3.eu-central-1.amazonaws.com/alf-digital-wiki-pics/sharex/chrome_uU24jKWRjo.png?550x230}} === Video about Big Data === A nice overview of UI for available BigData tools {{youtube>HeJ15SpR66w?medium}} https://www.youtube.com/watch?v=HeJ15SpR66w Here the title picture again {{https://s3.eu-central-1.amazonaws.com/alf-digital-wiki-pics/sharex/h3FTl7CReU.png?300x150}} * Query S3 data with Athena * Glue Job to **transform** data * Query S3 data with Redshift Spectrum * Query Postgres data with Redshift Federated Query * Query data in S3, RDS (Postgres) and Redshift together * Create Materialized View with Federated Query in Redshift === Structure of Data pipeline === Even better overview of tools for data-pipelines {{youtube>tykcCf-Zz1M?medium}} https://www.youtube.com/watch?v=tykcCf-Zz1M {{https://s3.eu-central-1.amazonaws.com/alf-digital-wiki-pics/sharex/ZVH0Q9dQMC.png?600x300}} * Data Source > * Data Ingestion > * Raw **Storage** > * Business rules transformation, consolidation (Glue, EMR) * Processed Zone ===== Comparison services about shifting of big data ===== As generated by ChatGPT. ^ Parameter ^ AWS Kinesis Firehose ^ AWS Glue Service ^ AWS EMR ^ AWS Athena ^ Apache Flink ^ | Purpose | Real-time data ingestion and transformation for data streams. | ETL and data preparation for analytics and warehousing. | Managed big data processing with Hadoop and Spark. | Serverless SQL query service for data in Amazon S3. | Stream processing for real-time data applications. | | Pricing Model | Pay-as-you-go | DPU-based | Instance-based | Per query and data | Infrastructure costs | | Data Processing and Integration | Real-time data streaming and transformation | ETL, data preparation | Big data processing | SQL query service | Stream processing | | Data Sources | AWS services, cloud apps | Databases, data lakes, APIs | Various sources | Amazon S3 | Multiple sources | | Integration and Output | AWS services, S3, Redshift, Elasticsearch, etc. | AWS services, data warehouses | Various AWS services | Amazon S3, export | Multiple data sinks | | Data Catalog and Metadata Management | None | AWS Glue Data Catalog | Integration with AWS Glue | AWS Glue Data Catalog | External tools may be required | **Pricing Model**: **AWS Kinesis Firehose**: AWS Kinesis Firehose typically charges based on the amount of data ingested and the destination storage costs. It has a straightforward pay-as-you-go model.\\ **AWS Glue Service**: AWS Glue charges based on the number of Data Processing Units (DPUs) consumed during ETL jobs, making it a more granular and job-specific pricing model.\\ **AWS EMR**: AWS EMR follows an instance-based pricing model where you pay for the EC2 instances you use. You can choose On-Demand or Reserved Instances for cost optimization.\\ **AWS Athena**: AWS Athena is a serverless query service that charges per query and the amount of data scanned in those queries.\\ **Apache Flink**: Apache Flink is open-source and has no direct associated costs for the software itself, but you need to pay for the infrastructure it runs on.\\ **Data Processing and Integration**: **AWS Kinesis Firehose**: It is designed for real-time data streaming and data transformation before loading it into data stores or analytics tools.\\ **AWS Glue Service**: It's an ETL (Extract, Transform, Load) service used for preparing and transforming data for analytics, data warehousing, or other data stores.\\ **AWS EMR**: EMR is a managed Hadoop and Spark service for big data processing and analysis.\\ **AWS Athena**: Athena is a serverless query service for analyzing data stored in Amazon S3, making it suitable for ad-hoc SQL queries.\\ **Apache Flink**: Flink is a stream processing framework that can handle real-time data processing and complex event-driven applications.\\ **Data Sources**: **AWS Kinesis Firehose**: It primarily ingests data from AWS services like Kinesis Streams and cloud applications.\\ **AWS Glue Service**: It can connect to a wide range of data sources, including databases, data lakes, and APIs.\\ **AWS EMR**: EMR can process data from various sources, including S3, HDFS, and real-time data from Kinesis or Kafka.\\ **AWS Athena**: Athena queries data in Amazon S3.\\ **Apache Flink**: Flink can process data from various sources, including Kafka, Kinesis, or custom connectors.\\ **Integration and Output**: **AWS Kinesis Firehose**: It integrates seamlessly with various AWS services and can load data into Amazon S3, Redshift, Elasticsearch, and more.\\ **AWS Glue Service**: It can load data into a variety of AWS services and data warehouses, and you can define custom ETL jobs.\\ **AWS EMR**: EMR can process data and store the results in various AWS services or your preferred storage.\\ **AWS Athena**: Athena is mainly used for querying data in S3 and exporting query results to different formats.\\ **Apache Flink**: Flink provides flexibility in processing and output options, allowing integration with different data sinks.\\ **Data Catalog and Metadata Management**: **AWS Kinesis Firehose**: It doesn't provide built-in metadata management or a data catalog.\\ **AWS Glue Service**: AWS Glue includes a data catalog that tracks metadata and provides a centralized repository for schema information.\\ **AWS EMR**: EMR itself doesn't include a data catalog, but you can integrate it with services like AWS Glue for metadata management.\\ **AWS Athena**: Athena uses the AWS Glue Data Catalog for metadata management.\\ **Apache Flink**: Flink does not offer native metadata management; external tools may be required.\\ In summary, these services cater to different aspects of data processing and analytics within the AWS ecosystem, and their suitability depends on your specific use case and requirements. Consider your data sources, processing needs, and desired output formats when choosing the right service or combination of services for your architecture.\\