Introduction to Big Data-Big Data - Beyond the Hype, Big Data Skills and Sources of Big Data, Big Data Adoption, Research and Changing Nature of Data Repositories, Data Sharing and Reuse Practices and Their Implications for Repository Data Curation
Hadoop: Introduction of Big data programming-Hadoop, The
ecosystem and stack, The Hadoop Distributed File System (HDFS), Components of
Hadoop, Design of HDFS, Java interfaces to HDFS, Architecture overview,
Development Environment, Hadoop distribution and basic commands, Eclipse
development, The HDFS command line and web interfaces, The HDFS Java API (lab),
Analyzing the Data with Hadoop, Scaling Out, Hadoop event stream processing,
complex event processing, MapReduce Introduction, Developing a Map Reduce
Application, How Map Reduce Works, The MapReduce Anatomy of a Map Reduce Job
run, Failures, Job Scheduling, Shuffle and Sort, Task execution, Map Reduce
Types and Formats, Map Reduce Features, Real-World MapReduce
Hadoop
Environment: Setting up a
Hadoop Cluster, Cluster specification, Cluster Setup and Installation, Hadoop
Configuration, Security in Hadoop, Administering Hadoop, HDFS – Monitoring
& Maintenance, Hadoop benchmarks
Apache
Airflow: Introduction to Data
warehousing and Data lakes, Designing Data warehousing for an ETL Data
Pipeline, Designing Data Lakes for ETL Data Pipeline, ETL vs ELT
Introduction
to HIVE, Programming with Hive: Data
warehouse system for Hadoop, Optimizing with Combiners and Practitioners (lab),
Bucketing, more common algorithms: sorting, indexing and searching (lab),
Relational manipulation: map-side and reduce-side joins (lab), evolution, purpose
and use, Case Studies on Ingestion and warehousing
HBase: Overview, comparison and architecture, java client
API, CRUD operations and security
Apache
Spark: APIs for large-scale data
processing: Overview, Linking with Spark, Initializing Spark, Resilient
Distributed Datasets (RDDs), External Datasets, RDD Operations, Passing
Functions to Spark, Job optimization, Working with Key-Value Pairs, Shuffle
operations, RDD Persistence, Removing Data, Shared Variables, EDA using
PySpark, Deploying to a Cluster Spark Streaming, Spark MLlib and ML APIs, Spark
Data Frames/Spark SQL, Integration of Spark and Kafka, Setting up Kafka
Producer and Consumer, Kafka Connect API, Mapreduce, Connecting DB’s with Spark