Introduction to Big Data: Big Data - Beyond The Hype, Big Data Skills And Sources Of Big Data, Big Data Adoption, Research And Changing Nature Of Data Repositories, Data Sharing And Reuse Practices And Their Implications For Repository Data Curation, Overlooked And Overrated Data Sharing, Data Curation Services In Action, Open Exit: Reaching The End Of The Data Life Cycle, The Current State Of Meta-Repositories For Data, Curation Of Scientific Data At Risk Of Loss: Data Rescue And Dissemination
Hadoop: Introduction of Big
data programming-Hadoop, The ecosystem and stack, The Hadoop Distributed File
System (HDFS), Components of Hadoop, Design of HDFS, Java interfaces to HDFS,
Architecture overview, Development Environment, Hadoop distribution and basic
commands, Eclipse development, The HDFS command line and web interfaces, The
HDFS Java API (lab), Analyzing the Data with Hadoop, Scaling Out, Hadoop event
stream processing, complex event processing, MapReduce Introduction, Developing
a Map Reduce Application, How Map Reduce Works, The MapReduce Anatomy of a Map
Reduce Job run, Failures, Job Scheduling, Shuffle and Sort, Task execution, Map
Reduce Types and Formats, Map Reduce Features, Real-World MapReduce,
Hadoop Environment: Setting up
a Hadoop Cluster, Cluster specification, Cluster Setup and Installation, Hadoop
Configuration, Security in Hadoop, Administering Hadoop, HDFS – Monitoring
& Maintenance, Hadoop benchmarks,
Apache
Airflow: Introduction to Data warehousing and Data lakes, Designing Data
warehousing for an ETL Data Pipeline, Designing Data Lakes for ETL Data
Pipeline, ETL vs ELT
Introduction & Programming
with Hive: Data warehouse system for Hadoop, Optimizing with Combiners and
Practitioners (lab), Bucketing, more common algorithms: sorting, indexing and
searching (lab), Relational manipulation: map-side and reduce-side joins (lab),
evolution, purpose and use, Case Studies on Ingestion and warehousing
HBase: Overview,
comparison and architecture, java client API, CRUD operations and security
Apache Spark APIs for
large-scale data processing: Overview, Linking with Spark, Initializing Spark,
Resilient Distributed Datasets (RDDs), External Datasets, RDD Operations,
Passing Functions to Spark, Job optimization, Working with Key-Value Pairs,
Shuffle operations, RDD Persistence, Removing Data, Shared Variables, EDA using
PySpark, Deploying to a Cluster Spark Streaming, Spark MLlib and ML APIs, Spark
Data Frames/Spark SQL, Integration of Spark and Kafka, Setting up Kafka Producer
and Consumer, Kafka Connect API, Mapreduce, Connecting DB’s with Spark