Introduction to Big
Data- Big Data - Beyond The Hype, Big Data Skills And
Sources Of Big Data, Big Data Adoption, Research And Changing Nature Of Data
Repositories, Data Sharing And Reuse Practices And Their Implications For
Repository Data Curation, Overlooked And Overrated Data Sharing, Data Curation
Services In Action, Open Exit: Reaching The End Of The Data Life Cycle, The
Current State Of Meta-Repositories For Data, Curation Of Scientific Data At
Risk Of Loss: Data Rescue And Dissemination
Hadoop:
Introduction of Big data programming-Hadoop, The ecosystem and stack, The
Hadoop Distributed File System (HDFS), Components of Hadoop, Design of HDFS,
Java interfaces to HDFS, Architecture overview, Development Environment, Hadoop
distribution and basic commands, Eclipse development, The HDFS command line and
web interfaces, The HDFS Java API (lab), Analyzing the Data with Hadoop,
Scaling Out, Hadoop event stream processing, complex event processing,
MapReduce Introduction, Developing a Map Reduce Application, How Map Reduce
Works, The MapReduce Anatomy of a Map Reduce Job run, Failures, Job Scheduling,
Shuffle and Sort, Task execution, Map Reduce Types and Formats, Map Reduce
Features, Real-World MapReduce,
Hadoop Environment: Setting up
a Hadoop Cluster, Cluster specification, Cluster Setup and Installation, Hadoop
Configuration, Security in Hadoop, Administering Hadoop, HDFS – Monitoring
& Maintenance, Hadoop benchmarks,
Apache
Airflow: Introduction to Data warehousing
and Data lakes, Designing Data warehousing for an ETL Data Pipeline, Designing
Data Lakes for ETL Data Pipeline, ETL vs ELT
Introduction to HIVE,
Programming with Hive: Data warehouse system for Hadoop, Optimizing with
Combiners and Practitioners (lab), Bucketing, more common algorithms: sorting,
indexing and searching (lab), Relational manipulation: map-side and reduce-side
joins (lab), evolution, purpose and use, Case Studies on Ingestion and
warehousing
HBase: Overview,
comparison and architecture, java client API, CRUD operations and security
Apache Spark APIs for large-scale
data processing: Overview, Linking with Spark, Initializing Spark,
Resilient Distributed Datasets (RDDs), External Datasets, RDD Operations,
Passing Functions to Spark, Job optimization, Working with Key-Value Pairs,
Shuffle operations, RDD Persistence, Removing Data, Shared Variables, EDA using
PySpark, Deploying to a Cluster Spark Streaming, Spark MLlib and ML APIs, Spark
Data Frames/Spark SQL, Integration of Spark and Kafka, Setting up Kafka
Producer and Consumer, Kafka Connect API, Mapreduce, Connecting DB’s with Spark