Practical Linux Command Line for Data Engineers and Analysts
Learn to navigate Linux systems and perform essential tasks for Hadoop and Spark analytics

The advent of Linux based analytics systems using Apache Hadoop and Spark provides scalable tools for insight and learning. As with any UNIX based platform, all essential operations can be performed using the command line. Indeed, in many situations there are operations that are efficiently performed by using the Linux command line. Although "pointing and clicking" using a GUI is often preferred, these interfaces can be restrictive and limit functionality. A good working knowledge of the Linux command line will actually allow many key operations to be streamlined and easily executed. Many of the commands and features available through the Linux command line will actually help improve the throughput of today's data analyst.
What you'll learn-and how you can apply it
- Understand why the command line is still important
- Learn how to access a Linux server using the command line in Windows and Mac computers
- Understand the basic Linux filesystem layout and navigate its contents
- Learn the essential commands and tools used in a modern scalable analytics environment
- Understand the basic vi text editor commands so you can view and edit files
- Learn about ways to move data to/from Linux and Hadoop/Spark systems
- Learn how to run Hadoop and Spark applications from the command line
- Learn how to create simple scripts to automate many processes
This training course is for you because...
- You are interested in learning only the essential and useful aspects of the Linux command line.
- You want to learn how to connect to and perform useful tasks on almost any Linux server.
- You are especially interested in Hadoop/Spark clusters.
- You want a hands-on experience to try all of the commands and examples during and after the course (including a single server instance of Hadoop/Spark and other tools) -- a Linux Hadoop virtual machine is provided.
Prerequisites
- A basic understanding of computer/server operation (processors, memory, disks, networking)
Setup Instructions
-
To run the class examples, a Linux Hadoop Minimal Virtual Machine (VM) is available. The VM is a full Linux installation that can run on your laptop/desktop using VirtualBox (freely available). The VM is approximately 3.3G in size and can be downloaded here: https://tinyurl.com/ya69odu7.
-
Installation notes are available here: https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Hadoop-Minimal-Install.3.txt.
-
If you wish to follow along, install and test the sandbox at least one day before the class.
About your instructor
-
Douglas Eadline, PhD, began his career as an analytical chemist with an interest in computer methods. Starting with the first Beowulf how-to document, Doug has written instructional documents covering many aspects of Linux HPC (High Performance Computing) and Hadoop computing. Currently, Doug serves as editor of the ClusterMonkey.net website and was previously editor of ClusterWorld Magazine, and senior HPC Editor for Linux Magazine. He is also an active writer and consultant to the HPC/Analytics industry. His recent video tutorials and books include of the Hadoop and Spark Fundamentals LiveLessons (Addison Wesley) video, Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison Wesley).
Schedule
The timeframes are only estimates and may vary according to how the class is progressing
Segment 1: Introduction and Course Goals (10 mins)
- How to get the most out of this course
- It is 2019, why do we need the Linux/Unix command line?
- Advantages and disadvantages of the command line
- Working with the command line in Windows, Mac, and Linux
- Safe communication using Secure Shell (SSH)
Segment 2: The Linux Hadoop Minimal Virtual MachineText Terminal (15 mins)
- Using Oracle Virtual Box
- Starting the Virtual Machine
- Connecting the VM using SSH
- The Linux filesystem layout
Segment 3: Basic Linux Commands (35 mins)
- What is a *nix shell?
- Basic Linux commands
- Basic shell commands
- Input/Output and pipes
- File permissions
- Process management
- Commands to access system information
Questions (10 min)
Break (5 mins)
Segment 4: Editing/Viewing Text Files: vi (Visual Editor) (20 mins)
- Basic modes and navigation
- Insert/delete copy/paste
- Search/Replace
Segment 5: Moving Data to/from Your Local File System (15 mins)
- Compressing and archiving using tar and zip
- Secure copy (scp)
- Web get (wget)
- Data integrity
Segment 6: Moving Data into Hadoop HDFS (15 mins)
- What is Hadoop HDFS and why is it different
- Your local file-system is not Hadoop HDFS
- Using HDFS wrapper commands
Segment 7: Bash Scripting Basics (20 mins)
- Creating a bash script using the following:
- Bash variables
- If-then tests
- Control structures
- Input and output
Questions (10 min)
Break (5 mins)
Segment 8: Running Command Line Analytics Tools (20 mins)
- Running/Observing a Hive job
- Running/Observing a PySpark job
Segment 9: Course Wrap-up and Additional Resources (5 mins)
Remaining questions