Data science at the command line: Obtain, Scrub, explore, and model data with Unix power tools By Jeroen Jan

By:

Janssens, Jeroen

Material type: Text

TextPublication details: Mumbai: Shroof Publishers & Distributors, c2021.Edition: 2Description: i-xxii+ 257PISBN:

9789391043308

Subject(s):

Electronic data processing. Database management. Information science

DDC classification:

005.7 JAN-D

Contents:

1. Introduction Data Science Is OSEMN Obtaining Data Scrubbing Data Exploring Data Modeling Data Interpreting Data Intermezzo Chapters What Is the Command Line? Why Data Science at the Command Line? The Command Line Is Agile The Command Line Is Augmenting The Command Line Is Scalable The Command Line Is Extensible The Command Line Is Ubiquitous Summary For Further Exploration 2. Getting Started Getting the Data Installing the Docker Image Essential Unix Concepts The Environment Executing a Command-Line Tool Five Types of Command-Line Tools Combining Command-Line Tools Redirecting Input and Output Working with Files and Directories Managing Output Help! Summary For Further Exploration 3. Obtaining Data Overview Copying Local Files to the Docker Container Downloading from the Internet Introducing curl Saving Other Protocols Following Redirects Decompressing Files Converting Microsoft Excel Spreadsheets to CSV Querying Relational Databases Calling Web APIs Authentication Streaming APIs Summary For Further Exploration 4. Creating Command-Line Tools Overview Converting One-Liners into Shell Scripts Step 1: Create a File Step 2: Give Permission to Execute Step 3: Define a Shebang Step 4: Remove the Fixed Input Step 5: Add Arguments Step 6: Extend Your PATH Creating Command-Line Tools with Python and R Porting the Shell Script Processing Streaming Data from Standard Input Summary For Further Exploration 5. Scrubbing Data Overview Transformations, Transformations Everywhere Plain Text Filtering Lines Extracting Values Replacing and Deleting Values CSV Bodies and Headers and Columns, Oh My! Performing SQL Queries on CSV Extracting and Reordering Columns Filtering Rows Merging Columns Combining Multiple CSV Files Working with XML/HTML and JSON Summary For Further Exploration 6. Project Management with Make Overview Introducing Make Running Tasks Building, for Real Adding Dependencies Summary For Further Exploration 7. Exploring Data Overview Inspecting Data and Its Properties Header or Not, Here I Come Inspect All the Data Feature Names and Data Types Unique Identifiers, Continuous Variables, and Factors Computing Descriptive Statistics Column Statistics R One-Liners on the Shell Creating Visualizations Displaying Images from the Command Line Plotting in a Rush Creating Bar Charts Creating Histograms Creating Density Plots Happy Little Accidents Creating Scatter Plots Creating Trend Lines Creating Box Plots Adding Labels Going Beyond Basic Plots Summary For Further Exploration 8. Parallel Pipelines Overview Serial Processing Looping Over Numbers Looping Over Lines Looping Over Files Parallel Processing Introducing GNU Parallel Specifying Input Controlling the Number of Concurrent Jobs Logging and Output Creating Parallel Tools Distributed Processing Get List of Running AWS EC2 Instances Running Commands on Remote Machines Distributing Local Data Among Remote Machines Processing Files on Remote Machines Summary For Further Exploration 9. Modeling Data Overview More Wine, Please! Dimensionality Reduction with Tapkee Introducing Tapkee Linear and Nonlinear Mappings Regression with Vowpal Wabbit Preparing the Data Training the Model Testing the Model Classification with SciKit-Learn Laboratory Preparing the Data Running the Experiment Parsing the Results Summary For Further Exploration 10. Polyglot Data Science Overview Jupyter Python R RStudio Apache Spark

Summary: This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data.

Tags from this library: No tags from this library for this title. Log in to add tags.

Holdings
Item type	Current library	Home library	Call number	Status	Date due	Barcode
Book	Dept. of Computational Biology and Bioinformatics Processing Center	Dept. of Computational Biology and Bioinformatics	005.7 JAN-D (Browse shelf(Opens below))	Available		DCB4020

This book is ideal for data scientists, analysts, engineers, system administrators, and researchers.

Obtain data from websites, APIs, databases, and spreadsheets
Perform scrub operations on text, CSV, HTML, XML, and JSON files
Explore data, compute descriptive statistics, and create visualizations
Manage your data science workflow
Create your own tools from one-liners and existing Python or R code
Parallelize and distribute data-intensive pipelines
Model data with dimensionality reduction, regression, and classification algorithms
Leverage the command line from Python, Jupyter, R, RStudio, and Apache Spark

1. Introduction

Data Science Is OSEMN
Obtaining Data
Scrubbing Data
Exploring Data
Modeling Data
Interpreting Data
Intermezzo Chapters
What Is the Command Line?
Why Data Science at the Command Line?
The Command Line Is Agile
The Command Line Is Augmenting
The Command Line Is Scalable
The Command Line Is Extensible
The Command Line Is Ubiquitous
Summary
For Further Exploration

2. Getting Started

Getting the Data
Installing the Docker Image
Essential Unix Concepts
The Environment
Executing a Command-Line Tool
Five Types of Command-Line Tools
Combining Command-Line Tools
Redirecting Input and Output
Working with Files and Directories
Managing Output
Help!
Summary
For Further Exploration

3. Obtaining Data

Overview
Copying Local Files to the Docker Container
Downloading from the Internet
Introducing curl
Saving
Other Protocols
Following Redirects
Decompressing Files
Converting Microsoft Excel Spreadsheets to CSV
Querying Relational Databases
Calling Web APIs
Authentication
Streaming APIs
Summary
For Further Exploration

4. Creating Command-Line Tools

Overview
Converting One-Liners into Shell Scripts
Step 1: Create a File
Step 2: Give Permission to Execute
Step 3: Define a Shebang
Step 4: Remove the Fixed Input
Step 5: Add Arguments
Step 6: Extend Your PATH
Creating Command-Line Tools with Python and R
Porting the Shell Script
Processing Streaming Data from Standard Input
Summary
For Further Exploration

5. Scrubbing Data

Overview
Transformations, Transformations Everywhere
Plain Text
Filtering Lines
Extracting Values
Replacing and Deleting Values
CSV
Bodies and Headers and Columns, Oh My!
Performing SQL Queries on CSV
Extracting and Reordering Columns
Filtering Rows
Merging Columns
Combining Multiple CSV Files
Working with XML/HTML and JSON
Summary
For Further Exploration

6. Project Management with Make

Overview
Introducing Make
Running Tasks
Building, for Real
Adding Dependencies
Summary
For Further Exploration

7. Exploring Data

Overview
Inspecting Data and Its Properties
Header or Not, Here I Come
Inspect All the Data
Feature Names and Data Types
Unique Identifiers, Continuous Variables, and Factors
Computing Descriptive Statistics
Column Statistics
R One-Liners on the Shell
Creating Visualizations
Displaying Images from the Command Line
Plotting in a Rush
Creating Bar Charts
Creating Histograms
Creating Density Plots
Happy Little Accidents
Creating Scatter Plots
Creating Trend Lines
Creating Box Plots
Adding Labels
Going Beyond Basic Plots
Summary
For Further Exploration

8. Parallel Pipelines

Overview
Serial Processing
Looping Over Numbers
Looping Over Lines
Looping Over Files
Parallel Processing
Introducing GNU Parallel
Specifying Input
Controlling the Number of Concurrent Jobs
Logging and Output
Creating Parallel Tools
Distributed Processing
Get List of Running AWS EC2 Instances
Running Commands on Remote Machines
Distributing Local Data Among Remote Machines
Processing Files on Remote Machines
Summary
For Further Exploration

9. Modeling Data

Overview
More Wine, Please!
Dimensionality Reduction with Tapkee
Introducing Tapkee
Linear and Nonlinear Mappings
Regression with Vowpal Wabbit
Preparing the Data
Training the Model
Testing the Model
Classification with SciKit-Learn Laboratory
Preparing the Data
Running the Experiment
Parsing the Results
Summary
For Further Exploration

10. Polyglot Data Science

Overview
Jupyter
Python
R
RStudio
Apache Spark

This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data.

There are no comments on this title.

to post a comment.