Bash shell scripting and data science

In summary, Bash shell scripting helps in data science by allowing you to automate everything. It is especially useful if you have a master computer and a bunch of slaves.
  • #1
EngWiPy
1,368
61
Hello,

I've noticed that a number of employers ask for knowledge in Linux and shell scripting for data science positions. How does bash shell scripting help in the filed of data science, or in general why to learn it?

Thanks
 
Technology news on Phys.org
  • #2
It allows you to automate everything. If you need to analyze data, you need to write a script to do it for you. You also often need to do things like backups, installations, maintenance, crons...

I could destroy my computer, install a fresh Linux install, then run ./install_mycompany_dev_env.php and literally everything else would be installed automatically, all my repositories checked out, and customize it the way I like it.

It's especially useful if you have a master computer and a bunch of slaves. Often I'll use a slave once, then destroy it and just create a fresh one because it's so easy.

I don't necessarily use bash though, personally, I do almost everything with php, sometimes I'll have to use bash or expect though, so you should learn all three.
 
  • Like
Likes scottdave and EngWiPy
  • #3
If you are using the free R language (old but quite popular in all of biology or any field of statistics), you'll need some shell scripting familiarity. I was surprised to find that R books are not under programming in the book stores, but instead are under statistics. Still, must-learn for any big data analysis these days.
 
  • Like
Likes EngWiPy
  • #4
harborsparrow said:
If you are using the free R language (old but quite popular in all of biology or any field of statistics), you'll need some shell scripting familiarity. I was surprised to find that R books are not under programming in the book stores, but instead are under statistics. Still, must-learn for any big data analysis these days.

I have used R but didn't use shell. Why do I need it with R? Interesting that R is not under programming but statistics, maybe because it was developed particularly for statistical analysis.
 
  • #5
If you are learning linux or unix, one of the first things you learn are shells and scripting languages. There are many shells - bash, sh, csh, z shell. Shells, pipelines and unix commands the glue that binds linux together.
 
  • #6
S_David said:
Why do I need it with R?

When working with R, you will need to do things at the command line, such as tell R to import data from a spreadsheet (and do something with it). R commands are a shell-script-like language. It is all command line, at least, the free versions are.
 
  • #7
harborsparrow said:
Still, must-learn for any big data analysis these days.
I kindly disagree... Well of course people can try to get accustomed to it (it'll probably overpass Java and it has a good growth rate), but the main leader is (and will be for at least the next few years) Python.
https://www.kdnuggets.com/2017/01/most-popular-language-machine-learning-data-science.html
I guess R gains its popularity from people outside the "programming-community".
 
  • #8
I think we’ll start to see a lot more language mixing in the future. I often use PHP for data analysis. I use it because clustering is really easy and it’s easy to talk to the operating system. The down side is that it’s very slow at certain types of tasks and is single threaded. However, it can talk to any library with a std C binding so I write my data analysis in C++ (or assembly) to take advantage of low level speed enhancements and threading and have a processing manager be PHP.

I expect this type of mixing will become more and more common. This is because IDEs are getting smarter and can handle projects that have multiple languages now. I used to have custom shell scripts to compile stuff in various languages but now I have all of that managed for me.
 
  • #9
harborsparrow said:
When working with R, you will need to do things at the command line, such as tell R to import data from a spreadsheet (and do something with it). R commands are a shell-script-like language. It is all command line, at least, the free versions are.

You can run R for free inside Jupyter Notebooks. I've done it.

There's a big push in analytics to do more and more basic work (and share said work) in jupyter notebooks, with some recently released cloud integrated products from Amazon and Google. The R crowd is not well integrated with the rest of the programming world and generally doesn't use Jupyter notebooks from what I've seen. (R Studio, etc is in many ways better, but it doesn't play well with other languages and further isolates users of the language.) The comment that you won't find R books in the programming section, is an almost poetic encapsulation with this.
- - - -
If you look at the big products / platforms backed by Google, Facebook etc. (think TensorFlow, PyTorch, and others), you see Python everywhere. Possibly C++. I haven't seen an official R release. This gives you a sense of what the industry pushes.

You can use R if you want but the idea that you must know it seems like attribute substitution (i.e. someone conflating their personal feelings with the facts on the ground). It has very powerful visualization libraries that people like in particular. If you don't like it, I wouldn't worry too much about it.
 
  • #10
I have worked with a lot of shell languages and they have gotten better over time. They allow you to automate a series of commands that you might otherwise spend days entering by hand and then would have to review several times to make sure you didn't make one mistake. You must have some sort of shell language to do that. Bash is a good one on unix/linux machines. Shell scripts are also a good way to record a tricky series of operations so that you can do them later or pass them on to others. You can nurse a process through by hand, finally get it right, and then use the history file to make an edited script of the sequence of commands that finally worked.

I switched to the Perl language for most of scripting and never looked back. So did many Unix system administrators (I am not an administrator). In one language, it allows you to easily run shell commands or utilities, capture and parse the output of the commands, process the data, and print reports of the results. I have used it in programs that produced and processed literally hundreds of thousands of data files. Some of the processes took several days, running around the clock. That being said about Perl, if you work a lot with others who prefer bash or with existing legacy bash code, then you need to know the bash shell well.
 
Last edited:

What is Bash shell scripting and how is it used in data science?

Bash shell scripting is a type of programming language that allows users to automate tasks and run commands on a Unix or Linux operating system. It is commonly used in data science to manipulate and analyze large datasets, as well as to automate data processing and analysis pipelines.

What are the benefits of using Bash shell scripting in data science?

Using Bash shell scripting in data science offers several benefits, such as increased efficiency, reproducibility, and scalability. It allows for the automation of repetitive tasks, making it easier to work with large datasets. Additionally, scripts can be easily shared and reproduced, ensuring consistent results.

What are some common tools and libraries used in Bash shell scripting for data science?

Some common tools and libraries used in Bash shell scripting for data science include awk, sed, and grep for text processing, and jq for working with JSON data. Other popular tools for data manipulation and analysis in Bash include Pandas, NumPy, and SciPy.

Can Bash shell scripting be used for machine learning and other advanced data science techniques?

While Bash shell scripting is not typically used for complex machine learning tasks, it can be used for data preprocessing and cleaning tasks, as well as for automating the execution of machine learning algorithms. Additionally, Bash can be used in conjunction with other programming languages or tools for more advanced data science techniques.

Are there any drawbacks to using Bash shell scripting in data science?

One potential drawback to using Bash shell scripting in data science is that it may not be as powerful or versatile as other programming languages, such as Python or R. It also requires some knowledge of Unix or Linux command line tools and can be less intuitive for those unfamiliar with these systems. However, for data manipulation and automation tasks, Bash can be a useful and efficient tool for data scientists.

Similar threads

Replies
3
Views
343
  • Programming and Computer Science
Replies
1
Views
538
  • Programming and Computer Science
Replies
33
Views
2K
  • Programming and Computer Science
Replies
10
Views
1K
  • Programming and Computer Science
Replies
11
Views
1K
  • Programming and Computer Science
Replies
19
Views
1K
  • Programming and Computer Science
Replies
1
Views
286
  • Programming and Computer Science
Replies
8
Views
363
  • Programming and Computer Science
Replies
1
Views
634
  • Programming and Computer Science
Replies
21
Views
1K
Back
Top