Bash shell scripting and data science

AI Thread Summary
Bash shell scripting is essential for data science as it automates tasks like data analysis, backups, and environment setups, making processes more efficient. Familiarity with shell scripting is particularly beneficial when using programming languages like R, which often requires command-line interactions for data import and manipulation. While Python is currently the dominant language in data science, R remains popular for its powerful visualization capabilities, despite its limited integration with other programming environments. Many users advocate for a mix of languages, including PHP and Perl, to leverage their strengths in data processing and automation. Overall, mastering shell scripting enhances productivity and facilitates collaboration in data science projects.
EngWiPy
Messages
1,361
Reaction score
61
Hello,

I've noticed that a number of employers ask for knowledge in Linux and shell scripting for data science positions. How does bash shell scripting help in the filed of data science, or in general why to learn it?

Thanks
 
Technology news on Phys.org
It allows you to automate everything. If you need to analyze data, you need to write a script to do it for you. You also often need to do things like backups, installations, maintenance, crons...

I could destroy my computer, install a fresh Linux install, then run ./install_mycompany_dev_env.php and literally everything else would be installed automatically, all my repositories checked out, and customize it the way I like it.

It's especially useful if you have a master computer and a bunch of slaves. Often I'll use a slave once, then destroy it and just create a fresh one because it's so easy.

I don't necessarily use bash though, personally, I do almost everything with php, sometimes I'll have to use bash or expect though, so you should learn all three.
 
  • Like
Likes scottdave and EngWiPy
If you are using the free R language (old but quite popular in all of biology or any field of statistics), you'll need some shell scripting familiarity. I was surprised to find that R books are not under programming in the book stores, but instead are under statistics. Still, must-learn for any big data analysis these days.
 
  • Like
Likes EngWiPy
harborsparrow said:
If you are using the free R language (old but quite popular in all of biology or any field of statistics), you'll need some shell scripting familiarity. I was surprised to find that R books are not under programming in the book stores, but instead are under statistics. Still, must-learn for any big data analysis these days.

I have used R but didn't use shell. Why do I need it with R? Interesting that R is not under programming but statistics, maybe because it was developed particularly for statistical analysis.
 
If you are learning linux or unix, one of the first things you learn are shells and scripting languages. There are many shells - bash, sh, csh, z shell. Shells, pipelines and unix commands the glue that binds linux together.
 
S_David said:
Why do I need it with R?

When working with R, you will need to do things at the command line, such as tell R to import data from a spreadsheet (and do something with it). R commands are a shell-script-like language. It is all command line, at least, the free versions are.
 
harborsparrow said:
Still, must-learn for any big data analysis these days.
I kindly disagree... Well of course people can try to get accustomed to it (it'll probably overpass Java and it has a good growth rate), but the main leader is (and will be for at least the next few years) Python.
https://www.kdnuggets.com/2017/01/most-popular-language-machine-learning-data-science.html
I guess R gains its popularity from people outside the "programming-community".
 
I think we’ll start to see a lot more language mixing in the future. I often use PHP for data analysis. I use it because clustering is really easy and it’s easy to talk to the operating system. The down side is that it’s very slow at certain types of tasks and is single threaded. However, it can talk to any library with a std C binding so I write my data analysis in C++ (or assembly) to take advantage of low level speed enhancements and threading and have a processing manager be PHP.

I expect this type of mixing will become more and more common. This is because IDEs are getting smarter and can handle projects that have multiple languages now. I used to have custom shell scripts to compile stuff in various languages but now I have all of that managed for me.
 
harborsparrow said:
When working with R, you will need to do things at the command line, such as tell R to import data from a spreadsheet (and do something with it). R commands are a shell-script-like language. It is all command line, at least, the free versions are.

You can run R for free inside Jupyter Notebooks. I've done it.

There's a big push in analytics to do more and more basic work (and share said work) in jupyter notebooks, with some recently released cloud integrated products from Amazon and Google. The R crowd is not well integrated with the rest of the programming world and generally doesn't use Jupyter notebooks from what I've seen. (R Studio, etc is in many ways better, but it doesn't play well with other languages and further isolates users of the language.) The comment that you won't find R books in the programming section, is an almost poetic encapsulation with this.
- - - -
If you look at the big products / platforms backed by Google, Facebook etc. (think TensorFlow, PyTorch, and others), you see Python everywhere. Possibly C++. I haven't seen an official R release. This gives you a sense of what the industry pushes.

You can use R if you want but the idea that you must know it seems like attribute substitution (i.e. someone conflating their personal feelings with the facts on the ground). It has very powerful visualization libraries that people like in particular. If you don't like it, I wouldn't worry too much about it.
 
  • #10
I have worked with a lot of shell languages and they have gotten better over time. They allow you to automate a series of commands that you might otherwise spend days entering by hand and then would have to review several times to make sure you didn't make one mistake. You must have some sort of shell language to do that. Bash is a good one on unix/linux machines. Shell scripts are also a good way to record a tricky series of operations so that you can do them later or pass them on to others. You can nurse a process through by hand, finally get it right, and then use the history file to make an edited script of the sequence of commands that finally worked.

I switched to the Perl language for most of scripting and never looked back. So did many Unix system administrators (I am not an administrator). In one language, it allows you to easily run shell commands or utilities, capture and parse the output of the commands, process the data, and print reports of the results. I have used it in programs that produced and processed literally hundreds of thousands of data files. Some of the processes took several days, running around the clock. That being said about Perl, if you work a lot with others who prefer bash or with existing legacy bash code, then you need to know the bash shell well.
 
Last edited:
Back
Top