1. Limited time only! Sign up for a free 30min personal tutor trial with Chegg Tutors
    Dismiss Notice
Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Job Skills Becoming a Data Scientist

  1. Feb 10, 2017 #1
    Hello all,

    I'm trying to come up with a detailed plan on how to make a transition from academia to become a data scientist in the industry. My academic background is: B.Sc in Computer Engineering, and M.Sc and PhD (and ~ 2 years postdoc experience) in Electrical Engineering/Wireless Communications.

    I have been reading about R, Python, and Machine Learning, but it seems that I'm not getting anywhere. I learn the theory/syntax and understand it, and apply the examples given, but that's it. I cannot put this on my resumé. Also, not everything in R and Python are related to data science.

    So, I need some help from the folks in the industry on what should I read (specific topics in R, Python, SQL, .... etc and pointing to some books and materials would be great), and after reading the materials what should I do next to prove I have some knowledge, since I don't have practical experience. Some suggested to have GitHub profile, but I'm not sure what to put in it. I guess I need to do something that hasn't been done, but I wonder how to know what that might be (excuse me if this sounded naive).

    I'm short in budget right now, and I need to do this by myself. I can do it, but I need some guidance.

    Last edited: Feb 10, 2017
  2. jcsd
  3. Feb 10, 2017 #2


    User Avatar
    Education Advisor

    Have you thought about looking at the following? I knew of someone who finished his PhD in probability theory at the University of Toronto (without much relevant industry experience), completed this program, and is now working as a data scientist.

  4. Feb 10, 2017 #3
    I'm not in the industry, but have been looking into it for a time. I do know that if you already have a PhD (in just about anything remotely quantitative!) there are a lot of post-doc Data Science training programs out there that try to get you up to speed quickly.

    A review and some questions about it here: https://datasciencebootcamps.com/2016/07/25/data-science-bootcamp-review-insight-data-science/

    I believe there are others. I don't know what their value is, but I wanted to pass it along.

    Some say it's about having projects under your belt, a "portfolio," GitHub repository, Kaggle competitions, etc. I don't know if this is the reality or hype. There are places that offer projects for people to work on, like Data Science for Social Good.

    Again, this is advice I've been given - not advice I'm able to follow just yet. I'm finishing up my masters and then off I go.

    -Dave K
  5. Feb 10, 2017 #4
    If it's interesting, is sufficiently advanced, and you can talk to it well, I think it can really help.

    Gotta get them to it though. Craft that resume, OP.
  6. Feb 13, 2017 #5
    Thanks all. I found something in the sites you included that I could use for my self-study. I would have liked it more if I could attend these programs personally, but I'm not in the USA. I'm still considering the online option, though.

    I think I have another problem, which is that my resumé probably isn't written in a professional way. I heard that academics are terrible in selling their skills, which I think is true. My question is then: how to write my resumé in such a way that makes my skills closely related to data science? I mean, I include in it that I have good background in statistics and some knowledge in programming, but I think since my experience isn't directly related to large data set, most hiring managers discard my resumé from the beginning as unrelated. How can I make my skills related to the job as employers like me to say?
  7. Feb 13, 2017 #6


    User Avatar
    Education Advisor

    I write often about what it takes to become a data scientist and problems academics face when making the transition. Read this:


    Secondly, GitHub and Kaggle do help. It isn't uncommon for top level companies to REQUIRE a github portfolio or a Kaggle profile. I don't require it for my interviews but I sure as hell do like to see it.

    Thirdly, not many people in general have experience with big data. What's more important is being familiar with the concepts of how to process big data.

    Fourth, you don't need much experience with python, R Sql etc. What you do need is experience in common libraries (scikit, pandas, ggplot2, seaborn, matplotlib).

    Fifth, talk to the importance of visualization. Being able to code using D3.js and visualize meaningful data in a dynamic way makes you 10 times more attractive to me than someone who cannot.

    Lastly, you need to be familiar with Spark. If you can speak to how to use spark 2.0 and data frame pipeline, that's a relatively rare skill but useful one. A lot of people know Spark is the future/present, not enough people good enough at it yet.

    Pro-tip if you want to be an overachiever, learn Scala and write your own custom Spark jobs :).

    Even if you want to be even more pro, write java map reduce :p.
  8. Feb 13, 2017 #7
    Yes, you made some comments on my thread before about experience.

    Thanks for the more specific details. I have been reading a book called Machine Learning with R. In the book there are examples of large data sets and I follow through them. So, I'm learning both machine learning concepts/algorithms and R at the same time, and on large data sets. I'm also reading about Python, but I need to be more specific there. I know I need to learn SQL, but I still don't know much about it. I wasn't aware about Spark and JavaScript. I will read about them as well.

    But again, after reading all these stuff, what should I do to prove that I know them? How to include in my resumé that I'm familiar with the concepts of how to manipulate data, for example? Also, what should I put on GitHub or Kaggle? I mean, do I need to work on some large projects, or if I make some extensions on something already exists, that would be fine?
  9. Feb 14, 2017 #8


    User Avatar
    Education Advisor

    As far as GitHub and Kaggle, I think there are two things that are important. With GitHub, I think it's important to contribute to open source projects, even in a small way. I would look for very small bug that's open and see if it's something you can fix. Why does this matter? First, it'll show that you can read other people code and contribute to it! Secondly, it'll show you know how to interact with a version control tool. Sounds minor but it's rather useful and somewhat rare for a data scientist. Even if all you do is improve a print statement, that's significant.

    As far as Kaggle, what's important is that you document your work and allow someone to read it. There are two types of data science interviews. The first are the "let me ask you what conditions most occur in a cell means model for blah blah blah to be true". I hate those and find them to be useless. I believe memorizing things about statistics is pointless, you can always look things up. The second type and probably most common is "detail your model building process". What Kaggle will allow you to do is provide a better answer to that because now you can draw from experience. Also, you can use an ipython notebook or any markdown file to document your process and allow an interviewer to follow again. Plus that shows you know how to document your work, code, and visualize data.

    As far as SQL, there isn't much to know. There's no need to ever use bind variables or convoluted super joins. The most you probably need is something like this:

    Select potato_id, cost, from table1, (select potato_id, location from table2 where location = "NY") where table1.potato_id = table2.potato_id;

    Basically, basic query, subquery, joins, and where clauses. Nothing special.

    Lastly, be careful when you say large dataset, it's easy to get to pissing matches about that. When people ask have you worked with large data just say "The largest i've worked on is so and so". While 500gb sounds large to some people, I process 1-20 terabytes daily. There's nothing you can really do to really prepare you for how to handle that.

    As far as your resume. First and foremost, emphasis your work and the times you've had to explain it. Be prepared to explain it in layman terms and not get caught up in the jargon. Second, add a section about Kaggle or even personal projects. For example, "Use Random Forest in Sci-kit learn predict survivors in Titanic data. Transformed data use DictVectorizer. Validated model by method A using xyz. Obtained result abc." Obviously use a better example than Titanic. Honestly, if your resume had a connection to a pretty github with some markdown that explained your work, i'm sure that'll appeal to a lot of managers.

    I wouldn't be to worried about JavaScript, that's a bonus skill. It'll be super impressive if you built a D3 demo from scratch yourself that showed something you thought was interesting, but that's simply a way to get more money and not a skill that'll land you a core data science job.
  10. Feb 14, 2017 #9
    @MarneMath What about this boom of MOOCs promising a glamorous career in data science? Anyone taking them seriously as a credential? I would take them to learn the skills but I do not know if they are worth putting on a resume.

    -Dave K
  11. Feb 14, 2017 #10


    User Avatar
    Education Advisor

    MOOC are useful for internal transfer. If you work at my company as an engineer want to transfer to data science, then it's useful to show that you are a self-starter and willing to learn on your own. On the other hand if you are an engineer from an outside company and apply with just MOOC, then i don't really care. The reason for this is that an internal technical talent probably has connections, insights, and experience working with the data we are processing. That expertise is general useful to have on a team.
  12. Feb 14, 2017 #11
    I would like to thank you very much for all of these insights and information, and the tone you wrote it. I think now I have more sense of direction what to do . I wasn't sure what managers are looking for in a resumé and, honestly, I was suspicious about the significance of GitHub and Kaggle because all jobs emphasize on real and practical experience, and only a small fraction require them (usually smaller companies). Yes, for now I just need the basics to land a job. Thanks again
  13. Feb 14, 2017 #12
    I'd like to echo this thank you. There is a lot of hype surrounding this field right now and I am having trouble sorting out myth vs. reality.

    -Dave K
  14. Feb 17, 2017 #13
    Actually, speaking of hype, I'm curious about what MarneMath thinks about claims of a bubble in "data science" (I hate the term). I was at Carnegie Mellon for a year, arguably the Mecca of data science, and encountered a lot of glassy eyed, cultish individuals who seemed pathologically optimistic.

    Will Deep Neural Nets replace HR at long last? Will boosted everything discover quantum gravity? Can a company replace all of its programmers with an infinite stack of meta-machine learning gizmos, each of which is trained on increasingly higher order parameters and whose outputs are fed to the next, lowlier element in the line?
  15. Feb 17, 2017 #14


    User Avatar
    Education Advisor

    I'm never been one to engage in futurist what could and can be talk. In general, my training is in Statistics and computer science, but more so statistics. I don't know nor care if "the" singularity will occurs. I'm simply concerned with solving one problem at a time.
  16. Feb 17, 2017 #15
    You mean this place?


    Glowing ones walking on binary paths?


    Oh and here's a word cloud

  17. Apr 27, 2017 #16
    I just went through a nearly 6 month job hunt for a data scientist position and failed. I'm now back in academia as a junior researcher in genomics.

    A major obstacle for me was the small number of opportunities in my city that I don't want to leave. Thus I was restricted to the one industry in my town looking for data scientists: banking. My lack of banking/finance background was the deal breaker. It really depends on the other candidates. At lot of employees who interviewed me liked me and said my lack of domain knowledge was not an issue (they themselves didn't have it upon hire). I have three years of data scientist experience from a now defunct tech startup, not to mention my PhD work in machine learning and bioinformatics, so I had plenty of data projects to talk about to sell my skillset. But in all banks I applied to, someone else with more banking/finance experience got the position.

    That said, that's one data point for myself that could be an outlier. I live in Honolulu. There's few tech opportunities so they all will probably end up being competitive here. Not to mention my job search was going on during and after the January blizzard in the USA, which meant the jobs I applied for had a ton of out-of-state applicants wanting to live at the beach. So I faced even more competition than usual :(

    S_David you should be including in your plan to get whatever domain knowledge you need for whatever industry you see yourself working for. It should help even if my experience is an outlier.
    Last edited: Apr 27, 2017
  18. Apr 28, 2017 #17
    Yes, I noticed that. Namely, companies prefer candidates to have background experience in their line of work besides data manipulation, like in the medical field and financing. However, some telecommunication companies (in which I have theoretical experience) need data scientists, and I applied for them stating that I have background in telecommunication, but to no avail.
  19. Apr 28, 2017 #18


    User Avatar
    Education Advisor

    There are a couple of points to consider here:

    1. In your particular case, insisting on living in Honolulu (as you yourself suggested) is the primary reason why you had failed to find a job in the data science field. From what I've read, the job market in Hawaii is tough outside of the fields of tourism, construction and healthcare. So it doesn't surprise me that you had a lack of success of working in the data science field. Much of the most lucrative jobs in that area tend to be clustered in the following regions in the US:

    (a) West Coast -- Seattle, Washington or the surrounding suburban areas; California (in particular the San Francisco-Bay area)
    (b) Northeastern US -- New York City; the state of New Jersey; Boston, Massachusetts or the surrounding suburban areas
    (c) Washington DC or the surrounding suburban areas in Maryland and Virginia
    (d) The Raleigh-Durham research park in North Carolina

    2. In my opinion, obtaining domain knowledge coming from academia is next to impossible. Employers who are seeking this are really saying they don't want to hire someone fresh from school, but someone with experience. To break in, one has to expand one's geography when looking for work (see point #1 above).
  20. Apr 28, 2017 #19


    User Avatar
    Education Advisor

    In your case, as I stated earlier, seeking domain knowledge would be next to impossible given your background. My suggestion (which I had already advised you to pursue in this thread) would be to pursue an internship program to help you to gain first-hand knowledge of tools/methodologies in data science (I don't know why you are resisting doing this).

    The other suggestions include the following:

    1. Set up a Github account where you can upload your code.

    2. Participate in Kaggle competitions -- employers will often look kindly on experience participating in these.

  21. Apr 28, 2017 #20
    I should mention that I did expand to remote job options, which the OP should also consider once ready to do the job search again. I'm not really sure what's the best place to find such positions. I paid for a few months of FlexJobs.

    OP, sorry for hijacking your thread here, but since I did already write something about my situation, I may was well seek advice in case I decide I don't like my current situation and start job searching again.

    In Hawaii, I know it is my lack of domain knowledge that I wasn't competitive for all positions I applied for. The managers said so for two of them. Another I know because the candidate that got the job is a LinkedIn connection, so I know that person spent a few years at a federal reserve bank. For the remote jobs, I'm not sure what made my application not look attractive. I really only got one interview that I recall from a startup data science consulting firm, and that went well enough that they wanted to evaluate me as a contractor first (their MO apparently), sent me the NDA that I signed, and said they'll contact once they get details from a client on the project they intended for me. Never happened. Radio silence after that. My seeking work remotely was even more of a failure than seeking locally.

    - My background: PhD CS. Publications in machine learning and bioinformatics (mostly from dissertation). Three years data scientist experience in a tech startup environment.
    - Current job: Junior researcher in genomics facility in a med school
    - Networking is huge. Knowing people in common. Sharing LinkedIn connections. I'm fine in Hawaii for this, I suppose, but outside? Not so much.
    - My GitHub appears empty if someone searched for it. All my stuff is propriety and in private repos. Though code for one of my pubs is on GitHub under a university account.
    - Kaggle still useful in my situation?
    - I'll ought to be getting more pubs with my current job. Probably won't help?

    What should I be considering doing to increase my competitiveness in my next round of job hunting for a data scientist position? My current job is purely funded by grant money, which could disappear in a year or two...
    Last edited: Apr 28, 2017
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook

Have something to add?
Draft saved Draft deleted

Similar Threads for Becoming Data Scientist
Physics Leaving physics to become... an artist?
Physics How capable is a PhD student supposed to become?