How can I transition from academia to become a data scientist in the industry?

EngWiPy · Feb 10, 2017

Hello all,

I'm trying to come up with a detailed plan on how to make a transition from academia to become a data scientist in the industry. My academic background is: B.Sc in Computer Engineering, and M.Sc and PhD (and ~ 2 years postdoc experience) in Electrical Engineering/Wireless Communications.

I have been reading about R, Python, and Machine Learning, but it seems that I'm not getting anywhere. I learn the theory/syntax and understand it, and apply the examples given, but that's it. I cannot put this on my resumé. Also, not everything in R and Python are related to data science.

So, I need some help from the folks in the industry on what should I read (specific topics in R, Python, SQL, ... etc and pointing to some books and materials would be great), and after reading the materials what should I do next to prove I have some knowledge, since I don't have practical experience. Some suggested to have GitHub profile, but I'm not sure what to put in it. I guess I need to do something that hasn't been done, but I wonder how to know what that might be (excuse me if this sounded naive).

I'm short in budget right now, and I need to do this by myself. I can do it, but I need some guidance.

Thanks

StatGuy2000 · Feb 10, 2017

Have you thought about looking at the following? I knew of someone who finished his PhD in probability theory at the University of Toronto (without much relevant industry experience), completed this program, and is now working as a data scientist.

https://www.thedataincubator.com/

dkotschessaa · Feb 10, 2017

I'm not in the industry, but have been looking into it for a time. I do know that if you already have a PhD (in just about anything remotely quantitative!) there are a lot of post-doc Data Science training programs out there that try to get you up to speed quickly.

i.e.
http://insightdatascience.com/
A review and some questions about it here: https://datasciencebootcamps.com/2016/07/25/data-science-bootcamp-review-insight-data-science/

I believe there are others. I don't know what their value is, but I wanted to pass it along.

Some say it's about having projects under your belt, a "portfolio," GitHub repository, Kaggle competitions, etc. I don't know if this is the reality or hype. There are places that offer projects for people to work on, like Data Science for Social Good.

Again, this is advice I've been given - not advice I'm able to follow just yet. I'm finishing up my masters and then off I go.

-Dave K

Locrian · Feb 10, 2017

dkotschessaa said:

Some say it's about having projects under your belt, a "portfolio," GitHub repository, Kaggle competitions, etc. I don't know if this is the reality or hype.

If it's interesting, is sufficiently advanced, and you can talk to it well, I think it can really help.

Gotta get them to it though. Craft that resume, OP.

EngWiPy · Feb 13, 2017

Thanks all. I found something in the sites you included that I could use for my self-study. I would have liked it more if I could attend these programs personally, but I'm not in the USA. I'm still considering the online option, though.

I think I have another problem, which is that my resumé probably isn't written in a professional way. I heard that academics are terrible in selling their skills, which I think is true. My question is then: how to write my resumé in such a way that makes my skills closely related to data science? I mean, I include in it that I have good background in statistics and some knowledge in programming, but I think since my experience isn't directly related to large data set, most hiring managers discard my resumé from the beginning as unrelated. How can I make my skills related to the job as employers like me to say?

MarneMath · Feb 13, 2017

I write often about what it takes to become a data scientist and problems academics face when making the transition. Read this:

https://www.physicsforums.com/threads/the-experience-dilemma.890909/#post-5604808

Secondly, GitHub and Kaggle do help. It isn't uncommon for top level companies to REQUIRE a github portfolio or a Kaggle profile. I don't require it for my interviews but I sure as hell do like to see it.

Thirdly, not many people in general have experience with big data. What's more important is being familiar with the concepts of how to process big data.

Fourth, you don't need much experience with python, R Sql etc. What you do need is experience in common libraries (scikit, pandas, ggplot2, seaborn, matplotlib).

Fifth, talk to the importance of visualization. Being able to code using D3.js and visualize meaningful data in a dynamic way makes you 10 times more attractive to me than someone who cannot.

Lastly, you need to be familiar with Spark. If you can speak to how to use spark 2.0 and data frame pipeline, that's a relatively rare skill but useful one. A lot of people know Spark is the future/present, not enough people good enough at it yet.

Pro-tip if you want to be an overachiever, learn Scala and write your own custom Spark jobs :).

Even if you want to be even more pro, write java map reduce :p.

EngWiPy · Feb 13, 2017

Yes, you made some comments on my thread before about experience.

Thanks for the more specific details. I have been reading a book called Machine Learning with R. In the book there are examples of large data sets and I follow through them. So, I'm learning both machine learning concepts/algorithms and R at the same time, and on large data sets. I'm also reading about Python, but I need to be more specific there. I know I need to learn SQL, but I still don't know much about it. I wasn't aware about Spark and JavaScript. I will read about them as well.

But again, after reading all these stuff, what should I do to prove that I know them? How to include in my resumé that I'm familiar with the concepts of how to manipulate data, for example? Also, what should I put on GitHub or Kaggle? I mean, do I need to work on some large projects, or if I make some extensions on something already exists, that would be fine?

MarneMath · Feb 14, 2017

As far as GitHub and Kaggle, I think there are two things that are important. With GitHub, I think it's important to contribute to open source projects, even in a small way. I would look for very small bug that's open and see if it's something you can fix. Why does this matter? First, it'll show that you can read other people code and contribute to it! Secondly, it'll show you know how to interact with a version control tool. Sounds minor but it's rather useful and somewhat rare for a data scientist. Even if all you do is improve a print statement, that's significant.

As far as Kaggle, what's important is that you document your work and allow someone to read it. There are two types of data science interviews. The first are the "let me ask you what conditions most occur in a cell means model for blah blah blah to be true". I hate those and find them to be useless. I believe memorizing things about statistics is pointless, you can always look things up. The second type and probably most common is "detail your model building process". What Kaggle will allow you to do is provide a better answer to that because now you can draw from experience. Also, you can use an ipython notebook or any markdown file to document your process and allow an interviewer to follow again. Plus that shows you know how to document your work, code, and visualize data.

As far as SQL, there isn't much to know. There's no need to ever use bind variables or convoluted super joins. The most you probably need is something like this:

Select potato_id, cost, from table1, (select potato_id, location from table2 where location = "NY") where table1.potato_id = table2.potato_id;

Basically, basic query, subquery, joins, and where clauses. Nothing special.

Lastly, be careful when you say large dataset, it's easy to get to pissing matches about that. When people ask have you worked with large data just say "The largest I've worked on is so and so". While 500gb sounds large to some people, I process 1-20 terabytes daily. There's nothing you can really do to really prepare you for how to handle that.

As far as your resume. First and foremost, emphasis your work and the times you've had to explain it. Be prepared to explain it in layman terms and not get caught up in the jargon. Second, add a section about Kaggle or even personal projects. For example, "Use Random Forest in Sci-kit learn predict survivors in Titanic data. Transformed data use DictVectorizer. Validated model by method A using xyz. Obtained result abc." Obviously use a better example than Titanic. Honestly, if your resume had a connection to a pretty github with some markdown that explained your work, I'm sure that'll appeal to a lot of managers.

I wouldn't be to worried about JavaScript, that's a bonus skill. It'll be super impressive if you built a D3 demo from scratch yourself that showed something you thought was interesting, but that's simply a way to get more money and not a skill that'll land you a core data science job.

dkotschessaa · Feb 14, 2017

@MarneMath What about this boom of MOOCs promising a glamorous career in data science? Anyone taking them seriously as a credential? I would take them to learn the skills but I do not know if they are worth putting on a resume.

-Dave K

MarneMath · Feb 14, 2017

MOOC are useful for internal transfer. If you work at my company as an engineer want to transfer to data science, then it's useful to show that you are a self-starter and willing to learn on your own. On the other hand if you are an engineer from an outside company and apply with just MOOC, then i don't really care. The reason for this is that an internal technical talent probably has connections, insights, and experience working with the data we are processing. That expertise is general useful to have on a team.

EngWiPy · Feb 14, 2017

MarneMath said:

As far as GitHub and Kaggle, I think there are two things that are important. With GitHub, I think it's important to contribute to open source projects, even in a small way. I would look for very small bug that's open and see if it's something you can fix. Why does this matter? First, it'll show that you can read other people code and contribute to it! Secondly, it'll show you know how to interact with a version control tool. Sounds minor but it's rather useful and somewhat rare for a data scientist. Even if all you do is improve a print statement, that's significant.

As far as Kaggle, what's important is that you document your work and allow someone to read it. There are two types of data science interviews. The first are the "let me ask you what conditions most occur in a cell means model for blah blah blah to be true". I hate those and find them to be useless. I believe memorizing things about statistics is pointless, you can always look things up. The second type and probably most common is "detail your model building process". What Kaggle will allow you to do is provide a better answer to that because now you can draw from experience. Also, you can use an ipython notebook or any markdown file to document your process and allow an interviewer to follow again. Plus that shows you know how to document your work, code, and visualize data.

As far as SQL, there isn't much to know. There's no need to ever use bind variables or convoluted super joins. The most you probably need is something like this:

Select potato_id, cost, from table1, (select potato_id, location from table2 where location = "NY") where table1.potato_id = table2.potato_id;

Basically, basic query, subquery, joins, and where clauses. Nothing special.

Lastly, be careful when you say large dataset, it's easy to get to pissing matches about that. When people ask have you worked with large data just say "The largest I've worked on is so and so". While 500gb sounds large to some people, I process 1-20 terabytes daily. There's nothing you can really do to really prepare you for how to handle that.

As far as your resume. First and foremost, emphasis your work and the times you've had to explain it. Be prepared to explain it in layman terms and not get caught up in the jargon. Second, add a section about Kaggle or even personal projects. For example, "Use Random Forest in Sci-kit learn predict survivors in Titanic data. Transformed data use DictVectorizer. Validated model by method A using xyz. Obtained result abc." Obviously use a better example than Titanic. Honestly, if your resume had a connection to a pretty github with some markdown that explained your work, I'm sure that'll appeal to a lot of managers.

I wouldn't be to worried about JavaScript, that's a bonus skill. It'll be super impressive if you built a D3 demo from scratch yourself that showed something you thought was interesting, but that's simply a way to get more money and not a skill that'll land you a core data science job.

I would like to thank you very much for all of these insights and information, and the tone you wrote it. I think now I have more sense of direction what to do . I wasn't sure what managers are looking for in a resumé and, honestly, I was suspicious about the significance of GitHub and Kaggle because all jobs emphasize on real and practical experience, and only a small fraction require them (usually smaller companies). Yes, for now I just need the basics to land a job. Thanks again

dkotschessaa · Feb 14, 2017

S_David said:

I would like to thank you very much for all of these insights and information, and the tone you wrote it. I think now I have more sense of direction what to do . I wasn't sure what managers are looking for in a resumé and, honestly, I was suspicious about the significance of GitHub and Kaggle because all jobs emphasize on real and practical experience, and only a small fraction require them (usually smaller companies). Yes, for now I just need the basics to land a job. Thanks again

I'd like to echo this thank you. There is a lot of hype surrounding this field right now and I am having trouble sorting out myth vs. reality.

-Dave K

Crass_Oscillator · Feb 17, 2017

Actually, speaking of hype, I'm curious about what MarneMath thinks about claims of a bubble in "data science" (I hate the term). I was at Carnegie Mellon for a year, arguably the Mecca of data science, and encountered a lot of glassy eyed, cultish individuals who seemed pathologically optimistic.

Will Deep Neural Nets replace HR at long last? Will boosted everything discover quantum gravity? Can a company replace all of its programmers with an infinite stack of meta-machine learning gizmos, each of which is trained on increasingly higher order parameters and whose outputs are fed to the next, lowlier element in the line?

MarneMath · Feb 17, 2017

I'm never been one to engage in futurist what could and can be talk. In general, my training is in Statistics and computer science, but more so statistics. I don't know nor care if "the" singularity will occurs. I'm simply concerned with solving one problem at a time.

dkotschessaa · Feb 17, 2017

Crass_Oscillator said:

Actually, speaking of hype, I'm curious about what MarneMath thinks about claims of a bubble in "data science" (I hate the term). I was at Carnegie Mellon for a year, arguably the Mecca of data science

You mean this place?

and encountered a lot of glassy eyed, cultish individuals who seemed pathologically optimistic.

Glowing ones walking on binary paths?

Oh and here's a word cloud

onoturtle · Apr 27, 2017

I just went through a nearly 6 month job hunt for a data scientist position and failed. I'm now back in academia as a junior researcher in genomics.

A major obstacle for me was the small number of opportunities in my city that I don't want to leave. Thus I was restricted to the one industry in my town looking for data scientists: banking. My lack of banking/finance background was the deal breaker. It really depends on the other candidates. At lot of employees who interviewed me liked me and said my lack of domain knowledge was not an issue (they themselves didn't have it upon hire). I have three years of data scientist experience from a now defunct tech startup, not to mention my PhD work in machine learning and bioinformatics, so I had plenty of data projects to talk about to sell my skillset. But in all banks I applied to, someone else with more banking/finance experience got the position.

That said, that's one data point for myself that could be an outlier. I live in Honolulu. There's few tech opportunities so they all will probably end up being competitive here. Not to mention my job search was going on during and after the January blizzard in the USA, which meant the jobs I applied for had a ton of out-of-state applicants wanting to live at the beach. So I faced even more competition than usual :(

S_David you should be including in your plan to get whatever domain knowledge you need for whatever industry you see yourself working for. It should help even if my experience is an outlier.

EngWiPy · Apr 28, 2017

Yes, I noticed that. Namely, companies prefer candidates to have background experience in their line of work besides data manipulation, like in the medical field and financing. However, some telecommunication companies (in which I have theoretical experience) need data scientists, and I applied for them stating that I have background in telecommunication, but to no avail.

StatGuy2000 · Apr 28, 2017

onoturtle said:

I just went through a nearly 6 month job hunt for a data scientist position and failed. I'm now back in academia as a junior researcher in genomics.

A major obstacle for me was the small number of opportunities in my city that I don't want to leave. Thus I was restricted to the one industry in my town looking for data scientists: banking. My lack of banking/finance background was the deal breaker. It really depends on the other candidates. At lot of employees who interviewed me liked me and said my lack of domain knowledge was not an issue (they themselves didn't have it upon hire). I have three years of data scientist experience from a now defunct tech startup, not to mention my PhD work in machine learning and bioinformatics, so I had plenty of data projects to talk about to sell my skillset. But in all banks I applied to, someone else with more banking/finance experience got the position.

That said, that's one data point for myself that could be an outlier. I live in Honolulu. There's few tech opportunities so they all will probably end up being competitive here. Not to mention my job search was going on during and after the January blizzard in the USA, which meant the jobs I applied for had a ton of out-of-state applicants wanting to live at the beach. So I faced even more competition than usual :(

S_David you should be including in your plan to get whatever domain knowledge you need for whatever industry you see yourself working for. It should help even if my experience is an outlier.

There are a couple of points to consider here:

1. In your particular case, insisting on living in Honolulu (as you yourself suggested) is the primary reason why you had failed to find a job in the data science field. From what I've read, the job market in Hawaii is tough outside of the fields of tourism, construction and healthcare. So it doesn't surprise me that you had a lack of success of working in the data science field. Much of the most lucrative jobs in that area tend to be clustered in the following regions in the US:

(a) West Coast -- Seattle, Washington or the surrounding suburban areas; California (in particular the San Francisco-Bay area)
(b) Northeastern US -- New York City; the state of New Jersey; Boston, Massachusetts or the surrounding suburban areas
(c) Washington DC or the surrounding suburban areas in Maryland and Virginia
(d) The Raleigh-Durham research park in North Carolina

2. In my opinion, obtaining domain knowledge coming from academia is next to impossible. Employers who are seeking this are really saying they don't want to hire someone fresh from school, but someone with experience. To break in, one has to expand one's geography when looking for work (see point #1 above).

StatGuy2000 · Apr 28, 2017

S_David said:

Yes, I noticed that. Namely, companies prefer candidates to have background experience in their line of work besides data manipulation, like in the medical field and financing. However, some telecommunication companies (in which I have theoretical experience) need data scientists, and I applied for them stating that I have background in telecommunication, but to no avail.

In your case, as I stated earlier, seeking domain knowledge would be next to impossible given your background. My suggestion (which I had already advised you to pursue in this thread) would be to pursue an internship program to help you to gain first-hand knowledge of tools/methodologies in data science (I don't know why you are resisting doing this).

The other suggestions include the following:

1. Set up a Github account where you can upload your code.

2. Participate in Kaggle competitions -- employers will often look kindly on experience participating in these.

https://www.kaggle.com/competitions

onoturtle · Apr 28, 2017

I should mention that I did expand to remote job options, which the OP should also consider once ready to do the job search again. I'm not really sure what's the best place to find such positions. I paid for a few months of FlexJobs.

OP, sorry for hijacking your thread here, but since I did already write something about my situation, I may was well seek advice in case I decide I don't like my current situation and start job searching again.

In Hawaii, I know it is my lack of domain knowledge that I wasn't competitive for all positions I applied for. The managers said so for two of them. Another I know because the candidate that got the job is a LinkedIn connection, so I know that person spent a few years at a federal reserve bank. For the remote jobs, I'm not sure what made my application not look attractive. I really only got one interview that I recall from a startup data science consulting firm, and that went well enough that they wanted to evaluate me as a contractor first (their MO apparently), sent me the NDA that I signed, and said they'll contact once they get details from a client on the project they intended for me. Never happened. Radio silence after that. My seeking work remotely was even more of a failure than seeking locally.

- My background: PhD CS. Publications in machine learning and bioinformatics (mostly from dissertation). Three years data scientist experience in a tech startup environment.
- Current job: Junior researcher in genomics facility in a med school
- Networking is huge. Knowing people in common. Sharing LinkedIn connections. I'm fine in Hawaii for this, I suppose, but outside? Not so much.
- My GitHub appears empty if someone searched for it. All my stuff is propriety and in private repos. Though code for one of my pubs is on GitHub under a university account.
- Kaggle still useful in my situation?
- I'll ought to be getting more pubs with my current job. Probably won't help?

What should I be considering doing to increase my competitiveness in my next round of job hunting for a data scientist position? My current job is purely funded by grant money, which could disappear in a year or two...

WWGD · Apr 28, 2017

MarneMath said:

<Snip> What Kaggle will allow you to do is provide a better answer to that because now you can draw from experience. Also, you can use an ipython notebook or any markdown file to document your process and allow an interviewer to follow again. Plus that shows you know how to document your work, code, and visualize data.

As far as SQL, there isn't much to know. There's no need to ever use bind variables or convoluted super joins. The most you probably need is something like this:

Select potato_id, cost, from table1, (select potato_id, location from table2 where location = "NY") where table1.potato_id = table2.potato_id;

Basically, basic query, subquery, joins, and where clauses. Nothing special.

Lastly, be careful when you say large dataset, it's easy to get to pissing matches about that. When people ask have you worked with large data just say "The largest I've worked on is so and so". While 500gb sounds large to some people, I process 1-20 terabytes daily. There's nothing you can really do to really prepare you for how to handle that.

As far as your resume. First and foremost, emphasis your work and the times you've had to explain it. Be prepared to explain it in layman terms and not get caught up in the jargon. Second, add a section about Kaggle or even personal projects. For example, "Use Random Forest in Sci-kit learn predict survivors in Titanic data. Transformed data use DictVectorizer. Validated model by method A using xyz. Obtained result abc." Obviously use a better example than Titanic. Honestly, if your resume had a connection to a pretty github with some markdown that explained your work, I'm sure that'll appeal to a lot of managers.

I wouldn't be to worried about JavaScript, that's a bonus skill. It'll be super impressive if you built a D3 demo from scratch yourself that showed something you thought was interesting, but that's simply a way to get more money and not a skill that'll land you a core data science job.

A couple of followups, please:
1)Would it help if you had advanced knowledge of SQL?
2) How do you deal with large datasets : Hadoop/Map Reduce?
3) Re: Machine learning: is the learning part the iterations you do after partitioning into training and validation data?

WWGD · Apr 28, 2017

StatGuy2000 said:

There are a couple of points to consider here:

1. In your particular case, insisting on living in Honolulu (as you yourself suggested) is the primary reason why you had failed to find a job in the data science field. From what I've read, the job market in Hawaii is tough outside of the fields of tourism, construction and healthcare. So it doesn't surprise me that you had a lack of success of working in the data science field. Much of the most lucrative jobs in that area tend to be clustered in the following regions in the US:

(a) West Coast -- Seattle, Washington or the surrounding suburban areas; California (in particular the San Francisco-Bay area)
(b) Northeastern US -- New York City; the state of New Jersey; Boston, Massachusetts or the surrounding suburban areas
(c) Washington DC or the surrounding suburban areas in Maryland and Virginia
(d) The Raleigh-Durham research park in North Carolina

2. In my opinion, obtaining domain knowledge coming from academia is next to impossible. Employers who are seeking this are really saying they don't want to hire someone fresh from school, but someone with experience. To break in, one has to expand one's geography when looking for work (see point #1 above).

I was actually surprised to see some of the people I knew fresh out of the Math PHD go into data science, and one, specialized in logic, without knowledge of programming, was named the VP . I don't know how to make sense of it. As far as places like NYC, I wonder if the great number of opportunities is canceled out by the large number of people seeking those opportunities.Is it better to have 300 opportunities for 290 people than having 40 for 30 people, e.g.?

dkotschessaa · Apr 29, 2017

WWGD said:

I was actually surprised to see some of the people I knew fresh out of the Math PHD go into data science, and one, specialized in logic, without knowledge of programming, was named the VP . I don't know how to make sense of it. As far as places like NYC, I wonder if the great number of opportunities is canceled out by the large number of people seeking those opportunities.Is it better to have 300 opportunities for 290 people than having 40 for 30 people, e.g.?

I know people in math (not stats or comp sci) programs now that are starting to do more data science related research, i.e. topological data analysis. There are probably other areas but that's the one I know about.

I'm not sure if this is the case in the scenario you've described, but IMO the latest generation of math Phd's is just a different breed, probably having more basic computer literacy than in prior years. And they will tell you they don't know how to program, when in fact they probably have used several languages. What they really mean is "I am not a programmer."

-Dave K

MarneMath · Apr 29, 2017

WWGD said:

A couple of followups, please:
1)Would it help if you had advanced knowledge of SQL?
2) How do you deal with large datasets : Hadoop/Map Reduce?
3) Re: Machine learning: is the learning part the iterations you do after partitioning into training and validation data?

1) It depends. Companies who are still making the shift to "big data", advanced SQL is useful, but to what extent is debatable. What's probably important is knowledge about partitions, index, joins, subqueries, and a programmatic way to get this data on your own (ie outside of a SQL tool).
2) When I first started many years ago, Hadoop and Java Map Reduce was basically all that existed. Eventually Pig and Hive came to help and make life a bit easier. Finally, we have Spark which if you don't know in today's era basically means you're going to be left behind.
3) I'm not sure what you mean by your third question. Partitioning data into training and validation is just good practice, machine learning is just what some people call statistical learning theory.

WWGD · Apr 29, 2017

MarneMath said:

1) It depends. Companies who are still making the shift to "big data", advanced SQL is useful, but to what extent is debatable. What's probably important is knowledge about partitions, index, joins, subqueries, and a programmatic way to get this data on your own (ie outside of a SQL tool).
2) When I first started many years ago, Hadoop and Java Map Reduce was basically all that existed. Eventually Pig and Hive came to help and make life a bit easier. Finally, we have Spark which if you don't know in today's era basically means you're going to be left behind.
3) I'm not sure what you mean by your third question. Partitioning data into training and validation is just good practice, machine learning is just what some people call statistical learning theory.

Thanks, Marne. By 3 I mean that Machine Learning, AFAIK, is about having machines learn " on their own". I don't see how this comes about when using the mentioned algorithms. In what what is the computer doing any self-learning? But I will look up Stat Learning Theory.

MarneMath · Apr 29, 2017

In general there's three types of learning. "Supervised", "Unsupervised" and "Reinforced". The learning part of an algorithm is basically how the algorithm makes a decision. For example, in logistic regression, the learning portion occurs whenever you attempt to find the MLE with a regularization via some algorithm. Thus the learning parameters essentially become the regularization parameters. Every algorithm has it's own way of "learning". Support vectors maximizes distances while looking for a symmetry. Random forest vote. Linear regressions use OLS. It sounds odd, but once you break away the jargon it basically comes down to decision theory and optimization of parameters.

StatGuy2000 · Apr 30, 2017

dkotschessaa said:

I know people in math (not stats or comp sci) programs now that are starting to do more data science related research, i.e. topological data analysis. There are probably other areas but that's the one I know about.

I'm not sure if this is the case in the scenario you've described, but IMO the latest generation of math Phd's is just a different breed, probably having more basic computer literacy than in prior years. And they will tell you they don't know how to program, when in fact they probably have used several languages. What they really mean is "I am not a programmer."

-Dave K

My understanding is that there are people in statistics departments who are also involved in research in topological data analysis, often working in close conjunction with people in math departments (in cases where the two departments are separate). Some other areas of mathematicians doing research in data science-related areas include information geometry (application of differential geometry in the space of probability distributions), theory of random matrices, and multiscale geometric analysis (e.g. wavelets and other generalizations such as curvelets, etc.)

There are probably many other research areas within math that could be tied to data science/statistics/machine learning that I'm not familiar with. Perhaps there are people here on PF who may be familiar with those.

atyy · Apr 30, 2017

MarneMath said:

It sounds odd, but once you break away the jargon it basically comes down to decision theory and optimization of parameters.

At least for supervised learning, I think the only differences from traditional statistics are that the number of parameters is larger, and optimization is nonconvex.

WWGD · Apr 30, 2017

MarneMath said:

In general there's three types of learning. "Supervised", "Unsupervised" and "Reinforced". The learning part of an algorithm is basically how the algorithm makes a decision. For example, in logistic regression, the learning portion occurs whenever you attempt to find the MLE with a regularization via some algorithm. Thus the learning parameters essentially become the regularization parameters. Every algorithm has it's own way of "learning". Support vectors maximizes distances while looking for a symmetry. Random forest vote. Linear regressions use OLS. It sounds odd, but once you break away the jargon it basically comes down to decision theory and optimization of parameters.

By optimization of parameters you mean Maximum Likelihood?

MarneMath · Apr 30, 2017

atyy said:

At least for supervised learning, I think the only differences from traditional statistics are that the number of parameters is larger, and optimization is nonconvex.

I think in the broader sense you're more or less correct. I was also argue that the goal's that traditional statisticians have versus "machine learners" tends to be rather different.

WWGD said:

By optimization of parameters you mean Maximum Likelihood?

If relevant for the model then yes.

StatGuy2000 · Apr 30, 2017

atyy said:

At least for supervised learning, I think the only differences from traditional statistics are that the number of parameters is larger, and optimization is nonconvex.

From what I understand, there has been much fertile research within statistics on working in domains of large number of parameters, dimensionality reduction, and sparse signal detection (e.g. recent research on higher criticism -- see the following link: https://arxiv.org/pdf/1411.1437.pdf)

Researchers working in machine learning also work on many of these same problems, so increasingly there is a blurring of disciplines between the machine learning and statistics communities. As a matter of fact, it is not uncommon for researchers to be cross-listed between the statistics and computer science departments (where such departments exist separately). My alma mater, for example, have 3 faculty members cross-listed as such.

How can I transition from academia to become a data scientist in the industry?

1. What skills do I need to transition from academia to become a data scientist in the industry?

2. Do I need to have a background in computer science or data science to become a data scientist in the industry?

3. What steps should I take to transition from academia to become a data scientist in the industry?

4. How can I showcase my academic research experience to potential employers in the industry?

5. What are the biggest challenges I may face when transitioning from academia to become a data scientist in the industry?

Similar threads

Hot Threads

Recent Insights