A Path To Become A Data Scientist

So if you ask anybody in the field of engineering, they would have heard the term Big Data, Data Science, and Machine Learning. There is a lot of material available online that someone might feel little overwhelmed just by looking at it. You will see people competing on Kaggle, publishing papers, writing blogs, with over 10,000 followers on Twitter etc. and you will feel like you won’t be able to reach that level.

The fact is that many companies have realised that they can use the data they have gathered and get much deeper insights about various questions and that has created this huge demand for Data Scientists, Machine Learning Researchers and Big Data Developers. There is also a lot of buzz around Deep Learning, Artificial Intelligence but just not enough time to learn all these cool stuff!

Seeing this huge demand, many universities and other online course providers have started their own series on these topics. So for an absolute beginner where should he/she start? Because if you take a course and you do not like or enjoy it then you may give up this fantastic and exciting field.

I was in that position and I tried many courses, completed few and gave up few (some half way, some just after a day). One mistake I did that I enroled in many courses at one point and could not complete them, as you do not have enough time when you are working full time.

I had some advantage in the sense that I did my Ph.D. in communication and signal processing, and there is a lot of overlap between signal processing and machine learning algorithms. Therefore, I am going to tell you what I did in order to get the foundation I needed to get started in the field of Data Science, Machine Learning and Big Data. To learn these topics, you need to have a basic understanding of linear algebra, probability and random processes or at least willingness to learn them as it makes life easy (just a little!). If you are trying to become a Data Scientist while working full time, then I would say give yourself 1 year and study part time. I can promise you that you will be in a much better position after 1 year of hard work. My suggestion is as below:

Machine Learning by Andrew Ng

This course must be the first step in your journey towards becoming a Data Scientist. I cannot stress this enough. My experience says that not many people can teach a complex topic with such ease. In fact, this course must have produced many data scientists.

The course uses Matlab as a language and if I remember correctly, the students can get a license for the duration of the course for free. Or you can use Octave. Do not worry if you have not used Matlab before, as there will be a class on Matlab and it is very easy to learn. Prof. Ng uses some mathematical derivation so that you can see what is going on instead of using the algorithms blindly. The coding exercises are also very good and it will really test your skills. Also you do not need to buy any book for this course. The good part about this course is that you can have several attempts at the quizzes and coding exercises.

I suggest that you complete this course even if you do not like it (which is highly unlikely) just so that you know what you will be getting into.

You should do two more things while you are taking this course, as it will prepare you for the next two courses. First is read the book An Introduction to Statistical Learning. This book provides a theoretical treatment for this subject from statistical point of view. It is available free from this link. You do not have to master this book but find the topics that you learn in the course in this book so that you get little more theoretical understanding. And the second thing you should do is register yourself on Kaggle and practise the tutorials e.g. predict survival of Titanic passengers.

The Analytics Edge by MIT

This course is offered by MIT through EdX. Again, although the required hours are 10 to 15 per week, it is a very easy course to follow. The instructors are very good and their teaching style is very easy to follow. This course uses R as a programming language and for someone with prior programming experience, it will be very easy to follow.

The structure of the course is very unique in the sense that, the concepts are first covered and then real examples are shown using the techniques and then a recitation class is given where you can see how each technique is used in R to make predictions. The examples used are from Baseball, Twitter, D2HawkEye, the US Supreme Court Judgements etc.

The homework assignments are also good and sometime challenging which is always good as it really tests your skills. The best part of the course is the Kaggle competition and it carries a huge part of the course. You will be tested on the techniques taught in the course. And in the end, there will be a final exam.

I highly recommend this course as it gives you another perspective and you will get to use and learn R as language. What I like about this course is that you only have limited attempts at quizzes and R programming exercises. That way you have to be very sure when you submit the answer.

Please continue reading the book I suggested above in the first point, as it will help you get a deeper understanding of the subject. It would be even better if you could master the concepts in the book to a certain degree.

Introduction to Statistical Learning

This course is offered by Stanford University and it is based on the book Introduction to Statistical Learning (Now you know why I suggested that in the first point). Actually, this book is a cut-down version of a much bigger and better book Elements of Statistical Learning and available free on this link.

The course is offered by Trevor Hastie and Rob Tibshiramani and I think they offer this course each year around January/February. The videos are fun to watch and would require a lot of attention. The quizzes are very challenging and you only get one attempt! So you better be sure when you submit the answer.

Next Move

If you complete the above three courses, then I can assure you that you would be in a much better position and you will be more confident in dealing with Data Science projects. I strongly encourage that you start your own side project while learning and also start using Git as version control. If you want to learn Git, check out video tutorial by Derek Banas on Youtube.

Now that you have a strong foundation of Data Science or Machine Learning, you can choose from the following courses:

  1. Probabilistic Graphical Models by Daphne Koller
  2. Deep Learning by Nando De Freitas

You may instead want to learn how to implement these algorithms at large scale using a proper Big Data Architecture. For that, you should know SQL or NoSQL. Also, you may want to learn about Big Data Frameworks such as Hadoop and Spark. I have recently signed up for a specialisation by the University of California Berkeley on Spark through EdX. The course will begin in mid-June 2016 and looks good. So I will try it out and post a review.

If not, you may want to learn Python so that you can implement machine-learning algorithms. The machine learning/data science community is divided on which language to choose (R/Python/Matlab). It really depends on the individual but I would say not many people will prefer Matlab as it is not free and many things are way too complicated in Matlab. I have used Matlab for wireless communication and signal processing but for Data Science and Machine Learning, I would use either R or Python.

As you can see, this is a never-ending story and there are so many things to learn and improve. The path that I have suggested may or may not work. But at least you will get an idea that how exciting this field is.

Also, try to interact with people on various forums either in the course or outside such as Kaggle and StackOverflow. See what others are doing, try to follow the experts on Twitter, follow Data heavy companies and try to attend local meet-ups. The only way learn continuously is to be amongst people who are already doing it on a day-to-day basis.

I did my schooling from a non-English medium school but I had taken English as a subject. My English teacher had told me that 50% of the task is accomplished the moment you set your eyes on a target and rest 50% is just pure hard work and dedication. I strongly believe in what he said and I am confident that you will also become a great Data Scientist one day.

Good luck!