When I first started with data, I was trying to self-teach R by following the Coursera course ‘ Data Science’ by Johns Hopkins University. Apparently, this would be the best data science language of the future. The reason being because it is free and statistical base, so lending itself naturally to machine learning.
I actually think Python is better because it’s easy to use ,deep learning APIs are usually written in Python and cloud computing usually uses Python as well. If you look online, nowadays it’s Python as the top language followed by JavaScript.
At the start it was difficult, but it gradually got easier the more I practiced and used StackOverflow. I started using TidyR and ignored base R because TidyR has better syntax. I found base R was too verbose.
However, when I began trying out deep learning, I discovered that the big deep learning players (eg. Tensorflow, Pytorch) were only designed for Python users, so I had to learn a new language — another learning curve.
What all the data languages in common was that you still needed to clean data to suit the language. For example, R and Python use different object types when trying to create a sparse data frame. Learning how to conduct machine learning can be cumbersome when switching between languages.
Given that my day job doesn’t require much machine learning ever, I’ve actually resulted to using Excel’s PowerQuery to do most of my data manipulation because it’s easier than coding. Only sometimes, I use R or Python for more complex stuff. As a matter of fact, in PowerBI, you can clean data and run machine learning all without needing to learn how to code. *You can even use Google Sheets for data cleaning and Weka for machine learning as a completely free option.
Ultimately, machine learning isn’t that hard. Really, all you need to do is import data, get rid of excess data columns, fill in the nulls, one-hot encode, split into the data into train and test data sets, train your training set on your algorithm of choice and finally predict on the test set. Okay, maybe I lied but trust me it’s not too hard.
But, what I found hardest about machine learning was understanding which data types were meant to be used for which algorithms. Admittedly, this is my fault, I just don’t read the algorithm’s ‘how to’ page carefully as I should. Often, I find these pages a bit too wordy.
Here are some examples of machine learning algorithms and their best data types. K-means works best when using only numeric data. Xgboost works best when everything is in factors (in R) or categories (in Python). Finally, Tensorflow requires everything to be turned into some sort of number, which is then scaled across all columns before using.
Really, the root problem was just my poor understanding of the language I was using. Honestly, I was ‘too’ into trying to get machine learning to work that I ignored the fundamentals.
Since my day job seldom requires machine learning, I don’t have this problem in the same manner. When making graphs or trying to model data, then the issue arises again.
Photo by Sufyan on Unsplash
My first data science project was to build an algorithm to predict house prices accurately. For my data set, I used land size, number of rooms, number of bathrooms, suburb and whatever other amenities I could find, and put all of those variables into an Xgboost model. I thought I did a pretty good job.
But in reality, the Xgboost model wasn’t that great. Turns out that if I found the longitude and latitude coordinates of previous sold homes and put them into a clustering model, the results would be just as good. Really, all machine learning did here was sort by location of house and priced a house accordingly to its current state (original, renovated or new). Upon reflection, I could’ve predicted house prices well enough or if not better by looking up the prices of recently sold homes for a particular street than needing to use machine learning.
It took me awhile to realise that what machine learning does well. It is not predicting things; rather, it does well at copying how the world currently thinks about things and automates that.
At work, I really only used machine learning once and this was just to automatically fill out a spreadsheet using Naive Bayes. My job was to guess outputs from various variables and I was too lazy to do that manually.
When self-teaching data science, majority of the time was copying and pasting error logs into Google and seeing what answers I could find from StackOverflow. Most of the time, I was lucky enough and got the answer directly. Other times, I had to do my own troubleshooting. For example, why couldn’t I load data into my Xgboost model? Simply, I didn’t make the data frame sparse. Small things that were explicit on machine learning ‘how to’ guides, but my knowledge being so fragile at the time, that I didn’t really understand what I was doing and needed several StackOveflow answer to tell me.
Today, things are more intuitive and I don’t use the platform as much, but it still has its purposes. For instance, when I want to do something exotic like grouping ID with some random variable and slicing the first instance of such grouping.
As cloud computing became more prominent, I realised that I no longer had to do machine learning on a localised computer but instead I could run a cloud instance to do it for me. Fortunately, because of my troubles learning Python for Tensorflow and Sci-kit learn, I didn’t have a horrendous time learning cloud computing.
Nevertheless, coming from a data background and trying to do some software programming did drive me insane at times. For instance, to have a machine learning algorithm run on demand, you first have to write a script that loads data from storage, trains on the training set, predicts from the test set, writes the output within the instance and finally exports the data back to storage. Furthermore, this needs to be done as a web service (eg. RESTful API). Once figured out, it wasn’t too hard, but it took awhile to get there.
Upon reflection, to be a modern data scientist, it would be easier to learn basic cloud computing and data storage skills before moving onto learning machine learning algorithms. The reason, at least for myself, is that progress is more limited by coding skills than understanding of machine learning models.
Source: https://medium.datadriveninvestor.com/after-4-years-of-data-science-heres-what-i-learnt-b179e7719559