You can get a lot from the basics when processing data. Of course, machine learning, and advanced analytics is fun, exciting, and stimulating, but there are benefits from simply looking at the data by category. In this quick tutorial, we look at ramen ratings [link] in order to see which brand we want to invest in. Of course, more analysis should always be done before investing in any company, this is just pretext for analyzing data by category. First of all, we load the data excluding the ‘Unrated’ ratings:
Now let us have a look at the number of reviews for each category:
There are just over 2500 reviews in the dataset, therefore we can’t really infer anything from the variety. However, we can do some analysis on the on the rest. There will be a lot of counting so we might as well create a column of ones called pointer:
Now we can start doing some analysis. What I like to do is use some extra memory and code to avoid confusion. This is one off analysis, not something that will be running in a system, therefore, you will save time focusing on clear idiot proof code as opposed to efficient code. What I like to do is create a new data frame with the pointer and the thing that I am counting. Group the field that I am counting, and sum the pointer column. Then drop all duplicates of the thing that I am counting, then drop the pointer, then merge that with the main data frame based on the id of the thing that I was counting. I use a left argument in the how parameter. This fills every data point in the main data frame with the count number based on the id being merged on. You can convert this into a function if you want, however, in this post, it will be repeated code so the reader can see which variables stay the same, and which ones change. First of all, we count how many ratings per brand:
We then calculate the total number of stars per brand:
We then define a basic average function, and apply it to the main data frame getting an average for all brands:
We now have a brand average. We now want to see which brand has the highest average. However, there might be some brands that only have one review (I checked that is actually the case). Therefore, we exclude any brands that have a count below 50, sort them by descending order, and have a look at the top 5 brands using the head() function:
We can see that “Indomie” has the highest average rating. So looking at them is a good start. We can do the same process for the style of ramen:
We then make sure we’re not taking into account the styles with just one review (again I checked there was two!…. pringles with the style of a can somehow made it into the dataset!!!).
We can see that pack is the most successful style of ramen. With a few lines of code we can look at our prospective investment and see what styles the brand is making:
….. hmm it looks like Indomie is already mainly focusing on the style that has the highest average rating. Country is also in the dataset. You can look at the average rating by country using the same method, and see where Indomie could expand to.
I help clinicians get to grips with coding and tech, I also code for a financial tech firm