The Kaggle Data Science portal is a respectful web place for all interested in deepening their understanding of the subject. At the same time it is where we can find the best and most interesting collection of problems, competitions and resources to practice Data Science and Machine Learning. The site offer varied competitions and if anyone feels equipped with a nice dataset framework, certainly the only bound to success is the necessity to sharpen the skills required. Consequentely it is one of the best resources out there to acquire the skills needed for Data Science and Machine Learning, from a pragmatic perspective.
I decided to reproduce today an interview in the Kaggle Blog about a competition addressing Business value, the Red Hat Business Value competition, won by Darius Barusauskas. Data Science, when done properly have the potential to bring considerable business value to almost any enterprise. The enthusiast practitioner or someone with a passion for the subject knows this. But who might be in a better position to tell us about that who have won a Kaggle competition, and precisely one addressing business value ? Hardly anyone other than Darius Barusauskas. Highlights:
What was your background prior to entering this challenge?
I have been on Kaggle for a year now and it has been very exciting time of my life. In my years working in data analytics I have obtained many useful data mining and ML skills which have flourished in the Kaggle competitions I’ve participated in.
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
The problem itself was not new to me – I have made several new clients’ potential detection models in my work; they were designed differently compared to Red Hat‘s problem, but such experience helped to make useful feature transformations in this competition.
What made you decide to enter this competition?
I aimed for a solo gold medal to achieve my Grandmaster’s title – it took me only a year! I am very happy that I decided to dedicate all my spare time to this and that I was able to make my goals come true – got my top 10 overall rank, nice win and a hefty reward.
This competition was a tight race. How did you approach it differently from past competitions?
I have always preferred working in a team. As this was a dedicated solo run, there were times when it was hard to concentrate and easy to procrastinate – had to look for moral support from my Kaggle friends. Thank you guys!
Let’s get technical
What was your most important insight into the data?
The presence of a leakage transformed original problem into 2 sub-problems which I tackled simultaneously:
a) Interpolating outcome values for companies with some leakage information
b) Predicting outcome values for companies not affected by leakage
I chose to turn leakage into several features for my ML models to directly predict value changing points in time – a contrast to many who were using some ad hoc rules.
The data itself presented several ways to tackle the problem given Red Hat’s client company-user-activity relation. I chose to make top-down approach models – create robust company-level models first and incorporate them into activity-level models using company users’ information.
The main principle of my company-level models was to take first observation in time as a reference point for each company, then aggregate activities having same value outcome and create ML models based on that subset of data (similar model versions taking last observation as reference point as well). Having robust predictions of first and last observations translated well in capturing if/when company value changed in time. These models were critical for my solution to work, so I dedicated 90% of my time for that.
What preprocessing and supervised learning methods did you use?
My solution had a simple 4+2 model structure: 4 company-level XGBoost models incorporated in 2 activity-level XGBoost models. The first activity-level model was CV optimized (had very poor public LB performance) and the other was selected giving best public LB score; a combination of these strategies provided a huge score uplift in my final submission.
Other methods did not work as well as XGBoost. I did not want my solution to be complicated due to leak presence, so I just stuck with XGBoost. Microsoft’s brand new LightGBM would have produced even better results. So if the competition was a month or two later, I would have probably preferred LightGBM.
Words of wisdom:
What have you taken away from this competition?
- Leave no stones unturned when it comes to testing silly ideas.
- Combination of cross-validation and public LB overfitting approach can yield surprisingly good results. Did not expect that.
- Competing solo at high ranks is very tough.
Do you have any advice for those just getting started in data science?
- Try running simple Kaggle kernels written by others and try to understand what is going on. Asking questions and receiving answers is the fastest way to know how things are done.
- Try to acquire technical skills first – try as many methods as you can, create your own code templates for running and making predictions on any given dataset.
- Learn how to do proper cross-validation and understand why it is important
- Don’t let XGBoost be the only tool in your toolbox.
Just for fun
If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?
Given a list of personal names and login names predict any risk related event using only external public internet data, especially social networks.
What is your dream job?
Developing data science models to improve the quality of everyone’s daily life.
Darius Barušauskas has BSc and MSc in Econometrics (Vilnius University, Lithuania). Specializes in credit and other risk modelling (5+ years of experience), has created many different models for financial, telco and utilities sectors. R and SQL guru.
Featured Image: Predicting Red Hat Business Value