Challenge Summary *under construction*
Next step: JMLR Special issue on Large Scale Learning
Submission to the special issue is open to everyone, granted that you follow the challenge setting for a fair comparison. See Data sets and performance measures.
Results of the Challenge have been announced during the LSL Workshop, ICML 2008
Additional best student and best newcomer awards
To date the re-evaluation is still ongoing, but a leading group manifested itself which allows us to declare the winners of the challenge, each receiving prize money or travel grants. As the best methods perform very similar (almost same score) we don't rank the winners.
The overall score of each participant is computed as the average rank of its submission w.r.t. six scalar measures that aggregate the predictive accuracy and computational effort, for various dataset sizes. For more details see Assessment of the submissions and a discussion below.
What, who, how...
Interestingly, most participants used linear algorithms, variants of L1/L2-SVM solvers:
- SGD-Qn (Antoine Bordes)
- Newton (Olivier Chapelle)
- Dual coord. descent (Hsiang Fy Yu)
- Tripple Jump SVM (Han-Shen Huang and Chun-Nan Hsu)
- Interior point (Kistian Woodsend)
Among the non linear methods are:
- Parallel decision tree (Yossi Richter)
- Averaging of selective naive Bayes (Marc Boulle)
- Averaging of RBF-kernel SVM (Jochen Garcke)
While non-linear methods usually get a better predictive accuracy than linear ones, they often are significantly slower; all the more so as 100k examples was often enough for linear algorithms to converge due to the low data dimensionality.
This table displays the best performing methods on the challenge datasets.
|Dataset||Best known method||Best submission|
|alpha||quadratic||0.009774||0.0854||AV SVM - garcke|
|beta||quadratic||0.304563||0.4577||AV SVM single - garcke|
|gamma||semi-quad||0.009418||0.0116||Avg NB Classif. - MB|
|delta||semi-quad||0.071005||0.0801||Avg NB Classif. - MB|
|epsilon||linear||0.033645||0.0341||IPM SVM 2 - kristian|
|zeta||linear||0.011361||0.0116||SDM SVM L1 - yahoo|
|dna||WD-SVM||0.4558||0.8030||SDM SVM L2 - yahoo|
|ocr||n.n.||n.n.||0.1579||SDM SVM L2 - yahoo|
|fd||n.n.||n.n.||0.2314||Avg. NB Classif.CR - MB|
|webspam||n.n.||n.n.||0.0004||SDM SVM L2 - yahoo|
Generally people used the (simple-minded) data representation we suggested in our svmlight-format convert.py script. No one even tried quadratic features. Thus there is often a huge gap between what is possible and what was achieved (as we could see on the artificial datasets where we knew the optimal Bayes classifier).
The programming languages used are C/C++, Java and Matlab. The Wild track and the Alpha synthetic dataset have been the most attractive ones, with respectively 263 entries and 72 submissions. The most active participants so far are Christian Woodsend (Interior Point SVM: linear, parallel, wild track), Antoine Bordes (SGD-QN, Larank), Olivier Chapelle and Sathiya Keerthi (SDM-L1/L2, Newton). The Parallel track unfortunately did not attract much interest (see below).
Workshop and Discussion
The Pascal workshop on Large Scale Learning took place after ICML 2008, in Helsinki. All 9 talks and the discussion have been recorded and will be available at Pascal video lectures. The discussion focused on two main points: aggregating the scores and what next.
Scores, criteria, cheating...
The overall results were summarized by three figures: time vs. aoPRC; dataset size vs. training time; dataset size vs. aoPRC, reflecting the main questions from the end-user: how long does it take to reach a given performance; how does the computational effort increases with dataset size, and which amount of data is needed for reaching a given accuracy. The second figure (dataset size vs. training time) is used to compute the empirical computational complexity, which raises the question: smart stopping criteria could be used to reach a sub-linear complexity... is this fair ?
The current aggregation of the ranks is detrimental to methods which are accurate but slow; what is worse, the time needed to adjust the algorithm parameters is not accounted for, which is especially unfair to parameterless methods (e.g., Bayesian)... How to account for the model selection cost ?
The worst of all is that, by submitting a fast and idiot method, you could scramble the ranks... and kill the order. Clearly, algorithm assessment is a multi-objective problem and there is no way we can come with a single fair total order. Best is to look at the plots to see where is your optimal trade-off given the amount of data / computational budget you have.
Everybody seemed to be interested in having a second round, even - as could have been expected - nobody was willing to take on. Shall we ? We don't know yet... On one hand the infrastructure is up and running; on the other hand it is difficult to obtain LARGE public datasets, and adjust the criteria accordingly...
Some points raised during the discussion:
- LSL *must* go parallel; reserve a few nodes of the PASCAL cluster for the challenge and consider fixed time budgets of e.g. 10, 100, 1000 and 1000 seconds;
- People might submit code to be run on the PASCAL cluster (no mistakes about the computational effort; no need to distribute huge datasets...);
- The cost of model selection must definitely be accounted for;
- The aggregation would proceed by normalizing each one of the six current scores; and averaging the normalized scores (preventing the scrambling-based cheating...).
Help and any ideas are welcome. Do you want to help organizing a LS-very-L Challenge in 2009-2010 ? Please contact us.
How the LSL challenge evolved
After an initial proposal was presented at the NIPS*07 large scale learning workshop, and received few encouragements, a more polished (and less ambitious ...) version was presented during the PASCAL meeting in Bled, January 2008; the challenge started end of February.
Then we had an endlessly long time without any submissions at all; and after that a number of submissions that performed worse than random... Ha! Olivier Chapelle managed to beat Vojtech's baseline SVM.
For the following period, a race went on between Olivier and Kristian Woodsend with his incredibly fast Interior Point SVM. In the end Olivier won the race with his Newton SVM which remained top ranking till the end of the challenge.
A long time passed before other submissions appeared. Marc Boulle's Bayesian submission ranked high in accuracy but overall received a low rank due to the computational effort...
At that point, a critical post from John Langford drew some more attention to the challenge (thanks John) and we accordingly introduced the parallel track; which finally interested very few people (!?).
Finally, the number of submissions totaled 44, with 49 participants registered.