With the exceptional increase in computing power, storage capacity and network bandwidth of the past decades, ever growing datasets are collected in fields such as bioinformatics (Splice Sites, Gene Boundaries, etc), IT-security (Network traffic) or Text-Classification (Spam vs. Non-Spam), to name but a few. While the data size growth leaves computational methods as the only viable way of dealing with data, it poses new challenges to ML methods.
This PASCAL Challenge is concerned with the scalability and efficiency of existing ML approaches with respect to computational, memory or communication resources, e.g. resulting from a high algorithmic complexity, from the size or dimensionality of the data set, and from the trade-off between distributed resolution and communication costs.
Indeed many comparisons are presented in the literature; however, these usually focus on assessing a few algorithms, or considering a few datasets; further, they most usually involve different evaluation criteria, model parameters and stopping conditions. As a result it is difficult to determine how does a method behave and compare with the other ones in terms of test error, training time and memory requirements, which are the practically relevant criteria.
We are therefore organizing a competition, that is designed to be fair and enables a direct comparison of current large scale classifiers aimed at answering the question “Which learning method is the most accurate given limited resources?”. To this end we provide a generic evaluation framework tailored to the specifics of the competing methods. Providing a wide range of datasets, each of which having specific properties we propose to evaluate the methods based on performance figures, displaying training time vs. test error, dataset size vs. test error and dataset size vs. training time.
Call for Participation
For questions feel free to contact Soeren.Sonnenburg at ml.tu-berlin.de.
The LSL challenge will take place from February to June 2008. Final results will be presented in an ICML'08 workshop (Helsinki, July 9th).
All participants are required to provide the executable or source code of their method, in order for the challenge organizers to re-run them under the same computing environment (single CPU Linux machine, 32 Gb RAM).
All participants will provide an abstract (up to 4 pages) describing their method. These submissions will be extended and reviewed for publication in a JMLR special topic on large scale learning.
|27 February 2008||Challenge Announcement|
|12 June 2008||Test sets available|
|26 June 2008||End of the competition (test set submissions, abstract and program for re-evaluation due)|
|27 June - 8 July||re-calibration of all entries|
|9 July 2008||ICML'08 Large Scale Learning workshop|
|10 July - ...||Re-run of the top ten methods|
- Soeren Sonnenburg, TU Berlin, Berlin, Germany
- Vojtech Franc, Czech Technical University, Prague, Czech Republic
- Elad Yom-Tov, IBM Haifa Research Lab, Haifa, Israel
- Michele Sebag, LRI, Orsay, France
The organizers gratefully acknowlegde support from the PASCAL Network of Excellence cf. http://www.pascal-network.org/. Furthermore we thank Konrad Rieck, Marc Toussaint, Klaus-Robert Mueller, Gunnar Raetsch, Alexander Zien and the participants of the NIPS Workshop on efficient machine learning for datasets, comments, discussions and moral support.