Driven by cheap commodity storage, fast data networks, rich structured models, and the increasing desire to catalog and share our collective experiences in real-time, the scale of many important learning problems has grown well beyond the capacity of traditional sequential systems. These “Big Learning” problems arise in many domains including bioinformatics, astronomy, recommendation systems, social networks, computer vision, web search and online advertising.
Simultaneously, parallelism has emerged as a dominant widely used computational paradigm in devices ranging from energy efficient mobile processors, to desktop supercomputers in the form of GPUs, to massively scalable cloud computing services. The Big Learning setting has attracted intense interest across industry and academia, with active research spanning diverse fields ranging from machine learning and databases to large scale distributed systems and programming languages. However because the Big Learning setting is being studied by experts of these various communities, there is a need for a common venue to discuss recent progress, to identify pressing new challenges, and to exchange new ideas.
This workshop aims to:
* Bring together parallel and distributed system builders in industry and academia, machine learning experts, and end users to identify the key challenges, opportunities, and myths of Big Learning. What REALLY changes from the traditional learning setting when faced with terabytes or petabytes of data?
* Solicit practical case studies, demos, benchmarks and lessons-learned presentations, and position papers.
* Showcase recent and ongoing progress towards parallel ML algorithms
* Provide a forum for exchange regarding tools, software, and systems that address the Big Learning problem.
* Educate the researchers and practitioners across communities on state-of-the-art solutions and their limitations, particularly focusing on key criteria for selecting task- and domain-appropriate platforms and algorithms.
Focal points for discussions and solicited submissions include but are not limited to:
1. Case studies of practical applications that operate on large data sets or computationally intensive models; typical data and workflow patterns; machine learning challenges and lessons learned.
2. Insights about the end users for large-scale learning: who are they, what are their needs, what expertise is required of them?
3. Common data characteristics: is it more typical for data to appear in streams or in batches? What are the applications that demand online or real-time learning, and how can the engineering challenges for deploying autonomously adaptive systems be overcome? Which analytic and learning problems are more appropriate for (or even require) analysis in the cloud, and when is “desktop” learning on sub-sampled or compressed data sufficient?
4. Choices in data storage and management, e.g., trade-offs between classical RDBMS and NoSQL platforms from a data analysis and machine learning perspectives.
5. The feasibility of alternate structured data storage: object databases, graph databases, and streams.
6. Suitability of different distributed system platforms and programming paradigms: Hadoop, DryadLINQ, EC2, Azure, etc.
7. Applicability of different learning and analysis techniques: prediction models that require large-scale training, vs. simpler data analysis (e.g., summary statistics), which is needed when.
8. Computationally intensive learning and inference: Big Learning doesn’t just mean 9. Big Data it also can mean massive models or structured prediction tasks.
Labeling and supervision: scenarios for large-scale label availability and appropriate learning approaches. Making use of diverse labeling strategies (curated vs. noisy/crowd-sourced/feedback-based labeling)
10. Real-world deployment issues: initial prototyping requires quickly-implemented-and-expandable solutions, along with the ability to easily incorporate new features/data sources.
11. Practicality of high-performance hardware for large-scale learning (e.g., GPUs, FPGAs, ASIC). GPU vs. CPU processors: programming strategies and performance opportunities and tradeoffs.
12. Unifying the disparate data structures and software libraries that have emerged in the GP-GPU community.
13. Evaluation methodology and trade-offs between machine learning metrics (predictive accuracy), computational performance (throughput, latency, speedup), and engineering complexity and cost.
14. Principled methods for dealing with huge numbers of features. As the number of data points grow, often times so do the number of features as well as their dependence structure. Does Big Learning require, for example, better ways of doing multiple hypothesis testing than FDR?
15. Determination of when is an answer good enough. How can we efficiently estimate confidence intervals over Big Data?
Target audience includes industry and academic researchers from the various subfields relevant to large-scale machine learning, with a strong bias for either position talks that aim to induce discussion, or accessible overviews of the state-of-the-art. We will solicit paper submissions in the form of short, long and position papers as well as demo proposals. Papers that focus on emerging applications or deployment case studies will be particularly encouraged, while demos of operational toolkits and platforms will be considered for inclusion in the primary program of the workshop.