Feature discovery is a challenging aspect of the data science and knowledge discovery. Creating an online interactive space where data scientists can benefit from each other’s ideas on various features can significantly simplify and expedite the process. Feature Factory is an online platform where ALFA@CSAIL will present a prediction problem for which features are sought. For the prediction problem, the group will provide downloadable mock data so users can test their scripts and submit. Feature Factory seeks three kinds of contributions: ideas of new features, feature extraction code and comments on existing ones.
Upon the submission of the feature extraction code , it will be validated on our online mock dataset and you will be notified of the result immediately. Upon validation, our team will execute the code on the real dataset to generate the features and insert the new feature into a number of machine learning models using discriminative (Decision trees, Neural networks, support vector Machines), generative (logistic regression, Gaussian process) and time series models. As a result, your features will be ranked against one another.
Current Focus Problem: Predict Student Stopouts on Massive Open Online Courses
In this problem, our goal is to predict when a student will stop engaging with the course. A student is assumed to have stopped out from a course when s/he stops to attempt problems/homeworks. We have data captured from students online behavior, which includes click stream data, their online forum interactions and their submissions for problems. We have a comprehensive data schema, called MOOCdb which captures the student activity data on a MOOC platform. The data schema is documented here. A small mock dataset that is in the form of the data schema can be downloaded in two formats: sql or csv.
We solicit participants for three distinct activities:
1. Propose a new feature by clicking on Add an idea
An example of a possible feature for this problem is:
Amount of time student spent on the course
Below you can see a number of features already developed and extracted.
2. Write an SQL script for your idea or for an already existing idea
Below you can see a list of feature ideas. For some of them, extraction has not yet been performed.
3. Comment on an existing ideas
Ideas develop when they are refined. So please feel free to comment or like the existing scripts.
Total number of weekly interactions for each student = # of observations + # of submission + # of collaborations
Feature is if current week total number of interactions is significantly less (2 sigma?) than averages of previous weeks.
Theory: if students fall behind for some reason, they are likely to give up.
It looks like the people who are the most likely to drop out are those who couldn't figure out the solutions to the questions online even after making multiple attempts or those who had to make multiple attempts in order to get the correct answer. In the 'submit' tab of the google excel sheet, we can see that Emily and Jamie made multiple submissions (up to 9 attempts) and still did not get the correct answer. I think this is telling that the students are unable to find good resources easily
Number of Submissions : This feature gives you a good idea of a user's activity on its own. You could determine the likelyhood of a person dropping out based on sharp increases or decreases of activity. However, the more weeks you analyze for and individual, the better your prediction will be.