Feature discovery is a challenging aspect of the data science and knowledge discovery. Creating an online interactive space where data scientists can benefit from each other’s ideas on various features can significantly simplify and expedite the process. Feature Factory is an online platform where ALFA@CSAIL will present a prediction problem for which features are sought. For the prediction problem, the group will provide downloadable mock data so users can test their scripts and submit. Feature Factory seeks three kinds of contributions: ideas of new features, feature extraction code and comments on existing ones.

Upon the submission of the feature extraction code , it will be validated on our online mock dataset and you will be notified of the result immediately. Upon validation, our team will execute the code on the real dataset to generate the features and insert the new feature into a number of machine learning models using discriminative (Decision trees, Neural networks, support vector Machines), generative (logistic regression, Gaussian process) and time series models. As a result, your features will be ranked against one another.


Current Focus Problem: Predict Student Stopouts on Massive Open Online Courses

In this problem, our goal is to predict when a student will stop engaging with the course. A student is assumed to have stopped out from a course when s/he stops to attempt problems/homeworks. We have data captured from students online behavior, which includes click stream data, their online forum interactions and their submissions for problems. We have a comprehensive data schema, called MOOCdb which captures the student activity data on a MOOC platform. The data schema is documented here. A small mock dataset that is in the form of the data schema can be downloaded in two formats: sql or csv.

We solicit participants for three distinct activities:

1. Propose a new feature by clicking on Add an idea
An example of a possible feature for this problem is: Amount of time student spent on the course
Below you can see a number of features already developed and extracted.

2. Write an SQL script for your idea or for an already existing idea
Below you can see a list of feature ideas. For some of them, extraction has not yet been performed.

3. Comment on an existing ideas Ideas develop when they are refined. So please feel free to comment or like the existing scripts.




Existing ideas and scripts


Total time spent on each resource during the week
by Franck code comment 0 | like 1
Average time (in days) the student takes to react when a new resource is posted. This pretends to capture how fast a student is reacting to new content.
by Josep Marc Mingot code comment 1 | like 1
average time between problem submission time and problem due date
by Rob Miller code comment 0 | like 1
Number of forum posts
by Franck code comment 0 | like 0
Number of Wiki edits by week
by Franck code comment 0 | like 0
Average length of forum posts
by Franck code comment 0 | like 0
Number of distinct problems attempted
by Franck code comment 0 | like 0
Number of distinct problems correct
by Franck code comment 0 | like 0
Average number of attempts to problems
by Franck code comment 0 | like 0
Total time spent on all resources during the week per number of correct problems
by Franck code comment 0 | like 0
Duration of longest observed event
by Franck code comment 0 | like 0
Standard deviation of the hours the user produces events and collaborations. This pretends to capture how regular a student is in her schedule while doing a MOOC
by Josep Marc Mingot code comment 0 | like 0
Total number of weekly interactions for each student = # of observations + # of submission + # of collaborations Feature is if current week total number of interactions is significantly less (2 sigma?) than averages of previous weeks. Theory: if students fall behind for some reason, they are likely to give up.
by Chris Terman code comment 0 | like 0
Number of questions posted with no response.
by Chris Terman code comment 0 | like 0
It looks like the people who are the most likely to drop out are those who couldn't figure out the solutions to the questions online even after making multiple attempts or those who had to make multiple attempts in order to get the correct answer. In the 'submit' tab of the google excel sheet, we can see that Emily and Jamie made multiple submissions (up to 9 attempts) and still did not get the correct answer. I think this is telling that the students are unable to find good resources easily
by Carolyn Chang code comment 0 | like 0
Number of Submissions : This feature gives you a good idea of a user's activity on its own. You could determine the likelyhood of a person dropping out based on sharp increases or decreases of activity. However, the more weeks you analyze for and individual, the better your prediction will be.
by Michelle Johnson code comment 0 | like 0
Total time spent watching videos and doing problems
by Megan O'Leary code comment 0 | like 0
Total time spent on the course in the previous week
by Rob Miller code comment 0 | like 0
number of sessions (contiguous blocks of time) during the previous week when the user accesses resources
by Rob Miller code comment 1 | like 0
average number of attempts per problem in the previous week
by Rob Miller code comment 0 | like 0
number of distinct problems attempted in the previous week
by Rob Miller code comment 0 | like 0
number of forum responses in the past week
by Rob Miller code comment 0 | like 0
fraction of observed resource time spent on each day of the week (7 variables for Mon-Sun that add up to 1.0). Just for previous week, and averaged over all weeks so far.
by Rob Miller code comment 0 | like 0







© 2013 MIT CSAIL ALFA Lab