Data Science Final Project

Model Refreshing

We checked our fake detection models recently, and found some problems. Thus in this blog we refresh our fake detection models.

The first problem is about data preprocessing. Our original thinking is that by normalizing numeric data into mean=0 and std=1, it is more convenient for SVM with RBF Kernel to training. However, we mistakenly preprocess fake and true data separately, which makes them have much more different to each other, resulting in better classification result. This time, we first try to preprocessing them together, and then train models without preprocessing. By comparing result, we abandon data preprocessing of numerical data. This time, we alternate judge score as follows:

judge_score = (cross_valid_score * 0.7 + recall * 0.3)

The relationship between class weight and judge score are shown in figure 1, and get the best weight based on judge score, as shown in the following table.

Features Model Best Weight judge_score
numeric SVM (RBF) 12 0.849
description Logistic Regression 10 0.796
Linear SVM 10 0.802

Statistic Analysis

UFO's may appear for a different duration of time: some are transient while others last for hours, if not days. Is it possible for us to tell their duration based on the weather conditions? To answer this question, we have performed machine-learning algorithms to determine their relationships. Using each column in our weather data as independent features, we use DecisionTree and Navie Bayes to predict outcome labels --- the duration. The results, however, show little connection between these features and the UFO appearance duration. In the following figure, we provide a table of the accuracy of features vs label (duration), and it shows that the accuracy is mostly unsatisfying. Based on our weather and UFO data (>90000), we cannot effectively predict the relation between weather features and the duration outcome.

We did, nevertheless, find that most of UFO appearance will last 5 minutes, with almost 40% within 60 seconds. This gives us insights on a reasonable duration UFOs will appear. A log-transformed histogram is also present to show the relationship between appearance and duration.

Creating Web Application

In this blog, we start to design and implement our web application for UFO report. To begin with, our website has two functions:

  1. Let users to report their UFO sighting and give them our comments on their reports, i.e., whether they are fake or not.
  2. Show our statistic result of previous UFO reports, which should be interactive to users.
  3. Update our my_ufo.db based on new UFO reports.

Currently, we let each model gives out a probability for a new input report, and vote to determine whether a new UFO report is fake, using judge_score as the weight of each classifier. The structure of our web application is shown in figure 2. We use Node.js as server to combine front end and back end. We have already acheived fake detection part, as shown in figure 3.

In the home page, users may choose view our statistic result or report an UFO sighting:

If the user choose 'report', we will give a form to the user:

After user submit the form, we give the probability that his report is true: