I had attached the ufo_sightings_large.csv

  • In this assignment, you will investigate UFO data over the last century to gain some insight.
  • Please use all the techniques we have learned in the class to preprocesss/clean the datasetufo_sightings_large.csv
  • After the dataset is preprocessed, please split the dataset into training sets and test sets
  • Fit KNN to the training sets.
  • Print the score of KNN on the test sets


1. Import dataset “ufo_sightings_large.csv” in pandas (5 points)


2. Checking column types & Converting Column types (10 points)

Take a look at the UFO dataset’s column types using the dtypes attribute. Please convert the column types to the proper types. For example, the date column, which can be transformed into the datetime type. That will make our feature engineering efforts easier later on.


3. Dropping missing data (10 points)

Let’s remove some of the rows where certain columns have missing values.


4. Extracting numbers from strings (10 points)

The length_of_time column in the UFO dataset is a text field that has the number of minutes within the string. Here, you’ll extract that number from that text field using regular expressions.

In [ ]:


5. Identifying features for standardization (10 points)

In this section, you’ll investigate the variance of columns in the UFO dataset to determine which features should be standardized. You can log normlize the high variance column.


6. Encoding categorical variables (20 points)

There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. You’ll do that transformation here, using both binary and one-hot encoding methods.


7. Text vectorization (10 points)

Let’s transform the desc column in the UFO dataset into tf/idf vectors, since there’s likely something we can learn from this field.


8. Selecting the ideal dataset (10 points)

Let’s get rid of some of the unnecessary features.


9. Split the X and y using train_test_split, setting stratify = y (5 points)

In [9]:

X = ufo.drop(["type"],axis = 1)
y = ufo["type"].astype(str)


10. Fit knn to the training sets and print the score of knn on the test sets (5 points)

In [1]:

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
# Fit knn to the training sets
knn.fit(train_X, train_y)
# Print the score of knn on the test sets
print(knn.score(test_X, test_y))
Assignment Data Preprocessing – UFO Sighting Data Exploration