NUFORC Reports

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import pandas as pd
import lib.scraping as scraping
import lib.postprocessing_and_plotting as post
import lib.sentiment as sentiment
import lib.cleaning as clean
import lib.nuforc_analysis as nuforc
import lib.image_scraping as img_scrape
import lib.maps as maps
import lib.image_processing as img_proc
from ipywidgets import interact, IntSlider, fixed

data_folder = "data/"
file_url_list = data_folder + 'full_list_of_urls.json'
report_file = 'all_sightings.csv'
[nltk_data] Downloading package stopwords to /Users/kiru/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Data

Scraping from NUFORC

The National UFO Reporting Center (NUFORC) lists reports splitted by the occured date, by posting date, by shape or by location. These sightings has been collected since 1974. NUFORC provides a hotline or web formular to submit the UFO encounters. These reports are checked for hoax and fake reports, according to several sources. The sightings also include articles from news papers and historical accounts.

We decided to scrape the data from their web page using the request and beatifulsoup libraries.

The first step to fetch all the possible shapes and then we can fetch all the urls from each of the shape.

In [28]:
all_shapes = scraping.fetch_shape_urls()
all_reporting_urls = scraping.fetch_all_sighting_urls(all_shapes)

Let's see how many shapes and reportings there are:

In [29]:
print("Number of shapes:", len(all_shapes))
print("Nubmer of sighting:", len(all_reporting_urls))
Number of shapes: 31
Nubmer of sighting: 117722

We don't want to constantly send requests to the server, therefore store the file in urls in a json file.

In [ ]:
scraping.store_to_json(all_reporting_urls, file_url_list)

# import href list of each report
with open(file_url_list) as train_file:
    href_data = json.load(train_file)
urls_to_read = href_data['url']['0']

Now we are ready to fetch all the reports and convert it to a Pandas DataFrame.

In [ ]:
%%time
df_reports = scraping.build_report_dataframe(urls_to_read)
df_reports.to_json(data_folder + "all_sightings.csv")

The whole scraping process took more than 1 hour.

Loading data

Of course we do not want to scrape every time so we load the file. The initial data contains the following columns:

  • Occurred
  • Reported
  • Posted
  • Location
  • Shape
  • Duration
  • Summary
In [30]:
df_reports = pd.read_csv(data_folder + report_file)
df_reports.sample(2).head()
Out[30]:
Unnamed: 0 Duration Location Occurred Posted Reported Shape Summary url
68075 55386 30 Toronto (Canada), ON 9/10/2014 01:00 (Entered as : 10/09/14 1:00) 9/10/2014 9/9/2014 10:09:25 PM 22:09 Light blinking white light very high in the sky, not... http://www.nuforc.org/webreports/113/S113416.html
56419 44896 3 seconds Brook Park, OH 2/23/2017 05:45 (Entered as : 02/23/17 05:45) 3/10/2017 2/24/2017 6:45:24 PM 18:45 Flash While heading south on I-71 a flash caught my ... http://www.nuforc.org/webreports/132/S132916.html

Cleaning

These are reports which were not fetched correctly, we have to investigate further why these are missing.
Let's see how many are not fetched:

In [31]:
not_found_reports = df_reports[pd.isnull(df_reports.Summary)]
print("Not found reports", not_found_reports.count()['url'])
Not found reports 106

A

  • The function below does post processing on the scraped data such as cleaning whitespace, renaming the columns, ordering the columns and applies filters so that we only retrieve the desired data wich contains summary data and sufficient information for the states.
  • In our dataset we noticed there are several emtpy strings. We decided to exchange these empty strings with NaN for easier parsing.

B

  • Split all NUFORC notes of each summary into a seperate column
  • Split all links of each summary into a seperate column

C

  • Seperate MADAR notes into seperate Dataframe
In [32]:
print("cleaning..")
df_reports, df_madar_reports = clean.clean_data(df_reports)
print("..done")
cleaning..
..done

After Cleaning the data our structure looks like this

In [33]:
df_reports.sample(2)
Out[33]:
Duration Location State Occurred Posted Reported Shape Summary nuforc_note link
url
http://www.nuforc.org/webreports/080/S80349.html 12.5 London (Canada) ON 2006-08-11 23:00:00 3/10/2011 2011-03-05 15:42:00 Circle I saw a yellow ufo moving in the sky heading k... [] []
http://www.nuforc.org/webreports/105/S105740.html 3.0 Cleveland TN 2013-12-29 19:00:00 1/10/2014 2013-12-30 08:08:00 Sphere At 7:00 pm we saw five reddish orange spheres ... [] []

Analysis

Time of occurrence

Can the reports be explained?

A reports submission splits into three parts, 'Occurred' when the UFO sighting was noticed, 'Reported' when the sighting was reported to the NUFORC and 'Posted' when the actual report was published online.

The report with the oldest occurrence time is from 1762 which is a report from a London newspaper.

As pandas datetime deals with only years starting from 1900 and there would be very little reliability with reports so old, we will be looking into reports only starting from 1900.

Our goal is to understand if we can explain the results through specific events, like rocket launches, seasonal weather analysis, etc.

To get a better understand, we should see the distribuation over the year. Thus we show the amount of reports occuring per each month over our time period.

In [34]:
post.peaks_over_months(df_reports)

This plot shows the amount of reported occurances per month from 1900-2018. We can see a very clear trend that during the summer months when there is an increase in submitted reports. This can be reasoned with very logicaly, because during warmer periods people tend to spend more time outside, thus have a higher probability of making a sighting of a UFO.

But we see a clear increase in July, thus we will be investigating this month more to see if there is something more than just the summer trend. We are plotting the reports from July over the time frame 1900 to 2018.

In [7]:
post.get_plot_july(df_reports)

A bias of memory

Now, here things start to become very interesting. We can quite obviously see an increase for the 4th of July, as a lot of sightings might be connected with the fireworks in the US for the Independence Day.

Though, another very clear increase is on the 15th of July, but after a google query, no specific occurance in the month of July was found. Thus, an investigation of the 15th of July over the years occurs. What is more interesting, is that the increase on the 15th day of the month can be seen across multiple months.

In [35]:
post.plot_for_each_month(post.get_data_for_months(df_reports))

We can see a very clear trend across all months, that during the 15th day of the month there is a sudden increase in occured reportings of UFOs. Also, in most months there is a peak on the first day of the month and in some on the last one. And in some months there are also other peaks, which most of the time can be explained due to holidays, such as the 4th of July of New Years Eve, when fireworks are very prevalent.

To get a better understanding of the peaks that occur in the middle of the month, we will plot the time difference between the Occurrence and Reporting of the sighting over days of the month.

In [9]:
post.plot_occur_report_difference(post.occurance_report_difference(df_reports))

We see that we have some suspicious dates, where the mean is higher than for the other ones, these days fall on 'round' numbers, thus we propose that these dates are artificially inflated. In other words, UFO reporters which submit their reports after a longer time will have a bias towards round dates, which has been shown in previous research as well [1].

To test this further we will do Kolmogorov-Smirnov test on these dates to see if their empirical distribution matches the distribution of the other dates.

Thus we split our data into two subsets, one which consists of the 'suspicious' (1st day of the month and the days when day is divisible by 5) and the other days.

In [10]:
data_all_days, data_suspicios_days = post.get_distribution_data(post.occurance_report_difference(df_reports))
post.kolmogorov_smirnov(data_all_days, data_suspicios_days)
Kolmogorov_Smirnov Statistic 0.30862854553990543 and p-value 0.0
Out[10]:
True

Conclusions from statistics

By doing the Kolmogorov-Smirnov test we check wether these two datasets are of the same empirical distribution. As the p-value is below 0.05 we can reject the null hypothesis, meaning, that they are not of the same distribution. This further means, that reporters have a certain bias towards round dates after a long time between the occurance of the sighting and the reporting itself.

Other spikes

The previous analysis only covers spikes, which are create themselves due to the bias of reporters. Thus, it is not an explanation per say of the reports, but of the reason why they aggragate at these specific dates.

To look more in depth in special cases/other peaks, we are plotting intensity of occurrence over the period 2004-2007 for all the months.

In [36]:
interact(post.spike_finder, year = IntSlider(min=2004,max=2007,step=1,value=2004),
                                              month = IntSlider(min=1,max=12,step=1,value=1),
                                                df = fixed(df_reports))
Out[36]:
<function lib.postprocessing_and_plotting.spike_finder(year, month, df)>

From 2004 to 2007

After some manual searching for the peaks we found reasons for some of them, while other were reported in the news as UFOs as well.

Final Conclusions about time

By manually checking the days of the peaks we could find reasonable explanations for most of them, such as rocket launches, space station fueling missions, meteorites, etc. But at the same time some peaks didn't leave any seeming correlation with events that could be explainable from earth/space weather or known human made flying objects. For further analysis, one could track the locations of all the satellites over earths orbit (https://in-the-sky.org/satmap_worldmap.php) and cross check that with reports, which might match a description. For even deeper analysis, one could cross check all of the dates with astronomical calendars to possibly dispatch of even more reports.

But in the end it would be impossible to explain all of the sightings, because often times reporters which want to believe in something, will make far fetched correlations for their beliefs, just to reinforce them.

Map's - Location Of The Occurrence

In [42]:
from IPython.display import HTML
HTML(filename='maps.html')
Out[42]:
UFO-advanced_maps-_clean-Copy1

MAPS

Layers

Airbase Markers

We wanted to see if the location of military airbases of the us is somehow correlated.

Population Density
  • yellow representing non-metropolitan
  • green representing rural areas
UFO-sightings per normalized countie population

After cleaning the Location data which was at first not only related to states and counties we were able to merge together countie population data with the reports per countie. This results in the following layer countie population which shows that the likelihood for ufo sightings is higher on coastal regions.

In [1]:
from lib.map_countie import *
In [3]:
get_countie_map_()
Out[3]:
In [2]:
get_density_map_()
Out[2]:

As most of the reportings are from the US, we are showing the distribution of reports from the different states.

Airbase Markers

We wanted to see if the location of military airbases of the us is somehow correlated.

UFO-sightings per normalized countie population

After cleaning the Location data which was at first not only related to states and counties we were able to merge together countie population data with the reports per countie. This results in the following layer countie population which shows that the likelihood for ufo sightings is higher on coastal regions.

In [ ]:
map_plot.get_countie_map_()
Population Density

The map below shows the population density of the us.

  • yellow representing non-metropolitan
  • green representing rural areas

If we compare the density-map with the normalized-ufo-sightings-per-capita-map we can tell that there is no direct relation.

In [ ]:
map_plot.get_density_map_()

The map below shows the distribution of reports according to the different states.

In [ ]:
maps.sightings_per_state(df_ufo_reports)

UFO Reports per Capita

Sightings per 100k people per state
In [ ]:
maps.sightings_per_state_normalized(df_ufo_reports)

Image Scraping

As we found many links and youtube references in our summaries. We extracted them and requested the content. We merged all images and youtube thumbnails and formed a big picture. We sorted them by dominant color using k-means. We can now creatively analyze if the picture was taken by night or by day or maybe the light of the ufo.

In [ ]:
img_scrape.scrape_any_images(df_ufo_reports,".jpg")

Scrape Youtube Thumbnails and Video titles
-> to Folder /img/yt/

In [ ]:
img_scrape.scrape_yt_images(df_ufo_reports)
In [ ]:
big_image, df_img = img_proc.plot_sorted_images()
df_img.head()

big_pictureimage.png

Sentiment Analysis

Getting sentiment

We use LIBWC to analyse the sentiment of each report. Since LIBWC does not provides an API, we have to export all reports as a text file first:

In [ ]:
sentiment.export_alll_files(df_reports)

Now you have to run the sentiment anylsis for all the files we generated. This will produce a new file which we can now use.

In [16]:
df_report_sentiment = sentiment.add_sentiment_data(df_reports.reset_index())
df_report_sentiment.head(1)
Out[16]:
Duration Location State Occurred Posted Reported Shape Summary nuforc_note link ... Comma Colon SemiC QMark Exclam Dash Quote Apostro Parenth OtherP
url
(http://www.nuforc.org/webreports/086/S86976.html, http://www.nuforc.org/webreports/086/S86976.html) 12.5 Terre Haute IN 2012-02-01 20:45:00 2/3/2012 2012-02-02 20:46:00 jets investigate strange lighti was on my pati... [] [] ... 2.45 0.0 0.0 0.0 0.49 0.0 0.98 0.0 0.0 0.49

1 rows × 104 columns

The LIBWC2015 paper explains the emtions. For us interesting are the following columns:

  • family
  • friend
  • posemo ( positive emotion )
  • negemo ( negative emotion )
  • anger
  • sad
  • anx (anxiety)
  • hear
  • feel
  • reward
  • achieve
  • risk
  • tentat ( tentative )
  • certain

Word count

Before we analyze the senitment, let's analyze the lenght of the reports:

In [17]:
df_report_sentiment.WC.describe(percentiles=[0.25, .5, .75, .9, .95]).to_frame()
Out[17]:
WC
count 117232.000000
mean 189.948504
std 192.333251
min 1.000000
25% 79.000000
50% 140.000000
75% 236.000000
90% 384.000000
95% 515.000000
max 11656.000000

We are not interested in reports largest 5% of the reports, therefore let's plot the distributaion of the rest:

In [39]:
df_not_long_reports = df_report_sentiment[df_report_sentiment.WC < 1000]
df_not_long_reports.WC.plot.hist(bins=100, figsize=(15, 4), color='#751947')
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a23f4df60>
In [41]:
df_not_long_reports.WC.describe().to_frame()
Out[41]:
WC
count 116388.000000
mean 180.906734
std 151.019725
min 1.000000
25% 78.000000
50% 139.000000
75% 233.000000
max 998.000000

We can clearly see that the word count follows a normal distribution with a mean shift to the left.

Sentiment and Shape

We want to answer the question if there is a correlation between shape and emotion. In order to do so, we have to find out which emotions are more prevalent in the reports. The following boxplot shows the occurrence of the emotions we are interested in.

In [25]:
sentiment.emotions_boxplot(sentiment.get_emotion_data(df_report_sentiment), showfliers=False)

Out of the emtions we were initially interested in, we focus on a subset of them consisting of:

  • hear
  • negemo
  • posemo
  • feel
  • reward
  • certain

We can see that both positive and negative emotions are rather commonly found in the reports, as well as 'tentative' and 'certain' cognitive processes. Which is why we are looking into these emotions together with the senses of 'hear' and 'feel' and how are they showed for the most popular shapes.

In [26]:
sentiment.plot_sentiment_nets(sentiment.get_emotion_data(df_report_sentiment))

Contrary to our initial thought, the shapes don't seem to convey any meaningful informations nor there seems to be any correlation between their shape and different emotions.

Conclusion

This was an overall very interesting dataset with many unexpected twists and turns. After a lengthy cleaning and scraping process and an in-depth analysis of the many different sides of the dataset we came to some interesting conclusions. Firstly, even if we could not come to a default case where we could debunk most of the reports, we could give plausible explanations to a lot of them.

Our map analysis gave us interesting insights into the distribution of the reports, but alas, did not give us enough information to correlate the reports with neither military bases, nor rocket launch site, nor rural or urban regions.

Finally, our sentiment analysis showed us prospective results connected with the emotions of different shapes, and we could assess that reporters mostly felt strong certainty and trust towards their reporting which is an interesting sight.

Miscellaneous

We made wordclouds for each year starting at 1964 (which is the year from which there were more than 1000 reports), and found out the most popular shapes for these years. Over all the years the most popular shape was 'light', thus we decided to not include it in our further analysis, as well as the most popular words, as they would dominate the WordClouds and we couldn't get any interesting information out of them.

In [27]:
interact(post.generate_clipart, year = IntSlider(min=1964,max=2018,step=1,value=1998),
        frequency_years = fixed(post.get_data_for_wordclouds()))
Out[27]:
<function lib.postprocessing_and_plotting.generate_clipart(year, frequency_years)>

References

[1] Huttenlocher, J., Hedges, L. V., & Bradburn, N. M. (1990). Reports of Elapsed Time: Bounding and Rounding Processes in Estimation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16(2), 196-213. https://doi.org/10.1037/0278-7393.16.2.196