NUFORC Reports¶

Table of Contents¶

Data
Analysis
Basic anlayis
Conclusion
Miscellaneous
- Wordcloud of popular words
- References

%load_ext autoreload
%autoreload 2
%matplotlib inline
import pandas as pd
import lib.scraping as scraping
import lib.postprocessing_and_plotting as post
import lib.sentiment as sentiment
import lib.cleaning as clean
import lib.nuforc_analysis as nuforc
import lib.image_scraping as img_scrape
import lib.maps as maps
import lib.image_processing as img_proc
from ipywidgets import interact, IntSlider, fixed

data_folder = "data/"
file_url_list = data_folder + 'full_list_of_urls.json'
report_file = 'all_sightings.csv'

[nltk_data] Downloading package stopwords to /Users/kiru/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Data¶

Scraping from NUFORC¶

The National UFO Reporting Center (NUFORC) lists reports splitted by the occured date, by posting date, by shape or by location. These sightings has been collected since 1974. NUFORC provides a hotline or web formular to submit the UFO encounters. These reports are checked for hoax and fake reports, according to several sources. The sightings also include articles from news papers and historical accounts.

We decided to scrape the data from their web page using the request and beatifulsoup libraries.

The first step to fetch all the possible shapes and then we can fetch all the urls from each of the shape.

all_shapes = scraping.fetch_shape_urls()
all_reporting_urls = scraping.fetch_all_sighting_urls(all_shapes)

Let's see how many shapes and reportings there are:

print("Number of shapes:", len(all_shapes))
print("Nubmer of sighting:", len(all_reporting_urls))

Number of shapes: 31
Nubmer of sighting: 117722

We don't want to constantly send requests to the server, therefore store the file in urls in a json file.

scraping.store_to_json(all_reporting_urls, file_url_list)

# import href list of each report
with open(file_url_list) as train_file:
    href_data = json.load(train_file)
urls_to_read = href_data['url']['0']

Now we are ready to fetch all the reports and convert it to a Pandas DataFrame.

%%time
df_reports = scraping.build_report_dataframe(urls_to_read)
df_reports.to_json(data_folder + "all_sightings.csv")

The whole scraping process took more than 1 hour.

Loading data¶

Of course we do not want to scrape every time so we load the file. The initial data contains the following columns:

Occurred
Reported
Posted
Location
Shape
Duration
Summary

df_reports = pd.read_csv(data_folder + report_file)
df_reports.sample(2).head()

Cleaning¶

These are reports which were not fetched correctly, we have to investigate further why these are missing.
Let's see how many are not fetched:

not_found_reports = df_reports[pd.isnull(df_reports.Summary)]
print("Not found reports", not_found_reports.count()['url'])

Not found reports 106

A¶

The function below does post processing on the scraped data such as cleaning whitespace, renaming the columns, ordering the columns and applies filters so that we only retrieve the desired data wich contains summary data and sufficient information for the states.
In our dataset we noticed there are several emtpy strings. We decided to exchange these empty strings with NaN for easier parsing.

B¶

Split all NUFORC notes of each summary into a seperate column
Split all links of each summary into a seperate column

C¶

Seperate MADAR notes into seperate Dataframe

print("cleaning..")
df_reports, df_madar_reports = clean.clean_data(df_reports)
print("..done")

cleaning..
..done

After Cleaning the data our structure looks like this¶

df_reports.sample(2)

Analysis¶

Time of occurrence¶

Can the reports be explained?¶

A reports submission splits into three parts, 'Occurred' when the UFO sighting was noticed, 'Reported' when the sighting was reported to the NUFORC and 'Posted' when the actual report was published online.

The report with the oldest occurrence time is from 1762 which is a report from a London newspaper.

As pandas datetime deals with only years starting from 1900 and there would be very little reliability with reports so old, we will be looking into reports only starting from 1900.

Our goal is to understand if we can explain the results through specific events, like rocket launches, seasonal weather analysis, etc.

To get a better understand, we should see the distribuation over the year. Thus we show the amount of reports occuring per each month over our time period.

post.peaks_over_months(df_reports)

This plot shows the amount of reported occurances per month from 1900-2018. We can see a very clear trend that during the summer months when there is an increase in submitted reports. This can be reasoned with very logicaly, because during warmer periods people tend to spend more time outside, thus have a higher probability of making a sighting of a UFO.

But we see a clear increase in July, thus we will be investigating this month more to see if there is something more than just the summer trend. We are plotting the reports from July over the time frame 1900 to 2018.

post.get_plot_july(df_reports)

A bias of memory¶

Now, here things start to become very interesting. We can quite obviously see an increase for the 4th of July, as a lot of sightings might be connected with the fireworks in the US for the Independence Day.

Though, another very clear increase is on the 15th of July, but after a google query, no specific occurance in the month of July was found. Thus, an investigation of the 15th of July over the years occurs. What is more interesting, is that the increase on the 15th day of the month can be seen across multiple months.

post.plot_for_each_month(post.get_data_for_months(df_reports))

We can see a very clear trend across all months, that during the 15th day of the month there is a sudden increase in occured reportings of UFOs. Also, in most months there is a peak on the first day of the month and in some on the last one. And in some months there are also other peaks, which most of the time can be explained due to holidays, such as the 4th of July of New Years Eve, when fireworks are very prevalent.

To get a better understanding of the peaks that occur in the middle of the month, we will plot the time difference between the Occurrence and Reporting of the sighting over days of the month.

post.plot_occur_report_difference(post.occurance_report_difference(df_reports))

We see that we have some suspicious dates, where the mean is higher than for the other ones, these days fall on 'round' numbers, thus we propose that these dates are artificially inflated. In other words, UFO reporters which submit their reports after a longer time will have a bias towards round dates, which has been shown in previous research as well [1].

To test this further we will do Kolmogorov-Smirnov test on these dates to see if their empirical distribution matches the distribution of the other dates.

Thus we split our data into two subsets, one which consists of the 'suspicious' (1st day of the month and the days when day is divisible by 5) and the other days.

data_all_days, data_suspicios_days = post.get_distribution_data(post.occurance_report_difference(df_reports))
post.kolmogorov_smirnov(data_all_days, data_suspicios_days)

Kolmogorov_Smirnov Statistic 0.30862854553990543 and p-value 0.0

True

Conclusions from statistics¶

By doing the Kolmogorov-Smirnov test we check wether these two datasets are of the same empirical distribution. As the p-value is below 0.05 we can reject the null hypothesis, meaning, that they are not of the same distribution. This further means, that reporters have a certain bias towards round dates after a long time between the occurance of the sighting and the reporting itself.

Other spikes¶

The previous analysis only covers spikes, which are create themselves due to the bias of reporters. Thus, it is not an explanation per say of the reports, but of the reason why they aggragate at these specific dates.

To look more in depth in special cases/other peaks, we are plotting intensity of occurrence over the period 2004-2007 for all the months.

interact(post.spike_finder, year = IntSlider(min=2004,max=2007,step=1,value=2004),
                                              month = IntSlider(min=1,max=12,step=1,value=1),
                                                df = fixed(df_reports))

<function lib.postprocessing_and_plotting.spike_finder(year, month, df)>

From 2004 to 2007¶

After some manual searching for the peaks we found reasons for some of them, while other were reported in the news as UFOs as well.

15th April 2004 Florida sightings connected with rocket launch
26th June 2004 Space debris entering atmosphere
21st august 2004 and 31st october 2004, 30th September 2005 Almost all reports over the same time, in the same location: Black triangle, Video
31st August 2004 The missile launch
29th April 2005 Titan IV misile launch
15th of May 2005 reports from all over the place, reports mention different shapes and sizes, no seemind connection with weather or anything else.
29th April 2006 A lot of reports from Portaland, Oregon, reports describe similar to that of this
1st of October 2006 Reports from Colorado and New Mexico, no seeming correlation to meteorite showers or weather.
28th of November 2006 Often described as fireball, possible explanation meteor event
16th of December 2006 Possible explanation: meteor event
19-20th of June 2007 Possible explanation, ISS docking of a cargo ship
12th of August 2007 meteor shower at the time First_glimpses_from_the_Perseid_meteor_shower, though a lot of reports mention that they saw UFO within the meteor shower.
12th of December 2007 Atlas rocket fuel dump

Final Conclusions about time¶

By manually checking the days of the peaks we could find reasonable explanations for most of them, such as rocket launches, space station fueling missions, meteorites, etc. But at the same time some peaks didn't leave any seeming correlation with events that could be explainable from earth/space weather or known human made flying objects. For further analysis, one could track the locations of all the satellites over earths orbit (https://in-the-sky.org/satmap_worldmap.php) and cross check that with reports, which might match a description. For even deeper analysis, one could cross check all of the dates with astronomical calendars to possibly dispatch of even more reports.

But in the end it would be impossible to explain all of the sightings, because often times reporters which want to believe in something, will make far fetched correlations for their beliefs, just to reinforce them.

Map's - Location Of The Occurrence¶

from IPython.display import HTML
HTML(filename='maps.html')

from lib.map_countie import *

get_countie_map_()

get_density_map_()

MAPS¶

Layers¶

Airbase Markers¶

We wanted to see if the location of military airbases of the us is somehow correlated.

Population Density¶

yellow representing non-metropolitan
green representing rural areas

UFO-sightings per normalized countie population¶

After cleaning the Location data which was at first not only related to states and counties we were able to merge together countie population data with the reports per countie. This results in the following layer countie population which shows that the likelihood for ufo sightings is higher on coastal regions.

from lib.map_countie import *

from lib.map_countie import *

get_countie_map_()

get_countie_map_()

get_density_map_()

get_density_map_()

As most of the reportings are from the US, we are showing the distribution of reports from the different states.

Airbase Markers¶

We wanted to see if the location of military airbases of the us is somehow correlated.

UFO-sightings per normalized countie population¶

After cleaning the Location data which was at first not only related to states and counties we were able to merge together countie population data with the reports per countie. This results in the following layer countie population which shows that the likelihood for ufo sightings is higher on coastal regions.

map_plot.get_countie_map_()

Population Density¶

The map below shows the population density of the us.

yellow representing non-metropolitan
green representing rural areas

If we compare the density-map with the normalized-ufo-sightings-per-capita-map we can tell that there is no direct relation.

map_plot.get_density_map_()

The map below shows the distribution of reports according to the different states.

maps.sightings_per_state(df_ufo_reports)

UFO Reports per Capita¶

Sightings per 100k people per state¶

maps.sightings_per_state_normalized(df_ufo_reports)

Image Scraping¶

As we found many links and youtube references in our summaries. We extracted them and requested the content. We merged all images and youtube thumbnails and formed a big picture. We sorted them by dominant color using k-means. We can now creatively analyze if the picture was taken by night or by day or maybe the light of the ufo.

img_scrape.scrape_any_images(df_ufo_reports,".jpg")

Scrape Youtube Thumbnails and Video titles
-> to Folder /img/yt/

img_scrape.scrape_yt_images(df_ufo_reports)

big_image, df_img = img_proc.plot_sorted_images()
df_img.head()

big_picture

Sentiment Analysis¶

Getting sentiment¶

We use LIBWC to analyse the sentiment of each report. Since LIBWC does not provides an API, we have to export all reports as a text file first:

sentiment.export_alll_files(df_reports)

Now you have to run the sentiment anylsis for all the files we generated. This will produce a new file which we can now use.

df_report_sentiment = sentiment.add_sentiment_data(df_reports.reset_index())
df_report_sentiment.head(1)

The LIBWC2015 paper explains the emtions. For us interesting are the following columns:

family
friend
posemo ( positive emotion )
negemo ( negative emotion )
anger
sad
anx (anxiety)
hear
feel
reward
achieve
risk
tentat ( tentative )
certain

Word count¶

Before we analyze the senitment, let's analyze the lenght of the reports:

df_report_sentiment.WC.describe(percentiles=[0.25, .5, .75, .9, .95]).to_frame()

We are not interested in reports largest 5% of the reports, therefore let's plot the distributaion of the rest:

df_not_long_reports = df_report_sentiment[df_report_sentiment.WC < 1000]
df_not_long_reports.WC.plot.hist(bins=100, figsize=(15, 4), color='#751947')

<matplotlib.axes._subplots.AxesSubplot at 0x1a23f4df60>

df_not_long_reports.WC.describe().to_frame()

We can clearly see that the word count follows a normal distribution with a mean shift to the left.

Sentiment and Shape¶

We want to answer the question if there is a correlation between shape and emotion. In order to do so, we have to find out which emotions are more prevalent in the reports. The following boxplot shows the occurrence of the emotions we are interested in.

sentiment.emotions_boxplot(sentiment.get_emotion_data(df_report_sentiment), showfliers=False)

Out of the emtions we were initially interested in, we focus on a subset of them consisting of:

hear
negemo
posemo
feel
reward
certain

We can see that both positive and negative emotions are rather commonly found in the reports, as well as 'tentative' and 'certain' cognitive processes. Which is why we are looking into these emotions together with the senses of 'hear' and 'feel' and how are they showed for the most popular shapes.

sentiment.plot_sentiment_nets(sentiment.get_emotion_data(df_report_sentiment))

Contrary to our initial thought, the shapes don't seem to convey any meaningful informations nor there seems to be any correlation between their shape and different emotions.

Conclusion¶

This was an overall very interesting dataset with many unexpected twists and turns. After a lengthy cleaning and scraping process and an in-depth analysis of the many different sides of the dataset we came to some interesting conclusions. Firstly, even if we could not come to a default case where we could debunk most of the reports, we could give plausible explanations to a lot of them.

Our map analysis gave us interesting insights into the distribution of the reports, but alas, did not give us enough information to correlate the reports with neither military bases, nor rocket launch site, nor rural or urban regions.

Finally, our sentiment analysis showed us prospective results connected with the emotions of different shapes, and we could assess that reporters mostly felt strong certainty and trust towards their reporting which is an interesting sight.

Miscellaneous¶

Wordcloud of popular words¶

We made wordclouds for each year starting at 1964 (which is the year from which there were more than 1000 reports), and found out the most popular shapes for these years. Over all the years the most popular shape was 'light', thus we decided to not include it in our further analysis, as well as the most popular words, as they would dominate the WordClouds and we couldn't get any interesting information out of them.

interact(post.generate_clipart, year = IntSlider(min=1964,max=2018,step=1,value=1998),
        frequency_years = fixed(post.get_data_for_wordclouds()))

<function lib.postprocessing_and_plotting.generate_clipart(year, frequency_years)>

References¶

[1] Huttenlocher, J., Hedges, L. V., & Bradburn, N. M. (1990). Reports of Elapsed Time: Bounding and Rounding Processes in Estimation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16(2), 196-213. https://doi.org/10.1037/0278-7393.16.2.196

	Unnamed: 0	Duration	Location	Occurred	Posted	Reported	Shape	Summary	url
68075	55386	30	Toronto (Canada), ON	9/10/2014 01:00 (Entered as : 10/09/14 1:00)	9/10/2014	9/9/2014 10:09:25 PM 22:09	Light	blinking white light very high in the sky, not...	http://www.nuforc.org/webreports/113/S113416.html
56419	44896	3 seconds	Brook Park, OH	2/23/2017 05:45 (Entered as : 02/23/17 05:45)	3/10/2017	2/24/2017 6:45:24 PM 18:45	Flash	While heading south on I-71 a flash caught my ...	http://www.nuforc.org/webreports/132/S132916.html

	Duration	Location	State	Occurred	Posted	Reported	Shape	Summary	nuforc_note	link
url
http://www.nuforc.org/webreports/080/S80349.html	12.5	London (Canada)	ON	2006-08-11 23:00:00	3/10/2011	2011-03-05 15:42:00	Circle	I saw a yellow ufo moving in the sky heading k...	[]	[]
http://www.nuforc.org/webreports/105/S105740.html	3.0	Cleveland	TN	2013-12-29 19:00:00	1/10/2014	2013-12-30 08:08:00	Sphere	At 7:00 pm we saw five reddish orange spheres ...	[]	[]

	Duration	Location	State	Occurred	Posted	Reported	Shape	Summary	nuforc_note	link	...	Comma	Colon	SemiC	QMark	Exclam	Dash	Quote	Apostro	Parenth	OtherP
url
(http://www.nuforc.org/webreports/086/S86976.html, http://www.nuforc.org/webreports/086/S86976.html)	12.5	Terre Haute	IN	2012-02-01 20:45:00	2/3/2012	2012-02-02 20:46:00		jets investigate strange lighti was on my pati...	[]	[]	...	2.45	0.0	0.0	0.0	0.49	0.0	0.98	0.0	0.0	0.49

	WC
count	117232.000000
mean	189.948504
std	192.333251
min	1.000000
25%	79.000000
50%	140.000000
75%	236.000000
90%	384.000000
95%	515.000000
max	11656.000000

	WC
count	116388.000000
mean	180.906734
std	151.019725
min	1.000000
25%	78.000000
50%	139.000000
75%	233.000000
max	998.000000