This project was created to track the spread of the coronavirus on social media. The objective of this project was to analyze 1.1 billion geotagged tweets from the year 2020 to analyze the popularity of COVID related hashtags and keywords in both English and Korean.
Objectives
This project is intended to track the spread of the coronavirus on social media.
0. Implement parallel code using MapReduce
1. Process multilingual text effectively
2. Efficiently handle large-scale datasets
Protocol
0: I created the map.py file so that it could track the occurrence of specific hashtags on both the country and language level
1: Altered the file so that the output when ran would be a .lang and .country file for their respective dictionaries
2: Created a shell script run_mapper.sh to loop over each file in the dataset and run the map.py command.
Note: using the nohup and & operator when running will ensure that commands are running parallel
3: Utilized the reduce.py file to reduce all of the .lang into one singular folder and all of the .country files into one singular folder.
4: Altered the visualize.py file to fit the following parameters: the graph should be formatted in a png file, horizontal axis should house the keys, the vertical should be the data in the input file, the end results should be sorted in ascending order with only the top 10 keys.
Note: The ‘keys’ being referenced are the hashtags that are being searched for. #코로나바이러스 is the first key and is Korean for Covid. #coronavirus is the other key that has been used in these graphs.
5: I made an alternative_reduce.py file that outputs a line plot that fits the following parameters: there is one line per input hashtag, the horizontal axis is the days of the year, and the vertical axis is the number of tweets that use the specific hashtag in a year.
Results
Reflection
Starting and ending this process was significantly more easier than getting through the hiccups that came about while trying to get everything to run. The problem solving process was not as straightforward as I have been used to when running my python3 -m pytest commands, because there was no explicit “something is not working correctly here.” I forgot the additional commands that run the things parallel, so they ran slower. I was waking up in the middle of the night, and reconnecting to the server because I preferred to run in the middle of the night when no one else was on the server. I was really dedicated to getting through this project but I do not think I would have been as successful without collaboration with the mentors at the Quantitative and Computing Lab.