{"id":19733,"date":"2021-11-17T20:15:47","date_gmt":"2021-11-17T20:15:47","guid":{"rendered":"https:\/\/socialmedialab.ca\/web\/?p=19733"},"modified":"2024-12-11T00:49:20","modified_gmt":"2024-12-11T00:49:20","slug":"start-your-infodemic-research-with-these-newly-released-covid-19-twitter-ids-datasets-from-smlabto-covid-19-twitter-pandemic-archive","status":"publish","type":"post","link":"https:\/\/socialmedialab.ca\/web\/2021\/11\/17\/start-your-infodemic-research-with-these-newly-released-covid-19-twitter-ids-datasets-from-smlabto-covid-19-twitter-pandemic-archive\/","title":{"rendered":"Start Your #Infodemic Research with These Newly Released COVID-19 Twitter IDs Datasets from @SMLabTO&#8217;s COVID-19 Twitter Pandemic Archive"},"content":{"rendered":"<div class=\"wp-block-image\">\n<figure class=\"alignleft size-full is-resized\"><a href=\"https:\/\/stream.covid19misinfo.org\/tweet_ids\" target=\"_blank\" rel=\"https:\/\/stream.covid19misinfo.org\/tweet_ids noopener\"><img decoding=\"async\" src=\"https:\/\/socialmedialab.ca\/web\/wp-content\/uploads\/2021\/11\/Pandemic-archive-square.png\" alt=\"\" class=\"wp-image-19734\" width=\"203\" height=\"203\" srcset=\"https:\/\/socialmedialab.ca\/web\/wp-content\/uploads\/2021\/11\/Pandemic-archive-square.png 500w, https:\/\/socialmedialab.ca\/web\/wp-content\/uploads\/2021\/11\/Pandemic-archive-square-150x150.png 150w, https:\/\/socialmedialab.ca\/web\/wp-content\/uploads\/2021\/11\/Pandemic-archive-square-300x300.png 300w, https:\/\/socialmedialab.ca\/web\/wp-content\/uploads\/2021\/11\/Pandemic-archive-square-420x420.png 420w\" sizes=\"(max-width: 203px) 100vw, 203px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p class=\"has-text-align-left wp-block-paragraph\">Today, we are pleased to announce the formal release of a new <a href=\"https:\/\/stream.covid19misinfo.org\/tweet_ids\" target=\"_blank\" rel=\"noreferrer noopener\">COVID-19 Twitter Pandemic Archive<\/a>, a catalog of datasets containing billions of Tweet IDs for COVID-19 tweets and a set of data visualizations featuring high-level monthly stats about the COVID-19 conversations on Twitter. The datasets are being offered as-is for archiving and non-commercial research purposes and are free to download and reuse.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The tweets in these datasets are collected via&nbsp;Twitter\u2019s COVID-19 Streaming Endpoint (API)&nbsp;using a custom script developed by the&nbsp;<a rel=\"noreferrer noopener\" href=\"https:\/\/socialmedialab.ca\/web\/\" target=\"_blank\">Social Media Lab<\/a>. According to Twitter, this new streaming endpoint has no data volume or throughput limitations, and offers a real-time, full-fidelity stream of public Tweets containing the full conversation about COVID-19. (For more information about what tweets are included in this collection see&nbsp;Twitter\u2019s filtering rules&nbsp;for this endpoint.)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As per Twitter\u2019s API Terms, each dataset only includes Tweet IDs (as opposed to the actual tweets and associated metadata). New datasets are uploaded to the web at the beginning of each month.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For each month, we prepare two data files:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>one file with Tweet IDs for all COVID-19 related tweets that we collect via the API, and<\/li>\n\n\n\n<li>a second file containing a subset of Tweet IDs for COVID-19 related tweets that also contain a vaccine-related word (i.e., words starting with vaccin*, vacin*, or vax*).<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignright size-full is-resized\"><a href=\"https:\/\/github.com\/SocialMediaLab\/Tweets_Sampling_Toolkit\/blob\/main\/README.md\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/socialmedialab.ca\/web\/wp-content\/uploads\/2021\/11\/Tweets-Sampling-Toolkit-Square.png\" alt=\"\" class=\"wp-image-19736\" width=\"193\" height=\"193\" srcset=\"https:\/\/socialmedialab.ca\/web\/wp-content\/uploads\/2021\/11\/Tweets-Sampling-Toolkit-Square.png 500w, https:\/\/socialmedialab.ca\/web\/wp-content\/uploads\/2021\/11\/Tweets-Sampling-Toolkit-Square-150x150.png 150w, https:\/\/socialmedialab.ca\/web\/wp-content\/uploads\/2021\/11\/Tweets-Sampling-Toolkit-Square-300x300.png 300w, https:\/\/socialmedialab.ca\/web\/wp-content\/uploads\/2021\/11\/Tweets-Sampling-Toolkit-Square-420x420.png 420w\" sizes=\"(max-width: 193px) 100vw, 193px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">To rehydrate tweets from one of the datasets in the COVID-19 Twitter Pandemic Archive (or a newly created random sample of Tweet IDs&#8230; see below), you can use third-party programs such as <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/DocNow\/hydrator\/\" target=\"_blank\">Hydrator<\/a>, the Python library <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/DocNow\/twarc\/\" target=\"_blank\">Twarc<\/a>, or <a rel=\"noreferrer noopener\" href=\"https:\/\/communalytic.com\/\" target=\"_blank\">Communalytic Pro<\/a> (dataset limit of 10M Tweet IDs).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/SocialMediaLab\/Tweets_Sampling_Toolkit\/blob\/main\/README.md\" target=\"_blank\">Tweets Sampling Toolkit<\/a> <\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">As part of the release of this new research data resource, we are also releasing a companion<a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/SocialMediaLab\/Tweets_Sampling_Toolkit\/blob\/main\/README.md\" target=\"_blank\"> Tweets Sampling Toolkit<\/a> which will allow researchers to create smaller random sample datasets consisting of Tweet IDs derived from one of the larger datasets available in the new COVID-19 Twitter Pandemic Archive.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In addition to creating a random sample, the Tweets Sampling Toolkit can also perform <a href=\"https:\/\/www.probabilitycourse.com\/chapter1\/1_2_2_set_operations.php\" target=\"_blank\" rel=\"noreferrer noopener\">set operations<\/a> such as union, difference, and intersection to compare two or more datasets. For example, if you have previously collected your own dataset of COVID-19 related tweets using Twitter\u2019s Standard Search or Streaming API, you could compare it with one of the datasets published in the COVID-19 Twitter Pandemic Archive. This can be done using the \u201cunion\u201d function provided in the Tweets Sampling Toolkit to merge two or more datasets of Tweet IDs, while excluding duplicates. Alternatively, you can use the \u201cdifference\u201d function to identify and recollect only those tweets (based on their Tweet IDs) that are not part of your original dataset. Finally, you can use the \u201cintersection\u201d function, to locate Tweet IDs that appear in two or more datasets.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Today, we are pleased to announce the formal release of a new COVID-19 Twitter Pandemic Archive, a catalog of datasets containing billions of Tweet IDs for COVID-19 tweets and a set of data visualizations featuring high-level monthly stats about the COVID-19 conversations on Twitter. The datasets are being offered as-is for archiving and non-commercial research [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":19734,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[495,41,483,494,265,554,264],"tags":[518,563,568,559,231,529,567,562,18],"class_list":["post-19733","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-analytics","category-announcements","category-health","category-misinformation","category-research","category-research-tools","category-web-apps","tag-coronavirus","tag-covid-19","tag-covid-19-tweets-epidemiology","tag-data","tag-health-informatics","tag-infodemic","tag-pandemic","tag-social-media-research","tag-twitter"],"_links":{"self":[{"href":"https:\/\/socialmedialab.ca\/web\/wp-json\/wp\/v2\/posts\/19733","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/socialmedialab.ca\/web\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/socialmedialab.ca\/web\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/socialmedialab.ca\/web\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/socialmedialab.ca\/web\/wp-json\/wp\/v2\/comments?post=19733"}],"version-history":[{"count":12,"href":"https:\/\/socialmedialab.ca\/web\/wp-json\/wp\/v2\/posts\/19733\/revisions"}],"predecessor-version":[{"id":21234,"href":"https:\/\/socialmedialab.ca\/web\/wp-json\/wp\/v2\/posts\/19733\/revisions\/21234"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/socialmedialab.ca\/web\/wp-json\/wp\/v2\/media\/19734"}],"wp:attachment":[{"href":"https:\/\/socialmedialab.ca\/web\/wp-json\/wp\/v2\/media?parent=19733"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/socialmedialab.ca\/web\/wp-json\/wp\/v2\/categories?post=19733"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/socialmedialab.ca\/web\/wp-json\/wp\/v2\/tags?post=19733"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}