Independent Study Opportunity for INFO and MSIM Students—2 to 3 Credits—Spring of 2022
This is an independent study opportunity for both INFO and MSIM students with a strong technical and programming background (object-oriented programming, MySql experience, maybe some php programming—familiarity with AWS and git will also be beneficial).
Context
As part of the ongoing twin project of bibliometric research and curation of the Digital Government Reference Library (DGRL) and the Disaster Information Reference Library (DIRL), which are geared at systematically collecting peer-reviewed academic publications in the two separate subject domains of (a) Digital Government and (b) Disaster Information Management, we intend to expand the data collection and curation efforts from a manual and human-expert-only process to an ICT-supported screening, collecting, and sorting process, which creates an initial pool of preselected and narrowed-down candidate records, which can be vetted by human experts for ultimate eligibility, completeness, and inclusion in one of the two reference libraries.
Significance
The DGRL with thoroughly curated reference entries to some 17,000 English-language peer-reviewed academic publications in Digital Government represents the body of knowledge in this study domain and has become a major source for researchers when starting a research project, when reviewing a research manuscript, or when analyzing research trends in the study domain. Ever since its inception in 2005, the DGRL has been updated on a semiannual basis. Over the years, peer-reviewed research articles on topics surrounding Digital Government has increased in numbers from an annual going rate of 400+ to over 2,000. At this increased volume, the continuation of manual curation becomes more and more unfeasible. Automated and ICT-supported curation appears as the timely and appropriate avenue to secure this highly valued and significant resource to the Digital Government research community.
In parallel, the yet smaller DIRL, which was only launched in 2018, appears to follow the trajectory of the DGRL with a current going rate of 500+ new entries per annum. While the DIRL has not yet assumed the widespread recognition, which the DGRL is already enjoying, it seems to be only a matter of time, when a similar impact in the domain of disaster-related information management will be seen. In both cases, the move from manual curation to automated and ICT-supported curation has become a necessity.
Automated Collection and ICT-supported Curation (ACIC)—The Project
The envisioned automated collection and ICT-supported curation (ACIC) project would be informed by and most likely follow the current manual approach. ACIC shall be programmed and documented as open source code.http://faculty.washington.edu/jscholl/
The manual curation process unfolds like this:
“First, we pull keywords from a set list. The DGRL includes 51 keywords, while the smaller DIRL includes only 10. These keywords can be phrases, such as “digital democracy,” or include Boolean operators, such as “disaster OR crisis OR emergency AND information management.” In Google Scholar, we then use advanced search features and settings to filter results. First, under Settings, we limit the results to pages written in English. Second, we enter our keywords into the Advanced Search page and further limit the results to only return articles dated within a certain range.
Unfortunately, Google Scholar only allows for limiting the date range by year, making it difficult to narrow the results to the true window of interest, usually 6 months since the last iteration of the database. After hitting search and being navigated to the results page, we sort the results using the “sort by date” feature, which brings the newest results to the top and works backward chronologically. With our limited (manual) capacity, we look through results going back 10 pages (some 100 entries); however, this often does not bring us sufficiently back in time to the last database iteration, meaning that we are possibly missing relevant resources published in the few months after the last iteration.
Finally, we check relevant sources for relevancy and quality.
Based on an understanding of the criteria for inclusion in the database, we open potentially relevant results and determine whether the source is credible. Because the results from Google Scholar are so varied, this understanding is necessary to know to exclude something like “Travelling While Black: Essays Inspired by a Life on the Move,” but to include “Digital Democracy, Social Media, and Disinformation,” results that appear next to each other in a keyword search of “digital democracy.”
If the title seems relevant, we open the source, but if it is clearly an un-credited PDF, a dissertation, or not written in English, we do not include it. If it appears to be from a journal that we are not familiar with, we run the journal title through Ulrichsweb (available from the UW Library system: https://ulrichsweb-serialssolutions-com.offcampus.lib.washington.edu) to determine whether the journal is legitimate. If it is, we save the source to a Zotero library (https://www.zotero.org/), and if not, it is excluded.”
The open-source ACIC code to be developed for, stored in, and executed from the Cloud would crawl the scholar.google.com site in the same fashion human expert curators would do. The crawler finds potential new records for a given timeframe along a pre-specified list of keywords, and it deposits the full reference record in RIS format on a Cloud-based stack for further inspection. The crawler suspends operation once the hit rate for potentially eligible records has fallen to less than 1 in 10 inspected records for a given annual timeframe.
Once the initial crawl has produced the stack of candidate records, the collected records need to be inspected for completeness. Incomplete records need to be flagged. The list of keywords found in candidate records needs to be inspected. New keywords provided by the candidate records need to be added into a monitor stack. New keywords with high frequency counts need to be flagged and considered for inclusion in the respective keyword registry. Likewise, (old) keywords that produced very low or no numbers of candidate records need to be flagged in the keyword registry and considered for future exclusion in searches.
The ACIC algorithm performs further checks of eligibility along the pre-specified list of criteria and eliminates records, which would not qualify for inclusion in the respective reference library. The code also marks records, which it found eligible for inclusion and complete.
Since the candidate records cannot fully automatically be curated, the final list of candidates including the incomplete records need to be inspected by human subject matter experts.
In order to make that happen, ACIC must provide a web-based user interface, which lets the human expert inspect, add, edit, and delete records; it also needs to let the human expert inspect, add, edit, and delete keywords.
ACIC must also provide an administrative function, by which it can be set up and maintained. The administrative function needs to allow for the bulk export of records in RIS format, which were marked eligible for inclusion by the human expert curators.
Integral part of the project is the detailed documentation of functionality (annotated code) and an online editable user manual (for laymen users).
Academic Impact and Recognition
There will be the opportunity to be a contributing author on a paper from this work. Once ACIC has been successfully developed and tested we will craft and submit a written academic report on the approach and the test results in a technical journal.
The independent study covering the programming and testing of the above extensions is worth 2 to 3 credits. It is ideally suited for students who want to work in a team of two.
Registering
For registering, please contact Student Services for the Independent Study Form (INFO499 or IMT600, respectively). In the description field, you can use the contents of this announcement.