As a graduate student, I’ve found it’s difficult not be be a little obsessed with funding. I know that graduate student experiences with funding differ amoung disciplines and among sub-disciplines. Some graduate students have probably never had to think about funding once during their gradudate lives. But in ecology / evolutionary biology, students regularly apply for small grants and even help their PIs write larger proposals to the National Science Foundation (NSF).

NSF is the major funder of basic research in the US. It’s the lifeblood of my department, and of departments in various disicplines across the US. But recently NSF funding has been tougher and tougher to acquire, with some panels reporting a tiny ~5% chance of being awarded a grant you apply for. This has only intensified the interest among graduate students and faculty about NSF grants. What makes a good grant? What are other factors that come into play? How can I increase my odds of being awarded I grant I need to do my research?


In the next few blog posts, I’ll be exploring a simple but surpringly fruitful dataset on NSF grant awards. If NSF is the lifeblood of basic scientific research, surely we can learn something by studying patterns of funding over time.

I came across this dataset on Data is Plural, which you should sign up for it you don’t already subscribe. NSF publishes some information about each grant that they award: who the grant was award to, how much money it was worth, how many PIs were on the grant, and happily, the abstract of the grant itself. This isn’t a whole lot of information, but its bulk (there are ~half a million records of grants) and depth (the data goes back to the 1950s) allows us to do some cool data science. Additionally, the including of the abtract allows us to do some natural language processing and figure out how the language of science changes over time.

overview

This first blog post is actually going to be pretty boring: all we’re going to do is download the data from NSF’s website and parse the xml files that store the data with the awesome BeautifulSoup library for Python. We’ll extract the information we want from the xml files and save them to csv file. Later on, for faster reading and writing, we’ll use the feather library to save the resulting dataframe, which can be read easily by both R and Python.

I believe in using the best tool for the job, so I’m going to bounce back and forth between using R and Python in the following posts, depending on what language I find easier to use for a given task.

If you don’t have a nice Python3 environment set up, I suggest to you yourself a headache a lot of time and install the Anaconda distribution of Python3, which comes with a lot of handy libraries we’ll use.

downloading the data

The first step is simply getting the data. NSF’s website has a link for the grants associated with each year. If you click on one of the links, you get a zip file called something like 2007.zip that contains thousands of xml files when extracted. We could download each of these by hand, but there are something like 60 links on the page and we could have Python do the heavy lifting for us.

I start by creating a subdirectory called data that I’ll use to dump all the xml files into. I’ll then use the requests library to download the web page:

import os
import requests
import bs4
import re
import zipfile
import io

# make a new directory called ./data
if not os.path.exists('./data'):
    os.makedirs('./data')

# change into the new directory
os.chdir('./data')

url = "https://www.nsf.gov/awardsearch/download.jsp"

# download the web page
res = requests.get(url)

# make a fuss if something went wrong
res.raise_for_status()

Now our goal is to find all the links we want and download them all. This is where BeautifulSoup shines. When I look at the links on NSF’s webpage, I see that all look something like https://www.nsf.gov/awardsearch/download?DownloadFileName=1977&All=true, so I’m going to use a dead simple regular expression using the re module to find only the links I want: the ones that begin with download:

# find the zip file links on the page
soup = bs4.BeautifulSoup(res.text, 'html.parser')

# extract the link part, adding base url
links = soup.find_all("a", href = re.compile("download.*"))
links_correct = []
for link in links:
    links_correct.append('https://www.nsf.gov/awardsearch/' + link['href'])

At this point, I have a list called links_correct that contains the links to download each zip file on the web page. Now it’s just a matter of iterating through the links, downloading with requests, and extracting the files with zipfile:

# exclude the first link
for i, link in enumerate(links_correct[1:]):
    print("processing link {} of {}".format(i, len(links_correct[1:])))
    # from https://stackoverflow.com/questions/9419162/python-download-returned-zip-file-from-url
    res = requests.get(link)
    z = zipfile.ZipFile(io.BytesIO(res.content))
    z.extractall()

This takes some time. There are things I could definately improve about this script if I really wanted. For example, I could re-write this last loop so that it executed in parallel, since I don’t care about the order in which the data are downloaded. But this script didn’t take long to write and gets the job done.

parsing the XML files

The result of running the above code is something like half a million xml files. xml files are sort of like HTML: there are tags that are associated with different parts of the text. Here’s an example of one of the files:

<?xml version="1.0" encoding="UTF-8"?>

<rootTag>
  <Award>
    <AwardTitle>Continued Operation of the Guerrero Accelerograph Network</AwardTitle>
    <AwardEffectiveDate>09/01/2001</AwardEffectiveDate>
    <AwardExpirationDate>08/31/2006</AwardExpirationDate>
    <AwardAmount>176920</AwardAmount>
    <AwardInstrument>
      <Value>Continuing grant</Value>
    </AwardInstrument>
    <Organization>
      <Code>07030000</Code>
      <Directorate>
        <LongName>Directorate For Engineering</LongName>
      </Directorate>
      <Division>
        <LongName>Div Of Civil, Mechanical, &amp; Manufact Inn</LongName>
      </Division>
    </Organization>
    <ProgramOfficer>
      <SignBlockName>Richard J. Fragaszy</SignBlockName>
    </ProgramOfficer>
    <AbstractNarration>This action is to continue support for operation of the Guerrero network, together with analysis of the data collected.&lt;br/&gt;&lt;br/&gt;The Guerrero Accelerograph Network was installed in 1985 as a 30-station network of strong motion instruments along the coast of Mexico. The purpose of the network is to recover accelerograms from an expected magnitude 8 earthquake in the Guerrero seismic gap. While this major earthquake has not yet occurred, the network has obtained accelerograms from numerous small to moderate-sized earthquakes. In the 15 years through the end of 1998, the network recorded 1768 high-quality digital accelerograms from 738 earthquakes with magnitudes less than 3 to 8.1.&lt;br/&gt;&lt;br/&gt;The Guerrero network is operated as a joint project of the Instituto de Ingenieria at the Universidad Nacional Autonoma de Mexico (UNAM) in Mexico City, and the University of Nevada at Reno. Major support is provided by UNAM, which has upgraded most of the equipment from 12-bit to 19-bit digital recorders, as well as providing the major part of the field support, data playback, and data organization.&lt;br/&gt;&lt;br/&gt;Data from the expected break on the Guerrero gap will have international significance. In the United States, the data will provide estimates of ground motion in the Pacific Northwest due to a comparable earthquake expected on the Cascadia subduction zone; also for locations affected by the subduction zone in Alaska.</AbstractNarration>
    <MinAmdLetterDate>09/06/2001</MinAmdLetterDate>
    <MaxAmdLetterDate>08/23/2005</MaxAmdLetterDate>
    ...

We see there are tags associated with the type of grant, its title, the directorate (what branch of NSF funded the grant), the program officer, the grant abstract, etc.

For our purposes, I’m going to record:

  • the amount of the grant
  • the title of the grant
  • when the grant was set to start and end
  • the program officer
  • the investigators
  • the directorate

Again, BeautifulSoup is going to make our lives a lot easier by grabbing the text associated with a given tag. In BeautifulSoup, we first read in the xml file, then hand it over to BeautifulSoup, making sure to tell it that we’re working with xml and not, for example, html files:

import bs4

# get into the correct directory
os.chdir('./data')

# get a list of all the files
all_files = [file for file in os.listdir('.') if file.endswith(".xml")]

# loop through the files
for i, file in enumerate(all_files):
    file_name = file
    handler = open(file).read()
    soup = bs4.BeautifulSoup(handler, 'xml')

From here, we can extract the title of the grant using

title = soup.AwardTitle.text

soup.AwardTitle gets finds the <AwardTitle> tag while the .text extracts the text from the tag. We can follow a similar logic to extract the information from the other tags we’re interested in.

Finally, we need to write the data to some sort of file. Because csv files are so prevalent and are a flexible format, we’ll write the data to a single csv file, one row at a time (again, there are faster, better ways to do this). For this we can use the csv module.

In all, my code looks like this:

import os
import bs4
import csv


os.chdir('./data')

all_files = [file for file in os.listdir('.') if file.endswith(".xml")]

with open('../out.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow( ('file_name', 'directorate', 'division', 'title', 'institution', 'amount', 'grant type', 'abstract', 'date_start', 'date_end', 'program_officer', 'investigators', 'roles', 'number_pis') )
    for i, file in enumerate(all_files):
        try:
            # read in file
            file_name = file
            handler = open(file).read()
            soup = bs4.BeautifulSoup(handler, 'xml')

            # record a bunch of stuff about the grant
            directorate = soup.Directorate.LongName.text
            division = soup.Division.LongName.text
            title = soup.AwardTitle.text

            institution = soup.Institution.Name.text

            amount = soup.Award.AwardAmount.text
            grant_type = soup.Award.AwardInstrument.Value.text
            abstract = soup.Award.AbstractNarration.text

            # need to parse these date:
            date_end = soup.AwardExpirationDate.text
            date_start = soup.AwardEffectiveDate.text

            program_officer = soup.ProgramOfficer.text

            investigators = list()
            roles = list()
            for investigator in soup.find_all("Investigator"):
                investigators.append(investigator.FirstName.text + " " + investigator.LastName.text)
                roles.append(investigator.RoleCode.text)

            number_pis = len(investigators)

            try:
                writer.writerow( (file_name, directorate, division, title, institution, amount, grant_type, abstract, date_start, date_end, program_officer, investigators, roles, number_pis) )
            except:
                writer.writerow( ('NA', 'NA', 'NA', 'NA','NA','NA','NA','NA','NA','NA','NA','NA','NA') )
                print("problem writing the csv row")
        except:
            # this occured three times in the whole dataset
            pass

        if i % 100 == 0:
            print("on the {}th file".format(i))
csvfile.close()

You’ll notice there’s some try / except clauses here. A few of the xml files were in some way corrupted or did not contain the fields they should have contained. I simply ignore those files, of which there were about a thousand. I should go back and see what’s up with these files, but for our purposes, these represent less than half a percent of all the grants, and we’re going to end up cutting down this sample size further on anyway. The try / except clauses allow the program to keep running after it encounters an exception, like if the <AbstractNarration> tag is empty.

Running this script takes some time. It’s the sort of thing you set up before you go to lunch. But when it finishes, you’ve

  1. programmatically downloaded the data from NSF’s website and
  2. extracted the parts you care about into a flexible, useful format

take-aways

  • BeautifulSoup is a great library for working with HTML or XML files and extracting links from webpages
  • Downloading thousands of files can be accomplished using a short, simple Python script