Satoshi Village the blog of Daniel Himmelstein

Four years of fellowship: annual summaries for my NSF Graduate Research Fellowship

As a first year graduate student at UCSF, I took a mandatory course titled Scientific Writing, which helped students apply for the National Science Foundation’s Graduate Research Fellowship. I was fortunate to receive the fellowship (Grant No. 1144247), which has funded the bulk of my PhD since its third year.

At the end of each fellowship year, fellows submit an Annual Activities Report, which includes a written Fellowship Year Summary. Below I’ve reproduced my summaries from the last four years. But first, here’s a screenshot of the prompt:

Screenshot of the Fellowship Year Summary prompt

Summary of 2012–13 dated April 1, 2013

The 2012–2013 reporting period spans my first and second years of graduate school. The first year of graduate school consisted of coursework, seminars, and rotations. My first rotation, supervised by Dr. Ryan Hernandez, focused on human evolution and population genomics. I simulated sequencing data capturing the complexity of human demographic history. During my second rotation with Dr. John Witte, I compared the prevalence of gene-gene associations with disease under additive and epistatic disease models. The project, related to my original NSF proposal, resulted in a poster presentation at the American Society of Human Genetics Annual Meeting. In my third rotation, with Dr. Sergio Baranzini, I worked on modelling biological entities using heterogeneous networks. The project consisted of parsing and integrating publicly available databases. Professor Baranzini is now my advisor. My project aims to discover new uses for existing drugs using a network-based computational link prediction technique.

As I mentioned in my Personal Statement, I think science could be improved by better communication. Communication includes a broad range of issues ranging from writing style to open-access publishing. Over the past year, I’ve stressed my own ability to communicate. Professor Hernandez requested me as the teaching assistant for a new course he was teaching called Computational Evolutionary Genomics. As part of the quarter long course, I led weekly discussion sections. I’ve also honed my communication skills through presentations to lab members, students, and faculty.

Over the last year, I’ve begun to apply my interest in high-throughput computationally intensive problems to medicine. Through my research on drug repurposing, I not only hope to find new disease therapies, but also spread my scientific methodologies through improved scientific communication.

Summary of 2013–14 dated March 31, 2014

Intellectual Merit: Over the past year, my research has matured extensively. My project aims to identify genes associated with complex human diseases. My findings will help with disease susceptibility prediction and understanding what biological characteristics underlie pathogenic variants. The cross-disciplinary method improves on a technique first developed for social network analysis and relies on advanced machine learning and distributed computing. I applied for and received a $2,000 UCSF Graduate Student Research Fellowship to add 192GB of low-latency memory to our HPC workstation. In the past year, I exhibited the work at the American Society of Human Genetics Annual Meeting and presented it at the Biomedical Computation at Stanford Symposium. I also discuss the project in a public youtube video that I created.

Broader Impacts: My vision for science is one of greater accessibility and communication. I helped other students achieve proficiency in written communication as a teaching assistant for our program’s Writing Predoctoral Fellowships course. Personally, I assisted with peer-reviewing a journal manuscript and writing a grant proposal. In October, I traveled to Guatemala to participate in a Forum on Innovation in Medical Education. Hosted by a private, forward-thinking University, UFM, the forum focused on designing a new medical degree program based on embracing disruptions. Throughout the forum I suggested ways for strong communication and accessibility to be embedded into the structure of the new curriculum. I am an outspoken advocate of open access publishing allowing unlimited reuse with attribution. I also support developments to make science more transparent and incremental. To that end, I have developed a personal website describing my research and have begun posting material to figshare.

Summary of 2014–15 dated March 18, 2015

Intellectual Merit: My research focuses on making novel biological predictions from heterogeneous networks. Heterogeneous networks contain multiple entity and relationship types making them an ideal approach for data integration. Applying our edge prediction algorithm to a multiscale network of pathogenesis, we successfully predicted disease-associated genes, uncovered multiple sclerosis risk genes, and highlighted the mechanisms underlying pathogenesis. This study is currently available as a preprint. In July, I gave my first talk at a major international conference, when I presented this research at the Intelligent Systems in Molecular Biology conference in Boston, supported by an NSF-funded travel fellowship. Furthermore, our study was featured in the Biomedical Computation Review, a quarterly magazine produced by Stanford.

Next, we plan to identify new uses for existing drugs using our network-based approach. We hope to bring new life to the sluggish and increasingly expensive drug discovery process. Interested parties can follow and contribute to this project in real-time.

Broader Impacts: In the last year, I have taken many steps to move science closer to my vision of openness. Realizing that science moves faster and reproducibility improves when source code is open, I made 156 public contributions on GitHub. Additionally, I created a website ( covering our research on heterogeneous networks. We provide our predictions of disease-associated genes through an interactive browser. I also started a personal blog that aims to communicate science to a broad readership. Similarly, UCSF hosted a three minute thesis competition, where students present their dissertation research with a general audience in mind. Although my submission was not selected as a finalist, I posted the recording online. UCSF also hosts a seminar series that accepts student speaker nominations. My nomination of Lior Pachter — professor at UC Berkeley and author of the Bits of DNA blog — was accepted and I was fortunate to arrange his visit and discuss science with him. Finally, I performed my first solo peer review for the open access journal PLOS One.

At a Bay Area open access event, I met the founder of Thinklab, a nascent platform for massively collaborative open science. Our project on drug repurposing is one of the initial projects using the platform. Thinklab has funded our proposal and rewards community participants for contributing to the project. We have high hopes that the Thinklab model can accelerate science, help reproducibility, and encourage transparent research.

Summary of 2015–16 dated March 6, 2016

Intellectual Merit: For the last year, I’ve been constructing a network for drug repurposing. We will use the network to predict the probability that a given drug–disease pair is an efficacious treatment. Since the network contains multiple node and relationships types, it’s called a hetnet — a term we’re formally popularizing. Our hope is that by integrating diverse data types, the hetnet will form a foundation for powerful analyses.

We’ve migrated to using the Neo4j graph database to store and operate on our hetnets. Leveraging this mature open source technology has improved our interoperability and provided new capabilities. Our interactive tutorial showcases our project and use of the Neo4j technology.

To build our hetnet, we created several intermediate resources. We’re particularly excited about three resources that we think the community will appreciate. First, we created a catalog of medical indications, which can be used to train and evaluate computational methods. The catalog was created by integrating four existing resources. However, we recruited three physicians to review each drug–disease pair and classify it as disease-modifying or symptomatic. Second, we computed consensus signatures for LINCS L1000, which profiles transcriptional responses to perturbation. Third, we created a user-friendly webapp for retrieving Gene Ontology annotations.

Broader Impacts: To construct our hetnet, we integrated 28 public resources. Even though the resources were publicly funded, copyright and access agreements created a major licensing headache. After a 7,500+ word discussion involving lawyers and other experts, we settled on the least bad way forward.

Since undergoing this copyꞃightmare and realizing that a crisis looms, I’ve become an advocate for data copyright and licensing reform in science. I presented on the legal problems related to data reuse at OpenCon in Belgium and at a local Open Access Week event.

I’ve been performing realtime, open notebook science using Thinklab and GitHub. Our drug repurposing project on Thinklab has generated 60 discussions containing 346 comments by 8 project team members and 29 community reviewers. Getting feedback in the early stages has strengthened our project while helping us avoid poor uses of time.

In the last year, I reviewed for PLOS Computational Biology and the Pacific Symposium on Biocomputing; authored 906 public GitHub contributions; mentored an apprentice and an intern; and crafted a data biologist cookbook for the new students.

Finally, while waiting for my accepted paper to be published, I grew frustrated by the delays. I compared the delays at several journals in my field and blogged the findings. Nature News and The Publication Plan covered my findings. Later I performed a more comprehensive analysis on the history of publishing delays, which was also covered by Nature News and The Publication Plan. My goal with this research is to help researchers avoid excessive delays while replacing anecdote with evidence in the contemporary discussion of scientific publishing.


I’ll close with four reflections:

  1. My productivity exceeded linear growth over the four year period. Experience has played a role, in addition to the rise of open source tools for data science.
  2. The number of hyperlinks per year has grown rapidly from 0 to 3 to 13 to 21. I attribute much of this increase to open science, which has led to more public output per unit of research time.
  3. The deadline has always been May 1, yet each year I’ve finished the summary further in advance than the year before. Am I becoming less just-in-time?
  4. I have yet to “discover new uses for existing drugs” despite aiming for this in 2013. Let’s make 2016 the year!