Five principles to get undergraduates involved in real-world data science projects
By Jae Yeon Kim, computational social scientist and PhD candidate in Political Science at UC Berkeley
As a D-Lab and Data Science Education Program Fellow at the University of California, Berkeley in Spring 2020, I helped to ensure and enhance the quality of more than 40 Data Science Discovery Projects, working with community partners and undergraduate research assistants. The goal of these projects was to connect undergraduates with community impact groups, entrepreneurship ventures, and educational initiatives across UC Berkeley and provide them with hands-on and team-based research opportunities outside the classroom.
There are many challenges related to the successful implementation of projects like this. The Discovery program selects and helps to match community partners and undergraduates, all while maintaining a tight schedule and upholding high standards. The program expects the community partners and undergraduate research assistants to produce something tangible by the end of the semester and to present their work at the Data Science Showcast event. I know this by experience. Most recently, I investigated intersectional bias in hate speech and abusive language datasets with four undergraduate research assistants hired by the Discovery program. The undergraduate research assistants became co-authors of the paper, which was accepted at the Fourteenth International Conference on Web and Social Media (ICWSM), Data Challenge Workshop.
My main objective this semester was to create a general framework that the community partners and undergraduate research assistants could use to define and manage their projects effectively. In the process, I created and held workshops on project management, computational reproducibility and version control, bias and fairness in machine learning, and data communication and visualization. In this article, I have distilled the many lessons learned from these workshops into five principles. These guidelines could be useful for universities in helping their undergraduates to get involved in real-world data science projects in a systematic way.
Help make better decisions
The first thing undergraduates need to learn is to focus on problems, rather than tools. Tools only have value insofar as they are useful to the partners. Help undergraduates to develop a rapport with the partners, to understand their unique challenges, and to grasp the strengths and limitations of their data. For instance, if a partner is a social service organization that currently lacks basic information on the population they serve, the immediate priority should be finding efficient ways to collect, analyze, and visualize descriptive data. The value of a data science project is defined contextually. It varies by the partner’s maturity, in terms of data infrastructure, literacy, and their needs. Often, undergraduates skip this step when trying to learn data science by participating in online machine learning competitions, but this experience is crucial if they want to develop a perspective on how to deliver value to their partners. In the end, the goal of data science is to make data useful. Decision-makers care little about what tools—whether a simple regression model or a deep learning algorithm—were used to analyze data. Instead, they are concerned with whether the results are valuable.
2. Be realistic
“Done” is better than “perfect.” Undergraduates lead busy lives, juggling courses, preparing for internships, and engaging in club activities. They are often unable to spend much time on the project in the second half of a semester because of upcoming final exams. For this reason, if a project team is unable to collect its desired data by the mid-term, it certainly raises a red flag. At UC Berkeley, semesters run consecutively for more or less five months. To create something tangible by the end of the semester, a project team needs to define its research problem by the first month, to collect its data by the second month, clean and explore the data by the third month, model and analyze the data by the fourth month, and visualize and interpret the outcomes by the fifth month. (Note that these checkpoints are heuristics suitable for project teams to understand progress, not exact estimates.)
3. Set clear communication protocols
Emailing is not the same as project management. Most work emails lack clarity and structure. By contrast, project management entails defining which team members need to do what, by when, and how. It also requires organizing these conversations systematically, ideally sorted by tasks. Minimize the reasons to exchange emails.
Instead of emails, use project communication tools such as Slack. Create a workspace for the group, and then create channels for different aspects of the project (data, programming, paper, presentation, etc.). Use threads to organize discussions around particular tasks in each channel. For instance, do not email group members to ask them whether they agree with an idea; post it on a Slack channel. Unorganized information is a threat to work productivity.
Have weekly meetings (around 1 hour) to check-in. Help team members build expectations regarding what they need to do before, during, and after the meeting. Before the meeting, share the agenda in a document form that allows people to comment (e.g., Google Docs). Collectively find out what additional items need to be added to the list. The meeting should involve reviewing (the first 20 minutes), discussing (40 minutes), and setting deadlines (the last 10 minutes). Review what has been done and what needs to be done. Discuss obstacles, bottlenecks, and possible solutions. (When sharing the agenda, ask your team members to think about what problems they struggle with and what solutions they have considered to solve them.) Define and distribute tasks across team members. Update the meeting agenda and share deadlines for each task with team members using both Slack and Google Calendar (or something similar).
4. Use version controls
Version control, the process of systematically documenting code and data changes over time, is not an option for managing a complex data science project, it is a necessity.
Set a project structure at the beginning of the semester and ensure team members organize their files accordingly. Make sure team members do not mix raw data (which should be “read-only”) with processed data. If a project deals with repeated computational problems, encourage team members to create functions and to reuse and improve them over time. If multiple groups of team members work on the same dataset, instruct them to create an index variable, so that it is easy to combine these different processed data at a later stage. Set file and directory naming conventions to reduce confusion and incompatibility across different operating systems. It is generally good advice not to use capital letters, spaces, or special characters in file and directory names.
Use Git and GitHub throughout the project. Adding “final” to a file name does not make a file “final.” Simply put, we do not know a research project is over until it is published. Use a Git repository as the project directory. I assume that a project team consists of a project manager and several team members. A project manager should make sure the team members push (submit) their code to the code-related subdirectory of the Git repository before each deadline. The code should be provided with comments on major data-related decisions. It should also be tested to avoid dependency problems. Before pushing, team members should check whether they could rerun the code in a completely new session. (The environment is part of the code.) The project manager should review the code and comment so that the code and data quality of the project can be ensured.
5. Fail often, early, and systematically
Research is an entrepreneurial process. It is more likely to fail than to succeed because the goal is to discover something unknown. Don’t put all your eggs in one basket. It is unwise to invest all of one’s resources in a single, large project. Instead, invest widely. Encourage project teams to identify and pursue multiple directions to solve problems. Develop many small projects, modularize the procedures, document their progress, and frequently examine which of the projects is worth doing and is doable. Often, failing is the only true way to learn what will and will not work. Help project teams to exploit uncertainty and maximize the learning opportunities available from the messy research process.
To recap, here are the five principles of helping undergraduate students to get involved in real-world data science projects.
The goal is to help people make better decisions. Focus on the problems, rather than tools.
Be realistic. “Done” is better than “perfect”. Provide multiple checkpoints to help team members understand their progress.
Set clear communication protocols. Unorganized information is a threat to work productivity. Minimize reasons to exchange emails.
Version control is not an option. Set a project structure and use Git and GitHub throughout the project.
Fail, often, early, and, most importantly, systematically. Develop many small projects, modularize the procedures, document their progress, and frequently examine which of the projects is worth doing and is doable.
About
Jae Yeon Kim is a computational social scientist and a PhD candidate in Political Science at UC Berkeley. He is also a D-Lab Data Science Fellow, a Data Science Education Program Fellow at UC Berkeley, and a co-organizer of the Summer Institute in Computational Social Science in the San Francisco Bay Area (BAY-SICSS).