Data science and journalism coalesce for social justice

Data science and journalism coalesce for social justice

By Rowan Walrath, opinions editor

Around 150 data scientists, journalists and innovators filled the Curry Student Center mezzanine to capacity on Wednesday for the inaugural HUBweek hackathon. The four-hour event focused on how data science and reporting could be synthesized to foster social good.

The event was hosted by HUBweek in partnership with InkHouse, Northeastern University’s School of Journalism and its College of Computer and Information Science (CCIS) and Boston Area Research Initiative (BARI).

John Wihbey, an assistant professor of journalism and new media at Northeastern, served as the emcee for the hackathon. He joked that the event should be called “North by Northeastern,” a play on South by Southwest, a conglomerate of film, interactive media and music festivals and conferences held annually in Austin, Texas.

“Our event name is Data Science, Journalism and the Future of Justice,” Wihbey said. “In journalism – and I’m a journalism professor – we’ve been doing data since the ’60s and ’70s.”

Wihbey referred to computer-assisted reporting, a relatively outdated term that describes the use of computers to gather and analyze data to write news stories. He cited several media outlets that utilize data-driven journalism today, including the Washington Post, FiveThirtyEight and ProPublica.

Tina Cassidy, chief content officer at public relations firm InkHouse, also expounded on the need for reporters to work with experts in other fields.

“One of the themes for HUBweek this year is inclusive innovation,” Cassidy said. She talked about the tendency of journalists to create an insular community, talking almost exclusively to other journalists rather than to other professionals.

Before the hacking began, Wihbey introduced two panels: The first was comprised of Randall Lane, editor of Forbes Magazine, and Igor Tulchinsky, founder and CEO of WorldQuant, a quantitative investment management firm.

Tulchinsky drove home one point: All data is important.

“When it comes down to it, no data is useless,” Tulchinsky said. He added that if data can be analyzed and simplified to a level at which people can understand it, those people will find an application for it.

Three professionals from varying backgrounds made up the second panel: Todd Wallack, a reporter on the Boston Globe’s Spotlight team who specializes in data journalism; Dan O’Brien, an assistant professor of public policy and urban affairs and criminology and criminal justice at Northeastern; and Michelle Borkin, a professor in CCIS interested in information and scientific visualization.

With a 22-year career in reporting under his belt, Wallack brought the journalistic perspective to the panel.

“It’s changed a lot to where readers expect not only to see the numbers in the story but actually see the data,” Wallack said.

Wallack added that there has also been a shift in how journalists use data. In 2001, Microsoft Excel spreadsheets were a big deal; now, newsrooms, including the Boston Globe, are hiring data scientists to work with reporters on articles and supplementary visualizations.

“The number of journalists using data has gone up, even as the number of journalists has gone down,” Wallack said.

Wallack writes in Python, a high-level, general-purpose programming language designed for concision and readability. Other data scientists, he said, use R, a programming language and software environment for statistical computing and graphics.

However, even as reporters utilize data, they are receiving some pushback from government officials who are reluctant to give access to data sets, Wallack said. In Massachusetts, police departments and courts have refused to make public data on breathalyzer tests and drunk driving cases.

O’Brien, who comes from the world of public policy, hopes to overcome this pushback by working with city officials for urban planning. He introduced the concept of the “smart city” – the idea that city planners should optimize data, make it efficient and effective and then apply it.

“The new urban science is this other angle […] trying to use data to try to turn the city into a math algorithm,” O’Brien said.

O’Brien came to Northeastern in 2014 from Harvard University, where he was the research director of BARI. The initiative collects and examines data describing the people, places and events in the Greater Boston Area that have been made available for research.

One of the three data sets used at the HUBweek hackathon were provided by BARI, information on Boston-area homicides. The other two were provided by the City of Boston – a set of Boston Police Department Field and Interrogation and Observation (FIO), more commonly known as stop-and-frisk, data as well as multiple sets on crime incident reports.

The last presenter on the panel was Borkin. As a professor, she treated the audience as her students and aimed to foster an understanding of the issues data scientists are addressing.

“Every day, 2.5 million terabytes of data are created, 90 percent of which were generated in the past two years,” Borkin said. Flipping through slides, Borkin focused on three core issues data visualization aims to solve: Scalability, complexity reduction and keeping humans in the loop.

After the panelists returned to their seats, Wihbey announced that attendees could grab food, catered by Rebecca’s, and begin hacking. With hummus wraps, turkey sandwiches and roast beef on gluten-free bread in hand, the 150 innovators present got to work.

Jack Michaud, a freshman computer science major at Northeastern, chose to explore two of the data sets provided.

“I took the stop-and-frisk data set and also the crime data set,” Michaud said. “I put it into my own database and made a graph.”

When Michaud, who was working in Python, examined the graph, he discovered that the crime rate dropped significantly every February. Citing the snowstorm of February 2014, he speculated that cold weather alone may have accounted for this.

Adam Wespiser, a data scientist currently working as a contractor, provided some insight into Python and R, the two main programming languages used by data scientists. R, he said, is great for analyzing and plotting data; Python is more accessible but is not coherent for data analysis.

“What I’m trying to build over the next five to 10 years is a language that’s easy to explore data but deployable,” Wespiser said. “Imagine what we could do if we could deploy quickly and then go to production.”

Aditeya Pandey, a Ph.D student in CCIS studying under Borkin, was examining the crime incidents data set provided by the City of Boston.

“I’m doing an exploration analysis of my data, just trying to find trends in my data set,” Pandey said. Pandey also planned to bring in census and income data from BARI to scan for correlations.

Emily Hopkins, a graduate student in Northeastern’s College of Arts, Media and Design’s Media Innovation program, was collaborating with several other team members to analyze the FIO data set.

“We chose the Boston Police stop-and-frisk data set and, understanding our limitations with time and capabilities, we decided to look at the intersection of race and clothing,” Hopkins said.

Hopkins used an Excel spreadsheet to examine the data available, running searches for “hood” and “hoodie.” She and her team members found that of all the people subjected to stop-and-frisks, about 19 percent of them were wearing such clothing.

Hackathons are not without their technical challenges. Neal Jawadekar, a data scientist at Welltok in Boston who is interested in social change, spent part of his evening discussing with his teammates whether they should analyze the crime incidents data set using a decision tree or another approach.

“The problem was, we entered [the data] into the module, and there’s so many different types of crime that it crashed the computer,” Jawadekar said.

After about an hour of work, several of the teams spent the last half-hour of the event presenting their findings with an eye to how the data sets they had been provided with related to large issues of social justice.

One group examined gender bias in stop-and-frisks as it relates to race. According to the data, black women were 21 percent less likely than black men to be subjected to an FIO. White women were 29 percent more likely than white men to be subjected to one, and Latina women, 5 percent less likely than Latino men.

Michaud’s team, the same one that had noted a crime drop each February, also noted a considerable spike in crime in May 2014. The potential culprit: The Boston Bruins lost to the Montreal Canadiens 4-3.

Two teams took note of a variable on terrorism in the stop-and-frisk data set. One found that more terrorism-related stops occurred around the time of the Boston Marathon Bombing in April 2013.

Like the teams that took note of the hockey game and the bombing, the winning team took both raw data and real-world events to synthesize data science and journalism, in keeping with the theme of the hackathon. According to the stop-and-frisk data, men constituted 88 percent of the stops, with black men making up the majority of this number. The chances of black men being stopped were more than 22 times those of a white woman. Significantly, the top eight supervisors in the BPD conducted a majority of stop-and-frisks.

The hasty research done by innovators exemplified the theme of the hackathon, the intermingling of technology, journalism and social good. Hours earlier, during the first panel, Randall Lane characterized this theme as what it was – groundbreaking.

“We’re talking about disrupting journalism,” Lane said. “We’re talking about disrupting justice.”

Photo by Paige Howell

Leave a Reply