Heng Ji: Constructing Conversion Databases with Artificial Intelligence

Self-learning computer models will soon be scanning academic journal articles to help CABBI scientists refine their bioenergy research under the guidance of one of CABBI’s newest PIs.

Heng Ji, the latest researcher to join CABBI’s Conversion Theme, is working to bring novel machine learning methods to the fields of chemistry and biology. Ji is a Professor of Computer Science and a Professor affiliated with Electrical and Computer Engineering at the University of Illinois Urbana-Champaign. She is also an Amazon Scholar.

An expert in computational linguistics, Ji is no stranger to working on natural language processing, machine learning, and information extraction. Through her work, she aims to help create more efficient ways to consume information.

At CABBI, Ji will develop self-learning computer models that will allow for automatic annotation, analysis, and comparison of scientific literature. Millions of published scientific papers may be relevant to the experts in the Conversion Theme, but it is not humanly possible for the research team to read each article and manually create a knowledge base. By quickly constructing databases with artificial intelligence (AI), Ji will help scientists perform more efficient and targeted research.

Machine learning models are common in computational linguistics and are often used to analyze news articles. However, scientific writing follows a different language structure than the average news article. The sentences are much longer and more complex, which confuses the machine learning models attempting to interpret the sentence structure. To overcome this, Ji employs a technique called semantic parsing, in which sentence sequences are converted into graphs and data that machines can understand and incorporate into databases through graph neural networks.

After constructing these knowledge bases with machine learning, Ji will link different databases into a larger unified database using a process called “graph alignment.” Many universities build scientific literature databases on similar research topics, but currently those databases do not interact with each other. As a result, the data falls into information silos and scientific collaboration is minimized.

“For example, if the University of Illinois produces a database for protein ‘A’ while another university produces a database for protein ‘B,’ information about the proteins would exist on two separate graphs,” Ji said. “But those proteins do not exist in isolation. If you use a graph neural network to plot both proteins ‘A’ and ‘B,’ then you can look at the proteins in relation to each other and get a better representation of both proteins.”

Linking databases will allow Ji to begin the process of pathway discovery. In biochemistry, a pathway is a series of interactions among molecules in a cell that leads to a certain product or a change in the cell. The use and discovery of these pathways is integral to the Conversion Theme’s work in engineering microbial strains that can efficiently produce diverse, high-value molecules such as biodiesel, organic acids, jet fuels, lubricants, and alcohols. From the comprehensive knowledge graph constructed by the AI, Ji can identify certain nodes of interest and determine their relationship as defined by the scientific articles in the database.

Each node will have semantic representation; in other words, each node will contain data and evidence extracted from scientific literature. This includes identification of associations between words that often appear together.

In addition to identifying relationships via semantics, a machine learning model can also pinpoint nodes that are connected by similar scientific or technical concepts. As more research articles are added to the database, graph alignment will continue to update the representation of each node on the graph. These powerful databases will consolidate existing knowledge and help researchers identify pathways of interest.

Machine learning models reveal indirect associations that scientists may overlook. One group of scientific articles may describe protein interactions in a lab; meanwhile, a seemingly unrelated article may describe patient symptoms that result from these same protein interactions. The establishment of these connections will vastly improve the speed of scientific research. It may also encourage scientists to consider patterns they had not investigated before.

Ji hopes that her involvement with CABBI will encourage more computer scientists to work on interdisciplinary projects.

“The entry cost for a computer scientist to get involved (in biology or chemistry) is very high because we don’t have the domain knowledge,” she said. “I think we need more computer scientists to have patience and courage to learn and contribute.”

Ji believes that young computer scientists should consider how they can apply their expertise to solving essential, real-world problems.

“The kind of algorithms scientists are applying to today’s chemical engineering problems are literally from 20 years ago,” she said. “We already have so many advanced things going on in computer science, but few of them are applied to real, important scenarios like biology, engineering, and medicine. I think we need more conversations between disciplines.”

Ji was recently featured in an episode of The Story Collider, a podcast that shares true, personal stories about science. Listen here >>>

To learn more about Heng Ji and the exciting work she’s doing, check out her lab’s website >>>

— Article by CABBI Communications Intern Lucy Nifong and Communications Specialist April Wendling