Close Menu
    Facebook X (Twitter) Instagram
    SciTechDaily
    • Biology
    • Chemistry
    • Earth
    • Health
    • Physics
    • Science
    • Space
    • Technology
    Facebook X (Twitter) Pinterest YouTube RSS
    SciTechDaily
    Home»Technology»Data Civilizer Finds and Links Related Data Scattered Across Digital Files
    Technology

    Data Civilizer Finds and Links Related Data Scattered Across Digital Files

    By Larry Hardesty, Massachusetts Institute of TechnologyJanuary 19, 2017No Comments6 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn WhatsApp Email Reddit
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email Reddit
    New System Finds and Links Related Data Scattered Across Digital Files
    New system finds and links related data scattered across digital files.

    A new system called Data Civilizer automatically finds connections among many different data tables and allows users to perform database-style queries across all of them. The results of the queries can then be saved as new, orderly data sets that may draw information from dozens or even thousands of different tables.

    The age of big data has seen a host of new techniques for analyzing large data sets. But before any of those techniques can be applied, the target data has to be aggregated, organized, and cleaned up.

    That turns out to be a shockingly time-consuming task. In a 2016 survey, 80 data scientists told the company CrowdFlower that, on average, they spent 80 percent of their time collecting and organizing data and only 20 percent analyzing it.

    An international team of computer scientists hopes to change that, with a new system called Data Civilizer, which automatically finds connections among many different data tables and allows users to perform database-style queries across all of them. The results of the queries can then be saved as new, orderly data sets that may draw information from dozens or even thousands of different tables.

    “Modern organizations have many thousands of data sets spread across files, spreadsheets, databases, data lakes, and other software systems,” says Sam Madden, an MIT professor of electrical engineering and computer science and faculty director of MIT’s bigdata@CSAIL initiative. “Civilizer helps analysts in these organizations quickly find data sets that contain information that is relevant to them and, more importantly, combine related data sets together to create new, unified data sets that consolidate data of interest for some analysis.”

    The researchers presented their system last week at the Conference on Innovative Data Systems Research. The lead authors on the paper are Dong Deng and Raul Castro Fernandez, both postdocs at MIT’s Computer Science and Artificial Intelligence Laboratory; Madden is one of the senior authors. They’re joined by six other researchers from Technical University of Berlin, Nanyang Technological University, the University of Waterloo, and the Qatar Computing Research Institute. Although he’s not a co-author, MIT adjunct professor of electrical engineering and computer science Michael Stonebraker, who in 2014 won the Turing Award — the highest honor in computer science — contributed to the work as well.

    Pairs and permutations

    Data Civilizer assumes that the data it’s consolidating is arranged in tables. As Madden explains, in the database community, there’s a sizable literature on automatically converting data to tabular form, so that wasn’t the focus of the new research. Similarly, while the prototype of the system can extract tabular data from several different types of files, getting it to work with every conceivable spreadsheet or database program was not the researchers’ immediate priority. “That part is engineering,” Madden says.

    The system begins by analyzing every column of every table at its disposal. First, it produces a statistical summary of the data in each column. For numerical data, that might include a distribution of the frequency with which different values occur; the range of values; and the “cardinality” of the values, or the number of different values the column contains. For textual data, a summary would include a list of the most frequently occurring words in the column and the number of different words. Data Civilizer also keeps a master index of every word occurring in every table and the tables that contain it.

    Then the system compares all of the column summaries against each other, identifying pairs of columns that appear to have commonalities — similar data ranges, similar sets of words, and the like. It assigns every pair of columns a similarity score and, on that basis, produces a map, rather like a network diagram, that traces out the connections between individual columns and between the tables that contain them.

    Tracing a path

    A user can then compose a query and, on the fly, Data Civilizer will traverse the map to find related data. Suppose, for instance, a pharmaceutical company has hundreds of tables that refer to a drug by its brand name, hundreds that refer to its chemical compound, and a handful that use an in-house ID number. Now suppose that the ID number and the brand name never show up in the same table, but there’s at least one table linking the ID number and the chemical compound, and one linking the chemical compound and the brand name. With Data Civilizer, a query on the brand name will also pull up data from tables that use just the ID number.

    Some of the linkages identified by Data Civilizer may turn out to be spurious. But the user can discard data that don’t fit a query while keeping the rest. Once the data have been pruned, the user can save the results as their own data file.

    “Data Civilizer is an interesting technology that potentially will help data scientists address an important problem that arises due to the increasing availability of data — identifying which data sets to include in an analysis,” says Iain Wallace, a senior informatics analyst at the drug company Merck. “The larger an organization, the more acute this problem becomes.”

    “We are currently exploring how to use Civilizer as a harmonization layer on top of a variety of chemical-biology datasets,” Wallace continues. “These datasets typically link compounds, diseases, and targets together. One use case is to identify which table contains information about a specific compound and what additional information is available about that compound in other related datasets. Civilizer helps us by allowing full-text search over all the columns and then identifying related columns automatically. By using Civilizer, we should be easily able to add additional data sources and update our analysis very quickly.”

    Reference: “The Data Civilizer System” by Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani and Nan Tang, 8 January 2017, 8th Biennial Conference on Innovative Data Systems Research.
    PDF

    Never miss a breakthrough: Join the SciTechDaily newsletter.
    Follow us on Google and Google News.

    Computer Science Engineering MIT
    Share. Facebook Twitter Pinterest LinkedIn Email Reddit

    Related Articles

    New Circuit Reduces Power Leakage When Transmitters Are Idle

    New Approach Improves Execution Times and Efficiency of Multicore Chips

    New Device Allows Scientists to Watch How Cells Interact

    New Algorithm Should Enable Household Robots to Better Identify Objects

    New Photon Detectors – A Crucial Step Toward Quantum Chips

    New System Allows Individuals to Pick and Choose What Data to Share

    Valleytronics Help Researchers Move Toward a New Kind of 2D Microchip

    New System Allows Programmers to Trade Computational Accuracy for Energy Savings

    Printable Robots That Self-Assemble When Heated

    Leave A Reply Cancel Reply

    • Facebook
    • Twitter
    • Pinterest
    • YouTube

    Don't Miss a Discovery

    Subscribe for the Latest in Science & Tech!

    Trending News

    First-of-Its-Kind Discovery: Homer’s Iliad Found Embedded in a 1,600-Year-Old Egyptian Mummy

    Beyond Inflammation: Scientists Uncover New Cause of Persistent Rheumatoid Arthritis

    A Simple Molecule Could Unlock Safer, Easier Weight Loss

    Scientists Just Built a Quantum Battery That Charges Almost Instantly

    Researchers Unveil Groundbreaking Sustainable Solution to Vitamin B12 Deficiency

    Millions of People Have Osteopenia Without Realizing It – Here’s What You Need To Know

    Researchers Discover Boosting a Single Protein Helps the Brain Fight Alzheimer’s

    World-First Study Reveals Human Hearts Can Regenerate After a Heart Attack

    Follow SciTechDaily
    • Facebook
    • Twitter
    • YouTube
    • Pinterest
    • Newsletter
    • RSS
    SciTech News
    • Biology News
    • Chemistry News
    • Earth News
    • Health News
    • Physics News
    • Science News
    • Space News
    • Technology News
    Recent Posts
    • This Simple Exercise Trick Builds Muscle With Less Effort, Study Finds
    • Middle Age Is Becoming a Breaking Point in America, Study Reveals
    • Scientists Discover How Coffee Impacts Memory, Mood, and Gut Health
    • How Cells Copy DNA Might Matter More Than We Ever Realized
    • Scientists Just Solved the Mystery of the Twelve Apostles
    Copyright © 1998 - 2026 SciTechDaily. All Rights Reserved.
    • Science News
    • About
    • Contact
    • Editorial Board
    • Privacy Policy
    • Terms of Use

    Type above and press Enter to search. Press Esc to cancel.