Researches in all fields produce data in different forms and shapes. Advances in computing technology allow produced data to be stored in a digital format. In addition, more and more historical records, which originally were captured on paper, are being digitized. The move towards digital data is ubiquitous; it introduces new ways of making the data available to the public and be reused. Nevertheless, very often research data are not shared at all, or shared on researcher’s or university’s web page, making it less discoverable.
On one hand, many tools and data repositories (e.g., DataUp, The Dataverse Network, DataDryad, DSpace and many more) have been developed to facilitate data sharing and preservation processes. The disadvantages of current approaches include and dataset isolation within a repository. Most of those repositories are created for specific research areas, and users would need to apply considerable efforts to utilize datasets they are interested in.
On the other hand, both industry (Enterprise Information Integration) and academia (mostly in database community) worked really hard on the problem of data integration for the last 30 years. Many tools and systems has been developed ranging from data warehousing to virtual data integration, to peer-to-peer data integration (e.g. Pentaho, many data warehousing and business intelligence solutions, etc.). The shortcomings of existing data integration systems include the long setup time and additional expertise of the system is required for users to use it. Moreover, the systems are designed to be used per "problem", e.g. each organization sets up and configure their own data integration tools, thus missing the idea of global-scale integrated data repository.
systems were developed to alleviate the long setup problem. Col*Fusion takes Pay-As-You-Go idea even further. In addition to the user putting more effort as they need improved services, Col*Fusion performs as much data integration job as it can automatically, but also utilizes all other users in the system. Therefore, the key idea of our approach is to apply crowdsourcing for large-scale data integration on all stages of data integration process. Col*Fusion utilizes collective intelligence to submit data, provide feedback on generated schema mapping, provide new schema mapping, provide data concordance tables, to assess data quality and reliability and more.
The approach to data integration in Col*Fusion is similar conceptually to pear-to-peer architectures that offer a fully distributed data sharing. However, Col*Fusion requires neither special software installation, nor prior knowledge of a specific data management systems. It supports a simple data submission protocol implemented via lightweight and intuitive web interface. At the same time, some users may choose to be more involved in further data curation and consolidation similar to pay-as-you-go data integration systems.
Here are two short videos explaining what Col*Fusion is:
The Col*Fusion infrastructure enables transformative cutting-edge research in the global-scale information integration and related disciplines. In particular, we apply Col*Fusion to create a major repository of consolidated global historical data from the past several centuries. This work is conducted in conjunction with the Collaborative for Historical Information and Analysis (CHIA) that currently involves nine different research groups throughout the U.S. and Europe. Col*Fusion has been also described as part of CHIA by the CHIA director Dr. Patrick Manning in his Big Data in History book.
To the best of our knowledge, no existing system implements all stages of data integration processes in an advanced infrastructure based on crowdsourcing. Similar works include Orchestra with Q systems. However, it focuses mostly on data exchange and update reconciliation for a specific domain or group of users. In addition it requires considerable setup effort. To some degree Google Fusion Tables is similar to Col*Fusion, however Google Fusion Tables doesn't seem to have a goal to construct global integrated data repository.
Regarding the related work, there are many papers have been published which describe algorithms, tools or systems which solve only one or several steps of the data integration process (e.g. many work focused only on Schema Mapping, or record linkage, etc.). You can see related work mentioned in the papers below. Also I have a document of 3 pages with related work, maybe I will put it here one day too.
Col*Fusion is implemented as a web application, therefore, as it mentioned above, it requires no installation. The interface is intuitive and easy to use. In addition, web site provides links to Col*Fusion wiki for comprehensive help (still in progress) which can be expanded by any Col*Fusion user. For communications, Col*Fusion provides a forum.
At the time of writing this paragraph, Col*Fusion was not open to the general public yet. We schedule the release on April, 2014. However, you still can try to access it here.
Please see some screenshots below as well as the tools, systems, libraries, technologies we use.
In this poster we introduce Col*Fusion – a novel architecture for large-scale data integration, fusion and preservation based on crowdsourcing. Col*Fusion is implemented as easy-to-use web application and provides uniform data submit and integration interface. It provides all functionality expected from professional data archival repository, but also solves two main problems of current approaches – repository and dataset isolation – by involving users into active participation of both data submission and integration processes.
Project page last time updated on Feb. 9, 2014