Wednesday, 8 March 2017

Metadata as Linked Data for Research Data Repositories




 “Every man has his own cosmology and who can say that his own is right.” said by Einstein. This is also true when we come to understand data semantics that one data may be different interpreted by different data creators, curators and re-users. Then, how do we build a better research data repository?

We start with the point made by Willis, C., Greenberg, J., & White, H. (2012) that the metadata of research data increases the access to and reuse of the data. And Stanford, Harvard, and Cornell believe the use of linked data technologies is a promising method to gather contextual information about research resources.

To look for inspiration tools that can meet the urgent needs of innovative solutions providing feature-rich services for helping data publishing such as visualization, validation & reuse in different applications by research repositories (Assante, et.al, 2016), the CKAN (Comprehensive Knowledge Archive Network) as a major solution that makes linked metadata available, citable, and validated becomes our first choice. 


PDF / PNG
Our current result of this research at data.odw.tw include:

1. An Use Case for Curation, Publication & Reuse of Metadata as Linked Data.
  • 843,309 CC licensed metadata records of 14 domains reused from the Union Catalog of Digital Archives Taiwan.
  • 44,806,400 triples (Linked Data) encoded with Dublin Core 15 Elements and Provenance Information.
  • 25,913,304 triples from 832,803 records semantically refined with spatial & temporal normalization, mapping, and linking with domain knowledges (external vocabularies, ontologies. and knowledges bases).
  • 14 domains include Archaeology,  Architecture, Archives, Artifacts, Biology, Geology, Manuscript, Multimedia, NewsMedia, PaintCal , RareBook, ResearchReuse, StoneRub.
  • 80 projects and 74 agents associated with metadata records are curated by their linked data formats and Wikidata ID: they have roles in NGO (2), Museum (5), Library (2), Government (9), Archive (1) and Academia (55).

2. A New Method to Manage Data for General-use & Discipline-specific Repositories.
  • For Open Science: using the CKAN (Comprehensive Knowledge Archive Network) as a major solution that makes linked metadata available, citable, and validated.
  • Availability: data shared with multiple formats, CSV, XML, Turtle, RDF/XML, JSON-LD, consumed both by human & machine.
  • Validation and Reproducibility: each data encoded with provenance in details while at the same time a complete mechanism for publishing article, data and code is designed and implemented.
  • A flexible and adaptable ontology for describing different data context (common knowledge or domain knowledge), event concepts (people, place, time) and objects collected by meaningful groups of different vocabularies is provided .
  • Data Visualization is enhanced and integrated through spatial and temporal mapping, filtering and linking system design.   

3. Data Semantically Enriched with Vocabularies and Knowledge Bases via Adaptable Mechanisms.
  • 18 international vocabularies used  for modeling common knowledge, and 5 domain specific vocabularies for place, time, art and humanity, or biology are applied.  3 knowledge bases like GeoNames, Wikidata, and Encyclopedia of Life are mapped and linked. 
  • The use of SPARQL language and endpoints  provide data analytic semantic queries both in local and external. In addition, data from the RDF triplestore can be easily used in 3rd-party applications.
  • Multiple DataClean Versions Mechanism:  we treat data cleaning as a kind of interpretation. Refined Versions (R Versions i.e. r1, r2, r3… ) provide different contexts to different needs of users. 
  • Multiple LinkedKnowledge Bases Mechanism:  more knowledge bases like DBpedia, WordCat, or LinkedGeoData can be linked in future via different R Versions without sacrificing  the  integrality of the original Version, encoded with DC 15 .
  • Multiple SemanticStructure Versions Mechanism:  different  interpretations results from the use of different vocabularies .  Co-exists of multiple R Versions  with different vocabularies or transforming vocabularies via SPARQL are  solutions. 
Reference:
  • Douglas, A. Vibert. "Forty minutes with Einstein." Journal of the Royal Astronomical Society of Canada 50 (1956): 99. P.100
  • Assante, M., Candela, L., Castelli, D., & Tani, A. (2016). Are scientific data repositories coping with research data publishing?. Data Science Journal, 15.. DOI: http://doi.org/10.5334/dsj-2016-006
  • Willis, C., Greenberg, J., & White, H. (2012). Analysis and synthesis of metadata goals for scientific data. Journal of the American Society for Information Science and Technology, 63(8), 1505-1520.