Improving Dataset descriptions#1083

danbri · 2016-04-06T14:25:49Z

Talking with Natasha Noy about possible improvements around dataset description. Some things to look into:

coverageStart and coverageEnd (currently, the datasetTimeInterval has DateTime, not interval, as its expected time, which I think is not correct, or at least doesn't allow us to specify the coverage interval)
timestep (dct:accrualPeriodicity)
bibliographic reference: many of the dataset refer to the paper that describes it
Main variables measured -- without necessarily knowing the distinction of which ones are dimensions and which ones are measures qb:dimensionProperty and qb:MeasureProperty)

Related work

This all starts to get into the business of looking inside the dataset, which was discussed at schema.org previously - e.g. see Looking inside tables thread from Omar. Subsequently in W3C CSVW some of these ideas went standards track, in particular a templating mechanism to map tabular data into RDF.

See also the (SDMX-oriented) W3C Data cube specification.
W3C DCAT final specification (earlier drafts inspired the schema.org Dataset design)
More recently: DCAT Application Profile for data portals in Europe (DCAT-AP) - @danbri discussed briefly w/ DCAT-AP team the possibility of using a JSON-LD-based @context for DCAT-AP as an 'external extension'.
CSVW RDF examples using data cube vocabulary from @6a6d74

darobin · 2016-04-06T15:29:15Z

This is also related to #975 for versioning dependencies, particularly on datasets (discussed in more detail in https://research.science.ai/article/web-first-data-citations).

danbri · 2016-04-06T20:58:51Z

See also #1066 for a quick bugfix (spotted by Natasha too)

natashafn · 2016-04-18T17:19:01Z

A couple of follow up comments:

there is a citation property on CreativeWork that is probably the right property to use for bibliographic reference
Another property that is missing however is any description of how a dataset was created. In some cases, I would imagine this would be just a text field and in some cases, a structured provenance record. Maybe a property that could be either?

danbri · 2016-05-16T11:48:38Z

Notes from a F2F meeting on lifescience datasets

"funder(s)" was suggested; this is tracked as Describe the funding of a person/project/creative work #383
see also CERIF (http://www.eurocris.org/cerif/main-features-cerif) and FundRef (http://www.crossref.org/fundingdata/) /cc @CaroleGoble

danbri · 2016-05-16T11:49:30Z

See also http://scholarly.vernacular.io/ w.r.t. data citation /cc @darobin

trypuz · 2016-05-17T09:51:14Z

Hi!
There is something wrong with:

http://meta.schema.org
http://pending.schema.org
http://health-lifesci.schema.org

I have „The requested URL / was not found on this server”.

Best,
Robert Trypuz

danbri · 2016-05-31T18:48:16Z

Filed #1189 re datasetTimeInterval

…es page). See #1083 for context.

See #1083

See #1083. also #84

danbri · 2016-07-15T19:08:39Z

Most of these suggestions are now implemented/committed and published on our draft webschemas.org site for review: http://webschemas.org/docs/releases.html#g1083

The corresponding pull request was #1247

I copy here some supporting notes. Of all these points, only the overlap with releasedEvent remains unexplored.

CHANGES

1.) for temporal and spatial coverage.

As of v3.0 we have:

Relating to Dataset specifically,

http://schema.org/spatial (Dataset -> Place),
"The range of spatial applicability of a dataset, e.g. for a dataset of New York weather, the state of New York."

http://schema.org/temporal (Dataset -> Datetime),
"The range of temporal applicability of a dataset, e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format)."

The temporal property superseded by the awkwardly named http://schema.org/datasetTimeInterval -

http://schema.org/datasetTimeInterval (Dataset -> Datetime),
"The range of temporal applicability of a dataset, e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format)."

Relating to CreativeWork,

http://schema.org/contentLocation (CreativeWork -> Place),
"The location depicted or described in the content. For example, the location in a photograph or painting"

http://schema.org/locationCreated (CreativeWork -> Place),
"The location where the CreativeWork was created, which may not be the same as the location depicted in the CreativeWork."

Note also http://schema.org/releasedEvent which structures things a little differently, grouping place/time within an Event.

PROPOSAL:

1a. a minor detail re releasedEvent, but documenting here:
For works (most typically media broadcasts but potentially e.g. datasets too) whose publication is structured in terms of documented releases, it is reasonable to expect the release information in a http://schema.org/PublicationEvent to match direct contentLocation or spatial[Coverage] properties if the latter are present. A startDate property of the event would match http://schema.org/dateCreated of the published item.

1b.
Create spatialCoverage and temporalCoverage properties as successors to the (vaguely and/or awkwardly named) datasetTimeInterval, spatial and temporal properties.

1c.
Broaden spatialCoverage and temporalCoverage so that they apply to CreativeWork rather than just Dataset.

1d.
Update their textual definitions to accommodate their broader scope, and to address any confusion about related properties.
Proposed text:

spatialCoverage: "The spatialCoverage of a CreativeWork indicates the place(s) which are the focus of some work. It is a subproperty of
contentLocation intended for more technical and specific materials. For example with a Dataset, it indicates
areas that the dataset describes: a dataset of New York weather would have spatialCoverage which was the place: the state of New York."

temporalCoverage: "The temporalCoverage of a CreativeWork indicates the period that the content applies to, i.e. that it describes. In
the case of a Dataset it will typically indicate the relevant time period in a precise notation (e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format). Other forms of content e.g. ScholarlyArticle, Book, TVSeries or TVEpisode may indicate their temporalCoverage in broader terms - textually or via well-known URL."

1e.
Update RDFS assertions.

spatialCoverage subPropertyOf contentLocation.
temporal supersededBy temporalCoverage. (rather than by datasetTimeInterval as now)
datasetTimeInterval supersededBy temporalCoverage.
Add mappings,

temporalCoverage equivalentProperty http://purl.org/dc/terms/temporal
spatialCoverage equivalentProperty http://purl.org/dc/terms/spatial

akuckartz · 2016-07-15T20:25:46Z

A W3C workshop about DCAT and metadata will take place in Amsterdam 30 Nov - 1 Dec 2016.

joshsh · 2016-07-19T17:19:39Z

So we have arrived at the names spatialCoverage and temporalCoverage, after all. Agreed that they are appropriate for other CreativeWorks, and it's nice to have the explicit mapping into DCMI Terms.

danbri · 2016-07-20T14:21:40Z

Yes, I think this terminology bridges well with usage elsewhere, as well as better connecting schema.org dataset description with the approach for other kinds of CreativeWork. Does this work ok for others following along here?

danbri · 2016-08-10T14:14:02Z

Published via http://schema.org/docs/releases.html#v3.1
http://blog.schema.org/2016/08/schemaorg-update-hotels-datasets-health.html

danbri · 2016-09-14T18:03:43Z

On reflection, and after further feedback, I believe variableMeasured would be a more appropriate name for this property. I'll work on migrating unless anyone objects.

agbeltran · 2016-09-15T09:32:10Z

In addition to the change to singular, it seems that the variableMeasured property is missing PropertyValuePair in the range to comply with the definition.

Aaranged · 2016-09-23T17:26:49Z

In addition to comment from @agbeltran note that Google's use of variableMeasured extends the expected type from text to include URL.

…y pattern. See #1083

agbeltran · 2016-11-01T13:49:31Z

@danbri should we open a new issue about the two problems with variablesMeasured reported above?

danbri · 2016-11-01T14:22:38Z

@agbeltran I believe they're fixed ok in our next release, previewable at http://webschemas.org/variableMeasured - can you confirm?

agbeltran · 2016-11-01T14:29:32Z

Thanks @danbri - I can see that it now complies with the definition as its range is Text or PropertyValue.

Maybe what remains to be fixed is the documentation at
https://developers.google.com/search/docs/data-types/datasets
which indicates Text, URL rather than Text, PropertyValue?

natashafn · 2016-11-01T17:24:40Z

@danbri: is the description actually correct about PropertyValue as range?

On Tue, Nov 1, 2016 at 7:29 AM Alejandra Gonzalez-Beltran <
notifications@github.com> wrote:

Thanks @danbri https://github.com/danbri - I can see that it now
complies with the definition as its range is Text or PropertyValue.

Maybe what remains to be fixed is the documentation at
https://developers.google.com/search/docs/data-types/datasets
which indicates Text, URL rather than Text, PropertyValue?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1083 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AByJRXKK87qBxpjWwqj3FfnYt51PiZcFks5q50zYgaJpZM4IBIz6
.

agbeltran · 2016-11-08T06:56:31Z

Checking this again, both properties singular and plural are live in the pending version:

http://pending.schema.org/variablesMeasured
http://pending.webschemas.org/variableMeasured

The documentation (https://developers.google.com/search/docs/data-types/datasets) refers to the singular variableMeasured, which it is the one we had discussed it was a better option. Right?

What is the conclusion about the range?

danbri · 2016-11-08T12:16:46Z

@agbeltran I'm sorry the site doesn't make this clear enough, but roughly: schema.org is the official site, updated in named releases several times a year; webschemas.org is the editor's working draft of the proposed next release, typically edited several times a week. In the webschemas version if you look up the obsolete plural variablesMeasured, you will find youself directed to http://pending.webschemas.org/variablesMeasured -> http://attic.webschemas.org/variablesMeasured which is an area we have made for things that are "as good as removed", for complete transparency.

For range, yes PropertyValue should be in the range - looks like it needs adding on the Google side.

agbeltran · 2016-11-08T12:19:54Z

Thanks! (I was aware about the releases/working draft but had missed the attic redirection.)

dr-shorthair · 2017-05-09T20:51:53Z

BTW - the use of the word 'Measured' also has this problem - 'Measure' usually applies to data collection activities with quantitative, but not categorical results. So variableMeasured has the risk that it implicitly excludes datasets where the 'values' are categories rather than numbers.

There are precedents from several scientific domains to use the more general term 'Observed' and 'Observation' (rather than Measured and Measurement) to allow for both categories and quantities. SSN [1] & O&M [2] use 'observedProperty' and OBOE [3] has 'ofCharacteristic'.

[1] http://w3c.github.io/sdw/ssn/
[2] https://en.wikipedia.org/wiki/Observations_and_Measurements https://dx.doi.org/10.13140/2.1.1142.3042
[3] https://dx.doi.org/10.5063/F11C1TTM

danbri · 2017-07-18T09:28:26Z

I realize I didn't reply explicitly here @dr-shorthair. I'd like to bring most of SOSA into schema.org (as discussed with SpatialWeb WG) and hope it will address the topic more thoroughly. @agbeltran any thoughts from a bioschemas/lifesci perspective?

dr-shorthair · 2018-04-04T07:32:16Z

See
https://github.com/w3c/sdw/blob/gh-pages/ssn/rdf/sosa-sdo-mapping.ttl
https://github.com/w3c/sdw/blob/gh-pages/ssn/rdf/sdo-sosa-schema.rdfa.html

thadguidry · 2018-04-04T13:39:52Z

@dr-shorthair But Simon, I would prefer we still give publishers the ability to collect both quantitative and categorical results. Doing that makes data flow tooling easier and systems have a bit more information provided to them for proper analysis by machine learning and humans. I think your stance is from a collection effort primarily. However, my stance is we should consider the data after the collection efforts, which is were value in the data is finally extracted for publishers and mankind.

@dr-shorthair Could this be anything, like say "loss of life" as a Result ? http://w3c.github.io/sdw/ssn/#SOSAResult that didn't really specify a "kind" of result and I found the description a bit lacking to determine if there were any limits of its usage.

goofballLogic · 2019-01-26T02:54:11Z

Is this prop likely to make it out of pending any time soon?

dr-shorthair · 2020-01-22T03:44:17Z

I see this was closed a long time ago, but for completeness:
@thadguidry indeed there is no intention to limit the results of observations to only quantities. I'm sorry this was not clear. In fact my comment was initially prompted by the same concern, because the word 'measure' is usually tied to quantities, while 'observe' can also have categorical, qualitative or truth (boolean) results. In SSN/SOSA there is no a priori assumption that observations generate only numbers.

However (RDF noise ahead) - sosa:hasResult is an owl:ObjectProperty with the rdfs:range sosa:Result so in RDF the result value (often a literal) must be wrapped up as a resource. To assist this I created a little RDF vocabulary to package the most common result-types - see http://catalogue.linked.data.gov.au/index.php/resource/116 or https://github.com/AGLDWG/datatype-ont

dr-shorthair · 2020-01-22T03:52:02Z

A bit more explanation here (if you can get to it - I'm not responsible for the permissions here) - https://bitbucket.org/terndatateam/ternplotdata-ontology/src/ef9d9f05b7a3eba915bf8c47708e40fb55d7e1f6/schema/result-types.md

mitar · 2020-01-23T18:51:52Z

So if I understand correctly, I can use variableMeasured to describe the attributes of a ML dataset of a CSV file? Maybe this is so obvious to people here, but after reading the spec and reading online, I do not see how to exactly do it. How can I describe which columns dataset has, what are names of those columns, type of the column (categorical, numeric, string), etc., in JSON-LD?

Also comment here says it is still a proposal, but this is not true anymore? Now it is accepted, no?

/cc #1083 #2564

danbri added schema.org vocab General top level tag for issues on the vocabulary status:needs review status:work expected We are likely to, or would like to, or probably should try, ... to do something in this area. labels Apr 6, 2016

danbri self-assigned this Apr 6, 2016

This was referenced May 31, 2016

DataDownload is subtype MediaObject - definition of latter needs tweaking #1190

Closed

fileFormat assumes registered mediatypes - need an idiom for obscure formats #1191

Closed

proccaserra mentioned this issue Jun 3, 2016

Review & Mapping by Biocaddie DATS [Metadata Working Group 3] #1196

Open

danbri added a commit that referenced this issue Jul 15, 2016

Added variablesMeasured proposal to pending extension. See #1083

422acd8

danbri added a commit that referenced this issue Jul 15, 2016

Changes towards Dataset improvements (and documenting these in releas…

9a746f3

…es page). See #1083 for context.

danbri added a commit that referenced this issue Jul 15, 2016

Noted that we also add a pending property: variablesMeasured.

eac8145

See #1083

danbri mentioned this issue Jul 15, 2016

Sdo datasets2 #1247

Merged

danbri added a commit that referenced this issue Jul 15, 2016

Added Dublin Core mappings.

a11ef62

See #1083. also #84

danbri mentioned this issue Jul 15, 2016

Introducing Bioschemas: promoting schema.org in the life sciences #1028

Open

danbri mentioned this issue Aug 2, 2016

Type for defining a Data Schema? #713

Closed

rajido mentioned this issue Aug 9, 2016

Identifier vs URL #1286

Open

danbri closed this as completed Aug 10, 2016

danbri added a commit that referenced this issue Oct 3, 2016

Renamed variablesMeasured to be variableMeasured, to fit our pluralit…

f62a9ce

…y pattern. See #1083

ypriverol mentioned this issue Nov 1, 2016

First test of dataset in Schema.org OmicsDI/ddi-web-app#135

Closed

4 tasks

agbeltran mentioned this issue Nov 8, 2016

Describe the funding of a person/project/creative work #383

Open

ldodds mentioned this issue Jan 9, 2017

variablesMeasured #1471

Open

danbri mentioned this issue Apr 9, 2018

Property "variablesMeasured" WITH AN S does not exists in csv file (attached to type "CompleteDataFeed") + similar property "variableMeasured" WITHOUT ANY S attached to the same type #1881

Closed

VladimirAlexiev mentioned this issue Jul 16, 2020

size characteristics of a Dataset schemaorg/suggestions-questions-brainstorming#160

Open

danbri added a commit that referenced this issue May 15, 2023

Updated to support StatisticalVariable values.

d21f00e

/cc #1083 #2564

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Dataset descriptions#1083

Improving Dataset descriptions #1083

danbri commented Apr 6, 2016

darobin commented Apr 6, 2016

danbri commented Apr 6, 2016

natashafn commented Apr 18, 2016

danbri commented May 16, 2016

danbri commented May 16, 2016

trypuz commented May 17, 2016

danbri commented May 31, 2016

danbri commented Jul 15, 2016

akuckartz commented Jul 15, 2016

joshsh commented Jul 19, 2016

danbri commented Jul 20, 2016

danbri commented Aug 10, 2016

danbri commented Sep 14, 2016

agbeltran commented Sep 15, 2016 •
edited

Aaranged commented Sep 23, 2016 •
edited

agbeltran commented Nov 1, 2016

danbri commented Nov 1, 2016

agbeltran commented Nov 1, 2016

natashafn commented Nov 1, 2016

agbeltran commented Nov 8, 2016

danbri commented Nov 8, 2016

agbeltran commented Nov 8, 2016

dr-shorthair commented May 9, 2017 •
edited

danbri commented Jul 18, 2017

dr-shorthair commented Apr 4, 2018

thadguidry commented Apr 4, 2018 •
edited

goofballLogic commented Jan 26, 2019

dr-shorthair commented Jan 22, 2020

dr-shorthair commented Jan 22, 2020

mitar commented Jan 23, 2020

Improving Dataset descriptions#1083

Improving Dataset descriptions #1083

Comments

danbri commented Apr 6, 2016

Related work

darobin commented Apr 6, 2016

danbri commented Apr 6, 2016

natashafn commented Apr 18, 2016

danbri commented May 16, 2016

danbri commented May 16, 2016

trypuz commented May 17, 2016

danbri commented May 31, 2016

danbri commented Jul 15, 2016

akuckartz commented Jul 15, 2016

joshsh commented Jul 19, 2016

danbri commented Jul 20, 2016

danbri commented Aug 10, 2016

danbri commented Sep 14, 2016

agbeltran commented Sep 15, 2016 • edited

Aaranged commented Sep 23, 2016 • edited

agbeltran commented Nov 1, 2016

danbri commented Nov 1, 2016

agbeltran commented Nov 1, 2016

natashafn commented Nov 1, 2016

agbeltran commented Nov 8, 2016

danbri commented Nov 8, 2016

agbeltran commented Nov 8, 2016

dr-shorthair commented May 9, 2017 • edited

danbri commented Jul 18, 2017

dr-shorthair commented Apr 4, 2018

thadguidry commented Apr 4, 2018 • edited

goofballLogic commented Jan 26, 2019

dr-shorthair commented Jan 22, 2020

dr-shorthair commented Jan 22, 2020

mitar commented Jan 23, 2020

agbeltran commented Sep 15, 2016 •
edited

Aaranged commented Sep 23, 2016 •
edited

dr-shorthair commented May 9, 2017 •
edited

thadguidry commented Apr 4, 2018 •
edited