Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Dataset descriptions#1083

Closed
danbri opened this issue Apr 6, 2016 · 30 comments
Closed

Improving Dataset descriptions #1083

danbri opened this issue Apr 6, 2016 · 30 comments
Assignees
Labels
schema.org vocab General top level tag for issues on the vocabulary status:work expected We are likely to, or would like to, or probably should try, ... to do something in this area.

Comments

@danbri
Copy link
Contributor

danbri commented Apr 6, 2016

Talking with Natasha Noy about possible improvements around dataset description. Some things to look into:

  • coverageStart and coverageEnd (currently, the datasetTimeInterval has DateTime, not interval, as its expected time, which I think is not correct, or at least doesn't allow us to specify the coverage interval)
    timestep (dct:accrualPeriodicity)
  • bibliographic reference: many of the dataset refer to the paper that describes it
  • Main variables measured -- without necessarily knowing the distinction of which ones are dimensions and which ones are measures qb:dimensionProperty and qb:MeasureProperty)

Related work

This all starts to get into the business of looking inside the dataset, which was discussed at schema.org previously - e.g. see Looking inside tables thread from Omar. Subsequently in W3C CSVW some of these ideas went standards track, in particular a templating mechanism to map tabular data into RDF.

@danbri danbri added schema.org vocab General top level tag for issues on the vocabulary status:needs review status:work expected We are likely to, or would like to, or probably should try, ... to do something in this area. labels Apr 6, 2016
@danbri danbri self-assigned this Apr 6, 2016
@darobin
Copy link
Contributor

darobin commented Apr 6, 2016

This is also related to #975 for versioning dependencies, particularly on datasets (discussed in more detail in https://research.science.ai/article/web-first-data-citations).

@danbri
Copy link
Contributor Author

danbri commented Apr 6, 2016

See also #1066 for a quick bugfix (spotted by Natasha too)

@natashafn
Copy link

A couple of follow up comments:

  • there is a citation property on CreativeWork that is probably the right property to use for bibliographic reference
  • Another property that is missing however is any description of how a dataset was created. In some cases, I would imagine this would be just a text field and in some cases, a structured provenance record. Maybe a property that could be either?

@danbri
Copy link
Contributor Author

danbri commented May 16, 2016

Notes from a F2F meeting on lifescience datasets

@danbri
Copy link
Contributor Author

danbri commented May 16, 2016

See also http://scholarly.vernacular.io/ w.r.t. data citation /cc @darobin

@trypuz
Copy link
Contributor

trypuz commented May 17, 2016

Hi!
There is something wrong with:

http://meta.schema.org
http://pending.schema.org
http://health-lifesci.schema.org

I have „The requested URL / was not found on this server”.

Best,
Robert Trypuz

@danbri
Copy link
Contributor Author

danbri commented May 31, 2016

Filed #1189 re datasetTimeInterval

@danbri
Copy link
Contributor Author

danbri commented Jul 15, 2016

Most of these suggestions are now implemented/committed and published on our draft webschemas.org site for review: http://webschemas.org/docs/releases.html#g1083

The corresponding pull request was #1247

I copy here some supporting notes. Of all these points, only the overlap with releasedEvent remains unexplored.

CHANGES

1.) for temporal and spatial coverage.

As of v3.0 we have:

Relating to Dataset specifically,

http://schema.org/spatial (Dataset -> Place),
"The range of spatial applicability of a dataset, e.g. for a dataset of New York weather, the state of New York."

http://schema.org/temporal (Dataset -> Datetime),
"The range of temporal applicability of a dataset, e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format)."

The temporal property superseded by the awkwardly named http://schema.org/datasetTimeInterval -

http://schema.org/datasetTimeInterval (Dataset -> Datetime),
"The range of temporal applicability of a dataset, e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format)."

Relating to CreativeWork,

http://schema.org/contentLocation (CreativeWork -> Place),
"The location depicted or described in the content. For example, the location in a photograph or painting"

http://schema.org/locationCreated (CreativeWork -> Place),
"The location where the CreativeWork was created, which may not be the same as the location depicted in the CreativeWork."

Note also http://schema.org/releasedEvent which structures things a little differently, grouping place/time within an Event.

PROPOSAL:

1a. a minor detail re releasedEvent, but documenting here:
For works (most typically media broadcasts but potentially e.g. datasets too) whose publication is structured in terms of documented releases, it is reasonable to expect the release information in a http://schema.org/PublicationEvent to match direct contentLocation or spatial[Coverage] properties if the latter are present. A startDate property of the event would match http://schema.org/dateCreated of the published item.

1b.
Create spatialCoverage and temporalCoverage properties as successors to the (vaguely and/or awkwardly named) datasetTimeInterval, spatial and temporal properties.

1c.
Broaden spatialCoverage and temporalCoverage so that they apply to CreativeWork rather than just Dataset.

1d.
Update their textual definitions to accommodate their broader scope, and to address any confusion about related properties.
Proposed text:

spatialCoverage: "The spatialCoverage of a CreativeWork indicates the place(s) which are the focus of some work. It is a subproperty of
contentLocation intended for more technical and specific materials. For example with a Dataset, it indicates
areas that the dataset describes: a dataset of New York weather would have spatialCoverage which was the place: the state of New York."

temporalCoverage: "The temporalCoverage of a CreativeWork indicates the period that the content applies to, i.e. that it describes. In
the case of a Dataset it will typically indicate the relevant time period in a precise notation (e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format). Other forms of content e.g. ScholarlyArticle, Book, TVSeries or TVEpisode may indicate their temporalCoverage in broader terms - textually or via well-known URL."

1e.
Update RDFS assertions.

spatialCoverage subPropertyOf contentLocation.
temporal supersededBy temporalCoverage. (rather than by datasetTimeInterval as now)
datasetTimeInterval supersededBy temporalCoverage.
Add mappings,

temporalCoverage equivalentProperty http://purl.org/dc/terms/temporal
spatialCoverage equivalentProperty http://purl.org/dc/terms/spatial

@joshsh
Copy link
Contributor

joshsh commented Jul 19, 2016

So we have arrived at the names spatialCoverage and temporalCoverage, after all. Agreed that they are appropriate for other CreativeWorks, and it's nice to have the explicit mapping into DCMI Terms.

@danbri
Copy link
Contributor Author

danbri commented Jul 20, 2016

Yes, I think this terminology bridges well with usage elsewhere, as well as better connecting schema.org dataset description with the approach for other kinds of CreativeWork. Does this work ok for others following along here?

@danbri
Copy link
Contributor Author

danbri commented Aug 10, 2016

Published via http://schema.org/docs/releases.html#v3.1
http://blog.schema.org/2016/08/schemaorg-update-hotels-datasets-health.html

@danbri danbri closed this as completed Aug 10, 2016
@danbri
Copy link
Contributor Author

danbri commented Sep 14, 2016

On reflection, and after further feedback, I believe variableMeasured would be a more appropriate name for this property. I'll work on migrating unless anyone objects.

@agbeltran
Copy link

agbeltran commented Sep 15, 2016

In addition to the change to singular, it seems that the variableMeasured property is missing PropertyValuePair in the range to comply with the definition.

@Aaranged
Copy link

Aaranged commented Sep 23, 2016

In addition to comment from @agbeltran note that Google's use of variableMeasured extends the expected type from text to include URL.

@agbeltran
Copy link

@danbri should we open a new issue about the two problems with variablesMeasured reported above?

@danbri
Copy link
Contributor Author

danbri commented Nov 1, 2016

@agbeltran I believe they're fixed ok in our next release, previewable at http://webschemas.org/variableMeasured - can you confirm?

@agbeltran
Copy link

Thanks @danbri - I can see that it now complies with the definition as its range is Text or PropertyValue.

Maybe what remains to be fixed is the documentation at
https://developers.google.com/search/docs/data-types/datasets
which indicates Text, URL rather than Text, PropertyValue?

@natashafn
Copy link

@danbri: is the description actually correct about PropertyValue as range?

On Tue, Nov 1, 2016 at 7:29 AM Alejandra Gonzalez-Beltran <
notifications@github.com> wrote:

Thanks @danbri https://github.com/danbri - I can see that it now
complies with the definition as its range is Text or PropertyValue.

Maybe what remains to be fixed is the documentation at
https://developers.google.com/search/docs/data-types/datasets
which indicates Text, URL rather than Text, PropertyValue?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1083 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AByJRXKK87qBxpjWwqj3FfnYt51PiZcFks5q50zYgaJpZM4IBIz6
.

@agbeltran
Copy link

Checking this again, both properties singular and plural are live in the pending version:

http://pending.schema.org/variablesMeasured
http://pending.webschemas.org/variableMeasured

The documentation (https://developers.google.com/search/docs/data-types/datasets) refers to the singular variableMeasured, which it is the one we had discussed it was a better option. Right?

What is the conclusion about the range?

@danbri
Copy link
Contributor Author

danbri commented Nov 8, 2016

@agbeltran I'm sorry the site doesn't make this clear enough, but roughly: schema.org is the official site, updated in named releases several times a year; webschemas.org is the editor's working draft of the proposed next release, typically edited several times a week. In the webschemas version if you look up the obsolete plural variablesMeasured, you will find youself directed to http://pending.webschemas.org/variablesMeasured -> http://attic.webschemas.org/variablesMeasured which is an area we have made for things that are "as good as removed", for complete transparency.

For range, yes PropertyValue should be in the range - looks like it needs adding on the Google side.

@agbeltran
Copy link

Thanks! (I was aware about the releases/working draft but had missed the attic redirection.)

@dr-shorthair
Copy link

dr-shorthair commented May 9, 2017

BTW - the use of the word 'Measured' also has this problem - 'Measure' usually applies to data collection activities with quantitative, but not categorical results. So variableMeasured has the risk that it implicitly excludes datasets where the 'values' are categories rather than numbers.

There are precedents from several scientific domains to use the more general term 'Observed' and 'Observation' (rather than Measured and Measurement) to allow for both categories and quantities. SSN [1] & O&M [2] use 'observedProperty' and OBOE [3] has 'ofCharacteristic'.

[1] http://w3c.github.io/sdw/ssn/
[2] https://en.wikipedia.org/wiki/Observations_and_Measurements https://dx.doi.org/10.13140/2.1.1142.3042
[3] https://dx.doi.org/10.5063/F11C1TTM

@danbri
Copy link
Contributor Author

danbri commented Jul 18, 2017

I realize I didn't reply explicitly here @dr-shorthair. I'd like to bring most of SOSA into schema.org (as discussed with SpatialWeb WG) and hope it will address the topic more thoroughly. @agbeltran any thoughts from a bioschemas/lifesci perspective?

@thadguidry
Copy link
Contributor

thadguidry commented Apr 4, 2018

@dr-shorthair But Simon, I would prefer we still give publishers the ability to collect both quantitative and categorical results. Doing that makes data flow tooling easier and systems have a bit more information provided to them for proper analysis by machine learning and humans. I think your stance is from a collection effort primarily. However, my stance is we should consider the data after the collection efforts, which is were value in the data is finally extracted for publishers and mankind.

@dr-shorthair Could this be anything, like say "loss of life" as a Result ? http://w3c.github.io/sdw/ssn/#SOSAResult that didn't really specify a "kind" of result and I found the description a bit lacking to determine if there were any limits of its usage.

@goofballLogic
Copy link

Is this prop likely to make it out of pending any time soon?

@dr-shorthair
Copy link

I see this was closed a long time ago, but for completeness:
@thadguidry indeed there is no intention to limit the results of observations to only quantities. I'm sorry this was not clear. In fact my comment was initially prompted by the same concern, because the word 'measure' is usually tied to quantities, while 'observe' can also have categorical, qualitative or truth (boolean) results. In SSN/SOSA there is no a priori assumption that observations generate only numbers.

However (RDF noise ahead) - sosa:hasResult is an owl:ObjectProperty with the rdfs:range sosa:Result so in RDF the result value (often a literal) must be wrapped up as a resource. To assist this I created a little RDF vocabulary to package the most common result-types - see http://catalogue.linked.data.gov.au/index.php/resource/116 or https://github.com/AGLDWG/datatype-ont

@dr-shorthair
Copy link

A bit more explanation here (if you can get to it - I'm not responsible for the permissions here) - https://bitbucket.org/terndatateam/ternplotdata-ontology/src/ef9d9f05b7a3eba915bf8c47708e40fb55d7e1f6/schema/result-types.md

@mitar
Copy link

mitar commented Jan 23, 2020

So if I understand correctly, I can use variableMeasured to describe the attributes of a ML dataset of a CSV file? Maybe this is so obvious to people here, but after reading the spec and reading online, I do not see how to exactly do it. How can I describe which columns dataset has, what are names of those columns, type of the column (categorical, numeric, string), etc., in JSON-LD?

Also comment here says it is still a proposal, but this is not true anymore? Now it is accepted, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema.org vocab General top level tag for issues on the vocabulary status:work expected We are likely to, or would like to, or probably should try, ... to do something in this area.
Projects
None yet
Development

No branches or pull requests