Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vocabulary to indicate which sections of a document are particularly 'speakable'#1389

Closed
danbri opened this issue Oct 5, 2016 · 22 comments
Assignees
Labels
guidelines docs examples Work on our supporting materials rather than on schema definitions

Comments

@danbri
Copy link
Contributor

danbri commented Oct 5, 2016

Usecase:

"With use of text-to-speech on the rise in mainstream use-case scenarios such as smart
speakers (Amazon Echo, Google Home), multimodal interaction on smart phones and in-car systems, there is a need for authors and publishers to be able to easily call out portions of a Web page that are particularly appropriate for reading out aloud. Such read-aloud functionality may
vary from speaking a short title and summary, to speaking a few key sections of a page; in some cases, it may amount to speaking most non-visual content on the page. "

A vocab draft:

@danbri danbri self-assigned this Oct 5, 2016
@danbri danbri added the guidelines docs examples Work on our supporting materials rather than on schema definitions label Oct 5, 2016
@chaals
Copy link
Contributor

chaals commented Mar 3, 2017

It seems like you're identifying the "key bits of the page", presumably as an initial view of it, a bit like
<meta name="description" content="This is the most important page about speaking things"> but more directly oriented to consuming the content or interacting with it than to choosing between two or more pages.

I think that kind of summary has a fair bit of application beyond reading it out on a speech system. I like the model of being able to gather a few different pieces of the content together, but I'm wary of trying to tie it tightly to text-to-speech usage.

On the other hand, I am still thinking about this.

(The examples also seem to be a bit broken)

@LJWatson ping?

@LJWatson
Copy link

LJWatson commented Mar 3, 2017

This seems like a useful property. When using a voice UI the interaction needs to be clutter free, or it becomes fairly horrible.

The only other use case for something like it, is those tools that strip out the visual clutter of pages for better readability. I don't know whether the desireable content would be the same for both use cases though...

@danbri
Copy link
Contributor Author

danbri commented May 23, 2017

@chaals @LJWatson - I've just posted a brief proposal to the JSON-LD group, who are working on improvements to JSON-LD. The idea would be for the cross-domain parts of this to be specified as something a JSON-LD parser might do, i.e. as @chaals says, not "tie it tightly to text-to-speech usage". Within the purely schema.org world, at least the 'xpath' and 'cssSelector' properties have nothing binding them to text-to-speech; other definitions and usecases could easily reuse them.

(edit - here's the issue I mentioned) - json-ld/json-ld.org#498

@nicolastorzec
Copy link
Contributor

I'm with Chaals regarding clarifying the goal. Is it about:

  1. Annotating the portions of a page that would be particularly appropriate for reading out loud because the publisher think they could be accurately rendered via TTS? This option mostly makes sense when publishers are TTS experts...
  2. Annotating which portions of a page would be worth reading out loud because the publisher think they are the most important information on the page? This option is more about marking up prominent information than speakable information...
  3. Providing an alternate, speakable, version of the most prominent information on the page?

Also, we should look into SSML if we want to go beyond annotating the speakable portions of a page:

  • The Speech Synthesis Markup Language is "designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications."
  • SSML is supported by Amazon Alexa, Microsoft Cortana, and the Google Assistant.

@gkellogg
Copy link
Contributor

gkellogg commented Jul 1, 2017

In json-ld/json-ld.org#498 (comment) I suggested that it may simply be better to combine RDFa and JSON-LD on page to address this, as RDFa allows HTML content to be referenced/extracted from the page using rdf:XMLLiteral or rdf:HTML. Adding some new kind of HTML selector as a value in JSON-LD seems like mixing domain metaphors.

I don't think any existing examples contains both JSON-LD and RDFa, but this is feasible and well-supported by existing processors.

danbri added a commit that referenced this issue Aug 1, 2017
For #1389 #1672

Intent is that these types be applicable to usecases beyond SpeakableSpecification.

They are named with "*Type" to avoid the types having same spelling as the property.
@danbri
Copy link
Contributor Author

danbri commented Sep 11, 2017

Some implementor feedback from Google: the "cssSelector" (and "xpath") property would be particularly useful on http://schema.org/WebPageElement to indicate the part(s) of a page matching the selector / xpath.

Note that this isn't "element" in some formal XML sense, and that the selector might match multiple XML/HTML elements if it is a CSS class selector.

I suggest adding WebPageElement as a type that these 2 properties are expected on.

@danbri
Copy link
Contributor Author

danbri commented Sep 12, 2017

Ping @tmarshbing @scor @rvguha @vholland @tilid @nicolastorzec - any views?

@vholland
Copy link
Contributor

+1

@danbri
Copy link
Contributor Author

danbri commented Sep 14, 2017

Proceeding on the basis that this is a commonsense combination of two terms with related semantics, I'm making an edit now to cssSelector, xpath, and the expected type assocations of both. There might be some nuance in the details but it doesn't make sense having a type for parts of a page, a property for pointing into parts of a page, and failing to say how they relate!

danbri added a commit that referenced this issue Sep 14, 2017
…nt' /cc #1389

Allowed both properties to be expected on that type.
@jvandriel
Copy link

jvandriel commented Nov 27, 2017

Finally got a moment to respond to this...

Heaving read the discussion I'm still wondering what it exactly is this proposal is trying to resolve?

In general the part of a web page that should be 'speakable/pronouncable' is the main content of a page, which most of the times, are things like Article, BlogPosting, Product, Service, Recipe, etc. which IMHO all have plenty of properties (even WebPage itself) for devices to be able to 'speak/pronounce' the textual content that matters.

At the same time I can't help feeling that this proposal tries to bypass WCAG accessibility guidelines which IMO should suffice for devices (can't image things like speakers need more specific Types and attributes than screen readers (and visually impaired people) do).

Am I overlooking reasons why WCAG guidelines don't suffice here?

@anschluss80
Copy link

anschluss80 commented Jan 20, 2018

Summary: I guess, WCAG guidelines and this 'speakable' proposal have different use cases and target audiences.

WCAG is about making the whole (main) content of a webpage accessible. The use case here is to serve the whole content to anyone, who deliberately visits a specific website.

Voice assistants, on the other hand, should keep their answer to a specific question brief - a short summary of the page topic could fit quite well, A typical use case is a search engine research, where users won't visit the website, but instead get an ecxerpt of the topic. Even more, users often do not know, where the excerpt is originating from (see example below).

An advisory from Amazon for Alexa responses: "Be brief"
https://developer.amazon.com/designing-for-voice/what-alexa-says/

And for Google: "Recommended: Less than 300 characters for each dialog turn."
https://developers.google.com/actions/assistant/responses

If you. for example, ask Amazon Alexa "Alexa, who is Chuck Norris", it will read the first sentence of the Wikipedia article on Chuck Norris, without mentioning the origin. At the time of writing, the answer is "Carlos Ray 'Chuck' Norris (born March 10, 1940) is an American martial artist, actor, film producer and screenwriter." (English Wikipedia). It's not the whole article, which you would expect when using a screen reader.

Just my two cents ;-)
Alex.

@jvandriel
Copy link

jvandriel commented Jan 21, 2018

OK, I see sense in the point that WCAG's use case is different than that of this proposal. But for use cases like a 'title', there is the name property which could be used by speaking devices (or headline for creative works). And as for summaries, can't description be used for that? (which in most cases are less than 300 characters)

What about 'speakable' parts of an Article, do we actually expect publishers to markup individual <section>, <p>, <div> and <span> elements? (when there also are properties like articleBody and articleSection) Will be real fun if the rest of the markup gets published in JSON-LD, can't wait to see how WYSIWYG text editors will cope with this (educated guess, they won't).

What about something like a short Answer (or Question, Review or any other form of user generated content for that matter), is it expected that both the text property and speakable > SpeakebleSpecification be provided? (which will more than likely contain exactly the same content)

And lastly, what about something like a Product? A <meta name="description"> can now be ±300 chars in Google's organic search, which coincidentally is more or less the same amount of chars a marketer would provide Amazon for a product's description. I therefore expect that in a product's case the <meta name="description">, description and speakable > SpeakebleSpecification will all contain exactly the same content (as well as the descriptions provided (in other formats) to marketplaces and price comparison sites).

Really, I get the intention of the proposal but I don't expect much more to come of it than publishers duplicating the same content multiple times to be able to populate multiple properties (in different formats for different parties).

Now I've worked with some very large publishers in the past and can tell you that all hell breaks loose when authors have to start providing new/multiple titles and descriptions for the same article (or product) because different media require different character counts - simply because this costs time (=money) they don't have and therefore this isn't a trivial matter for them!

Meaning, I'm pretty sure authors (as well as business owners/stakeholders) won't be happy at all if they have to start providing 'speakable' descriptions (of a certain length) as well - especially if this also involves doing this for multiple sections of an article or web page.

And from a CMS perspective I don't expect much positives either as this will probably lead to authors having to fill out (many) more input fields in the CMS form of an article or a product's PIM system (or even worse, forcing authors having to start adding and managing CSS classes of elements for the cssSelectors - fun job for sites with (hundreds of) thousands of articles or products).

Apologies if I sound negative (especially because I do like the idea of being able to easily serve speaking devices) but I just don't see publishers handling this proposal very well mostly due to technical/resource constraints (which will lead to the duplication of content), as well as time constraints for (professional) authors (as they already have so many things to fill out).

Try looking at it from a business perspective, what's there to be gained by website owners after they've spend a ton of time and resources to make this happen? I understand the ROI for companies that produce speaking devices but what's the ROI for those implementing this proposal on their sites? Will this lead to users reading more articles or buying more products? If not, why would publishers bother to accommodate speaking devices? Typical questions businesses need answers to, and guess what happens if the answers aren't in their favor? Absolutely nothing as they'll see it as a waist of precious resources.

Can't this instead be resolved by having speaking devices simply use properties that already exist (and are being used by publishers)?

@akuckartz
Copy link

Can HTML annotations be used?
https://www.w3.org/TR/annotation-html/

@postphotos
Copy link

postphotos commented May 4, 2018

Meaning, I'm pretty sure authors (as well as business owners/stakeholders) won't be happy at all if they have to start providing 'speakable' descriptions (of a certain length) as well - especially if this also involves doing this for multiple sections of an article or web page.

Thanks @jvandriel for describing this larger concerns and I hear you. A few thoughts I'd like to offer:

  • Many publishers will leverage excerpt for Social Media tailoring and RSS feeds already. I could see that being be pretty close to a speakable selection from a CMS side, allowing a X,XXX-long article to make sense in <300 words... I'm thinking of Google's Featured Snippets or the Wolfram Alpha powered knowledge answers in iOS's Siri.

  • I would be happy if I could better define what answer these platforms see, especially if I'm using this spec on a site with high domain authority and lots of relevant content for a structured data consumer (like the two already mentioned.)

  • I could see the speakable speakableSelection spec here allows for a user to better define, in limited terms, what the whole thing is about with less details and more summary, and could provide better context (just as meta descriptions or RSS excerpts already do) with a better focus on speakable-friendly vocabulary.

Thoughts?

@BigBlueHat
Copy link
Contributor

FWIW, the Web Annotation selectors encoding might provide more future-proofing, flexibility, and potential re-use throughout Schema.org (i.e. new selection systems can be added without new properties)--though at the exchange of being a bit more verbose.

So example 1 for SpeakableSpecification might support both CSS Selectors and XPath's and fragment identifiers together to become:

{
  "@context": "http://schema.org/",
  "@type": "WebPage",
  "name": "Jane Doe's homepage",
  "speakable": {
    "@type": "SpeakableSpecification",
    "selector": [
      {"@type": "CssSelector", "@value": ".headline"},
      {"@type": "XPathSelector", "@value": "//summary"},
      {"@type": "FragmentSelector", "@value": "#speakable"}
    ]
  },
  "url": "http://www.janedoe.com"
 }

Essentially, the WebPage is the targeted resource and the speakable property would create the ResourceSelection as a SpeakableSpecification (in this usage).

The Selectors and States note focuses on this part of the Web Annotation Data Model.

I'd be happy to help with some mappings between the two, if there's interest.

@danbri
Copy link
Contributor Author

danbri commented Jul 25, 2018

https://webmasters.googleblog.com/2018/07/hey-google-whats-latest-news.html explains what we've been using this for at Google. I'll find some example URLs to share here too.

@ghost
Copy link

ghost commented Jul 25, 2018

I got a question, is it possible to create 2 or more speakable sections from the one webpage?

I only see code examples showing a single markup using the following combinations:

  • headline/name
  • summary/description

Would it be possible to create a list of speakable markup's per webpage?

@ghost
Copy link

ghost commented Aug 9, 2018

@danbri

  1. Does the markup need to be visible on the web page?

According to Google's Guidelines found here: https://developers.google.com/search/docs/guides/sd-policies

Don't mark up content that is not visible to readers of the page.

However, I have seen people markup meta-descriptions via an XPath and their meta-description value is not present on the webpage. Yet Google Home Smart Speaker is still finding and reading the markup. Does this not conflict with Google's Spam Guidelines?

  1. Does the markup need to be the exact same XPath/CSS Selector on both Desktop and AMP-HTML webpages?

According to Google's Guidelines found here: https://support.google.com/webmasters/answer/7478053?hl=en

Content mismatch "There is a difference in content between the AMP version and its canonical web page."

It suggests that Google's view is to have both the AMP-HTML/Mobile and the Desktop Versions of the website to have the exact same thing.

So would this same policy act on Google's Structured Data Policy meaning that we have to use the exact same XPath/CSS Selector Path/CSS Value when marking up both versions of the webpage.

For example:

Mobile Version:

 "speakable":
 {
  "@type": "SpeakableSpecification",
  "xpath": [
    "/html/head/title",
    "/html/body/details/summary"
    ]
  }

Desktop Version:

 "speakable":
 {
  "@type": "SpeakableSpecification",
  "xpath": [
    "/html/head/title",
    "/html/body/somethingelse/details/summary"
    ]
  }

Note: The different XPath's in the above examples codes.

p.s. I could not find the answers on Google's Doc's or on Schema dot org thanks.

@beltofte
Copy link

beltofte commented Aug 13, 2018

@michalise Did you find a way to add multiple speakable sections on a single page?

@danbri
Copy link
Contributor Author

danbri commented Aug 13, 2018

In general, search engines and other products/services/features can express more detailed restrictions than are required by Schema.org itself. I think that's what is happening here. Schema.org provides the underlying dictionary of terms, and Google says "here are some deployment patterns that we can work with". Everyone's policies and information needs are evolving, and it isn't feasible to attempt to track such things within Schema.org's definitions.

My understanding is that there is not reason to consider multiple speakable sections as intrinsically inappropriate. Whether it works in Google right now is a separable matter.

@DamonHD
Copy link

DamonHD commented Dec 16, 2018

@beltofte: FWIW, Google's Structured Data checker tool seems happy with the following (microdata) markup with 3 separate segments.

<span itemprop=speakable itemscope itemtype=http://schema.org/SpeakableSpecification><meta itemprop=xpath content="/html/head/meta[@property='og:title']/@content"><meta itemprop=xpath content="/html/head/meta[@property='og:description']/@content"></span><span itemprop=speakable itemscope itemtype=http://schema.org/SpeakableSpecification><meta itemprop=cssSelector content=.pgintro></span>

I have written up a bit more here:

http://www.earth.org.uk/note-on-site-technicals-19.html#Speakable

Rgds

Damon

@danbri danbri closed this as completed in 135382d Feb 19, 2019
@Shayo33
Copy link

Shayo33 commented Sep 25, 2019

Newbie question, so when using cssSelector, first of all, by cssSelector the meaning is the id/class tag of the html part of the section? E.G <title class=headline>headline</title> so the "headline" in the cssSelector (in this case) will match the class and by that direct to the section needed? is it the same for the summery? also, why more examples i see contains two WebPage schemas? does the headline and summry being collected by that schema and not the HTML code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
guidelines docs examples Work on our supporting materials rather than on schema definitions
Projects
None yet
Development

No branches or pull requests