New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vocabulary to indicate which sections of a document are particularly 'speakable'#1389
Comments
It seems like you're identifying the "key bits of the page", presumably as an initial view of it, a bit like I think that kind of summary has a fair bit of application beyond reading it out on a speech system. I like the model of being able to gather a few different pieces of the content together, but I'm wary of trying to tie it tightly to text-to-speech usage. On the other hand, I am still thinking about this. (The examples also seem to be a bit broken) @LJWatson ping? |
This seems like a useful property. When using a voice UI the interaction needs to be clutter free, or it becomes fairly horrible. The only other use case for something like it, is those tools that strip out the visual clutter of pages for better readability. I don't know whether the desireable content would be the same for both use cases though... |
@chaals @LJWatson - I've just posted a brief proposal to the JSON-LD group, who are working on improvements to JSON-LD. The idea would be for the cross-domain parts of this to be specified as something a JSON-LD parser might do, i.e. as @chaals says, not "tie it tightly to text-to-speech usage". Within the purely schema.org world, at least the 'xpath' and 'cssSelector' properties have nothing binding them to text-to-speech; other definitions and usecases could easily reuse them. (edit - here's the issue I mentioned) - json-ld/json-ld.org#498 |
I'm with Chaals regarding clarifying the goal. Is it about:
Also, we should look into SSML if we want to go beyond annotating the speakable portions of a page:
|
In json-ld/json-ld.org#498 (comment) I suggested that it may simply be better to combine RDFa and JSON-LD on page to address this, as RDFa allows HTML content to be referenced/extracted from the page using I don't think any existing examples contains both JSON-LD and RDFa, but this is feasible and well-supported by existing processors. |
Some implementor feedback from Google: the "cssSelector" (and "xpath") property would be particularly useful on http://schema.org/WebPageElement to indicate the part(s) of a page matching the selector / xpath. Note that this isn't "element" in some formal XML sense, and that the selector might match multiple XML/HTML elements if it is a CSS class selector. I suggest adding WebPageElement as a type that these 2 properties are expected on. |
Ping @tmarshbing @scor @rvguha @vholland @tilid @nicolastorzec - any views? |
+1 |
Proceeding on the basis that this is a commonsense combination of two terms with related semantics, I'm making an edit now to cssSelector, xpath, and the expected type assocations of both. There might be some nuance in the details but it doesn't make sense having a type for parts of a page, a property for pointing into parts of a page, and failing to say how they relate! |
…nt' /cc #1389 Allowed both properties to be expected on that type.
Finally got a moment to respond to this... Heaving read the discussion I'm still wondering what it exactly is this proposal is trying to resolve? In general the part of a web page that should be 'speakable/pronouncable' is the main content of a page, which most of the times, are things like At the same time I can't help feeling that this proposal tries to bypass WCAG accessibility guidelines which IMO should suffice for devices (can't image things like speakers need more specific Types and attributes than screen readers (and visually impaired people) do). Am I overlooking reasons why WCAG guidelines don't suffice here? |
Summary: I guess, WCAG guidelines and this 'speakable' proposal have different use cases and target audiences. WCAG is about making the whole (main) content of a webpage accessible. The use case here is to serve the whole content to anyone, who deliberately visits a specific website. Voice assistants, on the other hand, should keep their answer to a specific question brief - a short summary of the page topic could fit quite well, A typical use case is a search engine research, where users won't visit the website, but instead get an ecxerpt of the topic. Even more, users often do not know, where the excerpt is originating from (see example below). An advisory from Amazon for Alexa responses: "Be brief" And for Google: "Recommended: Less than 300 characters for each dialog turn." If you. for example, ask Amazon Alexa "Alexa, who is Chuck Norris", it will read the first sentence of the Wikipedia article on Chuck Norris, without mentioning the origin. At the time of writing, the answer is "Carlos Ray 'Chuck' Norris (born March 10, 1940) is an American martial artist, actor, film producer and screenwriter." (English Wikipedia). It's not the whole article, which you would expect when using a screen reader. Just my two cents ;-) |
OK, I see sense in the point that WCAG's use case is different than that of this proposal. But for use cases like a 'title', there is the name property which could be used by speaking devices (or headline for creative works). And as for summaries, can't description be used for that? (which in most cases are less than 300 characters) What about 'speakable' parts of an What about something like a short And lastly, what about something like a Really, I get the intention of the proposal but I don't expect much more to come of it than publishers duplicating the same content multiple times to be able to populate multiple properties (in different formats for different parties). Now I've worked with some very large publishers in the past and can tell you that all hell breaks loose when authors have to start providing new/multiple titles and descriptions for the same article (or product) because different media require different character counts - simply because this costs time (=money) they don't have and therefore this isn't a trivial matter for them! Meaning, I'm pretty sure authors (as well as business owners/stakeholders) won't be happy at all if they have to start providing 'speakable' descriptions (of a certain length) as well - especially if this also involves doing this for multiple sections of an article or web page. And from a CMS perspective I don't expect much positives either as this will probably lead to authors having to fill out (many) more input fields in the CMS form of an article or a product's PIM system (or even worse, forcing authors having to start adding and managing CSS classes of elements for the cssSelectors - fun job for sites with (hundreds of) thousands of articles or products). Apologies if I sound negative (especially because I do like the idea of being able to easily serve speaking devices) but I just don't see publishers handling this proposal very well mostly due to technical/resource constraints (which will lead to the duplication of content), as well as time constraints for (professional) authors (as they already have so many things to fill out). Try looking at it from a business perspective, what's there to be gained by website owners after they've spend a ton of time and resources to make this happen? I understand the ROI for companies that produce speaking devices but what's the ROI for those implementing this proposal on their sites? Will this lead to users reading more articles or buying more products? If not, why would publishers bother to accommodate speaking devices? Typical questions businesses need answers to, and guess what happens if the answers aren't in their favor? Absolutely nothing as they'll see it as a waist of precious resources. Can't this instead be resolved by having speaking devices simply use properties that already exist (and are being used by publishers)? |
Can HTML annotations be used? |
Thanks @jvandriel for describing this larger concerns and I hear you. A few thoughts I'd like to offer:
Thoughts? |
FWIW, the Web Annotation selectors encoding might provide more future-proofing, flexibility, and potential re-use throughout Schema.org (i.e. new selection systems can be added without new properties)--though at the exchange of being a bit more verbose. So example 1 for SpeakableSpecification might support both CSS Selectors and XPath's and fragment identifiers together to become: {
"@context": "http://schema.org/",
"@type": "WebPage",
"name": "Jane Doe's homepage",
"speakable": {
"@type": "SpeakableSpecification",
"selector": [
{"@type": "CssSelector", "@value": ".headline"},
{"@type": "XPathSelector", "@value": "//summary"},
{"@type": "FragmentSelector", "@value": "#speakable"}
]
},
"url": "http://www.janedoe.com"
} Essentially, the The Selectors and States note focuses on this part of the Web Annotation Data Model. I'd be happy to help with some mappings between the two, if there's interest. |
https://webmasters.googleblog.com/2018/07/hey-google-whats-latest-news.html explains what we've been using this for at Google. I'll find some example URLs to share here too. |
I got a question, is it possible to create 2 or more speakable sections from the one webpage? I only see code examples showing a single markup using the following combinations:
Would it be possible to create a list of speakable markup's per webpage? |
According to Google's Guidelines found here: https://developers.google.com/search/docs/guides/sd-policies
However, I have seen people markup meta-descriptions via an XPath and their meta-description value is not present on the webpage. Yet Google Home Smart Speaker is still finding and reading the markup. Does this not conflict with Google's Spam Guidelines?
According to Google's Guidelines found here: https://support.google.com/webmasters/answer/7478053?hl=en
It suggests that Google's view is to have both the AMP-HTML/Mobile and the Desktop Versions of the website to have the exact same thing. So would this same policy act on Google's Structured Data Policy meaning that we have to use the exact same XPath/CSS Selector Path/CSS Value when marking up both versions of the webpage. For example: Mobile Version:
Desktop Version:
Note: The different XPath's in the above examples codes. p.s. I could not find the answers on Google's Doc's or on Schema dot org thanks. |
@michalise Did you find a way to add multiple speakable sections on a single page? |
In general, search engines and other products/services/features can express more detailed restrictions than are required by Schema.org itself. I think that's what is happening here. Schema.org provides the underlying dictionary of terms, and Google says "here are some deployment patterns that we can work with". Everyone's policies and information needs are evolving, and it isn't feasible to attempt to track such things within Schema.org's definitions. My understanding is that there is not reason to consider multiple speakable sections as intrinsically inappropriate. Whether it works in Google right now is a separable matter. |
@beltofte: FWIW, Google's Structured Data checker tool seems happy with the following (microdata) markup with 3 separate segments.
I have written up a bit more here: http://www.earth.org.uk/note-on-site-technicals-19.html#Speakable Rgds Damon |
Newbie question, so when using cssSelector, first of all, by cssSelector the meaning is the id/class tag of the html part of the section? E.G <title class=headline>headline</title> so the "headline" in the cssSelector (in this case) will match the class and by that direct to the section needed? is it the same for the summery? also, why more examples i see contains two WebPage schemas? does the headline and summry being collected by that schema and not the HTML code? |
Usecase:
"With use of text-to-speech on the rise in mainstream use-case scenarios such as smart
speakers (Amazon Echo, Google Home), multimodal interaction on smart phones and in-car systems, there is a need for authors and publishers to be able to easily call out portions of a Web page that are particularly appropriate for reading out aloud. Such read-aloud functionality may
vary from speaking a short title and summary, to speaking a few key sections of a page; in some cases, it may amount to speaking most non-visual content on the page. "
A vocab draft:
The text was updated successfully, but these errors were encountered: