SemanticTagging

From RuntimeWiki

Jump to: navigation, search

In an effort to improve machine-to-web interoperability and to provide end-users with greater control over personal data, a proposed solution would be to start enforcing semantic description of data being transmitted from a user's browser via form post, XHR, JSON, cookie or URL.

The strategy word work as follows. Legacy forms, cookies, XHR requests (and perhaps URLs) would require the user to confirm that they allow submission of data to the server. Web application providers could aquire special status for their forms, and bypass this confirmation, by making their data semantic compliant. This would simply involve the alignment of data fields (ie. form fields) with established ontologies. For example, a credit card form containing fields for the cardholder's name, credit card number, and expiry date would have additional markup that tied each field to a "standard" (third-party) ontology maintained within a metadata registry (such as NIEM). Additionally, further semantic enforcement could be applied by requiring that labels associated with form fields match against a set of labels (synonyms) associated with a given concept term within the ontology.

It would be the responsibility of the browser to check that all data being submitted over the web has been semantically identified. The browser would perform this check by parsing the data (form post, XHR request, cookie, JSON?) and extracting a list of field names (or datatypes). The list of field names would be matched against a semantics schema. The semantics schema would be constructed by the browser by either parsing a document fragment containing the the schema or by parsing the inline semantic associations defined on the form field tags themselves. If all fields validate the data would be transmitted without user interaction. If some or all fields fail to validate the browser would notify the user that "unknown data" was about to be transmitted to the server.

Ideally the browser would provide additional funcitonality that would allow end-users to tag semantically unidentified form fields whenever they fail the schema validation step. The browser would need the capacity to connect to a search service(configurable) that would search for the unidentified fields and provide guesses for the user to select from. The selected concept would be cached so that subsequent data submissions to the same server/url would not required re-validation. Optionally, a mechanism for sharing the concept selection with other users could be provided. This could occur by submitting the chosen selection along with the data in the post to the server or by recording the match at the semantic search provider as part of the search. There is very little infrastructure in place to support this aspect of semantic identification so it may be a more long-term objective.

Some mechanism may also be necessary for internal corporate/community web applications to prevent or override semantic shema validation of submitted data. By default users should still likely be required to enable overrides for particular URL's, subdomains or domains.

Requiring the deconstruction of URL's and semantic schema validation of terms contained with a URL or query string would be difficult or impossible at this stage. It is difficult to determine which terms within a URL are datatypes and which are data. Going forward it may be desirable to put a mechanism in place that encourages semantic identification of URL content. A similar strategy to the one proposed for form data could be implemented whereby a schema is defined for the URL. Web content providers would identify the terms used within their URLs. Eventually providers could be discouraged from using unvalidated URL's by prompting users to approve URL's that cannot be semantically validated.

One final piece of the semantic package that it would be nice to see in browsers would be the ability to log all data that has been submitted. Filtering tools could then be written that manage where and when data has been submitted and assist in managing this data in the future.