Manual Pages for Linux CentOS command on man tagsoup

´ This file is part of TagSoup and is Copyright 2002‐2008 by John

Cowan. ´ ´ TagSoup is licensed under the Apache License, ´ Ver‐ sion 2.0. You may obtain a copy of this license at ´ http://www.apache.org/licenses/LICENSE‐2.0 . You may also have ´ additional legal rights not granted by this license. ´ ´ TagSoup is distributed in the hope that it will be useful, but ´ unless required by applicable law or agreed to in writing, TagSoup ´ is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS ´ OF ANY KIND, either express or implied; not even the implied warranty ´ of MERCHANTABILITY or FITNESS FOR A PARTICULAR PUR‐ TAGSOUP(1) User Commands TAGSOUP(1) POSE. ´ NAME

tagsoup - convert nasty, ugly HTML to clean XHTML SYNOPSIS

java -jar tagsoup [ options ] [ files ] DESCRIPTION Rectify arbitrary HTML into clean XHTML, using a tailored description

of HTML. The output will be well-formed XML, but not necessarily valid XHTML. files multiple input files should be processed into corresponding out‐ put files encoding=encoding specifies the encoding of input files

output-encoding=encoding specifies the encoding of the output (if the encoding name begins with ``utf'', the output will not contain character enti‐

ties; otherwise, all non-ASCII characters are represented as entities) html output rectified HTML rather than XML, omitting the XML declara‐ tion and any namespace declarations method=html

output rectified HTML rather than XML (end-tags are omitted for empty elements, and no character escaping is done in script and style elements)

omit-xml-declaration omit the XML declaration lexical output lexical features (specifically comments and any DOCTYPE declaration) nons suppress namespaces in output nobogons

suppress unknown non-HTML elements in output nodefaults suppress default attribute values nocolons change explicit colons in element and attribute names to under‐ scores norestart don't restart any restartable elements ignorable

pass through ignorable whitespace (whitespace in element-only content) via SAX method handler ignorableWhitespace

any treat unknown non-HTML elements as allowing any content (default) emptybogons

treat unknown non-HTML elements as empty elements norootbogons

don't allow unknown non-HTML elements to be root elements

doctype-system=system-id force DOCTYPE declaration to be output with specified system identifier

doctype-public=public-id force DOCTYPE declaration to be output with specified public identifier standalone=[yes|no]

specify standalone pseudo-attribute in output XML declaration version=version

specify version pseudo-attribute in output XML declaration (does not affect actual version of XML output) nocdata

treat the CDATA-content elements script and style as ordinary elements (mostly for testing) pyx output PYX format rather than XML (mostly for testing) pyxin

input is PYX-format HTML (mostly for testing) reuse reuse the same Parser object internally (for testing only) help output basic help version output version number TagSoup is a parser and reformatter for nasty, ugly HTML. Its normal processing mode is to accept HTML files on the command line, or from the standard input if none are given, and output them as clean XML to

the standard output. The encoding is assumed to be the platform-local

encoding on input, and is always UTF-8 on output. When the files option is given, each input file is processed into an output file of the corresponding name, with the extension changed to xhtml. If the extension is already xhtml, it is changed to xhtml. TagSoup will repair, by whatever means necessary, violations of XML

well-formedness. In particular, it will fix up malformed attribute

names and supply missing attribute-value quotation marks. More signif‐

icantly, it supplies end-tags where HTML allows them to be omitted, and

sometimes where it doesn't. It will even supply start-tags where nec‐ essary; for example, if a document begins with a

tag, TagSoup will automatically prefix it with

foo: will be put into the artificial namespace urn:x-prefix:foo.

For the same reasons, namespace-qualified attributes like xml:space can't be returned as default values, though an explicit attribute in the xml namespace will be returned with the proper namespace URI. AUTHOR John Cowan COPYRIGHT

Copyright © 2002-2008 John Cowan TagSoup is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICU‐ LAR PURPOSE. TagSoup 1.2.1 January 2008 TAGSOUP(1)