±htmlstrip

Strips all HTML markup from data fields

Stripping only a subset of columns

Description

When reading a HTML table into CSVfox, the field data might also still contain HTML markup.
This can be desired if the fields are meant to be displayed on a web page later on again. But if markup shall be removed instead, this command will do this.

Pattern

±htmlstrip ±htmlstrip=y

This is done for all table rows and all columns, for every data field, at the time when the table is read, and before any column modification or data editing takes place.

[Source Example for the difference still missing]

Function

The setting ±htmlstrip does the following changes:

All (in the browser) invisible text line breaks are removed, and <br>, <p>, and <div> tags are replaced with line breaks instead.
It replaces all <a href=...> tags and <img src=...> tags with their embedded url, enclosed in round brackets.
All <ul> lists are replaced with their list items in separate lines, prepended by a symbol or by an asterisk (*).
All <ol> lists are replaced with their list items in separate lines, prepended by their ordinal number in the intended format.
All embedded <table> tables are replaced with their rows in separate lines, each line consisting of its comma-separated fields.
At last, all remaining HTML markup will be removed, and all HTML entities will be replaced with their plain text characters (i.e. Ä will be replaced with Ä, and < will be replaced with <).

As scripting (e.g., Javascript) or CSS cannot be evaluated and processed, no dynamic DOM objects or CSS formatting can be taken into account. Only some HTML attributes are used where appropriate.

Usage Examples

csvfox https://example.com/infile.csv +filetype=html +htmlstrip (...): Reads the first HTML table from the web site, and strips all HTML data.

Stripping only a subset of columns

Pattern

±htmlstrip[Column],[Column] ±htmlstrip[Column],[Column]=y