±htmlstrip
Strips all HTML markup from the data fields when read
Description
When reading a HTML table into CSVfox, the field data might also still contain HTML markup.
This can be desired if the fields are meant to be displayed on a web page later on again. But if markup shall be removed instead, this command will do this.
Pattern
±htmlstrip
±htmlstrip=y
[
Function
The setting ±htmlstrip does the following changes:
- All (in the browser) invisible text line breaks are removed, and <br>, <p>, and <div> tags are replaced with line breaks instead.
- It replaces all <a href=...> tags and <img src=...> tags with their embedded url, enclosed in round brackets.
- All <ul> lists are replaced with their list items in separate lines, prepended by a symbol or by an asterisk (*).
- All <ol> lists are replaced with their list items in separate lines, prepended by their ordinal number in the intended format.
- All embedded <table> tables are replaced with their rows in separate lines, each line consisting of its comma-separated fields.
- At last, all remaining HTML markup will be removed, and all HTML entities will be replaced with their plain text characters (i.e. Ä will be replaced with Ä, and < will be replaced with <).
As CSS cannot be evaluated and processed, any CSS formatting (e.g. the list format defined there) cannot be taken into account. Only the HTML attributes are used where appropriate.
Usage Examples
- csvfox https://example.com/infile.csv +filetype=html +htmlstrip (...)
- Reads the first HTML table from the web site, and strips all HTML data.