CSVfox
Get the job done.
 English

±htmlstrip

Strips all HTML markup from data fields

Description

When reading a HTML table into CSVfox, the field data might also still contain HTML markup.
This can be desired if the fields are meant to be displayed on a web page later on again. But if markup shall be removed instead, this command will do this.

Pattern

±htmlstrip ±htmlstrip=y

This is done for all table rows and all columns, for every data field, at the time when the table is read, and before any column modification or data editing takes place.

[Source Example for the difference still missing]

Function

The setting ±htmlstrip does the following changes:

  • All (in the browser) invisible text line breaks are removed, and <br>, <p>, and <div> tags are replaced with line breaks instead.
  • It replaces all <a href=...>  tags and <img src=...>  tags with their embedded url, enclosed in round brackets.
  • All <ul> lists are replaced with their list items in separate lines, prepended by a symbol or by an asterisk (*).
  • All <ol> lists are replaced with their list items in separate lines, prepended by their ordinal number in the intended format. 
  • All embedded <table> tables are replaced with their rows in separate lines, each line consisting of its comma-separated fields.
  • At last, all remaining HTML markup will be removed, and all HTML entities will be replaced with their plain text characters (i.e. &Auml; will be replaced with Ä, and &lt; will be replaced with <).

As CSS cannot be evaluated and processed, any CSS formatting (e.g. the list format defined there) cannot be taken into account. Only the HTML attributes are used where appropriate.

Usage Examples

csvfox https://example.com/infile.csv +filetype=html +htmlstrip (...)
Reads the first HTML table from the web site, and strips all HTML data.

Stripping only a subset of columns

Pattern

±htmlstrip[Column],[Column] ±htmlstrip[Column],[Column]=y
Under construction, coming soon