Class: HTMLReader

Extract the significant text from an arbitrary HTML document. The contents of any head, script, style, and xml tags are removed completely. The URLs for a[href] tags are extracted, along with the inner text of the tag. All other tags are removed, and the inner text is kept intact. Html entities (e.g., &) are not decoded.

Extends

FileReader

Constructors

new HTMLReader()

new HTMLReader(): HTMLReader

Returns

HTMLReader

Inherited from

FileReader . constructor

Methods

getOptions()

getOptions(): object

Wrapper for our configuration options passed to string-strip-html library

Returns

object

An object of options for the underlying library

skipHtmlDecoding

skipHtmlDecoding: boolean = true

stripTogetherWithTheirContents

stripTogetherWithTheirContents: string[]

See

https://codsen.com/os/string-strip-html/examples

Source

packages/llamaindex/src/readers/HTMLReader.ts:42

loadData()

loadData(filePath): Promise <Document <Metadata>[]>

Parameters

• filePath: string

loadDataAsContent()

loadDataAsContent(fileContent): Promise <Document <Metadata>[]>

Public method for this reader. Required by BaseReader interface.

Parameters

• fileContent: Buffer

Returns

Promise <Document <Metadata>[]>

Promise<Document[]> A Promise object, eventually yielding zero or one Document parsed from the HTML content of the specified file.

Overrides

FileReader . loadDataAsContent

Source

packages/llamaindex/src/readers/HTMLReader.ts:18

parseContent()

parseContent(html, options): Promise<string>

Wrapper for string-strip-html usage.

Parameters

• html: string

Raw HTML content to be parsed.

• options: any= {}

An object of options for the underlying library

Returns

Promise<string>

The HTML content, stripped of unwanted tags and attributes

See

getOptions

Source

packages/llamaindex/src/readers/HTMLReader.ts:32

addMetaData()

static addMetaData(filePath): (doc, index) => void

Parameters

• filePath: string

Returns

Function

Parameters

• doc: Document <Metadata>

• index: number

Returns

void

Inherited from

FileReader . addMetaData

Source

packages/llamaindex/src/readers/type.ts:24

Class: HTMLReader

Extends​

Constructors​

new HTMLReader()​

Returns​

Inherited from​

Methods​

getOptions()​

Returns​

skipHtmlDecoding​

stripTogetherWithTheirContents​

See​

Source​

loadData()​

Parameters​

Returns​

Inherited from​

Source​

loadDataAsContent()​

Parameters​

Returns​

Overrides​

Source​

parseContent()​

Parameters​

Returns​

See​

Source​

addMetaData()​

Parameters​

Returns​

Parameters​

Returns​

Inherited from​

Source​

Extends

Constructors

new HTMLReader()

Returns

Inherited from

Methods

getOptions()

Returns

skipHtmlDecoding

stripTogetherWithTheirContents

See

Source

loadData()

Parameters

Returns

Inherited from

Source

loadDataAsContent()

Parameters

Returns

Overrides

Source

parseContent()

Parameters

Returns

See

Source

addMetaData()

Parameters

Returns

Parameters

Returns

Inherited from

Source