npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@data-master/web-extractor

v2.0.0

Published

Represents a web rule with its properties and methods.

Downloads

13

Readme

Web Extractor

Class: WebRule

Represents a web rule with its properties and methods.

Constructor

WebRule(id: string)

Creates a new instance of the WebRule class with the specified id.

  • id (string): The unique identifier for the web rule.

Properties

  • version (number): The version of the web rule.
  • id (string): The unique identifier for the web rule.
  • structures (array): An array of WebStructure instances associated with the web rule.
  • fields (array): An array of WebFields instances associated with the web rule.
  • meta (object): Additional metadata associated with the web rule.
  • type (string): The type of the web rule.

Methods

setVersion(version: number)

Sets the version of the web rule.

  • version (number): The version number to set.

addStructure(structure: WebStructure)

Adds a WebStructure instance to the web rule.

  • structure (WebStructure): The WebStructure instance to add.

addFields(field: WebFields)

Adds a WebFields instance to the web rule.

  • field (WebFields): The WebFields instance to add.

toJSON(): object

Serializes the web rule object to a JSON representation.

  • Returns: An object representing the serialized web rule.

setMeta(key: string, value: any)

Sets a metadata value for the specified key.

  • key (string): The key of the metadata.
  • value (any): The value to set for the specified key.

run(): object

Executes the web rule by running its associated structures and fields.

  • Returns: An object containing the storage information and the timestamp of the execution.

static fromJSON(ruleJSON: string): WebRule

Creates a new WebRule instance from a JSON representation.

  • ruleJSON (string): The JSON representation of the web rule.

  • Returns: A new WebRule instance created from the JSON representation.

Note: This method is static and does not require an existing instance of the class.

Class: WebFields

Represents a collection of web fields with their properties and methods. Extends the WebRetrieveItem class.

Constructor

WebFields(id: string)

Creates a new instance of the WebFields class with the specified id.

  • id (string): The unique identifier for the web fields.

Properties

  • Inherited from WebRetrieveItem:

    • id (string): The unique identifier for the web fields.
    • steps (array): An array of WebRetrieveMethod instances representing the steps to retrieve the item.
    • source (WebRetrieveItem|null): The source WebRetrieveItem instance from which to retrieve the item.
    • debug (object): Debug information associated with the web fields.
    • retrieved (boolean): A flag indicating whether the item has been retrieved or not.
    • sourceValue (any|null): The value retrieved from the source.
    • value (any|null): The final retrieved value.
    • type (string): The type of the web fields.
  • items (object): An object representing the collection of web field items. Each item is identified by its unique id.

  • confidence (number): The confidence level associated with the web fields.

Methods

setConfidence(confidence: number)

Sets the confidence level for the web fields.

  • confidence (number): The confidence level to set.

addFieldItem(item: WebFieldItem)

Adds a WebFieldItem instance to the collection of web fields.

  • item (WebFieldItem): The WebFieldItem instance to add.

toJSON(): object

Serializes the web fields object to a JSON representation.

  • Returns: An object representing the serialized web fields.

static fromJSON(fieldJSON: object): WebFields

Creates a new WebFields instance from a JSON representation.

  • fieldJSON (object): The JSON representation of the web fields.

  • Returns: A new WebFields instance created from the JSON representation.

Note: This method is static and does not require an existing instance of the class.

run(): array

Executes the retrieval process for the web fields by running the associated steps on the source value.

  • Returns: An array of field objects containing the retrieved values for each field.

Note: The retrieval process considers the source value, source items, and field items to generate the fields.

Class: WebFieldItem

Represents a single web field item with its properties and methods. Extends the WebRetrieveItem class.

Constructor

WebFieldItem(id: string)

Creates a new instance of the WebFieldItem class with the specified id.

  • id (string): The unique identifier for the web field item.

Properties

  • Inherited from WebRetrieveItem:
    • id (string): The unique identifier for the web field item.
    • steps (array): An array of WebRetrieveMethod instances representing the steps to retrieve the item.
    • source (WebRetrieveItem|null): The source WebRetrieveItem instance from which to retrieve the item.
    • debug (object): Debug information associated with the web field item.
    • retrieved (boolean): A flag indicating whether the item has been retrieved or not.
    • sourceValue (any|null): The value retrieved from the source.
    • value (any|null): The final retrieved value.
    • type (string): The type of the web field item.

Methods

static fromJSON(json: object): WebFieldItem

Creates a new WebFieldItem instance from a JSON representation.

  • json (object): The JSON representation of the web field item.

  • Returns: A new WebFieldItem instance created from the JSON representation.

Note: This method is static and does not require an existing instance of the class.

run(value: any): any

Executes the retrieval process for the web field item by running the associated steps on the provided value.

  • value (any): The value on which to run the retrieval steps.

  • Returns: The retrieved value after running the steps.

Note: The retrieval process considers the provided value and the associated steps to generate the retrieved value.

Class: WebRetrieveItem

Represents a web retrieve item with its properties and methods.

Constructor

WebRetrieveItem(id: string)

Creates a new instance of the WebRetrieveItem class with the specified id.

  • id (string): The unique identifier for the web retrieve item.

Properties

  • id (string): The unique identifier for the web retrieve item.
  • steps (array): An array of WebRetrieveMethod instances representing the steps to retrieve the item.
  • source (WebRetrieveItem|null): The source WebRetrieveItem instance from which to retrieve the item.
  • debug (object): Debug information associated with the web retrieve item.
  • retrieved (boolean): A flag indicating whether the item has been retrieved or not.
  • sourceValue (any|null): The value retrieved from the source.
  • value (any|null): The final retrieved value.
  • type (string): The type of the web retrieve item.

Methods

setType(type: string)

Sets the type of the web retrieve item.

  • type (string): The type to set for the web retrieve item.

setRetrieved(flag: boolean = true)

Sets the retrieved flag indicating whether the item has been retrieved or not.

  • flag (boolean): The flag value. Default is true.

setSource(source: WebRetrieveItem|null)

Sets the source WebRetrieveItem instance from which to retrieve the item.

  • source (WebRetrieveItem|null): The source WebRetrieveItem instance.

setSourceValue(sourceValue: any)

Sets the source value and updates the debug information.

  • sourceValue (any): The source value to set.

retrieveFromSource()

Retrieves the value from the source WebRetrieveItem instance and sets the source value and retrieved flag.

addSteps(steps: WebRetrieveMethod)

Adds a WebRetrieveMethod instance to the steps for retrieving the item.

  • steps (WebRetrieveMethod): The WebRetrieveMethod instance to add.

toJSON(): object

Serializes the web retrieve item object to a JSON representation.

  • Returns: An object representing the serialized web retrieve item.

runSteps(sourceValue: any): any

Executes the steps of retrieving the item by sequentially running each WebRetrieveMethod.

  • sourceValue (any): The source value to start with.

  • Returns: The final retrieved value.

run(): any

Executes the retrieval process by retrieving the item from the source and running the steps if necessary.

  • Returns: The final retrieved value.

static fromJSON(json: object): WebRetrieveItem

Creates a new WebRetrieveItem instance from a JSON representation.

  • json (object): The JSON representation of the web retrieve item.

  • Returns: A new WebRetrieveItem instance created from the JSON representation.

Note: This method is static and does not require an existing instance of the class.

Class: WebRetrieveMethod

Represents a web retrieve method with its properties and methods.

Constructor

WebRetrieveMethod()

Creates a new instance of the WebRetrieveMethod class.

Properties

  • type (string): The type of the web retrieve method.
  • parameters (object): The parameters associated with the web retrieve method.
  • method (string): The method used for retrieval.
  • debug (object): Debug information associated with the web retrieve method.
  • classInstance (string): The class instance identifier.

Methods

setMethod(method: string)

Sets the method used for retrieval.

  • method (string): The method used for retrieval.

setParameter(key: string, value: any)

Sets a parameter value for the specified key.

  • key (string): The key of the parameter.
  • value (any): The value to set for the specified key.

getParameter(key: string): any

Retrieves the value of a parameter for the specified key.

  • key (string): The key of the parameter.

  • Returns: The value of the parameter.

run(item: any): any

Executes the web retrieve method by processing the provided item.

  • item (any): The item to be processed by the method.

  • Returns: The processed item.

fromJSON(stepJSON: object)

Populates the WebRetrieveMethod instance from a JSON representation.

  • stepJSON (object): The JSON representation of the web retrieve method.

toJSON(): object

Serializes the web retrieve method object to a JSON representation.

  • Returns: An object representing the serialized web retrieve method.

Class: WebStructure

Represents a web structure with its properties and methods. Extends the WebRetrieveItem class.

Constructor

WebStructure(id: string)

Creates a new instance of the WebStructure class with the specified id.

  • id (string): The unique identifier for the web structure.

Properties

  • Inherited from WebRetrieveItem:
    • id (string): The unique identifier for the web structure.
    • steps (array): An array of WebRetrieveMethod instances representing the steps to retrieve the item.
    • source (WebRetrieveItem|null): The source WebRetrieveItem instance from which to retrieve the item.
    • debug (object): Debug information associated with the web structure.
    • retrieved (boolean): A flag indicating whether the item has been retrieved or not.
    • sourceValue (any|null): The value retrieved from the source.
    • value (any|null): The final retrieved value.
    • type (string): The type of the web structure.

Methods

fromJSON(structureJSON: object): WebStructure

Creates a new WebStructure instance from a JSON representation.

  • structureJSON (object): The JSON representation of the web structure.

  • Returns: A new WebStructure instance created from the JSON representation.

Note: This method is static and does not require an existing instance of the class.

Web Data Retrieval Methods

This project provides a set of classes that implement various web data retrieval methods. These methods are designed to extract specific content from web documents based on different criteria. The classes are implemented in JavaScript and extend the WebRetrieveMethod class, which provides common functionality for data retrieval.

WebRetrieveMethod

This is the base class for all web data retrieval methods. It contains shared functionality and properties.

Constructor

  • No Arguments: Creates an instance of the WebRetrieveMethod class.

Methods

  • setMethod(method): Sets the method identifier for the retrieval process.

  • setParameter(name, value): Sets a parameter used by the retrieval method.

  • run(content): The main method responsible for extracting data from the web content. It takes the web content as input and returns the extracted data.

RetriveByDocumentTextContent

This class retrieves data by directly accessing the text content of a web document.

Constructor

  • No Arguments: Creates an instance of the RetriveByDocumentTextContent class.

Methods

  • run(documentNode): Takes a DOM element documentNode as input, and extracts the text content from it using the textContent property. It then passes the text content to the parent class's run method for further processing.

RetriveByDocumentGetAttribute

This class retrieves data by accessing a specific attribute of a web document.

Constructor

  • attribute: The attribute name to retrieve data from.

Methods

  • run(documentNode): Takes a DOM element documentNode as input, retrieves the value of the specified attribute using the getAttribute method, and passes it to the parent class's run method for further processing.

RetriveByDocumentParsedTextContent

This class retrieves data by parsing the inner HTML of a web document and extracting text content.

Constructor

  • No Arguments: Creates an instance of the RetriveByDocumentParsedTextContent class.

Methods

  • run(documentNode): Takes a DOM element documentNode as input, parses its inner HTML, removes HTML tags, and returns the resulting text content to the parent class's run method for further processing.

RetriveByVisibleElement

This class retrieves data from a visible web element by checking its visibility and dimensions.

Constructor

  • No Arguments: Creates an instance of the RetriveByVisibleElement class.

Methods

  • run(documentNode): Takes a DOM element documentNode as input, checks its visibility and dimensions, and returns the element itself if it is visible, or null if it is hidden, to the parent class's run method for further processing.

RetriveByTextSplit

This class retrieves data by splitting a text content using a specified delimiter.

Constructor

  • splitby: The delimiter to split the text content.

Methods

  • run(content): Takes the text content as input, splits it using the specified delimiter, and returns an array of the resulting segments to the parent class's run method for further processing.

RetriveByRegEx

This class retrieves data by matching a regular expression pattern in the text content.

Constructor

  • regex: The regular expression pattern to match.

Methods

  • run(content): Takes the text content as input, matches it against the specified regular expression, and returns an array of matched results to the parent class's run method for further processing.

RetriveByFromArray

This class retrieves data from an array by accessing a specific index.

Constructor

  • index: The index of the element to retrieve from the array.

Methods

  • run(content): Takes an array content as input, retrieves the element at the specified index, and returns it to the parent class's run method for further processing.

RetriveByStringTrim

This class retrieves data by trimming whitespace from a string.

Constructor

  • No Arguments: Creates an instance of the RetriveByStringTrim class.

Methods

  • run(content): Takes a string content as input, trims leading and trailing whitespace from it, and returns the trimmed string to the parent class's run method for further processing.

Constants: XPathResultType

A collection of constants representing different XPath result types.

  • ANY_TYPE (number): Represents any type of result.
  • NUMBER_TYPE (number): Represents a number result.
  • STRING_TYPE (number): Represents a string result.
  • BOOLEAN_TYPE (number): Represents a boolean result.
  • UNORDERED_NODE_ITERATOR_TYPE (number): Represents an unordered node iterator result.
  • ORDERED_NODE_ITERATOR_TYPE (number): Represents an ordered node iterator result.
  • UNORDERED_NODE_SNAPSHOT_TYPE (number): Represents an unordered node snapshot result.
  • ORDERED_NODE_SNAPSHOT_TYPE (number): Represents an ordered node snapshot result.
  • ANY_UNORDERED_NODE_TYPE (number): Represents any unordered node result.
  • FIRST_ORDERED_NODE_TYPE (number): Represents the first ordered node result.

Class: RetriveByXpath

Represents a web retrieval method that uses XPath expressions to select nodes from an XML or HTML document. Extends the WebRetrieveMethod class.

Constructor

RetriveByXpath(xpathExpression: string)

Creates a new instance of the RetriveByXpath class with the specified XPath expression.

  • xpathExpression (string): The XPath expression used for node selection.

Properties

  • Inherited from WebRetrieveMethod:
    • type (string): The type of the web retrieval method.
    • parameters (object): A dictionary of parameters for the retrieval method.
    • method (string): The specific retrieval method.
    • debug (object): Debug information associated with the retrieval method.
    • classInstance (string): The class instance identifier.

Methods

setExpression(xpathExpression: string): void

Sets the XPath expression used for node selection.

  • xpathExpression (string): The XPath expression to set.

setNamespaceResolver(namespaceResolver: any): void

Sets the namespace resolver for the XPath expression.

  • namespaceResolver (any): The namespace resolver to set.

setResultType(resultType: number): void

Sets the result type for the XPath expression.

  • resultType (number): The result type to set. Should be one of the constants defined in XPathResultType.

run(contextNode?: Node): any

Executes the retrieval process by evaluating the XPath expression and selecting nodes from the provided context node.

  • contextNode (Node): The context node from which to evaluate the XPath expression. If not specified, the document node will be used as the context.

  • Returns: The retrieved nodes or the result of running the retrieved nodes through the superclass's run method, depending on the result type.


Class: RetriveByXpathSingleNode

Represents a web retrieval method that uses XPath expressions to select a single node from an XML or HTML document. Extends the RetriveByXpath class.

Constructor

RetriveByXpathSingleNode(xpathExpression: string)

Creates a new instance of the RetriveByXpathSingleNode class with the specified XPath expression.

  • xpathExpression (string): The XPath expression used for node selection.

Properties

  • Inherited from RetriveByXpath:
    • type (string): The type of the web retrieval method.
    • parameters (object): A dictionary of parameters for the retrieval method.
    • method (string): The specific retrieval method.
    • debug (object): Debug information associated with the retrieval method.
    • classInstance (string): The class instance identifier.

Methods

  • Inherits all methods from the RetriveByXpath class.

Class: RetriveBy

XpathMultipleNodes

Represents a web retrieval method that uses XPath expressions to select multiple nodes from an XML or HTML document. Extends the RetriveByXpath class.

Constructor

RetriveByXpathMultipleNodes(xpathExpression: string)

Creates a new instance of the RetriveByXpathMultipleNodes class with the specified XPath expression.

  • xpathExpression (string): The XPath expression used for node selection.

Properties

  • Inherited from RetriveByXpath:
    • type (string): The type of the web retrieval method.
    • parameters (object): A dictionary of parameters for the retrieval method.
    • method (string): The specific retrieval method.
    • debug (object): Debug information associated with the retrieval method.
    • classInstance (string): The class instance identifier.

Methods

  • Inherits all methods from the RetriveByXpath class.

Web Data Retrieval Methods - Query Selector

This project provides two classes that implement web data retrieval methods based on the query selector expressions. These methods allow users to specify elements on a web page using CSS-like selectors and retrieve specific content from those elements. The classes are implemented in JavaScript and extend the WebRetrieveMethod class, which provides common functionality for data retrieval.

WebRetrieveMethod

This is the base class for all web data retrieval methods. It contains shared functionality and properties.

Constructor

  • No Arguments: Creates an instance of the WebRetrieveMethod class.

Methods

  • setMethod(method): Sets the method identifier for the retrieval process.

  • setParameter(name, value): Sets a parameter used by the retrieval method.

  • run(content): The main method responsible for extracting data from the web content. It takes the web content as input and returns the extracted data.

RetriveByQuerySelector

This class retrieves data by using the querySelector method to select a single element on the web page.

Constructor

  • selectorExpression: The CSS-like selector expression to identify the target element.

Methods

  • run(contextNode): Takes a DOM element contextNode as input (optional, defaults to the document), uses the querySelector method with the specified selector expression to find the target element, and returns the element itself if found or null if not found to the parent class's run method for further processing.

RetriveByQuerySelectorAll

This class retrieves data by using the querySelectorAll method to select multiple elements on the web page.

Constructor

  • selectorExpression: The CSS-like selector expression to identify the target elements.

Methods

  • run(contextNode): Takes a DOM element contextNode as input (optional, defaults to the document), uses the querySelectorAll method with the specified selector expression to find all matching elements, and returns an array of matched elements to the parent class's run method for further processing.

Please note that the above documentation provides an overview of the classes and their methods' functionalities. For the actual implementation and usage of these classes, you would need to see the complete code and how it is integrated into the web data retrieval system.

Example

// Create retrieval methods
const retriveTxtContent = new RetriveByDocumentTextContent();
const retriveuserListBody = new RetriveByXpathSingleNode('//*[@id="id_user_list_body"]');
const retrieveuserRows = new RetriveByXpathMultipleNodes('//tr[@class="clickable"]');
const retriveName = new RetriveByXpathSingleNode('td[2]');

// Create web structures
const userInfoStructure = new WebStructure("user");
userInfoStructure.addSteps(retriveuserListBody);
userInfoStructure.addSteps(retrieveuserRows);

// Create field items
const fnameFieldItem = new WebFieldItem('fname');
fnameFieldItem.addSteps(new RetriveByXpathSingleNode('td[3]'));
fnameFieldItem.addSteps(retriveTxtContent);

const lnameFieldItem = new WebFieldItem('lname');
lnameFieldItem.addSteps(new RetriveByXpathSingleNode('td[2]'));
lnameFieldItem.addSteps(retriveTxtContent);

// Create web fields
const nameGenderDOBFields = new WebFields('name-dob-gender');
nameGenderDOBFields.setSource(userInfoStructure);
nameGenderDOBFields.addFieldItem(fnameFieldItem);
nameGenderDOBFields.addFieldItem(lnameFieldItem);

// Create field items for user header info
const fullname = new WebFieldItem('fullname');
fullname.addSteps(new RetriveByRegEx("([^\s]+) \\([^)]+\\)"));
fullname.addSteps(new RetriveByFromArray(1));
fullname.addSteps(new RetriveByStringTrim());

const gender = new WebFieldItem('gender');
gender.addSteps(new RetriveByRegEx("\\(\\s*(Male|Female|Other|Unknown|Declined to Specify)\\s*\\|"));
gender.addSteps(new RetriveByFromArray(1));
gender.addSteps(new RetriveByStringTrim());

const dob = new WebFieldItem('dob');
dob.addSteps(new RetriveByRegEx("/(\w+ \d{1,2}, \d{4})/"));
dob.addSteps(new RetriveByFromArray(1));
dob.addSteps(new RetriveByStringTrim());

// Create web fields for user header info
const nogdF = new WebFields('nogd');
nogdF.setSource(userHeaderInfoStructure);
nogdF.addFieldItem(fullname);
nogdF.addFieldItem(gender);

// Create web scraping rule
const scrapRule = new WebRule("23232");
scrapRule.version = 3;
scrapRule.addStructure(userInfoStructure);
scrapRule.addStructure(userHeaderInfoStructure);
scrapRule.addFields(nameGenderDOBFields);
scrapRule.addFields(nogdF);

// Run the web scraping rule
scrapRule.run();

This example demonstrates the usage of various web retrieval methods, web structures, field items, web fields, and a web scraping rule to retrieve and extract data from a web page. The retrieved data can be further processed or used as needed.