Mercury Parser - Extracting content from chaos #parser #url #html #extractor

extractor html parser url

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

Go to file

Michael Ashley 2d991cc0c4 Merge branch 'master' into fix-issue-432-theatlantic		5 years ago
.circleci	chore: prevent adding phantomjs-prebuilt as a dependency in CI. (#412 )	5 years ago
.github	docs: PR and Issue templates (#211 )	5 years ago
assets	docs: add usage gif (#308 )	5 years ago
dist	release: 2.1.1 (#446 )	5 years ago
fixtures	Merge branch 'master' into fix-issue-432-theatlantic	5 years ago
scripts	chore: remove unneeded import (#357 )	5 years ago
src	Merge branch 'master' into fix-issue-432-theatlantic	5 years ago
.agignore	chore: renamed iris to mercury	8 years ago
.babelrc	chore: update node rollup config (#229 )	5 years ago
.eslintignore	Feat: browser support (#19 )	8 years ago
.eslintrc	deps: upgrade (#218 )	5 years ago
.gitattributes	fix: i put a bad comment in .gitattributes (#125 )	7 years ago
.gitignore	dx: include test results in comment (#230 )	5 years ago
.nvmrc	chore: update node and some deps (#209 )	5 years ago
.prettierignore	dx: add .prettierignore (#257 )	5 years ago
.prettierrc	a more explicit .prettierrc	5 years ago
.remarkrc	feat: add remarklint for md docs (#213 )	5 years ago
CHANGELOG.md	release: 2.1.1 (#446 )	5 years ago
CODE_OF_CONDUCT.md	chore: small CoC typofix (#358 )	5 years ago
CONTRIBUTING.md	docs: cleanup and update docs (#238 )	5 years ago
LICENSE-APACHE	docs: add license files (#217 )	5 years ago
LICENSE-MIT	docs: add license files (#217 )	5 years ago
README.md	docs: Add links to README	5 years ago
RELEASE.md	docs: document release process (#186 )	5 years ago
cli.js	feat: Support passing custom headers in requests (#337 )	5 years ago
karma.conf.js	deps: upgrade (#218 )	5 years ago
package.json	chore(package): update brfs-babel to version 2.0.0 (#461 )	5 years ago
preview	feat: add content format output options (#256 )	5 years ago
rollup.config.js	deps: upgrade (#218 )	5 years ago
rollup.config.web.js	deps: upgrade (#218 )	5 years ago
score-move	chore: refactored and linted	8 years ago
yarn.lock	chore(deps): bump lodash.merge from 4.6.1 to 4.6.2 (#456 )	5 years ago

README.md

Mercury Parser - Extracting content from chaos

Postlight's Mercury Parser extracts the bits that humans care about from any URL you give it. That includes article content, titles, authors, published dates, excerpts, lead images, and more.

Mercury Parser powers the Mercury AMP Converter and Mercury Reader, a Chrome extension that removes ads and distractions, leaving only text and images for a beautiful reading view on any site.

Mercury Parser allows you to easily create custom parsers using simple JavaScript and CSS selectors. This allows you to proactively manage parsing and migration edge cases. There are many examples available along with documentation.

How? Like this.

Installation

# If you're using yarn
yarn add @postlight/mercury-parser

# If you're using npm
npm install @postlight/mercury-parser

Usage

import Mercury from '@postlight/mercury-parser';

Mercury.parse(url).then(result => console.log(result));

// NOTE: When used in the browser, you can omit the URL argument
// and simply run `Mercury.parse()` to parse the current page.

The result looks like this:

{
  "title": "Thunder (mascot)",
  "content": "... <p><b>Thunder</b> is the <a href=\"https://en.wikipedia.org/wiki/Stage_name\">stage name</a> for the...",
  "author": "Wikipedia Contributors",
  "date_published": "2016-09-16T20:56:00.000Z",
  "lead_image_url": null,
  "dek": null,
  "next_page_url": null,
  "url": "https://en.wikipedia.org/wiki/Thunder_(mascot)",
  "domain": "en.wikipedia.org",
  "excerpt": "Thunder Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos",
  "word_count": 4677,
  "direction": "ltr",
  "total_pages": 1,
  "rendered_pages": 1
}

If Mercury is unable to find a field, that field will return null.

`parse()` Options

Content Formats

By default, Mercury Parser returns the content field as HTML. However, you can override this behavior by passing in options to the parse function, specifying whether or not to scrape all pages of an article, and what type of output to return (valid values are 'html', 'markdown', and 'text'). For example:

Mercury.parse(url, { contentType: 'markdown' }).then(result =>
  console.log(result)
);

This returns the the page's content as GitHub-flavored Markdown:

"content": "...**Thunder** is the [stage name](https://en.wikipedia.org/wiki/Stage_name) for the..."

Custom Request Headers

You can include custom headers in requests by passing name-value pairs to the parse function as follows:

Mercury.parse(url, {
  headers: {
    Cookie: 'name=value; name2=value2; name3=value3',
    'User-Agent':
      'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1',
  },
}).then(result => console.log(result));

Pre-fetched HTML

You can use Mercury Parser to parse custom or pre-fetched HTML by passing an HTML string to the parse function as follows:

Mercury.parse(url, {
  html:
    '<html><body><article><h1>Thunder (mascot)</h1><p>Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos</p></article></body></html>',
}).then(result => console.log(result));

Note that the URL argument is still supplied, in order to identify the web site and use its custom parser, if it has any, though it will not be used for fetching content.

The command-line parser

Mercury Parser also ships with a CLI, meaning you can use the Mercury Parser from your command line like so:

# Install Mercury globally
yarn global add @postlight/mercury-parser
#   or
npm -g install @postlight/mercury-parser

# Then
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source

# Pass optional --format argument to set content type (html|markdown|text)
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --format=markdown

# Pass optional --header.name=value arguments to include custom headers in the request
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --header.Cookie="name=value; name2=value2; name3=value3" --header.User-Agent="Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"

# Pass optional --extend argument to add a custom type to the response
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend credit="p:last-child em"

# Pass optional --extend-list argument to add a custom type with multiple matches
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list categories=".meta__tags-list a"

# Get the value of attributes by adding a pipe to --extend or --extend-list
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list links=".body a|href"

License

Licensed under either of the below, at your preference:

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

Contributing

For details on how to contribute to Mercury, including how to write a custom content extractor for any site, see CONTRIBUTING.md

Unless it is explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual licensed as above without any additional terms or conditions.

🔬 A Labs project from your friends at Postlight. Happy coding!