Mercury Parser - Extracting content from chaos #parser #url #html #extractor
You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
Go to file
Adam Pash 15f7fa1e27 a more explicit .prettierrc 5 years ago
.circleci feat: hook up ci to publish to npm (#226) 5 years ago
.github docs: PR and Issue templates (#211) 5 years ago
dist deps: upgrade (#218) 5 years ago
fixtures feat: add fortinet custom parser (#188) 5 years ago
scripts feat: hook up ci to publish to npm (#226) 5 years ago
src docs: cleanup and update docs (#238) 5 years ago
.agignore chore: renamed iris to mercury 8 years ago
.babelrc chore: update node rollup config (#229) 5 years ago
.eslintignore Feat: browser support (#19) 8 years ago
.eslintrc deps: upgrade (#218) 5 years ago
.gitattributes fix: i put a bad comment in .gitattributes (#125) 7 years ago
.gitignore dx: include test results in comment (#230) 5 years ago
.nvmrc chore: update node and some deps (#209) 5 years ago
.prettierrc a more explicit .prettierrc 5 years ago
.remarkrc feat: add remarklint for md docs (#213) 5 years ago
CHANGELOG.md release: 1.0.13 (#183) 6 years ago
CODE_OF_CONDUCT.md docs: add code of conduct (#204) 5 years ago
CONTRIBUTING.md docs: cleanup and update docs (#238) 5 years ago
LICENSE-APACHE docs: add license files (#217) 5 years ago
LICENSE-MIT docs: add license files (#217) 5 years ago
README.md docs: cleanup and update docs (#238) 5 years ago
RELEASE.md docs: document release process (#186) 5 years ago
appveyor.yml Feat: improving ci (#16) 8 years ago
karma.conf.js deps: upgrade (#218) 5 years ago
package.json feat: hook up ci to publish to npm (#226) 5 years ago
preview feat: preview with optional rebuild (#36) 8 years ago
rollup.config.js deps: upgrade (#218) 5 years ago
rollup.config.web.js deps: upgrade (#218) 5 years ago
score-move chore: refactored and linted 8 years ago
yarn.lock feat: hook up ci to publish to npm (#226) 5 years ago

README.md

Mercury Parser - Extracting content from chaos

CircleCI Build status Apache License MITC License

The Mercury Parser extracts the bits that humans care about from any URL you give it. That includes article content, titles, authors, published dates, excerpts, lead images, and more.

Mercury Parser powers the Mercury AMP Converter and Mercury Reader, a Chrome extension that removes ads and distractions, leaving only text and images for a beautiful reading view on any site.

How? Like this.

Installation

yarn add @postlight/mercury-parser

Usage

import Mercury from '@postlight/mercury-parser';

Mercury.parse(url).then(result => console.log(result););

// NOTE: When used in the browser, you can omit the URL argument
// and simply run `Mercury.parse()` to parse the current page.

The result looks like this:

{
  "title": "Thunder (mascot)",
  "content": "<div><div><p>This is the content of the page!</div></div>",
  "author": "Wikipedia Contributors",
  "date_published": "2016-09-16T20:56:00.000Z",
  "lead_image_url": null,
  "dek": null,
  "next_page_url": null,
  "url": "https://en.wikipedia.org/wiki/Thunder_(mascot)",
  "domain": "en.wikipedia.org",
  "excerpt": "Thunder Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos",
  "word_count": 4677,
  "direction": "ltr",
  "total_pages": 1,
  "rendered_pages": 1
}

If Mercury is unable to find a field, that field will return null.

License

Licensed under either of the below, at your preference:

Contributing

For details on how to contribute to Mercury, including how to write a custom content extractor for any site, see CONTRIBUTING.md

Unless it is explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual licensed as above without any additional terms or conditions.