Mercury Parser - Extracting content from chaos #parser #url #html #extractor
Go to file
2019-02-11 15:44:00 -08:00
.circleci fix: ci config (#246) 2019-02-05 15:13:58 -08:00
.github docs: PR and Issue templates (#211) 2019-01-24 09:36:01 +02:00
dist feat: add content format output options (#256) 2019-02-07 16:48:13 -08:00
fixtures feat: add fortinet custom parser (#188) 2019-01-30 09:33:36 +02:00
scripts feat: add content format output options (#256) 2019-02-07 16:48:13 -08:00
src feat: add content format output options (#256) 2019-02-07 16:48:13 -08:00
.agignore chore: renamed iris to mercury 2016-09-16 13:26:37 -04:00
.babelrc chore: update node rollup config (#229) 2019-01-30 10:17:32 -08:00
.eslintignore Feat: browser support (#19) 2016-11-21 14:17:06 -08:00
.eslintrc deps: upgrade (#218) 2019-01-23 09:54:42 -08:00
.gitattributes fix: i put a bad comment in .gitattributes (#125) 2017-01-27 10:26:03 -08:00
.gitignore dx: include test results in comment (#230) 2019-01-29 17:04:21 -08:00
.nvmrc chore: update node and some deps (#209) 2019-01-16 16:03:36 -08:00
.prettierignore dx: add .prettierignore (#257) 2019-02-07 16:50:45 -08:00
.prettierrc a more explicit .prettierrc 2019-02-01 14:11:08 -08:00
.remarkrc feat: add remarklint for md docs (#213) 2019-01-24 11:09:18 +02:00
CHANGELOG.md release: 1.1.1 (#254) 2019-02-07 10:38:39 -08:00
cli.js fix: parse signature in cli (#259) 2019-02-07 17:03:42 -08:00
CODE_OF_CONDUCT.md docs: add code of conduct (#204) 2019-01-23 10:30:39 +02:00
CONTRIBUTING.md docs: cleanup and update docs (#238) 2019-02-01 14:10:59 -08:00
karma.conf.js deps: upgrade (#218) 2019-01-23 09:54:42 -08:00
LICENSE-APACHE docs: add license files (#217) 2019-01-24 12:10:04 +02:00
LICENSE-MIT docs: add license files (#217) 2019-01-24 12:10:04 +02:00
package.json feat: add content format output options (#256) 2019-02-07 16:48:13 -08:00
preview feat: add content format output options (#256) 2019-02-07 16:48:13 -08:00
README.md docs: delete extra semicolon (#266) 2019-02-11 15:44:00 -08:00
RELEASE.md docs: document release process (#186) 2018-12-20 09:30:47 -08:00
rollup.config.js deps: upgrade (#218) 2019-01-23 09:54:42 -08:00
rollup.config.web.js deps: upgrade (#218) 2019-01-23 09:54:42 -08:00
score-move chore: refactored and linted 2016-09-13 15:22:27 -04:00
yarn.lock feat: add content format output options (#256) 2019-02-07 16:48:13 -08:00

Mercury Parser

Mercury Parser - Extracting content from chaos

CircleCI Greenkeeper badge Apache License MITC License Gitter chat

The Mercury Parser extracts the bits that humans care about from any URL you give it. That includes article content, titles, authors, published dates, excerpts, lead images, and more.

Mercury Parser powers the Mercury AMP Converter and Mercury Reader, a Chrome extension that removes ads and distractions, leaving only text and images for a beautiful reading view on any site.

Mercury Parser allows you to easily create custom parsers using simple JavaScript and CSS selectors. This allows you to proactively manage parsing and migration edge cases. There are many examples available along with documentation.

How? Like this.

Installation

# If you're using yarn
yarn add @postlight/mercury-parser

# If you're using npm
npm install @postlight/mercury-parser

Usage

import Mercury from '@postlight/mercury-parser';

Mercury.parse(url).then(result => console.log(result));

// NOTE: When used in the browser, you can omit the URL argument
// and simply run `Mercury.parse()` to parse the current page.

The result looks like this:

{
  "title": "Thunder (mascot)",
  "content": "<div><div><p>This is the content of the page!</div></div>",
  "author": "Wikipedia Contributors",
  "date_published": "2016-09-16T20:56:00.000Z",
  "lead_image_url": null,
  "dek": null,
  "next_page_url": null,
  "url": "https://en.wikipedia.org/wiki/Thunder_(mascot)",
  "domain": "en.wikipedia.org",
  "excerpt": "Thunder Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos",
  "word_count": 4677,
  "direction": "ltr",
  "total_pages": 1,
  "rendered_pages": 1
}

If Mercury is unable to find a field, that field will return null.

Mercury Parser also ships with a CLI, meaning you can use the Mercury Parser from your command line like so:

# Install Mercury globally
yarn global add @postlight/mercury-parser
#   or
npm -g install @postlight/mercury-parser

# Then
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source

License

Licensed under either of the below, at your preference:

Contributing

For details on how to contribute to Mercury, including how to write a custom content extractor for any site, see CONTRIBUTING.md

Unless it is explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual licensed as above without any additional terms or conditions.