Thank you for your interest in contributing to Mercury Parser! It's people like you that make Mercury such a useful tool. The below guidelines will help answer any questions you may have about the contribution process. We look forward to receiving contributions from you — our community!
Thank you for your interest in contributing to Postlight Parser! It's people like you that make this such a useful tool. The below guidelines will help answer any questions you may have about the contribution process. We look forward to receiving contributions from you — our community!
_Please read our [Code of Conduct](./CODE_OF_CONDUCT.md) before participating._
_Please read our [Code of Conduct](./CODE_OF_CONDUCT.md) before participating._
## Contents
## Contents
- [Contributing to Mercury Parser](#contributing-to-mercury-parser)
- [Contributing to Postlight Parser](#contributing-to-postlight-parser)
- [Contents](#contents)
- [Contents](#contents)
- [Ways to Contribute](#ways-to-contribute)
- [Ways to Contribute](#ways-to-contribute)
- [Reporting a Bug](#reporting-a-bug)
- [Reporting a Bug](#reporting-a-bug)
@ -27,15 +27,15 @@ _Please read our [Code of Conduct](./CODE_OF_CONDUCT.md) before participating._
## Ways to Contribute
## Ways to Contribute
There are many ways you can contribute to the Mercury community. We value each type
There are many ways you can contribute to the Postlight Parser community. We value each type
of contribution and appreciate your help.
of contribution and appreciate your help.
Here are a few examples of what we consider a contribution:
Here are a few examples of what we consider a contribution:
- Updates to source code, including bug fixes, improvements, or [creating new custom site extractors](./src/extractors/custom/README.md)
- Updates to source code, including bug fixes, improvements, or [creating new custom site extractors](./src/extractors/custom/README.md)
- Answering questions and chatting with the community in the [Gitter](https://gitter.im/postlight/mercury) room
- Answering questions and chatting with the community in the [Gitter](https://gitter.im/postlight/mercury) room
- Filing, organizing, and commenting on issues in the [issue tracker](https://github.com/postlight/mercury-parser/issues)
- Filing, organizing, and commenting on issues in the [issue tracker](https://github.com/postlight/parser/issues)
- Teaching others how to use Mercury
- Teaching others how to use Postlight Parser
- Community building and outreach
- Community building and outreach
## Reporting a Bug
## Reporting a Bug
@ -49,41 +49,41 @@ as it's possible that someone else has already reported the error. This doesn't
always work, and sometimes it's hard to know what to search for, so consider
always work, and sometimes it's hard to know what to search for, so consider
this extra credit. We won't mind if you accidentally file a duplicate report.
this extra credit. We won't mind if you accidentally file a duplicate report.
Opening an issue is as easy as following [this link](https://github.com/postlight/mercury-parser/issues/new)
Opening an issue is as easy as following [this link](https://github.com/postlight/parser/issues/new)
and filling out the template.
and filling out the template.
### Security
### Security
If you find a security bug in Mercury, send an email with a descriptive subject line
If you find a security bug in Postlight Parser, send an email with a descriptive subject line
to [mercury+security@postlight.com](mailto:mercury+security@postlight.com). If you think
to [mercury+security@postlight.com](mailto:mercury+security@postlight.com). If you think
you’ve found a serious vulnerability, please do not file a public issue or share in the Mercury Gitter room.
you’ve found a serious vulnerability, please do not file a public issue or share in the Postlight Parser Gitter room.
Your report will go to Mercury's core development team. You will receive
Your report will go to Postlight Parser's core development team. You will receive
acknowledgement of the report in 24-48 hours, and our next steps should be to
acknowledgement of the report in 24-48 hours, and our next steps should be to
release a fix. If you don’t get a report acknowledgement in 48 hours, send an email to
release a fix. If you don’t get a report acknowledgement in 48 hours, send an email to
To request a change to the way that Mercury works, please open an issue in this repository named, "Feature Request: [Your Feature Idea]," followed by your suggestion.
To request a change to the way that Postlight Parser works, please open an issue in this repository named, "Feature Request: [Your Feature Idea]," followed by your suggestion.
## Development Workflow
## Development Workflow
This section of the document outlines how to build, run, and test Mercury locally.
This section of the document outlines how to build, run, and test Postlight Parser locally.
### Building
### Building
To build the Mercury Parser locally, execute the following commands:
To build the Postlight Parser locally, execute the following commands:
Mercury is a test-driven application; each component has its own test file. Tests are run for both node and web builds. Our testing frameworks are:
Postlight Parser is a test-driven application; each component has its own test file. Tests are run for both node and web builds. Our testing frameworks are:
- `Jest` for the node build
- `Jest` for the node build
- `Karma` for the web build
- `Karma` for the web build
@ -143,9 +143,9 @@ preset. This helps keep our Markdown tidy and consistent.
### Node.js Version Requirements
### Node.js Version Requirements
Mercury is built against Node `>= v12.8.1`. Since this is the
Postlight Parser is built against Node `>= v12.8.1`. Since this is the
version we run in our CI environments, we recommend you use it when working on
version we run in our CI environments, we recommend you use it when working on
the Mercury codebase.
the codebase.
If you use [nvm](https://github.com/creationix/nvm) to manage Node.js versions
If you use [nvm](https://github.com/creationix/nvm) to manage Node.js versions
and zsh (like [Oh-My-ZSH](https://github.com/robbyrussell/oh-my-zsh)), you can
and zsh (like [Oh-My-ZSH](https://github.com/robbyrussell/oh-my-zsh)), you can
@ -176,12 +176,12 @@ load-nvmrc
## Writing Documentation
## Writing Documentation
Improvements to documentation are a great way to start contributing to Mercury. The
Improvements to documentation are a great way to start contributing to Postlight Parser. The
source for the official documentation are Markdown files that live in this repository.
source for the official documentation are Markdown files that live in this repository.
## Submitting a Pull Request
## Submitting a Pull Request
Want to make a change to Mercury? Submit a pull request! We use the "fork and pull"
Want to make a change to Postlight Parser? Submit a pull request! We use the "fork and pull"
model [described here](https://help.github.com/articles/creating-a-pull-request-from-a-fork).
model [described here](https://help.github.com/articles/creating-a-pull-request-from-a-fork).
**Before submitting a pull request**, please make sure:
**Before submitting a pull request**, please make sure:
@ -203,7 +203,7 @@ Commit messages should follow the format outlined below:
| chore | does not effect the production version of the app in any way. |
| chore | does not effect the production version of the app in any way. |
| deps | add, update, or remove a dependency. |
| deps | add, update, or remove a dependency. |
| doc | add, update, or remove documentation. no code changes. |
| doc | add, update, or remove documentation. no code changes. |
| dx | improve the development experience of mercury core. |
| dx | improve the development experience of parser core. |
| feat | a feature or enhancement. can be incredibly small. |
| feat | a feature or enhancement. can be incredibly small. |
| fix | a bug fix for something that was broken. |
| fix | a bug fix for something that was broken. |
| perf | add, update, or fix a test. |
| perf | add, update, or fix a test. |
@ -222,9 +222,9 @@ fall behind. Feel free to reach out to the core team if you have not received a
Some useful places to look for information are:
Some useful places to look for information are:
- The main [README](./README.md) for this repository.
- The main [README](./README.md) for this repository.
- The Mercury Custom Parser [README](./src/extractors/custom/README.md).
- The Postlight Custom Parser [README](./src/extractors/custom/README.md).
- The postlight/mercury room on [Gitter](https://gitter.im/postlight/mercury)
- The postlight/mercury room on [Gitter](https://gitter.im/postlight/mercury)
- The Mercury Parser API [repository](https://github.com/postlight/mercury-parser-api).
- The Postlight Parser API [repository](https://github.com/postlight/parser-api).
_Adapted from [Contributing to Node.js](https://github.com/nodejs/node/blob/master/CONTRIBUTING.md)
_Adapted from [Contributing to Node.js](https://github.com/nodejs/node/blob/master/CONTRIBUTING.md)
and [ThinkUp Security and Data Privacy](http://thinkup.readthedocs.io/en/latest/install/security.html#thinkup-security-and-data-privacy)._
and [ThinkUp Security and Data Privacy](http://thinkup.readthedocs.io/en/latest/install/security.html#thinkup-security-and-data-privacy)._
[Postlight](https://postlight.com)'s Mercury Parser extracts the bits that humans care about from any URL you give it. That includes article content, titles, authors, published dates, excerpts, lead images, and more.
[Postlight](https://postlight.com)'s Parser extracts the bits that humans care about from any URL you give it. That includes article content, titles, authors, published dates, excerpts, lead images, and more.
Mercury Parser powers the [Mercury AMP Converter](https://mercury.postlight.com/amp-converter/) and [Mercury Reader](https://mercury.postlight.com/reader/), a Chrome extension that removes ads and distractions, leaving only text and images for a beautiful reading view on any site.
Postlight Parser powers [Postlight Reader](https://reader.postlight.com/), a browser extension that removes ads and distractions, leaving only text and images for a beautiful reading view on any site.
Mercury Parser allows you to easily create custom parsers using simple JavaScript and CSS selectors. This allows you to proactively manage parsing and migration edge cases. There are [many examples available](https://github.com/postlight/mercury-parser/tree/master/src/extractors/custom) along with [documentation](https://github.com/postlight/mercury-parser/blob/master/src/extractors/custom/README.md).
Postlight Parser allows you to easily create custom parsers using simple JavaScript and CSS selectors. This allows you to proactively manage parsing and migration edge cases. There are [many examples available](https://github.com/postlight/parser/tree/master/src/extractors/custom) along with [documentation](https://github.com/postlight/parser/blob/master/src/extractors/custom/README.md).
## How? Like this.
## How? Like this.
@ -22,21 +22,21 @@ Mercury Parser allows you to easily create custom parsers using simple JavaScrip
// NOTE: When used in the browser, you can omit the URL argument
// NOTE: When used in the browser, you can omit the URL argument
// and simply run `Mercury.parse()` to parse the current page.
// and simply run `Parser.parse()` to parse the current page.
```
```
The result looks like this:
The result looks like this:
@ -60,16 +60,16 @@ The result looks like this:
}
}
```
```
If Mercury is unable to find a field, that field will return `null`.
If Parser is unable to find a field, that field will return `null`.
#### `parse()` Options
#### `parse()` Options
##### Content Formats
##### Content Formats
By default, Mercury Parser returns the `content` field as HTML. However, you can override this behavior by passing in options to the `parse` function, specifying whether or not to scrape all pages of an article, and what type of output to return (valid values are `'html'`, `'markdown'`, and `'text'`). For example:
By default, Postlight Parser returns the `content` field as HTML. However, you can override this behavior by passing in options to the `parse` function, specifying whether or not to scrape all pages of an article, and what type of output to return (valid values are `'html'`, `'markdown'`, and `'text'`). For example:
@ -85,7 +85,7 @@ This returns the the page's `content` as GitHub-flavored Markdown:
You can include custom headers in requests by passing name-value pairs to the `parse` function as follows:
You can include custom headers in requests by passing name-value pairs to the `parse` function as follows:
```javascript
```javascript
Mercury.parse(url, {
Parser.parse(url, {
headers: {
headers: {
Cookie: 'name=value; name2=value2; name3=value3',
Cookie: 'name=value; name2=value2; name3=value3',
'User-Agent':
'User-Agent':
@ -96,10 +96,10 @@ Mercury.parse(url, {
##### Pre-fetched HTML
##### Pre-fetched HTML
You can use Mercury Parser to parse custom or pre-fetched HTML by passing an HTML string to the `parse` function as follows:
You can use Postlight Parser to parse custom or pre-fetched HTML by passing an HTML string to the `parse` function as follows:
```javascript
```javascript
Mercury.parse(url, {
Parser.parse(url, {
html:
html:
'<html><body><article><h1>Thunder (mascot)</h1><p>Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos</p></article></body></html>',
'<html><body><article><h1>Thunder (mascot)</h1><p>Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos</p></article></body></html>',
}).then(result => console.log(result));
}).then(result => console.log(result));
@ -109,37 +109,36 @@ Note that the URL argument is still supplied, in order to identify the web site
#### The command-line parser
#### The command-line parser
Mercury Parser also ships with a CLI, meaning you can use the Mercury Parser
Postlight Parser also ships with a CLI, meaning you can use it from your command line like so:
# Pass optional --header.name=value arguments to include custom headers in the request
# Pass optional --header.name=value arguments to include custom headers in the request
mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source --header.Cookie="name=value; name2=value2; name3=value3" --header.User-Agent="Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --header.Cookie="name=value; name2=value2; name3=value3" --header.User-Agent="Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
# Pass optional --extend argument to add a custom type to the response
# Pass optional --extend argument to add a custom type to the response
@ -153,7 +152,7 @@ Licensed under either of the below, at your preference:
## Contributing
## Contributing
For details on how to contribute to Mercury, including how to write a custom content extractor for any site, see [CONTRIBUTING.md](./CONTRIBUTING.md)
For details on how to contribute to Postlight Parser, including how to write a custom content extractor for any site, see [CONTRIBUTING.md](./CONTRIBUTING.md)
Unless it is explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual licensed as above without any additional terms or conditions.
Unless it is explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual licensed as above without any additional terms or conditions.
git checkout -b release-1.x.x # (where 1.x.x reflects the release)
```bash
```
git checkout -b release-1.x.x # (where 1.x.x reflects the release)
```
2. Update package.json with the version number
2. Update package.json with the version number
3. Build the release
3. Build the release
```bash
yarn release
```bash
```
yarn release
```
4. Update the changelog
4. Update the changelog
```bash
```bash
# Copy the output of the command below and paste it into CHANGELOG.md
# Copy the output of the command below and paste it into CHANGELOG.md
# following the conventions of that file
# following the conventions of that file
yarn changelog-maker postlight mercury-parser
yarn changelog-maker postlight parser
```
```
5. Submit a PR
5. Submit a PR
6. Merge once the PR's tests pass
6. Merge once the PR's tests pass
7. [Create a release](https://github.com/postlight/mercury-parser/releases), linking to this release's entry in the changelog. (See other releases for context.)
7. [Create a release](https://github.com/postlight/parser/releases), linking to this release's entry in the changelog. (See other releases for context.)
"description":"Mercury transforms web pages into clean text. Publishers and programmers use it to make the web make sense, and readers use it to read any web article comfortably.",
"description":"Postlight Parser transforms web pages into clean text. Publishers and programmers use it to make the web make sense, and readers use it to read any web article comfortably.",
Mercury can extract meaningful content from almost any web site, but custom parsers/extractors allow the Mercury Parser to find the content more quickly and more accurately than it might otherwise do. Our goal is to include custom parsers as many sites as we can, and we'd love your help!
Postlight Parser can extract meaningful content from almost any web site, but custom parsers/extractors allow the Postlight Parser to find the content more quickly and more accurately than it might otherwise do. Our goal is to include custom parsers as many sites as we can, and we'd love your help!
## The basics of parsing a site with a custom parser
## The basics of parsing a site with a custom parser
Custom parsers allow you to write CSS selectors that will find the content you're looking for on the page you're testing against. If you've written any CSS or jQuery, CSS selectors should be very familiar to you.
Custom parsers allow you to write CSS selectors that will find the content you're looking for on the page you're testing against. If you've written any CSS or jQuery, CSS selectors should be very familiar to you.
You can query for every field returned by the Mercury Parser:
You can query for every field returned by the Postlight Parser:
As you might guess, the selectors key provides an array of selectors that Mercury will check to find your title text. In our `ExampleExtractor`, we're saying that the title can be found in the text of an `h1` header with a class name of `hed`.
As you might guess, the selectors key provides an array of selectors that Postlight Parser will check to find your title text. In our `ExampleExtractor`, we're saying that the title can be found in the text of an `h1` header with a class name of `hed`.
The selector you choose should return one element. If more than one element is returned by your selector, it will fail (and Mercury will fall back to its generic extractor).
The selector you choose should return one element. If more than one element is returned by your selector, it will fail (and Parser will fall back to its generic extractor).
Because the `selectors` property returns an array, you can write more than one selector for a property extractor. This is particularly useful for sites that have multiple templates for articles. If you provide an array of selectors, Mercury will try each in order, falling back to the next until it finds a match or exhausts the options (in which case it will fall back to its default generic extractor).
Because the `selectors` property returns an array, you can write more than one selector for a property extractor. This is particularly useful for sites that have multiple templates for articles. If you provide an array of selectors, Parser will try each in order, falling back to the next until it finds a match or exhausts the options (in which case it will fall back to its default generic extractor).
As you can see, to pass this test, we need to fill out our title selector. In order to do this, you need to know what your selector is. To do this, open the html fixture the generator downloaded for you in the [`fixtures`](/fixtures) directory. In our example, that file is `fixtures/www.newyorker.com/1475248565793.html`. Now open that file in your web browser.
As you can see, to pass this test, we need to fill out our title selector. In order to do this, you need to know what your selector is. To do this, open the html fixture the generator downloaded for you in the [`fixtures`](/fixtures) directory. In our example, that file is `fixtures/www.newyorker.com/1475248565793.html`. Now open that file in your web browser.
The page should look more or less exactly like the site you pointed it to, but this version is downloaded locally for test purposes. (You should always look for selectors using this local fixture rather than the actual web site; some sites re-write elements after the page loads, and we want to make sure we're looking at the page the same way Mercury will be.)
The page should look more or less exactly like the site you pointed it to, but this version is downloaded locally for test purposes. (You should always look for selectors using this local fixture rather than the actual web site; some sites re-write elements after the page loads, and we want to make sure we're looking at the page the same way Postlight Parser will be.)
(For the purpose of this guide, we're going to assume you're using Chrome as your default browser; any browser should do, but we're going to refer specifically to Chrome's developer tools in this guide.)
(For the purpose of this guide, we're going to assume you're using Chrome as your default browser; any browser should do, but we're going to refer specifically to Chrome's developer tools in this guide.)
@ -302,7 +302,7 @@ AssertionError: 'Hacking, Cryptography, and the Countdown to Quantum Computing'
'Schrödinger’s Hack';
'Schrödinger’s Hack';
```
```
When Mercury generated our test, it took a guess at the page's title, and in this case, it got it wrong. So update the test with the title we expect, save it, and your test should pass!
When Parser generated our test, it took a guess at the page's title, and in this case, it got it wrong. So update the test with the title we expect, save it, and your test should pass!