mirror of
https://github.com/postlight/mercury-parser
synced 2024-11-01 21:40:16 +00:00
8f42e119e8
Squashed commit of the following: commit deaf9e60d031d9ee06e74b8c0895495b187032a5 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Sep 20 10:31:09 2016 -0400 chore: README for custom parsers commit a8e8ad633e0d1576a52dbc90ce31b98fb2ec21ee Author: Adam Pash <adam.pash@gmail.com> Date: Mon Sep 19 23:36:09 2016 -0400 draft of readme commit 4f0f463f821465c282ce006378e5d55f8f41df5f Author: Adam Pash <adam.pash@gmail.com> Date: Mon Sep 19 17:56:34 2016 -0400 custom extractor used to build basic parser for theatlantic commit c5562a3cede41f56c4e723dcfa1181b49dcaae4d Author: Adam Pash <adam.pash@gmail.com> Date: Mon Sep 19 17:20:13 2016 -0400 pre-commit to test custom parser generator commit 7d50d5b7ab780b79fae38afcb87a7d1da5d139b2 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Sep 19 17:19:55 2016 -0400 feat: added nytimes parser commit 58b8d83a56927177984ddfdf70830bc4f328f200 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Sep 19 17:17:28 2016 -0400 feat: can do fuzzy search or go straight to file commit c99add753723a8e2ac64d51d7379ac8e23125526 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Sep 19 10:52:26 2016 -0400 refactored export for custom extractors for easier renames commit 22563413669651bb497f1bb2a92085b71f2ae324 Author: Adam Pash <adam.pash@gmail.com> Date: Fri Sep 16 17:36:13 2016 -0400 feat: custom extractor generation in place commit 2285a29908a7f82a5de3c81f6b2b902ddec9bdaa Author: Adam Pash <adam.pash@gmail.com> Date: Fri Sep 16 16:42:20 2016 -0400 good progress
85 lines
2.1 KiB
Markdown
85 lines
2.1 KiB
Markdown
Each extractor should ultimately be an object that exports like so:
|
|
|
|
```javascript
|
|
import GenericContentExtractor from './content/extractor'
|
|
import GenericTitleExtractor from './title/extractor'
|
|
import GenericAuthorExtractor from './author/extractor'
|
|
import GenericDatePublishedExtractor from './date-published/extractor'
|
|
import GenericDekExtractor from './dek/extractor'
|
|
import GenericLeadImageUrlExtractor from './lead-image-url/extractor'
|
|
|
|
const GenericExtractor = {
|
|
content: GenericContentExtractor,
|
|
title: GenericTitleExtractor,
|
|
author: GenericAuthorExtractor,
|
|
datePublished: GenericDatePublishedExtractor,
|
|
dek: GenericDekExtractor,
|
|
leadImageUrl: GenericLeadImageUrlExtractor,
|
|
}
|
|
```
|
|
|
|
Custom parsers can then be merged with the generic parser to fill in gaps in their implementations. E.g:
|
|
|
|
```javascript
|
|
import NYMagContentExtractor from '...'
|
|
import NYMagTitleExtractor from '...'
|
|
|
|
const NYMagExtractor = {
|
|
content: NYMagContentExtractor,
|
|
title: NYMagTitleExtractor,
|
|
}
|
|
|
|
const Extractor = {
|
|
...GenericExtractor,
|
|
...NYMagExtractor
|
|
}
|
|
|
|
```
|
|
|
|
# Declarative Custom Extractors
|
|
|
|
My goal is be to create declarative extractors that describe what rather than how. So, for example:
|
|
|
|
```javascript
|
|
NYMagExtractor = {
|
|
content: {
|
|
// Order by most likely. Extractor will stop on first occurrence
|
|
selectors: [
|
|
'div.article-content',
|
|
'section.body',
|
|
'article.article',
|
|
],
|
|
|
|
// Selectors to remove from the extracted content
|
|
clean: [
|
|
'.ad',
|
|
],
|
|
|
|
// Array of tranformations to make on matched elements
|
|
// Each item in the array is an object. They key is the
|
|
// selector, the value is a tranformation function
|
|
// for the matching node.
|
|
transforms: [
|
|
// Convert h1s to h2s
|
|
{
|
|
'h1': ($node) => convertNodeTo($node, $, 'h2')
|
|
},
|
|
|
|
// Convert lazy-loaded noscript images to figures
|
|
{
|
|
'noscript': ($node) => {
|
|
const $children = $node.children()
|
|
if ($children.length === 1 && $children.get(0).tagName === 'img') {
|
|
convertNodeTo($node, $, 'figure')
|
|
}
|
|
}
|
|
}
|
|
]
|
|
},
|
|
|
|
title: [
|
|
'h1',
|
|
]
|
|
}
|
|
```
|