mercury-parser/NOTES.md

Each extractor should ultimately be an object that exports like so:

```javascript
import GenericContentExtractor from './content/extractor'
import GenericTitleExtractor from './title/extractor'
import GenericAuthorExtractor from './author/extractor'
import GenericDatePublishedExtractor from './date-published/extractor'
import GenericDekExtractor from './dek/extractor'
import GenericLeadImageUrlExtractor from './lead-image-url/extractor'

const GenericExtractor = {
  content: GenericContentExtractor,
  title: GenericTitleExtractor,
  author: GenericAuthorExtractor,
  datePublished: GenericDatePublishedExtractor,
  dek: GenericDekExtractor,
  leadImageUrl: GenericLeadImageUrlExtractor,
}
```

Custom parsers can then be merged with the generic parser to fill in gaps in their implementations. E.g:

```javascript
import NYMagContentExtractor from '...'
import NYMagTitleExtractor from '...'

const NYMagExtractor = {
  content: NYMagContentExtractor,
  title: NYMagTitleExtractor,
}

const Extractor = {
  ...GenericExtractor,
  ...NYMagExtractor
}

```

# Declarative Custom Extractors

My goal is be to create declarative extractors that describe what rather than how. So, for example:

```javascript
NYMagExtractor = {
  content: {
    // Order by most likely. Extractor will stop on first occurrence
    selectors: [
      'div.article-content',
      'section.body',
      'article.article',
    ],

    // Selectors to remove from the extracted content
    clean: [
      '.ad',
    ],

    // Array of tranformations to make on matched elements
    // Each item in the array is an object. They key is the 
    // selector, the value is a tranformation function
    // for the matching node.
    transforms: [
      // Convert h1s to h2s
      {
        'h1': ($node) => convertNodeTo($node, $, 'h2')
      },

      // Convert lazy-loaded noscript images to figures
      {
        'noscript': ($node) => {
          const $children = $node.children()
          if ($children.length === 1 && $children.get(0).tagName === 'img') {
            convertNodeTo($node, $, 'figure')
          }
        }
      }
    ]
  },

  title: [
    'h1',
  ]
}
```
notes, cleanup 2016-09-06 13:55:36 +00:00			`Each extractor should ultimately be an object that exports like so:`

			```javascript
			`import GenericContentExtractor from './content/extractor'`
			`import GenericTitleExtractor from './title/extractor'`
			`import GenericAuthorExtractor from './author/extractor'`
			`import GenericDatePublishedExtractor from './date-published/extractor'`
			`import GenericDekExtractor from './dek/extractor'`
			`import GenericLeadImageUrlExtractor from './lead-image-url/extractor'`

			`const GenericExtractor = {`
			`content: GenericContentExtractor,`
			`title: GenericTitleExtractor,`
			`author: GenericAuthorExtractor,`
			`datePublished: GenericDatePublishedExtractor,`
			`dek: GenericDekExtractor,`
			`leadImageUrl: GenericLeadImageUrlExtractor,`
			`}`
			```

			`Custom parsers can then be merged with the generic parser to fill in gaps in their implementations. E.g:`

			```javascript
			`import NYMagContentExtractor from '...'`
			`import NYMagTitleExtractor from '...'`

			`const NYMagExtractor = {`
			`content: NYMagContentExtractor,`
			`title: NYMagTitleExtractor,`
			`}`

			`const Extractor = {`
			`...GenericExtractor,`
			`...NYMagExtractor`
			`}`

			```

			`# Declarative Custom Extractors`

			`My goal is be to create declarative extractors that describe what rather than how. So, for example:`

			```javascript
			`NYMagExtractor = {`
			`content: {`
feat: generator for custom parsers and some documentation Squashed commit of the following: commit deaf9e60d031d9ee06e74b8c0895495b187032a5 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Sep 20 10:31:09 2016 -0400 chore: README for custom parsers commit a8e8ad633e0d1576a52dbc90ce31b98fb2ec21ee Author: Adam Pash <adam.pash@gmail.com> Date: Mon Sep 19 23:36:09 2016 -0400 draft of readme commit 4f0f463f821465c282ce006378e5d55f8f41df5f Author: Adam Pash <adam.pash@gmail.com> Date: Mon Sep 19 17:56:34 2016 -0400 custom extractor used to build basic parser for theatlantic commit c5562a3cede41f56c4e723dcfa1181b49dcaae4d Author: Adam Pash <adam.pash@gmail.com> Date: Mon Sep 19 17:20:13 2016 -0400 pre-commit to test custom parser generator commit 7d50d5b7ab780b79fae38afcb87a7d1da5d139b2 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Sep 19 17:19:55 2016 -0400 feat: added nytimes parser commit 58b8d83a56927177984ddfdf70830bc4f328f200 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Sep 19 17:17:28 2016 -0400 feat: can do fuzzy search or go straight to file commit c99add753723a8e2ac64d51d7379ac8e23125526 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Sep 19 10:52:26 2016 -0400 refactored export for custom extractors for easier renames commit 22563413669651bb497f1bb2a92085b71f2ae324 Author: Adam Pash <adam.pash@gmail.com> Date: Fri Sep 16 17:36:13 2016 -0400 feat: custom extractor generation in place commit 2285a29908a7f82a5de3c81f6b2b902ddec9bdaa Author: Adam Pash <adam.pash@gmail.com> Date: Fri Sep 16 16:42:20 2016 -0400 good progress 2016-09-20 14:35:23 +00:00			`// Order by most likely. Extractor will stop on first occurrence`
notes, cleanup 2016-09-06 13:55:36 +00:00			`selectors: [`
			`'div.article-content',`
			`'section.body',`
			`'article.article',`
			`],`

			`// Selectors to remove from the extracted content`
			`clean: [`
			`'.ad',`
			`],`

			`// Array of tranformations to make on matched elements`
			`// Each item in the array is an object. They key is the`
			`// selector, the value is a tranformation function`
			`// for the matching node.`
			`transforms: [`
			`// Convert h1s to h2s`
			`{`
			`'h1': ($node) => convertNodeTo($node, $, 'h2')`
			`},`

			`// Convert lazy-loaded noscript images to figures`
			`{`
			`'noscript': ($node) => {`
			`const $children = $node.children()`
			`if ($children.length === 1 && $children.get(0).tagName === 'img') {`
			`convertNodeTo($node, $, 'figure')`
			`}`
			`}`
			`}`
			`]`
			`},`

			`title: [`
			`'h1',`
			`]`
			`}`
			```