feat: basic wikipedia custom extractor

pull/1/head
Adam Pash 8 years ago
parent 9665fe7209
commit 603682239d

@ -1,5 +1,6 @@
TODO:
- change customselector to rootselector. consider other options for generalizing cleaning (use generic cleaners)
- extract and generalize cleaners
- get custom datePublished selector to convert to date object (prob through cleaner)
- run makeLinksAbsolute on extracted content before returning
- remove logic for fetching meta attrs with custom props
- Resource (fetches page, validates it, cleans it, normalizes meta tags (!), converts lazy-loaded images, makes links absolute, etc)

@ -0,0 +1 @@
0 priests were performing the holy ceremonies in the temple.<sup id="cite_ref-7" class="reference"><a href="#cite_note-7">[7]</a></sup> It is believed a fire cracker lit near the temple fell on the <i>yagasala</i>, a temporary structure built to accommodate the ritual ceremonies, and sparked the fire that spread to the thatched roofs. A stampede resulted when the panic-stricken devotees rushed to the only entrance to the temple on the eastern side.<sup id="cite_ref-yaga_6-1" class="reference"><a href="#cite_note-yaga-6">[6]</a></sup><sup id="cite_ref-8" class="reference"><a href="#cite_note-8">[8]</a></sup> However, another version claimed the fire was caused by a spark from the electric generator.<sup id="cite_ref-yaga_6-2" class="reference"><a href="#cite_note-yaga-6">[6]</a></sup> Most of the deaths were reported be caused by the inhalation of carbon monoxide and a few due to burn injuries. There were lot of inflammable material like ghee, condiments and thatched roof that resulted in spreading of fire. The only entrance was the narrow eastern side where many rushed and fell on stones.<sup id="cite_ref-9" class="reference"><a href="#cite_note-9">[9]</a></sup> Police reported that they recovered 37 bodies from the thatched roof that fell on the worshipers. The fire hampered the electric line in the neighbourhood, slowing down the rescue operations.<sup id="cite_ref-10" class="reference"><a href="#cite_note-10">[10]</a></sup></p> <p>The rescue operations were monitored by Pulavar Senguttuvan, the state Minister for Hindu Religious and Charitable Endowements, T N Ramanathan, the District Collector, S K Dogra, the Deputy Inspector-General of Police and Jayanth Murali, Superintendent of Police of <a href="/wiki/Thanjavur_district" title="Thanjavur district">Thanjavur district</a> at that time. The rescue operations were aided by Home Guards, member of <a href="/wiki/Red_Cross" class="mw-redirect" title="Red Cross">Red Cross</a> and the general public.<sup id="cite_ref-TV_11-0" class="reference"><a href="#cite_note-TV-11">[11]</a></sup><sup id="cite_ref-12" class="reference"><a href="#cite_note-12">[12]</a></sup> A special information cell was opened in the premises of the temple and also at Collector&apos;s office.<sup id="cite_ref-TV_11-1" class="reference"><a href="#cite_note-TV-11">[11]</a></sup></p> <h2><span class="mw-headline" id="Aftermath">Aftermath</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Brihadeeswarar_Temple_fire&amp;action=edit&amp;section=3" title="Edit section: Aftermath">edit</a><span class="mw-editsection-bracket">]</span></span></h2> <p>The accident was one of four major fire accidents in the state along with the fire accidents like the <a href="/wiki/Erwadi_fire_incident" title="Erwadi fire incident">Erwadi fire incident</a> on 6 August 2001 that killed 30 mentally challenged people, <a href="/wiki/Srirangam_marriage_hall_fire" title="Srirangam marriage hall fire">fire at marriage hall</a> on 23 January 2005 at <a href="/wiki/Srirangam" title="Srirangam">Srirangam</a> where 30 people including the bridegroom were killed and <a href="/wiki/2004_Kumbakonam_School_fire" title="2004 Kumbakonam School fire">2004 Kumbakonam School fire</a> where 94 school children were killed.<sup id="cite_ref-Teets103_13-0" class="reference"><a href="#cite_note-Teets103-13">[13]</a></sup> The Tamil Nadu Government announced a compensation of Rs 100,000 to the families of the deceased and the injured were paid from Rs 10,000 to Rs 50,000 each.<sup id="cite_ref-yaga_6-3" class="reference"><a href="#cite_note-yaga-6">[6]</a></sup> The Deputy Inspector General (DGI), during the investigation, ruled out any possibility of sabotage even though there was an attempt was made to blast the TV relay station at Eswari Nagar the previous week.<sup id="cite_ref-TV_11-2" class="reference"><a href="#cite_note-TV-11">[11]</a></sup></p> <h2><span class="mw-headline" id="Notes">Notes</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Brihadeeswarar_Temple_fire&amp;action=edit&amp;section=4" title="Edit section: Notes">edit</a><span class="mw-editsection-bracket">]</span></span></h2> <h2><span class="mw-headline" id="References">References</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Brihadeeswarar_Temple_fire&amp;action=edit&amp;section=5" title="Edit section: References">edit</a><span class="mw-editsection-bracket">]</span></span></h2> </div></div>

@ -1,10 +1,12 @@
import GenericExtractor from './generic'
import NYMagExtractor from './custom/nymag.com'
import BloggerExtractor from './custom/blogspot.com'
import WikipediaExtractor from './custom/wikipedia.org'
const Extractors = {
'nymag.com': NYMagExtractor,
'blogspot.com': BloggerExtractor,
'wikipedia.org': WikipediaExtractor,
}
export default Extractors

@ -0,0 +1,40 @@
const WikipediaExtractor = {
domain: 'wikipedia.org',
content: {
selectors: [
'#mw-content-text',
],
// transform top infobox to an image with caption
transforms: {
'.infobox img': ($node, $) => {
$node.parents('.infobox').prepend($node)
},
'.infobox': 'figure',
},
// Selectors to remove from the extracted content
clean: [
'.mw-editsection',
'figure tr, figure td, figure tbody',
],
},
author: 'Wikipedia Contributors',
title: {
selectors: [
'h2.title',
]
},
datePublished: {
selectors: [
'#footer-info-lastmod',
]
},
}
export default WikipediaExtractor

@ -9,12 +9,17 @@ const RootExtractor = {
// This is the generic extractor. Run its extract method
if (extractor.domain === '*') return extractor.extract(opts)
const title = extract({ ...opts, type: 'title', extractor })
const datePublished = extract({ ...opts, type: 'datePublished', extractor })
const author = extract({ ...opts, type: 'author', extractor })
const content = extract({ ...opts, type: 'content', extractor, extractHtml: true })
const leadImageUrl = extract({ ...opts, type: 'leadImageUrl', extractor })
const dek = extract({ ...opts, type: 'dek', extractor })
opts = {
...opts,
extractor
}
const title = extract({ ...opts, type: 'title' })
const datePublished = extract({ ...opts, type: 'datePublished' })
const author = extract({ ...opts, type: 'author' })
const content = extract({ ...opts, type: 'content', extractHtml: true })
const leadImageUrl = extract({ ...opts, type: 'leadImageUrl', content })
const dek = extract({ ...opts, type: 'dek', content })
return {
title,
@ -40,18 +45,20 @@ function select($, extractionOpts, extractHtml=false) {
// Skip if there's not extraction for this type
if (!extractionOpts) return
// If a string is hardcoded for a type (e.g., Wikipedia
// contributors), return the string
if (typeof extractionOpts === 'string') return extractionOpts
const { selectors } = extractionOpts
const matchingSelector = selectors.find((selector) => {
return $(selector).length === 1
})
console.log(matchingSelector)
// console.log($(matchingSelector).text())
console.log(extractHtml)
if (!matchingSelector) return
// If the selector type requests html as its return type
// clean the element with provided cleaning selectors
// transform and clean the element with provided selectors
if (extractHtml) {
let $content = $(matchingSelector)
@ -59,8 +66,8 @@ function select($, extractionOpts, extractHtml=false) {
$content.wrap($('<div></div>'))
$content = $content.parent()
$content = cleanBySelectors($content, $, extractionOpts)
$content = transformElements($content, $, extractionOpts)
$content = cleanBySelectors($content, $, extractionOpts)
return $.html($content)
} else {
@ -93,7 +100,7 @@ export function transformElements($content, $, { transforms }) {
} else if (typeof value === 'function') {
// If value is function, apply function to node
$matches.each((index, node) => {
const result = value($(node))
const result = value($(node), $)
// If function returns a string, convert node to that value
if (typeof result === 'string') {
convertNodeTo(node, $, result)

@ -16,5 +16,11 @@ describe('Iris', function() {
// console.log(result)
})
it('does wikipedia', async function() {
const result = await Iris.parse('https://en.wikipedia.org/wiki/Brihadeeswarar_Temple_fire')
console.log(result)
})
})
})

Loading…
Cancel
Save