Improve metadata extraction (#478)
* Improve metadata extraction * Recognize meta[property] as a space-separated list * Recognize Dulin Core (dc|dcterm): metadata. * Prefer Dublin Core, Open Graph, Twitter, and HTML in that order. * _getArticleTitle() is now only used as fallback if document doesn't provide good metadata.pull/483/head
parent
0449dbf186
commit
5a69d4a8eb
@ -0,0 +1,7 @@
|
||||
{
|
||||
"title": "Dublin Core property title",
|
||||
"byline": "Dublin Core property author",
|
||||
"dir": null,
|
||||
"excerpt": "Dublin Core property description",
|
||||
"readerable": true
|
||||
}
|
@ -0,0 +1,20 @@
|
||||
<div id="readability-page-1" class="page">
|
||||
<article>
|
||||
<p>
|
||||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
|
||||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
|
||||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
|
||||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
|
||||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
|
||||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
|
||||
</p>
|
||||
<p>
|
||||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
|
||||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
|
||||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
|
||||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
|
||||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
|
||||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
|
||||
</p>
|
||||
</article>
|
||||
</div>
|
@ -0,0 +1,45 @@
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<meta charset="utf-8"/>
|
||||
<title>Title Element</title>
|
||||
<meta name="title" content="Meta name title"/>
|
||||
<meta name="og:title" content="Open Graph name title"/>
|
||||
<meta name="twitter:title" content="Twitter name title"/>
|
||||
<meta name="DC.title" content="Dublin Core name title"/>
|
||||
<meta property="dc:title" content="Dublin Core property title"/>
|
||||
<meta property="twitter:title" content="Twitter property title"/>
|
||||
<meta property="og:title" content="Open Graph property title"/>
|
||||
<meta name="author" content="Meta name author"/>
|
||||
<meta name="DC.creator" content="Dublin Core name author"/>
|
||||
<meta property="dc:creator" content="Dublin Core property author"/>
|
||||
<meta name="description" content="Meta name description"/>
|
||||
<meta name="og:description" content="Open Graph name description"/>
|
||||
<meta name="twitter:description" content="Twitter name description"/>
|
||||
<meta name="DC.description" content="Dublin Core name description"/>
|
||||
<meta property="dc:description" content="Dublin Core property description"/>
|
||||
<meta property="twitter:description" content="Twitter property description"/>
|
||||
<meta property="og:description" content="Open Graph property description"/>
|
||||
</head>
|
||||
<body>
|
||||
<article>
|
||||
<h1>Test document title</h1>
|
||||
<p>
|
||||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
|
||||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
|
||||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
|
||||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
|
||||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
|
||||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
|
||||
</p>
|
||||
<p>
|
||||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
|
||||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
|
||||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
|
||||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
|
||||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
|
||||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
|
||||
</p>
|
||||
</article>
|
||||
</body>
|
||||
</html>
|
@ -0,0 +1,7 @@
|
||||
{
|
||||
"title": "Preferred title",
|
||||
"byline": "Creator Name",
|
||||
"dir": null,
|
||||
"excerpt": "Preferred description",
|
||||
"readerable": true
|
||||
}
|
@ -0,0 +1,20 @@
|
||||
<div id="readability-page-1" class="page">
|
||||
<article>
|
||||
<p>
|
||||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
|
||||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
|
||||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
|
||||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
|
||||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
|
||||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
|
||||
</p>
|
||||
<p>
|
||||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
|
||||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
|
||||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
|
||||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
|
||||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
|
||||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
|
||||
</p>
|
||||
</article>
|
||||
</div>
|
@ -0,0 +1,35 @@
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<meta charset="utf-8"/>
|
||||
<title>Title Element</title>
|
||||
<meta property="x:title dc:title" content="Preferred title"/>
|
||||
<meta property="og:title twitter:title" content="A title"/>
|
||||
<meta property="dc:creator twitter:site_name" content="Creator Name"/>
|
||||
<meta name="author" content="FAIL"/>
|
||||
<meta property="og:description x:description twitter:description" content="A description"/>
|
||||
<meta property="dc:description og:description" content="Preferred description"/>
|
||||
<meta name="description" content="FAIL"/>
|
||||
</head>
|
||||
<body>
|
||||
<article>
|
||||
<h1>Test document title</h1>
|
||||
<p>
|
||||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
|
||||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
|
||||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
|
||||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
|
||||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
|
||||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
|
||||
</p>
|
||||
<p>
|
||||
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
|
||||
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
|
||||
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
|
||||
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
|
||||
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
|
||||
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
|
||||
</p>
|
||||
</article>
|
||||
</body>
|
||||
</html>
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"title": "Obama admits US gun laws are his 'biggest frustration'",
|
||||
"title": "Obama admits US gun laws are his 'biggest frustration' - BBC News",
|
||||
"byline": null,
|
||||
"excerpt": "President Barack Obama tells the BBC his failure to pass",
|
||||
"excerpt": "President Barack Obama tells the BBC his failure to pass \"common sense gun safety laws\" is the greatest frustration of his presidency.",
|
||||
"readerable": true
|
||||
}
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"title": "Student Dies After Diet Pills She Bought Online \"Burned Her Up From Within\"",
|
||||
"byline": "Mark Di Stefano",
|
||||
"excerpt": "An inquest into Eloise Parry's death has been adjourned until July...",
|
||||
"byline": null,
|
||||
"excerpt": "An inquest into Eloise Parry's death has been adjourned until July.",
|
||||
"readerable": true
|
||||
}
|
||||
|
@ -1,7 +1,7 @@
|
||||
{
|
||||
"title": "Xbox One X review: A console that keeps up with gaming PCs",
|
||||
"title": "Xbox One X review: A console that keeps up with gaming PCs",
|
||||
"byline": null,
|
||||
"dir": null,
|
||||
"excerpt": "The Xbox One X is the ultimate video game system. It sports more horsepower than any system ever. And it plays more titles in native 4K than Sony's PlayStation...",
|
||||
"excerpt": "The Xbox One X is the most powerful gaming console ever, but it's not for everyone yet.",
|
||||
"readerable": true
|
||||
}
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"title": "1Password für Mac generiert Einmal-Passwörter",
|
||||
"byline": null,
|
||||
"byline": "Mac & i",
|
||||
"excerpt": "Das in der iOS-Version bereits enthaltene TOTP-Feature ist nun auch für OS X 10.10 verfügbar. Zudem gibt es neue Zusatzfelder in der Datenbank und weitere Verbesserungen.",
|
||||
"readerable": true
|
||||
}
|
||||
|
@ -1,5 +1,5 @@
|
||||
{
|
||||
"title": "draft-dejong-remotestorage-04 - remoteStorage",
|
||||
"byline": "AUTHORING",
|
||||
"title": "remoteStorage",
|
||||
"byline": "Jong, Michiel de",
|
||||
"readerable": true
|
||||
}
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"title": "Una solución no violenta para la cuestión mapuche - 07.12.2017",
|
||||
"title": "Una solución no violenta para la cuestión mapuche",
|
||||
"byline": null,
|
||||
"excerpt": "Una solución no violenta para la cuestión mapuche | Los pueblos indígenas reclaman por derechos que permanecen incumplidos, por eso es más eficiente canalizar la protesta que reprimirla - LA NACION",
|
||||
"excerpt": "Los pueblos indígenas reclaman por derechos que permanecen incumplidos, por eso es más eficiente canalizar la protesta que reprimirla",
|
||||
"readerable": true
|
||||
}
|
||||
|
@ -1,6 +1,6 @@
|
||||
{
|
||||
"title": "Raspberry Pi 3 - The credit card sized PC that cost only $35 - All-time bestselling computer in UK",
|
||||
"byline": null,
|
||||
"excerpt": "The Raspberry Pi Foundation started by a handful of volunteers in 2012 when they released the original Raspberry Pi 256MB Model B without knowing what to expect. In a short four-year period they have grown to over sixty full-time employees and ha...",
|
||||
"excerpt": "The Raspberry Pi Foundation started by a handful of volunteers in 2012 when they released the original Raspberry Pi 256MB Model B without knowing what to expect. In a short four-year period they have grown to over sixty full-time employees and ha...",
|
||||
"readerable": true
|
||||
}
|
||||
|
Loading…
Reference in New Issue