mirror of
https://github.com/miguelmota/cointop
synced 2024-11-10 13:10:26 +00:00
.. | ||
.gitignore | ||
.travis.yml | ||
license | ||
README.md | ||
soup.go |
soup
Web Scraper in Go, similar to BeautifulSoup
soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup.
Functions implemented till now :
func Get(string) (string,error) // Takes the url as an argument, returns HTML string
func Header(string, string) // Takes key,value pair to set as headers for the HTTP request made in Get(), refer to PR #11 for more usage
func HTMLParse(string) struct{} // Takes the HTML string as an argument, returns a pointer to the DOM constructed
func Find([]string) struct{} // Element tag,(attribute key-value pair) as argument, pointer to first occurence returned
func FindAll([]string) []struct{} // Same as Find(), but pointers to all occurrences returned
func FindNextSibling() struct{} // Pointer to the next sibling of the Element in the DOM returned
func FindNextElementSibling() struct{} // Pointer to the next element sibling of the Element in the DOM returned
func FindPrevSibling() struct{} // Pointer to the previous sibling of the Element in the DOM returned
func FindPrevElementSibling() struct{} // Pointer to the previous element sibling of the Element in the DOM returned
func Attrs() map[string]string // Map returned with all the attributes of the Element as lookup to their respective values
func Text() string // Full text inside a non-nested tag returned
func SetDebug(bool) // Sets the debug mode to true or false; false by default
The struct returned by the functions has three fields :
Pointer
containing the pointer to the current html nodeNodeValue
containing the current html node's value, i.e. the tag name for an ElementNode, or the text in case of a TextNodeError
containing an error if one occurrs, elsenil
is returned.
Installation
Install the package using the command
go get github.com/anaskhan96/soup
Example
An example code is given below to scrape the "Comics I Enjoy" part (text and its links) from xkcd.
package main
import (
"fmt"
"github.com/anaskhan96/soup"
"os"
)
func main() {
resp, err := soup.Get("https://xkcd.com")
if err != nil {
os.Exit(1)
}
doc := soup.HTMLParse(resp)
links := doc.Find("div", "id", "comicLinks").FindAll("a")
for _, link := range links {
fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
}
}
Contributions
This package was developed in my free time. However, contributions from everybody in the community are welcome, to make it a better web scraper. If you feel there should be a particular new feature or function in the package, feel free to open up a new issue or pull request.