A Couple of Questions About Metadata Scraping

December 16, 2018

In Emby's Development Policy, it says "No code contribution or plug-in shall directly violate or otherwise circumvent or cause the Emby product as a whole to violate or circumvent any laws as governed by the United States of America. ... This includes but is not limited to: Using 'web scraping' techniques to obtain data from a web site."

First: In what way is scraping a public website for metadata illegal if it doesn't adversely affect the website?

And second, macr0dev's Audiobook Metadata Agent plugin for Plex scrapes Audible for audiobook metadata... so does this mean that implementing an identical metadata plugin to perform the same function for Emby would not be allowed?

Edited December 16, 2018 by chyron8472

December 16, 2018

Well nobody's preventing you from writing your own private plugin if that's what you're asking.

The main thing is we want to conform to terms of use of the websites we are pulling data from.

December 16, 2018

Look at the sites from which you are trying to scrape data. More often than not, they will contain a copyright notice. So, scraping the information from the site and then re-distributing it would be the same as xeroxing some books and selling them on the street.

December 17, 2018

Look at the sites from which you are trying to scrape data. More often than not, they will contain a copyright notice. So, scraping the information from the site and then re-distributing it would be the same as xeroxing some books and selling them on the street.

I would liken it more to xeroxing a card file. It's not the book, its the database entry that describes the book.

... however, as I look into this, the License and Access subsection under Amazon's Conditions of Use says:

Subject to your compliance with these Conditions of Use and any Service Terms, and your payment of any applicable fees, Amazon or its content providers grant you a limited, non-exclusive, non-transferable, non-sublicensable license to access and make personal and non-commercial use of the Amazon Services. This license does not include any resale or commercial use of any Amazon Service, or its contents; any collection and use of any product listings, descriptions, or prices; any derivative use of any Amazon Service or its contents; any downloading, copying, or other use of account information for the benefit of any third party; or any use of data mining, robots, or similar data gathering and extraction tools. All rights not expressly granted to you in these Conditions of Use or any Service Terms are reserved and retained by Amazon or its licensors, suppliers, publishers, rightsholders, or other content providers. No Amazon Service, nor any part of any Amazon Service, may be reproduced, duplicated, copied, sold, resold, visited, or otherwise exploited for any commercial purpose without express written consent of Amazon. You may not frame or utilize framing techniques to enclose any trademark, logo, or other proprietary information (including images, text, page layout, or form) of Amazon without express written consent. You may not use any meta tags or any other "hidden text" utilizing Amazon's name or trademarks without the express written consent of Amazon. You may not misuse the Amazon Services. You may use the Amazon Services only as permitted by law. The licenses granted by Amazon terminate if you do not comply with these Conditions of Use or any Service Terms.

Well, crap.

Edited December 17, 2018 by chyron8472

December 17, 2018

The other thing to think about is the load it puts on those websites. Their bill might go up if their data usage skyrockets due to apps scraping data from them.

December 17, 2018

@@chyron8472 Are you looking to port over macr0dev's plugin to Emby? At this point I manually update the ID3 tag / Emby's album description box with Audible's info + download the 500x500px cover.

Currently the artist (author) image is auto-populated by an image from their albums. I leave the author bio blank for now, as I know when last.fm works, it should auto-populate images/bio's for 50% of my authors. It did with Plex at any rate...

December 17, 2018

@@chyron8472 Are you looking to port over macr0dev's plugin to Emby? At this point I manually update the ID3 tag / Emby's album description box with Audible's info + download the 500x500px cover.

Currently the artist (author) image is auto-populated by an image from their albums. I leave the author bio blank for now, as I know when last.fm works, it should auto-populate images/bio's for 50% of my authors. It did with Plex at any rate...

Truthfully, I do manually adjust the id3 tags for my audiobooks after I convert and split them from aax. But, coming from using Plex for audiobooks, there are some tags that Plex lists for music-type libraries that aren't editable in mp3tag---such as the name of the publisher (or "Record Label" in Plex). And I'm not sure if embedded "Comment" tags support paragraph breaks. idk, I haven't tested that in Emby.

Second, I suppose I just like having it matched to an online source. I am really extremely picky about poster art, so 19 times out of 20 I will poke around Google and Bing for better covers or else higher resolution versions of what Audible provides for it (either with the downloaded .aax or scraped by the agent). ...oh, and also posters that don't have that obtrustive "Only on Audible" ribbon in the corner obscuring the artwork. So I don't really rely on macr0dev's plugin to supply artwork either.

I just like the idea to a degree of the metadata matching to an online source. It makes it feel more legit somehow. More official or something.

Edited December 17, 2018 by chyron8472

December 18, 2018

I just like the idea to a degree of the metadata matching to an online source. It makes it feel more legit somehow. More official or something.

From what I've seen, you're better off entering everything manually in Emby rather than wait for a quality metadata scraper to not only exist - but also be supported in Emby.

At some point there'll have to be a community-driven scraper that uses developer API's for all the major audiobook metadata sources, then creates its own API for end-users. I imagine it'd be like TVDB where community can add submissions, though follows its own standardized framework so metadata isn't "every which way" like it is today.

How hard is it to build a Musicbrainz for audiobooks?

December 18, 2018

The current musicbrainz metadata provider will also work for audio books.

December 19, 2018

The current musicbrainz metadata provider will also work for audio books.

Does the audiobooks content type already scrapes musicbrainz @@Luke?

December 19, 2018

Yes.

December 19, 2018

Yes.

I'm sorry, I see now author bio's pulled from Musicbrainz and author images from Last.fm. I've been doing it all manually up till this point.

December 20, 2018

From another perspective, scraping was essentially ruled legal not too long ago in the LinkedIn case. It’s still under review, but as of now, scraping, as long as it’s not behind a password protected site, you are free to scrape. Note this can change at any time.

December 20, 2018

From another perspective, scraping was essentially ruled legal not too long ago in the LinkedIn case. It’s still under review, but as of now, scraping, as long as it’s not behind a password protected site, you are free to scrape. Note this can change at any time.

Is that in reference to Meet Leonard fiasco a year back? https://meetleonard.com/

December 20, 2018

Is that in reference to Meet Leonard fiasco a year back? https://meetleonard.com/

I'm not sure about the Meet Leonard fiasco, I'm referring to https://www.eff.org/cases/hiq-v-linkedin.

December 21, 2018

I'm not sure about the Meet Leonard fiasco, I'm referring to https://www.eff.org/cases/hiq-v-linkedin.

Wow, thanks for the info. These large companies definitely need to be held in check. No wonder people are suspicious of big corporate getting involved in FOSS. https://www.reddit.com/r/stallmanwasright

December 21, 2018

Legality aside, web-scraping implementations require constant attention because, with the help of the law or not, it is not the intention of most of these companies to give away their proprietary data for free - so they constantly change things to thwart scraping engines.

December 21, 2018

Hence why when there is scraping allowed, API's are issued with usually strict call limits. Speaking of which, did you look into giantbomb.com for video game metadata @@ebr?

December 21, 2018

Scraping allowed is a little too broad of a statement. At a minimum, in the limited case, the court initially found that HiQ didn't break the law. That being said, it is under appeal, could be reversed, etc. This is not settled law, and I would recommend that Emby tread lightly on this area for a while.

A Couple of Questions About Metadata Scraping

Recommended Posts

Chyron 221

Link to comment

Share on other sites

Luke 37064

Link to comment

Share on other sites

ebr 14913

Link to comment

Share on other sites

Chyron 221

Link to comment

Share on other sites

Luke 37064

Link to comment

Share on other sites

VaporTrail 66

Link to comment

Share on other sites

Chyron 221

Link to comment

Share on other sites

VaporTrail 66

Link to comment

Share on other sites

Luke 37064

Link to comment

Share on other sites

VaporTrail 66

Link to comment

Share on other sites

Luke 37064

Link to comment

Share on other sites

VaporTrail 66

Link to comment

Share on other sites

legallink 187

Link to comment

Share on other sites

VaporTrail 66

Link to comment

Share on other sites

legallink 187

Link to comment

Share on other sites

VaporTrail 66

Link to comment

Share on other sites

ebr 14913

Link to comment

Share on other sites

VaporTrail 66

Link to comment

Share on other sites

legallink 187

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Activity