Jump to content

Anonymised Logs


adrianwi

Recommended Posts

9 hours ago, CBers said:

Yes, a good idea.

Hopefully @cayars can add something.
 

 

2 hours ago, adrianwi said:

Great idea!  And once you've come up with the list of things that may need anonymising, write a script that does it automatically when downloaded them.  Thanks 😀

@CBers is that a message you can edit?
Normally the only thing that would need anonymizing I think would be your WAN address and Live TV account info.

I think the proper move would be to default items like these to be redacted at the point of writing the logs with an option to not redact the info if working in private with a dev, mods or supports.  I'm not sure this info needs to ever be written to the log as it can be asked for.

Link to comment
Share on other sites

This:

10 hours ago, Spaceboy said:

perhaps at the very least, the How to Report a Problem post should be updated to suggest this and the things that may need anonymising?

10 hours ago, CBers said:

Yes, a good idea.

Hopefully @cayars can add something.
 

 

Link to comment
Share on other sites

Log file sanitization is being added for the upcoming 4.7 server release. They'll be anonymized when you download them from the server web interface.

You can participate with this starting with the next beta server release. It may take time for us to spot everything that needs to be marked for anonymization. Thanks.

  • Like 3
  • Thanks 6
Link to comment
Share on other sites

Painkiller8818
On 1/11/2022 at 9:07 PM, ebr said:

And how do we know what those are...?

1. get all urls not beeing *.emby.media and other emby related urls and replace them with some template url or obfuscate them
2. get all IPs not being private class ranges and replace them with some template IPs or obfuscate them
2. get every pw=, passwd=, password= or username=, user= etc.. text phrases and delete those lines

Could be a simple search and replace or regex..
really not that hard to do.

Edited by Painkiller8818
Link to comment
Share on other sites

Painkiller8818

Have done a quick and dirty script for this.

So anyone using WINDOWS as Server can use this script.

This should also give the devs an idea on how to get all the URLS.

 

###Default Values for replacement instead of Domains, IPs etc...
###outfile contains the name of the log file, this script must be run from the directory the log file is in
$ips = "XXX.XXX.XXX.XXX"
$outfile = "embyserver.txt"
$port = ":XXXX"
$url = "DOMAIN.COM"

###opens the file embyserver.txt for read and replaces IPs, Urls###
$newcontent = GC $outfile| Foreach-Object { 
$_ -replace "(http://\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", $ips `
-replace "(https://\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", $ips `
-replace "(to \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", $ips `
-replace "(RemoteIp: \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})", "RemoteIp: $ips" `
-replace "(https?:\D+)", $url `
-replace "(http?:\D+)", $url

} 

###write the replaced log to file
$newcontent > $outfile

###opens the logfile again to search for stings like pw= etc and removes the whole line if find
###saves the content without all lines matching one of the given text patterns
###if your IPTV url uses different keywords for credentials just add them in there
$newcontent = GC $outfile| Where-Object{$_-NotMatch 'pw=|passwd=|password=|user=|uname=|username='}

###saves the file now without any lines that contains one of the given credential strings above
$newcontent > $outfile

###log is now clean, no URLs, no external IPs and no login creds for IPTV URLS and Lists

if you are using WINDOWS you can copy this into a texfile and save as a .ps1 file

Remember this script needs to be executed in the same directory your log is.

This should help until the Beta will be a stable one for those of us not wanting to use a beta server ;)

  • Like 3
Link to comment
Share on other sites

seanbuff

Let's not forget the first few lines which sometimes contains a users real name in the file path:

2022-01-05 00:00:15.632 Info App: Application version: 4.7.0.18
2022-01-05 00:00:15.632 Info App: Emby
	Command line: C:\Users\MyRealName\AppData\Roaming\Emby-Server\system\EmbyServer.dll
	Operating system: Microsoft Windows 10.0.19043
	Framework: .NET 6.0.0-rtm.21522.10
	OS/Process: x64/x64
	Runtime: C:/Users/MyRealName/AppData/Roaming/Emby-Server/system/System.Private.CoreLib.dll
	Processor count: 16
	Data path: C:\Users\MyRealName\AppData\Roaming\Emby-Server\programdata
	Application path: C:\Users\MyRealName\AppData\Roaming\Emby-Server\system

 

  • Thanks 1
Link to comment
Share on other sites

Painkiller8818
8 hours ago, seanbuff said:

Let's not forget the first few lines which sometimes contains a users real name in the file path:

2022-01-05 00:00:15.632 Info App: Application version: 4.7.0.18
2022-01-05 00:00:15.632 Info App: Emby
	Command line: C:\Users\MyRealName\AppData\Roaming\Emby-Server\system\EmbyServer.dll
	Operating system: Microsoft Windows 10.0.19043
	Framework: .NET 6.0.0-rtm.21522.10
	OS/Process: x64/x64
	Runtime: C:/Users/MyRealName/AppData/Roaming/Emby-Server/system/System.Private.CoreLib.dll
	Processor count: 16
	Data path: C:\Users\MyRealName\AppData\Roaming\Emby-Server\programdata
	Application path: C:\Users\MyRealName\AppData\Roaming\Emby-Server\system

 

It contains your windows user, for most people this would not be the real name or in most cases just the first name. I don't really see this as a problem

Link to comment
Share on other sites

rbjtech

Ideally, every non RFC1918 IP should be replaced with a meaningful but anonymised label - if you change IP's into X's - then tracing what is going on may become much harder.

ie 64.64.64.64 always gets changed to WAN1, 128.128.128.128 always gets changed to WAN2

Another idea which I've seen a lot, is to do a 'support pack' type download - where it downloads (and anonymises) ALL of the recent logs and maybe Zip's them up as sometimes they are large - ie you get embyserver.txt and maybe the previous rolled version but you also get any ffmpeg logs, the hardware_detection.txt etc ?

Edited by rbjtech
  • Like 1
Link to comment
Share on other sites

Painkiller8818
10 minutes ago, rbjtech said:

Ideally, every non RFC1918 IP should be replaced with a meaningful but anonymised label - if you change IP's into X's - then tracing what is going on may become much harder.

 

this was just a quick and dirty work, i also haven't excluded the emby related URLs, this should just show it is possible, but i didn't want to do all their work :D 

  • Like 2
Link to comment
Share on other sites

rbjtech
4 minutes ago, Painkiller8818 said:

this was just a quick and dirty work, i also haven't excluded the emby related URLs, this should just show it is possible, but i didn't want to do all their work :D 

Mine was just a general comment - it's wasn't about your script ;)

Link to comment
Share on other sites

It will be right on the log screen, and the toggle will apply to all functions - the built-in log viewer, downloading a log file, opening in a new window, etc.

Untitled.png

  • Like 7
Link to comment
Share on other sites

  • 3 weeks later...
adrianwi

It looks like it's hiding most things that might lead someone to my server, but I've noticed it's displaying the full URL for the Slack Notification plugin.

2022-01-29 21:36:49.232 Info HttpClient: POST https://mattermost.domain.com/hooks/nht%%%%%%%%%%%%%1kjfby5ewa

Could this be added to the code?

Thanks

Link to comment
Share on other sites

  • 3 months later...
zebo51

It would be nice to know what all the anonymizing is doing.  Next to the button maybe put what enabling it does, ie Hides URLs, Ext IPs, Passwords. 

I still see things I would not be comfortable with adding a server log to the forums.  Some of us might be more secure.  I personally wouldn't want internal IPs revealed, my windows username account, my library paths are just a few more.  Any of those are very useful to a hacker. 

I am glad to see this is now being included and hope it will only continue to improve.

Thanks

Link to comment
Share on other sites

Quote

Next to the button maybe put what enabling it does, ie Hides URLs, Ext IPs, Passwords. 

HI, yes this is a good idea.

Quote

I still see things I would not be comfortable with adding a server log to the forums.

Examples?

Link to comment
Share on other sites

GrimReaper
7 hours ago, zebo51 said:

I personally wouldn't want internal IPs revealed, my windows username account, my library paths are just a few more.  Any of those are very useful to a hacker. 

As are very useful for troubleshooting purposes, since one would have absolutely no idea who/what/where tried to access what/how. Seeing a bunch of "x"s instead would surely not help towards identifying/resloving your issue. Besides, don't see how much useful those would be to any malicious individual without your external IP exposed/known.

My 2c.

Link to comment
Share on other sites

Locutus64
23 hours ago, zebo51 said:

It would be nice to know what all the anonymizing is doing.  Next to the button maybe put what enabling it does, ie Hides URLs, Ext IPs, Passwords. 

I still see things I would not be comfortable with adding a server log to the forums.  Some of us might be more secure.  I personally wouldn't want internal IPs revealed, my windows username account, my library paths are just a few more.  Any of those are very useful to a hacker. 

I am glad to see this is now being included and hope it will only continue to improve.

Thanks

Like GrimReaper said not without you external IP. 

Link to comment
Share on other sites

CBers
On 29/05/2022 at 12:58, zebo51 said:

I still see things I would not be comfortable with adding a server log to the forums

You can always send an anonymised log to the dev who requested it.

You don't have to post it here on the forums, unless you want to anonymise it further yourself.
 

Link to comment
Share on other sites

On 5/29/2022 at 1:58 PM, zebo51 said:

It would be nice to know what all the anonymizing is doing.

 

If you want to know HOW it works, that's actually pretty cool I think (guess why I'm saying that.. 🙂 )

There were several - partly contradicting - requirements for implementing this:

  • Doing something like simple pattern of expression replacements doesn't cut it
    The sanitation needs to work in a way that everybody can rely on and regex replacements are not reliable enough, there will always happen to appear a certain pattern that hasn't been considered and not handled by such replacements
     
  • The sanitation replacements shall be
    • Precisely controllable
      e.g.: hide host name or ip address from a URL, but not the protocol (http/s), not the port (sometimes but not always), not the path (but sometimes certain sensitive path parts) and not the query (not in general but possibly sanitize specific query parameters in a URL)
       
    • Intelligent
      like @rbjtech mentioned above: replace different parts with different replacement strings, but replace identical parts always with the same replacement string (e.g. host1, host2, host3..)
       
    • Still Expressive
      e.g: IPs and host names are replaced by "host4", passwords are replaced by "*******5", path parts by "x_path6", etc.
       
  • The sanitation procedure needs to be extensible and re-usable
    to allow plugins make use of it without much effort
     
  • There should not be an up-front decision to make between sanitized and non-sanitized logging
    just like uncle Murphy told: the choice would always be wrong
    • When a log file is sanitized on writing, it would be irreversible and important information might be lost forever
    • When there's a problem and the logs aren't sanitized one couldn't post them
       
  • Dual logging (one regular and one sanitized) would have been very undesirable as well
    (IO overhead, disk space, keeping all in sync, etc: very ugly)

     

So, these were some pretty tough requirements... Well - not the individual ones but fulfilling all of them - not that easy, especially now, as we're coming to the twist of the story where you'll get to understand....
 

The Magic

Seeing the toggle that Luke had posted above, most of you had probably thought that this is controlling whether the server is writing sanitized or non-sanitized log files to disk.
But that's not the case. The green toggle just controls how you get the logs presented when downloading or viewing. In fact you can download a log with toggle on and right afterwards with toggle off - and you will get the same log, but one being sanitized and the other not.

All requirements above are fulfilled and a plugin can freely choose which parts of its log output should be sanitized and in which way (like as hostname, as password, etc.) 
In fact, a plugin could mark the part 'quire' of the word 'requirements' as host name and it would be treated as such.

It's a puzzle for those who enjoy - how can this work? 🙂 

  • Like 3
Link to comment
Share on other sites

On 1/13/2022 at 12:40 PM, rbjtech said:

Another idea which I've seen a lot, is to do a 'support pack' type download - where it downloads (and anonymises) ALL of the recent logs and maybe Zip's them up as sometimes they are large - ie you get embyserver.txt and maybe the previous rolled version but you also get any ffmpeg logs, the hardware_detection.txt etc ?

In fact, this is something which I'm thinking about for a long time already (means years), but there are quite a few considerations involved:
(random order)

Privacy: Sensitive Data in Logs

For a (way too) long time, we had sensitive data in the logs, even things like login credentials to external services, where it turned out that some hackers were regularly scanning log files on this platforms to publish such credentials on piracy sites and our users where wondering why they were regularly kicked out from m3u streams.
An automated mechanism would have voided even the last bit of control for users to redact their logs manually before posting.
This was a showstopper for me to proceed as as long as we don't have automatic sanitation in place.

==> Now we have sanitation, and when that is perfected, it's no longer an issue

Encrypted Transmission of Support Data?

There was an idea to have the server encrypt the collected data package in a way that only "we" can decrypt and view the data, but there are three problems with that approach:

  • Users who are submitting the data would not be able to review what they have sent
  • Submitting users cannot be sure what "we" means exactly and who would get access to their data
  • It would negate the whole concept of having a community with users helping other users 
    (which wouldn't be possible anymore)

==> We don't want to do this

Initial Questionnaire Should be Included

In order to improve the effectivity of support operations (in the specific case of playback/transcoding issues), it would be very important to have some initial questions answered without us needing to ask each time (e.g. does it work with hw accel disabled, does it work without subtitles, does it happen with a specific, some or all files, etc.)
Ideally, the feature would then automatically create a new forum topic (or amend an existing one).

But previously, this would have required to enter your forum credentials into Emby server (or the Diagnostics plugin providing that kind of feature), which is of course something that smells, and I understand everybody who wouldn't want to make this connection between server and forum account.

==> Since a recent forum update, it supports OAuth login, so it would be possible that the browser can authenticate via OAuth to the forums and make that post without Emby Server (or the plugin) being involved

Log Handling

There's an other problem with this method of creating support information packages: It is putting the user aside from the process and it may include a lot of unneeded and unrelated logs. Currently, users are often able to preselect and post just those logs that are relevant for the actual issue.
Without that pre-selection it can be much more difficult for us to understand where we need to look at and looking at many logs is no fun. After looking at four or five I've often already forgot about the first one.

==> In the recent past, we have improved the tooling at our side to be able to better manage reading and understanding of submitted logs

Summary

Meanwhile, most of the hindering parts have been improved or been resolved. I think this will come in one or another form, I'm just not sure exactly how and when...

Edited by softworkz
  • Like 2
  • Thanks 1
Link to comment
Share on other sites

rbjtech

@softworkz

Thanks for taking the time once again to expand on thoughts and ideas - it good to get these out for possible discussion and understanding.

One item that may help users solve their problems is for the user to actually look at the log for themselves.  Lets be honest here, system logs are not user friendly at the best of times - and without some form of class or message filters, their use is very limited and frankly scary to somebody not used to seeing reams and reams of meaningless text.

I know you guys have developed tools to combat this from an external support perspective - but maybe include some of the real basic filtering in the web logview itself  ?

I personally use an external log viewer that allows filtering on the log class, message etc - making it so much easier to filter output - if this basic filtering could be included for the user, then it may cut down on a % of support posts.  For example a user could filter by 'LiveTV' if they are having live TV issues (as an example)... ;)

 

image.png.82e33e1ef3bca0e1d6f984fd7c138c26.png

 

Maybe add simple filters here in the web view to allow filtering by Date/Time, Class, Message etc - making a meaningless 10,000 row log into something more useful ?   

image.thumb.png.1647ed10f12e395f822cd9b87f87c1e7.png

Edited by rbjtech
  • Like 1
Link to comment
Share on other sites

A prerequisite for this would be to have a decent grid control in the client UI for presentation. We're about to adopt one for a different purpose, but it could be used for log display as well in the future (maybe).
Also this won't work well without some kind of database backend, because with larger log sizes (like dozens of MB), you can't filter effectively by scanning through the full log text each time. To be honest - I'm not sure whether we will want to spend that much effort on this subject. It's a bit more effort than it seems at first sight.

Maybe a good idea for a plugin?

2 hours ago, rbjtech said:

For example a user could filter by 'LiveTV' if they are having live TV issues (as an example).

The problem is that a filter like this doesn''t give you the full story you need to see. E.g., when there's transcoding involved, you won't see it with this filter and you won't see http log entries related to that. Also, filtering by 'LiveTV' would get you "everything" from LiveTV, like all recordings all playback sessions, etc.

The one thing that I am thinking about adding is an additional "column" with a certain context id which provides an additional filtering dimension, and would allow you to get all log messages related to - for example - a certain playback session, so you would get all log entries relating to exactly that single playback operation but from all kinds of "message classes" (your naming 😉 ).

Another adjustment will be a shortening of IDs. It's in no way necessary to have full GUIDs for transcode or other session ids, so for example:

ffmpeg-transcode-10df6e86-74a0-4a16-8f5f-78e197801376_1.txt can become something like  ffmpeg-transcode-10df6e_1.txt

which makes it look much more friendly and readable and there's still just a 1:16 Million chance to hit a duplicate when generating a new random id.

Maybe  ffmpeg-transcode-10DF6E_1.txt would be even better because it will make characters appear more evenly sized.

  • Like 1
Link to comment
Share on other sites

rbjtech
11 hours ago, softworkz said:

A prerequisite for this would be to have a decent grid control in the client UI for presentation. We're about to adopt one for a different purpose, but it could be used for log display as well in the future (maybe).
Also this won't work well without some kind of database backend, because with larger log sizes (like dozens of MB), you can't filter effectively by scanning through the full log text each time. To be honest - I'm not sure whether we will want to spend that much effort on this subject. It's a bit more effort than it seems at first sight.

Maybe a good idea for a plugin?

The problem is that a filter like this doesn''t give you the full story you need to see. E.g., when there's transcoding involved, you won't see it with this filter and you won't see http log entries related to that. Also, filtering by 'LiveTV' would get you "everything" from LiveTV, like all recordings all playback sessions, etc.

The one thing that I am thinking about adding is an additional "column" with a certain context id which provides an additional filtering dimension, and would allow you to get all log messages related to - for example - a certain playback session, so you would get all log entries relating to exactly that single playback operation but from all kinds of "message classes" (your naming 😉 ).

Another adjustment will be a shortening of IDs. It's in no way necessary to have full GUIDs for transcode or other session ids, so for example:

ffmpeg-transcode-10df6e86-74a0-4a16-8f5f-78e197801376_1.txt can become something like  ffmpeg-transcode-10df6e_1.txt

which makes it look much more friendly and readable and there's still just a 1:16 Million chance to hit a duplicate when generating a new random id.

Maybe  ffmpeg-transcode-10DF6E_1.txt would be even better because it will make characters appear more evenly sized.

I love the idea of a context or 'thread' ID to match all the items related to that action - that would be very handy.

wrt a plugin - it's easy enough to just point log viewers to the embyserver.txt now that it conforms to industry logging standards.

I use ALV which is open source, so maybe that could be forked onto a plugin .. not sure .. something to put on the never ending to do list .. haha

  • Agree 1
Link to comment
Share on other sites

20 minutes ago, rbjtech said:

I love the idea of a context or 'thread' ID to match all the items related to that action - that would be very handy.

In a lab project for something new, we have three-level IDs which look like this D3EFF_4A4CB_66D2E. This is about a common operation which can be  used by multiple sub-operations  ni parallel and each of them can have multiple sub-sub-operations. The top level operation has Id D3EFF only, second-level D3EFF_4A4CB for example.

The filtering will need a slight adaption, so that when filtering for D3EFF_4A4CB_66D2E, you get the third-level entries plus D3EFF_4A4CB plus D3EFF but not other sub-levels. 😉 

29 minutes ago, rbjtech said:

wrt a plugin - it's easy enough to just point log viewers to the embyserver.txt now that it conforms to industry logging standards

I think that's really the best option for users who want this, it would require some huge effort to make it as good in the web UI as those applications can do.

  • Like 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...