Ask HN: How to get aggregated user behaviour without tracking an individual user

33 points by harianus 2 months ago

For simpleanalytics.io I don’t want to track individual users and I would love to give businesses insights in the combined user behaviour.

A business could ask: "How many visitors converted from DDG to sign up and what is the average duration?" To be able to calculate the conversion between landing and signup you need to know the history of events.

Let's say we have a few events including:

- Page view event with referrer DDG

- Signup event

The data could look like this:

  [
    ['/','mysite','ddg.com'],
    ['signup',30]
  ]
Explained:

  [
    [event, your website, referrer],
    [event, duration since first event]
  ]
When an event happens I add it to a function session cookie (exp 30 min) and send the complete cookie to the API. The time of the first request will be stored in another cookie and never send the API.

The two requests from the above example looks like this:

  [['/','mysite.com','ddg.com']]

  [['/','mysite.com','ddg.com'],['signup',30]]
When the first request happens it gets added to the database (see row 95):

  id | time     | event  | site        | referrer | link
  94 | 20:30:20 | /      | mysite.com  | ddg.com  | NULL
  95 | NOW()    | /      | mysite.com  | ddg.com  | NULL <---
The second request contains the information of the first request. When a request comes in with more than 1 array item it will look for the previous events in the database. It will look for a row where event=/, referrer=ddg.com, site=mysite.com, and time is >30 min ago: row 94. The table after adding the row will look like:

  id | time     | event  | site        | referrer | link
  94 | 20:30:20 | /      | mysite.com  | ddg.com  | a
  95 | 20:38:28 | /      | mysite.com  | ddg.com  | NULL
  96 | 20:30:50 | signup | mysite.com  |          | a    <---
The conneted row can be 30 min off, but I think that's okay.

Do you think this is acceptable from a privacy perspective?

Chrissvo 2 months ago

That's a nice challenge!

If you're super distrustful you could argue that you should never store a timestamp with a signup event, because it could potentially reveal a user's identity...

Here's a crazy thought, what if you would do this:

1. You fire off a default first event, say “init" On the server you generate a PGP key pair, store the private key with the init-event and return the public key

2. Second event (first real event) is fired by the website owner and encrypted with the PGP public key from 1

3. On the server you try decrypt event #2 with all available active private keys (stored with init-events)

4. Once a solution is found you link the 2nd event to the 1st event, delete the private key of the 1st event, generate a new PGP key pair, store private key with 2nd event, and return the new public key

5. Third event is encrypted with the public key of 4 and...

No need to store timestamps and all traffic is encrypted, now how to make step 3 fast?

  • harianus 2 months ago

    Thanks! I like the way you think.

    2. I think encrypting PGP is pretty heavy and maybe not great for the performance of a script that loads on a lot of websites.

    3. I'm not sure how fast this will be. Especially on a very busy website with lots of page views per second.

    Basically you could also store a variable with the event and send that variable back. What would be the added value to use PGP encryption?

harianus 2 months ago

I don't want to use a session cookie with an ID to link all events. I don't want any ID because I could potentially link those ID together in the back end based on IP (I don't, but I want people not to have to trust me). I want to make sure I don't get any data that my system could use wrong.

  • vokep 2 months ago

    I think maybe a good way to go is a compromise - Since you're already taking efforts to protect privacy without needing to trust you, thats already a good start. But maybe you need some kind of ID to tie behavior together, so you do record one temporarily, until you've processed it into an aggregate (anonymized individual behaviors)

    Basically, train a machine learning model on the data of invididuals. You don't want to overfit or that could be de-anonymizable, but a slightly underfit model could capture most of the important patterns, while throwing out most of any identifying aspects.

    The hard part then becomes finding a way to demonstrate this actually is happening so that you can be trusted. Unfortunately I can't think of a provable way, since you pretty much either can track users by IDs or not. And if you do..then trust has to be assumed

    • harianus 2 months ago

      But with my solution in the main Ask HN I don't need any ID. So why should I not do it that way?

  • JadeNB 2 months ago

    I think that the problem is, while you have near-total control over the information you collect, and can carefully consider its interactions, you have no control over the interaction of your information with other publicly available information. For example, the famous AOL de-anonymisation (https://techcrunch.com/2006/08/06/aol-proudly-releases-massi...) did not (I think it is accurate to say) rely on any metadata attached to the queries, only to the queries themselves.

    • harianus 2 months ago

      While I can understand this being true for AOL:

      > The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.

      I don't think I have a similar issue with page views of one website of one session. I strip all query params and only save the hostname and path of the URL. I think it's nearly impossible to ever link that to a user. Maybe if you have very little amount of users, but then still you don't get personal info.

      • chatmasta 2 months ago

        Why do you consider the path of the URL to be any less sensitive than the query parameters? Many websites use dynamic paths that may as well be query parameters.

        • harianus 2 months ago

          We are going way off topic here, too bad there is not even one answer to my question:

          > Do you think this is acceptable from a privacy perspective?

          But back to your point. Query params contain usually tokens, search queries, and id's. This is not so much the case for paths. I think you agree with that. But indeed, paths can have sensitive information too.

          How would you prevent that data to be sent to my server?

          • chatmasta 2 months ago

            Maybe you should allow the user to provide some sort of regex mask on URLs, or some sort of rule engine for which parts to keep or strip.

harianus 2 months ago

And what would it be from a privacy perspective if I set a cookie for 90 days. I can't link this to any personal information and my customers will only see my tool where they can see the conversions (they don't get access to the "link" in the tables above).

nartz 2 months ago

Differential privacy.

  • harianus 2 months ago

    Not really, that is more for when you have sensitive data and want to show that data publicly. I want to have only insensitive data and make sure I don't get sensitive data from the visitors.