Web Surfer Identification

There are many user voluntary and involuntary techniques available to identify web surfers. Every request (click or keystroke) passes not only the information relevant only to that particular request over the web but also some extremely poignant data in the information packet. This allows the resulting server and every server that the information passes through (with the ability to decipher the information packet) on its journey to record in a database.

Voluntary identification techniques are those normally covered by “ethical” stated privacy policies; forms, registration, cookies, page counters and any other way that the web surfer’s identification can be established with their permission. Involuntary identification techniques largely fall into a black hat classification in legal terms; if any user data is collected and used without that user’s agreement to a policy, user agreement and so on. This agreement can be obtained by implication and need not be expressed, for example in the case of a page prominently  displayed entitled “Privacy Policy” it is common to see statements such as “Do not use this site if you do not agree to our policies, terms of service, etc.” Therefore those techniques listed as voluntarily, may be also classified as involuntary if the user has not agreed to their use. In addition, with the use of Spyware and Malware, some web sites can use malicious software to gain access to even more personal and sensitive information to the point of identification fraud.

Considering only those voluntary ways to collect data besides the user inputting the data themselves, it is possible by the use of the information packets described above to build a database profile containing which site they have come from, their entire journey on the site in question (content, clicks), which browser they are using, which operating system they are using, which language(s) they prefer, their geographical location and Internet Service Provider (ISP) (based on IP address) and search terms used to arrive at the site.

With analysis software, any site can attempt to identify its visitors by IP address or apply gathered data to an explicitly used user account.

It is fairly easy for their databases to have entity integrity, if they are well designed, where an accurate means of identification exists (login to an account, long lasting and undeleted cookie, identification by undetected malicious software, etc.) as this would ensure that no duplicate records existed and that the primary key is not null; the definition of entity integrity. Database software systems (such as MS SQL Server) can be enforced to attempt to ensure that records are not duplicated however this is only as good as the data collected itself. However when relying up gathered information solely, the primary key (or a secondary key) is likely to be the IP address collected as this represents the only unique key and this can change depending on how and where the visitor is surfing from; multiple visitors may use the same IP address (if allocated dynamically by their ISP), they may use different machines so that cookies and any other software method of identification is lost. This would result in duplicate records in the database, even though this could not be identified.

In conclusion I agree somewhat with the CEO of Sun that there is no such thing as privacy on the web and, in my opinion, there is an implied usage term in using the Web; if you are concerned about privacy, don’t use it.

References

Coronel, Morris & Rob (2009) Database Systems: Design, Implementation, and Management (9th Edition). Cengage Learning.

Rosella (2005) Web Mining: Web search and Web navigation Pattern Analyzer [Online]. http://www.roselladb.com/surf-pattern-analyzer.htm (Accessed 11 April 2010).

Webcredible (2005) Why track your visitor’s behaviour? [Online]. Available at http://www.webcredible.co.uk/user-friendly-resources/web-usability/track-visitors.shtml (Accessed 11 April 2010).