前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Data linkage and privacy

Data linkage and privacy

原创
作者头像
403 Forbidden
发布2021-05-19 14:26:25
4020
发布2021-05-19 14:26:25
举报
文章被收录于专栏:hsdoifh biuwedsyhsdoifh biuwedsy

Lectures 18-20: Data linkage and privacy

-understanding what the record (data) linkage problem is

  • combining related/equivalent records across data sources

-understand why matching a database against itself can be regarded as a data linkage task

  • business wishes to carry out an advertising campaign
  • the customer database changes over time, people move address, change their names.
  • duplicate records about individuals - business wishes to know if the same person appears more than once.

-be able to explain where record linkage is applied and why

Why:

  • For businesses purpose, need to identify if two or more records fefer to the same individual.
  • build connection between the similar record in separate table

-appreciate why record linkage can be difficult

Privacy

-be able to describe the methodology of blocking, as applied to record linkage, explain why it is useful and why it is challenging

Blocking

  • process
    • preprocessing
      • clean the data
    • Blocking
      • represent complex records as simple values(blocks). Only score records with simple value in common.
    • Scoring
      • comparing two records and asses their similarity
      • can use Jaccard similarity or edit distance here.
    • Matching
      • match sufficiently similar records
    • Merging
      • merge two groups, resolve conflicting attributes
  • Why?
    • since blocking is much faster than linear scan and compare (n*m), but blocking only takes (m/b*n/b*b = m*n/b)
  • challenges
    • to determine the size of b
      • when b is too small the speed is not significantly improved
      • when b is too large the accuracy will drop dramatically
      • for a good blocking strategy(uniform spread across blocks), then accuracy may be almost as good as strategy without blocking, but much much faster.

-understand the record linkage pipeline of preparation, blocking, scoring, matching and merging

  • Record linkage
    1. preparation
    • Clean records
  1. 8J9I
  2. Blocking
  • Represent complex records as simple values (blocks)
  • Only score records with block in common
  • Key words:
  • Process:
    • Each record from A allocated to one or more of the blocks
    • Each record from B allocated to one or more of the blocks
    • Within each block, compare the records from A against those from B & find those that match
    • If two records are not assigned to the same block = they are not a match
  • Number of blocks ??
    • Speed ??
    • Accuracy ?? (miss other correct blocks)
  1. score
  • Compare two records -> assess their similarity
  • Jaccard similarity

Treat si as a set of words

| si | = count number of words in set

Denominator normalises the score based on how long each word is

  • Edit distance
    • Minimum number of character insertions, deletion & substitutions to go from s1 to s2
    • Expensive to compute
    • Score Combination
      • Idea 1: sum up feature scores
      • Idea 2: + similarities, - dissimilarities
      • Idea 3: weighted sum
      • Idea 4: label examples of non-matching record pairs, train a classifier using machine learning
  1. matching
  • Match sufficiently similar records
  • Threshold
    • i.e. Match when score > 0.5
  1. merging
  • Merge matched records
  • Resolve conflicting attributes

-be able to explain in what circumstances privacy is an important issue for data linkage

  • When important?
    • Matched data is being passed to another organisation or being made public
    • Data matching is being conducted across databases from different organisation.

-understand the objective of privacy preserving data linkage

  • the objective of privacy preserving
    • without revealing any information about individuals who do not get linked across the databases

-understand the use of one way hashing for exact matching in a 2 party privacy preserving data linkage protocol

  • One way hashing
    • also means non invertible hash function
    • for each organisation, applies a one way hash function to the attribute used to join the databases and shares its hashed values with other organisation. Each checks which ones match. These are linked records.
    • Disadvantages
      • little big difference results in a completely different output
      • Dictionary attack: an organisation could mount a dictionary attack to "invert" the hash function. for example, organisation A scans the hashed values received from Organisation B. Checks if any match its hash dictionary. If yes, privacy is lost for those items.
  • 2 party protocal
    • Each organisation
      • Applies a one-way hash function to the attribute used to join the databases
      • Share its hashed values with the other organisation
      • Each checks which one matches
    • Disadvantage
      • single character differences in the original value -> completely different hashed value
      • dictionary attack: hash value could be inverted

-understand the vulnerabilities of 2 party privacy preserving data linkage protocol to i) small differences in input, ii) dictionary attack

-understand the operation of the 3 party protocol for privacy preserving linkage, using hash encoding with salt for exact matching. Understand the disadvantages of this protocol

3 party protocal

  • Organisation C = trusted 3rd party
  • Organisations A & B send their hashed values to Organisation C
  • Organisation C checks for matches
  • If Organisation C is malicious -> using salt
    • Organisations A & B concatenate a secret word (salt) to every name field in their data before hashing
    • Organisation C does not know what this word is -> cannot perform a dictionary attack

Third party protocol

  • using salt
    • organisation A and B concatenate a secret word to every name field in their data before hashing(known as salt). Organisation C does not know what this word is and thus can't perform a dictionary attack to "reverse" the hashed values it received. When we send
    • disadvantages
      • may not robust to frequent attck,that is, 3rd party compares the distribution of hashed values to some known distribution. E.g. distribution of surename frequencies in a public database versus distribution of hash value. To prevent this, we also need to add some random record.
      • adding salt does not help two parties protocol
      • doesnt help approximate matching

-understand when and in what circumstances dictonary attacks and frequency attacks might be mounted against 2 party and 3 party data linkage protocols

-understand the steps of the 3 party protocol for privacy preserving data linkage with approximate matching (using Bloom filters)

Bloom filter

  • how it works
    • A bloom filter is an array of n bits, with all bits initially set to zero. We may store strings in the bloom filter by using hash functions H1...Hk to turn on certain bit. If a bit was set to 1 before, no change is made.
    • similarity is (number of bits set to 1 in bot bloom filter) / (sum of 1 bitss in bot bloom filter)
    • need to set the thread for similar comparison.
  • how to protect string
    • Very hard for third party to guess the exact word by only using the bloom filter but still perform similar to exact string compare.
  • notice that chose a proper size of bloom filter and number of hash function is very important, when bloom filter is too small or use too many hash functions the comparison is useless since almost all string has full 1 bits in bloom filter.

-understand how to compute similarity of two strings based on 2-grams and why this method is useful

Similar match

  • using 2-grams to calculate the approximate similarity
    • 2*(number of 2 common 2-grams)/ (total number of 2-grams in both string)
  • easy comparison and effective method
  1. approximate similarity
  • two strings X and Y
  • Convert each string into 2-grams ( lower case and list all substring of length 2)
  • 0 <= sim(X,Y) <= 1

h = number of common 2-grams

x = number of 2-grams in X

y = number of 2-grams in Y

  • E.g.

-understand how a Bloom filter works (how it is created, how strings are inserted, how strings are checked for membership)

  1. bloom filter
  • An array of I bits & all bits initially set to 0
  • Store strings in the bloom filter by using hash functions H1…Hk to turn on certain bits
    • Each hash function Hi (1<=i<=k) maps an input string X to a value in the range [1,i-1]
    • To store a string X in the bloom filter, set array index Hi(X) in the bloom filter to 1 for each Hi(X)
    • Hash vale = index
  • Example: k=2, I = 25
  • If a bit was set to 1 before, no change is made
  • Check if a string is present:
    • If at least one of the bits is set to 0, then X is definitely not a member of the bloom filter
    • If all of the bits are set to 1 -> X appears to be a member of the bloom filter, but it might be false positive (same bloom filter -> different words)
  • Google Chrome Browser
    • use bloom filter to store malicious URLs
    • when visiting a website the browser checked if that URL is in the filter
      • if the answer is yes -> browser would query Google’s servers for more detailed information.
      • Otherwise, browsing continues as normal
    • The bloom filter was a compressed data structure of all malicious URLs Google new about, stored by the browser
      • Much less space than maintaining a full database of URLs
  • Comparing two strings for approximate similarity
    • Given two strings: X and Y
    • 2-grams of X are stored in bloom filter B1
    • 2-grams of Y are stored in bloom filter B2
    • Both bloom filters are the same length & use same hash function
    • If two strings have a lot of 2-grams in common -> their bloom filters will have a large number of identical bit positions set to 1

-understand how a Bloom filter provides a private representation for strings

  • A and B agree on a bit array length I and k hash functions
    • also agree on a salt and use it in their hash functions
  • A and B many add dummy records to lessen the chance of C mounting a frequency attack
  • May be problems with using bloom filters represent short strings (not very secure)
    • Might add random 2-grams / random bits to the bloom filters (for extra security)

-understand how Bloom filters can be used to compare two strings for approximate similarity and the formula for doing this (Dice similarity coefficient)

  1. similarity of bloom filter

h = number of bits set to 1 in both bloom filters

b1 = number of bits set to 1 in bloom filter B1

b2 = number of bits set to 1 in bloom filter B2

  • Need to choose a threshold value for deciding whether to report that X matches Y

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
数据库
云数据库为企业提供了完善的关系型数据库、非关系型数据库、分析型数据库和数据库生态工具。您可以通过产品选择和组合搭建,轻松实现高可靠、高可用性、高性能等数据库需求。云数据库服务也可大幅减少您的运维工作量,更专注于业务发展,让企业一站式享受数据上云及分布式架构的技术红利!
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档
http://www.vxiaotou.com