Sort the data set by collect date then we will have the latest date is top rank of dataset
We will collect only 1 record from the top rank of dataset then we can collect the latest data collect by today. This is de-duplication process and generate the output data to HDFS. |
|