Some scientists even state that only discovery of “cleaning” and cleaning tools as broom and rag really allowed to create civilization. Before it peope have to live “on move” or keep to burn their simple wooden houses from time to time because they were full of garbage and parasites.
Well this is a joke (maybe not that big) but in area of “big data” cleaning is really the basic and most time consuming activity we must do. Because big data are messy. Values are incomplet, missing, not readable…
Some people point out that big data does not always mean “huge” data. It means “difficult” data. Mostly unstructured and growing and growing in big speed.
Therefore before we start any analysis we need to look at data and clean biggest nonsenses and critical gaps. Which of course is also dangerous because we could easily “clean away” some interesting things.
Basic big data cleaning technics and tools: