Stochastic data mining

Is it time to rethink about management of the fast-growing database ?

Think of this:

CPU speed doubles every 18 months according to Moore's Law

Hard disk storage capacity doubles every 9 months (according to estimation I saw in "Scientific American" many years ago).

Database size doubles every 6 months for the hotest field in bioinformatics (this number is somewhat elusive)

The exponetial growth leads to a conclusion that every 3 years you would have 4-folds growth in computational power, 16-folds growth in disk storage capacity, and 64-folds growth in information.  It is inevitable that information overload should happen. The ratio database size:computational power would still be 16-folds every 3 years.

Stochastic sampling may be the most viable option left in the long run.  The idea is not new. Square-root sampling method was used for so long in industrial. It was also mentioned in database management for many years.

The only drawback of this method is that you will never sure if you really get the best answer.  You would get range of possible answers instead. You could not gaurantee the global optima, but you would get good local optima within a reasonable time.

In the case of bioinformatic information processing, I believe that we have to go down to cube-root sampling.

The reason is, for every 3 years passed, growth factor of CPU is 4 and growth factor for information is 64. Cube root sampling would permit the growth factor of information to cube root of 64 (i.e., 4). The growth from both part would always be equal.

It means "No Bottleneck, Forever"..


บันทึกนี้เขียนที่ GotoKnow โดย  ใน Brief Note

คำสำคัญ (Tags)#information#sampling#algorithm#cube-root#bioinformatics

หมายเลขบันทึก: 54135, เขียน: 11 Oct 2006 @ 10:25 (), แก้ไข: 11 Feb 2012 @ 16:05 (), สัญญาอนุญาต: สงวนสิทธิ์ทุกประการ, อ่าน: คลิก

ความเห็น (0)