Patent ReferencesScatter-gather: a cluster-based method and apparatus for browsing large document collections Method for learning to infer the topical content of documents based upon their lexical content Efficient information collection method for parallel data mining Method and apparatus for information accesss employing overlapping clusters Method and system for data clustering for very large databases System and method for data mining from relational data by sieving through iterated relational reinforcement System for selecting multimedia databases over networks Method of locating related items in a geometric space for data mining Method and apparatus for reducing the computational requirements of K-means data clustering Automatic subspace clustering of high dimensional data for data mining applications InventorsAssigneeApplicationNo. 607365 filed on 06/30/2000US Classes:707/6, Pattern matching access704/245, Clustering707/104.1, Application of database or data structure (e.g., distributed, multimedia, image)707/201Coherency (e.g., same view to multiple users)ExaminersPrimary: Amsbury, WayneAssistant: Havan, Thu-Thao Attorney, Agent or FirmInternational ClassG06F 007/04AbstractIn one exemplary embodiment the invention provides a data mining system for use in finding cluster of data items in a database or any other data storage medium. A portion of the data in the database is read from a storage medium and brought into a rapid access memory buffer whose size is determined by the user or operating system depending on available memory resources. Data contained in the data buffer is used to update the original model data distributions in each of the K clusters in a clustering model. Some of the data belonging to a cluster is summarized or compressed and stored as a reduced form of the data representing sufficient statistics of the data. More data is accessed from the database and the models are updated. An updated set of parameters for the clusters is determined from the summarized data (sufficient statistics) and the newly acquired data. Stopping criteria are evaluated to determine if further data should be accessed from the database. Each time the data is read from the database, a holdout set of data is used to evaluate the model then current as well as other possible cluster models chosen from a candidate set of cluster models. The evaluation of the holdout data set allows a cluster model with a different cluster number K' to be chosen if that model more accurately models the data based upon the evaluation of the holdout set.Other References
Field of SearchPattern matching accessManipulating data structure (e.g., compression, compaction, compilation) Generating database or data structure (e.g., via user interface) Application of database or data structure (e.g., distributed, multimedia, image) Coherency (e.g., same view to multiple users) Sorting For partial translation Clustering Creating patterns for matching Update patterns Cluster analysis Sequential decision process (e.g., decision tree structure) With a multilevel classifier | |