U.S. patents available from 1976 to present.
U.S. patent applications available from 2005 to present.

Varying cluster number in a scalable clustering system for use with large databases

Patent 6449612 Issued on September 10, 2002. Estimated Expiration Date: Icon_subject June 30, 2020. Estimated Expiration Date is calculated based on simple USPTO term provisions. It does not account for terminal disclaimers, term adjustments, failure to pay maintenance fees, or other factors which might affect the term of a patent.

Patent References

Scatter-gather: a cluster-based method and apparatus for browsing large document collections
Patent #: 5442778
Issued on: 08/15/1995
Inventor: Pedersen, et al.

Method for learning to infer the topical content of documents based upon their lexical content
Patent #: 5687364
Issued on: 11/11/1997
Inventor: Saund, et al.

Efficient information collection method for parallel data mining
Patent #: 5758147
Issued on: 05/26/1998
Inventor: Chen, et al.

Method and apparatus for information accesss employing overlapping clusters
Patent #: 5787422
Issued on: 07/28/1998
Inventor: Tukey, et al.

Method and system for data clustering for very large databases
Patent #: 5832182
Issued on: 11/03/1998
Inventor: Zhang, et al.

System and method for data mining from relational data by sieving through iterated relational reinforcement
Patent #: 5884305
Issued on: 03/16/1999
Inventor: Kleinberg, et al.

System for selecting multimedia databases over networks
Patent #: 5920856
Issued on: 07/06/1999
Inventor: Syeda-Mahmood

Method of locating related items in a geometric space for data mining
Patent #: 5930784
Issued on: 07/27/1999
Inventor: Hendrickson

Method and apparatus for reducing the computational requirements of K-means data clustering
Patent #: 5983224
Issued on: 11/09/1999
Inventor: Singh, et al.

Automatic subspace clustering of high dimensional data for data mining applications
Patent #: 6003029
Issued on: 12/14/1999
Inventor: Agrawal, et al.

More ...

Inventors

Assignee

Application

No. 607365 filed on 06/30/2000

US Classes:

707/6, Pattern matching access704/245, Clustering707/104.1, Application of database or data structure (e.g., distributed, multimedia, image)707/201Coherency (e.g., same view to multiple users)

Examiners

Primary: Amsbury, Wayne
Assistant: Havan, Thu-Thao

Attorney, Agent or Firm

International Class

G06F 007/04

Abstract

In one exemplary embodiment the invention provides a data mining system for use in finding cluster of data items in a database or any other data storage medium. A portion of the data in the database is read from a storage medium and brought into a rapid access memory buffer whose size is determined by the user or operating system depending on available memory resources. Data contained in the data buffer is used to update the original model data distributions in each of the K clusters in a clustering model. Some of the data belonging to a cluster is summarized or compressed and stored as a reduced form of the data representing sufficient statistics of the data. More data is accessed from the database and the models are updated. An updated set of parameters for the clusters is determined from the summarized data (sufficient statistics) and the newly acquired data. Stopping criteria are evaluated to determine if further data should be accessed from the database. Each time the data is read from the database, a holdout set of data is used to evaluate the model then current as well as other possible cluster models chosen from a candidate set of cluster models. The evaluation of the holdout data set allows a cluster model with a different cluster number K' to be chosen if that model more accurately models the data based upon the evaluation of the holdout set.

Other References

  • J. Banfield and A. Raftery, "Model-based Gaussian and Non-Gaussian Clustering", Biometrics, vol. 39:803-821, pp.15-34, (1993)
  • P.S. Bradley, O.I. Mangasarian, and W.N. Street. 1997. Clustering via Concave Minimization', in Advances in Neural Information Processing systems 9, M.C. Mozer, M.I. Jordan, and T. Petsche (Eds.) pp. 368-374, MIT Press, (1997)
  • P. Cheeseman and J. Stutz, "Bayesian Classification (AutoClass): Theory and Results", in [FPSU96], pp. 153-180. MIT Press, (1996)
  • A.P. Dempster, N.M. Laird, and D. Rubin, "Maximum Likelihood from Incomplete Data Via the EM Algorithm", Journal of the Royal Statistical Society, Series B, 39(1):pp. 1-38. (1977)
  • U. Fayyad, D. Haussler, and P. Stolorz. "Mining Science Data", Communications of the ACM 39(11), (1996)
  • D. Fisher, "Knowledge Acquisition via Incremental Conceptual Clustering". Machine Learning, 2:139-172, (1987)
  • E. Forgy, "Cluster Analysis for Multivariate Data: Efficiency vs. Interpretability of Classifications", Biometrics 21:768 (1965)
  • Jones, "A Note on Sampling from a Tape File", Communications of the ACM, vol. 5, (1962)
  • T. Zhang, R. Ramakrishnan, and M. Livny, "Birch: A New Data Clustering Algorithm and its Applications", Data Mining and Knowledge Discovery, vol. 1, No. 2, (1997)
  • Radford M. Neal and Geoffrey E. Hinton, A View of the EM Algorithm That Justifies Incremental , Sparse and Other Variants, (date unknown)
  • Bo Thiesson, Christopher Meek, David Maxcell Chickering and David Heckerman, "Learning Mixtures of DAG Models", Technical Report MSR-TR-97-30 De. 1997, revised May 1998
  • S.Z. Selim and M.A. Ismail, "K-Means-Type Algorithms: A Generalized Covergence Therorem and Characterization of Local Optimaility," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. PAMI-6, No. 1, (1984
PatentsPlus Images
Enhanced PDF formats
loading...
PatentsPlus: add to cart
PatentsPlus: add to cartSearch-enhanced full patent PDF image
$9.95more info
PatentsPlus: add to cart
PatentsPlus: add to cartIntelligent turbocharged patent PDFs with marked up images
$18.95more info
 
Sign InRegister
Username  
Password   
forgot password?