8.1 数据挖掘标准与标准8.2 数据挖掘工具8.3 数据挖掘的研究趋势建立的CRISP-DM有许多相似之处n 数据挖掘相关标准nCRISP-DM交叉行业数据挖掘过程标准,Cross Industry Standard Process for Data MiningSPSS、NCR以及DaimlerChrysler三个在数据挖掘领域经验丰富的公司发起建立一个社团,目的建立数据挖掘方法和过程的标准 Crisp-DMProject ObjectivesData UnderstandingData PreparationModelingEvaluationReportingBackgroundRequirements,assumptions,constraintsTerminologyData mining goals&success criteriaProject planInitial Data collection reportData description reportData Exploration reportData quality reportData description reportData pre-processing stepsModeling assumptionTest designModel descriptionModel assessment(inc.validation)Assessment of data mining results withrespect to objectivesFinal report:-Summary:ObjectivesData Mining processData Mining resultsData Mining assessment-Conclusions-Future work(Business Understanding)(Deployment)Widely accepted PROCESS MODEL for data miningProvides a framework for describing the modeling process in detail“BEST PRACTICEnBusiness Understanding PhaseUnderstand the business objectivesWhat is the status quo?Understand business processesAssociated costs/painDefine the success criteriaDevelop a glossary of terms:speak the languageCost/Benefit AnalysisCurrent Systems AssessmentIdentify the key actorsMinimum:The Sponsor and the Key UserWhat forms should the output take?Integration of output with existing technology landscapeUnderstand market norms and standardsnBusiness Understanding PhaseTask DecompositionBreak down the objective into sub-tasksMap sub-tasks to data mining problem definitions Identify ConstraintsResourcesLaw e.g.Data ProtectionBuild a project planList assumptions and risk(technical/financial/business/organisational)factorsnData Understanding PhaseCollect DataWhat are the data sources?Internal and External Sources(e.g.Axiom,Experian)Document reasons for inclusion/exclusionsDepend on a domain expertAccessibility issuesAre there issues regarding data distribution across different databases/legacy systemsWhere are the disconnects?nData Understanding PhaseData DescriptionDocument data quality issuesCompute basic statistics Data ExplorationSimple univariate data plots/distributionsInvestigate attribute interactionsData Quality IssuesMissing Values:Understand its sourceStrange DistributionsnData Preparation PhaseIntegrate DataJoining multiple data tablesSummarisation/aggregation of dataSelect DataAttribute subset selectionRationale for Inclusion/ExclusionData samplingTraining/Validation and Test setsnData Preparation PhaseData TransformationUsing functions such as logFactor/Principal Components analysisNormalization/Discretisation/BinarisationClean DataHandling missing values/OutliersData ConstructionDerived AttributesnThe Modeling PhaseBuild ModelChoose initial parameter settingsStudy model behaviour:Sensitivity analysisAssess the modelBeware of over-fittingInvestigate the error distribution:Identify segments of the state space where the model is less effectiveIteratively adjust parameter settingsnThe Evaluation PhaseValidate ModelHuman evaluation of results by domain expertsEvaluate usefulness of results from business perspectiveDefine control groupsCalculate lift curvesExpected Return on InvestmentReview ProcessDetermine next stepsPotential for deploymentDeployment architectureMetrics for success of deploymentPMML预测模型标记语言,Predictive Model Markup Language。
数据挖掘应用往往需要多种类型的数据挖掘软件、算法协同运行,这就要求对挖掘出的模型能够很好地继承、复用与集成DMGThe Data Mining Group,DMG提出PMML语言PMML最新版本为4.1,支持16种数据挖掘模型,包括:AssociationModel 关联规如此、BaselineModel基准模型、ClusteringModel聚类模型、GeneralRegressionModel回归模型、MiningModel组合模型、NaiveBayesModel朴素贝叶斯、NearestNeighborModel 最近邻模型NeuralNetwork神经网络、RegressionModel线性、多项式、对数三种回归模型、RuleSetModel规如此集、SequenceModel序列模式、Scorecard、TimeSeriesModel、SupportVectorMachineModel支持向量机、TextModel文本模型、TreeModel决策树PMML的模型定义由以下几局部组成:The header element contains general information about the PMML document,such as copyright formation for the model,its description,and information about the application used to generate the model such as name and version.PMML version=3.2.The data dictionary records information about the data fields from which the model was built.DataField name=Species.Data Transformations:transformations allow for the mapping of user data into a more desirable form to be used by the mining model.PMML defines several kinds of simple data transformations.Normalization:map values to numbers,the input can be continuous or discrete.Discretization:map continuous values to discrete values.Value mapping:map discrete values to discrete values.Functions(custom and built-in):derive a value by applying a function to one or more parameters.Aggregation:used to summarize or collect groups of values.Model:contains the definition of the data mining model.Model Name(attribute modelName)Algorithm Name(attribute algorithmName)Number of Layers(attribute numberOfLayers)Mining Schema:lists all fields used in the model.Name:must refer to a field in the data dictionaryUsage type:defines the way a field is to be used in the model.Typical values are:active,predicted,and supplementary.Predicted fields are those whose values are predicted by the model.Outlier Treatment:defines the outlier treatment to be use.Missing Value Replacement Policy:if this attribute is specified then a missing value is automatically replaced by the given values.Missing Value Treatment:indicates how the missing value replacement was derived.Targets:allow for post-processing of the predicted value in the format of scaling if the output of the model is continuous.PMML Example:Association Rule:t1:Cracker,Coke,Watert2:Cracker,Watert3:Cracker,Watert4:Cracker,Coke,WaterModel attributes ItemsPMML Example:Association Rule:t1:Cracker,Coke,Watert2:Cracker,Watert3:Cracker,Watert4:Cracker,Coke,Water Item SetsAssociation RulesnJDMJava Data Mining API。
旨在提供一个访问数据挖掘工具的标准API,支持数据挖掘模型的建立、使用,数据及元数据的创立、存储、访问及维护,从而使得Java应用程序能够能够方便集成数据挖掘技术n Semantic Web相关标准nTim Berners-Lee 在XML 2000会议报告中首次提出了语义Web的层次模型Layer Cake其特点在与:基于XML和RDF/RDFS,构建本体和逻辑推理规如此,以完成基于语义的知识表示和推理,从而为计算机所理解和处理第一层是Unicode统一编码和URIUniform Resource Identifier,统一资源标识器UNICODE于1993年成为国际标准组织ISO的一项国际标准ISO/IEC10646,其宗旨是全球所有文种统一编码URI包含三个局部:被用来访问资源的统一命名规如此分配体系、资源宿主机器的名称、路径形式的资源名称与URL 本不同的是,URI只是一个标识符,不直接提供访问资源的方法第二层是XMLEXtensible Markup LanguageXML具有简单、自描述、可扩展的特点,并且实现了内容、结构和表现三者的别离,因而,更适合于数据表示和交换XML Schema中的约束主要用于XML文档的结构合法性验证。
第三层是RDFResource Description Framework,资源描述框架元数据层RDF是建立在XML上的元数据描述与交换框架,以“资源Resource属性Property属性值Property Value的形式描述对象一个例子第四层是RDF-SRDF SchemaRDF-S是对RDF 的扩展,是RDF的词汇描述语言Vocabulary Description Language,用于定义RDF资源描述文件中出现的词汇第五层是本体Ontology和规如此Rule领域知识层OWL用于明确表示词汇体系中的术语及术语间的关系,在词义和语义的表达来说,OWL有更强的表达能力规如此用于描述领域知识中的前提和结论SPARQLSimple Protocol and RDF Query Language是W3C推荐的用于对RDF数据查询的语言和协议8.1 数据挖掘标准与标准8.2 数据挖掘工具8.3 数据挖掘的研究趋势Free open-source data mining software and applicationsGATE:a natural language processing and language engineering tool.Orange:A component-based data mining and machine learning software suite written in the Python language.R:A programming language and software environment for statistical computing,data mining,and graphics.RapidMiner:An environment for machine learning and data mining experiments.UIMA:The UIMA(Unstructured Information Management Architecture)is a component framework for analyzing unstructured content such as text,audio and video originally developed by IBM.Weka:A suite of machine learning software applications written in the Java programming language.Commercial data-mining software and applicationsIBM SPSS Modeler:data mining software provided by IBM.Microsoft Analysis Services:data mining software provided by Microsoft.Oracle Data Mining:data mining software by Oracle.SAS Enterprise Miner:data mining software provided by the SAS Institute.STATISTICA Data Miner:data mining software provided by StatSoft.Main Features 49 data preprocessing tools76 classification/regression algorithms8 clustering algorithms3 algorithms for finding association rules15 attribute/subset evaluators+10 search algorithms for feature selectionMain GUI“The Explorer(exploratory data analysis)“The Experimenter(experimental environment)“The KnowledgeFlow(new process model inspired interface)WEKA only deals with“flat files relation heart-disease-simplifiedattribute age numericattribute sex female,maleattribute chest_pain_type typ_angina,asympt,non_anginal,atyp_anginaattribute cholesterol numericattribute exercise_induced_angina no,yesattribute class present,not_presentdata63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present.例子1:关联规如此分析数据集:supermarket.arff(菜单ToolsArffviewer)Pattern过滤:departmentd;departmentdd;departmentddd最小支持度:0.1,最小置信度:0.6例子2:决策树分类 数据集:iris.arff(菜单ToolsArffviewer)sepal length:花萼长度 sepal width:花萼宽度 Petal length:花瓣长度 Petal width:花瓣宽度 Class:Iris setosa 山鸢尾;Iris versicolor变色鸢尾;Iris virginica维吉尼亚鸢尾 分类器:J48 10折交叉验证 决策树可视化 评测8.1 数据挖掘标准与标准8.2 数据挖掘工具8.3 数据挖掘的研究趋势数据挖掘领域专家在ICDM2005列出了10大挑战性问题开展统一的数据挖掘理论多维的数据挖掘和高速流的数据挖掘具有可扩展性,支持millions、billions级维度的数据分类,支持超高速流数据的挖掘时序系列的数据挖掘从复杂数据中的复杂知识的挖掘图数据挖掘不满足i.i.d(independent and identically distributed)的数据、异构数据、相互关联的数据进展挖掘 5.网络数据挖掘Web、社会网络、物理网络6.分布式数据挖掘和多代理的数据挖掘物联网数据多源数据关联挖掘7.生物和环境问题的数据挖掘 8.相关问题的数据挖掘过程如何实现挖掘过程的自动化。