从头开始实施自定义凝聚算法

Question

从头开始实施自定义凝聚算法

Lon*_*guy 6 java algorithm math frameworks cluster-analysis

我知道凝聚聚类算法,它以每个数据点作为单个聚类开始然后将点组合成聚类的方式.

现在,我有一个n维空间和几个数据点,每个维度都有值.我想根据业务规则聚类两个点/集群,如:

如果跨越维度1的集群之间的距离<T1,则集群两个点c1和c2,并且跨越维度2的距离<T2,......和跨越维度的距离n <Tn.
如果满足维度1的规则并且满足维度2的规则,则集群它们而不必担心其他维度...

....和类似的自定义规则.

另外,我有自己的方法来定义和测量任何特定维度中任意两个聚类之间的距离.维度可能只是字符串,我想定义自己的字符串距离度量.在另一个维度中,它可以包含位置的名称,并且沿着该维度的两个点之间的距离是命名的位置之间的地理距离,对于其他维度也是如此.

是否有框架/软件可以让我实现这种定义自定义距离指标的方式,然后实施凝聚聚类？当然,当在任何时间点都不满足业务规则时,凝聚聚类停止,并且我们在最后的n维空间中形成聚类.

谢谢Abhishek S.

Answer 1

Вит*_*вич 5

你可以用Weka来做.

您必须实现距离函数,并使用该方法将其传递给Hierarchical ClusterersetDistanceFunction(DistanceFunction distanceFunction).

Weka中的其他可用聚簇是:Cobweb,EM,FarthestFirst,FilteredClusterer,MakeDensityBasedClusterer,RandomizableClusterer,RandomizableDensityBasedClusterer,RandomizableSingleClustererEnhancer,SimpleKMeans,SingleClustererEnhancer.

NormalizableDistance类中的示例距离函数:

  /** Index in ranges for MIN. */
  public static final int R_MIN = 0;

  /** Index in ranges for MAX. */

  public static final int R_MAX = 1;

  /** Index in ranges for WIDTH. */
  public static final int R_WIDTH = 2;

  /** the instances used internally. */
  protected Instances m_Data = null;

  /** True if normalization is turned off (default false).*/
  protected boolean m_DontNormalize = false;

  /** The range of the attributes. */
  protected double[][] m_Ranges;

  /** The range of attributes to use for calculating the distance. */
  protected Range m_AttributeIndices = new Range("first-last");

  /** The boolean flags, whether an attribute will be used or not. */
  protected boolean[] m_ActiveIndices;

  /** Whether all the necessary preparations have been done. */
  protected boolean m_Validated;


public double distance(Instance first, Instance second, double cutOffValue, PerformanceStats stats) {
    double distance = 0;
    int firstI, secondI;
    int firstNumValues = first.numValues();
    int secondNumValues = second.numValues();
    int numAttributes = m_Data.numAttributes();
    int classIndex = m_Data.classIndex();

    validate();

    for (int p1 = 0, p2 = 0; p1 < firstNumValues || p2 < secondNumValues; ) {
      if (p1 >= firstNumValues)
        firstI = numAttributes;
      else
        firstI = first.index(p1); 

      if (p2 >= secondNumValues)
        secondI = numAttributes;
      else
        secondI = second.index(p2);

      if (firstI == classIndex) {
        p1++; 
        continue;
      }
      if ((firstI < numAttributes) && !m_ActiveIndices[firstI]) {
        p1++; 
        continue;
      }

      if (secondI == classIndex) {
        p2++; 
        continue;
      }
      if ((secondI < numAttributes) && !m_ActiveIndices[secondI]) {
        p2++;
        continue;
      }

      double diff;

      if (firstI == secondI) {
        diff = difference(firstI,
                  first.valueSparse(p1),
                  second.valueSparse(p2));
        p1++;
        p2++;
      }
      else if (firstI > secondI) {
        diff = difference(secondI, 
                  0, second.valueSparse(p2));
        p2++;
      }
      else {
        diff = difference(firstI, 
                  first.valueSparse(p1), 0);
        p1++;
      }
      if (stats != null)
        stats.incrCoordCount();

      distance = updateDistance(distance, diff);
      if (distance > cutOffValue)
        return Double.POSITIVE_INFINITY;
    }

    return distance;
  }

Run Code Online (Sandbox Code Playgroud)

显示您可以单独处理各种维度(在Weka中称为属性).因此,您可以为每个维度/属性定义不同的距离.

关于避免将某些实例聚集在一起的业务规则.我认为您可以创建一个距离函数,Double.positiveInfinity该函数在业务规则不满足时返回.

归档时间：	13 年，8 月前
查看次数：	3485 次
最近记录：	13 年，8 月前