Page tree
Skip to end of metadata
Go to start of metadata

Introduction 

Feature generator plugin will be used to generate text based feature from a string field.

Use-case

A user has training data that has labeled various tweets as positive, neutral, or negative. The user wants to train a model (Eg. Decision Tree) from the data, then use it to tag new tweets as positive, neutral, or negative.

User Stories

  • The user should be able to generate text based features from a string field using HashingTF.

  • The user should be able to specify number of features to use with HashingTF.

  • The user should be able to train and store a skip-gram model (Spark's Word2Vec) to use later for feature generation.

  • The user should be able to set vector size, min count, num partitions, num iterations, and window size when training skip-gram model.

  • The user should be able to set which fileset and path to use when storing the skip-gram model.

  • The user should be able to generate text based features from a string field using a stored skip-gram model (Spark's Word2Vec).

  • The user should be able to use generated features to train a model or for prediction.

Example

Skip-Gram (Spark's Word2Vec)

Following is a simple example showing how Spark's word2vec can be used for text based generation using skip-gram model.

The SkipGram Trainer will fit the data for input column specified and for parameters vectorSize : 3, minCount: 2, numPartitions: 1, numIterations: 1 and windowSize: 3, and save the model into a fileSet.

Suppose the SkipGramGenerator receives the following input records:

offsettext
1Spark ML plugins 
2Classes in Java

The SkipGramFeatureGenerator will use the saved model and generate records that will contain all the fields along with the output       fields mentioned in ``outputColumnMapping``.

offsettextresult
1

Spark ML plugins

[0.040902843077977494, -0.010430609186490376, -0.04750693837801615]
2Classes in Java[-0.04352385476231575, 3.2448768615722656E-4, 0.02223073500208557]

 

HashingTF Feature Generator:

Suppose the feature generator receives the following records:

offsettext
1Hi I heard about Spark 
2Logistic regression models are neat

The HashingTF Feature Generator will transform column ``text`` to generate fixed length vector of size 10 and emit the generated sparse vector as a cobination of three columns: result_size, result_indices, result_value.

offsettextresult_sizeresult_indicesresult_value
1Hi I heard about Spark 10[3, 6, 7, 9] [2.0, 1.0, 1.0, 1.0]
2Logistic regression models are neat10[0, 2, 4, 5, 8][1.0, 1.0, 1.0, 1.0, 1.0]

 

Design 

SkipGramFeatureTrainer:

SparkSink to train and store a skip-gram model (Spark's Word2Vec) to use later for feature generation.

Properties:

    • fileSetName : The name of the FileSet to save the model to.
    • path : Path of the FileSet to save the model to.
    • vectorSize: The dimension of codes after transforming from words.
    • minCount: The minimum number of times a token must appear to be included in the word2vec model's vocabulary.
    • numPartitions: Number of partitions for sentences of words.
    • numIterations : Maximum number of iterations (>= 0).
    • windowSize The window size (context words from [-window, window]) default 5.
    • inputCol: Input column to train the skip-gram model (Spark's Word2Vec).

Input Json Format

{
    "name": "FeatureTrainer",
    "type": "sparksink",
    "properties": {
        "fileSetName": "feature-generator",
        "path": "feature",
        "vectorSize": "3",
        "minCount": "2",
        "numPartitions": "1",
        "numIterations ": "1",
        "windowSize ": "3",
        "inputCol": "text"
    }
}

SkipGramFeatureGenerator:

SparkCompute to generate text based feature from string using stored skip-gram model (Spark's Word2Vec).

The sparkcompute will emit record containing the original input schema along with the transformed columns as mentioned in the outputMapping.

Properties:

    • fileSetName : The name of the FileSet to load the skip-gram model from.
    • path : Path of the FileSet to load the skip-gram model from.
    • outpuColumntMapping: Input column to output column mapping where each output column will contain the generated feature vector for the corresponding input field as double array.

Input Json Format

{
    "name": "FeatureGenerator",
    "type": "sparkcompute",
    "properties": {
        "fileSetName": "feature-generator",
        "path": "feature",
        "outputColumnMapping": "text:result"
    }
}

 

HashingTFFeatureGenerator:

SparkCompute to generate text based feature from string using HashingTF or stored skip-gram model (Spark's Word2Vec).

The sparkcompute will emit record containing the original input schema along with the 3 extra columns(representing the sparse vector representation of the value) for every transformed column as mentioned in the outputMapping.

Properties:

    • numFeatures: Number of features to be used for HashingTF.
    • outputColumnMapping: Input column to output column mapping where for each input column, output will contain 3 corresponding fields as <output>_size, <output>_indices, <output>_value. The 3 columns combined will give the sparse vector value for the input column.

       

{
    "name": "FeatureGenerator",
    "type": "sparkcompute",
    "properties": {
        "numFeatures": "16"
        "outputColumnMapping": "text:result"
    }
}

Table of Contents

 

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature
  • No labels

6 Comments

  1. Albert ShauTodd Greenstein

    We have started looking at this requirement. Please confirm/suggest below :

    1. Updated design
    2. Assumptions:
      1. Only one input column would be used for training, using skip-gram model (Spark's Word2Vec) for SparkSink. Please confirm.
      2. For each input column specified in featureGenerator using HaashingTF or skip-gram model (SparkCompute), the output schema will contain the corresponding output columns specified using the columnMapping along with the other columns on which no operation was performed, except the input columns used for transformation. Please confirm.
      3. The text based features generated for each input column will be emitted as an array in the corresponding output column. Please suggest.
    1. a. I think it is ok to start with just a single input column.

      b. I'm not sure I fully understand the question, but I think you are suggesting that if the input is 3 fields (age int, text string, other boolean), and we generate features from the 'text' field, the output will keep fields 'age' and 'other', while dropping 'text' and adding 'result'?  I think it makes sense to keep the original except add a field for the features, similar to how the predictors just add a prediction field.  So the output would have (age int, text string, other boolean, result double[]).

      c. If the feature vector is dense, a double array makes sense. If it is sparse, a Map<Integer, Double> makes sense. I believe Word2Vec is dense, but HashingTF is sparse. 

      1. Actually, after looking at SparseVector some more, you can't represent it as a Map<Integer, Double>. It has to be a record with 3 fields: size int, indices int[], values double[]

  2. I think it would be better to have a separate HashingTF feature generator and a skip-gram feature generator rather than clubbing them into one, because they don't really have any overlap in required properties

    1. Albert Shau

      Updated design with the changes. Please confirm.

      1. Looks good. The example still has 'featureGenerationType' for skip gram feature generator, but I think thats just left overs from the previous design.