4-29-25 Workshop Session 2

Name: 4-29-25 Workshop Session 2
Uploaded: 2025-06-04T19:41:19.5533333Z
Duration: 1 h 28 min 33 s

June 04, 2025

Information

ID: 13191
To Cite: DCA Citation Guide

Download Transcript

00:01So
00:02in
00:03the
00:04previous session, I think,
00:07we mainly focus on to
00:08show you how to access
00:10the data, how to request
00:11computing resource. Right?
00:13In the next session, what
00:15we will do is more
00:16about
00:17technical details,
00:19how you can call existing
00:21large engine models
00:23in the CHP safe environment.
00:26Or if you want to
00:28build your own
00:29customized large engine model. Right?
00:34By leveraging is it existing
00:35model by using your own
00:37data, then how can you
00:38train the model using the
00:40environment? So that's what will
00:42happen in the next session.
00:44And I will provide very
00:46brief
00:47overview about large engine model,
00:49then,
00:50a number of speaker will
00:51actually jump in to show
00:53the different tools
00:55that are available on the
00:56CHP safe environment, which you
00:58can use. Okay? For example,
00:59if you want to annotate
01:01data, we have a tool.
01:02If you want to train
01:03the model,
01:04call existing API like a
01:06keyway tool, how to call
01:07it, and how to stuff
01:09on using your own data
01:11to train them out. So
01:12that's what will happen next.
01:13Let me just do a
01:14quick,
01:15introduction
01:17of the large engine model.
01:18I think because of running
01:19of time, so I'm trying
01:20to shrink,
01:22reduce my session.
01:32I think I'm going to
01:33skip this. I will
01:35was planning to give a
01:36short history about AI. Basically,
01:39what I'm trying to say
01:40is not just starting now.
01:41We have been doing this
01:43for many times, but this
01:44wave of generative AI is
01:46slightly different from the previous
01:48one in a few ways,
01:49like, how the model was,
01:53trained. It's much bigger than
01:54previous model and focus on
01:56generation task rather than trying
01:58to do, prediction or analyze
02:00the data and also heavy
02:02rely on the GPUs.
02:04And one thing I really
02:05want to mention because we
02:06talk about this is a
02:07large engine model session, and
02:09I'll just give a brief
02:10history about
02:11language models. Language model has
02:14been there for a while.
02:15Start nineteen sixty.
02:16Really a promising model trying
02:18to give a sequence of
02:19words, trying to predict next
02:21word. Okay. Then
02:22later on, neural,
02:24language model showed very good
02:26performance, but it suffers with
02:27all those,
02:28computational
02:29efficiency
02:30and all those issues. So
02:32didn't really scale up. Until
02:33later, two thousand seventeen, the
02:35transformer model was proposed
02:38together with abundant of GPUs
02:40available,
02:40make it possible actually to
02:42train,
02:43neural
02:44language model with a lot
02:46of data. Then we move
02:47to this pretrained language models.
02:49Okay? At that time, like
02:51a BERT model, you probably
02:52heard a lot too. Right?
02:53Show the good performance that
02:55you you pretrain,
02:57a reasonable text. But then
02:59two thousand twenty two, really,
03:01we moved to what we
03:02call large language model.
03:04And,
03:05basically, it's a transformal based
03:07pretrained language model, but trained
03:09out a lot of data.
03:11And the the rationale behind
03:13this is,
03:14oh, I think I
03:15is this
03:17emergent phenomenon of a large
03:19range model. Because people find
03:20that when you train on
03:22a lot of data,
03:23the model not just can
03:25do one thing in a
03:26reasonable performing. It can actually
03:28do a lot of different
03:29tasks, always a reasonable performance.
03:32That's what we call the
03:33emergent phenomenon of the large
03:35language model. Suddenly, it become
03:36very smart.
03:38And that leads to the
03:39the the the a lot
03:40of development of all those
03:42large language model, open source
03:44models
03:45like a LLAMA, DeepSeq, you
03:46probably heard, which give you
03:48all the weights.
03:49You can actually use your
03:50own data to continue pretrain
03:53or fine tune. Okay? And
03:55commercial model like GPT,
03:57you they used to be
03:59closed model, so you cannot
04:00really fine tune. But now
04:02GPT also have a service.
04:03You can upload data, do
04:05some kind of fine tune
04:06on their side, then post
04:07the model. The fine tune
04:09model on the GPT side.
04:10You still have to pay
04:11every time you call them.
04:12Okay? And then there's also
04:14different architect encoder versus decoder.
04:17And the the trend is
04:18also we're moving more towards
04:19multimodal large entry model. Instead
04:22of just a text based
04:23model, your text plus image,
04:25text plus other genomic data,
04:26and all those things. A
04:27lot of things going on.
04:29And,
04:31in particularly, in the NLP
04:33world,
04:34especially in the biomedical NLP
04:37world,
04:38we often focus on one
04:40NLP task called information extraction.
04:43So the idea is,
04:46in the clinical data, there's
04:47a lot of unstructure there.
04:48For example, it's a text
04:49document and a lot of
04:50details.
04:51So the task of information
04:53extraction is like, okay. Given
04:55this document,
04:56can you extract all the
04:57disease information of the patient
04:59out of this document?
05:00So I would say this
05:01is about comes to seventy,
05:03eighty percent of,
05:05the the the requirements for
05:06a lot of, like,
05:08EHR based analysis,
05:10both for,
05:12practical
05:13practices
05:14as well as for clinical
05:15research.
05:16So today, most all almost
05:18all the work we show
05:19here today is on this
05:20information extraction task,
05:22and it can further divide
05:24into three subtasks.
05:26The first one is called
05:28named entity recognition.
05:29So the idea is that
05:31you need to give a
05:31document. You need to,
05:33the system need to recognize
05:35MRI of the abdominal
05:37is a test.
05:39Okay?
05:40You need to know both
05:41the type is the test
05:42and the boundary.
05:43And you need to know
05:44June eighteen two thousand eight
05:46is a temple expression,
05:48that type of entity what
05:49type of entity and what's
05:50the boundary.
05:51That's the MER task. The
05:53second one, we also call
05:55it relation instruction. So what
05:56do you need to know?
05:57You want to know this
05:58June eighteen two thousand eight
06:00is a modifier of MRI.
06:03Right? That's a relation between
06:04those two entities. So it's
06:06very important for you to
06:07recognize the context
06:09of that clinical entity.
06:11And the the third one
06:12is called concept normalization.
06:14So if you read the
06:16notes, what you saw here
06:18is renal cell carcinoma.
06:20But if you want to
06:21build a clinical decision support
06:23system, you need to be
06:24this entity need to be
06:26coded to a concept in
06:28the medical terminology.
06:29Could be ICD ten. Could
06:31be SNOMED. Right? So that's
06:32you want to normalize this
06:34detect entity
06:36to a term in the
06:37vocabulary in the standard vocabulary.
06:40And as you can see,
06:41it's actually not straightforward because
06:42you see renal cell carcinoma,
06:44but, actually, the term is
06:45a, malignant neoplasma
06:48of, kidney in the terminology.
06:50So you need this kind
06:51of mapping. Okay? So
06:54today, most of our work
06:55will show how we do
06:56those three tasks and build
06:58the system to extract information
06:59out of the text.
07:01I just primarily talk about
07:03three different,
07:05approach.
07:06I'll skip those.
07:08I have a few slides
07:09about history.
07:10Two thousand twin two thousand
07:12ish, we mainly work on
07:13rule based system. You have
07:15a dictionary. We try to
07:16look up all the disease
07:18from the dictionary. Okay? Then
07:20two thousand ten, we annotate
07:21corpus. We start to do
07:23machine learning. So what happens
07:24is if you have a
07:26pack of red blood cell
07:28as a entity,
07:29beginning of entity will label
07:31as b, Intermediate token of
07:33the entity will label as
07:34I. Then all other outside
07:36entity will label as o.
07:38So then you convert this
07:39to a sequence labeling task.
07:41You label each word. Is
07:42it b, is it I,
07:44or o? So become machine
07:46learning task to learn. Okay?
07:48So that's what we're doing
07:49around that time. And then
07:51twenty twenty,
07:52it's more on,
07:54deep learning. I think many
07:55of you heard about. At
07:57that time, we are looking
07:58for those context embeddings like
08:00a BER model. So we
08:01actually fine tune the BER
08:03models from open domain with
08:05clinical data and show the
08:07performance.
08:08So this is just trying
08:09to summary. All you want
08:11to know is, like, moving
08:12from rule based machine learning
08:14to deep learning, the performance
08:15that you're getting better.
08:17And now we move to
08:18large language model. How you
08:20can use large man lang
08:22language model to do this
08:23information
08:24extraction task. I'll give you
08:26three examples,
08:27three different approach we have
08:29worked on. And, we actually
08:31have a a
08:32I think, potentially, if you're
08:34going to work on your
08:35own task, those are the
08:36three approach you may can
08:38take.
08:39The first one, you probably
08:41all know. You have GPT
08:42over there.
08:43All you need
08:44to do is write a
08:45prompt. Right? Say, give this
08:47document. Tell GPT what I
08:49want. So that's what we
08:51did as a first experiment
08:52here. Basically,
08:54we give
08:55GPT three point five, GPT
08:56four at that time. We
08:58we say we want to
08:59extract a medical problem treatment
09:01test out of clinical nodes.
09:04So the main exercise here
09:05is really about prompt. So
09:07we tried actually a different,
09:09strategy for prompt. You define
09:11the task, define the output.
09:13You also need to tell
09:14the prompt what's the definition
09:16of a medical problem,
09:17and then you can also
09:19give a guideline. So say
09:21because in the previous, I
09:22also show you what's a
09:23boundary. You may say, here,
09:24it has to be a
09:25noun phrase for that entity.
09:26You give that kind of
09:27guideline. Then you can also
09:29give additional examples.
09:31Here's the sentence. Here's the
09:32entity I want to extract.
09:34So this kind of call
09:36few short learning. It gives
09:37a three example, like, a
09:39few short learning. Right? And
09:42we tested all those. We
09:43made a framework for the
09:45prompt, and we showed the
09:46evaluate on the annotate purpose.
09:49And then we we show,
09:50actually,
09:51if you have a lot
09:52of annotated data,
09:54then BER model, the previous
09:55deep learning based approach,
09:57still work better. The zero
09:59shot performance on GPT is
10:01as not as good as
10:02the BER model if you
10:03have a lot of annotated
10:04data. So that's what we
10:06found at that time. But
10:07it's actually close because the
10:08GPT
10:09four model can reach to,
10:11like, eighty six versus the
10:14BER model trained on hundred
10:15samples ninety in the relaxing
10:17matching.
10:18Relaxing matching means if the
10:20entity
10:21you predicted versus the entity
10:22you annotate is overlap but
10:24not exactly same. Okay.
10:27And,
10:28I'll skip this one.
10:30Then I'll skip those two.
10:32The second exercise we did
10:34okay.
10:35Later, llama come out. You
10:37have all the weights, like
10:38I said. You can actually
10:40do,
10:41fine tune of those using
10:42your additional data. So that's
10:44what we did in this.
10:45We're working on the same,
10:47task, extra medical problem treatment
10:49test, but now we have
10:50the open source LAMA model.
10:52We're actually going to use,
10:54annotate data from local corpus
10:57to fine tune the LAMA
10:58model for this task.
11:00And this is what we
11:01had
11:02through the instruction tuning approach.
11:04I think, we're not going
11:05to talk a little bit
11:06about this, so I'm not
11:07going to repeat. Basically, you're
11:09converting the annotate data, which
11:11I showed,
11:12to a instruction dataset.
11:14Then you fine tune the
11:15LAMA model to to,
11:17change the weights,
11:19for this specific task. And
11:21what we found here is
11:22actually
11:23when you have a
11:26a a lot of annotate
11:27data, the per the large
11:28entry model, llama three, actually
11:30start
11:31almost same as a per
11:33model if you look at
11:34those, but slight better than
11:35per model. It's this one
11:37is more fair compressive
11:39because both model use those
11:40hundred annotated sample to train
11:43it. Okay? But then
11:45the last dataset is the
11:47unseen dataset for both model.
11:49But then you can see
11:50actually the LAMA model have
11:52much better performance than the
11:53BER model.
11:55Indicated the LAMA model is
11:56more actually generalizable
11:58because it's almost have, like,
11:59eight percent improvement
12:01versus the BER model is
12:03around
12:04seventy nine. Here's eighty seven.
12:06So now we start to
12:07actually build this as a
12:08LAMA based kind of information
12:10instruction system. But one thing
12:12I want to point out,
12:13at least when we test
12:14on the LAMA three,
12:16the speed is the issue,
12:17actually. If you look at
12:18the BER model, it will
12:20take us point two second
12:21to do a named entity
12:23recognition for one document.
12:26Three took thirty nine seconds.
12:28So if you're processing millions
12:30of notes, that's another concern.
12:32So there's a a lot
12:33of other issues in addition
12:35to the performance you want
12:36to consider. That's what I
12:38want to bring up. So
12:39the third approach,
12:41you think about the first
12:42approach prompt, you don't really
12:44need much GPU. Right? It
12:46just costs money to call
12:47GPT.
12:48Second is fine tune. You
12:50do need the GPT to
12:51load the mod or you
12:52You do need a GPU
12:53machine to load the model
12:55and fine tune. It may
12:56take couple hours to couple
12:57days.
12:58But this one, we call
13:00it continue pre training, the
13:01LAMA model.
13:03You use a lot of
13:04clinical data, like all the
13:05notes, all the literature. We
13:07combined, like, one hundred twenty
13:09nine billions of,
13:11tokens to continue pretraining,
13:14the the the llama model.
13:15It took one hundred fifty
13:17GPUs running for a month.
13:19That will be a lot
13:21of money if you go
13:22to Amazon.
13:23As you can see, the
13:24the computational
13:25cost for training this model,
13:27much bigger compared to a
13:29previous one. But the benefit
13:31of this is actually the
13:32model has become more generalizable.
13:34It can work on multiple
13:36top clinical NLP task. So
13:38that's what we call this
13:39MiLama model. We trained on
13:41the LAMA two and,
13:43show the better performance on
13:45actually multiple task, not just
13:47on entity recognition, on question
13:49answering,
13:50inference, and all other tasks.
13:52I'm just going to stop
13:53here.
13:54So in summary,
13:56just to quickly
13:57talk about what we learned
13:59so far. Basically,
14:02when
14:03you try to extract information
14:04out of notes using large
14:06engine model, you still can
14:08think about, do you really
14:09need a large engine model?
14:10If the task is simple,
14:11I think even sometimes regular
14:13expression, even rule based approach
14:15still work. And, also, if
14:16you already have a lot
14:17of annotate data, then deep
14:19learning model like BERT still
14:21works well. Okay? And, also,
14:23it costs less in terms
14:24of computational effort.
14:26Then
14:27if you think g p
14:29large engine model does help
14:30for that specific, then we
14:32also want to discuss, oh,
14:33do I should I train
14:35my own large engine model
14:36based on open source model,
14:38or should I go with
14:40GPT? Right?
14:42Then there's a lot of
14:43concern.
14:44In addition to performance, you
14:45also will think about the
14:47cost.
14:48Right? The the GPU requirement.
14:50Do you have GPUs locally
14:51and all those issues?
14:54So in the next three
14:56to four presentation, we basically
14:58will talk about several things,
15:00tools
15:01available on the CFG safe
15:03environment, which will allow you
15:05to do this kind of
15:06work.
15:07The first tool we will
15:08talk about is actually about
15:10annotation tool. A lot of
15:11people
15:12didn't really pay much attention
15:14to annotation, but you if
15:15you really think about look
15:17at all the model training,
15:18use even within at the
15:20area of a large angle,
15:21you still need to do
15:22some annotation even just for
15:24validation
15:25evaluation. Then you need a
15:26tool to do that.
15:28And we we have a
15:29tool installed on the CHP
15:30for that purpose. Then second
15:32one, we showed you
15:34a tool we already fine
15:36tuned, and we made it
15:37available on the, CHP. You
15:39can just call it. I
15:40think Nate also talked about
15:42the services.
15:43Third one is really,
15:45go deep. I think Lingfue
15:47gonna talk about if you
15:48have your own data, start
15:50with. How can you fine
15:51tune that model with your
15:53own data on the CHP,
15:55man? So let them just
15:56start.
15:58Do you wanna go ahead?
15:59Start with the annotation.
16:01Just watch all the time.
16:03Maybe just a little fast.
16:15Hi, everyone.
16:16Today, I'm just, gonna go
16:18through why we need annotation
16:20and then, try to introduce
16:21our annotation tool, Blue.
16:25So,
16:26annotation is the process of
16:28labeling data,
16:30marking span of text images
16:32or other content with additional
16:34information such as, entity types,
16:36categories,
16:37or relationships.
16:38Just as doctor Xu mentioned
16:40before showing the graph that
16:42there is entities annotated and
16:44also the relationship, annotate
16:48between those
16:50entities. Annotation is, critically important
16:53because it serves as the
16:55foundation for the machine learning
16:57and deep learning models.
16:59These models,
17:01heavily rely on annotated dataset
17:03to learn meaningful patterns and
17:06then to make,
17:07accurate predictions.
17:09Annotation remain
17:11the large length models.
17:13Although LMM is highly capable,
17:17it they still depends on
17:18the annotation data for
17:20fine tuning and the specific
17:22to for specific task and
17:24for evaluation against the ground
17:26truth.
17:28Here
17:33here you can see a
17:35table that,
17:36compare performance
17:38between,
17:39multiple models
17:41for,
17:43including the lama three variants
17:44and also a fine tuned
17:46lama model on the language
17:47annotation task.
17:50As seen in the result,
17:51fine tuned model,
17:53which trained our well annotated
17:55dataset
17:56tend to perform better on
17:58specific
17:59targeted,
18:00project.
18:04There are several key topics
18:07related to the annotation.
18:09The process always begin with,
18:11developing a clear and detailed
18:14annotation guideline.
18:15A well developed guideline improves
18:17consistency
18:18among,
18:19annotators
18:20and lead to higher annotation
18:22quality as the speed up
18:25the onboarding of new annotators
18:27and also make conflict resolution
18:29easier.
18:30Once the guideline is,
18:33developed, the next step will
18:35be select the annotation
18:37with appropriate domain knowledge
18:39and then train them thoroughly
18:41based on the guideline.
18:44After the training, it is
18:45important to continuously
18:47checking the and monitor the
18:49annotation
18:50quality.
18:51This including,
18:53checking agreement among annotators,
18:56holding discussion
18:57to resolve,
18:59disagreement
19:00and refine the guideline based
19:02on common errors or ambiguities
19:04identified during the process.
19:07And later, I will also
19:08introduce the annotation tool that
19:10can support and streamline the
19:12annotation workflow.
19:17For annotation guideline development,
19:19the first step is always
19:21define the goal of your
19:22project.
19:24Clearly state what you are
19:26trying to achieve,
19:27with the project.
19:29Next, provide a clear definition
19:31for all the concept like
19:34entities, relations, or some special
19:36terms.
19:39After that, develop
19:40detailed annotation rule that covers
19:43morality
19:43scenarios
19:44and edge cases to minimize
19:46the ambiguity.
19:48It is also essential to
19:50include many real world examples
19:52in the guidelines.
19:54Illustrate both, correct annotation and
19:57common errors.
19:59Guideline development is not a
20:00one time effort. It's always
20:02need,
20:04iterative process.
20:05It is important to involve
20:07both,
20:08domain experts and the linguistics
20:11or info implementations
20:12to ensure both,
20:14technical accuracy and practical usability.
20:18After the initial guideline is
20:21create created,
20:22it should be refined during
20:24the annotation training and tested
20:26on real world data.
20:28Given the variability
20:29and complexity
20:31of the real world data,
20:33new scenarios will always,
20:36inevitably
20:37raised
20:38and may occur further guideline
20:40updates.
20:41Once the guideline is stable
20:43and robust, the process can
20:45move to the corpus final
20:47day final finalization.
20:51Here, you can see, example
20:53of a annotation guideline.
20:55The goal of this guideline
20:56is to,
20:58identify meaningful,
20:59clinical concept from,
21:02important patient,
21:03medical records
21:04and to help extract information
21:06like
21:07the test problem, drug, and
21:09treatment.
21:11As shown on the left
21:12side,
21:13we provide a detailed definition
21:16to
21:17to ensure the annotator understand
21:19what should be labeled.
21:21In this guideline, we also
21:22introduced the modifiers,
21:24a concept that,
21:27complement,
21:27entity and also extend its
21:29mailing.
21:31For each modifiers
21:32such as the severity and
21:34body location,
21:36it will also need to
21:37be specific on how it
21:39need to be annotated and
21:41what's the relationship with the
21:43entities.
21:44Additionally, the guideline need to
21:46include, many real world examples,
21:50like the diagram show on
21:51the bottom,
21:52to illustrate
21:54the correct annotation practice
21:56and for ambiguous
21:58phrase or tricky scenarios.
22:00Examples will also need to
22:02be provided to establish a
22:04clear consistent rules for annotator
22:07to follow.
22:12It is it is important
22:13to choose the annotators with
22:15a proper it's,
22:17background for your task.
22:19Depending on the complexity,
22:22you might need to choose
22:24a domain expert like physicians,
22:26nurses,
22:27or, medical students,
22:29or just or some layperson
22:31for more general and broad
22:33annotation.
22:34Training for annotator is a
22:36iterative process.
22:38Annotator should be trained and
22:40evaluate multiple times,
22:42until they achieve
22:43the expected level of performance.
22:47Quality checking,
22:48need to be ongoing during
22:50the annotation progress.
22:52Regularly review is always needed
22:55during their work.
22:57And you also need to
22:58provide the feedback,
23:00on time and sometimes additional
23:02retraining for the annotators.
23:07When managing a project,
23:09which contains,
23:10multiple annotators,
23:12there are several,
23:14important steps must be taken
23:15to ensure the quality.
23:18Before starting the actual annotation,
23:21train each annotator
23:23thoroughly to ensure they can
23:25produce consistent and reliable annotation
23:28result
23:29that align with the guideline
23:31you developed.
23:33If resource allow,
23:35implement
23:35double annotation
23:37strategy.
23:38Ideally, each sample need to
23:39be annotated
23:41by two annotators independently.
23:44Then a third more experienced
23:46annotator
23:47can review
23:48any discrepancies and make the
23:50final decision.
23:51This process will have to
23:52maintain a high quality of
23:54annotation.
23:55If double annotation for the
23:57intel dataset is not feasible,
24:00assign small overlapping
24:02subset of data to multiple
24:04annotators.
24:05This overlap allow you to
24:07calculate inter,
24:09interagreement
24:11of the annotator
24:12and then provide a way
24:14to monitor and maintain annotation
24:15quality.
24:20When checking the annotation quality
24:22for the NER task, we
24:24will focus on two main
24:25areas.
24:26The first one is entity
24:28type agreement and then empty
24:30span agreement.
24:32For anti type agreement, we
24:33verify whether annotators
24:36assign the same type for
24:37the entity.
24:38You can see in this,
24:40graph, one of them annotate
24:42the Vancom missing HCL as
24:44the drug and another one
24:46annotate as treatment.
24:47Then this,
24:49mismatch will
24:51need to be discussed when
24:53during during the annotation and
24:55then correct,
24:57for the final
24:58step.
24:59And also for the
25:01anti span agreement, we check
25:03whether both annotator
25:04select the same portion of
25:06text. For the same example,
25:08one
25:10labeled a lot of emotional
25:12stress as a problem and
25:13another one annotate just the
25:15emotional stress.
25:17When such mismatch occurs,
25:19it is important to refer
25:21back to the guideline
25:22and to and to determine
25:24what is the current correct
25:26one to move forward.
25:40When checking the annotation quality
25:43in relation
25:44to extraction,
25:45there are three main aspects
25:47we need to evaluate.
25:48The first one is,
25:50relation type agreement.
25:52We check whether both annotators
25:54assign the same type of
25:56relation between entities,
25:58and then we evaluate,
26:00entity pair. We verify if
26:03the same entity are being
26:05linked by the relation.
26:06And finally, we need to
26:08check the direction
26:09directionality,
26:11which is important for some
26:13tasks because the direction may
26:15change the meanings.
26:19To
26:21evaluate, there are several metrics
26:22we can use.
26:24The common one is precision
26:26recall and f one measure,
26:28which help quantify how consistently
26:30annotate and identify and classify
26:32entities.
26:34Additionally, we can also use
26:36some statistical measures such as
26:38Cohen's copper or.
26:41Another important
26:43matter is, self train and
26:44self test. By training the
26:46model on the annotated dataset
26:48and then testing on the
26:50same dataset,
26:52we can
26:54check if the model achieve
26:55high performance.
26:56If the performance
26:58is low, it may indicate
26:59underlying issue with, annotation inconsistency
27:03or, quality that or quality.
27:09Oh, here is some examples
27:11of, widely used annotation tool.
27:15You can see there is,
27:16Meditator,
27:17Ehost,
27:18or a Docana. All of
27:20those tools are open source
27:21and available on GitHub.
27:24Today, I'm gonna,
27:25introduce the annotation tool blue
27:27is which is implement on
27:29the cheap environment, and then
27:32each users don't need to
27:33install by themselves and could
27:35be managed by,
27:37admin user.
27:42For the Bluetooth,
27:44there are several prerequest
27:46for the access. The first
27:47one, you will need a
27:48one HH account
27:50and then, adding connect to
27:52VPN.
27:53So for Mac user, you
27:54will need to install a
27:55Windows application.
27:57And for Windows user, you
27:58can use the remote desk,
28:01connection to oh, application.
28:05First step, you need to
28:06connect to the VPN.
28:08Open the VPN
28:09application and then, in the
28:11address, type the telecom mute
28:14dot y h h dot
28:16org backslash y s m.
28:18Here, you need to use
28:19your Yale Net ID and
28:21password to log in.
28:24And then once you successfully,
28:27log in to the VPN
28:28environment, you can open the
28:29application and click the add
28:32button
28:33to add the
28:35IP address.
28:36It's ten dot forty eight
28:38dot one two eight dot,
28:40ninety six
28:42dot sixty nine.
28:44And
28:45once the PC successfully added,
28:48it will show on the
28:50application and then double click
28:53to insert your credential.
28:55Here, we'll need your one
28:57HH ID and the one
28:59HH password.
29:03Once you, log in to
29:06the PC, you will see
29:07a Ubuntu environment.
29:13After you get access to
29:15that environment, you can use
29:16any browser on the left
29:18side And on the address,
29:20insert the URL, HTTP,
29:23local host to open the
29:25annotation tool.
29:28The first step is to
29:30create your account.
29:31You always want to have
29:33a admin person that creates
29:35account first.
29:37That will be the person
29:38who can manage the whole
29:39group and assign the project
29:41and tasks to each annotators.
29:44Please use your email, username,
29:46and pass
29:47password to sign up. And
29:49for the verification
29:51code field,
29:52we disable that function so
29:54you can just enter any
29:55four digit number or
29:57combination of characters.
30:02After you log in to
30:03the blue, you will be
30:04able to create able to
30:05create the project and invite,
30:08users to the tool.
30:14Once the admin person successfully
30:17log in and then he
30:19he or she can send
30:20the invitation to the other
30:22group members,
30:23he the person need to
30:25click the invite button and
30:27then copy the invitation link
30:29to each of
30:30the annotators.
30:31And the annotator need to
30:33be use this link to
30:35register. Otherwise, they will not
30:36be in the same group.
30:41And,
30:42by click the click on
30:44add new project button, you
30:46can you will be able
30:47to
30:48choose your task either NER
30:50or NER plus relational extraction.
30:57The pro the creative project
30:59will show on the front
31:01page,
31:01and then you will be
31:03able to add annotators
31:04to the to the project.
31:10To add the data source,
31:12you can click the data
31:13source button
31:14button and then choose what
31:16kind of format you want
31:17to upload to the tool.
31:19We access two format. One
31:21is txt. That's the plain
31:22text without any,
31:25entity or relationship,
31:27and and pre annotate. You
31:29can also choose the blue
31:31format. That is a JSON
31:32file. You can include the
31:34entity or
31:35relationship
31:36inside that JSON.
31:42For each for each project,
31:44you can create tasks for
31:45the annotators,
31:47by adding the by click
31:50the add task button.
31:52And for each task, you
31:53can assign multiple annotators to
31:56this one task just as
31:57I mentioned before.
31:59Different annotators can
32:02annotate same, sub subgroup of
32:04data. That is in order
32:06to calculate the agreement among
32:08annotators.
32:12After Tesla
32:13created, you you will be
32:14able to start annotation.
32:16And then
32:18first first thing, you need
32:19to define the entity and
32:21relationship
32:22that you already have in
32:23your annotation guideline.
32:25And
32:28and then after that, highlight
32:30the phrase you want to
32:31do the annotation and then,
32:33choose
32:34what kind of entity or
32:35relationship you want to
32:38annotate.
32:40The blue tool can will
32:42also provide you a function
32:44to calculate the agreement among
32:46the annotators.
32:48Once the annotator finish and
32:51finish the task,
32:52you can
32:54and finalize them. Then you
32:56then you can just use
32:58the button to
33:01check the agreement among them.
33:03It will give you a
33:04f one score for both
33:06entity and relationship.
33:10Then I will have a
33:12quick demo for the process.
34:18Okay. Okay. As I mentioned,
34:20you just, connect to the
34:21VPN and then
34:25type the password.
34:32And here, then open
34:35open the Windows app.
34:38Click to the
34:42server
34:43we have.
35:09And then open the browser.
35:24Here, you can sign into
35:26your account.
35:29To create a new project,
35:30you can just click this
35:31button and
35:33type the project name and
35:35select the project type.
35:38Here, I already created a
35:39demo project project.
35:41I want to,
35:42import the data source here.
35:44So I just click this
35:46button,
35:46and then I
35:48download some The notes I
35:50need to annotate it from
35:52the chip environment,
35:54and then I drag drag
35:55it to here.
35:58This is the tab,
35:59tip c file, so I
36:00just choose text,
36:02and then I confirm.
36:08For each of the project,
36:09you can add the annotator,
36:12and that's the pit per
36:13that's the person within your
36:15group.
36:19And then you just, create
36:21annotation task for them.
36:27You can choose, multiple annotators
36:29here.
36:33And then you go to
36:34the file.
36:36On this side, you can
36:38define the entity. For example,
36:40we we want to choose
36:42we want to define problem.
36:46And then you can,
36:48start to do the annotation.
36:52Yeah. Basically, that's the whole
36:54process for the how you're
36:55gonna do the annotation and
36:56how to use our tool.
37:06Yeah. Any questions?
37:09So
37:10how we can import our
37:12own data to this? This
37:13because I think this this
37:14is your server. Right?
37:17Here, as a
37:19the we can download
37:21the
37:22the team also mentioned we
37:24use the cheap environment. Right?
37:25The Camino, we can upload
37:27the own data mod their
37:29their own data to their
37:30environment, and this server will
37:32connect to the Camino.
37:34You can download that data
37:36from Camino environment.
37:38That will.
37:47Not yet. Right now, because,
37:49this tool will host in
37:51a secure environment,
37:53while coming to us because
37:54there are lots of PHI
37:56information.
37:57So
37:58that's the purpose,
38:00we hosted there. So for
38:01example, let's say there are
38:02other,
38:04and publicly available datasets.
38:07We really want to annotate
38:08them so they're they're able
38:09to be for us to
38:10ask,
38:12to,
38:13upload it to and then
38:14from
38:15to this server. Right? Yes.
38:18Well, I want to think
38:19of if you try to
38:21annotate public data, don't use.
38:23I I I think what
38:25what we can do is
38:26we set up a blue
38:27in a open,
38:29public website, then you can
38:30just go to over there.
38:32Like, I put it in
38:33the spin up, then you
38:34can just upload because there's
38:35no sensitive data. We can
38:37just make another instance of
38:39the loop of public data.
38:41Because this,
38:42we install in the communal,
38:44in the CHP to support
38:45this annotation of of.
38:47And if there are public
38:48data, well, we can't just
38:49set up another because it's
38:50a web application. We just
38:51set up another web application
38:53in this thing, a public
38:54space we can. Yeah. We
38:56can discuss that. And we
38:57should we should ask them
38:58to set up that
39:00specific
39:01list that the public list,
39:02or is it available?
39:05We have not, but you
39:06can contact us. Maybe we
39:08just give you a copy.
39:09We can stop by ourselves.
39:10But right now, we we
39:11didn't really distribute this package
39:13frame. We're just sitting up
39:15before our.
39:35It it's just a different,
39:36tools.
39:38Yeah.
39:51Thanks, Silja. I'm gonna be
39:53very quick.
39:55And machine gun mode on.
39:57Okay.
39:59Yeah. Doctor Shu already discussed
40:00about the difference between,
40:02BERT and LAMA.
40:04And
40:05summarize everything,
40:07there is a trade off
40:08between performance, computational
40:10resources, and time.
40:12Okay? So you have, you
40:14need better performance.
40:16The computational resources are there.
40:17Go for llama models, high
40:19billion models.
40:20Across a wide variety of
40:22tasks, they would work well.
40:24But if time is a
40:25concern, he, projected,
40:27you know, issue of speed
40:28between BART models and the
40:30last language models. It is
40:31up to twenty to thirty
40:32times slower.
40:34So if that is a
40:35concern, you need to switch
40:36to BERT models. So I'm
40:38gonna talk about the clinical
40:39information extraction system
40:41where we have developed both
40:43the BERT and LAMA based
40:46large language models for you
40:49in such a way that
40:50you do not have programming
40:52experience, you have some programming
40:54experience, or you are a
40:55pro programmer.
40:57Anyway,
40:58we have features that will
40:59help you take it and
41:01customize it to whatever task
41:03you want to use it
41:04for.
41:05And that
41:07is what we call Kiwi.
41:09Okay? So we are building
41:11Kiwi. The one pipeline that
41:13I'm currently gonna show you
41:14that is set for all
41:16these sort of use cases
41:17that I'm talking about
41:19is a general clinical information
41:21extraction pipeline.
41:23I also have things coming
41:24up for you, and if
41:25you have suggestions or something
41:27that you have been really
41:28working on, it's a real
41:29need of the time, let
41:30us know, and then we
41:32would work on developing those
41:33things.
41:34Okay.
41:36We have the clinical notes.
41:37We need to do some
41:38preprocessing,
41:39deidentification,
41:40these sort of things. Doctor
41:42Shu mentioned named entity recognition
41:45followed by relation extraction, then
41:47there is this concept mapping
41:48or concept normalization.
41:50Finally, post process it and
41:53get
41:54all the structured data
41:55from the unstructured
41:57clinical notes. So that is
41:59the basic block diagram of
42:01any clinical information extraction pipeline.
42:06I don't need to go
42:07over this named entity recognition,
42:09identify the boundaries,
42:10relation, identify the relationship between
42:13the entities,
42:14and normalization,
42:15doctors write the same thing
42:16in hundred different types. High
42:18VP, hypertension, all these are
42:20the same. Right? So you
42:21need to get it to
42:22another standardized
42:24vocabulary, terminology like ICD, SNOMED,
42:27these sort of things. That
42:28is what concept normalization
42:29does. All these three things
42:31comes together.
42:32That is where you take
42:33unstructured data and get your
42:35structured thing out of it.
42:37Okay. What does our general
42:39clinical,
42:41information extraction pipeline give you?
42:44We mainly focused on four
42:45main entities,
42:47medical problem, treatment,
42:49drug, and test.
42:50Right? So you our Kiwi
42:52tool will give you all
42:54these four
42:55main types of entities, but
42:57these entities are not just
42:58by themselves. Right?
43:00When you are talking about
43:01a drug, you have things
43:03like the strength, the dosage,
43:04the duration, the route, all
43:06these things are important. And
43:08we need to connect that
43:09specific drug to that specific
43:11route or specific
43:12strength or dosage
43:14to actually identify what has
43:16the doctor written about giving
43:18those information to that patient.
43:21So we have a bunch
43:22of main entities, and we
43:24have a bunch of modifiers
43:25that correspond to those main
43:27entities.
43:28Altogether,
43:29this is what Kiwi is
43:31gonna extract for you. I
43:33know many of the things
43:34that you may be needing
43:36might be missing from this,
43:38but we will. If there
43:39are some other cases that
43:40you would like to extract,
43:42we may in future think
43:43about incorporating that. So So
43:45for medical problem, you have
43:47the severity, the condition, the
43:49uncertainty, who is the subject,
43:50whether is it really talking
43:51about the patient or his
43:52family because we can see
43:54all these sort of things
43:55appearing in the notes.
43:57Whether that particular problem is
43:59negated or not, So this
44:01is how it is. So
44:02we have the four main
44:03entities and all these modifiers.
44:08Going very briefly. So YuJa
44:10mentioned about the annotation. Let's
44:11think when you annotate, this
44:13is on the top figure
44:14is something that you get.
44:16Now suppose you are using
44:17a large language model, it
44:19understands the language of prompts.
44:21Right? And doctor Xu covered
44:22this, how to write a
44:23proper prompt for a named
44:25entity recognition.
44:26So you define the task.
44:28We want to identify medical
44:30problems, treatment test, and, other
44:32things, and then you give
44:34us how you need the
44:35output. That is
44:36for making your programming life
44:38easy to take it in
44:39a particular output format so
44:41that you can convert it
44:42and evaluate it fast. So
44:44that is the output guideline
44:45markup.
44:46Then you define each entity
44:48because we have developed the
44:50annotation guidelines, and that is
44:51how the humans actually annotate.
44:53So the model should also
44:55know what is how the
44:56humans have annotated. Otherwise, how
44:58do you compare that gold
44:59human standard annotated data with
45:01what is what the model
45:02is giving? So whatever information
45:04you are giving the human,
45:06you also give that to
45:08a model
45:08in the terms of entity
45:10definitions.
45:11And then annotation guidelines. We
45:13talked about okay. Annotate only
45:15complete noun phrases shouldn't be
45:17partial, complete adjective phrases. These
45:19sort of things that are
45:20there in the annotation guideline
45:22that you developed is also
45:23provided to the model.
45:25Now then you build your
45:27training data by showing the
45:29model a bunch of examples.
45:31Suppose your input is at
45:32the time of admission, he
45:34denied fever, dysphoria, whatever it
45:36is. So
45:37how does the model provide
45:38you the output? So it
45:40should say that span class
45:41problem fever.
45:42That is telling the model,
45:44okay,
45:45fever is a problem. Whenever
45:47see you see a medical
45:48problem, put it between the
45:50HTML
45:51tags, span class is equal
45:53to problem, the opening tag,
45:55and slash span, which is
45:56the closing tag. We did
45:58that for our convenience because
46:00we were comparing it with
46:01the BERT and other models.
46:03You can provide the output
46:04in the way that you
46:05want. You can use JSON
46:07format, or if you just
46:08want it to be plain
46:09text in question answering and
46:10things like that, you can
46:12give the output in such
46:13a way. But at least
46:14with the named entity recognition
46:15relation extraction, this really helps
46:17us. And another thing is
46:19that this also prevents or
46:21helps us know that the
46:22model is not hallucinating.
46:24You see? You are giving
46:25the input sentence and you
46:27are also telling the model
46:28to repeat the same sentence
46:29but with some tags attached.
46:31You can compare your input
46:33and your output to see
46:34that model is not inserting
46:36entities or things that are
46:38not already there in the
46:39original sentence.
46:42Okay.
46:43So that is how you
46:45create a prompt and do
46:46NER with large language models.
46:48The next step is relation
46:50extraction.
46:51There is a particular drug
46:53you need to associate its
46:54strength, its route, its form,
46:57its frequency,
46:58everything, and connect that particular
47:00drug to whatever is mentioned
47:02for it. Right? So for
47:04that, for the relation extraction,
47:06you need to slightly modify
47:08your prompt when you give
47:09it to. So here we
47:10are saying your task is
47:11to mark up modifier entities
47:14when given a main entity.
47:17So how do we train
47:18the model for this task?
47:20We will show the model
47:21the main entity. That is
47:22your input text. Span class
47:24drug is equal to this.
47:26Then you will ask the
47:27model, given this main entity,
47:29what are the modifier entities
47:31associated with it? And then
47:34you give examples in the
47:35output where you see now
47:37the main entity is not
47:38annotated inside the span class
47:40tags, whereas you see that
47:42point three five mg is
47:44within span class is equal
47:45to strength.
47:47So given a drug, you
47:48say this
47:49appearance is like this point
47:51five milligram or mcg, these
47:53sort of things, when it
47:54sees repeatedly,
47:55it's actually learning that this
47:57is the strength associated with
47:59that particular main entity.
48:01So a lot of examples
48:02that are annotated like this
48:03is what is helping the
48:04model learn.
48:07Again, this is another same
48:09sort of example. His blood
48:10pressure on discharge was one
48:12twenty six over sixty three.
48:14Heart rate is eighty. You
48:16cannot say blood pressure is
48:17eighty. Right? It's the same
48:18sentence which has two values
48:20and two tests. You need
48:22to correctly associate blood pressure
48:23with hundred and twenty six
48:25over sixty three and heart
48:26rate with,
48:27eighty.
48:29Right? So we give the
48:30input when we say blood
48:32pressure is the entity. Its
48:33value should be hundred and
48:34twenty six over eighty. If
48:36we highlight heart rate as
48:37the entity, then the value
48:39should be eighty.
48:44Again, so now we originally
48:46had the annotated data. We
48:48converted into these instructions format
48:51that had been showing for
48:52the named entity recognition, things
48:53like this. This is an
48:54instruction demonstration
48:56sample.
48:56And for any so relation
48:58extraction, the one that you
48:59see down. So you have
49:01these
49:02and the entire dataset
49:04converted into such things is
49:06your instruction demonstration.
49:08So you have a bunch
49:09of these
49:10examples that you collectively,
49:13call as your instruction
49:15dataset.
49:16So previously, we have denoted
49:18datasets for the other models.
49:19They're just slightly different. The
49:20term is instruction datasets because
49:22the dataset is comprised of
49:23a bunch of instructions or
49:25prompts with the examples input
49:27and output examples.
49:29To instruction fine tune a
49:31large language model, all you
49:33need is such an instruction
49:34dataset specific for your task,
49:36a base large language model
49:39like llama two, llama three,
49:41llama four, or whatever it
49:42is.
49:43And then you give this
49:45model, you train it, and
49:47finally, you get an instruction
49:49tuned large language model. So
49:51if it is a LAMA
49:52model as your base, you
49:53will get an instruction tuned
49:55LAMA,
49:56but
49:57the one that is actually
49:59adapted
50:00for those tasks.
50:02So when you just take
50:03the originally available LAMA model,
50:05it's a general trained model.
50:06Right? It's not adapted or
50:08it is not domain adapted
50:10for your specific task.
50:12By fine tuning a large
50:13language model, what you are
50:15doing is making its capabilities
50:17much more lean towards whatever
50:20task you want to perform
50:22by showing it a lot
50:23of such examples and modifying
50:25its weights in such a
50:26way that it adapts to
50:28that specific task.
50:30That particular model, if you
50:32now go and test back
50:33on some general task, it
50:35might not perform the way
50:36that it previously
50:38performed
50:39because you have changed the
50:40model weights and adapted it
50:42to that specific task.
50:44Okay. So this is basically
50:46fine tuning and then you
50:47would evaluate the model.
50:49Now going back, I said
50:51we also we had the
50:52BERT based models too. There
50:54also, as you just shown,
50:55you would annotate the dataset,
50:57but for a large language
50:58model, you would give prompt.
51:00For BERT, you would convert
51:02it in such a way.
51:04So BERT model is basically
51:06sequence tagging. So you convert
51:08each of the sentence into
51:09tokens. Let's say a token
51:11is a word. So vital
51:13sign remains stable. And then
51:15doctor Xu has covered this
51:17BIO tagging is what we
51:19call beginning of an entity,
51:20inside of an entity, outside
51:22of an entity. So vital
51:24sign is a test here,
51:25so you say b test.
51:27If you have a problem,
51:28you would say acute carcinoma
51:30or something. I'm making this
51:31up. So acute is gonna
51:33be b problem
51:34and carcinoma is gonna be
51:36I problem. If it is
51:37not within the four main
51:39entities and four modifiers that
51:41we have, we tag it
51:42as o, which means outside
51:44of an entity. So take
51:46the same annotated dataset, convert
51:48into two different formats. One,
51:50safe for the llama, another
51:51for the bird.
51:52And for bird, this is
51:54token classification. So given a
51:56content given a sentence, you
51:57are basically predicting whether vital
51:59is among the b test,
52:01I test, b problem, I
52:02for problem, b value, I
52:04value, whatever is the corresponding
52:06label that should be for
52:07that particular token. So it
52:09is token classification task what
52:11we do.
52:13How does relation extraction correspond
52:15in the case of BERT?
52:17So here also the same
52:18thing. Now it becomes sentence
52:20classification.
52:21You have two classes. One
52:23that has value, which is
52:25which is a positive class
52:26and the other, it is
52:27a negative class. So if
52:29you show blood pressure and
52:31eighty, that is a negative
52:33sample. You should say label
52:34that sentence as negative. If
52:36you have blood pressure and
52:37hundred and twenty six over
52:39sixty three, then it is
52:40a positive sample. So it
52:42becomes a sentence classification
52:44task. And so many patterns
52:46like this and seeing repeated
52:48sentences like that, the model
52:49is learning that particular pattern
52:51and identifying it. Another time
52:53that sort of sentence appears,
52:54okay. This is a positive
52:56or has value or this
52:57is a negative or a
52:58negative class there.
53:01This is the entire Kiwi
53:03pipeline.
53:04Okay.
53:06Having
53:07data
53:08from multiple sources is important.
53:11Something that works on your
53:12specific data at a particular
53:14hospital setting written by one
53:16specific doctor in one particular
53:17setting might not generalize well
53:20when you try to use
53:21that same pipeline in another
53:23hospital,
53:24in another node that is
53:26written by another physician or
53:27health care provider.
53:29So for Kiwi, we actually
53:31have data from four sources
53:33so that we can
53:34make the model much more
53:36generalizable
53:37and make it see the
53:38patterns that happens in a
53:40wide variety of data. We
53:42have the UTP that is
53:43UT Physicians, empty samples that
53:45is an, publicly available dataset.
53:47MIMIC three, you might know.
53:48And all these data from
53:50these different sources are incorporated
53:52for our training process.
53:54And as I mentioned, instruction
53:56format for llama BERT format
53:58for training the BERT models.
54:00Then you fine tune both
54:02the models, and then you
54:03test the models out. So
54:05you test it on a
54:06subset of the UTP, empty
54:08samples, and mimic three and
54:10also on I two b
54:11two. Again, doctor Xu mentioned
54:12that I two b two
54:13is unseen data. Right? It's
54:15not in your training data.
54:16That's how we are testing
54:17the generalizability
54:18to see whether it is
54:19actually performing on unseen data.
54:23Post process
54:24separate entities relationship
54:26calculate precision recall and f
54:27one, and that is your
54:29evaluation.
54:31So this is,
54:32quickly the composition of the
54:34Kiwi dataset, means the Kiwi
54:36model that we are giving
54:38out currently. It has it
54:39has been trained on about
54:41one thousand four hundred documents
54:43and then tested on some,
54:45four different types, each having
54:47fifty documents or,
54:49twenty five documents.
54:52So evaluation, I mentioned precision
54:54recall and f one, and
54:55it is exact match and
54:57relaxed match. To be clear,
54:58exact match, the entity type
55:00should match and the boundary
55:02should also match. But when
55:04it is relaxed match, the
55:05entity type should still match,
55:07but the boundary can be
55:09overlapping.
55:12How did we perform? Llama
55:14three seventy billion was sort
55:16of better for NER task
55:18and, again, for relation extraction,
55:20but you also see that
55:21some smaller models still performed
55:24on par with it. Sometimes
55:25you do not have much
55:26difference with the BERT model,
55:28but here we saw that
55:29at least some statistical significance
55:31was there. And I two
55:33b two is the unseen
55:34data, and doctor Xu mentioned
55:36again about how,
55:37you know, last language models
55:39are better on unseen data
55:41compared to the BERT. And
55:42BERT models, again, definitely need
55:44a lot more data to
55:45train on.
55:49Now what about the memory
55:50usage, total GPU hours GPU
55:53hours per epoch, energy consumption,
55:55carbon emission?
55:56That is where a lot
55:58of these computational resources and
56:00things comes into play. You
56:01need huge amount of memory.
56:03As you know, we are
56:04comparing a BART model that
56:05is about hundred million, three
56:07hundred million parameters to something
56:09that is seven billion, eight
56:10billion, seventy billion, and that
56:12difference
56:13really shows,
56:15in the amount of compute
56:16and the, hours that you
56:17require for training these models
56:19and the memory that they
56:21utilize.
56:23So, if you want to,
56:25fine tune the model, if
56:26you are using this parameter
56:27efficient fine tuning approaches like
56:29Laura, Melingfei is gonna discuss
56:31on that,
56:32then it is,
56:34you need one,
56:35a one hundred eighty gigabyte
56:37GPU. But if you,
56:39need to do the inference
56:41for the seventy billion model,
56:42you need two a one
56:44hundred,
56:44eighty gigabyte GPU.
56:47Okay. Again, our paper has
56:48a lot of things. I
56:49can skip through this. Just
56:50want to talk about concept
56:52normalization.
56:53The actual way we do
56:54concept normalization
56:55is actually having elastic search,
56:57which basically does exact
57:00match and,
57:00partial match of a z
57:02match and then b m
57:03twenty five to rerank
57:05those
57:06things extracted.
57:07So here we in the
57:09Kiwi, we have mapped it
57:10to UMLS concept unique identifiers.
57:13For anyone who's not familiar
57:14with UMLS,
57:15it is a meta UMLS
57:17metatasaurus
57:18basically incorporates
57:19about hundred and, some vocabularies
57:22and gives it a unique
57:23identity. Same concepts from all
57:25different vocabularies
57:27are mapped to a unique
57:28concept ID. So here,
57:30this is a large language
57:31model utilized concept normalization
57:34pipeline.
57:35Once you do the NER,
57:36you get that query. On
57:37the left side, you see
57:39left atrium dilated. So that
57:41is your query entity with
57:43its context. Let's say the
57:44sentence that has
57:45that. That you give to
57:46a last language model and
57:48ask it to generate multiple
57:50synonyms of it. Why are
57:51we doing that? Because the
57:52exact thing might not be
57:53appearing in any of the
57:54standardized vocabularies. So
57:56we
57:57generate as many variations of
57:59that particular entity so that
58:00we can do and
58:02match,
58:03that elastic search and b
58:04m twenty five actually does
58:05that. So you give the
58:06original utterance and all the
58:08synonyms and actually check-in
58:10your,
58:11you know,
58:13database that you have created
58:14whether that entity is actually
58:16present there.
58:18So,
58:19you search, you will get
58:20a bunch of concepts that
58:21are sort of similar to
58:22that, and then you again
58:23use a large language model
58:25to find among those concept
58:27which is the best one
58:28that actually
58:30represents the originally redeemed entity.
58:33I know I'm going so
58:34fast, but
58:36the slides will be available,
58:37and we will also think
58:38of making the recordings available
58:40on the YBIG website.
58:42Okay. Last step,
58:44the Kiwi usually gives you
58:45output in a JSON format,
58:47but we also have scripts
58:48to make it easy for
58:49you so that that JSON
58:50can be converted into a
58:52CSV.
58:52And what I have actually
58:54highlighted is you see in
58:55the first,
58:56column, it's the entity, the
58:58term that we have actually
58:59extracted, and the highlighted one
59:01is the concept ID for
59:02that, which is basically
59:04quiz or concept unique identifiers
59:07of that particular thing from
59:08UMLS.
59:09And if you ask why
59:10UMLS,
59:11if there is a concept
59:12unique identifier, you can actually
59:14map it back to SNOMED,
59:15ICD, mesh because UMLS includes
59:18all those things. That's a
59:19very easy task.
59:21Where can you find Kiwi?
59:22This is our website, You
59:23know? Kiwi dot clinical n
59:25l p dot org.
59:27The QR code will take
59:28you right there. You press
59:30the live demo,
59:31and then
59:33you, get a prepopulated,
59:36note, a few sentences.
59:38Click submit. It will show
59:40you the entities and the
59:41relation
59:42extracted. You can remove that
59:44text, add your own text.
59:45No programming experience. You can
59:47put something in there and
59:48get to see what are
59:49the entities. Just play around
59:51with that.
59:52Okay. Now if you want
59:54to download the models, we
59:56have this another page called
59:57download. You need to fill
59:58a form,
60:00and then we will send
01:00:01you,
01:00:02the Docker images for that.
01:00:03Now how is Docker,
01:00:05different?
01:00:06Everything is prepackaged
01:00:07into a container. You do
01:00:09not need to install things
01:00:10separately. You just and the
01:00:12Docker comes with instructions
01:00:14as to what to do.
01:00:15It's just like an executable
01:00:17you run,
01:00:18select, okay, one, two, three
01:00:19numbers. It also has a
01:00:21readme file, which gives
01:00:23you the ways how to
01:00:24run about it, and then
01:00:25the output will be stored
01:00:27in the you need to
01:00:27give this, where your input
01:00:29data is and the output,
01:00:30where you want the output
01:00:32which to be, and it
01:00:33will give you run the
01:00:34entire Kiwi and give you
01:00:35the output there. Output
01:00:37there.
01:00:38Okay? So easy to install
01:00:39Docker images. All the dependencies,
01:00:40everything is taken care of.
01:00:42Can be run on Linux,
01:00:43Mac, Windows. You have CPU.
01:00:45You have we we have
01:00:46versions for that. You have
01:00:47GPU. We have versions for
01:00:49that.
01:00:50And, we have both the
01:00:51BERT based and LAMA based
01:00:53models that does the thing
01:00:54that I was talking about.
01:00:57Finally,
01:00:58what,
01:00:59Vincent is gonna demo is
01:01:01forget about all these things.
01:01:03Your data is on the
01:01:04chip. You want to directly
01:01:05use it just with an
01:01:07API call. Currently, you need
01:01:09to contact Chris Gilman, who's
01:01:11a senior software engineer, to
01:01:12get that API for calling
01:01:14the Kiwi. But in future,
01:01:15we are gonna come up
01:01:16with a system where you
01:01:17can submit the tickets, get
01:01:18the API key. So get
01:01:20your API key, put it
01:01:21into a program that we
01:01:22are gonna give you, run
01:01:23it. That's as easy as
01:01:24it gets.
01:01:27A growing database, about thirty
01:01:29two requests so far since
01:01:30we released, and, that's it.
01:01:32I don't want to
01:01:34go more on that. What's
01:01:35coming up? We have more
01:01:37packages that we have actually
01:01:39built but not made it
01:01:40available in as a docker
01:01:41or a service or something
01:01:42like that. One of those
01:01:44that we are thinking of
01:01:45making it available on the
01:01:46chip or as, Kiwi is
01:01:48currently
01:01:49is the resist pipeline, which
01:01:50is extracting systematic anticancer therapy
01:01:53and the responses based on
01:01:55the, RECIST guidelines. Again, I'm
01:01:57not a clinician, so I'm
01:01:58not going on to it.
01:01:59So probably you can see
01:02:00something like the similar in
01:02:02future
01:02:02available,
01:02:03like the stalker images or
01:02:05API services or something that
01:02:06you can download and play
01:02:07with.
01:02:09My main area of research,
01:02:11social determinants of health. This
01:02:12is another pipeline that I
01:02:14have built. Twenty one social
01:02:16determinants of health,
01:02:17four one, two, three, four.
01:02:19Yeah. Four different models.
01:02:21You start from XGBoost,
01:02:23TextCNN,
01:02:24SentenceBird,
01:02:25llama. After that, that actually
01:02:28can take your notes and
01:02:30annotate it on two levels,
01:02:32on twenty one social determinants,
01:02:34determinant factors. And let's say,
01:02:37it does a sort of
01:02:38sentence classification.
01:02:39It takes your note. It
01:02:41divides it into sentences, tells
01:02:42you, okay. This sentence is
01:02:43talking about race, sex, gender.
01:02:46This sentence is talking about
01:02:47the insurance of the person.
01:02:48So this sentence is talking
01:02:50about their education. Now we
01:02:51go one more level. You
01:02:52will also have models that
01:02:54tells you, okay. This education
01:02:56of this person is high
01:02:57school or below. The insurance,
01:02:59it's yes. The person has
01:03:01having an insurance or no.
01:03:02So high level on the
01:03:04twenty one factors, both the
01:03:06values and attributes
01:03:08on another
01:03:09digging another level deep. So
01:03:11that's all we have here.
01:03:12And, also, I just wanna,
01:03:15forgot that.
01:03:16Yeah.
01:03:18When you sign a DUA
01:03:19with us,
01:03:20we are gonna give you
01:03:21the model weights of KB
01:03:23that is still in the
01:03:24pipeline. It will come here
01:03:26in the form. So in
01:03:27this form that I'm asking
01:03:28you to fill to get
01:03:29the docker images would also
01:03:31be if you are good
01:03:32in programming,
01:03:33take our model,
01:03:35you know, continuously fine tune
01:03:37on it with your data,
01:03:38make it whatever you want
01:03:40to do with it. So,
01:03:41that is another thing, but
01:03:42you need to sign a
01:03:43DUA with us, and that
01:03:44form will be available soon
01:03:46there. With that, Vincent, take
01:03:48it over for a Kiwi
01:03:49API demo.
01:03:56Not taking questions because of
01:03:58the time that we focus
01:03:59on.
01:04:07So good afternoon, everyone. My
01:04:09name is Vincent, and I'm
01:04:10a software developer in doctor.
01:04:12She labs.
01:04:13And today, I will talk
01:04:15about how to use the
01:04:16Kiwi API service.
01:04:18And the core concept is
01:04:20of of the Kiwi have
01:04:22already become discussed with, by,
01:04:24so I will talk very
01:04:26quick.
01:04:28So what is the Kiwi's
01:04:31so API service?
01:04:33The Kiwi API service provide
01:04:35an API as a service
01:04:37interface
01:04:38that allow user within
01:04:40chips
01:04:41in internal network to access
01:04:42a Kiwi without a request,
01:04:44high performance GPU and having
01:04:46to install
01:04:47or manage the model locally.
01:04:50User simply request a API
01:04:53key and make standard HTTP
01:04:55API API calls to use
01:04:57the service.
01:04:58All computational
01:04:59resource are running on chips,
01:05:01so and, we don't need
01:05:03to request the local GPU,
01:05:05GPU.
01:05:06Does this setup of streamline
01:05:08assess TV functionality
01:05:11and, makes it more accessible
01:05:13in resource constraint
01:05:15environment.
01:05:20So how does the Kiwi
01:05:21API service actually work?
01:05:23The process follow a simple
01:05:25request and,
01:05:27a response
01:05:29response.
01:05:30User can send a request
01:05:31to the Kiwi API server
01:05:33in chips environment such as
01:05:35the cam Camino, which include
01:05:37the either,
01:05:38the clinical notes or other
01:05:40tests related data.
01:05:42Once the API service received
01:05:44the request, it's determined
01:05:46the task type, based on
01:05:48the specific
01:05:49endpoint and then return the
01:05:51up appropriate appreciate the response.
01:05:55Most tasks are handled by
01:05:57the background process on the
01:05:58API server.
01:06:00All incoming requests are and
01:06:02queued
01:06:03is queued and processed sequentially
01:06:06to issue the efficient using
01:06:08the lim and imitate,
01:06:10computational resource.
01:06:11So
01:06:13and then
01:06:15let's take a closer look
01:06:17at how to use the
01:06:19QApi service.
01:06:20Before we getting start,
01:06:23there are something you need
01:06:25to prepare. First,
01:06:27obviously, you need,
01:06:29access in the chips environment
01:06:31like the Camino.
01:06:32You need to have a
01:06:33one HX
01:06:35account.
01:06:36Then you
01:06:37need to answer right API
01:06:38key, for the detail just
01:06:40mentioned by,
01:06:42You need to ask Chris
01:06:44Gellman to get the API
01:06:45key.
01:06:47For the user who have
01:06:49some coding experience, they can
01:06:51write their own script to
01:06:52access the API, but we
01:06:54also provide the API launch
01:06:56script and some use case
01:06:57include
01:06:58in Jupyter Notebook provided under
01:07:00this, GitHub,
01:07:01link.
01:07:06Now assuming you, you already
01:07:08have the access in into
01:07:09the formula and then you
01:07:10you open a Jupyter notebook,
01:07:12and then you have a
01:07:13a key API key. First,
01:07:16you need to define a
01:07:17variable to instantiate
01:07:19the the class, that I
01:07:20provide in the script.
01:07:21At this step, you need
01:07:23to insert
01:07:24your API key and into
01:07:26this instance.
01:07:28For the first time,
01:07:29for using the API key
01:07:30server,
01:07:32you can use the key
01:07:33info function to test your
01:07:35connection.
01:07:37This will give some, result.
01:07:39Yeah.
01:07:40It, responds as a JSON
01:07:42format,
01:07:43information.
01:07:44There are three main component
01:07:46in the in this response.
01:07:48You can see the usage
01:07:49count and, which tells you
01:07:52how many tokens you use
01:07:53since you create the API
01:07:54key. And the token remain,
01:07:57which tells you how many
01:07:58token you still have in
01:08:00the API key. Finally, the
01:08:02expire at, tells you when
01:08:04the key is expired.
01:08:06For the token, the expire
01:08:07state, you can contact our
01:08:09team to add add in
01:08:10the usage in the future.
01:08:13Our main function process is
01:08:15the batch prediction,
01:08:17which allow user process their
01:08:19clinical node in the bark.
01:08:21To use this function,
01:08:23you need to provide the
01:08:24path of your files in
01:08:26into in the communal environment.
01:08:29Currently, the upload format are
01:08:31support to compress the files
01:08:32such as the deep or
01:08:33tar or the single text
01:08:36t file, and you can
01:08:37compress your,
01:08:38node into a single text
01:08:39t file as well.
01:08:41If the sit,
01:08:42your note will on the
01:08:44Kiwi server as a task
01:08:45in the queue. The function
01:08:47will return the text status
01:08:48including the test ID for,
01:08:51for this test as a
01:08:52JSON format.
01:08:53All all test ID are
01:08:54related to the API key,
01:08:57which means all the user
01:08:58data is isolate isolated by
01:09:00the API key and the
01:09:01test ID.
01:09:04Here, is an example when
01:09:06you do a batch prediction.
01:09:09As you can see, it's
01:09:10a return, JSON format of
01:09:12the test information.
01:09:14It's including the test ID
01:09:16and the message that shows
01:09:17the state, status and how
01:09:20many tokens using this, task
01:09:22and, how many token remaining
01:09:24your,
01:09:25account.
01:09:26Finally, the estimate time for
01:09:28your task is calculated based
01:09:30under your task queue position
01:09:31and the progress of the
01:09:33task, ahead of it.
01:09:37After you submit a task
01:09:39and receive a task ID,
01:09:41you can use your task
01:09:42ID to check the current
01:09:43status of your task at
01:09:45anytime using the task ID
01:09:47status function.
01:09:49It will provide your,
01:09:51task information and the detail.
01:09:53Typically, there are three main,
01:09:56status of your,
01:09:58the test data. So first
01:09:59is the queue status.
01:10:01When a state,
01:10:02task in the queue, the
01:10:04system will provide the,
01:10:06test current queue position as
01:10:08well as the estimate time
01:10:10for processing.
01:10:11And next is processing.
01:10:13When no no one attack
01:10:15you,
01:10:16your task will put into
01:10:18a process or pull. It
01:10:20will in indicate how many
01:10:21files into your task and,
01:10:24include how many not yet
01:10:26processed
01:10:27and, how many have you
01:10:28been proceed
01:10:29and, also, remaining time based
01:10:31on the remaining files.
01:10:33Finally, the incomplete
01:10:35status, you can use the
01:10:37test ID in the next
01:10:38function to download your result.
01:10:42Once your, test is start
01:10:44complete,
01:10:45you can download your
01:10:47test using the type download
01:10:49function.
01:10:50In this function, you need
01:10:51to give output path, and,
01:10:53then you want to save
01:10:54your, file as a add
01:10:56local and the output type
01:10:58you you prefer.
01:11:00By default, the path are
01:11:01the working directory,
01:11:03and the the output,
01:11:05output is the JSON format.
01:11:07For the file saving, it
01:11:08supported three types of the
01:11:10tip that typically use. First
01:11:13is seed file,
01:11:14which compress the all result
01:11:16into a separate JSON file
01:11:17for your input files.
01:11:19And then it, insert the
01:11:21JSON, which combined all the
01:11:23JSON result into a single
01:11:25JSON file.
01:11:26Finally, it's the CSVS,
01:11:28where we've seen our mission.
01:11:29We have a con convert,
01:11:31integrate into the j the
01:11:33eight QVAP service so they
01:11:34can just,
01:11:35simply,
01:11:37output CSV.
01:11:39Each file can only be
01:11:41download once. After you download,
01:11:43you cannot access,
01:11:45again because,
01:11:46for the some,
01:11:48privacy issue, they will delete
01:11:50the delete the re record
01:11:52of the data.
01:11:54And, here is a example
01:11:55for download. Typically, it will
01:11:57save your file into local
01:11:59directory and give you some
01:12:01message to tell you if
01:12:02this, succeed.
01:12:04And the left side, it
01:12:05should be the, like, the
01:12:06typical re re
01:12:08result format.
01:12:10And,
01:12:11in some case, you might
01:12:13submit multiple task and or
01:12:16forget a specific
01:12:18task ID, this function allow
01:12:20you to quickly review the
01:12:22status of each text.
01:12:24This this function will list
01:12:26all tests that not yet
01:12:27downloaded.
01:12:28Yeah. Yeah.
01:12:30And, if you exit as
01:12:33accidentally
01:12:34submit your task, if the
01:12:36task is not,
01:12:37into a process,
01:12:39you can still using,
01:12:40this, this
01:12:42function to cancel your task
01:12:44be, before the
01:12:46task is into
01:12:48the process pool,
01:12:49and, they will give you
01:12:51the token back.
01:12:53So I will show you
01:12:54a quick demo for them.
01:13:17Okay. Sure. So
01:13:18So I think we'll skip
01:13:19the demo and move to
01:13:20the next speaker. It's the
01:13:22same things, but you see
01:13:23it on coming on. Like,
01:13:25I'll actually showed you the
01:13:26program there. Right.
01:13:27Yeah.
01:13:29Thank you.
01:13:35Oh, hi, everyone. My name
01:13:37is Lingfei Chen. I am
01:13:39and, I am a postdoc
01:13:40at doctor Vashu's group. Today,
01:13:42I'm going to show you
01:13:43how to develop customized models
01:13:45customized models for some specific
01:13:46applications.
01:13:48So at the beginning, I
01:13:49would like to introduce why
01:13:51we need those customized models.
01:13:53We all know that large
01:13:54language models like LLAMA and
01:13:56GPT series have shown great
01:13:58potential in many domains
01:14:00as they,
01:14:01portrayed on large scale of
01:14:03text, and they have strong
01:14:05instruction following abilities
01:14:07across different tasks, and they
01:14:09have a wide coverage of
01:14:10general knowledge.
01:14:11However, they may not fully
01:14:13capture the analysis of some
01:14:15specific tasks tasks or user
01:14:17needs, especially when some, like,
01:14:20the task involves some,
01:14:22like,
01:14:23specific design definitions
01:14:26or the task is actually
01:14:27real in common users.
01:14:29And that's why we need
01:14:31to develop customized models for
01:14:33ourselves.
01:14:34We could in enhance the
01:14:35model with the domain specific
01:14:37expertise
01:14:38in this task and improve
01:14:40the performance of existing large
01:14:42range models.
01:14:43And in the process of
01:14:45improving the performance, we could
01:14:47actually, like, let the small
01:14:49size smaller size models to
01:14:51get comparable performance with those
01:14:53large larger size models, and
01:14:55we could get more, like,
01:14:57efficient and cost effective.
01:14:59And, also, we could better
01:15:01user experience
01:15:03be experienced by reducing some
01:15:05of the hallucinations in those
01:15:06existing large language models.
01:15:09And here are some key
01:15:10steps of developing customized models.
01:15:13So the first is to
01:15:14actually define your what is
01:15:16your NRP task, and the
01:15:18second is to prepare those
01:15:20data to,
01:15:21train and evaluate the large
01:15:23language model.
01:15:26Prevent,
01:15:27the prep the preparation of
01:15:28data involving some, like, steps
01:15:31that you have, like, introduced
01:15:33before do the data annotation
01:15:35and data pre processing to
01:15:36afford any further models.
01:15:38And after we get to
01:15:40the data, we could start
01:15:41model training is to enhance
01:15:43the performance of the models
01:15:44with those task specific data.
01:15:47And then
01:15:48once we finish the the
01:15:50model training, we could actually
01:15:52use another set of this
01:15:53annotated data to evaluate the
01:15:55performance of our developed
01:15:57model to see if the
01:15:59performance actually gained when compared
01:16:02with the backbone model.
01:16:03And then once we confirm
01:16:05that the performance of the
01:16:07model
01:16:07improved, we could actually use
01:16:09this model, do this customized
01:16:11models for those production.
01:16:14And here is a general
01:16:15workflow of model training and
01:16:17evaluation.
01:16:18Once we define our task
01:16:20and prepare our data, we
01:16:22need to, like, split data
01:16:23into different subsets.
01:16:26Usually, we would have three
01:16:27subsets. The first is the
01:16:28training data to develop the
01:16:30model, and the second would
01:16:31be the validation
01:16:32data to validate the effectiveness
01:16:35of the training model. But
01:16:36for simplification, we just use
01:16:38that data to, to instead,
01:16:41like, the we use the
01:16:42test data to, like, evaluate
01:16:45the training model. If the
01:16:46training model is, effective compared
01:16:49with the backbone model, we
01:16:50then could use it to
01:16:52for the production model to,
01:16:54like, process the production data.
01:16:56And if the training model
01:16:58actually the performance might decrease,
01:17:00once it's decreased, we might
01:17:01need to adjust the training
01:17:03process
01:17:04to, like,
01:17:06we redo the training part.
01:17:10And,
01:17:11next, I will show more
01:17:12details of each step. I
01:17:14will, start from how to
01:17:16define tasks.
01:17:18This is actually a real
01:17:19example.
01:17:20Start from, like, task task
01:17:22design. I will show you
01:17:24how to, like, develop customized
01:17:26models step by step.
01:17:28So let's say that we
01:17:29have a research to investigate
01:17:31about the impact of bilingual
01:17:34bilingualism
01:17:35and ADRD pro progression.
01:17:38So the first step
01:17:40is, like, in different clinical
01:17:42researches, is to find eligible
01:17:43patients.
01:17:44But once we find those
01:17:46eligible patients, we need to
01:17:47identify those bilingual or monolingual
01:17:49patients within these patients.
01:17:52And,
01:17:53the first thought is to,
01:17:55like, check the structured data
01:17:57about those, preferred language or
01:17:59written language,
01:18:01areas to see what the
01:18:03language the patient will prefer.
01:18:05But where where we, like,
01:18:07check the actual data, we
01:18:08find that the structured data
01:18:10actually,
01:18:11the there might not be
01:18:13enough this kind of data
01:18:15to support our research,
01:18:16and some of them might
01:18:18not be even accurate.
01:18:20But we noticed that there
01:18:22are a lot of language
01:18:23information contained in the clinical
01:18:25notes. For example, many notes
01:18:27would record what the patient,
01:18:29speak, what is the preferred
01:18:30language, and how well do
01:18:32they speak. So we might,
01:18:34like, comprehensively
01:18:36extract all the language speaking
01:18:38status from all the clinical
01:18:40notes, like using those OLP,
01:18:43models.
01:18:44There are two targets that
01:18:46we want to extract. The
01:18:47first is what language does
01:18:49the patient speak and how
01:18:50well do they speak. So,
01:18:52for these two
01:18:54specific tasks,
01:18:55aims, we, like,
01:18:57could formulate the task as
01:18:59a task.
01:19:01The first the first thing
01:19:02we want to do is
01:19:03to identify all the language
01:19:05entities in the clinical notes,
01:19:06and then we could assign
01:19:08different tags based on different
01:19:10context
01:19:11to indicate different speaking status
01:19:13of the patient.
01:19:16Once we, like, formulate the
01:19:18task as a task, we
01:19:20need to further refine the
01:19:22details of the task.
01:19:24We might need to review
01:19:26some of the clinical notes
01:19:27and design different tasks for
01:19:30the task.
01:19:32In this, like, data reviewing
01:19:34after the data review, we,
01:19:36designed four different tasks for
01:19:38this task. The first is
01:19:39language fluent, which indicate the
01:19:41patient speaks some kind of
01:19:42language fluently.
01:19:44And there is another language
01:19:46sum to indicate the patient
01:19:47speaks some of the languages,
01:19:49and that they are language
01:19:50no and the language other.
01:19:52And here are some examples.
01:19:54For, for the first one,
01:19:56language fluent,
01:19:57some of the sentences says
01:19:59that the patient speaks Italian
01:20:02primary
01:20:03primarily.
01:20:05So this indicate that the
01:20:06person is has lang fluent
01:20:09Italian
01:20:10abilities.
01:20:11And for the language some,
01:20:12here's the sentence. She speaks
01:20:14some English.
01:20:15And, no, the patient does
01:20:17not speak English. And for
01:20:19the language other, we actually,
01:20:22when when we reviewing data,
01:20:24we found a lot of,
01:20:25like, languages used in other
01:20:28individuals. For example, the patient's
01:20:30family or the
01:20:32written language. So this kind
01:20:34of language do not, indicate
01:20:36the language speaking status of
01:20:38the patient, so we categorize
01:20:39them as language other.
01:20:42Once we refine the details
01:20:44of the EER task, we
01:20:45need to, like, start to
01:20:47prepare the data for model
01:20:49training and the model evaluation.
01:20:52Here is the overall flow
01:20:54to prepare the data for,
01:20:56like, model developing.
01:20:58We first need to get
01:20:59some raw data and
01:21:01do the annotation with the
01:21:03annotation guideline that we developed
01:21:05before. And once we get
01:21:07the annotated results, we may
01:21:09need to, like, design a
01:21:10prompt for this task.
01:21:12Some of the prompt has
01:21:14been, like, discussed,
01:21:16by doctor Shui and Vipina.
01:21:18And once we get the
01:21:19prompts and all the annotated
01:21:21results, we need to process
01:21:22the results for the models
01:21:24to load to start the
01:21:25training and evaluation.
01:21:28And,
01:21:29this is the annotation
01:21:30using Blue that Yuja has
01:21:32mentioned before, so I'm just
01:21:33gonna skip this. And here
01:21:35is the annotated results look
01:21:37like. Usually, we would have
01:21:39a JSON file for each
01:21:40input sample, and each
01:21:43each sample looks like this.
01:21:45So it will record file
01:21:47name and the original sentence
01:21:49and also the entities and
01:21:51the positions of entities that
01:21:53we,
01:21:54marked.
01:21:56And here is the prompt.
01:21:58I'm just gonna skip this.
01:22:00And
01:22:01after we get all the
01:22:03files and the prompt, we
01:22:04need to process the data
01:22:05based on different prompts. For
01:22:07example,
01:22:08our task is to, like,
01:22:10annotate all the text in
01:22:12the original sentence with, HTML
01:22:15tag. So we might need
01:22:17to process the input as
01:22:19the original sentence, and the
01:22:20output and the target output
01:22:22would be,
01:22:24the same sentence, but with
01:22:26all the entities
01:22:27wrapped with this, like,
01:22:30a language front or language
01:22:32sum, those tags,
01:22:34to wrap it with HTML
01:22:35tag.
01:22:37And this is also the
01:22:39data preparation for the models
01:22:41to load. I will prepare
01:22:42I will I will show
01:22:44the code, later so you
01:22:45can just directly try the
01:22:47code to to process it.
01:22:49So after we prepare the
01:22:51data, we now finally
01:22:54can start the fine tuning
01:22:55process.
01:22:58So fine tuning
01:23:00process is actually
01:23:01a process to adjust the
01:23:03weights of the large language
01:23:05models to make it adapt
01:23:07to our task specific data.
01:23:10So, usually, we need to,
01:23:12like,
01:23:13adjust all the weights of
01:23:14the model, but like, we
01:23:16know that for large language
01:23:17model, there are a lot
01:23:18of parameters. So,
01:23:20the full fine tuning would
01:23:21be very, like,
01:23:23computational
01:23:24cost cost will be very
01:23:26high. So instead, we would
01:23:28use,
01:23:29like,
01:23:30widely used way, LoRa, to
01:23:32do the fine tuning. LoRa
01:23:33is actually,
01:23:35a low rank adaptation
01:23:36that use, two, like, different,
01:23:39small vectors here in the
01:23:42green
01:23:43green
01:23:44in the green,
01:23:46in the green part.
01:23:48Instead of fine tuning the
01:23:50entire large network model, we
01:23:52we only need to adjust
01:23:54the parameters in this, like,
01:23:55small vectors.
01:23:57So, compared with full fine
01:23:59tuning, it is much more
01:24:00faster, and we only need
01:24:02minimal training resources.
01:24:04And, but it it needs
01:24:05high quality datasets.
01:24:07So we need to, like,
01:24:08define the task and
01:24:10develop the annotation guideline carefully.
01:24:13And it also has some
01:24:15high risk of over fifty.
01:24:17So for the resources,
01:24:19for eight eight billion model,
01:24:21it might need one a
01:24:22one hundred or h one
01:24:23hundred GPU. But for seventy
01:24:25billion models, we might need
01:24:26two h one hundred GPUs
01:24:28to do the fine tuning.
01:24:30So, here is some environment
01:24:32setup. I also provided the
01:24:34code in the,
01:24:36at last so you can
01:24:37check it for more details.
01:24:40And if we want to
01:24:41do the fine tuning, we
01:24:43need to, like,
01:24:45modify
01:24:46the config file in the
01:24:48code that I provided.
01:24:50The first one is to,
01:24:51like, indicate where is the
01:24:53model
01:24:54stored in the Camino or
01:24:55CHP environment.
01:24:57So
01:24:58in the folder, it should
01:24:59look like this. It has,
01:25:01many, like,
01:25:02weights of the model, some
01:25:03details of the model.
01:25:05And for the data, we
01:25:07also need to provide the
01:25:08the pass of the data
01:25:09that we
01:25:10processed before to tell the
01:25:12model where is the data.
01:25:14And
01:25:15beside the model and the
01:25:16data, we also need to
01:25:17set up some other, like,
01:25:19configs. For example, the most
01:25:21important one might be the
01:25:23learning rate. You could, like,
01:25:25adjust the learning rate based
01:25:26on
01:25:27the evaluation results of the,
01:25:30trained model.
01:25:32And finally, we need we
01:25:35can we can start the
01:25:36model's fine tuning.
01:25:38The fine tuning process is
01:25:40actually very easy. Once we,
01:25:42finish the config file, we
01:25:43can just start the fine
01:25:45tuning with only one line
01:25:47of a command.
01:25:48So after fine tuning, we
01:25:49will get adapt adaptator
01:25:52adapter
01:25:54parameters, which is a very
01:25:55small file.
01:25:57So after we get this
01:25:58LoRa adapter, we need to
01:26:00combine the adapter with the
01:26:01original backbone model to formalize
01:26:04your own customized model.
01:26:07Once we get our customized
01:26:09model, we need to,
01:26:10test the model to see
01:26:12if the performance actually gained
01:26:14compared with the backbone model.
01:26:16So we need to do
01:26:17the inference on the test
01:26:18data.
01:26:19And here is also some
01:26:21example how to setting up
01:26:23all the environment.
01:26:24And this is the example
01:26:25of how to, like, do
01:26:27the inference on the test
01:26:28data data to get to
01:26:29the results.
01:26:32This is also to set
01:26:34up all the inference,
01:26:36from fix.
01:26:38For example, the max max
01:26:39token indicate how long you'd
01:26:42expect the model to output,
01:26:44and the stop token EOS
01:26:47means once the model, like,
01:26:50generated the US to token,
01:26:52it would finish
01:26:53the generation
01:26:54instead of generate, you know,
01:26:56five hundred tokens.
01:26:59So once we get the
01:27:01inference results, we could evaluate
01:27:03the performance
01:27:04of the model and compare
01:27:06it with the performance of
01:27:07the backbone model.
01:27:08And here is some evaluation
01:27:10metric that you just have
01:27:11the introduced before, so I'm
01:27:13just gonna skip this.
01:27:15And I also provide some
01:27:16scripts
01:27:17for the,
01:27:18like, evaluation. You can also
01:27:20refer to the code that
01:27:21I provided for more details.
01:27:23And here is the,
01:27:25fine tune results that we
01:27:27have once after we do
01:27:30the fine tuning with eight
01:27:31hundred samples that we annotated
01:27:33before.
01:27:34So the fine tune the
01:27:35means that,
01:27:36we use the three seventy
01:27:38billion instructor model as a
01:27:40backbone model to do the
01:27:42fine tune.
01:27:43So compared with the backbone
01:27:44model, we see that for
01:27:46every tech language, fluent, stem,
01:27:48no, and other, all the
01:27:49performance of all the f
01:27:51one score actually in improved.
01:27:53So
01:27:54in this case, we can
01:27:55say that we have
01:27:56a effective, like, fine tuned
01:27:58customized model.
01:28:01And once we find that
01:28:02the customized model performed dropped,
01:28:04we need to go back
01:28:06to the training process to
01:28:07retrain the model to see,
01:28:09to,
01:28:10like,
01:28:12iteratively
01:28:13check the performance to see
01:28:14if there is any,
01:28:16gain.
01:28:17And here is the code
01:28:19data that we provided for,
01:28:21like, more details.
01:28:23And if you have any,
01:28:25like, interest, you can leave
01:28:26if, have any question, you
01:28:28can leave comments or directly
01:28:30send me an email.
01:28:31Oh, okay. Thank you so
01:28:33much.